XML Parsing Designed By: Creating Markup with XML Outline Introduction Introduction to XML Markup Parsers and Well-formed XML Documents Parsing an XML Document with msxml Characters Character Set Characters vs. Markup While Space, Entity References and Built-in Entities Using Unicode in an XML Document Markup CDATA Sections Introduction XML Technology for creating markup languages Enables document authors to describe data of any type Allows creating new tags HTML limits document authors to fixed tag set Introduction to XML Markup XML document (intro.xml) Marks up message as XML Commonly stored in text files Extension .xml 1 2 3 4 5 6 7 8 Document begins with declaration that specifies XML version 1.0 Simple XML document containing a message. Line numbers are not part of XML document. We include them for clarity. Document begins with declaration that specifies XML version 1.0 Comments Element message is child element of root element myMessage
Welcome to XML! Element message is child element of root element myMessage Line numbers are not part of XML document. We include them for clarity. Introduction to XML Markup (cont.) XML documents Must contain exactly one root element Attempting to create more than one root element is erroneous Elements must be nested properly Incorrect:
hello Correct:
hello Rules for Elements XML document syntax Considered well formed if syntactically correct Single root element Each element has start tag and end tag Tags properly nested Attribute (discussed later) values in quotes Proper capitalization Case sensitive Parsers and Well-formed XML Documents XML parser Processes XML document Reads XML document Checks syntax Reports errors (if any) Allows programmatic access to document’s contents Parsers and Well-formed XML Documents (cont.) XML parsers support Document Object Model (DOM) Builds tree structure containing document data in memory Simple API for XML (SAX) Generates events when tags, comments, etc. are encountered (Events are notifications to the application) Parsing an XML Document with msxml XML document Contains data Does not contain formatting information Load XML document into Internet Explorer 5.0 Document is parsed by msxml. Places plus (+) or minus (-) signs next to container elements Plus sign indicates that all child elements are hidden Clicking plus sign expands container element Displays children Minus sign indicates that all child elements are visible Clicking minus sign collapses container element Hides children Error generated, if document is not well formed XML document shown in IE5. Error message for a missing end tag. Characters Character set Characters that may be represented in XML document e.g., ASCII character set Letters of English alphabet Digits (0-9) Punctuation characters, such as !, - and ? Character Set XML documents may contain Carriage returns Line feeds Unicode characters Enables computers to process characters for several languages Characters vs. Markup XML must differentiate between Markup text Enclosed in angle brackets (< and >) e.g,. Child elements Character data Text between start tag and end tag e.g., line 7: Welcome to XML! Element Naming XML elements must follow these naming rules: Names can contain letters, numbers, and other characters Names must not start with a number or punctuation character Names must not start with the letters xml (or XML or Xml ..) Names cannot contain spaces White Space Whitespace characters Spaces, tabs, line feeds and carriage returns Significant (preserved by application) Insignificant (not preserved by application) Normalization Whitespace collapsed into single whitespace character Sometimes whitespace removed entirely
This is character data after normalization, becomes
This is character data Entity References XML-reserved characters Ampersand (&) Left-angle bracket (<) Right-angle bracket (>) Apostrophe (’) Double quote (”) Entity references Allow to use XML-reserved characters Begin with ampersand (&) and end with semicolon (;) Prevents from misinterpreting character data as markup Built-in Entities Build-in entities Ampersand (&) Left-angle bracket (<) Right-angle bracket (>) Apostrophe (') Quotation mark (") Mark up characters “<>&” in element message
<>& Using Unicode in an XML Document XML Unicode support e.g., displays Arabic words Arabic characters represented by entity references for Unicode characters 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 --> Document type definition (DTD) defines document structure and entities
دايتَل أند &assoc; أهلاً بكم فيِ عالم &text; Root element welcome contains child elements from and subject XML document that contains Arabic words Document type definition (DTD) defines document structure and entities Root element welcome contains child elements from and subject Sequence of entity Sequence of entity references for references for Unicode Unicode characters in Arabic characters in Arabic alphabet alphabet lang.dtd defines lang.dtd defines entities entities assoc and text assoc and text XML document that contains Arabic words. Markup XML element markup Consists of Start tag Content End tag All elements must have corresponding end tag

is correct in HTML, but not XML XML requires end tag or forward slash (/) for termination

or

is correct XML syntax Markup (cont.) Elements Define structure May (or may not) contain content Child elements, character data, etc. Attributes Describe elements Elements may have associated attributes Placed within element’s start tag Values are enclosed in quotes Element car contains attribute doors, which has value “4”
Markup (cont.) Processing instruction (PI) Passed to application using XML document Provides application-specific document information Delimited by and ?> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Processing instruction specifies stylesheet (discussed in Chapter 12) XML document that Element chapters marks up information contains four child about a fictitious book. elements, each which contain two attributes Processing instruction specifies Deitel&s XML Primer stylesheet (discussed Root element book contains child in Chapter 12) elements title, author, Paul Deitel chapters and media element book Root Element book contains elements title, author, attribute isbn, which has chapters and "2">Welcome 999-99999-9-X value of media Element book contains attribute isbn, which has value of 9999999-9-X Element chapters contains four child elements, each which contain two attributes contains child Easy XML XML Elements? Entities XML document that marks up information about a fictitious book. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Jane Doe Box 12345 15 Any Ave. Othertown Otherstate 67890 555-4321 Jane Doe 123 Main St. Anytown Anystate 12345 555-1234 XML document that marks up a letter. 30 Dear Sir: 31 32 It is our privilege to inform you about our new 33 database managed with XML. This new system 34 allows you to reduce the load on your inventory list 35 server by having the client machine perform the work of 36 sorting and filtering the data. 37 38 The data in an XML element is normalized, so 39 plain-text diagrams such as 40 /---\ 41 | | 42 \---/ 43 will become gibberish. 44 45 Sincerely 46 Ms. Doe 47 48 XML document that marks up a letter. (Part 2) XML document that marks up a letter. CDATA Sections CDATA sections May contain text, reserved characters and whitespace Reserved characters need not be replaced by entity references Not processed by XML parser Commonly used for scripting code (e.g., JavaScript) Begin with 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 XML does not process CDATA section --> --> Element directory contains two namespace prefixes A book list A funny picture Listing for namespace.xml. Element directory contains two namespace prefixes Use prefix text to describe elements file and description Apply prefix text to describe elements file, description and size 18