About XML
Extensible Markup Language (XML) is described as both a markup language and a text based data storage format, depending on who you talk to. It is a subset of Standard Generalized Markup Language (SGML); it offers a text-based means to apply and describe a tree-based structure to information. XML serves as the basis for a number of languages/formats, such as Really Simple Syndication (RSS), Mozilla's XML User Interface Language (XUL), Macromedia's Maximum eXperience Markup Language (MXML), Microsoft's eXtensible Application Markup Language (XAML), and the open source Java XML UI Markup Language (XAMJ). As the many flavors of XML demonstrate, XML is a big deal. Everyone wants to get on the XML bandwagon.
Writing XML
XML's basic unit of data is the element. Elements are delimited by a start tag, such as , and an end tag, such as . If you have a start tag, you must have an end tag. If you fail to include an end tag for each start tag, your XML document is not well-formed, and parsers will not parse the document properly. Tags are usually named to reflect the type of content contained in the element. You would expect an element named book to contain a book title, such as Great American Novel (see Listing 1). The content between the tags, including the white spaces, is referred to as character data.
Listing 1. A sample XML document
<books>
<book>
<title>Great American Novel</title>
<characters>
<character> <name>Cliff</name> <desc>really great guy</desc> </character> <character> <name>Lovely Woman</name> <desc>matchless beauty</desc> </character> <character> <name>Loyal Dog</name> <desc>sleepy</desc> </character> </characters> <plot> Cliff meets Lovely Woman. Loyal Dog sleeps, but wakes up to bark at mailman. </plot> <success type="bestseller">4</success> <success type="bookclubs">9</success> </book> </books>
|
XML element and attribute names can consist of the upper case alphabet A-Z, the lower case alphabet a-z, digits 0-9, certain special and non-English characters, and three punctuation marks, the hyphen, the underscore, and the period. Other punctuation marks are not allowed in names.
XML is case sensitive. In this example, and describe two different elements. Either is an acceptable element name. It's probably not a good idea to use and to describe two different elements, as the possibility of clerical error seems high.
Each XML document contains one and only one root element. The root element is the only element in an XML document that does not have a parent. In the example above, the root element is . Most XML documents contain parent and child elements. The element has one child, . The element has four children, , , and . The element has three child elements, each of which is a element. Each element has two child elements, and .
In addition to the nesting of elements that create the parent-child relationships, XML elements can also have attributes. Attributes are name-value pairs attached to an element's start tag. Names are separated from values by an equal sign, =. Values are enclosed by single or double quotation marks. In Listing 1 above, the element possesses two attributes, "bestseller" and "bookclubs". There are different schools of thought among XML developers about the use of attributes. Most information contained in an attribute could be contained in a child element. Some developers insist that attribute information should be metadata, namely information about the data, and not the data itself. The data itself should be contained in elements. The choice of whether to use attributes or not really depends on the nature of the data and how data will be extracted from the XML.
Strengths of XML
One of XML's good qualities is its relative simplicity. You can write XML with basic text editors and word processors, no special tools or software required. The basic syntax for XML consists of nested elements, some of which have attributes and content. An element usually consists of two tags, a start tag and an end tag, each of which is bracketed by an open and a close < /tag >. XML is case sensitive and does not ignore white space. It looks a lot like HTML, which is familiar to a lot of people, but, unlike HTML, it allows you to name your tags to best describe your data. Some of XML's advantages are its self-documenting, human, and machine-readable format, its support for Unicode, which allows for internationalization in human language support, and its stringent syntax and parsing requirements. Unfortunately, UTF-8 is problematic in PHP5; this shortcoming is one of the forces driving the development of PHP6.
Weaknesses of XML
XML is wordy and redundant, with the attendant consequences of being large to store and a huge consumer of bandwidth. People are supposed to be able to read it, but it's hard to imagine a human trying to read an XML file with 7 million nodes. The most basic parser functionality doesn't support a wide array of data types; therefore, irregular or unusual data, which is common, is a primary source of difficulty.
Well-Formed XML An XML document is well-formed if it follows all of XML's syntax rules. If a document is not well-formed, it is not XML, in a technical sense. An HTML tag such as
is unacceptable in XML; the tag should be written
to be well-formed XML. A parser won't parse XML properly if it is not well-formed. Additionally, an XML document must have one and only one root element. Think of the one root element as being like an endless file cabinet. You have one file cabinet, but there are few limits as to what and how much you can fit into the file cabinet. There are endless drawers and folders into which you can stuff information.
More