XML - K Dev Planet

XML

XML is rapidly becoming data interchanging format & de-facto data standard on the Internet. As a programmer you must have a little bit idea about what it is and how it can be used in web applications (pages). Through this excerpt (taken and compiled from Professional Active Server Pages 3.0) I am going to explain you basics of this standard. In the coming days I will cover how it can be used while developing web-solutions.

What is XML?

It is not a language in the sense that Visual Basic or C++ are languages, but it's a set of rules that define how the text or document should be marked-up. Marking up of a document is the process of identifying certain areas of the document as having special meaning.

How does XML differs from HTML?

The major difference is that XML is designed to describe the structure of the text and not how it should be displayed (presentation style). XML doesn't have fixed set of tags. The Internet Explorer doesn't do anything with the user-defined tags. So even though the tag-names have some meaning to us, they don't to XML. The abstract here is that tag-names can be anything that we like, it is only how we use them that give them a meaning. Of course it is sensible to give them meaningful names to start with. After all, XML is fairly readable, so using the tag-names that describe the contents is common sense.

XML can be used as data interchange format. It's standard text, so can be transferred from machine to machine. It is not in a proprietary format, so anybody can read it. And if the tags are named sensibly the XML document data is self-describing. There are some terminologies concerning to it and ways in which XML can be laid out. Let us consider them one-by-one.

Tags & Bullets

An element comprises a start tag and, an end tag alongwith the text it encloses, which can include other elements. This is a particularly important point since it forms the concept of - Well Formed XML in which each opening tag must have a closing tag. If we are using XML to describe data, then it is possible that some fields might contain no data. In this case the tags would be empty. Empty tags in XML can be defined in one of the two ways. The first is with the start tag and an end tag, but no content; e.g.

<tagname> </tagname>

The second way is just to use an opening tag, but put a forward-slash at the second last position; e.g.

<tagname />

Another part of being Well Formed is that the tags in XML are case sensitive, so the opening tag and the closing tag must match in case. This means that following is invalid in XML,

<TAGName> </tagname>
Root tag

One other term to be aware of is the - Root Tag. This is defined as the outer tag, and an XML document can have at the most only one root. For example,

<Authors>
    <Author>
        <Name>Christy</Name>
        <Id>1</Id>
    </Author>
    <Author>
        <Name>John</Name>
        <Id>2</Id>
    </Author>
</Authors>

Here the root tag is <Authors>. This is valid, since there is only one root tag, the following however is invalid,

<Authors>
    <Author>
        <Name>Christy</Name>
        <Id>1</Id>
    </Author>
</Authors>
<Authors>
    <Author>
        <Name>John</Name>
        <Id>2</Id>
    </Author>
</Authors>

Because, here there are two tags at the top level, it is not valid.
The <?xml> Tag

This is not a true XML tag, but a special tag indicating special processing instruction. The <?xml> tag is a special tag, that should be the first line of each XML document. This tag can be used to identify version and language information. For example,

<?xml version="1.0" ?>

This identifies the version of XML. The default and current (only) version is 1.0. At the moment 1.0 is the only version of XML, but having the ability to specify it in our XML documents does allow us to future proof them. This tag is also the place where we can define the language used in the XML data. This is important if our data contains characters that aren't part of the standard English ASCII character set. We can specify the encoding used in our document by adding the encoding attribute to the '?xml' processing instruction; e.g.

<?xml version="1.0" encoding="iso-8859-1" ?>
Attributes

Like HTML, XML has - Attributes to define the properties of elements and these must also be well-formed. For attributes this means that they must be enclosed in quotes. For example,

<Book ISBN="1-861002-61-0"> Professional Active Server Pages 3.0 </Book>

Special Characters

XML has a special set of characters that can't be used in normal XML strings. These are,

Character	Must be replaced by
&	&
<	<
>	>
"	"
'	'

For example, the following XML is invalid,

<Book> Computers & Robotronics </Book>

Whereas the following is valid,

<Book> Computers & Robotronics </Book>

Schemas & DTDs

Schemas and DTDs (Document Type Definitions) are the flip side of the same coin. The both specify which elements are allowed in a document, and can turn a Well Formed XML document into a - Valid XML document. It means that as well as being correctly marked-up, it contains only allowed elements and attributes.

Microsoft somewhere along the line decided that DTDs were a bit stupid. A DTD is a text file that defines the structure of an XML document, but the DTD itself isn't XML - rather it has a complete separate syntax. If we are dealing with XML documents, then the structure that define those documents should also be XML too, and this is what Schemas are - XML equivalent of a DTD.

e.g.

<!ELEMENT DOCUMENT (AUTHOR +)>
<!ELEMENT AUTHOR (au_id, contract)>
<!ELEMENT au_id (CDATA)>
<!ELEMENT contract (CDATA)>

This typical DTD is quite simple. It states that this document comprises 0 or more AUTHOR element. The plus sign on end of AUTHOR says 'one or more'. Each AUTHOR element is made-up from two other elements. Each of these sub-elements contains character data (CDATA). But there are two real flaws with DTDs,

1. The aren't XML.

2. We can't specify the data types - such as integers, date and so on - for each element. CDATA simply means that an element contains just character data, and doesn't identify the actual type of the element's contents.

Because of these reasons Microsoft proposed Schemas to W3C (World Wide Web Consortium). If we covert the above DTD into a Schema, it would be something like,

<Schema ID="Author">
<Element name="au_id" />
<Element name="contract" />
</Schema>

With the addition of data types we'd get,

<Schema ID="Author">
<Element name="au_id" type="string" />
<Element name="contract" type="boolean" />
</Schema>

This Schema now details not only the allowable items, but also their data-types. The CDATA of a DTD is equivalent to a string, but the Schema allows other data types - the contract element, for example, contains Boolean data.

** You can check out a live-example Webpage using XML here. **