XML is rapidly becoming data interchanging format & de-facto data
standard on the Internet. As a programmer you must have a little bit
idea about what it is and how it can be used in web applications
(pages). Through this excerpt (taken and compiled from Professional
Active Server Pages 3.0) I am going to explain you basics of this
standard. In the coming days I will cover how it can be used while
developing web-solutions.
What is XML?
It is not a language in the sense that Visual Basic or C++ are
languages, but it's a set of rules that define how the text or document
should be marked-up. Marking up of a document is the process of
identifying certain areas of the document as having special meaning.
How does XML differs from HTML?
The major difference is that XML is designed to describe the structure
of the text and not how it should be displayed (presentation style). XML
doesn't have fixed set of tags. The Internet Explorer doesn't do
anything with the user-defined tags. So even though the tag-names have
some meaning to us, they don't to XML. The abstract here is that
tag-names can be anything that we like, it is only how we use them that
give them a meaning. Of course it is sensible to give them meaningful
names to start with. After all, XML is fairly readable, so using the
tag-names that describe the contents is common sense.
XML can be used as data interchange format. It's standard text, so can
be transferred from machine to machine. It is not in a proprietary
format, so anybody can read it. And if the tags are named sensibly the
XML document data is self-describing. There are some terminologies
concerning to it and ways in which XML can be laid out. Let us consider
them one-by-one.
- Tags & Bullets
An element comprises a start tag and, an end tag alongwith the text
it encloses, which can include other elements. This is a particularly
important point since it forms the concept of - Well Formed XML in which
each opening tag must have a closing tag. If we are using XML to
describe data, then it is possible that some fields might contain no
data. In this case the tags would be empty. Empty tags in XML can be
defined in one of the two ways. The first is with the start tag and an
end tag, but no content; e.g.
<tagname> </tagname>
The second way is just to use an opening tag, but put a forward-slash at
the second last position; e.g.
<tagname />
Another part of being Well Formed is that the tags in XML are case
sensitive, so the opening tag and the closing tag must match in case.
This means that following is invalid in XML,
<TAGName> </tagname>
- Root tag
One other term to be aware of is the - Root Tag. This is defined as the
outer tag, and an XML document can have at the most only one root. For example,
<Authors>
<Author>
<Name>Christy</Name>
<Id>1</Id>
</Author>
<Author>
<Name>John</Name>
<Id>2</Id>
</Author>
</Authors>
Here the root tag is <Authors>. This is valid, since there is only one root tag,
the following however is invalid,
<Authors>
<Author>
<Name>Christy</Name>
<Id>1</Id>
</Author>
</Authors>
<Authors>
<Author>
<Name>John</Name>
<Id>2</Id>
</Author>
</Authors>
Because, here there are two tags at the top level, it is not valid.
- The <?xml> Tag
This is not a true XML tag, but a special tag indicating special
processing instruction. The <?xml> tag is a special tag, that should be
the first line of each XML document. This tag can be used to identify
version and language information. For example,
<?xml version="1.0" ?>
This identifies the version of XML. The default and current (only)
version is 1.0. At the moment 1.0 is the only version of XML, but having
the ability to specify it in our XML documents does allow us to future
proof them. This tag is also the place where we can define the language
used in the XML data. This is important if our data contains characters
that aren't part of the standard English ASCII character set. We can
specify the encoding used in our document by adding the encoding
attribute to the '?xml' processing instruction; e.g.
<?xml version="1.0" encoding="iso-8859-1" ?>
- Attributes
Like HTML, XML has - Attributes to define the properties of elements and
these must also be well-formed. For attributes this means that they must
be enclosed in quotes. For example,
<Book ISBN="1-861002-61-0"> Professional Active Server Pages 3.0 </Book>
- Special Characters
XML has a special set of characters that can't be used in normal XML
strings. These are,
Character |
Must be replaced by |
& |
& |
< |
< |
> |
> |
" |
" |
' |
' |
For example, the following XML is invalid,
<Book> Computers & Robotronics </Book>
Whereas the following is valid,
<Book> Computers & Robotronics </Book>
- Schemas & DTDs
Schemas and DTDs (Document Type Definitions) are the flip side of the
same coin. The both specify which elements are allowed in a document,
and can turn a Well Formed XML document into a - Valid XML document. It
means that as well as being correctly marked-up, it contains only
allowed elements and attributes.
Microsoft somewhere along the line decided that DTDs were a bit stupid.
A DTD is a text file that defines the structure of an XML document, but
the DTD itself isn't XML - rather it has a complete separate syntax. If
we are dealing with XML documents, then the structure that define those
documents should also be XML too, and this is what Schemas are - XML
equivalent of a DTD.
e.g.
<!ELEMENT DOCUMENT (AUTHOR +)>
<!ELEMENT AUTHOR (au_id, contract)>
<!ELEMENT au_id (CDATA)>
<!ELEMENT contract (CDATA)>
This typical DTD is quite simple. It states that this document comprises
0 or more AUTHOR element. The plus sign on end of AUTHOR says 'one or
more'. Each AUTHOR element is made-up from two other elements. Each of
these sub-elements contains character data (CDATA). But there are two
real flaws with DTDs,
1. The aren't XML.
2. We can't specify the data types - such as integers, date and so on -
for each element. CDATA simply means that an element contains just
character data, and doesn't identify the actual type of the element's
contents.
Because of these reasons Microsoft proposed Schemas to W3C (World Wide
Web Consortium). If we covert the above DTD into a Schema, it would be
something like,
<Schema ID="Author">
<Element name="au_id" />
<Element name="contract" />
</Schema>
With the addition of data types we'd get,
<Schema ID="Author">
<Element name="au_id" type="string" />
<Element name="contract" type="boolean" />
</Schema>
This Schema now details not only the allowable items, but also their
data-types. The CDATA of a DTD is equivalent to a string, but the Schema
allows other data types - the contract element, for example, contains
Boolean data.
** You can check out a live-example Webpage using XML
here. **
|