|
Getting Started with XML
by David Morris
Extensible Markup Language is a dialect of Standardized Generalized Markup Language. The World Wide Web Consortium created this new dialect of
SGML to provide a simple alternative to SGML for describing data exchanged between software
applications. Although XML is a simplified version of SGML, it is powerful enough to describe almost any
data, in a format that is understood by the majority of computers in use today.
One difference between XML and other markup languages is that XML is a meta-markup language. In
other words, it describes information about the markup but does not describe the domain-specific
implementation. This allows XML to be adapted to fulfill more requirements than other markup languages,
such as HTML. The ability to adapt to new functions or services makes XML very powerful.
Documents marked up with XML have many benefits over documents that are stored as plain text or are
marked up using a less capable markup such as HTML. Those benefits include:
- Data marked up with XML is self-describing
- XML relies on simple text, which eliminates most compatibility issues
- You can extend XML to fit almost any domain
These are just a few of XML's benefits. Just like Java, XML is sometimes over-hyped and presented as the
solution to all problems. It is not the solution to all problems, but it is a key technology that addresses many
of today's computing problems.
Defining an XML Document
With XML, you define document elements and attributes that are used to mark up your information. An
element can represent some piece of information, such as an address, telephone number, or a person's
name. Attributes are associated with an element and identify information that is typically not printed or
displayed. A payment type might be stored as an attribute for a payment element.
Example 1 shows how a payment transaction might be defined using XML:
<payment type="ET">
<amount>500.00</amount>
<unit>USD</unit>
</payment>
In the example, there are three elements: payment, amount, and unit. The payment root element contains
the amount and unit elements and has a type attribute. This example also shows how XML elements are
nested. Unlike HTML, XML tags must have both an opening and a closing tag. For example, the unit
closing tag in this example, which is nested in the payment tag, has to be closed before the payment tag is
closed. Another difference between HTML and XML is that XML is case sensitive. A <Unit> tag is not
the same as a <unit> tag.
eXtending with XML
Unlike most other markup languages (including HTML), which restrict you to a fixed set of tags, XML
allows you to create new tags. With HTML, you cannot create your own tags. XML is a meta-markup
language, which means you can use XML to describe your own domain-specific elements. Domain-specific
elements are elements that are useful in describing content related to a specific area. For example, there is a
domain-specific Chemical Markup Language, which allows chemists to describe data in a way that
facilitates the exchange of chemical information.
XML's extendibility makes it very powerful. In example 1, the payment root element contains an amount
and a unit element that describes a payment. Example 2, which follows, shows how to add a received date
element to the payment transaction, which demonstrates that you can extend XML to fit your needs:
<payment type="ET">
<amount>500.00</amount>
<unit>USD</unit>
<received-date>
<year>2001</year>
<month>10</month>
<day>29</day>
</received-date>
</payment>
Now the example contains a received-date element. Notice that I broke down the received-date element
into three separate year, month, and day elements. I chose to break these down because an XML element
should contain data and not structure. Removing structure from elements allows maximum flexibility when
working with XML documents.
Creating XML Documents
An XML document is textual in nature and uses standard ASCII characters. To create an XML document,
you combine character data and markup tags. You can use a specialized XML editor, or simply start your
favorite text editor; WordPad will work fine for most simple XML documents. There are also several
specialized XML editors. The best XML editors I have used are SoftQuad's
XMetal and IBM's WebSphere Studio Site Developer,
which is currently in beta.
On the iSeries, you can use the EDTF command to type in XML documents or type them in on your PC
and transfer them to your iSeries system. You can use just about any encoding, including Unicode for XML
documents, but on the iSeries it is best to stick with the standard ASCII International Standards
Organization Latin character sets and use their corresponding iSeries code pages. For example, in the
United States you would use ISO-8859-1 with a coded character set ID of 819, for Cyrillic it is ISO-8859-
5, with a CCSID of 857.
For XML documents that you want to build in a program you have several choices. You can use the Unix
file APIs to create XML documents from an ILE program. The open-source iSeries-toolkit has a Unix module that provides this type of
support. Another option is to use a parser, like the XM
L Interface for RPG and Procedural Languages. The latter allows your RPG applications to play along
in this open data transportation game and will be covered in more detail in an upcoming article. A parser is
useful when you want to validate the format of your XML document or manipulate individual elements.
A complete XML document should start with an XML declaration. The document declaration specifies the
version of XML and the encoding. Enclose comments in an XML document between <!-- and -->.
Example 3, which follows, is a complete XML document that describes a log delivery transaction.
<?xml version="1.0" encoding="ISO-8859-1" ? >
<!-- Sample log delivery XML document -->
<deliveries>
<load scale-type="DTL">
<scale-ticket>12345</scale-ticket>
<weight>42168</weight>
<weight-uom>LBS</weight-uom>
<scale-uom>US</scale-uom>
<delivered-date>
<month>10</month>
<day>29</day>
<year>2001</year>
</delivered-date>
<log>
<species>WESTERN LARCH</species>
<grade>PEELER</grade>
<large-end-diameter>15</large-end-diameter>
<small-end-diameter>12</small-end-diameter>
<length>32</length>
</log>
<log>
<species>WESTERN LARCH</species>
<grade>PEELER</grade>
<large-end-diameter>13</large-end-diameter>
<small-end-diameter>9</small-end-diameter>
<length>32</length>
</log>
</load>
</deliveries>
All XML documents consist of XML text, which is character data and markup. Markup is everything but
your content and includes start tags, end tags, comments, and entity references. Delimiters surround
markup. The most commonly used are tag delimiters, which are less-than (<) and greater-than (>)
symbols, and entity delimiters, which are the ampersand (&) and semi-colon (;).
These are the main components you will use when creating an XML document:
- Elements comprised of tags like <log> and </log>
- Attributes that add additional information like menuitem="Y"
- Entity placeholders for text or binary files as &REPLACEMENT
- Processing instructions to embed non-XML information
- Comments that describe an XML document
- Text, which supplies the most common form of XML content
All XML documents use some combination of these components. XML supplies strict rules that describe
how and where these components may be used.
Notice that the XML document in example 3 does not describe presentation. The syntax provided by XML
allows you to describe the content of a document. This capability allows you to describe the content of any
document and, just as importantly, allows you to separate the content of a document from the document's
presentation. Style is another term for the presentation format.
Adding Style to Your XML
After creating an XML document, you might want to display the contents in a Web browser. Unlike
HTML, XML has no built-in style, so you have to combine your XML document with a stylesheet. Style
allows you to describe formatting for data contained in an XML document. Style describes layout, fonts,
color, and behavior for the elements in an XML document.
There are several popular ways to describe style for an XML document. The first, Cascading Style Sheets
(CSS), is widely supported in browsers. The second, Extensible Stylesheet Language (XSL) uses an XML
variant that consists of two parts. The first part of XSL is a language for transforming XML documents
from one format to another. The second part of XSL provides a vocabulary for specifying formatting
semantics.
In this brief overview, I won't get into the details of CSS and XSL. If you do need to present XML data in a
browser, you will need to use one of these. CSS is mature and widely supported by browsers. However,
CSS is very limited, particularly when you need to restructure data into a list or table. With XSL, browser
support is very limited. The best way to use XSL at this point is to transform XML to HTML on your
iSeries server using XSLT.
How's My Form?
One important feature of XML is that it provides built-in assurance that the form and content of
information is correct and reliable. There are several ways that XML provides this assurance:
- Documents must be well-formed, adhering to XML's syntax
- Documents can conform to a DTD or Schema
- XML can't fix or interpret malformed documents
Although XML provides the flexibility to create new tags and content as necessary, all XML documents
and extensions must conform to XML's rules. There are two sets of rules: the first ensures that the basic
syntax and structure of a document are correct; the second set of rules is provided by a Document Type
Definition or Schema and applies domain-specific validation.
When a document's basic syntax and structure are correct, the document is considered to be well formed.
Programs that process XML documents check for conformance to these basic rules and are allowed to
identify errors, but the XML specification specifically prohibits the correction of errors or interpretation of
any document that is not well formed. In addition, programs that process XML documents cannot ignore
errors.
The following list describes some of XML's rules:
- If included, an XML document declaration starts on line one
- Every XML document must have one root element
- All elements must have matching begin and end tags
- Elements must be properly nested
- XML is case sensitive
- White space outside of elements is ignored
- Attribute values must be enclosed in single (') or double (") quotes
- Use & for ampersand (&) and < for the less than (<) symbol inside of XML
markup
If an XML document does not follow these rules, it is not well formed and cannot be processed. Part of the
job of an XML parser is to check an XML document to make sure that it is well formed.
Validating XML Documents
The next level of validation for XML documents uses a document type definition (DTD) or Schema to
apply domain-specific validation. A DTD describes what elements and attributes are valid in an XML
document and may specify the order of elements and other relationships. DTDs have been around a long
time and are widely used and understood. Instead of XML, DTDs use Extended Backus-Naur Form
(EBNF) to describe data, which has it roots in Standard Generalized Markup Language (SGML). I won't go
into detail on DTDs, but example 4, which follows, shows a simple DTD that describes a recipe element
and its contained elements. A document that meets the criteria of a DTD or Schema is well formed and
valid. A validating parser performs these optional validity checks.
<!ELEMENT recipe (name, ingredient+, instruction+)>
<!ELEMENT name (#PCDATA) >
<!ELEMENT ingredient (quantity?, description) >
<!ELEMENT quantity (#PCDATA) >
<!ELEMENT description (#PCDATA) >
<!ELEMENT instruction (#PCDATA) >
Schemas are XML documents and are more powerful than DTDs. Unlike a DTD that can check the
structure and order of tags, a schema can make sure that the data contained within an element conforms to
certain rule. The Schema specification became a recommendation in 2001, so support for Schemas is still
spotty. For recipes, DTDs are fine, but in time, Schemas should replace DTDs, particularly for the types of
business applications that are popular on the iSeries. The Schema snippet in example 5, which follows,
validates the log element from example 3.
<?xml version="1.0" encoding="ISO-8859-1" ? >
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<!-- Validating the log element -->
<xs:element name="log" type="logType"/>
<xs:complexType name="logType">
<xs:sequence>
<xs:element name="species" type="xs:string"/>
<xs:element name="grade" type="xs:string"/>
<xs:element name="large-end-diameter"
type="xs:positiveInteger"/>
<xs:element name="small-end-diameter"
type="xs:positiveInteger"/>
<xs:element name="length" type="xs:positiveInteger"/>
</xs:sequence>
</xs:complexType>
</xs:schema>
From the example, you can see that Schemas allow you to check many things. Like a DTD, Schemas allow
you to check the structure of an XML document. Schemas also allow you to verify whether an element
contains the right type of data, such as string, positive integer, decimal, and date. You can even check for a
list or range of values. If you decide to validate data on the iSeries with Schemas, you are best off using the
latest version of the Apache Software
Foundation's Xerces parser.
XML on the iSeries
Even on the iSeries platform, which tends to wait for new technologies to mature before embracing them,
XML is beginning to make headway in areas like Electronic Data Interchange and Web publishing. IBM is
pushing Web Services as a way to process XML-based transactions. Products like WebSphere use XML
extensively to store configuration data.
Many of the XML technologies, including Web Services, are not mature enough to be in widespread use. A
few companies such as Microsoft,
Sun, and IBM have standards-based solutions that are not
entirely open. Evolving standards from the World
Wide Web Consortium Web Services Activity groups should resolve some of the compatibility issues.
Java supplies the most complete and up-to-date XML support. On the iSeries, there is some built-in XML
support for RPG and C, but you will find more Java-based examples and programs that run on the iSeries.
The IBM Information Center's "XML Tools
for OS/400" page describes the tools IBM provides for the iSeries.
<conclusion>
Use of XML is growing quickly. Many products like WebSphere use XML to support configuration data.
Recently, new products and services are using XML to define data exchanged between disparate
computers. Because the foundation of XML is simple text, XML is able to span the void between
applications running on almost any computer system.
Unlike HTML, XML describes data and not presentation. Data is described more accurately than with
HTML, so the intended use is clearer. XML defers presentation to stylesheets that provide quite a bit of
formatting flexibility. HTML is no longer being extended so today's capabilities are all that will ever exist.
XML formatting begins where HTML leaves off and XSL style sheets already provide more formatting
options than HTML and CSS.
XML is a meta-markup language, which means XML describes information about markup. Because of this,
you can extend XML to fit almost any problem domain. The potential of XML is still developing.
Programmers are finding clever ways to use XML in their applications that make use of standard XML
tools. On the iSeries, Java is the language of choice for programmers using XML, but RPG and COBOL
programmers have decent support through the XML4PR parser and Unix APIs.
In future articles, I will take a more in-depth look at XML. I plan to provide more information on parsers,
XML document validation, XSL and XSLT, Web Services, and other XML technologies.
</conclusion>
David Morris is a software architect at Plum Creek Timber Company and started the iSeries-toolkit
open-source project. He can be contacted by e-mail at dmmorris@itjungle.com.
|