Guild Companies, Inc.  
 
Midrange Programmer - How-To Advice & Free Code
OS/400 Edition
Volume 1, Number 9 - May 9, 2002

Validating XML with a Document Type Definition

by David Morris

In "Getting Started with XML," I introduced you to the Extensible Markup Language and described a few of the rules that XML documents should follow. If you had a chance to read that article, you know XML documents must be "well-formed," meaning they must conform to XML's basic syntax and structure. Optionally, an XML document may be "valid," which means the document conforms to the rules of a document type definition (DTD) or an XML schema that defines the allowed structure and content.

One of the biggest problems with HTML is that the specification does not explicitly forbid HTML processors, like browsers, from attempting to fix markup errors. Not wanting to be unfriendly, most browsers are very tolerant of errors and blindly process error-ridden HTML documents. The problem with this approach is that this automatic correction takes a lot of processing time and does not always produce the intended result. Having learned from this mistake, the XML specification explicitly forbids any sort of error correction.

The requirement that all XML documents be well-formed ensures that XML documents are syntactically correct. They must have a single root, balanced tags, and tags have to be properly nested. In addition, XML documents can specify a DTD or an XML schema that imposes further rules. A DTD allows you to specify the elements that are valid in an XML document, their order, attributes, and other relationships. Schemas support the validation of a DTD and allow you to specify the type of data and values that elements may contain. Today, DTDs are far more prevalent than XML schemas, so this article will concentrate on DTDs and their use.

An XML parser is software that reads through an XML document, dividing it into individual elements and attributes. The XML parser then passes the individual parts of an XML document back to an application for processing. During the parsing process, if errors are found in an XML document, processing stops and an error is reported. I'll show you how to use the Xerces XML parser on your iSeries system to check an XML document.

Building Well-Formed XML Documents

All XML documents have to conform to some basic rules. An XML document that conforms to the rules is well-formed. In addition, documents that do not follow the rules cannot be processed. .

The rules that an XML document must follow to be well-formed are described in the World Wide Web Consortium's XML 1.0 recommendation. The following list summarizes the most important rules:

  • If included, an XML declaration starts on line one.
  • Every XML document must have one root element.
  • All elements must have matching opening and closing tags.
  • Elements must be properly nested.
  • XML is case-sensitive.
  • White space outside of elements is ignored.
  • Attribute values must be enclosed in single (') or double (") quotation marks.
  • Use &amp; for ampersand (&) and &lt; for less than (<) inside of markup.
If you follow these rules, you will rarely have a problem. Most of the other rules have to do with limitations, character conversions, and seldom-used aspects of XML.

Elements are the building blocks of XML documents. Elements describe the logical structure of an XML document. Matching opening and closing tags enclose elements that may contain content and structure. Content is the text or binary information within an XML document. Elements that support structure make it easier for programs to understand a document by providing a way of grouping the other elements in that document. Every XML document has one high-level element, known as the root element. Here is a short example with three elements:

<root-element>
  <sub-element first="true">
    character content
  </sub-element>
  <sub-element/>
</root-element>

The structure of an XML document is hierarchical; this example shows a root element that contains two subelements. The first subelement contains character content, and the second subelement is a singleton with no content. Elements that contain something have an opening tag and a closing tag. The closing tag matches the opening tag and has a forward slash (/). A singleton, like the second subelement, can contain a shorthand notation that adds the forward slash before the closing greater-than sign (>).

Attributes are associated with an element and add additional information. In the previous example, the first subelement has a first attribute with the value "true." In many cases, you have to decide between creating a new element or adding an attribute to an existing element. When deciding, ask yourself the following questions:

  1. Is it common to include this item in output?
  2. Does this item contain structure; in other words, can it be broken down into subcomponents, such as month, day, and year?
  3. Can an element contain more than one of these items?

In most cases, if you answer yes to any of these questions, you should use an element rather than an attribute. If you are not certain of the answer to one of these questions, it is usually better to go with an element, because elements offer more flexibility.

An XML document can include an optional XML declaration that specifies the version of XML used, as well as some other attributes. If a document includes an XML declaration, it must be the first thing found in the document, starting in the first position of the first line. It is good practice to include an XML declaration. Here is a typical XML declaration:

<?xml version="1.0" encoding="ISO-8859-1" ?> 
This XML declaration tells an XML processor the version of XML and the how the document is encoded. Version 1.0 is the only XML specification published, and because XML is so flexible, it may end up being the last. If encoding is not specified, most parsers will first assume UTF-8, which is a variable-length encoding scheme of one to four bytes and shares the same encoding scheme as the 128-character ASCII character set. Because of this, any file that contains only ASCII text is a valid UTF-8 file. On the iSeries, which does not actually support varying-length Unicode character sets like UTF-8, it is a good idea to include the encoding declaration.

Computers convert everything they deal with into a stream of bits. Many different schemes associate those bits with the characters we recognize as letters and numbers. The rules used to associate a particular set of bits with a recognizable character are an encoding scheme. Hundreds of these encoding schemes are in use today. On the iSeries, encoding schemes specify the rules that are followed when code points are assigned to characters in a particular code page and are part of the iSeries' Character Data Representation Architecture (CRDA).

Some of the most commonly used encoding schemes are defined by the International Organization for Standardization. The ISO 8859 standard defines several 8-bit character sets and uses the 7-bit ASCII character as a base and adds language specific code points in the added area that 8-bits afford. On the iSeries, the ISO-8859 character sets are supported with various code pages; for example, ISO-8859-1, also known as Latin-1, works with a code page of 819 and supports the majority of characters used in the United States. You can also use fixed-length UCS-2 Unicode by specifying UCS-2 for the encoding and a code page of 13488.

Comments in an XML document are surrounded by a <! and a >. The purpose of a comment is to add descriptive information. Comments should not be used be used to add proprietary extensions. You can use comments to prevent XML markup from being recognized. The following line shows a comment:

<!-- This embedded text and tag is 
        not processed <element> -->

That's it; if you follow the preceding rules, you will have well-formed XML documents that any XML parser will process. The next step is to include a DTD or an XML schema definition, which are custom rules for XML documents.

To DTD or Not to DTD?

When an XML document is valid, it includes a document type declaration, which identifies a document type definition (DTD) that defines rules the XML document must follow.. The DTD lists the elements, attributes, and entities that the XML document uses and the relationships between them. In order to be valid, the document uses only those items described by the DTD; otherwise, it is invalid.

Even when an XML document is valid, some flexibility is allowed. For example, a DTD does not specify the values or type of data that an element contains. You can’t tell a DTD to check numeric values, the length of a value, or other such rules. A DTD also does not tell you what the root element is or how many instances of an element may appear. A DTD is your first line of defense; further editing must be done using an application or a XML schema definition.

A DTD is not required for XML documents. One of the first questions you have to answer is whether you even need a DTD. In many cases, a DTD is a requirement imposed upon your XML document. For example, XHTML documents need to specify one of several DTDs. Even if a DTD is not required, there are several reasons why you might want to use one:

  • DTDs provide a clear definition of what your applications expect in XML documents.
  • For complex documents, a DTD helps an XML author determine relationships between markup and other rules, like sequencing.

In most cases, if you are working with more than a few documents that share a common structure, you should create a DTD. In some cases, you have to use a DTD to support certain features, like the inclusion of non-XML content in an XML documents. An XML author uses a DTD to understand what is expected and allowed in an XML document, and some XML editors allow you to specify a DTD. Those editors show what elements are valid and help prevent mistakes. In some cases, however, it is not worth the effort to create a DTD.

Creating a DTD

A DTD is a special kind of markup that is not actually XML. DTDs have been in use for more than 30 years and are part of the Standard Generalized Markup Language (SGML), the parent of XML and HTML. Unlike SGML documents, which have to use a DTD, XML imposes enough rules that a DTD is not required.

The DTD in the following example is about as simple as they come, but goes a long way toward describing the allowed content of an XML document. This DTD describes the elements in a bill of materials (BOM) XML document. This is a very simple example, and, in actual use, a BOM DTD would most likely be more complex and support additional elements.

<!ELEMENT product (name, part+, instruction+)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT part (quantity, description?)>
<!ELEMENT quantity (#PCDATA)>
<!ATTLIST quantity uom CDATA #REQUIRED>
<!ELEMENT description (#PCDATA)>
<!ELEMENT instruction (#PCDATA)>

First, this DTD would probably be stored as a separate document referred to by an XML recipe document. However, it is also possible to define a DTD like this within the XML document that it describes, within the document type definition. The extension for a DTD is usually .dtd.

Each line of this DTD is an element declaration. An element declaration defines which elements you can use, the children of those elements, the type of data those elements contain, and the allowed attributes.

A DTD defines the hierarchy of an XML document. The top-most element declaration should be the root element, although this is not a requirement. The BOM DTD starts with the root element declaration, which is "product." From the product element declaration, you can see it contains a name element, part elements, and instruction elements. This tells you that these three elements are all that are valid within a product element.

An element declaration starts of with the <!ELEMENT tag. This tag contains the name of the element and its content model. The name of the element can be any valid XML name. The content model describes what the element contains. The following table shows the types of content that XML allows:

Content Type Usage Type of Content Allowed
Text
<!ELEMENT name (#PCDATA)>
Parsed character data
Element(s)
<!ELEMENT name (child | chile2)>
Contains only elements of child or child2
ANY
<!ELEMENT name ANY>
No content restrictions
EMPTY
<!ELEMENT name EMPTY>
Cannot contain content
Mixed content
<!ELEMENT name (#PCDATA)| child)*>
Contains a mixture of parsed character data and child element(s)

The name, quantity, description, and instructions elements do not specify other elements; they specify that they contain "#PCDATA," which stands for parsed character data. Parsed character data is raw text that may contain entity references--like &amp; or &appos;--but does not contain other markup. In this example, the name, quantity, description, and instruction elements contain only text and may not contain other elements.

Looking more closely at the product element definition, you will notice that the name, part, and instruction elements are separated by commas, and that the part and instruction elements are followed by a plus (+) sign. The comma ensures that the elements appear in the order specified. The plus signs indicate that product must have one or more part and instruction elements. The name element, which has no special symbol, must appear once and only once. A question following indicates that the element is optional and appears once.

The following table describes the symbols used in a DTD:

Symbol Usage Use
( and )
<!ELEMENT name (child, (child1 | child2)+)>
Parenthesis group choices
*
<!ELEMENT name (child?)>
Zero or one child element
+
<!ELEMENT name (child+)>
One or more child elements
 
<!ELEMENT name (child)>
When ?, *, or + are not specified a child element must appear once
,
<!ELEMENT name (child1, child2)>
Child elements must appear in the order listed
|
<!ELEMENT name (child1 | child2)>
Child elements appear in any order

You can see that even a short DTD is useful when you are trying to make sure that an application will be capable of processing a particular XML document. An application that uses XML documents based on the BOM DTD needs to understand six elements. Knowing the hierarchical relationship of those elements reduces the number of possibilities that the application needs to handle.

Validating with a DTD

A parser uses a DTD to test the validity of an XML document. Not all parsers support the use of a DTD. Those parsers that do support a DTD are validating parsers. Most validating parsers allow you to turn off DTD validation. In most cases, DTDs are stored in a document file that is separate from the XML documents that reference the DTD. This allows you to use one DTD to val idate many XML documents. An XML document may also reference more than one DTD, or may combine internal DTD definitions with external definitions.

A document type declaration is markup that associates a DTD with an XML document. The document type declaration may include the DTD or refer to an external DTD. The document type declaration also identifies the root element of the XML document. A typical document type declaration looks like this:

<?XML version="1.0" encoding="ISO-8859-1" 
      standalone="no"?>
<!DOCTYPE product SYSTEM "bom.dtd">

The document type declaration, shown on the second line, follows the XML declaration and establishes that the DTD for this XML document is contained in the external file "bom.dtd." The <!DOCTYPE tag starts the document type declaration; following that, the root element-- "product," in this case--is identified. Frequently, a document type declaration uses a URI to refer to a DTD. With a URI this same reference might look like this:

<!DOCTYPE product PUBLIC 
          "http://www.iseriesxml.com/xml/dtd/bom.dtd">

When you use an external DTD subset, you should specify standalone="no" in the XML declaration. There are rare cases when you do not have to do this, but since standalone="no" is always permitted, the simplicity of this rule outweighs the work involved in determining if you can get away without it.

It is not a requirement for the DTD to be external. Instead of an external reference, the prolog may contain the DTD. This example shows a DTD for an internal Frequently Asked Questions XML document:

<?XML version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE faq [
  <!ELEMENT faq (question, answer)+>
  <!ELEMENT question (#PCDATA)>
  <!ELEMENT answer (#PCDATA)>
]>
<faq>
  <question>How do you validate 
               an XML document?</question>
  <answer>java sax.Counter –V /yourdocument</answer>
</faq>

Using an internal DTD allows an XML document to be self-contained but prevents you from reusing the DTD. There is a time when an internal DTD is useful; that’s when you need to extend an external DTD to support a particular document. An XML document can have internal declarations and reference a SYSTEM or PUBLIC URI. When you mix internal and external DTDs, the part that is contained within the XML document is the internal DTD subset, and the part brought in from an external document is the external DTD subset.

Validation with a Parser

There are a lot of ways to validate a document. Most Web browsers and XML editors will validate an XML document that references a DTD. On your iSeries system, which does not come equipped with an XML editor or a built-in Web browser, you can use a mapped drive or copy files to your PC workstation for validation. Another technique is to use a validating parser.

Several validating parsers will run on the iSeries, including IBM's alphaWorks XML Interface for RPG. For the latest and greatest support I recommend that you look at the Apache Software Foundation's XML tools. The Xerces Java 2 parser works great on the iSeries and is available from the Apache XML Project Web site.

To validate an XML document using Xerces Java 2, first download the latest binary release to your PC. Open an FTP session to your iSeries by selecting Start, then Run, on your PC, and entering FTP youriseriessystem. Run the following FTP commands to upload the downloaded zip file, setting your directories appropriately:

bin
cd /tmp
put c:\temp\xerces-j-bin.2.0.1.zip
quit

Once you have the zip file uploaded to your iSeries system, start a Qshell session from a command line using the QSH command and run the following shell commands:

cd /
jar –XF /tmp/xerces-j-bin.2.0.1.zip

Exit Qshell and compile the Xerces JAR files by submitting the following command:

SBMJOB CMD(QSH CMD('
  for jar in $(find /xerces-2_0_1 -name ''*.jar''); 
  do system "CRTJVAPGM CLSF(''"$jar"'') OPTIMIZE(40)";done')) 
  JOB(COMPILE) JOBQ(QSYSNOMAX)

Now set your Java class path using the Add Environment Variable (ADDENVVAR) command, like this:

ADDENVVAR ENVVAR(CLASSPATH) 
  VALUE('/xerces-2_0_1/xercesImpl.jar:
    /xerces-2_0_1/xercesSamples.jar:
    /xerces-2_0_1/xmlParserAPIs.jar')

You are now ready to validate an XML document. Start by creating an empty document with the appropriate code page. Start Qshell and run the following:

touch -C 00819 /tmp/testvalid.xml

Exit Qshell and run Edit File (EDTF) from a command line:

EDTF STMF('/tmp/faq.xml'). 

Insert 10 blank lines by keying I10 in the first line's CMD area. Copy the FAQ XML example from this page into your edit session and press F3 twice to save and exit.

Now start Qshell again and run the sax.Counter sample. Your Qshell session should look like:

java sax.Counter -V /tmp/faq.xml
/tmp/faq.xml: 210 ms (3 elems, 0 attrs, 7 spaces, 69 chars)
$

Now try experimenting with some of the other examples I provided or with your own XML documents and DTDs. You can learn more about the sax.Counter class on the Apache XML Project's SAX Samples page.

Well Informed About Valid

Unlike HTML, XML documents conform to a minimal structural standard. This feature reduces the amount of work that applications have to do when interpreting XML documents. Another one of XML's features is its flexibility. This flexibility can also be a problem; schemas help convey what is valid and simplify the task of creating new XML documents. DTDs are the most common form of schema in use today. A DTD helps verify the structure of an XML document and provides minimal content checking.

David Morris is a software architect at Plum Creek Timber Company and started the iSeries-toolkit open-source project. He can be contacted by e-mail at dmmorris@itjungle.com.

Sponsored By
ACOM SOLUTIONS

Take control with EZeDocs

Why not keep your forms inventory in your computer instead of a stock room or scattered about the office?

First off, you'll never have to reprint them.

Second, you'll use the same templates over and over, and when its time to update, it's a matter of keystrokes, not trips to the printer.

Third, you'll never have to store them, handle them or risk losing them.

Fourth, with the money you save and the efficiency you gain, you may pay for your EZeDocs solution in months. Click here for a demo.

EZeDocs modular, scalable architecture lets you implement the document generation and distribution elements you need, as you need them.

ACOM's professional services team can design your documents for you and teach you how to create and update your own, using the integrated GUI design tool.

ACOM's maintenance agreements provide you with upgrades as they're added, often ideas from actual users.

ACOM Guarantees Your Satisfaction!

Click here for EZeDocs online seminar schedule.

THIS ISSUE
SPONSORED BY:
SoftLanding Systems
LANSA
ACOM Solutions
ASNA
Profound Logic Software
WorksRight Software
BACK ISSUES
TABLE OF CONTENTS
Creating a VARPG Appointment Calendar
Dynamic Selection with Embedded SQL
Validating XML with a Document Type Definition
The Ins and Outs of Qshell
The iSeries Toolbox for Java: GUI-izing Program Calls
Exploring iSeries Navigator Application Administration
  Newsletters | Subscribe | Advertise | About Us | Contact | Search | Home  
  Last Updated: 5/8/02
Copyright © 1996-2008 Guild Companies, Inc. All Rights Reserved.