|
Validating XML with a Document Type Definition
by David Morris
In "Getting Started with XML," I introduced you to the Extensible Markup
Language and described a few of the rules that XML documents should follow. If you had a
chance to read that article, you know XML documents must be "well-formed," meaning they
must conform to XML's basic syntax and structure. Optionally, an XML document may be
"valid," which means the document conforms to the rules of a document type definition (DTD) or
an XML schema that defines the allowed structure and content.
One of the biggest problems with HTML is that the specification does not explicitly forbid
HTML processors, like browsers, from attempting to fix markup errors. Not wanting to be
unfriendly, most browsers are very tolerant of errors and blindly process error-ridden HTML
documents. The problem with this approach is that this automatic correction takes a lot of
processing time and does not always produce the intended result. Having learned from this
mistake, the XML specification explicitly forbids any sort of error correction.
The requirement that all XML documents be well-formed ensures that XML documents are
syntactically correct. They must have a single root, balanced tags, and tags have to be properly
nested. In addition, XML documents can specify a DTD or an XML schema that imposes further
rules. A DTD allows you to specify the elements that are valid in an XML document, their order,
attributes, and other relationships. Schemas support the validation of a DTD and allow you to
specify the type of data and values that elements may contain. Today, DTDs are far more
prevalent than XML schemas, so this article will concentrate on DTDs and their use.
An XML parser is software that reads through an XML document, dividing it into individual
elements and attributes. The XML parser then passes the individual parts of an XML document
back to an application for processing. During the parsing process, if errors are found in an XML
document, processing stops and an error is reported. I'll show you how to use the Xerces XML
parser on your iSeries system to check an XML document.
Building Well-Formed XML Documents
All XML documents have to conform to some basic rules. An XML document that conforms to
the rules is well-formed. In addition, documents that do not follow the rules cannot be processed.
.
The rules that an XML document must follow to be well-formed are described in the World Wide
Web Consortium's XML 1.0 recommendation. The following list summarizes the most important
rules:
- If included, an XML declaration starts on line one.
- Every XML document must have one root element.
- All elements must have matching opening and closing tags.
- Elements must be properly nested.
- XML is case-sensitive.
- White space outside of elements is ignored.
- Attribute values must be enclosed in single (') or double (") quotation marks.
- Use & for ampersand (&) and < for less than (<) inside of
markup.
If you follow these rules, you will rarely have a problem. Most of the other rules have to do with
limitations, character conversions, and seldom-used aspects of XML.
Elements are the building blocks of XML documents. Elements describe the logical structure of
an XML document. Matching opening and closing tags enclose elements that may contain content
and structure. Content is the text or binary information within an XML document. Elements that
support structure make it easier for programs to understand a document by providing a way of
grouping the other elements in that document. Every XML document has one high-level element,
known as the root element. Here is a short example with three elements:
<root-element>
<sub-element first="true">
character content
</sub-element>
<sub-element/>
</root-element>
The structure of an XML document is hierarchical; this example shows a root element that
contains two subelements. The first subelement contains character content, and the second
subelement is a singleton with no content. Elements that contain something have an opening tag
and a closing tag. The closing tag matches the opening tag and has a forward slash (/). A
singleton, like the second subelement, can contain a shorthand notation that adds the forward
slash before the closing greater-than sign (>).
Attributes are associated with an element and add additional information. In the previous
example, the first subelement has a first attribute with the value "true." In many cases, you have
to decide between creating a new element or adding an attribute to an existing element. When
deciding, ask yourself the following questions:
- Is it common to include this item in output?
- Does this item contain structure; in other words, can it be broken down into subcomponents,
such as month, day, and year?
- Can an element contain more than one of these items?
In most cases, if you answer yes to any of these questions, you should use an element rather than
an attribute. If you are not certain of the answer to one of these questions, it is usually better to go
with an element, because elements offer more flexibility.
An XML document can include an optional XML declaration that specifies the version of XML
used, as well as some other attributes. If a document includes an XML declaration, it must be the
first thing found in the document, starting in the first position of the first line. It is good practice
to include an XML declaration. Here is a typical XML declaration:
<?xml version="1.0" encoding="ISO-8859-1" ?>
This XML declaration tells an XML processor the version of XML and the how the document is
encoded. Version 1.0 is the only XML specification published, and because XML is so flexible, it
may end up being the last. If encoding is not specified, most parsers will first assume UTF-8,
which is a variable-length encoding scheme of one to four bytes and shares the same encoding
scheme as the 128-character ASCII character set. Because of this, any file that contains only
ASCII text is a valid UTF-8 file. On the iSeries, which does not actually support varying-length
Unicode character sets like UTF-8, it is a good idea to include the encoding declaration.
Computers convert everything they deal with into a stream of bits. Many different schemes
associate those bits with the characters we recognize as letters and numbers. The rules used to
associate a particular set of bits with a recognizable character are an encoding scheme. Hundreds
of these encoding schemes are in use today. On the iSeries, encoding schemes specify the rules
that are followed when code points are assigned to characters in a particular code page and are
part of the iSeries' Character Data Representation Architecture (CRDA).
Some of the most commonly used encoding schemes are defined by the International Organization for
Standardization. The ISO 8859 standard defines several 8-bit character sets and uses the 7-bit
ASCII character as a base and adds language specific code points in the added area that 8-bits
afford. On the iSeries, the ISO-8859 character sets are supported with various code pages; for
example, ISO-8859-1, also known as Latin-1, works with a code page of 819 and supports the
majority of characters used in the United States. You can also use fixed-length UCS-2 Unicode
by specifying UCS-2 for the encoding and a code page of 13488.
Comments in an XML document are surrounded by a <! and a >. The purpose of a comment
is to add descriptive information. Comments should not be used be used to add proprietary
extensions. You can use comments to prevent XML markup from being recognized. The
following line shows a comment:
<!-- This embedded text and tag is
not processed <element> -->
That's it; if you follow the preceding rules, you will have well-formed XML documents that
any XML parser will process. The next step is to include a DTD or an XML schema definition,
which are custom rules for XML documents.
To DTD or Not to DTD?
When an XML document is valid, it includes a document type declaration, which identifies a
document type definition (DTD) that defines rules the XML document must follow.. The DTD
lists the elements, attributes, and entities that the XML document uses and the relationships
between them. In order to be valid, the document uses only those items described by the DTD;
otherwise, it is invalid.
Even when an XML document is valid, some flexibility is allowed. For example, a DTD does not
specify the values or type of data that an element contains. You can’t tell a DTD to check
numeric values, the length of a value, or other such rules. A DTD also does not tell you what the
root element is or how many instances of an element may appear. A DTD is your first line of
defense; further editing must be done using an application or a XML schema definition.
A DTD is not required for XML documents. One of the first questions you have to answer is
whether you even need a DTD. In many cases, a DTD is a requirement imposed upon your XML
document. For example, XHTML documents need to specify one of several DTDs. Even if a
DTD is not required, there are several reasons why you might want to use one:
- DTDs provide a clear definition of what your applications expect in XML documents.
- For complex documents, a DTD helps an XML author determine relationships between
markup and other rules, like sequencing.
In most cases, if you are working with more than a few documents that share a common structure,
you should create a DTD. In some cases, you have to use a DTD to support certain features, like
the inclusion of non-XML content in an XML documents. An XML author uses a DTD to
understand what is expected and allowed in an XML document, and some XML editors allow you
to specify a DTD. Those editors show what elements are valid and help prevent mistakes. In some
cases, however, it is not worth the effort to create a DTD.
Creating a DTD
A DTD is a special kind of markup that is not actually XML. DTDs have been in use for more
than 30 years and are part of the Standard Generalized Markup Language (SGML), the parent of
XML and HTML. Unlike SGML documents, which have to use a DTD, XML imposes enough
rules that a DTD is not required.
The DTD in the following example is about as simple as they come, but goes a long way toward
describing the allowed content of an XML document. This DTD describes the elements in a bill
of materials (BOM) XML document. This is a very simple example, and, in actual use, a BOM
DTD would most likely be more complex and support additional elements.
<!ELEMENT product (name, part+, instruction+)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT part (quantity, description?)>
<!ELEMENT quantity (#PCDATA)>
<!ATTLIST quantity uom CDATA #REQUIRED>
<!ELEMENT description (#PCDATA)>
<!ELEMENT instruction (#PCDATA)>
First, this DTD would probably be stored as a separate document referred to by an XML recipe
document. However, it is also possible to define a DTD like this within the XML document that it
describes, within the document type definition. The extension for a DTD is usually
.dtd.
Each line of this DTD is an element declaration. An element declaration defines which
elements you can use, the children of those elements, the type of data those elements contain, and
the allowed attributes.
A DTD defines the hierarchy of an XML document. The top-most element declaration should be
the root element, although this is not a requirement. The BOM DTD starts with the root element
declaration, which is "product." From the product element declaration, you can see it contains a
name element, part elements, and instruction elements. This tells you that these three elements are
all that are valid within a product element.
An element declaration starts of with the <!ELEMENT tag. This tag contains the name of the
element and its content model. The name of the element can be any valid XML name. The
content model describes what the element contains. The following table shows the types of
content that XML allows:
| Content Type |
Usage |
Type of Content Allowed |
| Text |
<!ELEMENT name (#PCDATA)> |
Parsed character data |
| Element(s) |
<!ELEMENT name (child | chile2)> |
Contains only elements of child or child2 |
| ANY |
<!ELEMENT name ANY> |
No content restrictions |
| EMPTY |
<!ELEMENT name EMPTY> |
Cannot contain content |
| Mixed content |
<!ELEMENT name (#PCDATA)| child)*> |
Contains a mixture of parsed character data and child element(s) |
The name, quantity, description, and instructions elements do not specify other elements; they
specify that they contain "#PCDATA," which stands for parsed character data. Parsed
character data is raw text that may contain entity references--like & or &appos;--but does
not contain other markup. In this example, the name, quantity, description, and
instruction elements contain only text and may not contain other elements.
Looking more closely at the product element definition, you will notice that the name, part, and
instruction elements are separated by commas, and that the part and instruction elements are
followed by a plus (+) sign. The comma ensures that the elements appear in the order specified.
The plus signs indicate that product must have one or more part and instruction elements. The
name element, which has no special symbol, must appear once and only once. A question
following indicates that the element is optional and appears once.
The following table describes the symbols used in a DTD:
| Symbol |
Usage |
Use |
| ( and ) |
<!ELEMENT name (child, (child1 | child2)+)> |
Parenthesis group choices |
| * |
<!ELEMENT name (child?)> |
Zero or one child element |
| + |
<!ELEMENT name (child+)> |
One or more child elements |
|   |
<!ELEMENT name (child)> |
When ?, *, or + are not specified a child element must appear once |
| , |
<!ELEMENT name (child1, child2)> |
Child elements must appear in the order listed |
| | |
<!ELEMENT name (child1 | child2)> |
Child elements appear in any order |
You can see that even a short DTD is useful when you are trying to make sure that an application
will be capable of processing a particular XML document. An application that uses XML
documents based on the BOM DTD needs to understand six elements. Knowing the hierarchical
relationship of those elements reduces the number of possibilities that the application needs to
handle.
Validating with a DTD
A parser uses a DTD to test the validity of an XML document. Not all parsers support the use of a
DTD. Those parsers that do support a DTD are validating parsers. Most validating parsers allow
you to turn off DTD validation. In most cases, DTDs are stored in a document file that is separate
from the XML documents that reference the DTD. This allows you to use one DTD to val idate
many XML documents. An XML document may also reference more than one DTD, or may
combine internal DTD definitions with external definitions.
A document type declaration is markup that associates a DTD with an XML document. The
document type declaration may include the DTD or refer to an external DTD. The document type
declaration also identifies the root element of the XML document. A typical document type
declaration looks like this:
<?XML version="1.0" encoding="ISO-8859-1"
standalone="no"?>
<!DOCTYPE product SYSTEM "bom.dtd">
The document type declaration, shown on the second line, follows the XML declaration and
establishes that the DTD for this XML document is contained in the external file "bom.dtd." The
<!DOCTYPE tag starts the document type declaration; following that, the root element--
"product," in this case--is identified. Frequently, a document type declaration uses a URI to refer
to a DTD. With a URI this same reference might look like this:
<!DOCTYPE product PUBLIC
"http://www.iseriesxml.com/xml/dtd/bom.dtd">
When you use an external DTD subset, you should specify standalone="no" in the XML
declaration. There are rare cases when you do not have to do this, but since
standalone="no" is always permitted, the simplicity of this rule outweighs the work
involved in determining if you can get away without it.
It is not a requirement for the DTD to be external. Instead of an external reference, the prolog
may contain the DTD. This example shows a DTD for an internal Frequently Asked Questions
XML document:
<?XML version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE faq [
<!ELEMENT faq (question, answer)+>
<!ELEMENT question (#PCDATA)>
<!ELEMENT answer (#PCDATA)>
]>
<faq>
<question>How do you validate
an XML document?</question>
<answer>java sax.Counter –V /yourdocument</answer>
</faq>
Using an internal DTD allows an XML document to be self-contained but prevents you from
reusing the DTD. There is a time when an internal DTD is useful; that’s when you need to extend
an external DTD to support a particular document. An XML document can have internal
declarations and reference a SYSTEM or PUBLIC URI. When you mix internal and external
DTDs, the part that is contained within the XML document is the internal DTD subset, and the
part brought in from an external document is the external DTD subset.
Validation with a Parser
There are a lot of ways to validate a document. Most Web browsers and XML editors will
validate an XML document that references a DTD. On your iSeries system, which does not come
equipped with an XML editor or a built-in Web browser, you can use a mapped drive or copy
files to your PC workstation for validation. Another technique is to use a validating parser.
Several validating parsers will run on the iSeries, including IBM's alphaWorks XML Interface for RPG. For the
latest and greatest support I recommend that you look at the Apache Software Foundation's XML tools. The Xerces Java
2 parser works great on the iSeries and is available from the Apache XML Project Web site.
To validate an XML document using Xerces Java 2, first download the latest binary release to your PC. Open an FTP
session to your iSeries by selecting Start, then Run, on your PC, and entering
FTP youriseriessystem. Run the following FTP commands to upload the downloaded zip
file, setting your directories appropriately:
bin
cd /tmp
put c:\temp\xerces-j-bin.2.0.1.zip
quit
Once you have the zip file uploaded to your iSeries system, start a Qshell session from a
command line using the QSH command and run the following shell commands:
cd /
jar –XF /tmp/xerces-j-bin.2.0.1.zip
Exit Qshell and compile the Xerces JAR files by submitting the following command:
SBMJOB CMD(QSH CMD('
for jar in $(find /xerces-2_0_1 -name ''*.jar'');
do system "CRTJVAPGM CLSF(''"$jar"'') OPTIMIZE(40)";done'))
JOB(COMPILE) JOBQ(QSYSNOMAX)
Now set your Java class path using the Add Environment Variable (ADDENVVAR) command,
like this:
ADDENVVAR ENVVAR(CLASSPATH)
VALUE('/xerces-2_0_1/xercesImpl.jar:
/xerces-2_0_1/xercesSamples.jar:
/xerces-2_0_1/xmlParserAPIs.jar')
You are now ready to validate an XML document. Start by creating an empty document with the
appropriate code page. Start Qshell and run the following:
touch -C 00819 /tmp/testvalid.xml
Exit Qshell and run Edit File (EDTF) from a command line:
EDTF STMF('/tmp/faq.xml').
Insert 10 blank lines by keying I10 in the first line's CMD area. Copy the FAQ XML
example from this page into your edit session and press F3 twice to save and exit.
Now start Qshell again and run the sax.Counter sample. Your Qshell session should look
like:
java sax.Counter -V /tmp/faq.xml
/tmp/faq.xml: 210 ms (3 elems, 0 attrs, 7 spaces, 69 chars)
$
Now try experimenting with some of the other examples I provided or with your own XML
documents and DTDs. You can learn more about the sax.Counter class on the Apache
XML Project's SAX
Samples page.
Well Informed About Valid
Unlike HTML, XML documents conform to a minimal structural standard. This feature reduces
the amount of work that applications have to do when interpreting XML documents. Another one
of XML's features is its flexibility. This flexibility can also be a problem; schemas help convey
what is valid and simplify the task of creating new XML documents. DTDs are the most common
form of schema in use today. A DTD helps verify the structure of an XML document and
provides minimal content checking.
David Morris is a software architect at Plum Creek Timber Company and started the iSeries-toolkit open-source project. He can
be contacted by e-mail at dmmorris@itjungle.com.
|