Encoding XML (Or HTML) From Within RPG
April 17, 2013 Bob Cozzi
RPG developers who jump to the web and CGI programming soon learn that a stream-based syntax requires the use of certain control characters. Unlike native database, which uses structures and hidden attributes to control field size and starting and ending locations, HTML and XML rely on <i>tags</i>, agreed-upon syntax for start and end delimiters. You may be familiar with Comma Separated Values and the use of both the comma and the double-quote as the delimiters for that type of file. XML and HTML use much more verbose values as their tags or delimiters.
Tags do double duty; they separate data and they group related data together. For example, the individual fields of data in an XML document are separated by tags, while a set of data (similar to a record in a database table) is grouped together with an outer set of tags. Something like this:
<address> <street>123 Main St.</street> <city>Anytown</city> <state>Illinois</state> <postcode>60639</postcode> </address>
The tags are given user-specified names. In this example, the tags named street, city, state, and postcode separate the individual pieces of data, while the address tag groups the inner collection together.
HTML works the same way. A basic HTML document might look something like this:
<html> <head> <title>IT Jungle</title> </head> <body> <h1>Bob Cozzi's Website</h1> <p>Hello World! </p> </body> </html>
Looks very similar to our XML address example, doesn’t it? The tags separate the components of an HTML page into the header and detail pieces.
But what happens if I have data in my XML or HTML that include the left or right bracket (greater than or less than symbols)? Those characters and a few others will cause a problem.
Let’s look at the XML example first. If the address street is “123 <G> Main St.” then the XML parser will think that the <G> is some kind of XML tag and look for the closing </G> tag. If it can’t find it, the parser will fail.
To accommodate the parse, XML and HTML have instituted escape codes. Escape codes come in two forms: escape sequence and symbolic escape code. The normal escape sequence begins with the two characters &# followed by the numeric ASCII code for the character, followed by a semicolon. For example, the escape sequence of the left bracket (a.k.a., the less than symbol) is:
The numeric 60 is the ASCII code for the < symbol. Likewise, the right bracket (the greater than symbol) is ASCII 62, so its escape sequence is:
If your XML data contains a < symbol, it must be translated into < in the data portion of the XML document. In other words, this. . .
<street>123 <G> Main St. </street>
. . .needs to be translated to this. . .
<street>123 <G>l Main St. </street>
Yes, it is that ugly and yes, it is required.
As a consequence of this escape sequence, the ampersand character also needs to be escaped, otherwise the XML parser thinks you are starting an escape sequence of something else, and therefore it is invalid and fails. To escape an ampersand, you could use the following:
But since the ampersand is so common, a symbolic escape code is available:
The letters & tell the XML parser that you have replaced a real & with & it has the same impact as coding & however & is easier to remember and is CCSID agnostic.
For most characters, XML supports the ϧ escape sequence. That is, you can insert A instead of the letter A if you really want to, but it is not necessary. There are, however, five characters that should always be escaped in your XML data. These five characters have special meaning to XML (such as the left and right brackets also known as greater than and less than symbols). You could in theory escape ever single letter/character in your XML data (the content between the tags) but why? XML only requires the following five escape codes. I also strongly recommend escaping a sixth character whenever you send XML over HTTP. Here are those six symbols along with their escape sequence.
The percent sign is included in this list because it can create a problem if your transfer XML over HTTP. So be sure to escape it as well. It never hurts to escape and with only six characters that need it in XML, it is not rocket science.
Escape To RPG
When building the XML document in RPG, analyze the data before enclosing it in XML tags, and escape it. Fortunately in RPG IV at v7.1 it is incredibly simple to escape the data. In the example here, I use the v7.1 %SCANRPL built-in function. If you are not yet on IBM i v7.1 you can use the homegrown version named SCANRPL that I showed you in a previous article.
myData = %SCANRPL('"' : '"' : myData); myData = %SCANRPL('''' : ''' : myData); myData = %SCANRPL('<' : '<' : myData); myData = %SCANRPL('>' : '>' : myData); myData = %SCANRPL('&' : '&' : myData); myData = %SCANRPL('%' : '%' : myData);
After processing the above substitution values, the data stored in the MYDATA field is considered to be “escaped.” Note that when escaping the apostrophe (which we often refer to as a quote) it must be doubled up. That is a single quote must be doubled, and then enclosed in quotes, therefore four consecutive quotes or apostrophes are specified (line 2 above) to replace the single character.
If you escape your data before embedded it into XML, the receiving-end will have a much easier time parsing the data and producing accurate results.