The Basics Of XML-SAX
June 14, 2016 Jon Paris
Note: The code accompanying this article is available for download here.
In XML-INTO And Optional Elements, I showed a reader how he could use XML-INTO to parse an XML document that effectively contained one of two completely different payloads. As I noted in that article, for this type of “does the document contain X” processing, XML-SAX can be a better choice than XML-INTO. That is the task that I’m going to demonstrate in this tip.
In his email, the reader mentioned that his original intent had been to simply identify the type of payload (Report or Event) and then to pass the document to the appropriate program to complete the processing. The basic structure of his document was like this:
<Message> <Payload> Actual payload data consisting of a <Report> or <Event> element. These are compound elements and each contains a completely different set of child elements </Payload> </Message>
Before we begin looking at the code, a couple of general comments on XML-SAX are in order. Unlike XML-INTO, XML-SAX does not process the entire document in one big gulp. Rather it breaks the document into individual components and passes each one in turn to a handler subprocedure for processing. Actually, XML-INTO can also use a handler to process documents in smaller bites (or should that be sips?) but this definition will do for the sake of simplicity.
The individual components are known as events. Since <Event> is one of the element names that we will be searching for, this might be tiny bit confusing in places, but hopefully you’ll get the idea. In my defense I didn’t make the names up! The types of “events” that are passed include: beginning of document; start of element; element data; end of element; and so on. You must write a subprocedure to handle some or all of these events and it is repeatedly called until either the document has been completely parsed, or your subprocedure tells it to quit. More on this later.
Since it is the underlying XML parser that is calling your subprocedure, the parameter definitions are always the same. Let’s take a look at them:
P saxHandler B D PI 10I 0 D commArea Like(contentType) D event 10I 0 Value D pstring * Value D stringLen 20I 0 Value D exceptionId 10I 0 Value
I’ll talk about the use of the first parameter (commArea) in more detail later. For now I’ll just say that it is a means for indirectly passing a parameter of your choosing from your mainline code to the subprocedure. It can be anything–a single field, a data structure, anything.
The second parameter is a numeric code that tells you the type of event that you are about to process. IBM supplies a number of special names for these values so you don’t have to worry about what the actual numeric values are. For example, the event code that will be sent to mark the beginning of the document is *XML_START_DOCUMENT. You’ll see this in use in the code later.
The third parameter is a pointer to the data associated with the event. For the start or end of an element or attribute, that data will contain the name. For element or attribute data it references the actual data associated with the item.
The fourth is the length of the data “pointed to” by the previous parameter. It is this length that you must use to determine if data is present or not. The documentation suggests that when there is no data the pointer will be null, but the documentation lies! Use the length. Don’t trust that the pointer will be null. When we look at the sample code we’ll show how the combination of the two parameters allows us to easily access the data.
The fifth, and final, parameter is one that you probably won’t need to use. It only comes into play when the event signaled is an XML error (*XML_EXCEPTION). If the parser discovers a problem with the XML it will signal this error to your subprocedure to allow you to do any cleanup needed before the parse quits with an error. Since you can easily trap these errors by using the E(rror) operation code extender on the XML-SAX itself (or indeed a MONITOR operation), there are few times when you need to worry about it.
Now that the basics are out of the way let’s look at how XML-SAX is invoked.
XML-SAX(E) %Handler( saxHandler: contentType ) %XML( XMLData ); If %Error; // Not really an error - we found what we wanted Dsply ( 'XML type was: ' + contentType ); else; Dsply ( 'No payload found' ); endif;
The first parameter to %Handler names the subprocedure that will process the events. The second parameter is the communications area that I mentioned earlier. This field (contentType) is going to be used by the handler subprocedure to notify us of the type of payload that the XML contains. I specify it on the %Handler parameter and the RPG parser passes it on to the subprocedure. I could of course have used a global variable in this case, but had the subprocedure been in a separate module or service program, that would not have been an option. This way I can avoid using globals and future-proof the code at the same time.
%XML just identifies the variable containing the XML document. For this example there is no need for any options so the second parameter is omitted.
Notice that I specified the (E) extender to XML-SAX – this allows me to signal an “error” in the handler subprocedure and to detect it back in the main line as I do here. The weird thing about this particular example is that an “error” is what we actually want! It means that I have succeeded in identifying the payload and the field contentType has been set appropriately. I really wish there was a more logical way to handle this, but the only way to avoid it is to process the entire document (which could be huge), even though I found the answer in the first few lines of XML. I have actually suggested to IBM that perhaps there should be a way to quit the parse without signaling an error. We’ll see if that is a feature that can be added.
As you can see from the logic, if %Error is not set, then there is an error in that the entire document has been processed but no payload element was found.
Writing the Handler
Time now to look at the work of the handler, starting with the variable definitions and the initial processing.
(A) D string S 65535A Based(pString) (B) D cPayload C 'Payload' D cEvent C 'Event' D cReport C 'Report' D cUnknown C 'Unknown' D cContinue C 0 D cExit C 1 (C) D inPayload s N Static (D) D element s 128A Varying (E) If event = *XML_START_DOCUMENT; inPayload = *Off; // Static variable - set off for new doc (F) return cContinue; endif; // If string present capture actual string value // If no string then just return 'cos we have no interest (G) If stringLen > 0; element = %Subst( string: 1: stringLen ); else; return cContinue; endif;
At (A) you can see the definition of the data string whose pointer was passed as the third parameter. As you can see, that parameter (pString) is simply used as the basing pointer for the field. In other words, the minute the subprocedure is called the data is available. No need for even the most pointer phobic to worry about this kind of pointer usage!
(B) Defines a number of constants that will be used to identify the payload type.
(C) Is an indicator that is set once the Payload element has been found. This is used to ensure that I only check for <Report> and <Event> elements within the <Payload> element. This avoids signaling false positives should either a <Report> or <Event> element occur elsewhere in the document.
(D) is the field that will hold the current element name. I extract it from the variable string and use this version for comparisons as it avoids having to use %Subst all the time.
At (E) the logic begins with a check to see if a new document is starting. If it is, then the inPayload indicator is cleared. Why is this needed? Remember that the subprocedure is going to be called multiple times and local variables are reset on each call. So if I need to remember that I am in the payload section then that indicator must be declared as STATIC so that it persists between calls. But that creates its own problem. I now need a means to reset that status when starting a new document. That is what this logic is about.
There was one obvious alternative to this approach, I could have included the indicator in the communications area and avoided the static variable. However, you will almost certainly need to use static variables at some time in your XML-SAX coding, so I decided that it was a “teaching moment” that should not be missed.
The return at (F) will transfer control back to the SAX parser and the value of the constant (zero) tells it to continue processing. Later we’ll see how a non-zero value can be used to terminate the parsing.
The next part of the logic (G) checks if there was any data supplied for this event. Any value greater than zero indicates that data is present. I then use %Subst to extract the valid portion of the string and store it in element. As I noted earlier this is done to avoid using %Subst in all the following comparisons.
If there is no data then I am not interested in the event and simply return.
OK. So at this point I know that I have data and can continue the process.
// Look for start elements (H) if event = *XML_START_ELEMENT; if inPayload; // Currently in Payload element // Check element type and set unknown if not identified (I) if (element = cEvent) or (element = cReport); commArea = element; else; commArea = cUnknown; endIf; (J) cExit; // Return and exit parse else; // Not in payload so see if this is it (K) if element = cPayload; inPayload = *On; // Payload element so set flag endif; endif; endIf; (L) return cContinue;
At (H) I check to see if the event is the start of an element. If it is and I am already processing the payload then (I) I check to see if it is one of the payload types I recognize (Event or Report) and either copy the element name into the communications area or set the area to “Unknown”. At this point I have my answer: The payload has been found and reported so I can quit the parse and return to my mainline. This is what the return at (J) does. By returning a non-zero value I cause the parse to complete. As I noted earlier this will actually cause RPG to trigger an error, but it is the only way to get out of the parsing without running through the whole of the rest of the document and that would be a waste of time.
If I haven’t yet found the payload, then at (K) I check to see if this is it. If it is then the inPayload indicator is set and processing continues, as it does if the element is not the payload. The return at (L) transfers control back to the parser and my subprocedure will then sit patiently waiting to be called again.
That’s all there is to it. As I noted in the previous article a far simpler approach in this case might have been to simply use %Scan to identify the payload. However, there are times when you do need to extract a limited amount of information from the document and in that case XML-SAX is a far better choice than XML-INTO. Even though use of XML-SAX was not essential in the reader’s case, it still serves as a good introduction to the basic mechanics of XML-SAX.
So how do you decide whether to use XML-SAX or XML-INTO? I look at it this way. If I need to extract all (or the majority) of the data in the document, I tend to use XML-INTO. If I only need to extract the content of specific elements, or if the document is so complex and multi-formatted as to make the data structure definitions for XML-INTO problematic, then I will use XML-SAX. If I only want to know what type of document I’m processing, which was fundamentally the case for the reader, then a simple %Scan or two will normally fit the bill.
If enough people are interested in more details of using XML-SAX I can revisit the topic in more detail in a future tip.
Jon Paris is one of the world’s most knowledgeable experts on programming on the System i platform. Paris cut his teeth on the System/38 way back when, and in 1987 he joined IBM’s Toronto software lab to work on the COBOL compilers for the System/38 and System/36. He also worked on the creation of the COBOL/400 compilers for the original AS/400s back in 1988, and was one of the key developers behind RPG IV and the CODE/400 development tool. In 1998, he left IBM to start his own education and training firm, a job he does to this day with his wife, Susan Gantner–also an expert in System i programming. Paris and Gantner, along with Paul Tuohy and Skip Marchesani, are co-founders of System i Developer, which hosts the new RPG & DB2 Summit conference. Send your questions or comments for Jon to Ted Holt via the IT Jungle Contact page.