Guru: Sorting In Teraspace
January 29, 2018 Jon Paris
In an earlier tip, Teraspace To The Rescue, I discussed how to use teraspace in RPG to permit the storage and manipulation of data that requires more storage than RPG’s 16MB limit. In this tip I am going to discuss how such data can be sorted. Hint: SORTA won’t work!
Before I get to that though, there is one thing I should mention. In that original tip I said: “For reasons I have never understood, DEALLOC will not null your pointer. So it is a good idea to do it yourself. . . .” IBM’s Barbara Morris pointed out, in a comment to that tip, that DEALLOC does actually offer that feature via the op-code extender “N”. Apparently I “missed the memo” on that particular feature. So, had I coded DEALLOC(N), I would not have needed to manually null the pointer. Of course, any related pointers, such as p_currentPosn in the original code, still need to be manually nulled.
This story contains code, which you can download here.
OK — on to sorting things out. I hinted earlier that SORTA can’t help you here and, if you think about it, that makes perfect sense because RPG would not allow us to define the data as an array in the first place. However, we can use the qsort() API. While this API can look a bit intimidating when you first encounter it, it is really quite simple to use. Even better, once you’ve used it, all future uses are pretty much just clones. I described the basic usage of this API in a short tip back in 2012, but I’ll cover a few more details pertinent to this example later in this discussion.
For the purposes of my example, I will be using a data structure based on the records in a simple Customer file. You can see the definition of the DS in the RDi Outline View image here.
I chose this example so that I could demonstrate both single- and multi-key sorts. It is worth noting at this point that the technique shown here of using a DS defined with the LIKEREC option is basically the same as the one I use to provide sorting capabilities for 5250 subfiles. Combine these two concepts and you can provide sorting capabilities for subfiles of almost unlimited size.
As you will see if you download and study the code, the program simply loads every record from the file into my teraspace “array”. Once loaded, qsort is used to sequence the array. The results are then displayed and another sort performed.
Let’s take a look at the code, beginning with the creation and loading of the array. Since the majority of this code is much the same as in the previous tip, I will just cover the highlights.
The program variable currentmax is used to monitor the maximum number of records which our current memory allocation can accommodate. We begin (A) by using that value, multiplied by the record size, to allocate an amount of storage large enough to contain this number of records. The resulting pointer is stored in p_customers, which will always contain the storage location for the first record in the array. Its value is then copied into p_currentCustomer, which is the basing pointer for the record DS.
Once that basing pointer has been set, I can read the file (B) specifying the customerAddr DS as target for the data. The first record having been read, I can now enter the read loop that will load the rest of the data. Obviously you can apply any selectivity you like when loading the data. For that matter you could use SQL rather than a native READ operation.
Within the read loop, I check if there is sufficient space available for another record in the current allocation (C). If there is then the pointer is simply incremented to the next position. If there is not, then the maximum is increased and a %REALLOC used to get the necessary additional storage. The critical part of this operation is identified at (D) where the pointer value for the current record is recalculated. As I noted in the earlier tip, this is vital because there is no guarantee that when we reallocate the memory the starting address (p_customers) will be the same as it was before. Don’t worry, even if the memory location changes, the system will have copied your data to the new position for you so you never lose anything. However, the value presently in p_currentCustomer is to a location in our previous memory allocation so it must be recalculated before we can add any more records.
// Allocate enough storage for current maximum (A) p_customers = %Alloc( %Size( customerAddr ) * currentMax ); p_currentCustomer = p_customers; // Set current position to beginning (B) Read Customers customerAddr; // Loop priming read // Load data from file (or any other source) DoW Not %EOF( Customers ); count += 1; // Increment record count (C) if count < currentMax; // Still space remaining ? // If so advance to next element p_currentCustomer += %Size( customerAddr ); else; // Otherwise make room for more by increasing max size and then // reallocating storage. // Note: p_bigDS may now point to different storage so you MUST // recalculate the current customer position pointer currentMax += INCREMENT; p_customers = %ReAlloc( p_customers: %Size( customerAddr ) * currentMax ); (D) p_currentCustomer = p_customers + (%Size( customerAddr ) * ( count )); endif; (B) Read Customers customerAddr; // Add next address to array EndDo; Dsply ( %Char(count) + ' records loaded');
Once all the records have been loaded the first sort operation is performed (E). We pass qsort the address of the start of the data ( p_customers ), the number of entries in the “array” ( count ), the length of each entry ( %Size(customerAddr) ), and finally the procedure pointer ( p_SortCityState ) which identifies the procedure for the sort – in this case to sort the data in City within State. We will look briefly at that procedure in a moment.
The reason for using a procedure pointer is because we need to provide logic to control the processing of the qsort utility function. Unlike SORTA, qsort knows, and assumes, nothing about your data. Its sole purpose in life is to move blocks of memory around so that they are in a sequence that is determined by your logic.
Here’s how it works. Once invoked, qsort calls the routine you identified (via the pointer p_SortCityState ) and passes it a pair of elements from the array. Your logic then simply has to determine which of the two is higher in the desired sort sequence (or lower, of course, if descending sequence is what you want). It then returns the answer (HIGH,LOW, or EQUAL) to qsort, which will then pick another candidate pair and call your code again. This continues until qsort determines that everything is in sequence, and at that point it returns control to your mainline code. We’ll look at the code for the sequencing procedure a little later.
While unusual in an RPG environment, this kind of “call-back” processing, as it is known, is commonly used in other programming languages. It provides a much more flexible approach to controlling the behavior of a utility function than the more traditional approach of using exit programs. RPG’s only native use of this approach is with the %HANDLER options of XML-INTO and XML-SAX.
Once the sort has been completed, the routine DisplaySortResults is called to demonstrate that the correct sequence has been achieved. The program then goes on to sort the array into Name sequence and display those results. Needless to say, if you decide to run this program against your own data I suggest that you not use too large a file or you’ll be waiting a long time to see all the results flash by!
// Sort array into City/State sequence (E) qsort ( p_customers: count: %Size(customerAddr): p_SortCityState ); Dsply 'Data is in City/State Sequence ...'; DisplaySortResults(); // Sort array into name sequence qsort ( p_customers: count: %Size(customerAddr): p_SortByName ); Dsply 'Data is in Name sequence ...'; DisplaySortResults();
Time now to take a look at the logic for the City/State sequencing routine (F). If you were to look at the technical specifications for the qsort sequencing routine, you would see that it is defined as taking two parameters one for each of the two elements that it is to compare. Each parameter is a pointer and is passed by value. So technically speaking the procedure interface should look like this:
dcl-proc SortCityState; dcl-pi *n Int(10); p_element1 Pointer Value; p_element2 Pointer Value; end-pi; dcl-ds element1 LikeRec( custRec ) Based( p_ element1 ); dcl-ds element2 LikeRec( custRec ) Based( p_ element2 );
Notice that this approach forces us to manually define data structures for each of the elements, and to use the passed pointers as their basing pointers – a bit clumsy, and not at all RPG-like. Instead I have “cheated” and used my knowledge of how subprocedure parameters are actually passed to allow myself to code this in a much simpler and cleaner fashion. I personally think it is also far easier to understand. Here’s the version I actually used:
(F) dcl-proc SortCityState; dcl-pi *n Int(10); element1 LikeDS(customerAddr); element2 LikeDS(customerAddr); end-pi;
So how and why does this substitution work? Simply put, under the covers, passing a pointer containing the address of a field by value, has the same effect as passing that field by reference. Both of them result in a pointer, containing the address of the data, being placed on the stack. If you want to understand the mechanics behind this a little better then you can read my tip “Parameter Passing Fundamentals Of Programs Versus Procedures” and in particular the section headed “Why the Difference.”
Regardless of which way we code the procedure interface, the basic logic for the sequencing is the same. We begin by comparing the high order keys (in this case the state) and notify qsort via the RETURN value whether element 1 is higher (HIGH – which has the value 1) or lower (LOW – the value -1) in the sequence than element2. Note that HIGH, LOW and EQUAL are named constants in our procedure. You have probably guessed that the value of EQUAL is zero.
In the event that the State is the same in both elements, then the comparison moves on to the City. Hopefully you can see that if in future we were to need to add (say) the customer name to the sequence then we would simply insert the appropriate comparisons between element1.name and element2.name _before_ the RETURN EQUAL operation.
// Sequence by City in State if element1.state > element2.state; Return HIGH; elseif element1.state < element2.state; Return LOW; elseif element1.city > element2.city; Return HIGH; elseif element1.city < element2.city; Return LOW; // Add any additional low-order tests here ... else; Return EQUAL; endif;
As you can see, coding these sequencing procedures is very simple, and that makes it easy to add additional sequences to meet user’s requests. It would, for example, be possible to add a case-insensitive sort simply by unicasing the fields as part of the comparison. It would not be the most efficient way to do it, but it would work.
In the next tip I will discuss how the bsearch API can be used to perform the equivalent of %LOOKUP on our teraspace arrays and in particular how we could implement partial key searches. In the meantime, if you have any questions please let me know via the comments section.