Variable-Length Database Fields Better Use Disk Space

September 24, 2008 Jon Paris

In a recent tip, I discussed the use of variable length fields in building strings within RPG programs. This time, I will focus on how these fields can be utilized in our databases and consider the trade-offs, between disk space and performance, involved.

Why would you want to use a variable length field? Well, how about product descriptions as one example, or notes fields associated with an order? The marketing department would probably like to have an unlimited number of characters to play with, but they may accept a limit of say 1,000 as more realistic. The problem, of course, is that we will need to allocate those 1,000 characters for each record, and that’s going to seem like it “wastes” a lot of storage space when we consider all those cases where only 100 or so of those 1,000 characters are in use.

One traditional solution to this problem is to create a separate database for the descriptions. Let’s say each record contains 100 characters of the description and a sequence number. This minimizes wasted storage, but means that additional code always has to be executed in order to store and/or retrieve the description. As a result, performing even simple text search functions is problematic. And if you really enjoy exercises in futility, try to explain to an end user how to use their ODBC-based desktop software to search and retrieve information from such a database!

This is where variable length fields come into their own–they deal with all of these issues. They allow us to treat the field as a single entity and to let the system worry about how it is stored and retrieved, while still retaining a measure of control over the amount of storage used.

First let’s look at how these fields are defined in DDS:

A    LONGDESC1   1000          VARLEN
 
A    LONGDESC2   1000          VARLEN(80)

The VARLEN keyword defines the field as variable length. The purpose of the optional parameter shown in the second example is to specify the allocated length. It is easier to understand the purpose of this parameter once you understand how data is stored for a variable length field. In the case of LONGDESC1, any data in this field is always stored in what is known as the variable length portion of the file. (This is sometimes also known as the overflow or shadow area.) As a result, when you read the record containing LONGDESC1, if there is any data present in the field (i.e., it has a non zero length), a second read is required to actually retrieve that data. This obviously affects the performance of the application and so we need a means to control this.

This is the purpose of the allocated length. In my example, so long as the content of LONGDESC2 does not exceed 80 characters the data should always be stored with the main record. Only if the content of the field exceeds 80 characters will the overflow area be used, and a second read be required. The trick is in finding a balance between storage and efficiency! As a general rule, you should aim to have the allocated length be large enough to accommodate a substantial percentage of your records without resorting to using the overflow area.

What constitutes “substantial”? That depends on how often you access the records, how I/O bound your system is (i.e., how much will the extra reads affect you), and how critical disk space is to you. Only you can answer these questions, but I would suggest that setting the value to accommodate somewhere between 75 and 85 percent of the records would normally be a good starting point. If, however, you are in a situation where a large number of records will have no data in the field, then you should probably not specify the length and simply let the system use the overflow area for all the data.

Here are some other points you should bear in mind when using variable length fields:

Always remember to use %TRIMR when loading the field or all of the trailing spaces in the original fixed length field will be wasting space.
There is an overhead of 25 bytes associated with the variable length field to link it to the overflow area–so you will never save space by making short fields variable length no matter what.
Your RPG program, and any utility programs you use, will always see the full length of the field. But you can always determine the real length of the field through the %LEN BIF.
If the data in your variable length fields is volatile, remember to reorganize your file on a regular basis. This ensures that retrieval from the overflow area is as efficient as possible.

There are many other advantages to using variable length fields that we could discuss, but we’ll save those for another day.

Jon Paris is one of the world’s most knowledgeable experts on programming on the System i platform. Paris cut his teeth on the System/38 way back when, and in 1987 he joined IBM’s Toronto software lab to work on the COBOL compilers for the System/38 and System/36. He also worked on the creation of the COBOL/400 compilers for the original AS/400s back in 1988, and was one of the key developers behind RPG IV and the CODE/400 development tool. In 1998, he left IBM to start his own education and training firm, a job he does to this day with his wife, Susan Gantner–also an expert in System i programming. Paris and Gantner, along with Paul Tuohy, are co-founders of System i Developer, which hosts the new RPG & DB2 Summit conference. Send your questions or comments for Jon to Ted Holt via the IT Jungle Contact page.

The Efficiency of Varying Length Character Variables

                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot