Treasury of new DB2 6.1 (V6R1) Features, Part 4: Index and Data Type Enhancements

March 25, 2009 Michael Sansoterra

So far in this series, we’ve covered how IBM has made life easier for DB2 for i (formerly called DB2/400 or DB2 UDB for iSeries) developers by:

Part 1: Cramming greater query functionality and operations into a single statement
Part 2: Providing an easy way to summarize data with super groups and grouping sets
Part 3: Giving developers options to implement audit tracking and other features useful in a client/server environment by implementing several new client special registers

This tip will cover yet another valuable subset of features: new indexing and data type features.

Building an Index Using Expressions

The CREATE INDEX statement has been enhanced to allow column expressions. Previously, indexes containing limited expressions could be built using DDS based logical files, but now SQL has been granted this power. To do this in SQL, simply substitute a column expression where a column name would normally go in the CREATE INDEX statement:

CREATE INDEX APPDATA/ORDERLINE_I1 ON
APPDATE/ORDERLINE (SALESREP, LINEAMT * QTY DESC)

In the above example, an index is built on the ORDERLINE table using the SALESREP column followed by the calculation of the extended dollar amount in descending order. The following “Top Sales” query will allow the query optimizer to choose the newly created ORDERLINE_I1 index:

  Select SalesRep,Order,Item,LineAmt*Qty As ExtAmount
    From OrderLine
Order By SalesRep, ExtAmount Desc

As expected, the expressions and the sort sequence in the ORDER BY need to match the column names and expressions in the index. The first time I tried this I noticed that the optimizer was performing a table scan instead of using the intended index, but my original test table only had about 500 rows. Doing a similar test on a table with over 100K rows yielded the expected result with the query optimizer picking the expression index.

Here’s an example of how an index expression can benefit a legacy application. Sometimes legacy packages store dates in a less than optimal format. Say you have a PO header table that contains the following columns:

POCC	PO Date Century	Numeric(2,0)
POYMD	PO Date (YYMMDD)	Numeric(6,0)

It’s always a pain to do a “cross century” search on dates stored like this. However, we can build an index that will combine these two date piece fields into one large numeric field:

Create Index DataLib/PurchaseHeader_I1 ON     
DataLib/PurchaseHeader (POCC*1000000+POYMD,PO)

Or, we can even go a step further (assuming there’s valid date data in every row) and create an expression that will cast the numeric values to a DATE data type:

Create Index DataLib/PurchaseHeader_I2
On DataLib/PurchaseHeader 
(Date(Digits(POCC)||Left(Digits(POYMD),2)||'-'||
 Substr(Digits(POYMD),3,2)||'-'|| 
 Substr(Digits(POYMD),5,2)),PO)

Before building an index like the one in this example, you’ll want to ensure that the values in all of the rows are valid. Remember, just because the expression is in your index doesn’t mean you can use it in your Select’s column list. For instance, if you want to see the result of the above date expression in a query result, you’ll need to re-code the expression in your Select statement.

Unfortunately at this time, deterministic User-Defined Functions are not supported in index expressions. Understandably, special registers, sub-selects, aggregate expressions, and other non-deterministic functions are not supported either.

Indexes that Contain Column Names

SQL indexes can now have a list of column names and a record format specified. Previously, SQL indexes did not contain column information and hence could not be directly read by Query/400, COBOL or RPG programs. Ted Holt already did a write up on this topic, so I won’t cover it in detail again.

For quick reference, here is a sample index that uses a key expression and a column list:

Create Index AdvWorks/TranHistII On AdvWorks/TransactionHistory 
(ProductId, Case When Quantity=0 Then 0 
                 Else ActualCost/Quantity End Desc) 
RcdFmt TransHistR 
Add ProductId, ActualCost, Quantity, 
ReferenceOrderId, TransactionId, ReferenceOrderLineId, 
TransactionDate, TransactionType, ModifiedDate

This SQL index will be built with a column expression in the index, a record format of TransHistR, and will contain all of the column names listed in the Add clause. Instead of specifying individual column names on the ADD clause, you can also specify the “ALL COLUMNS” option as “ADD ALL COLUMNS.”

Having column information in a specific index should save the query optimizer time by allowing SQL to pull data out of the index without having to reference the data in the underlying base table (a.k.a. physical file). This will also allow RPG and COBOL programs to use an SQL index like a traditional logical file without using embedded SQL. These index enhancements have virtually eradicated the remaining benefits of using DDS. Although, since IBM is moving us away from DDS, I would still like an “i” specific extension to DB2 SQL to allow edit codes or edit words to be specified on a numeric column.

The big remaining limitations here are that column expressions are not allowed in the record format’s column list (that would be very nice), and they suffer the inability to reference multiple tables. To get rid of these limitations IBM’s next logical step is to give developers the ability to create Indexed Views.

Binary Constants and Comparisons with Character Data

Now moving on to our data type related discussions, a new binary string constant has been added to SQL that is very similar to the hex form of the character string constant. A binary string constant starts with the letters BX:

BX'0D0A'   /* ASCII Carriage Return & Line Feed Characters */

A binary string constant is automatically interpreted by DB2 as the BINARY data type and hence will not be translated to character data under any circumstances (except an explicit cast). If you’re wondering “why not just use the existing hex constant?” the answer is that the hex constant can, in certain circumstances, be interpreted as character data and hence can be translated to a different character set. For instance, a result set containing a hex constant could be translated by, say, the ODBC driver whereas a binary constant will not undergo such changes.

A related change is that character data tagged with the FOR BIT DATA option specified is now compatible with the binary data type. This means that these two data types can be compared, concatenated, etc., without first explicitly CASTing the data. Prior to V6R1 specifying FOR BIT DATA on a character column would still be treated as character data (even though the intention of this encoding is to store binary data). For example, the following concatenation will fail in V5R4 but succeed in V6R1:

Select Binary(X'F1')||Cast(X'F1' As Char(1) For Bit Data)
       As Test_Concat 
  From SysIBM/SysDummy1

Even so, there is a still a significant difference between the two types as binary columns are padded with binary zeros (BX’00’) and character based columns (including those with FOR BIT DATA specified) are padded with spaces. While the two are now somewhat compatible, all new applications should specify the BINARY data type for storing binary data unless there is an overriding legacy requirement to keep the column as character.

New DECFLOAT Data Type

A new data type called DECFLOAT has been introduced this release. This data type based on the IEEE 754 standard (actually it’s IEEE 754R) combines the benefits of the compact yet large value range capacity of floating point numbers, but also includes the precision of the decimal data type that is critical to business applications. (Something traditional floating point numbers are not guaranteed to have!)

DECFLOAT can be defined as having either 16 or 34 digits of precision. DECFLOAT(16) requires eight bytes of storage and can hold an exponent range of 10³⁸³ to 10³⁸⁴. DECFLOAT(34) requires 16 bytes of storage and can hold 10⁶¹⁴³ through 10⁶¹⁴⁴. If you need a small storage footprint, big numbers and accuracy then DECFLOAT is a great option.

Further, in addition to numbers, this data type is capable of holding any of the following special values:

Infinity–A value that represents an infinitely large (and can be positive or negative).
Quiet NaN–This value represents an undefined result (i.e., “not a number”) that does not generate a warning when processed.
Signaling NaN–A value that represents an undefined result that causes a warning message to be generated.

When inserting these values into a DECFLOAT column, use one of the following special constants: INFINITY, NAN, and SNAN (quotes are not required).

New companion built-in functions for use with the DECFLOAT type are:

QUANTIZE–This function can be used to control the precision of a DECFLOAT number, since floating point numbers can contain slight inaccuracies QUANTIZE can help by controlling the number of digits to be reported.
TOTALORDER–This function reports the result of a comparison operation between two values and takes into consideration the special values when doing a comparison. (For instance, infinity is considered less than a NaN value!)
DECFLOAT_SORTKEY–This function returns a special binary value that can be used to sort DECFLOAT values, which will correctly take into account the special values and rules when sorting according to the IEEE 754 standard. (For instance, in this standard the value 2.10000 is less than 2.1. In fact, 2.1000 < 2.100 < 2.10 < 2.1. DECFLOAT_SORTKEY can be used to sort values consistent with the standard.)
DECFLOAT–This function is used to CAST a value to the DECFLOAT data type.

For the record, the ILE/C compiler is the only high level language that currently supports DECFLOAT. For other high level languages, the DECFLOAT data will need to be cast to another data type (DEC or DOUBLE, for example) before it is stored in host variables.

New Unicode Data Types

Three new data types have been introduced: NCHAR, NVARCHAR, and NCLOB. These data types are double-byte characters tagged with the Unicode (UTF16) CCSID of 1200. In other words, defining the following column:

CustomerName NVARCHAR(35)

is equivalent to defining a column as:

CustomerName VARGRAPHIC(35) CCSID 1200

Similarly, NCHAR is the equivalent of GRAPHIC and NCLOB is the equivalent of DBCLOB (with CCSID 1200 specified).

While DB2 has been able to work with UTF16 since V5R3 this is a nice enhancement because:

The syntax is understandable (non DB2 developers don’t usually know about CCSIDs or the GRAPHIC data type)
You don’t have to remember the CCSID number for UTF-16
These data types provide nice compatibility with SQL Server and to a lesser extent with Oracle (which has NCHAR and NVARCHAR2 data types)

Summary

The index enhancements that enable indexes to have expressions, record format names, and specified column names eliminate many of the few advantages DDS had. Additionally, the new implementation of binary constants and binary comparisons with character “FOR BIT DATA” have eliminated most of the remaining issues related to binary data handling within DB2 for i. Finally, the new DECFLOAT and Unicode data types give developers more flexibility in numeric data type offerings as well as simplify defining columns and parameters requiring double byte Unicode data.

Michael Sansoterra is a programmer/analyst for i3 Business Solutions, an IT services firm based in Grand Rapids, Michigan. Send your questions or comments for Mike to Ted Holt via the IT Jungle Contact page.