Newsletters Subscriptions Media Kit About Us Contact Search Home

Stuff
OS/400 Edition
Volume 2, Number 7 -- March 27, 2003

Pivot Data with SQL/400


by Michael Sansoterra

[The code for this article is available for download.]

Have you ever received an annoying report request from management asking that the data be presented summarized in columns and rows, rather than just by rows? To further complicate things, the data to be presented in the columns wouldn't necessarily be consistent between runs, so you had to dynamically calculate column contents. Sound familiar? Cross tab queries provide the perfect solution for just such a dilemma.

Microsoft's Access and Excel have cross tab queries and pivot tables, to provide a programmer or user with the ability to easily create such data transformations. The Create Cross Tab SQL (CRTCTSQL) utility presented here is intended to bring similar functionality to the iSeries. Because it uses embedded SQL and free-form RPG, CRTCTSQL requires the DB2 SQL Development Kit to be installed, along with OS/400 V5R1 or later.

Cross tab queries are a splendid tool for data analysis. They allow data to be dynamically summarized by row and column values. In SQL, the GROUP BY clause summarizes data by predefined row values (a one-dimensional summary). A cross tab query can be thought of as providing a two-dimensional GROUP BY, summarizing predefined values by both row and column.

Figure 1 shows the result of a simple one-dimensional data summarization. Data is summarized by the unique row values found in the period and category columns.


Period Category Sales
2002-01 Snacks 3000.25
2002-01 Beverages 2145.01
2002-01 Fruit 175.20
2002-02 Snacks 2167.40
2002-02 Beverages 2022.39

Figure 1: Sales data is summarized in rows by period and category


Here is the SQL that was used to create the data summarization:

  Select Period, Category, Sum(SalesAmt) As Sales
    From Sales
Group By Period, Category
Order By Period, Category

Figure 2 shows the result differently, with the values in the category column pivoted from a row heading to a column heading.


Period Beverages Fruit Snacks
2002-01 2145.01 175.20 3000.25
2002-02 2022.39 -- 2167.40

Figure 2: Sales data is now summarized in rows by period, and in columns by category


Here is the SQL statement to produce the "pivoted" data:

  Select Period,
         Sum(Case When Category='Beverages' 
                  Then SalesAmt End) As Beverages,
         Sum(Case When Category='Fruit' 
                  Then SalesAmt End) As Fruit,
         Sum(Case When Category='Snacks'
                  Then SalesAmt End) As Snacks
    From Sales
Group By Period
Order By Period

Notice that the SQL CASE statement is used to group the sales amounts into columns by category. But what if you don't know beforehand what your category names are? For example, what if another 10 categories were added to the sales table? The programmer would have to go back and add another 10 CASEs to the SQL statement.

This is where the power of cross tab queries comes in. A cross tab query will examine data beforehand and then dynamically build an SQL statement consisting of a variable number of CASE statements. So if in the above example two new categories were added to the sales table, a cross tab query would examine the category column of the sales table, build an SQL statement (like the one shown in Figure 2) with a CASE statement for each distinct value in the Category column, and run the query.

The cross tab query has four components: data source, row headings, column heading, and aggregate values.

To build a cross tab query with CRTCTSQL, you need to specify a valid SQL SELECT statement as the data source. A sample data source would simply be SELECT * FROM SALES.

Row headings are specified by selecting one or more column names or expressions from the base SELECT statement. These row headings are used just like columns specified in a GROUP BY clause; that is, the data is summarized in rows by the values in these columns. Examples of row values could be ITEMNO or SUBSTR(ITEMNO,3,3).

The column heading is the name of a single column name or expression. When the cross tab query is run, the column name or expression specified for the column heading is scanned against the base table and a list of unique values is built. Each one of these unique values will be placed in a CASE statement so that the data will be "pivoted" by column. Examples of data to use in a pivoted column would be CUSTNAME or YEAR(DATE) || '-' || MONTH(DATE).

The final piece of the cross tab query is the aggregate values. These values are column names or expressions that will participate in one of the standard SQL aggregate functions such as AVG, COUNT, MIN, MAX, STDDEV, SUM. In case you're not familiar with aggregate functions, these are functions that operate on multiple rows of data. For example, the AVG function will return the median average of a numeric column in a given set of rows. The MIN and MAX functions will return the lowest and highest values, respectively, for a given column or expression for a given set of rows. Of course, the values in the aggregates can only be numeric for functions like AVG or SUM, but MIN and MAX can accept other types as well. Examples of aggregate values would be ORDERS (*COUNT) or QTY * PRICE (*SUM).

Figure 3 illustrates the cross tab query's parts.

Figure 3

Figure 3: Row headings, column headings, aggregate values, and a data source make up the cross tab query

The CRTCTSQL command is designed to build an SQL statement that will emulate a cross tab query. The following is a run down of the parameters comprising the CRTCTSQL command.


Parameter Description
COLHDG Enter the column name or expression from the data source SQL statement. The distinct values taken from the column specified here will be broken into individual columns on the cross tab query.

NOTE: If you have character fields containing binary data, the program will create invalid column alias names in the resulting cross tab query. The values of the column or expression specified here should resemble something that can be represented as an SQL column name.
MAXCOL Enter the maximum number (1-50) of columns allowed in the cross tab query (not including the row headings). This is used to limit the number of columns created in cases where a particular column heading may have a large number of distinct values.
ROWHDGS Enter one or more column names or expressions from the data source SQL statement. The distinct values for the columns specified will be used to create "row summaries." In other words, the columns or expressions given here will be placed in the SQL GROUP BY clause.
VALUES Enter one or more columns or expressions to be used in aggregate functions. The aggregate function to be used must be specified as well (SUM, COUNT, AVG, MIN). These will usually be numeric columns, to be summed or averaged. Character or date columns can be used for COUNT, MIN, and MAX functions.

Optionally, the DISTINCT keyword can be specified, so that only distinct values are used in the aggregate, as well as the option to cast the result to a specified data type. If *DECIMAL or *ZONED is specified as the data type, the precision and scale must be given as well. If an invalid data type is specified for a given column function--SUM(BLOB column)--a runtime error will occur when the Cross Tab query is executed.
SQL Place the data source SQL statement here. A maximum of 5,000 characters may be specified. If you need to use complex column expressions or join several tables, it may be wise to first encapsulate the SQL statement in a view.

The ORDER BY is not allowed here (see below). UNION is not allowed until V5R2. This statement should be entered with the *SYS naming convention unless the program's compiler option is changed to *SQL.

Note: If you intend to use a feature such as FETCH FIRST n ROWS ONLY to pick your top 10 customers for the cross tab query, you will need to create a work table first before using CRTCTSQL. That's because the ORDER BY is required for FETCH FIRST to work properly, but ORDER BY is not allowed in the base SQL statement.
OUTPUT Select an output for the cross tab SQL statement. The * option will display the results of the cross tab query on the screen. The other options--*VIEW, *QMQRY, *SRC--will put the cross tab SQL statement into a given VIEW, query manager query, or source member. For these latter options, the resulting query is not run immediately.

Several self-explanatory suboptions are available to specify the view, query, and source file/member information, along with the ability to replace existing objects or members.
ORDERBY If YES is specified, an ORDER BY clause will be appended to the SQL statement (not valid for output of *VIEW.) The column names in the ORDER BY will match the columns in the GROUP BY. Remember, a GROUP BY no longer forces the sort of the query.
FIXEDHDGS This option is used to force the column names and values in the query. In other words, this option will force the query to generate columns for a desired set of values. This is useful for date columns and the like, where data should be broken out in columns by year, quarter, or month.

Also, a column name should be assigned for each value. When the query is run, the value specified will also have the given column name in the resulting cross tab SQL statement. This name must not exceed 30 characters and must follow all of the SQL naming rules. Additionally, these names must be unique if a VIEW is to be created.

One other thing about cross tab queries needs to be said. While they're great for things like ad hoc queries, verifying program output, or creating long SQL statements that you otherwise wouldn't want to type, because the column names can often change from run to run, they're not a very friendly tool to use for production reporting, unless you have the skills and desire to build a sophisticated reporting tool that is able to work with dynamic column names.

Under certain circumstances, however, it may be known beforehand that the cross tab's columns will always be the same. For a query displaying sales data, for example, the columns may always be summarized as QTR1, QTR2, QTR3, and QTR4. Or a cross tab query to display customer names can be defined to always show the same 10 customers. In this case, the FIXEDHDGS parameter can be specified. The benefits of working with fixed headings are as follows:

  • Only the values specified here will be placed in the cross tab statement. This means unwanted values will be weeded out.
  • Processing time is not required to retrieve distinct values from the base statement.
  • The column names are defined by the user, and are therefore friendlier.
  • Since the column names are fixed, the cross tab query can easily be used by other programs or queries.

Here are a few other things to be aware of when using the command. As noted earlier, the cross tab SQL will automatically assign names to the columns created based on the value of the column. The names built will start with the column function selected (SUM, MIN, MAX), followed by an underscore ( _ ), followed by the distinct value. Any characters not allowed in a column name will be replaced by an underscore. The maximum length for a column name is 30, so any extra characters will be truncated. Keep in mind that if any two column names end up being the same, a view will not be able to be created, because views demand unique column names.

Because the values in the column are used as headings, only character fields with simple data types such as numeric and character (preferably with shorter lengths) are allowed. BLOBs and datalinks, for example, are not supported.

To get an idea of how column names are created, here is a sample SQL statement created by CRTCTSQL and highlights to show its components.

SELECT ProductID, ProductName,
/* Values */
         DECIMAL(SUM(CASE WHEN CategoryName ='Beverages' THEN Round(UnitPrice *
         Quantity * (1 - Discount),2) END),9,2) AS SUM_Beverages,
         DECIMAL(SUM(CASE WHEN CategoryName ='Condiments' THEN Round(UnitPrice
         * Quantity * (1 - Discount),2) END),9,2) AS SUM_Condiments,
         DECIMAL(SUM(CASE WHEN CategoryName ='Confections' THEN Round(UnitPrice
         * Quantity * (1 - Discount),2) END),9,2) AS SUM_Confections,
         DECIMAL(SUM(CASE WHEN CategoryName ='Dairy Products' THEN
         Round(UnitPrice * Quantity * (1 - Discount),2) END),9,2) AS
         SUM_Dairy_Products,
         DECIMAL(SUM(CASE WHEN CategoryName ='Grains/Cereals' THEN
         Round(UnitPrice * Quantity * (1 - Discount),2) END),9,2) AS
         SUM_Grains_Cereals,
         DECIMAL(SUM(CASE WHEN CategoryName ='Meat/Poultry' THEN
         Round(UnitPrice * Quantity * (1 - Discount),2) END),9,2) AS
         SUM_Meat_Poultry,
         DECIMAL(SUM(CASE WHEN CategoryName ='Produce' THEN Round(UnitPrice *
         Quantity * (1 - Discount),2) END),9,2) AS SUM_Produce,
         DECIMAL(SUM(CASE WHEN CategoryName ='Seafood' THEN Round(UnitPrice *
         Quantity * (1 - Discount),2) END),9,2) AS SUM_Seafood
/* Base SQL */
FROM
(Select CategoryName, A.ProductID, ProductName, A.UnitPrice, Quantity, Discount
 From Northwind/OrderDetails A Inner Join Northwind/Products B On A.Productid=B
.Productid Inner Join Northwind/Categories C On B.Categoryid=C.CategoryId) AS
TBL
/* Group By */
GROUP BY ProductID, ProductName

This SQL statement was generated by CRTCTSQL. Individual components of the cross tab query are highlighted, with the column heading portion in purple, the row heading in blue, the aggregate values in red, and the base SQL data source in green.

For the output type of *SRC, and *QMQRY, keep in mind that the created SQL statement may not run. That's because the resulting statement will be syntactically correct but may not be fully correct, in the event that a row, column, or value expression is entered incorrectly. In such cases you will not know that there is a problem with the statement until an attempt to run it is made. In contrast, outputs of * and *VIEW will generate an error if something is wrong with the resulting SQL because they attempt to run the SQL statement immediately.

CRTCTSQL uses SQL/400's nested SELECT ability to get the data from any SELECT statement. However, the ease of use of nested SELECTs comes with a performance price. It is often a good idea to run a query to select your base data into a small work table to use with CRTCTSQL in order to prevent long running cross tab queries.

If you have problems, be sure to check any column expressions and the base SQL statement. (The base SQL will be validated in the event that distinct values need to be retrieved.) It helps to dump the cross tab SQL to a source member for closer examination. Remember, a statement may be syntactically correct and still contain errors.

When running a cross tab query over a large table, it may take a while to find the DISTINCT values, so don't be surprised if there is a long delay before the command is finished.

Finally, a companion command called Retrieve Cross Tab SQL (RTVCTSQL) is available for retrieving the SQL statement into a CL variable. This is useful for those who want to do something more with the SQL statement, such as building a dynamic reporting utility, before using the cross tab query. Beware that the maximum length for a CL variable is 5,000 characters, and therefore a large SQL statement will be truncated. An error condition is sent to the calling program in this event.

How the Program Works

The CRTCTSQLR program accepts a column expression for the column headings and the data source SQL statement from the command line. From this information it creates a SELECT DISTINCT column expression statement, which is nested around the base SQL statement. For example, if the base SQL statement is SELECT * FROM SALES and the column expression is MONTH(INVDATE), the RPG program would build the following SQL statement to retrieve the DISTINCT values:

SELECT DISTINCT MONTH(INVDATE)
  FROM (SELECT * FROM SALES) A

But you may be asking, how does the program retrieve this set of distinct values from an SQL statement? How would the RPG program know what data type the user is requesting, whether packed, zoned, or varchar, since this information isn't defined at compile time?

The answer to this question lies in the power of SQL's PREPARE statement. PREPARE will validate a SQL statement and analyze it to return a description of the statement's resulting columns along with their attributes (data type, null capable) Since PREPARE also validates the SQL statement, it will inform the user of any errors in the base SQL or column heading.

The data structure where the PREPARE returns this information about a SELECT statement's columns is called the SQL Descriptor Area (SQLDA – don't confuse this with the SQL communications area, aka SQLCA). The format of this data structure can be found in Appendix C of the SQL Reference manual. It basically consists of a header structure that contains information about the statement, including the number of columns in the SELECT. It also contains an array that holds attribute information about each of the columns. To make things easy, the SQL precompiler directive INCLUDE SQLDA can be specified to have the precompiler place the data structure in the code automatically. When letting the precompiler create the SQLDA, the programmer must define a numeric constant field called SQL_NUM. This field will define how many column substructures will be allocated by the SQLDA. In other words, SQL_NUM should contain the maximum number of columns allowed in a SELECT statement for your application. CRTCTSQL defines SQL_NUM as 1, because only one column (the column heading) will ever be returned by this application.

Once the statement has been successfully "prepared," a cursor is opened from the prepared statement. At this point, the RPG program just lets SQL do its magic to retrieve the data from the DB2/400. When working with a dynamic SELECT list, instead of FETCHing data into host variables, the data will be FETCHed into the SQLDA. The SQLDA contains a pointer to the raw data for each column in the SELECT list.

Since there is only a pointer to work with, it's up to the RPG program to interpret the raw data. This isn't hard to do for VARCHAR, CHAR, or TIMESTAMP data, because these types can each be interpreted as fixed character for placement in the cross tab SQL statement. Standard numeric data types such as SMALLINT, DOUBLE, or INTEGER are also easy to work with because RPG also has an equivalent for each of these. Further, the built-in function %CHAR can be used to convert these data types to character for use in an SQL statement.

On the other hand, it isn't easy to work with packed or zoned data, because the program doesn't know beforehand the precision and scale. To convert these two data types to character, the program makes use of the Convert Edit Code to Edit Mask (QECCVTEC) and Format Numeric Data with Edit Mask (QECEDT) APIs. These APIs are useful because, unlike the %EDITC and %EDITW built-in functions, they can handle packed and zoned data at run time. The QECCVTEC API will create an EDIT MASK based on one of the standard edit codes. CRTCTSQL uses Edit Code P so that numeric data will be formatted with a leading sign and no commas, which is what the SQL syntax requires. QECEDT uses the edit mask created from QECCVTEC, the data type (zoned or packed), scale and precision to format the run time data as character.

Once the values of the cross tab's column heading are known and converted to character representation, it's easy to have the RPG program construct the appropriate SQL statement. The RPG program makes use of a temporary Query Manager query to display the results on the screen.

CRTCTSQL is a powerful data analysis tool. In addition to allowing data to be summarized in a two-dimensional matrix, it can also be used to aid programmers in creating complex SQL statements. Whether you're doing ad hoc queries to verify the output of a new COBOL program or putting some sales figures together for your next meeting, the time saving CRTCTSQL will no doubt come in handy.

CRTCTSQL Examples

To further aid your understanding of the utility, here are a few examples of how you might use it.

This first cross-tab query will summarize product sales (unit price * quantity * discount) by category name on the x-axis (column headings, with a maximum of 30 categories) and product ID and name on the y-axis (row headings.) Here is the resulting SQL statement for this CT query:

CRTCTSQL COLHDG(CategoryName) 
          MAXCOL(30) 
          ROWHDGS(ProductID ProductName)
          VALUES(('Round(UnitPrice * Quantity * (1 - Discount
                 ),2)' *SUM *ALL *DECIMAL 9 2))
          SQL('Select CategoryName, A.ProductID, ProductName,
               A.UnitPrice, Quantity, Discount From Northwind
              /OrderDetails A Inner Join Northwind/Products B
               On A.Productid=B.Productid Inner Join Northwin
              d/Categories C On B.Categoryid=C.CategoryId')

This cross-tab SQL command produced the sample SQL statement from above, which is restated here:


SELECT ProductID, ProductName, /* Values */ DECIMAL(SUM(CASE WHEN CategoryName ='Beverages' THEN Round(UnitPrice * Quantity * (1 - Discount),2) END),9,2) AS SUM_Beverages, DECIMAL(SUM(CASE WHEN CategoryName ='Condiments' THEN Round(UnitPrice * Quantity * (1 - Discount),2) END),9,2) AS SUM_Condiments, DECIMAL(SUM(CASE WHEN CategoryName ='Confections' THEN Round(UnitPrice * Quantity * (1 - Discount),2) END),9,2) AS SUM_Confections, DECIMAL(SUM(CASE WHEN CategoryName ='Dairy Products' THEN Round(UnitPrice * Quantity * (1 - Discount),2) END),9,2) AS SUM_Dairy_Products, DECIMAL(SUM(CASE WHEN CategoryName ='Grains/Cereals' THEN Round(UnitPrice * Quantity * (1 - Discount),2) END),9,2) AS SUM_Grains_Cereals, DECIMAL(SUM(CASE WHEN CategoryName ='Meat/Poultry' THEN Round(UnitPrice * Quantity * (1 - Discount),2) END),9,2) AS SUM_Meat_Poultry, DECIMAL(SUM(CASE WHEN CategoryName ='Produce' THEN Round(UnitPrice * Quantity * (1 - Discount),2) END),9,2) AS SUM_Produce, DECIMAL(SUM(CASE WHEN CategoryName ='Seafood' THEN Round(UnitPrice * Quantity * (1 - Discount),2) END),9,2) AS SUM_Seafood /* Base SQL */ FROM (Select CategoryName, A.ProductID, ProductName, A.UnitPrice, Quantity, Discount From Northwind/OrderDetails A Inner Join Northwind/Products B On A.Productid=B .Productid Inner Join Northwind/Categories C On B.Categoryid=C.CategoryId) AS TBL /* Group By */ GROUP BY ProductID, ProductName

This next cross-tab query will show the number of orders each customer had (row headings) for each category (column headings). The *DISTINCT value is used so that an order is only counted once, even if it has multiple lines for the same product category. The resulting SQL statement is placed in a view. A sample of the output is shown here:

CRTCTSQL COLHDG(CategoryName)                               
         MAXCOL(30)                                         
         ROWHDGS(CustomerID CompanyName)                    
         VALUES((ORDERID *COUNT *DISTINCT *INTEGER))        
         SQL('Select CategoryName, C.CustomerID, CompanyName
             , A.OrderID From Northwind/OrderDetails A Inner
              Join Northwind/Products B On A.Productid=B.Pro
             ductid Inner Join Northwind/Orders C On A.Order
             Id=C.OrderID Inner Join Northwind/Customers D O
             n C.CustomerID=D.CustomerID Inner Join Northwin
             d/Categories E On B.CategoryID=E.CategoryID')  
         OUTPUT(*VIEW) 
         VIEW(NORTHWIND/VIEW2)
         RPLVIEW(*YES)

CUSTOMERID COMPANYNAME COUNT_BEVERAGES COUNT_CONDIMENTS COUNT_CONFECT
ALFKI Alfreds Futterkiste 2 4 0
ANATR Ana Trujillo Emparedados y helados 1 0 1
ANTON Antonio Moreno Taquería 4 1 2
AROUT Around the Horn 5 0 5
BERGS Berglunds snabbköp 11 3 6
BLAUS Blauer See Delikatessen 2 1 3
BLONP Blondel père et fils 6 0 2
BOLID Bólido Comidas preparadas 1 1 0
BONAP Bon app' 4 4 6
BOTTM Bottom-Dollar Markets 5 5 6

This last example sums the quantity sold by product ID and year (row headings) and quarter of the year in the columns. The fixed headings are used because there will only be four quarters. Even though expressions are used--Quarter(OrderDate) and Year(OrderDate)--the expressions might better be placed in the base SQL statement with an alias. Part of the reason for this has to do with controlling the heading name that gets created for the row headings. Another reason is that views require an alias name be given to expressions. Attempting to place this statement in a view would fail because the expression Year(OrderDate) is unnamed.

CRTCTSQL COLHDG('Quarter(OrderDate)')
         MAXCOL(30) 
         ROWHDGS(ProductID 'Year(OrderDate)') 
         VALUES((QUANTITY *SUM))               
         SQL('Select OrderDate, ProductID, Quantity From Nor
             thwind/OrderDetails A Inner Join Northwind/Orde
             rs B On A.OrderId=B.OrderID')                  
         OUTPUT(*)                                          
         ORDERBY(*YES)                                      
         FIXEDHDGS((1 Qtr1) (2 Qtr2) (3 Qtr3)
                    (4 Qtr4))

Michael Sansoterra is a programmer/analyst for SilverLake Resources, an IT services firm in Grand Rapids, Michigan. E-mail: msansoterra@silver-lake.com.


Sponsored By
T.L. ASHFORD

BARCODE400 by T.L. Ashford is the easiest
and fastest way to create and print Compliance
Labels directly from the AS/400 and iSeries.

Ashford's comprehensive library of Compliance formats is available to Barcode400 users. AIAG labels for Ford and Motorcraft, GM, and many more are available. BARCODE400 is backed by the best Technical Support Team in the industry.

FREE Guide to Bar Code Labeling

www.tlashford.com or call 800.541.4893


THIS ISSUE
SPONSORED BY:

T.L. Ashford
Esker Software
ASNA
Profound Logic Software


BACK ISSUES

TABLE OF
CONTENTS
Pivot Data with SQL/400

We're Off to See the Wizard of WebFacing

Creating and Consuming an AS/400 Web Service

The Java-Based iSeries System Debugger


Editors
Shannon O'Donnell
Kevin Vandever

Managing Editor
Shannon Pastore

Contributing Editors:
Howard Arner
Raymond Everhart
Joe Hertvik
Ted Holt
David Morris

Publisher and
Advertising Director:

Jenny Thomas

Advertising Sales Representative
Kim Reed

Contact the Editors
Do you have a gripe, inside dope or an opinion?
Email the editors:
editors@itjungle.com


Copyright © 1996-2008 Guild Companies, Inc. All Rights Reserved.