|
|||||||
|
|
![]() |
|
|
Pivot Data with SQL/400 by Michael Sansoterra [The code for this article is available for download.] Have you ever received an annoying report request from management asking that the data be presented summarized in columns and rows, rather than just by rows? To further complicate things, the data to be presented in the columns wouldn't necessarily be consistent between runs, so you had to dynamically calculate column contents. Sound familiar? Cross tab queries provide the perfect solution for just such a dilemma. Microsoft's Access and Excel have cross tab queries and pivot tables, to provide a programmer or user with the ability to easily create such data transformations. The Create Cross Tab SQL (CRTCTSQL) utility presented here is intended to bring similar functionality to the iSeries. Because it uses embedded SQL and free-form RPG, CRTCTSQL requires the DB2 SQL Development Kit to be installed, along with OS/400 V5R1 or later. Cross tab queries are a splendid tool for data analysis. They allow data to be dynamically summarized by row and column values. In SQL, the GROUP BY clause summarizes data by predefined row values (a one-dimensional summary). A cross tab query can be thought of as providing a two-dimensional GROUP BY, summarizing predefined values by both row and column. Figure 1 shows the result of a simple one-dimensional data summarization. Data is summarized by the unique row values found in the period and category columns.
Figure 1: Sales data is summarized in rows by period and category Here is the SQL that was used to create the data summarization:
Select Period, Category, Sum(SalesAmt) As Sales
From Sales
Group By Period, Category
Order By Period, Category
Figure 2 shows the result differently, with the values in the category column pivoted from a row heading to a column heading.
Figure 2: Sales data is now summarized in rows by period, and in columns by category Here is the SQL statement to produce the "pivoted" data:
Select Period,
Sum(Case When Category='Beverages'
Then SalesAmt End) As Beverages,
Sum(Case When Category='Fruit'
Then SalesAmt End) As Fruit,
Sum(Case When Category='Snacks'
Then SalesAmt End) As Snacks
From Sales
Group By Period
Order By Period
Notice that the SQL CASE statement is used to group the sales amounts into columns by category. But what if you don't know beforehand what your category names are? For example, what if another 10 categories were added to the sales table? The programmer would have to go back and add another 10 CASEs to the SQL statement. This is where the power of cross tab queries comes in. A cross tab query will examine data beforehand and then dynamically build an SQL statement consisting of a variable number of CASE statements. So if in the above example two new categories were added to the sales table, a cross tab query would examine the category column of the sales table, build an SQL statement (like the one shown in Figure 2) with a CASE statement for each distinct value in the Category column, and run the query. The cross tab query has four components: data source, row headings, column heading, and aggregate values. To build a cross tab query with CRTCTSQL, you need to specify a valid SQL SELECT statement as the data source. A sample data source would simply be SELECT * FROM SALES. Row headings are specified by selecting one or more column names or expressions from the base SELECT statement. These row headings are used just like columns specified in a GROUP BY clause; that is, the data is summarized in rows by the values in these columns. Examples of row values could be ITEMNO or SUBSTR(ITEMNO,3,3). The column heading is the name of a single column name or expression. When the cross tab query is run, the column name or expression specified for the column heading is scanned against the base table and a list of unique values is built. Each one of these unique values will be placed in a CASE statement so that the data will be "pivoted" by column. Examples of data to use in a pivoted column would be CUSTNAME or YEAR(DATE) || '-' || MONTH(DATE). The final piece of the cross tab query is the aggregate values. These values are column names or expressions that will participate in one of the standard SQL aggregate functions such as AVG, COUNT, MIN, MAX, STDDEV, SUM. In case you're not familiar with aggregate functions, these are functions that operate on multiple rows of data. For example, the AVG function will return the median average of a numeric column in a given set of rows. The MIN and MAX functions will return the lowest and highest values, respectively, for a given column or expression for a given set of rows. Of course, the values in the aggregates can only be numeric for functions like AVG or SUM, but MIN and MAX can accept other types as well. Examples of aggregate values would be ORDERS (*COUNT) or QTY * PRICE (*SUM). Figure 3 illustrates the cross tab query's parts.
The CRTCTSQL command is designed to build an SQL statement that will emulate a cross tab query. The following is a run down of the parameters comprising the CRTCTSQL command.
One other thing about cross tab queries needs to be said. While they're great for things like ad hoc queries, verifying program output, or creating long SQL statements that you otherwise wouldn't want to type, because the column names can often change from run to run, they're not a very friendly tool to use for production reporting, unless you have the skills and desire to build a sophisticated reporting tool that is able to work with dynamic column names. Under certain circumstances, however, it may be known beforehand that the cross tab's columns will always be the same. For a query displaying sales data, for example, the columns may always be summarized as QTR1, QTR2, QTR3, and QTR4. Or a cross tab query to display customer names can be defined to always show the same 10 customers. In this case, the FIXEDHDGS parameter can be specified. The benefits of working with fixed headings are as follows:
Here are a few other things to be aware of when using the command. As noted earlier, the cross tab SQL will automatically assign names to the columns created based on the value of the column. The names built will start with the column function selected (SUM, MIN, MAX), followed by an underscore ( _ ), followed by the distinct value. Any characters not allowed in a column name will be replaced by an underscore. The maximum length for a column name is 30, so any extra characters will be truncated. Keep in mind that if any two column names end up being the same, a view will not be able to be created, because views demand unique column names. Because the values in the column are used as headings, only character fields with simple data types such as numeric and character (preferably with shorter lengths) are allowed. BLOBs and datalinks, for example, are not supported. To get an idea of how column names are created, here is a sample SQL statement created by CRTCTSQL and highlights to show its components.
SELECT ProductID, ProductName,
/* Values */
DECIMAL(SUM(CASE WHEN CategoryName ='Beverages' THEN Round(UnitPrice *
Quantity * (1 - Discount),2) END),9,2) AS SUM_Beverages,
DECIMAL(SUM(CASE WHEN CategoryName ='Condiments' THEN Round(UnitPrice
* Quantity * (1 - Discount),2) END),9,2) AS SUM_Condiments,
DECIMAL(SUM(CASE WHEN CategoryName ='Confections' THEN Round(UnitPrice
* Quantity * (1 - Discount),2) END),9,2) AS SUM_Confections,
DECIMAL(SUM(CASE WHEN CategoryName ='Dairy Products' THEN
Round(UnitPrice * Quantity * (1 - Discount),2) END),9,2) AS
SUM_Dairy_Products,
DECIMAL(SUM(CASE WHEN CategoryName ='Grains/Cereals' THEN
Round(UnitPrice * Quantity * (1 - Discount),2) END),9,2) AS
SUM_Grains_Cereals,
DECIMAL(SUM(CASE WHEN CategoryName ='Meat/Poultry' THEN
Round(UnitPrice * Quantity * (1 - Discount),2) END),9,2) AS
SUM_Meat_Poultry,
DECIMAL(SUM(CASE WHEN CategoryName ='Produce' THEN Round(UnitPrice *
Quantity * (1 - Discount),2) END),9,2) AS SUM_Produce,
DECIMAL(SUM(CASE WHEN CategoryName ='Seafood' THEN Round(UnitPrice *
Quantity * (1 - Discount),2) END),9,2) AS SUM_Seafood
/* Base SQL */
FROM
(Select CategoryName, A.ProductID, ProductName, A.UnitPrice, Quantity, Discount
From Northwind/OrderDetails A Inner Join Northwind/Products B On A.Productid=B
.Productid Inner Join Northwind/Categories C On B.Categoryid=C.CategoryId) AS
TBL
/* Group By */
GROUP BY ProductID, ProductName
This SQL statement was generated by CRTCTSQL. Individual components of the cross tab query are highlighted, with the column heading portion in purple, the row heading in blue, the aggregate values in red, and the base SQL data source in green. For the output type of *SRC, and *QMQRY, keep in mind that the created SQL statement may not run. That's because the resulting statement will be syntactically correct but may not be fully correct, in the event that a row, column, or value expression is entered incorrectly. In such cases you will not know that there is a problem with the statement until an attempt to run it is made. In contrast, outputs of * and *VIEW will generate an error if something is wrong with the resulting SQL because they attempt to run the SQL statement immediately. CRTCTSQL uses SQL/400's nested SELECT ability to get the data from any SELECT statement. However, the ease of use of nested SELECTs comes with a performance price. It is often a good idea to run a query to select your base data into a small work table to use with CRTCTSQL in order to prevent long running cross tab queries. If you have problems, be sure to check any column expressions and the base SQL statement. (The base SQL will be validated in the event that distinct values need to be retrieved.) It helps to dump the cross tab SQL to a source member for closer examination. Remember, a statement may be syntactically correct and still contain errors. When running a cross tab query over a large table, it may take a while to find the DISTINCT values, so don't be surprised if there is a long delay before the command is finished. Finally, a companion command called Retrieve Cross Tab SQL (RTVCTSQL) is available for retrieving the SQL statement into a CL variable. This is useful for those who want to do something more with the SQL statement, such as building a dynamic reporting utility, before using the cross tab query. Beware that the maximum length for a CL variable is 5,000 characters, and therefore a large SQL statement will be truncated. An error condition is sent to the calling program in this event. How the Program Works The CRTCTSQLR program accepts a column expression for the column headings and the data source SQL statement from the command line. From this information it creates a SELECT DISTINCT column expression statement, which is nested around the base SQL statement. For example, if the base SQL statement is SELECT * FROM SALES and the column expression is MONTH(INVDATE), the RPG program would build the following SQL statement to retrieve the DISTINCT values: SELECT DISTINCT MONTH(INVDATE) FROM (SELECT * FROM SALES) A But you may be asking, how does the program retrieve this set of distinct values from an SQL statement? How would the RPG program know what data type the user is requesting, whether packed, zoned, or varchar, since this information isn't defined at compile time? The answer to this question lies in the power of SQL's PREPARE statement. PREPARE will validate a SQL statement and analyze it to return a description of the statement's resulting columns along with their attributes (data type, null capable) Since PREPARE also validates the SQL statement, it will inform the user of any errors in the base SQL or column heading. The data structure where the PREPARE returns this information about a SELECT statement's columns is called the SQL Descriptor Area (SQLDA – don't confuse this with the SQL communications area, aka SQLCA). The format of this data structure can be found in Appendix C of the SQL Reference manual. It basically consists of a header structure that contains information about the statement, including the number of columns in the SELECT. It also contains an array that holds attribute information about each of the columns. To make things easy, the SQL precompiler directive INCLUDE SQLDA can be specified to have the precompiler place the data structure in the code automatically. When letting the precompiler create the SQLDA, the programmer must define a numeric constant field called SQL_NUM. This field will define how many column substructures will be allocated by the SQLDA. In other words, SQL_NUM should contain the maximum number of columns allowed in a SELECT statement for your application. CRTCTSQL defines SQL_NUM as 1, because only one column (the column heading) will ever be returned by this application. Once the statement has been successfully "prepared," a cursor is opened from the prepared statement. At this point, the RPG program just lets SQL do its magic to retrieve the data from the DB2/400. When working with a dynamic SELECT list, instead of FETCHing data into host variables, the data will be FETCHed into the SQLDA. The SQLDA contains a pointer to the raw data for each column in the SELECT list. Since there is only a pointer to work with, it's up to the RPG program to interpret the raw data. This isn't hard to do for VARCHAR, CHAR, or TIMESTAMP data, because these types can each be interpreted as fixed character for placement in the cross tab SQL statement. Standard numeric data types such as SMALLINT, DOUBLE, or INTEGER are also easy to work with because RPG also has an equivalent for each of these. Further, the built-in function %CHAR can be used to convert these data types to character for use in an SQL statement. On the other hand, it isn't easy to work with packed or zoned data, because the program doesn't know beforehand the precision and scale. To convert these two data types to character, the program makes use of the Convert Edit Code to Edit Mask (QECCVTEC) and Format Numeric Data with Edit Mask (QECEDT) APIs. These APIs are useful because, unlike the %EDITC and %EDITW built-in functions, they can handle packed and zoned data at run time. The QECCVTEC API will create an EDIT MASK based on one of the standard edit codes. CRTCTSQL uses Edit Code P so that numeric data will be formatted with a leading sign and no commas, which is what the SQL syntax requires. QECEDT uses the edit mask created from QECCVTEC, the data type (zoned or packed), scale and precision to format the run time data as character. Once the values of the cross tab's column heading are known and converted to character representation, it's easy to have the RPG program construct the appropriate SQL statement. The RPG program makes use of a temporary Query Manager query to display the results on the screen. CRTCTSQL is a powerful data analysis tool. In addition to allowing data to be summarized in a two-dimensional matrix, it can also be used to aid programmers in creating complex SQL statements. Whether you're doing ad hoc queries to verify the output of a new COBOL program or putting some sales figures together for your next meeting, the time saving CRTCTSQL will no doubt come in handy. CRTCTSQL Examples To further aid your understanding of the utility, here are a few examples of how you might use it. This first cross-tab query will summarize product sales (unit price * quantity * discount) by category name on the x-axis (column headings, with a maximum of 30 categories) and product ID and name on the y-axis (row headings.) Here is the resulting SQL statement for this CT query:
CRTCTSQL COLHDG(CategoryName)
MAXCOL(30)
ROWHDGS(ProductID ProductName)
VALUES(('Round(UnitPrice * Quantity * (1 - Discount
),2)' *SUM *ALL *DECIMAL 9 2))
SQL('Select CategoryName, A.ProductID, ProductName,
A.UnitPrice, Quantity, Discount From Northwind
/OrderDetails A Inner Join Northwind/Products B
On A.Productid=B.Productid Inner Join Northwin
d/Categories C On B.Categoryid=C.CategoryId')
This cross-tab SQL command produced the sample SQL statement from above, which is restated here:
This next cross-tab query will show the number of orders each customer had (row headings) for each category (column headings). The *DISTINCT value is used so that an order is only counted once, even if it has multiple lines for the same product category. The resulting SQL statement is placed in a view. A sample of the output is shown here:
CRTCTSQL COLHDG(CategoryName)
MAXCOL(30)
ROWHDGS(CustomerID CompanyName)
VALUES((ORDERID *COUNT *DISTINCT *INTEGER))
SQL('Select CategoryName, C.CustomerID, CompanyName
, A.OrderID From Northwind/OrderDetails A Inner
Join Northwind/Products B On A.Productid=B.Pro
ductid Inner Join Northwind/Orders C On A.Order
Id=C.OrderID Inner Join Northwind/Customers D O
n C.CustomerID=D.CustomerID Inner Join Northwin
d/Categories E On B.CategoryID=E.CategoryID')
OUTPUT(*VIEW)
VIEW(NORTHWIND/VIEW2)
RPLVIEW(*YES)
This last example sums the quantity sold by product ID and year (row headings) and quarter of the year in the columns. The fixed headings are used because there will only be four quarters. Even though expressions are used--Quarter(OrderDate) and Year(OrderDate)--the expressions might better be placed in the base SQL statement with an alias. Part of the reason for this has to do with controlling the heading name that gets created for the row headings. Another reason is that views require an alias name be given to expressions. Attempting to place this statement in a view would fail because the expression Year(OrderDate) is unnamed.
CRTCTSQL COLHDG('Quarter(OrderDate)')
MAXCOL(30)
ROWHDGS(ProductID 'Year(OrderDate)')
VALUES((QUANTITY *SUM))
SQL('Select OrderDate, ProductID, Quantity From Nor
thwind/OrderDetails A Inner Join Northwind/Orde
rs B On A.OrderId=B.OrderID')
OUTPUT(*)
ORDERBY(*YES)
FIXEDHDGS((1 Qtr1) (2 Qtr2) (3 Qtr3)
(4 Qtr4))
Michael Sansoterra is a programmer/analyst for SilverLake Resources, an IT services firm in Grand Rapids, Michigan. E-mail: msansoterra@silver-lake.com.
|
Editors
Contact the Editors |
| Copyright © 1996-2008 Guild Companies, Inc. All Rights Reserved. |