New OLAP Aggregate Functions In DB2 for i, Part 1
July 12, 2016 Michael Sansoterra
In my last tips about the new OLAP features in DB2 for i 7.3, I discussed the OLAP Aggregation Specification and the new LAG and LEAD OLAP functions. In this article and the next one, I continue the discussion of new OLAP features by highlighting four new OLAP aggregate functions: FIRST_VALUE, LAST_VALUE, nTH_VALUE, and RATIO_TO_REPORT. The value of these functions is that they can reference data from other rows in a query result set relative to the current row.
The new OLAP aggregate functions differ from normal aggregate functions (such as MIN, MAX, SUM, and AVG) in that they can only be used in conjunction with an OLAP aggregation specification (OAS); whereas the normal aggregates can be used with or without an OAS (e.g., they can be used in a GROUP BY summary query). Recall that the aggregation spec consists of one or more components, including a window partition clause, window order clause, and a window aggregation group clause. Since the DB2 for i terminology can be confusing, I’ll get right to an example. If you’re unfamiliar with OLAP aggregation spec, please review the Related Stories listed at the end of this tip.
FIRST_VALUE And LAST_VALUE
As their names imply, these functions will extract the first or last value of a column or expression within an OLAP window. Say you have a help desk application for tracking support calls. The two primary tables in the app are SupportIncident (header) and SupportCall (detail):
CREATE OR REPLACE TABLE SupportIncident ( Support_Id INT NOT NULL GENERATED BY DEFAULT AS IDENTITY, Product_Group VARCHAR(10) NOT NULL, Name NVARCHAR(80) NOT NULL, TimeZone VARCHAR(10) NOT NULL, EMail VARCHAR(96) NOT NULL, Phone VARCHAR(20) NOT NULL, First_Contact TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, CONSTRAINT pk_SupportIncident PRIMARY KEY (Support_Id) ); CREATE OR REPLACE TABLE SupportCall ( Support_Id INT NOT NULL, Support_Call_Id INT NOT NULL AS IDENTITY, Support_Timestamp TIMESTAMP(0) NOT NULL, Support_By_Phone CHAR(1) NOT NULL DEFAULT 'Y', Duration_Sec INT NOT NULL, Remarks CLOB(10K), Rep_Id INT NOT NULL, CONSTRAINT pk_SupportCall PRIMARY KEY (Support_Id,Support_Call_Id), CONSTRAINT fk_Support_Id FOREIGN KEY (Support_Id) REFERENCES SupportIncident (Support_Id) );
Management has requested that support calls be listed as detail lines on a “support incident” report. The query to feed this report is shown here:
SELECT c.* FROM SupportIncident i JOIN SupportCall c ON i.Support_Id=c.Support_Id
Further, because some incidents may require several calls (possibly fielded by multiple support center representatives) from a customer before their issue is resolved, when reviewing the detail it’s often helpful to know the customer support representative(s) who opened and closed the incident.
Therefore, each detail line should also include the support representative who opened the incident and the representative who closed the incident (or who last talked to the customer if the incident is still open). The support representative ID is contained in column REP_ID. To fulfill this requirement, simply use the new FIRST_VALUE and LAST_VALUE OLAP aggregate functions:
SELECT c.*, First_Value(Rep_Id) OVER( PARTITION BY c.Support_Id ORDER BY Support_Timestamp) AS First_Rep_Id, Last_Value(Rep_Id) OVER( PARTITION BY c.Support_Id ORDER BY Support_Timestamp ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS Last_Rep_Id FROM SupportIncident i JOIN SupportCall c ON i.Support_Id=c.Support_Id
The abridged results look something like this (with some columns omitted):
The source values for the First_Rep_Id and Last_Rep_Id columns are highlighted in red.
FIRST_VALUE will get the first value in a windowed set of rows and likewise LAST_VALUE will retrieve the last value. The OLAP aggregation spec controls the scope (PARTITION BY) and order (ORDER BY) of the window of rows that are to be considered by the OLAP function. Recall that this windowing concept allows these OLAP functions to be used in “detail queries” that don’t require a GROUP BY clause.
Looking closely at the use of these new functions:
First_Value(Rep_Id) OVER( PARTITION BY c.Support_Id ORDER BY Support_Timestamp) AS First_Rep_Id, Last_Value(Rep_Id) OVER( PARTITION BY c.Support_Id ORDER BY Support_Timestamp ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS Last_Rep_Id
It’s clear that the OLAP window consists of rows bounded by the Support_Id (PARTITION BY Support_Id) and the rows within the window are sorted by the support call’s timestamp. Further, because these functions are OLAP aggregation functions, the data they retrieve is from a row that is relative to the current row in relation to the current OLAP window.
By default, this class of functions will operate on the rows from the start of the window through the current row, as though the following “window aggregation group clause” was specified:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
This default works fine for the FIRST_VALUE function because the first row of the window is always available for consideration for any given partition. However, the last row in the window is not available by this default. Therefore, a different “window aggregation group clause” is required to instruct LAST_VALUE that it needs to consider all rows from the current row through the end of the window as follows:
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
Without this window aggregation group clause, the LAST_VALUE function returns the value from the current row. If this seems confusing, I don’t blame you and suggest you experiment with LAST_VALUE with and without this window aggregation group clause and watch how the behavior changes.
The nTH_VALUE function is similar to FIRST_VALUE and LAST_VALUE with the exception that the requested value is extracted from row n, where n is the row number within the window relative to the start or the end of the window.
Going back to the support query, say management concludes that knowing the rep ID for the third call in a long series of calls is pivotal to understanding how to improve customer service and shorten the number of calls. So, management requests for the rep ID of the third call in an incident to be included on the report as its own column. In this case, the nTH_VALUE function can be used to retrieve the third customer service rep ID in the chain of calls:
SELECT c.*, First_Value(Rep_Id) OVER(PARTITION BY c.Support_Id ORDER BY Support_Timestamp) AS First_Rep_Id, Last_Value(Rep_Id) OVER(PARTITION BY c.Support_Id ORDER BY Support_Timestamp ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS Last_Rep_Id, nTH_Value(Rep_Id,3) FROM FIRST OVER(PARTITION BY c.Support_Id ORDER BY Support_Timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS "3rd_Rep_Id" FROM SupportIncident i JOIN SupportCall c ON i.Support_Id=c.Support_Id
As shown in the example, nTH_VALUE requires a second parameter which is the row number within the OLAP window. A literal value of three is given, which means take the requested value (REP_ID) from the third row in the window. The default FROM FIRST indicates the function will get the third from the start of the window. The alternative option is FROM LAST, which would cause the function to get the third row from the end of the window. If there are less than three rows in the window, the function returns NULL.
Based on the result set shown above, for support incident 1, the “3rd_Rep_Id” column will contain the value “8”, which is the Rep_Id in row 3.
As with LAST_VALUE, nTH_VALUE also requires a non-default window aggregation group clause to make this function evaluate the third row over the entire window:
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Without this clause (keeping in mind that the default window is the first row through the current row), the nTH_VALUE function would not return a value until row three or later became the current row. In other words, the first two rows would contain a NULL value for the “3rd_Rep_Id” column.
FIRST_VALUE, LAST_VALUE and nTH_VALUE can have an additional option of RESPECT NULLS (default) or IGNORE NULLS specified as the last argument. When IGNORE NULLS is specified, the requested OLAP function will only consider non-null values within the window:
Last_Value(REMARKS) IGNORE NULLS OVER(PARTITION BY c.Support_Id ORDER BY Support_Timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS Last_Remark
In this case, LAST_VALUE will return the last non-null REMARKS column for the given support ID. For nTH_VALUE, using IGNORE NULLS will potentially cause the function’s behavior to change from retrieving data from an absolute position in the window to a relative row position that is based on whether NULLs are present.
There’s more to tell. Look for the RATIO_TO_REPORT function, coming in my next article.