DB2 For i Scalar Function Performance Considerations
March 5, 2014 Michael Sansoterra
I love using user-defined functions (UDFs) in DB2 for i. They encapsulate business logic, provide for code modularity and improve code reusability. But they can also flush performance down the drain. So in this tip, I’d like to use a sample UDF to analyze how it can affect the performance of an SQL statement.
For this illustration, I’ve exported the Microsoft SQL Server Adventure Works sample data to DB2 for i. There are three main tables to consider in this query; SalesOrderHeader (31465 rows), SalesOrderDetail (121317 rows) and ProductCostHistory (22680 rows; I dumped in some extra randomized data into this table). Figure 1 below shows the relationships between the tables including the Product table, which is not required.
The goal of the query is simple: read the entire sales order history and show the order date, product ID, net product sales, product cost and gross profit. Simply joining the order header and detail tables returns most of the required info:
SELECT soh.OrderDate,sod.SalesOrderId, sod.ProductId,OrderQty*UnitPrice As Sales FROM SalesOrderDetail sod JOIN SalesOrderHeader soh ON soh.SalesOrderId=sod.SalesOrderId
All that is left is to get the product cost and then calculate the net profit.
The query would be trivial except for a snafu. Since the product cost varies with time, a lookup of the product cost in the product cost history table (based on order date) is required. Further, it would be nice if I could count on all of the date ranges in the product cost history table to be contiguous (i.e., no overlapping start dates and end dates for a given product). But alas, due to the sad state of the data in the database, the query has to be careful to fetch only one cost per product ID and date, lest an erroneous one-to-many situation cause too many rows be returned.
To get the product cost, my first instinct is to use a UDF to do the cost lookup in the cost history table. The UDF is defined as follows:
CREATE OR REPLACE FUNCTION GetProductCost( @ProductID INT, @EffectiveDate TIMESTAMP) RETURNS DECIMAL(19,4) LANGUAGE SQL RETURNS NULL ON NULL INPUT DETERMINISTIC DISALLOW PARALLEL NOT FENCED SET OPTION DATFMT=*ISO,USRPRF=*OWNER,COMMIT=*NONE RETURN ( SELECT StandardCost FROM ProductCostHistory WHERE ProductId=@ProductId AND @EffectiveDate BETWEEN STARTDATE AND ENDDATE FETCH FIRST 1 ROW ONLY );
The savvy SQL reader will note that when using a FETCH FIRST clause there should be an ORDER BY clause to guarantee consistent results in the event that an effective date matches multiple date ranges in the table. I skipped that step for simplicity.
Plugging the GetProductCost UDF into the query above to calculate the product cost and the net profit yields the following revised query:
SELECT soh.OrderDate,sod.SalesOrderId, sod.ProductId,OrderQty*UnitPrice As Sales, OrderQty*GetProductCost(sod.ProductId,soh.OrderDate) AS Cost, OrderQty*(UnitPrice - GetProductCost(sod.ProductId,soh.OrderDate)) AS NetProfit FROM SalesOrderDetail sod JOIN SalesOrderHeader soh ON soh.SalesOrderId=sod.SalesOrderId
I now admire the UDF, knowing that I can reuse it throughout the application and even hand it to a junior developer to use, being confident that the results will be consistent. But I wonder what the performance overhead is.
I know that performance should be optimal as the query currently returns 120K rows and will return even more well into the future. Running the query with and without the function in place reveals that the UDF adds considerable time to the query with the large result set.
Looking to tweak the performance, I know that RPG is very good at keyed lookups and wonder if an RPG function would perform better than SQL. So I wrote an RPG UDF called GetProductCostRPG that has this logic to approximate the SQL UDF’s logic:
SetGt (ReqProduct:ReqDate) ProductCostHistory; ReadPE (ReqProduct) ProductCostHistory; If ReqDate>=StartDate And ReqDate<=EndDate And NOT %Eof(ProductCostHistory); StandardCost=STAND00001; Return; Else; StdCost_NI=-1; EndIf;
ReqProduct and ReqDate are the function’s input parameter values. STAND00001 is the StandardCost column from the ProductCostHistory table.
The RPG code is compiled as a program object and the performance is better than the SQL UDF, which is an embedded SQL C program under the covers. But the improvement is not enough. Hoping for more performance gains, I reworked the RPG code as a service program and created function GetProductCostRPGSRV. The result was the performance went up a tad. Hopefully, IBM will one day give developers the option to specify a PROGRAM TYPE of MAIN (*PGM) or SUB (*SRVPGM) when creating SQL functions.
While I love UDFs, I’m still not happy with how much they’re costing the query. So I looked for a way to write the query without a UDF. The following query will do the trick:
SELECT soh.OrderDate,sod.SalesOrderId,sod.ProductId, StandardCost,OrderQty*UnitPrice As Sales, OrderQty*StandardCost AS Cost, OrderQty*(UnitPrice-cost.StandardCost) AS NetProfit FROM SalesOrderDetail sod JOIN SalesOrderHeader soh ON soh.SalesOrderId=sod.SalesOrderId LEFT JOIN LATERAL ( SELECT ProductId,StandardCost FROM ProductCostHistory pch WHERE pch.ProductId=sod.ProductId AND soh.OrderDate BETWEEN STARTDATE AND ENDDATE FETCH FIRST 1 ROW ONLY ) cost ON cost.ProductId=sod.ProductId
Notice that the function’s logic is embedded in the LEFT JOIN as a nested table expression. The lateral keyword instructs the LEFT JOIN to run the nested table expression in a row by row fashion; this allows DB2 to correlate the nested table expression’s tables with the tables in the outer query. LATERAL is synonymous with TABLE.
Now, performance of the statement without using UDFs is something I’m happy with! Table 1 below shows the results of the different methods used.
To measure the query run times without worrying about network latency, I measured the query run times in an aggregate query that returned one row. So basically all the processing is done in DB2 for i with minimal results given back to the client. The queries looked like this, with the various UDF names substituted as appropriate:
SELECT SUM(Sales),SUM(Cost),SUM(NetProfit) FROM ( SELECT soh.OrderDate,sod.SalesOrderId, sod.ProductId,OrderQty*UnitPrice As Sales, OrderQty*GetProductCost(sod.ProductId,soh.OrderDate) AS Cost, OrderQty*(UnitPrice - GetProductCost(sod.ProductId,soh.OrderDate)) AS NetProfit FROM SalesOrderDetail sod JOIN SalesOrderHeader soh ON soh.SalesOrderId=sod.SalesOrderId ) QueryTest
Of course your results will vary. This test was done on a P50 with minimal other activity, so the performance differences could be starker on a P5 or P10 box with other processes running.
It’s possible to conclude in this case that the UDFs are more trouble than they’re worth, but that would be a premature conclusion. There was something obvious I forgot when defining my functions; I didn’t include “NO EXTERNAL ACTION.” Here is the CREATE FUNCTION statement wrapper for the RPG service program version. NO EXTERNAL ACTION wasnâ€™t originally specified.
CREATE OR REPLACE FUNCTION AdvWorks.GetProductCostRPGSRV (@ProductID INT, @EffectiveDate TIMESTAMP) RETURNS DECIMAL(19,4) LANGUAGE RPGLE PARAMETER STYLE SQL DETERMINISTIC NO SQL RETURNS NULL ON NULL INPUT NOT FENCED DISALLOW PARALLEL NO EXTERNAL ACTION EXTERNAL NAME 'DEV/GETCSTR2(GETCOST)'
DETERMINISTIC means that when the same input parameters are given to the function, the same output will always be returned from the function within the duration of current query execution. A UDF calculating the volume of a sphere, based on the input of the radius, is deterministic. A random number or current time UDF would not be deterministic because the UDF output can’t be determined from the input.
Specifying NO EXTERNAL ACTION means that there is nothing happening in the function that is not under the database manager’s control. Examples of an “external action” would be sending a message, accessing the IFS, writing to a data queue, etc. If DB2 thinks an “external action” is possible within the function, then it will call the function every single time, even if its output is deterministic. NO EXTERNAL ACTION tells DB2 it may not need to run the function for every row in the query.
When DETERMINISTIC and NO EXTERNAL ACTION are specified, DB2 has the option to cache the data from the UDF in order to improve performance, if it can. This is easy to show with Visual Explain.
Figure 2 below is Visual Explain’s query plan representation for the query when the default of EXTERNAL ACTION is present. The highlighted gray box titled “complicated” is how Visual Explain represents the UDF in the query plan:
Figure 3 below shows Visual Explain’s execution plan representation for the query when the function has NO EXTERNAL ACTION specified. Notice that there are two new steps, cache and cache probe:
The results from the small change to the function definition are impressive. Table 2 shows the mammoth improvement in execution times:
With the UDF caching enabled, there is no discernible difference in performance for the various function versions: SQL vs. RPG vs. RPG *SRVPGM. Notice during the first two trials, the times are improved but comparatively high to the query without a UDF. But by the third execution it appears that the SQE has somehow figured out the best way to cache the UDF results and thereby dramatically improve the query run time (even better than the statement without a function).
Keep in mind that this type of performance improvement may not always be present; it just depends on how much data DB2 has to cache and whether it can do it effectively. The ProductCostHistory table is relatively static; there aren’t many changes to it within a single week’s time. But if a UDF is based on a table that is continually changing, then DB2 may not be able to show these kinds of performance gains. That means each situation requires some experimentation.
One other interesting thing to note is when I tried the FENCED/ALLOW PARALLEL options, performance degraded. So be careful and test when using various function options. To summarize, UDFs can end up making your queries expensive; especially if the wrong keywords are specified.
So there is no right or wrong answer about when it’s appropriate to use UDFs. I’ll finish by leaving some general guidelines I use:
1. I average the execution time of several invocations of the UDF and approximate how long it would take to run it over 100,000 rows. A fast UDF that executes in about 1ms will still add 100 seconds (1:40) to a query that processes the function 100K times. This is too much overhead for my liking. Of course, there are occasions when DB2 can cache UDF results, which may alleviate much of the performance penalty.
2. SQL development often pits performance against reusability. When coding for performance, I will type out simple to intermediate column expressions in various queries instead of using a UDF. This makes the code harder to maintain but on systems with large volumes of data, I’d rather have the performance benefit.
3. If an expression’s logic is complicated or doing something beyond what SQL can natively do, then I’ll use a UDF.
4. I generally try to compare execution times with and without the UDF to get an idea of how much time the UDF is consuming. Often times LATERAL can help eliminate some UDFs that lookup data.
5. When I decide a UDF may be appropriate, I spend some time considering which language to use. SQL is getting hard to beat, so if the UDF has to operate on a set of data with many rows, I usually choose SQL. For logic involving many single I/O operations requiring the use of positioned lookups, an RPG service program is a good choice. When creating a UDF based on code libraries downloaded from the internet, such as a library capable of dynamically evaluating math expressions, I will look for a C/C++ solution before a Java solution because it’s likely to perform better.