Extracting Sample Data From A DB2 for i Table
November 3, 2015 Michael Sansoterra
The need to retrieve a sample or subset of data from a database is a relatively common (and somewhat annoying) task. For example, DB2 for i developers often want to extract data and create a scaled down copy of a library (a.k.a. schema) to develop with. Additionally, sometimes code testing can be done using random data, such as creating a battery of orders to process with varying customers, order terms, etc. This tip won’t attempt to tackle the full breadth of either of these scenarios. However, the aim is to show how some simple queries can be used to create periodic or random extracts from DB2 tables.
The code shown below was written and tested using DB2 for i 7.2 PTF Group Level 8 but should work on all versions of IBM i since V5R4.
Scenario 1: Extract periodic data
If you want a small subset of data for development, one way to achieve it is to select data with some sort of regular/periodic interval. The following query illustrates how to select every fourth row from the SalesOrderHeader table:
WITH SalesData AS ( SELECT h.*,ROW_NUMBER() OVER(ORDER BY SalesOrderId) AS ROW_ID FROM AdvWorks.SalesOrderHeader h ) SELECT * FROM SalesData WHERE MOD(ROW_ID,4)=0
The ROW_NUMBER function is used to assign a sequential number to the result set, and the MOD (modulus) function is used to filter every fourth row (every row with an ID evenly divisible by four). For the record, once you have the subset of the primary data (such as the order header table) the task of bringing the remaining related data (order lines, shipment info, etc.) from into the sample library is up to you; I have no magic bullet to offer here.
If you happen to be one of the lucky few with a well-defined set of foreign key constraints on parent tables, another option is to copy the entire library, and then delete all rows from a parent table except every fourth row. Assuming a foreign key relationship exists with ON DELETE CASCADE specified, removing data from the parent table will also delete data from the related child tables so that the data reduction is accomplished in one simple step:
/* SalesOrderDetail rows will be removed as well due to the FK */ DELETE FROM AdvWorks.SalesOrderHeader h WHERE SalesOrderId NOT IN ( SELECT SalesOrderId FROM ( SELECT h.*,ROW_NUMBER() OVER(ORDER BY SalesOrderId) AS ROW_ID FROM AdvWorks.SalesOrderHeader h) Subset WHERE MOD(ROW_ID,4)=0 )
Almost any variation on a periodic extract can be done. For example, if you want to extract sample data containing the first two orders from every month, the following query will do the trick:
WITH SalesData AS ( SELECT h.*, ROW_NUMBER() OVER(PARTITION BY YEAR(OrderDate),MONTH(OrderDate) ORDER BY OrderDate) AS Month_Order_Id FROM AdvWorks.SalesOrderHeader h ) SELECT * FROM SalesData WHERE Month_Order_Id>=2
In this case, the rows are numbered sequentially using ROW_NUMBER. The PARTITION BY causes the numbering to restart with each new month (MONTH_ORDER_ID column). This numbering scheme can be used to limit the results to only the first two rows of each month. The abridged results will look something like this:
Scenario 2: Extract a random row
Let’s say you’re writing a testing script to test code for a timesheet entry application. The testing requires the creation of several timesheet entries using a random male employee. How do you extract a random employee? It’s as easy as:
SELECT c.* FROM CorpData.Employee c WHERE SEX='M' ORDER BY 1 FETCH FIRST 1 ROW ONLY
The RAND() function will a random value between zero and one. In the case of DB2 for i, the RAND() function is non-deterministic meaning that the value it returns changes upon every invocation. In the above query, DB2 for i assigns a random number to every employee row, sorts the result set by the random number, and then returns the first one in the list. The result will most likely be a different employee every time the query is run.
Note that the RAND() function may be deterministic on other database platforms such as SQL Server, meaning that the function returns the same value for every statement invocation. This is not the case in DB2 for i.
Exercise caution when using this technique. The resulting query plan will perform a table scan and assign a random value to each row (I/O and CPU intensive), which may not be a pretty sight depending on the volume of data:
For small tables in an occasional testing scenario, this type of code is likely insignificant. However, if run against a million-plus row table in frequently used production code, it would most likely tax your hardware.
Other options for picking a random row from a table involve converting the RAND() function to an integer (or decimal) value and then using the integer to select a value from a sequential identity column or relative record number (using the RRN function). This technique requires that the random number generator know the limits of the identity column (or RRNs) and also assumes that there are not any gaps in the numbering.
This example shows how to generate a random integer value between 1 and 100,000:
Applied to a SELECT statement, this example picks a random product with an ID between 700 and 999 (and assumes all IDs are present):
SELECT * FROM ADVWORKS.PRODUCT WHERE PRODUCTID=INT(RAND()*299)+700