Guru: Use SQL to Generate Random Data
June 17, 2019 Ted Holt
Suppose I needed to generate a large database table with random data in order to adequately test the performance of an SQL query. Suppose that, for security reasons, I was not allowed to copy the production version of the table. Suppose that I needed a way to generate a lot — and I do mean a lot! — of random data. Suppose this scenario is not mere supposition.
Before an SQL query goes into production, it should be tested against a production-like dataset. Running a query against a test dataset of 25 rows (records) can produce unpleasant surprises when it’s installed into production and forced to face a production dataset of thousands or even millions of rows. Obviously we can’t key that much data, so how do we generate it? We use SQL. Let me give you an example.
Suppose you wish to test a query against a single table. (That’s not realistic, but it will keep the example small and I can illustrate the techniques.) You already have a copy of your table, and you just need to load data into it. However, since this is an illustration, here’s the statement to create my illustrative table.
create table mylib.Cust24 ( ID dec(5), Name char(20), Type dec(1), Balance dec(5,2), DueDate date, primary key (ID));
You need data for these columns (fields). You might just use a sequential number for ID, the primary key. If so, a recursive common table expression generates the sequence.
with list (IDNbr) as (values(1) union all select IDNbr + 1 from list where IDNbr < 25) select IDNbr from list
This generates 25 rows with sequential ID numbers. You’d generate many more, of course.
Now put random data into the other columns. The Type column should have an integer from 1 to 3.
dec(rand()*3+1, 1,0)
The rand() function generates a random number in the range zero to one. Multiplying that number by three changes it to a random number in the range zero to three. Adding one to that puts it into the range one to four. (Chances of the number being four are slim to none. I’ve never seen a random number generator return a zero or a one, only numbers in between them.) The dec() function removes the decimal places, leaving a whole number.
The Balance column is a five-digit packed-decimal number. If you want a positive number, you can use an expression much like the one for the Type column.
dec(rand()*999, 5, 2)
If you want some negative numbers as well, adjust the expression to give you some.
dec(rand()*1000, 5, 2)-500
A random date isn’t hard to get. Use the rand() function to generate a number of days to adjust a reference date, such as the current date.
current date + int(rand()*100) days
This expression adds up to 100 days to the current date.
I’ve saved the tricky part for last. You need to generate a random Name value. The rand() function only returns numeric values. How do you translate those numeric values into character strings? It’s slightly messy, but not difficult. Here’s one possibility.
with list (IDNbr, RandName) as (values(1, dec(rand()*100000,5,0)) union all select IDNbr + 1, dec(rand()*100000,5,0) from list where IDNbr < 25) select IDNbr, substr('ABCDEFGHIJKLMNOPQRSTUVWYXZ abcdefghijklmnopqrstuvwxyz', int(mod(RandName ,53)+1), 1) concat substr('ABCDEFGHIJKLMNOPQRSTUVWYXZ abcdefghijklmnopqrstuvwxyz', int(mod(RandName*0.1 ,53)+1), 1) concat substr('ABCDEFGHIJKLMNOPQRSTUVWYXZ abcdefghijklmnopqrstuvwxyz', int(mod(RandName*0.01 ,53)+1), 1) concat substr('ABCDEFGHIJKLMNOPQRSTUVWYXZ abcdefghijklmnopqrstuvwxyz', int(mod(RandName*0.001 ,53)+1), 1) concat substr('ABCDEFGHIJKLMNOPQRSTUVWYXZ abcdefghijklmnopqrstuvwxyz', int(mod(RandName*0.0001,53)+1), 1)
Let’s break that down a piece at a time.
with list (IDNbr, RandName) as (values(1, dec(rand()*100000,5,0)) union all select IDNbr + 1, dec(rand()*100000,5,0) from list where IDNbr < 25)
RandName is a five-digit packed-decimal random number. You can use RandName to generate subscripts into an array of characters.
substr('ABCDEFGHIJKLMNOPQRSTUVWYXZ abcdefghijklmnopqrstuvwxyz', int(mod(RandName,53)+1), 1)
The mod() function divides RandName by 53, the number of characters in the literal, returning a number from zero to 52. Adding one gives me a valid subscript value to the string. Substr() returns one character from the literal. You have one random character, which is either a letter or space.
You can generate more characters by dividing RandName by other powers of ten.
substr('ABCDEFGHIJKLMNOPQRSTUVWYXZ abcdefghijklmnopqrstuvwxyz', int(mod(RandName*0.1 ,53)+1), 1) substr('ABCDEFGHIJKLMNOPQRSTUVWYXZ abcdefghijklmnopqrstuvwxyz', int(mod(RandName*0.01,53)+1), 1)
Here’s the entire query.
with list (IDNbr, RandName) as (values(1, dec(rand()*100000,5,0)) union all select IDNbr + 1, dec(rand()*100000,5,0) from list where IDNbr < 25) select IDNbr, substr('ABCDEFGHIJKLMNOPQRSTUVWYXZ abcdefghijklmnopqrstuvwxyz', int(mod(RandName ,53)+1), 1) concat substr('ABCDEFGHIJKLMNOPQRSTUVWYXZ abcdefghijklmnopqrstuvwxyz', int(mod(RandName*0.1 ,53)+1), 1) concat substr('ABCDEFGHIJKLMNOPQRSTUVWYXZ abcdefghijklmnopqrstuvwxyz', int(mod(RandName*0.01 ,53)+1), 1) concat substr('ABCDEFGHIJKLMNOPQRSTUVWYXZ abcdefghijklmnopqrstuvwxyz', int(mod(RandName*0.001 ,53)+1), 1) concat substr('ABCDEFGHIJKLMNOPQRSTUVWYXZ abcdefghijklmnopqrstuvwxyz', int(mod(RandName*0.0001,53)+1), 1) as Name, dec(rand()*3+1, 1,0) as Type, dec(rand()*1000, 5, 2) as Balance, current date + int(rand()*100) days as DueDate from list;
Does it work? You be the judge.
select * from cust24;
I’ve included links to two articles from the IBM Knowledge Center for your benefit. (See the first two links under Related Stories.) They’re not for Db2 for i, but you may find them helpful. Be aware that the queries may not work on your system. One example generates character data by putting the mod() function in the second parameter of the translate() function, and Db2 for i does not allow that (yet).
One shortcoming of using a large table of random data is that you may not be assured that all cases get tested. But you would have already done that test with a small dataset that contains all the data needed to fully exercise your code. The purpose of testing against random data is performance, not accuracy.
I have a list of about 8000 common names. I need a function to replace the names in a table with mock data. This solution cannot work in a function. NewID() and RAND() are not supported in functions.