Parsing Delimited Text Data Natively in SQL, Part 1
January 26, 2016 Michael Sansoterra
Note: The code accompanying this article is available for download here.
A common task for database developers is to accept a delimited text file, parse it, and dump it into a database table. This tip demonstrates a user-defined table function (UDTF) that can accomplish this task based on delimited text data stored in a CLOB or in an IFS file.
The UDTF is called ParseFileContent and it accepts three parameters: CLOB data to parse, row delimiter character(s), and column delimiter character(s). The UDTF returns a ROWID column that represents the line number of the text data and 250 columns of VARCHAR(128) data with output columns named C1, C2, Cn. . . C250. Sample usage looks like this:
INSERT INTO MYLIB.DATA SELECT ROWID,C1 AS PRODUCTID,C2 PRODUCTDESCRIPTION FROM TABLE(ParseFileContent( GET_CLOB_FROM_FILE('/tmp/ProductData.txt'),x'0D25',x'05')) Data
This sample statement above uses built-in function GET_CLOB_FROM_FILE to transform the IFS file ProductData.txt into a CLOB. The row delimiter is x’0D25′ (Carriage Return/Line Feed) and the column delimiter is x’05’ (tab). Columns C1 and C2 are aliased as PRODUCTID and PRODUCTDESCRIPTION respectively.
You probably already know the CPYFRMIMPF command can perform much the same task. However, the SQL UDTF also offers the following:
In summary, this function boasts the power of native SQL in its ability to handle this kind of utilitarian task, and also allows the import process to take advantage of SQL programming.
SQL UDTF Code
The code for this tip can be downloaded here. There are two table functions: ParseFileContent, and a secondary helper function named ParseData. The abridged function header for ParseFileContent is defined as follows:
CREATE OR REPLACE FUNCTION ParseFileContent ( @FileContent CLOB, @RowDelimiter VARCHAR(12), @ColDelimiter VARCHAR(12)) RETURNS TABLE ( RowId INTEGER, C1 VARCHAR(128), C2 VARCHAR(128), .... C250 VARCHAR(128) ) LANGUAGE SQL DETERMINISTIC NO EXTERNAL ACTION SET OPTION COMMIT=*CS
The code uses the PIPE statement recently introduced in DB2 for i (i 7.1 TR9 or 7.2 TR1) and also uses a large number of output columns (IBM i 7.1 TR10 or IBM i 7.2 TR2). Users on systems that don’t have PIPE available can modify the code to insert a row into a temporary table in lieu of using PIPE. Users on systems that can’t yet create a UDTF with a large number of output table columns can simply reduce the number of columns the function uses (approximately output 120 columns max).
This secondary ParseData UDTF does the work to parse a CLOB by the given row delimiter such that it will return a row as a single column called FieldData for every text line in the CLOB:
CREATE OR REPLACE FUNCTION ParseData(@Data CLOB, @Delimiter VARCHAR(12)) RETURNS TABLE (RowId INTEGER,FieldData varchar(32000)) LANGUAGE SQL DETERMINISTIC NO EXTERNAL ACTION RETURNS NULL ON NULL INPUT SET OPTION COMMIT=*NONE BEGIN DECLARE @START INTEGER DEFAULT 1 NOT NULL; DECLARE @END1 INTEGER DEFAULT 0 NOT NULL; DECLARE @ROWID INTEGER DEFAULT 0 NOT NULL; DECLARE @FIELD VARCHAR(2048); SET @END1=LOCATE(@Delimiter,@Data); WHILE @END1>0 DO SET @FIELD=SUBSTRING(@Data,@START,@END1-@START); SET @ROWID=@ROWID+1; PIPE (@ROWID, @FIELD); SET @START=@END1+LENGTH(@Delimiter); SET @END1=LOCATE(@Delimiter, @Data,@END1+LENGTH(@Delimiter)); END WHILE; SET @FIELD=SUBSTRING(@Data,@START,LENGTH(@Data)- @START+1); SET @ROWID=@ROWID+1; PIPE (@ROWID,@FIELD); RETURN; END
As evidenced by the parameter definition, the table function can only process a delimited file with a row length of 32,000 characters or less.
The ParseFileContent UDTF is too large to show here because of the large number of parameters, but a few comments on it will suffice. First, the code is ugly because I have an IF..ELSE IF.. block that is a mile long in order to take a single field of extracted data and put it into one of the 250 output variables! It would be nice if DB2 developers could populate an array instead of populating columns one at a time. Second, for performance reasons, I incorporated much of the code from the ParseData UDTF into ParseFileContent; it’s redundant code but overall it runs much faster (more on this in a bit).
Common EBCDIC Delimiters
Row delimiters in text files vary, depending often on the tool used to create them or OS. The most common row delimiters are:
The common column delimiters for text data include the tab (x’05’), the comma ‘,’ and the pipe ‘|’.
b>A Problem With Journaling
If you are using a legacy application that does not use journaling (or perhaps that runs with a default of COMMIT=*NONE), you may encounter a problem when using a statement like this:
INSERT INTO MYLIB.DATA SELECT ROWID,C1,C2 FROM TABLE(ParseFileContent( GET_CLOB_FROM_FILE('/tmp/SampleData.txt'),x'0D25',x'05')) Data
The first error you’ll get is this:
SQL State: 42926 Vendor Code: -443 Message: [SQL0443] LOB and XML locators are not allowed with COMMIT(*NONE).
This error occurs because DB2 requires large object locators (such as returned by the GET_CLOB_FROM_FILE UDF) to be accessed within a transaction.
If you change your transaction isolation level to read uncommitted or higher, and run the statement again, you’ll get this error:
SQL State: 25006 Vendor Code: -817 Message: [SQL0817] Update operation not allowed.
This error occurs because the table as the target of the INSERT in not being journaled, a requirement of transaction processing in DB2 for i. What a pickle: you have to use transaction processing to get the CLOB from the IFS file, yet you can’t use transaction processing to INSERT into a table that’s not being journaled.
So now what? The code below shows how a dynamic compound statement can be used to do the process in two steps: the first step within a transaction boundary, and the second step without it:
BEGIN DECLARE @DATA CLOB; -- Requires a transaction SELECT GET_CLOB_FROM_FILE('/tmp/TabData.txt') INTO @DATA FROM SYSIBM.SYSDUMMY1 WITH CHG; -- Does not require a transaction INSERT INTO MYLIB.DATA SELECT ROWID,C1 FROM TABLE(ParseFileContent(@DATA,x'0D25',x'05')) Data ; END;
The WITH CHG isolation clause demonstrated above causes the SELECT statement to be executed with a transaction isolation level of “read uncommitted”, a requirement for GET_CLOB_FROM_FILE.
Custom Variations Of The ParseFileContent UDTF
To maximize for potential general usage, the ParseFIleContent UDTF provides 250 VARCHAR(128) columns. However, using non-descript column names can be confusing and having too many columns can be clumsy to wade through in a result set. The good news is the function can be rebuilt or modified in any number of ways to suit your needs, including tailoring it for specific text data.
For example, instead of naming the columns generically (C1, C2. . .), you can create a UDTF that has names that match the text file’s content. For example, RETURNS TABLE can be made to look like this:
RETURNS TABLE ( RowId INTEGER, ProductId INTEGER, ProductDescription VARCHAR(50), UnitCost DEC(11,4), ...
Further, you can define the function’s resulting data types to match the text file content by defining columns with types such as DATE, SMALLINT, etc. Thankfully, since V5R3 DB2 has done the job of automatically casting character data to a destination type without requiring the code to be changed with an explicit cast. The one drawback with doing this is if bad data is encountered, DB2 will choke with an error that can be a pain to debug.
As a final example of tailoring options, the UDTF output table definition can be changed to have the output columns resemble Excel’s column naming convention: A, B, C . . . Z, AA, AB, etc.
RETURNS TABLE ( RowId INTEGER, A VARCHAR(64), B VARCHAR(64), C VARCHAR(64), ...
Code Reuse vs. Performance
Every seasoned developer knows that re-using code can make life much easier because modular code is easy to maintain. But (as noted above), modular coding in SQL often produces a performance problem.
My original approach to writing this UDTF was to use the ParseData UDTF twice: once to parse the rows, and once to parse the columns within each row. My original approach is shown here:
SELECT RowData.RowId,col.RowId AS ColId,col.FieldData FROM TABLE(ParseData(@FileContent,@RowDelimiter)) rowdata CROSS JOIN TABLE(ParseData(FIELDDATA,@ColDelimiter)) col ORDER BY 1,2
In my testing on a 373 KB tab-delimited text file (74 columns and 738 rows) this method took about 38 seconds on average, which was too long. By copying the code to parse the columns from the ParseData function into the ParseFileContent UDTF, I was able to reduce the average run time to about 13 seconds. So, I made the UDTF process quicker, but at the cost of duplicating some code.
Caveats and Complaints
To write the function I had hoped to use the new native regular expression support in DB2 for i. Unfortunately, these new functions in DB2 for i are all scalar functions; they will only return one match at a time, meaning that the RegEx processing would need to be performed repetitively on every column of every row. Better would be an implementation of the RegEx functions that can return multiple matches as a table function or as a variable length array.
The UDTF sample demonstrated here is great because it requires nothing more than SQL. The code will not, however, correctly process files with embedded column delimiters or text delimiters. An example of this problem looks like the following:
Because the column data is comma delimited, the embedded comma in “4×4, Redwood” will cause the table function to incorrectly interpret the row as three columns instead of two.
In part 2 of this series, I will demonstrate an external UDTF that uses a Java library to process delimited text files and, as a bonus, supports embedded comma delimiters and text delimiters.