Parsing Delimited Text Data Natively in SQL, Part 1

January 26, 2016 Michael Sansoterra

Note: The code accompanying this article is available for download here.

A common task for database developers is to accept a delimited text file, parse it, and dump it into a database table. This tip demonstrates a user-defined table function (UDTF) that can accomplish this task based on delimited text data stored in a CLOB or in an IFS file.

The UDTF is called ParseFileContent and it accepts three parameters: CLOB data to parse, row delimiter character(s), and column delimiter character(s). The UDTF returns a ROWID column that represents the line number of the text data and 250 columns of VARCHAR(128) data with output columns named C1, C2, Cn. . . C250. Sample usage looks like this:

INSERT INTO MYLIB.DATA
SELECT ROWID,C1 AS PRODUCTID,C2 PRODUCTDESCRIPTION
   FROM TABLE(ParseFileContent(
GET_CLOB_FROM_FILE('/tmp/ProductData.txt'),x'0D25',x'05')) Data

This sample statement above uses built-in function GET_CLOB_FROM_FILE to transform the IFS file ProductData.txt into a CLOB. The row delimiter is x’0D25′ (Carriage Return/Line Feed) and the column delimiter is x’05’ (tab). Columns C1 and C2 are aliased as PRODUCTID and PRODUCTDESCRIPTION respectively.

You probably already know the CPYFRMIMPF command can perform much the same task. However, the SQL UDTF also offers the following:

A complete import function can be written in SQL without the help of external functions in languages such as RPG or C.
It can use the GET_CLOB_FROM_FILE built-in DB2 function that can read a file’s content from the IFS and place it into a CLOB. In the past, a developer would need to write custom code to access an IFS file or use the CLOB_FILE data type that is only available with embedded SQL programming.
Having everything in SQL can potentially make programmatic processing easier if you don’t have to implement an OS command.
Data type casts, sorting, JOINs and complex filtering can be done with an SQL implementation of an import that is not available with CPYTOIMPF.
If you have delimited data stored in a CLOB column of a table, the UDTF can process it.

In summary, this function boasts the power of native SQL in its ability to handle this kind of utilitarian task, and also allows the import process to take advantage of SQL programming.

SQL UDTF Code

The code for this tip can be downloaded here. There are two table functions: ParseFileContent, and a secondary helper function named ParseData. The abridged function header for ParseFileContent is defined as follows:

CREATE OR REPLACE FUNCTION ParseFileContent (
@FileContent CLOB,
@RowDelimiter VARCHAR(12),
@ColDelimiter VARCHAR(12))
RETURNS TABLE (
RowId INTEGER,
C1 VARCHAR(128),
C2 VARCHAR(128),
....
C250 VARCHAR(128)
)
LANGUAGE SQL
DETERMINISTIC
NO EXTERNAL ACTION
SET OPTION COMMIT=*CS

The code uses the PIPE statement recently introduced in DB2 for i (i 7.1 TR9 or 7.2 TR1) and also uses a large number of output columns (IBM i 7.1 TR10 or IBM i 7.2 TR2). Users on systems that don’t have PIPE available can modify the code to insert a row into a temporary table in lieu of using PIPE. Users on systems that can’t yet create a UDTF with a large number of output table columns can simply reduce the number of columns the function uses (approximately output 120 columns max).

This secondary ParseData UDTF does the work to parse a CLOB by the given row delimiter such that it will return a row as a single column called FieldData for every text line in the CLOB:

CREATE OR REPLACE FUNCTION ParseData(@Data CLOB, @Delimiter VARCHAR(12))
RETURNS TABLE (RowId INTEGER,FieldData varchar(32000))
LANGUAGE SQL
DETERMINISTIC
NO EXTERNAL ACTION
RETURNS NULL ON NULL INPUT
SET OPTION COMMIT=*NONE
BEGIN
    DECLARE @START INTEGER DEFAULT 1 NOT NULL;
    DECLARE @END1  INTEGER DEFAULT 0 NOT NULL;
    DECLARE @ROWID INTEGER DEFAULT 0 NOT NULL;
    DECLARE @FIELD VARCHAR(2048);

    SET @END1=LOCATE(@Delimiter,@Data);
    WHILE @END1>0 DO
        SET @FIELD=SUBSTRING(@Data,@START,@END1-@START);
        SET @ROWID=@ROWID+1;
        PIPE (@ROWID, @FIELD);
        SET @START=@END1+LENGTH(@Delimiter);
        SET @END1=LOCATE(@Delimiter, @Data,@END1+LENGTH(@Delimiter));
    END WHILE;

    SET @FIELD=SUBSTRING(@Data,@START,LENGTH(@Data)- @START+1);
    SET @ROWID=@ROWID+1;
    PIPE (@ROWID,@FIELD);
    RETURN;
END

As evidenced by the parameter definition, the table function can only process a delimited file with a row length of 32,000 characters or less.

The ParseFileContent UDTF is too large to show here because of the large number of parameters, but a few comments on it will suffice. First, the code is ugly because I have an IF..ELSE IF.. block that is a mile long in order to take a single field of extracted data and put it into one of the 250 output variables! It would be nice if DB2 developers could populate an array instead of populating columns one at a time. Second, for performance reasons, I incorporated much of the code from the ParseData UDTF into ParseFileContent; it’s redundant code but overall it runs much faster (more on this in a bit).

Common EBCDIC Delimiters

Row delimiters in text files vary, depending often on the tool used to create them or OS. The most common row delimiters are:

Carriage Return – x’0D’
Line Feed – x’25’ (common on Linux)
Carriage Return/Line Feed – x’0D25′ (common on Windows)

The common column delimiters for text data include the tab (x’05’), the comma ‘,’ and the pipe ‘|’.

b>A Problem With Journaling

If you are using a legacy application that does not use journaling (or perhaps that runs with a default of COMMIT=*NONE), you may encounter a problem when using a statement like this:

INSERT INTO MYLIB.DATA
SELECT ROWID,C1,C2
   FROM TABLE(ParseFileContent(
GET_CLOB_FROM_FILE('/tmp/SampleData.txt'),x'0D25',x'05')) Data

The first error you’ll get is this:

SQL State: 42926
Vendor Code: -443
Message: [SQL0443] LOB and XML locators are not allowed with COMMIT(*NONE).

This error occurs because DB2 requires large object locators (such as returned by the GET_CLOB_FROM_FILE UDF) to be accessed within a transaction.

If you change your transaction isolation level to read uncommitted or higher, and run the statement again, you’ll get this error:

SQL State: 25006
Vendor Code: -817
Message: [SQL0817] Update operation not allowed.

This error occurs because the table as the target of the INSERT in not being journaled, a requirement of transaction processing in DB2 for i. What a pickle: you have to use transaction processing to get the CLOB from the IFS file, yet you can’t use transaction processing to INSERT into a table that’s not being journaled.

So now what? The code below shows how a dynamic compound statement can be used to do the process in two steps: the first step within a transaction boundary, and the second step without it:

BEGIN

DECLARE @DATA CLOB;
-- Requires a transaction
SELECT GET_CLOB_FROM_FILE('/tmp/TabData.txt')
  INTO @DATA
  FROM SYSIBM.SYSDUMMY1
  WITH CHG;
-- Does not require a transaction
INSERT INTO MYLIB.DATA
SELECT ROWID,C1 
  FROM TABLE(ParseFileContent(@DATA,x'0D25',x'05')) Data ;

END;

The WITH CHG isolation clause demonstrated above causes the SELECT statement to be executed with a transaction isolation level of “read uncommitted”, a requirement for GET_CLOB_FROM_FILE.

Custom Variations Of The ParseFileContent UDTF

To maximize for potential general usage, the ParseFIleContent UDTF provides 250 VARCHAR(128) columns. However, using non-descript column names can be confusing and having too many columns can be clumsy to wade through in a result set. The good news is the function can be rebuilt or modified in any number of ways to suit your needs, including tailoring it for specific text data.

For example, instead of naming the columns generically (C1, C2. . .), you can create a UDTF that has names that match the text file’s content. For example, RETURNS TABLE can be made to look like this:

RETURNS TABLE (
RowId INTEGER,
ProductId INTEGER,
ProductDescription VARCHAR(50),
UnitCost DEC(11,4),
...

Further, you can define the function’s resulting data types to match the text file content by defining columns with types such as DATE, SMALLINT, etc. Thankfully, since V5R3 DB2 has done the job of automatically casting character data to a destination type without requiring the code to be changed with an explicit cast. The one drawback with doing this is if bad data is encountered, DB2 will choke with an error that can be a pain to debug.

As a final example of tailoring options, the UDTF output table definition can be changed to have the output columns resemble Excel’s column naming convention: A, B, C . . . Z, AA, AB, etc.

RETURNS TABLE (
RowId INTEGER,
A VARCHAR(64),
B VARCHAR(64),
C VARCHAR(64),
...

Code Reuse vs. Performance

Every seasoned developer knows that re-using code can make life much easier because modular code is easy to maintain. But (as noted above), modular coding in SQL often produces a performance problem.

My original approach to writing this UDTF was to use the ParseData UDTF twice: once to parse the rows, and once to parse the columns within each row. My original approach is shown here:

SELECT RowData.RowId,col.RowId AS ColId,col.FieldData
  FROM TABLE(ParseData(@FileContent,@RowDelimiter)) rowdata
  CROSS JOIN TABLE(ParseData(FIELDDATA,@ColDelimiter)) col
  ORDER BY 1,2

In my testing on a 373 KB tab-delimited text file (74 columns and 738 rows) this method took about 38 seconds on average, which was too long. By copying the code to parse the columns from the ParseData function into the ParseFileContent UDTF, I was able to reduce the average run time to about 13 seconds. So, I made the UDTF process quicker, but at the cost of duplicating some code.

Caveats and Complaints

To write the function I had hoped to use the new native regular expression support in DB2 for i. Unfortunately, these new functions in DB2 for i are all scalar functions; they will only return one match at a time, meaning that the RegEx processing would need to be performed repetitively on every column of every row. Better would be an implementation of the RegEx functions that can return multiple matches as a table function or as a variable length array.

The UDTF sample demonstrated here is great because it requires nothing more than SQL. The code will not, however, correctly process files with embedded column delimiters or text delimiters. An example of this problem looks like the following:

 "PART123","4x4, Redwood"

Because the column data is comma delimited, the embedded comma in “4×4, Redwood” will cause the table function to incorrectly interpret the row as three columns instead of two.

In part 2 of this series, I will demonstrate an external UDTF that uses a Java library to process delimited text files and, as a bonus, supports embedded comma delimiters and text delimiters.

Michael Sansoterra is a DBA for Broadway Systems in Grand Rapids, Michigan. Send your questions or comments for Mike via the IT Jungle Contact page.

Table of Contents

Content archive

Recent Posts

Subscribe

Pages

Search