Parsing Delimited Text Data Natively In SQL, Part 2: Java To The Rescue
February 9, 2016 Michael Sansoterra
Note: The code accompanying this article is available for download here.
In Part 1 of this series, I introduced ParseFileContent, a 100 percent SQL user-defined table function (UDTF) that can parse delimited text (as a CLOB or IFS file) and format the results as rows and columns. As an advantage over tools like CPYFRMIMPF, this UDTF can streamline a text based import process by allowing CASTs, JOINs, and data integrity checks to be done while the import is running. As useful as ParseFileContent is, it does suffer a few drawbacks.
ParseFileContent is not particularly fast and it can’t handle the features of an advanced text parser such as allowing for text delimiters and embedded column delimiters within a text delimiter. To overcome these limitations, I decided to look for an improved solution. The result is an external UDTF written in Java that uses a pre-built (and free!) Java library already suited for the purpose of parsing delimited text, including the use of advanced features.
The Java library is called Apache Commons CSV. This library is fast and packed with features so the only major task is getting it to play nice with DB2 for i. To aid DB2 and Java integration, IBM made the JAR file db2routines_classes.jar. It contains class UDF which allows developers to easily build Java scalar and table functions. This JAR file can typically be found on the IFS at this location: /qibm/ProdData/Java400/ext.
For more information about creating user-defined scalar and table functions using Java, see the section titled “Java SQL routines” in the Programming IBM Developer Kit for Java. In a nutshell, the UDF class provides the framework for Java to accept parameters from DB2 and to return a scalar value or the rows and columns of a programmatically generated tables (similar to external functions written in C, COBOL, or RPG). This UDTF is intended for IBM i 7.1 or later, but can be run on old versions of DB2 for i with minor modifications so long as Java 1.6 is the default JDK.
In honor of the Apache Commons CSV library, I named the table function demonstrated here ParseCSV. The source can be downloaded here. For this project, you will also need to download and unzip the Apache Commons CSV version 1.2 jar file that requires Java 6 (a.k.a. 1.6) or later.
Sample usage looks like this:
SELECT * FROM TABLE(ParseCSV('/mydata/TextData.csv', ',',X'25','"')) data;
This table function accepts four parameters:
If you have forgotten the importance of the quote character for string data, shown here is the delimited text problem that Part 1‘s UDTF could not handle:
In this comma delimited text example, the value 4×4, Redwood is a single string, but the embedded comma fools the parser into thinking it is two separate values. The quote character (in this case the double quote) instructs the parser to treat everything embedded between them as a single text value.
If text file /mydata/TextData.csv contains this raw text:
A,B,C 1,2,3 @,#,$ "AB",2,"c,d" "3"" pipe",4,"x"
The above query will return this result:
The values in bold demonstrate the function’s ability to handle embedded column delimiters and escaped quote characters. Notice that the quote character (“) can be escaped in a single value by repeating them (e.g. “3”” pipe”). DB2 for i SQL uses this technique to allow for embedded single quotes within a string value. If a particular delimiter is not needed, use an empty string.
For a refresher, remember the common EBCDIC row delimiters are:
Common column delimiters for text data include the tab (EBCDIC x’05’), the comma, and the pipe ‘|’.
Compiling and Using The Code
Once you download and unzip the commons-csv-1.2.jar file and the DB2TextParser.java file, save them on your IFS. In order for DB2 for i to use the commons JAR, you normally need to register the JAR by issuing the following SQL statements (updating the IFS path and schema with your own):
CALL sqlj.install_jar('file:/myfolder/commons-csv-1.2.jar', 'myschema.commons-csv-1.2',0) COMMIT
The schema that you place the JAR file in should also be set in the application’s SQL path so that DB2 can resolve the JAR file’s location.
For DB2 for i to execute the DB2TextParser.java code, the compiled class file must be placed at this special IFS location: /qibm/userdata/os400/sqllib/function. Using QSHELL, it’s easy to compile the Java source into the appropriate folder as follows:
javac -d /qibm/userdata/os400/sqllib/function /myfolder/DB2TextParser.java
Finally, run the CREATE OR REPLACE FUNCTION statement (included in the header of the Java source code) to register the Java code for use as a UDTF with DB2. An abridged version of the statement is shown here:
CREATE OR REPLACE FUNCTION mylib.ParseCSV( FileName VARCHAR(128) CCSID 37, ColDelimiter CHAR(1) CCSID 37, RowDelimiter VARCHAR(4) CCSID 37, QuoteChar CHAR(1) CCSID 37 ) RETURNS TABLE ( LineNo INT, C1 VARCHAR(128), C2 VARCHAR(128), C3 VARCHAR(128), ... C120 VARCHAR(128) ) EXTERNAL NAME 'DB2TextParser.parseFile' LANGUAGE Java PARAMETER STYLE DB2General FENCED NO SQL DISALLOW PARALLEL SCRATCHPAD FINAL CALL RETURNS NULL ON NULL INPUT
Java vs. SQL Text Parser Comparison
Compared to the ParseFileContent SQL UDTF presented in Part 1, this Java UDTF is superb as it averaged less than one-half second, compared to 38 seconds. This doesn’t mean Java code will always beat SQL, but the difference is telling for this particular application. Also note that the initial JVM start time for a given job may add a significant performance penalty to the first run of a DB2 Java routine.
A few other differences between the parsers I’ve presented worth noting are: this Java version doesn’t support CLOB data, it only supports a single character column delimiter (a limitation of the commons library), and a Java UDTF currently only supports a maximum of 125 columns. If you attempt to create more (such as the 250 columns afforded by the ParseFileContent UDTF), you’ll get this error:
SQL State: 54011 Vendor Code: -680 Message: [SQL0680] Too many columns specified. Cause . . . . . : Too many columns were specified in the definition of a user defined table function. A maximum of 125 columns can be specified for the input parameters and the return columns combined. This maximum is reduced by one if both the SCRATCHPAD and DBINFO structures are requested. Recovery . . . : Reduce the number of parameters or return columns specified for the user defined table function.
However, given the performance improvement and the ability to correctly parse quoted strings with embedded delimiters, I prefer the Java version of this utility.
Finally, remember the variability that SQL allows when defining table functions. The default function definition I provided is generic (with 120 generic column names and character data types of equal length), but it can be tweaked to have different column names and different character lengths. Since the Java program’s parameter signature is bound to the UDTF’s parameter and column definitions, changing the return output column types or number of columns also requires a change to the Java code.
One “cheap” way around this limitation is to create a UDTF wrapper around the Java ParseCSV UDTF that lets SQL do the dirty work of data conversions and limiting the number of columns returned. Say you want to extract data out of the U.S. Census Bureau’s “County Business Patterns 2013: Complete Country File” data found here. A function wrapper would look like this:
CREATE OR REPLACE FUNCTION ParseCensusData (@FileName VARCHAR(128)) RETURNS TABLE ( FIPSSTATE VARCHAR(20), FIPSCOUNTY VARCHAR(20), EMPFLAG CHAR(1), EMPCOUNT INT, Q1PAYROLL DEC(17,0)) BEGIN RETURN SELECT C1,C2,C4,C6,C8 FROM TABLE(ParseCSV(@FileName, ',',X'0D25','"')) data WHERE LINENO>1; -- Skip Headings END;
The above “wrapper” ParseCensusData UDTF is designed to return only the desired columns from the IFS file with the appropriate data type, and to skip the first line in the file because it contains unwanted headings.
In closing, the Apache Commons CSV Java library is ideally suited for parsing delimited text data. Further, by using IBM’s UDF Java class, DB2 can take advantage of this (or almost any) Java library with relatively little code.