• The Four Hundred
  • Subscribe
  • Media Kit
  • Contributors
  • About Us
  • Contact
Menu
  • The Four Hundred
  • Subscribe
  • Media Kit
  • Contributors
  • About Us
  • Contact
  • DB2 For i, Java, And Regular Expressions

    January 29, 2014 Michael Sansoterra

    In today’s world, IBM i developers often find themselves working with more than the relational data stored inside their trusty DB2 database. Coders can find themselves tasked with parsing many kinds of text data including: SMTP email, file contents of an IFS Folder, Java Properties files, free form comments imported from other systems, OCR data for invoices, resumes, XML, HTML, code fragments (RPG, COBOL, C, Java, SQL, VBA), and application logs.

    While there are many tools available to examine plain text, a regular expression engine provides one of the best generic text processing engines available.

    Yipes, it’s been over 10 years since my first set of articles on integrating Java regular expressions with SQL user-defined functions was published. That explains the hair loss and color change! It’s probably a good time to revisit the topic and introduce a few additional concepts about how useful regular expressions can be.

    In case you need a review on what regular expressions can do, or how to combine Java regular expressions and SQL user-defined scalar functions, you might want to check out a couple of my past articles: Empower SQL with Java User-Defined Functions and Staggering SQL String Handling with Regular Expressions

    The download for this tip’s Java source code is here. This tip’s code will simply add to the UDFs.java source file published with the prior tips. Two additional things I’d like to introduce in this tip are “named capturing groups,” which was available starting in Java 7 on IBM i 7.1 – licensed product 5761JV1 options 14 (32-bit) and 15 (64-bit), and user-defined table functions (available since V5R2). I’ll explain more on these concepts in a bit.

    For those RegEx pros who want to skip reading and get right to the heart of the matter, in addition to the functions published in the prior tips, the newly added SQL user-defined functions are:

    • re_Split_TF–This is a table function that will split text based on a regular expression pattern and return each “split” value as its own row. This is similar to the Java and .NET string “split” methods. This function is a companion to the prior re_Split scalar function, which is somewhat limited because it can only return a single value per invocation.
    • re_Named_Capture–This scalar function will accept text, a regular expression pattern, and the name of a capturing group and return the first occurrence of the named group’s matching values. This function has three parameters: text data, regular expression pattern, and the name of the capturing group. An overloaded second instance of the function has an extra parameter that will allow the developer to choose which occurrence of the pattern to return.
    • re_Named_Capture_TF–This table function is like the scalar function except it will return all occurrences of the requested named capture group as individual rows in the result set.

    The full instructions on how to compile the java UDFs program and how to issue the SQL CREATE FUNCTION statements are in the source code. Make sure the compiled java class is placed in the following special DB2 for i folder so that it can be accessed by DB2: /qibm/userdata/os400/sqllib/function. If you decide to add a package name (the sample code uses the default package), the EXTERNAL PROGRAM value on the CREATE FUNCTION statement will need to be changed to include the package name.

    The re_SPLIT_TF User-Defined Table Function

    Table functions are used to return many pieces of information in a tabular format. The original RegEx articles only included scalar functions that only return a single value. However, if you need to return an unknown number of values from a regular expression match or split, then the table function is the better tool.

    The re_SPLIT_TF table function is used to split text into rows based on a RE pattern. The two parameters for this function are the text to be processed and the RE pattern. In this example, which only works in free-form SQL tools like System i Navigator’s RunSQL scripts, a line feed character (EBCDIC 0x25) is supplied as the RE pattern to use to break apart the supplied text line by line:

    SELECT *
      FROM TABLE(QGPL.re_SPLIT_TF('Line 1
    Line 2
    Line 3
    ',X'25')) TEMP;
    

    The query returns:

    RowNo

    Value

    1

    Line 1

    2

    Line 2

    3

    Line 3

    The RowNo column is created by the table function to keep track of the relative position of the text as it is split. Each of the new table functions features this auto numbering column.

    Named Capturing Groups

    While the concept of RE capturing groups isn’t new, the ability to reference a capturing group by “name” is new to Java in version 7. Loading the Java 7 SDK is a prerequisite to using this code. Named capturing groups provide developers an easy way to extract a piece of information by name instead of ordinal out of plain text. The syntax for naming a capturing group within a RE pattern is as follows:

     (?<name>X) where X is the RE pattern to capture
    

    Say you’re mining text documents for invoice numbers formatted as “INVnnnnn” followed by an order number formatted as “ORDnnnnn” or “CMnnnnn” (CM=credit memo). The sample document text may look like this:

    This letter is in regard to invoice INV01432 (from store.com order ORD15423) blah blah blah

    You come up with a RegEx pattern to scan for the invoice and order number text as follows:

     (INVd{5}).*?(ORDd{5}|CMd{5})
    

    This pattern has two capturing groups, each enclosed in parenthesis:

    1. INVd{5}
    2. ORDd{5}|CMd{5}

    With Java regular expressions, the text identified by these groups can be accessed by ordinal number (with ordinal zero representing the entire pattern). Specifying a capture group of 1 from the sample document text above will return the invoice number (INV01432). Specifying a capture group of 2 from the sample text will return the order number (ORD15423).

    With named capturing, names can now be assigned to each group so you don’t have to worry about referencing ordinals (which can change) when capturing text from a particular pattern. Here, the capturing groups are assigned the names “invoice” and “order”:

     (?<invoice>INVd{5}).*?(?<order>ORDd{5}|CMd{5})
    

    Using the new re_Named_Capture function, you can extract text as follows:

    SELECT QGPL.re_Named_Capture(
    'This letter is in regard to invoice INV01432 (from store.com order
    ORD15423) blah blah blah' ,'(?<invoice>INVd{5}).*?
    (?ORDd{5}|CMd{5})','invoice') AS Invoice
    FROM SYSIBM.SYSDUMMY1
    

    The query returns the value of the invoice number: INV01432. Substituting the third parameter (capturing group name) with ‘order’, the function will return ORD15423. The capturing group names are case sensitive.

    If you’re expecting a capturing group to find only one occurrence, or if you’re only concerned with a specific single occurrence, then the scalar function re_Named_Capture should be used within your code.

    However, in some cases, a capturing group may capture many values from the input text. In this case, the table function, re_Named_Capture_TF, will allow the capture of many values with each value returned as a row in a result set. For example, let’s say you need to mine plain text for email addresses. You invent your own or search online to find this RE pattern.

    ^[_A-Za-z0-9-+]+(.[_A-Za-z0-9-]+)*
          @[A-Za-z0-9-]+(.[A-Za-z0-9]+)*(.[A-Za-z]{2,})$;
    

    The first line of the pattern retrieves the email info and the second line retrieves the domain name. Using this pattern (modified just a tad), this next SQL statement uses re_Named_Capture_TF to extract all of the email addresses out of the text:

    SELECT * FROM TABLE(re_Named_Capture_TF(
    'Please copy the following people:
    joe.shmoe@abc.com. john.doe@xyz.com, jane_pain@ghi.net.  Also, CC:  
    mike_sanso@sql.tv.  For more info, visit: http://mydata.net/',
    
    '(?<email>([A-Za-z0-9-+_]+(.[A-Za-z0-9-]+)*@
    (?<domain>([A-Za-z0-9-]+(.[A-Za-z0-9]+)*(.[A-Za-z]{2,})))))',
    
    'email')) x
    

    This query returns the following result:

    1. joe.shmoe@abc.com
    2. john.doe@xyz.com
    3. jane_pain@ghi.net
    4. mike_sanso@sql.tv

    Substituting a different capturing group name, the domain name is returned instead of the full email address:

    SELECT * FROM TABLE(re_Named_Capture_TF(
    'Please copy the following people:
    joe.shmoe@abc.com. john.doe@xyz.com, jane_pain@ghi.net.  Also, CC:  
    mike_sanso@sql.tv.  For more info, visit: http://mydata.net/',
    
    '(?<email>([A-Za-z0-9-+_]+(.[A-Za-z0-9-]+)*@
    (?<domain>([A-Za-z0-9-]+(.[A-Za-z0-9]+)*(.[A-Za-z]{2,})))))',
    
    'domain')) x
    

    That will return the following list:

    1. abc.com
    2. xyz.com
    3. ghi.net
    4. sql.tv

    Back Reference–A Cool Feature

    One thing to note about named capturing groups in regular expressions is that you can also reference a prior pattern match within an expression. For example, say you need to do a quick and dirty parse of an HTML snippet, trying to find the ALT attribute of all image (<img>) tags. Using the ibm.com/db2 web page’s HTML as an example, here is a typical image tag with the alternate text (ALT) specified:

    <img alt=”Announcing DB2 with BLU Acceleration. Accelerate your analytical workload times to the speed of thought.” height=”100″ width=”300″ src=”/software/data/db2/linux-unix-windows/images/blu-300×100.jpg”/>

    With HTML attributes, either single or double quotes are allowed as delimiters. The following RE pattern can be used to get the entire image tag, the type of quote delimiter used with the alt attribute and the alt attribute value itself:

    (?<imgtag>(<img.*alt=s*(?<quote>("|'))
    (?<altvalue>.*?)(k<quote>).*/>))
    

    Of particular importance is the “quote” named capturing group that determines whether a single or double quote is used. A back reference (k<quote>) is specified to make sure the end of the attribute is terminated with the same quote character used at the start. In other words, the k<quote> capturing group says “find the same value identified in the quote capturing group.”

    Many text documents (HTML, XML, Java Property Files, etc.) have special tools that can be used to programmatically extract information from them. However, depending on the task at hand, sometimes it’s a burdensome coding chore to break out the “right tool” for what might be a one-time code execution or a job fit for a “simple” database retrieval task. While regular expressions aren’t specifically designed to do XML/HTML parsing, they can nevertheless be used to aid with quick and simple queries.

    Keep in mind that these examples use Java’s RE engine. Many other RE implementations exist (PHP, .NET, C, etc.) but they’re not always 100 percent compatible.

    In summary, named capturing groups are handy because you can embed several capturing instructions within a single pattern and pull out the piece you need by name. Further, using table functions will allow queries to pull all values from a RE operation and return them in a result set. Regular expressions can parse and validate all kinds of text including phone numbers, mailing addresses, serial number formats, email addresses, etc., so integrate them in your applications and reap the benefits of this powerful text processing tool.

    Michael Sansoterra is a DBA for Broadway Systems in Grand Rapids, Michigan. Send your questions or comments for Mike via the IT Jungle Contact page.

    RELATED STORIES

    Empower SQL with Java User-Defined Functions

    Staggering SQL String Handling with Regular Expressions

    Named Capturing Groups feature in Java 7

    Configure Java default version on the IBM i (Author’s Note: Following these steps will allow you to make Java 7 the default JDK on IBM i and allow DB2 to make use of it.)

    Regular Expression Test Page for Java (Author’s Note: This page offers a free utility to test your Java based regular expression patterns, which is very helpful for RegEx newbies.)



                         Post this story to del.icio.us
                   Post this story to Digg
        Post this story to Slashdot

    Share this:

    • Reddit
    • Facebook
    • LinkedIn
    • Twitter
    • Email

    Tags:

    Sponsored by
    Rocket Software

    Software built on TRUST. Delivered with LOVE.

    For over 35 years, Rocket Software’s solutions have empowered businesses to modernize their infrastructure, unlock data value, and drive transformation – all while ensuring modernization without disruption.

    Learn More

    Share this:

    • Reddit
    • Facebook
    • LinkedIn
    • Twitter
    • Email

    Sponsored Links

    Fresche Legacy:  X-Analysis 10.0 is the premier knowledge mining and re-use tool for IBM i
    System i Developer:  Upgrade your skills at the RPG & DB2 Summit in Dallas, March 18-20.
    Profound Logic Software:  Live Webinar: "See What i Can Do With Modern RPG Development." Feb 12

    More IT Jungle Resources:

    System i PTF Guide: Weekly PTF Updates
    IBM i Events Calendar: National Conferences, Local Events, and Webinars
    Breaking News: News Hot Off The Press
    TPM @ EnterpriseTech: High Performance Computing Industry News From ITJ EIC Timothy Prickett Morgan

    CIOs To Feel The Pinch Again In 2014 IBM’s X86 Exit Strategy: Arguing The Good And Bad

    Leave a Reply Cancel reply

Volume 14, Number 2 -- January 29, 2014
THIS ISSUE SPONSORED BY:

SEQUEL Software
WorksRight Software
Northeast User Groups Conference

Table of Contents

  • The Case Of The Late Night FSM Explorer
  • DB2 For i, Java, And Regular Expressions
  • Admin Alert: Four Ways To Move An IBM i Partition, Part 3

Content archive

  • The Four Hundred
  • Four Hundred Stuff
  • Four Hundred Guru

Recent Posts

  • FAX/400 And CICS For i Are Dead. What Will IBM Kill Next?
  • Fresche Overhauls X-Analysis With Web UI, AI Smarts
  • Is It Time To Add The Rust Programming Language To IBM i?
  • Is IBM Going To Raise Prices On Power10 Expert Care?
  • IBM i PTF Guide, Volume 27, Number 20
  • POWERUp 2025 –Your Source For IBM i 7.6 Information
  • Maxava Consulting Services Does More Than HA/DR Project Management – A Lot More
  • Guru: Creating An SQL Stored Procedure That Returns A Result Set
  • As I See It: At Any Cost
  • IBM i PTF Guide, Volume 27, Number 19

Subscribe

To get news from IT Jungle sent to your inbox every week, subscribe to our newsletter.

Pages

  • About Us
  • Contact
  • Contributors
  • Four Hundred Monitor
  • IBM i PTF Guide
  • Media Kit
  • Subscribe

Search

Copyright © 2025 IT Jungle