• The Four Hundred
  • Subscribe
  • Media Kit
  • Contributors
  • About Us
  • Contact
Menu
  • The Four Hundred
  • Subscribe
  • Media Kit
  • Contributors
  • About Us
  • Contact
  • Native Regular Expressions In DB2 For i 7.1 And 7.2

    May 19, 2015 Michael Sansoterra

    Blast it! Another suite of custom code I have written and used over the years has recently been deprecated (or partially deprecated) by IBM. The good news is that regular expressions (abbreviated RegEx) are now a native part of DB2 featuring one new predicate (REGEXP_LIKE) and four new scalar functions: REGEXP_COUNT, REGEXP_INSTR, REGEXP_SUBSTR, and REGEXP_REPLACE.

    For IBM i 7.1, TR9 must be installed and for IBM i 7.2, TR1 must be installed. The new Regular Expression Functions Require licensed product #39 5770-SS1 International Components for Unicode to be installed on both IBM i 7.1 and 7.2. (This product is free with IBM i.) If you do not have this installed you will receive this error:

    SQL State: 42704
    Vendor Code: -204
    Message: [SQL0204] QQQSVREG in QSYS type *SRVPGM not found.
    Cause . . . . . : QQQSVREG in QSYS type *SRVPGM was not found.
    

    In short, regular expressions are used for advanced pattern matching, well beyond what a typical LIKE predicate offers. If you’re unfamiliar with regular expressions and how they can be used in a database environment, check out Staggering SQL String Handling with Regular Expressions.

    For many parts of this tutorial, I’ll be using the EMP_RESUME table in the CORPDATA sample database.

    REGEXP_LIKE Predicate

    Similar to the standard LIKE predicate, REGEXP_LIKE performs pattern testing on a text column using a regular expression pattern, rather than a simple wildcard pattern. This REGEXP_LIKE example shows how the new predicate can be used to find all plain-text resumes that have a correctly formatted birthday of the form: month name day, yyyy

    SELECT *
      FROM CORPDATA.EMP_RESUME
     WHERE RESUME_FORMAT='ascii'
       AND REGEXP_LIKE(RESUME,
    'Birthdate:s*(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|
    SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)s*d{1,2},s*d{4}'
    ,'i');
    

    The first argument passed to REGEXP_LIKE is the expression to search, in this case the RESUME column. The second argument is the regular expression pattern, and it does take some practice to get used to creating a RegEx pattern. The full list of regular expression control characters (as recognized by DB2 for i) used to build a pattern can be found in the SQL Reference manual.

    The third argument is one or more flag values that control how to interpret the RegEx processing. Shown in the example above is ‘i’, which means ignore case sensitivity. The possible flag values for this parameter are:

    Flag value

    Description

    c

    Specifies that matching is case sensitive. This
    is the default value if neither ‘c’ nor ‘i‘ is
    specified. This value must not be specified with a value of ‘i‘.

    i

    Specifies that matching is case insensitive. This
    value must not be specified with a value of ‘c’.

    m

    Specifies that the input data could contain more
    than one line. By default, the ‘^’ and the ‘$’ in a pattern will only match
    the start and the end, respectively, of the input string. If this flag is
    set, “^” and “$” will also match at the start and end of
    each line within the input string.

    n

    Specifies that the ‘.’ character in a pattern
    matches a line terminator in the input string. By default, the ‘.’ in a
    pattern will not match a line terminator. A carriage-return and line-feed
    pair in the input string behaves as a single line terminator, and will match
    a single “.” in a pattern.

    s

    Specifies that the ‘.’ character in a pattern
    matches a line terminator in the input string. This is a synonym for the ‘n’ value.

    x

    Specifies that white space characters in a
    pattern are ignored, unless escaped.

    These flag values are worth studying as they can be used with all of the REGEXP scalar functions.

    As expected, the search criteria can be negated so that the following query returns resumes that do not contain a correctly formatted birthday:

    SELECT *
      FROM CORPDATA.EMP_RESUME
     WHERE RESUME_FORMAT='ascii'
       AND NOT REGEXP_LIKE(RESUME,
    'Birthdate:s*(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|
    SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)s*d{1,2},s*d{4}'
    ,'i');
    

    Another optional parameter that can be used is the start position to search within the string. For example, to ignore the first 100 characters of the resume include 100 as the start position:

    SELECT *
      FROM CORPDATA.EMP_RESUME
     WHERE RESUME_FORMAT='ascii'
       AND REGEXP_LIKE(RESUME,
    'Birthdate:s*(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|
    SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)s*d{1,2},s*d{4}'
    ,100,'i');
    

    New Scalar Functions: REGEXP_COUNT

    This function is used to count the number of occurrences of a regular expression pattern within a string. For instance, this function counts the folder depth of an IFS file path:

    VALUES(REGEXP_COUNT('/qibm/ProdData/Java400/bin/JavaDoc','/'))
    

    The result is 5.

    The following function invocation returns the number of occurrences of a simple phone number pattern within a plain text resume:

    SELECT EMPNO,REGEXP_COUNT(RESUME,
    '(PHONE:|PH:|MOBILE:|CELL:)s*({0,1}d{3}){0,1}s*d{3}(s+|-)d{4}',
    'i')
      FROM CORPDATA.EMP_RESUME
     WHERE RESUME_FORMAT='ascii';
    

    Similar to REGEXP_LIKE, the function’s parameters (from left to right) are the text to search, the regular expression pattern and, optionally, any flags to use. In this case, the “ignore case” flag is used so that both “MOBILE” and “mobile” will be identified as a pattern match.

    REGEXP_SUBSTR (a.k.a. REGEXP_EXTRACT)

    This function will extract a substring from a given column or expression based on a regex pattern. The plain text version of the RESUME column in table EMP_RESUME contains a header that looks like this:

      Address:            1150 Eglinton Ave
                          Mellonville, Idaho 83725
      Phone:              (208) 875-9933
      Birthdate:          September 15, 1925
      Sex:                Female
      Marital Status:     Married
      Height:             5'2"
      Weight:             120 lbs.
    

    The phone number line can be extracted from the text with this REGEX pattern:

    PHONE:s*({0,1}d{3}){0,1}s*d{3}(s+|-)d{4}
    

    Using the REGEXP_SUBSTR function, the phone number can be extracted from the resume as follows:

    SELECT EMPNO,REGEXP_SUBSTR(RESUME,
    'PHONE:s*({0,1}d{3}){0,1}s*d{3}(s+|-)d{4}'
    ,1,1,'i') 
    FROM CORPDATA.EMP_RESUME 
    WHERE RESUME_FORMAT='ascii';
    

    The extracted phone number look like this:

    Phone:              (208) 875-9933
    

    This result could require a little more formatting such as ignoring extraneous spaces and eliminating the word “Phone”. To make these corrections, parenthesis can be placed around the area code and phone number in the regex pattern to make these elements of the pattern a “capture group” (see bolded red letters below):

    SELECT REGEXP_SUBSTR(RESUME,
    'PHONE:s*(({0,1}d{3}){0,1}s*d{3}(s+|-)
    d{4})'
    ,1,1,'i', 1) AS PHONE
    FROM CORPDATA.EMP_RESUME 
    WHERE RESUME_FORMAT='ascii' 
    

    The extra parenthesis in the pattern make the phone number itself a capture group. The optional sixth parameter passed to the REGEXP_SUBSTR function tells it to extract capture group number 1 (capture group 0 represents the entire pattern).

    The improved result is just the phone number without any leading or trailing words, spaces, etc.:

    (208) 875-9933
    

    REGEXP_REPLACE

    As Yoda would say: a powerful ally in the Jedi developer’s toolbox, the REGEXP_REPLACE function is. It can make complex text updates a breeze. Say the area code 208 is about to be eliminated with area code 222 taking its place. The resume phone numbers need to be updated accordingly. However, a blanket exchange of literal ‘208’ with ‘222’ using the SQL REPLACE function may inadvertently update something else in the document such as a non-area code portion of a phone number, a zip code, a house number or an apartment number.

    REGEXP_REPLACE comes in handy here because a regex pattern allows the search to be context specific to ensure the update only applies to an area code of a phone number and not something else.

    Here is an UPDATE statement that will accomplish just such a feat:

    UPDATE CORPDATA.EMP_RESUME
    SET RESUME=REGEXP_REPLACE(RESUME,
    '(.*?PHONE:s*({0,1})208(
    ){0,1}s*d{3}(s+|-)d{4}.*?)',
    '$1222$2',1,
     0,'is')
    WHERE RESUME_FORMAT='ascii' 
    

    Before I explain this mess of cryptic characters, study again what a plain text resume looks like in the sample data (abridged to show portions of the header):

      Resume:  Delores M. Quintana
      
      Personal Information
     
      Address:            1150 Eglinton Ave
                          Mellonville, Idaho 83725
      Phone:              (208) 875-9933
      Birthdate:          September 15, 1925
      Sex:                Female
    
     
      Department Information
     
      Employee Number:    000130
      Manager:            Sally Kwan
      Position:           Analyst
      Phone:              (208) 385-4578
      Hire Date:          1971-07-28
    

    Notice that phone numbers with the 208 area code appear twice. After the above statement runs, all occurrences of the 208 area code are changed to 222.

    The important things to note about the use of REGEXP_REPLACE for this purpose are highlighted in red. The first thing to consider is the pattern itself:

    (.*?PHONE:s*({0,1})208(){0,1}s*d{3}(s+|-)d{4}.*?)
    

    Remember, adding parentheses around elements of a pattern turn the elements into a “capturing group.” In the pattern above, the first capturing group identifies everything in a document prior (starting from the beginning of the document or the last pattern match) to the start of an area code:

    (.*?PHONE:s*({0,1})
    

    The second capturing group identifies everything after the area code until another phone number appears or until the end of the document is reached:

    (){0,1}s*d{3}(s+|-)d{4}.*?)
    

    The next thing to note about the use of REGEXP_REPLACE is the replacement expression as shown in parameter 3:

    '$1222$2'
    

    This is the text we want to use to replace the source text. Since we’re updating the RESUME column, we need to include the entire content of each resume.

    The $1 and $2 are special identifiers that refer to ordinal based capturing groups within a pattern. The regex pattern was created to identify everything before an area code (capture group 1), the area code itself, and everything after the area code (capture group 2). When the replace is done, the replacement text will be built by concatenating a string consisting of the content of capture group 1 (everything prior to the area code), followed by the replacement area code literal ‘222’, followed by the content of capture group 2 (everything after the area code).

    Note that $0 refers to the default capturing group which is the entire regular expression pattern. An alternative syntax for $1 and $2 is 1 and 2.

    The fifth REGEXP_REPLACE parameter represents the occurrence to replace so if three area codes are identified in the resume column, passing a 2 will only update the second match. Passing a zero as shown in this example instructs the REGEXP_REPLACE function to update all occurrences of the matched pattern.

    The last thing to note is the ‘s’ flag is also specified in the flags parameter:

    'is'
    

    This flag specifies that the ‘.’ control character in the pattern matches any character including a line terminator character. This allows the “reluctant quantifier” portion of the pattern element .*? to include text that spans multiple lines. Without this flag, the ‘.’ match would not go beyond a single line.

    Of course, the REGEXP_LIKE predicate can be added to limit the search to only those that contain phone number patterns within the 208 area code.

    UPDATE CORPDATA.EMP_RESUME
    SET RESUME=REGEXP_REPLACE(RESUME,
    '(.*?PHONE:s*({0,1})208(){0,1}s*d{3}(s+|-)d{4}.*?)',
    '$1222$2',1,0,'is')
    WHERE RESUME_FORMAT='ascii' 
      AND REGEXP_LIKE(RESUME,'({0,1}208){0,1}s*d{3}(s+|-)d{4}','i')
    

    REGEXP_INSTR

    This function returns the position of a pattern within a string. The first character position in a string is considered position one so don’t expect it to be zero as in a zero based language such as Java.

    In this example, REGEXP_INSTR is used to identify the last component of an IFS path (as identified in capture group 2: (w+)):

    VALUES(
    REGEXP_INSTR('/qibm/ProdData/Java400/bin/JavaDoc',
    '(/w+)*/(w+)'
    ,1,1,0,'', 2))
    

    The result is 28, the position of ‘JavaDoc’ in the given IFS path. Parameters 3 through 7 of REGEXP_INSTR are optional and represent start position, occurrence, return option (more on this in a minute), flags and capture group ordinal. The “2” value passed as parameter 7, instructs the function to return the position of capture group 2 rather than the position of the entire pattern.

    One other WAY COOL feature of REGEXP_INSTR is parameter 5, the “return option”. This value can be a zero or a one. When set to zero (as shown above), the function returns the start position of the requested pattern. When set to one, the function returns the position after the matched pattern.

    VALUES(
    REGEXP_INSTR(
    '/qibm/ProdData/Java400/bin/JavaDoc',
    '(/w+)*/(w+)'
    ,1,1,1,'',2))
    

    In this case, the function returns 35, the position after the pattern ends. In string processing, it’s common to identify the start position of a search string and then add the length of the same search string to the original start position. The “return option” parameter prevents the need to do this separate addition step. Since regex pattern matches can vary greatly in length, this information (start and after positions) can help you determine the length of the identified pattern.

    Express Yourself with RegEx

    Regular expressions are a powerful tool in the dev toolbox and it’s great to have them available with DB2 for i. Be careful to note that not all regular expression engines support the exact same features. For example, it doesn’t look like the DB2 for i includes support for named capture groups. Also, DB2 for i doesn’t offer a RegEx split function, which is extremely useful for parsing delimited lists. Likewise, there may be some subtle differences in the control character patterns supported by various engines.

    Michael Sansoterra is a DBA for Broadway Systems in Grand Rapids, Michigan. Send your questions or comments for Mike via the IT Jungle Contact page.

    RELATED STORIES

    DB2 For i, Java, And Regular Expressions

    Staggering SQL String Handling with Regular Expressions

    Share this:

    • Reddit
    • Facebook
    • LinkedIn
    • Twitter
    • Email

    Tags:

    Sponsored by
    Rocket Software

    Unlock the full potential of your data with Rocket Software. Our scalable solutions deliver AI-driven insights, seamless integration, and advanced compliance tools to transform your business. Discover how you can simplify data management, boost efficiency, and drive informed decisions.

    Learn more today.

    Share this:

    • Reddit
    • Facebook
    • LinkedIn
    • Twitter
    • Email

    Sponsored Links

    The Omni User:  Chicago's OMNI Technical Conference, June 4-5, Palos Hills, Illinois
    NGS:  The Many Faces of Query/Reporting/Analytics. Free, Live Webinar. May 20
    United Computer Group:  VAULT400 BaaS delivers secure cloud backup and DR solutions

    IBM i Shops Running Oracle JDE Consider MSPs And Migration Hadoop and IBM i: Not As Far Apart As One Might Think

    2 thoughts on “Native Regular Expressions In DB2 For i 7.1 And 7.2”

    • Regexp - Regular Expression con SQL del DB2 for i - Faq400.com says:
      September 7, 2017 at 1:00 am

      […] ITJungle era già uscito un po’ di tempo fa un ottimo articolo di Michael Sansoterra: “Native Regular Expressions In DB2 For i 7.1 And 7.2” che presentata degli ottimi esempi di REGEXP_LIKE, REGEXP_SUBSTR, REGEXP_REPLACE e […]

      Reply
    • Glenn Gundermann says:
      May 9, 2022 at 12:52 pm

      I created a flat file using CRTPF. This creates a file where the one field name is the same name as the file name. If I use REPLACE, it will work. If I use REGEX_REPLACE, it won’t work and returns with: Message: [SQL0171] Argument 01 of function REGEXP_REPLACE not valid.
      I really want to use REGEXP_REPLACE because of the wild card feature and case insensitive. Any idea why the two behave differently?

      Reply

    Leave a Reply Cancel reply

Volume 15, Number 10 -- May 19, 2015
THIS ISSUE SPONSORED BY:

ProData Computer Services
PowerTech
WorksRight Software

Table of Contents

  • Native Regular Expressions In DB2 For i 7.1 And 7.2
  • Prevent Overlapping In Range Tables
  • Job User Name And Current Job User

Content archive

  • The Four Hundred
  • Four Hundred Stuff
  • Four Hundred Guru

Recent Posts

  • Meet The Next Gen Of IBMers Helping To Build IBM i
  • Looks Like IBM Is Building A Linux-Like PASE For IBM i After All
  • Will Independent IBM i Clouds Survive PowerVS?
  • Now, IBM Is Jacking Up Hardware Maintenance Prices
  • IBM i PTF Guide, Volume 27, Number 24
  • Big Blue Raises IBM i License Transfer Fees, Other Prices
  • Keep The IBM i Youth Movement Going With More Training, Better Tools
  • Remain Begins Migrating DevOps Tools To VS Code
  • IBM Readies LTO-10 Tape Drives And Libraries
  • IBM i PTF Guide, Volume 27, Number 23

Subscribe

To get news from IT Jungle sent to your inbox every week, subscribe to our newsletter.

Pages

  • About Us
  • Contact
  • Contributors
  • Four Hundred Monitor
  • IBM i PTF Guide
  • Media Kit
  • Subscribe

Search

Copyright © 2025 IT Jungle