Newsletters Subscriptions Forums Media Kit About Us Contact Search Home

Stuff
OS/400 Edition
Volume 2, Number 21 -- October 23, 2003

Staggering SQL String Handling with Regular Expressions


by Michael Sansoterra

[The code for this article is available for download.]

"Empower SQL with Java User-Defined Functions" explained how to create UDFs with Java and to register them for use with SQL. This article will illustrate the usefulness of the regular expression functions created in the previous article. Who cares if you don't know Java! Read on anyway, because one benefit of UDFs is that you don't have to know anything about what they're written in or how they're implemented. All you need to know is how to use them.

To use these functions, you must have the 5722JV1 Java Developer Kit and 5722JC1 Toolbox for Java installed on your system. You also must be using Version 1.4 of the JDK (V5R2). Finally, the UDFs class should be compiled and the functions registered (as outlined in the previous article). Click here to download the UDFs.java code.

Regular expressions are splendid for working with strings, and they beat the tar out of standard string-related functions, such as LIKE, SUBSTRING, and POSSTR. The main issue with traditional string handling is the amount of work it takes to do a nontrivial task. Consider, for example, column Address3 that contains data in City, ST ZIP format. Here's a fragile SQL statement to extract the city, state and ZIP code into three separate columns:

SELECT
  CASE WHEN POSSTR(Address3,',')>1 THEN
       LEFT(Address3,POSSTR(Address3,',')-1) 
       END AS City, 
  CASE WHEN POSSTR(Address3,',')>1 THEN 
       TRIM(SUBSTR(Address3,POSSTR(Address3,',')+1,3))
       END AS State, 
  CASE WHEN POSSTR(Address3,',')>1 THEN 
       TRIM(SUBSTR(Address3,POSSTR(Address3,',')+4))
       END AS Zip
  FROM Address

It's fragile because it assumes a rigid format of the data. For example, it is expected that there is only one space before the state. Bullet-proofing for other possibilities makes the expressions bigger and uglier. One way to handle these expressions is to pack them into user defined functions, another is to use regular expressions. If you're already familiar with REs, skip to the "The SQL Regular Expression UDFs" section, below.

Regular Expression Patterns

So what makes REs superior, and how can they help? The superiority of REs lies in their ability to match advanced string patterns. (They're so advanced, I can't yet figure out many of them. But, hey, they're from the Unix world!) SQL/400 uses simple wildcard (%) and character substitution (?) in its patterns. REs can do this and more. Here is a list of some basic RE constructs, out of which RE patterns are built:


Characters:

Construct Matches
x the character x
\\ the backslash character
\t the tab character ('\u0009')
\n the newline (line feed) character ('\u000A')
\r the carriage-return character ('\u000D')

Character Classes:

Construct Matches
[abc] a, b, or c (simple class)
[^abc] any character except a, b, or c (negation)
[a-zA-Z] a through z or A through Z, inclusive (range)
[a-d[m-p]] a through d, or m through p: [a-dm-p] (union)
[a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction)

Predefined Character Classes:

Construct Matches
. any character (may or may not match line terminators)
\d a digit: [0-9]
\D a non-digit: [^0-9]
\s a whitespace character: space, tab, carriage return, line feed, etc. [ \t\n\x0B\f\r]
\S a non-whitespace character: [^\s]
\w a word character: [a-zA-Z_0-9]
\W a non-word character: [^\w]


Boundary Matchers:

Construct Matches
^ the beginning of a line
$ the end of a line


Logical Operators:

Construct Matches
| or

RE constructs vary among systems. (For example, Perl's implementation of REs isn't quite the same as Java.) The list I've displayed is for Java 1.4 because this version of Java is used to implement the regular expression UDFs. The full list of pattern constructs supported by Java's RE engine can be found on Sun Microsystems' Web site.

What Can REs Do?

With a little code, here's what REs can do with these advanced patterns:

  • Test for occurrences of the pattern, similar to LIKE.

  • Find a specific occurrence of the pattern (first, second, third), similar to POSSTR.

  • Replace occurrences of the pattern with a new string.

  • Dissect a string into tokens, based on a delimiter pattern.

Regular Expression Examples

To work with REs, a pattern must be constructed using the constructs shown above. Say you need to verify that a character column contains a phone number in the yyy-zzzz format. Here's a RE pattern to test for this structure:

\d{3}-\d{4}

Here's how to interpret the pattern:

  • The \d means to find a digit (0-9). Adding {3} means to find three consecutive digits.

  • The hypen (-) by itself means to find a hyphen right after the identified three digits.

  • The \d{4} means to find four digits after the hyphen.

A test of the pattern against the string (810) 524-4210 would succeed because part of the string matches the pattern. A test of the pattern against 810 524 4210 would fail because the pattern demands that a hyphen exist between the numbers. A test of the pattern against 810524-42102423 would succeed because the pattern does exist within the string.

Performing a RE find of this pattern against the string Hey, my phone number is (616) 812-3201 would result in 812-3201, because this part of the string matches the pattern \d{3}-\d{4}.

To include the area code, enhance the pattern as follows:

\(\d{3}\) \d{3}-\d{4}

This pattern specifies the following:

  • Look for an open parenthesis: \(

  • Followed by three digits: \d{3}

  • Followed by a closed parenthesis: \)

  • Followed by a blank space

  • Followed by the prior phone number pattern: \d{3}-\d{4}

Incidentally, parentheses and other characters, like periods (.), brackets ([]), and question marks (?), are called metacharacters and have a special meaning in REs. Therefore, to specify that parentheses are to be interpreted as real characters within the pattern, rather than as metacharacters, the parentheses needs to be "escaped" with a backslash (\).

Back to the phone number pattern. Area codes and exchanges should not begin with a zero or a 1, so you can further enhance the phone number pattern to include this check as part of a valid phone number test.

\([2-9]\d{2}\) [2-9]\d{2}-\d{4}

Now the pattern reads as follows:

  • Find a left parenthesis: \(

  • Followed by a digit in the range of 2-9: [2-9]

  • Followed by another two digits: \d{2}

  • Followed by a right parenthesis: \)

  • Followed by a blank space

  • Followed by a digit in the range of 2-9: [2-9]

  • Followed by another two digits: \d{2}

  • Followed by a hyphen: -

  • Followed by four digits: \d{4}

You could do all kinds of things to further enhance the pattern to test if the phone number is valid. You could, for instance, easily make the space, parenthesis, or hyphen optional. The options are limitless.

Confusing Constructs

You will notice that a few characters have multiple functions. If, for instance, a caret (^) is the first character in a pattern, it means the pattern to follow must be found at the beginning of the string; otherwise, the caret means "not," as in [^abc], which means any character except a, b, or c.

Case-Insensitive Patterns

Patterns are normally case-sensitive. They can be made case insensitive by specifying (?i) in the pattern. For example, given the string Mike is a stooge, an attempt to do a RE replace to substitute IS NOT for the simple pattern IS will fail, because the case doesn't match. However, specifying a pattern of (?i)IS will succeed because case is ignored, and the result will be Mike IS NOT a stooge.

Quantifiers

Another construct I'll briefly introduce is called a quantifier. These can be used to test for zero or more occurrences of a character or characters.


Quantifier Description
X* find pattern X, zero or more times
X+ find X, one or more times
X{n} find X, n times
X{n,} find X, at least n times
X{n,m} find X, at least n but not more than m times

Here, X can be a single character or a group of characters enclosed by parenthesis. For example, the pattern a{4,6} specifies to find occurrences of a four to six times consecutively. Likewise, the pattern (mike){3,} means mike has to be found at least three consecutive times in the input string. The pattern ((mike).*){3,} translates to locate mike at least three times in the string but not necessarily consecutively. The period construct in this pattern means "any character" is valid (except for a line terminator, such as a carriage return or a line feed), which allows mike to be found three times, even if zero or more characters (as signified by .*) are present between the occurrences of mike.

Further, by default these quantifiers are "greedy" because they return the largest string possible when a pattern match is found. When searching the string Mike rides a bike, for pattern (?i)M.*E, the result is Mike rides a bike. That's because greedy quantifiers start searching from the end of a string and move backward, resulting in the largest possible pattern match.

By contrast, a quantifier can be made "reluctant" by appending a question mark after the quantifier symbol. In this case, the pattern search starts at the beginning of the string, moving forward, so the smallest possible match is found. Searching the string Mike rides a bike for pattern (?i)M.*?E will now result in Mike (the smallest possible match).

These few examples are only introductory for the building of RE patterns. A thorough tutorial on regular expressions can be found on Sun's Web site.

The SQL Regular Expression UDFs

Here are the abbreviated UDF signatures for using regular expressions introduced in the last article. Remember, these signatures are used to define the function for SQL and to equate the function with a static method in a Java class. These UDFs perform the four basic RE tasks (test, replace, find, and split):

CREATE FUNCTION re_ReplaceAll(input VarChar(500),
                              RegEx VarChar(500),
                              Replace VarChar(500))
RETURNS VARCHAR(500)


CREATE FUNCTION re_Test(input VarChar(500),
                        RegEx VarChar(500))
RETURNS INTEGER


CREATE FUNCTION re_Find(input VARCHAR(500),
                        RegEx VarChar(500))
RETURNS VARCHAR(500)


CREATE FUNCTION re_Find(input VARCHAR(500),
                        RegEx VarChar(500),
                        occurrence INTEGER)
RETURNS VARCHAR(500)


CREATE FUNCTION re_Split(input VARCHAR(500),
                         Delimiter VARCHAR(500))
RETURNS VARCHAR(500)


CREATE FUNCTION re_Split(input VARCHAR(500),
                         Delimiter VARCHAR(500),
                         Occurrence INTEGER)
RETURNS VARCHAR(500)

The UDFs Find and Split are overloaded, which means they have more than one signature.

Function re_Test will return a zero if no matches are found, and re_Replace will return the original value if no replacements are made. The remaining functions return character data and will return a NULL if the pattern match fails. Now consider examples of using these functions.

Split and Find Pattern Examples

Let's go back to the Address3 example. Using the RE functions, here's how the city, state and ZIP code can be retrieved:

SELECT re_Split(Address3,',') AS City,
       re_Split(Address3,', {0,}| {0,}\d',2) AS State,
       re_Find(Address3,'\d{5}') AS Zip
  FROM Address

You may be thinking this statement doesn't look so good, either. It's not the prettiest (in all honesty, the cryptic RE constructs are down right homely), but it is doing everything the original did for us and more. For the address string Grand Rapids, MI 49546, the columns will return the following:

  • re_Split(Address3,',') splits the string into an array based on the delimiter (,). For our data, the array will have two elements: Grand Rapids and MI 49546. When the second parameter of re_Split is omitted, the function returns the first element of the split: Grand Rapids.

  • re_Split(Address3,', {0,}| {0,}\d',2) will also split the string into an array. However, if you want to isolate the state by itself, you have to break the string in two places (before and after the state). To break before the state, the first part of the pattern specifies to split the string on a comma, followed by zero or more spaces. The second part of the pattern, which follows the vertical bar, says to split the string when you encounter zero or more spaces followed by a digit (\d). The string will split in two pieces: Grand Rapids and MI. What happened to the ZIP code? Remember the delimiters are removed from the string, so the comma and digits are missing from the resulting array. The second parameter of re_Split instructs the function to return the second element of the array (the state). Although unattractive, this scheme is advantageous because the function is versatile enough to handle data entry variants so that "Grand Rapids,   MI   49546" and "Grand Rapids,MI49546" are processed properly.

  • re_Find(Address3,'\d{5}') is the easiest to interpret. This function will find and return the first pattern in the string that consists of five digits. In this case, it will return 49546.

Handling Complicated Cases

You can enhance the pattern to accept ZIP+4, as follows:

\d{5}-\d{4} 

This says to find five digits, followed by a hyphen, followed by four digits. Since ZIP+4s are often missing, the pattern can accept either pattern by using the "or" (|) character again:

\d{5}-\d{4}|\d{5}

You can further restrict erroneous data by verifying that the ZIP code only comes at the end. If Grand12345 Rapids, MI 49546 is processed, the current pattern would return a ZIP code of 12345. You can ensure that the ZIP comes only at the end of the string by adding a dollar sign ($) to the end of the patterns:

\d{5}-\d{4}$|\d{5}$'

But what if there are trailing spaces after the ZIP code? Such a scenario can be handled by trimming the function first with the SQL TRIM function, or the pattern can be enhanced to allow for zero or more spaces after the ZIP code:

\d{5}-\d{4} {0,}$|\d{5} {0,}$

Once you get used to looking at them, the regular expressions aren't difficult to interpret. Because of their unsightliness, you may end up embedding them in UDFs, but they're still easier to develop and to debug than traditional string-handling methods, and they can handle more cases.

Another Pattern Find Example

Several years ago, I was on a project where I needed to extract a manufacturing order (MO) number from a free-form comment field on a purchase order. The MO# was supposed to be CHAR(7) and to always start with M, followed by six digits. This was not a task for standard SQL.

If the column COMMENT contained the text Vendor 12345 M042345 is scheduled to start 08/29/2000, the following pattern could be used to find and extract the MO number:

re_Find(COMMENT, 'M\d{6}') AS MO#

The function would return M042345, the only text in the comment that matched the pattern. If only I had had REs for that project!

The pattern can be enhanced for little problems. What if the MO number started with a lowercase M? A problem I had was that the second character was often typed as an O rather than as a zero, because the numbers were referred to verbally as "MOs"! These two problems can be accommodated easily by changing the pattern:

re_Find(COMMENT,'[Mm][Oo0-9]\d{5}') AS MO#

Where [Mm] means to find an upper or lowercase M. The [Oo0-9] means the second position of the string should be an uppercase or lowercase O or a digit. The \d{5} means the next five characters are digits.

Pattern Test Examples

The re_Test function will check a string for a pattern and return the number of occurrences of the pattern. If a zero is returned, the pattern was not found.

In the prior example, re_Find was used to find a CHAR(7) field, starting with M, followed by digits in the COMMENTS field. If the pattern wasn't matched, that would be an error condition. Likewise, if the string was found more than once, you'd need to flag an error condition, because two embedded MO numbers could be a problem. If COMMENT contained the following:

 
Prior MO M012344 was cancelled  The new MO is M012345 

The pattern would match twice, and a person is now required to figure out which number is correct. Here's an example that pulls out any comments that have no MO number or that have more than one MO number:

-- Exception Report
-- Select comments where there isn't an
-- identifiable MO or there is < 1.

SELECT COMMENT
  FROM PO_Comment
 WHERE re_Test(COMMENT, '[Mm][Oo0-9]\d{5}')<>1

In the address extraction example, you may not want to process a row that contains an invalid address such as Grand Rapids, 49546. In this case, you can come up with a test pattern that will validate the entire address and use it to test with before final processing:

-- Select bad records where the address
-- doesn't conform to "City, ST ZIP" format
SELECT RRN(Address) AS Row, Address3
  FROM mike/Address
 WHERE re_Test(Address3, 
'.{3,}, {0,}[A-Za-z]{2} {0,}(\d{5}-\d{4} {0,}$|\d{5} {0,}$)')=0

Here's the breakdown of the test pattern:


Pattern Portion Description
.{3,}, find three or more characters, followed by a comma (CITY)
{0,} followed by zero or more spaces
[A-Za-z]{2} followed by two upper or lowercase letters (STATE)
{0,} followed by zero or more spaces
(\d{5}-\d{4} {0,}$|\d{5} {0,}$) followed by the ZIP code check illustrated earlier in the article (ZIP)

For interactive SQL users: don't break the pattern string in the middle unless you stop it and start it; otherwise unwanted spaces may appear in the pattern:

WHERE re_Test(Address3, 
 '.{3,}, {0,}[A-Za-z]{2} {0,}' ||
 '(\d{5}-\d{4} {0,}$|\d{5} {0,}$)')=0

A Replace Example

The re_Replace function can do standard search and replace tasks, where:

re_ReplaceAll('14" tire', '"',' inch')

Results in 14 inch tire, with the double quotation marks being replaced by the word inch. However, RE pattern replacements offer more flexibility.

Say you need to clean the address data so all extraneous spaces should be removed. So the following:

 
 Grand   Rapids, MI  49503 

Becomes the following:

 
Grand Rapids, MI 49503

Here's a pattern than will allow two or more spaces to be changed to one:

UPDATE Address 
   SET Address3=TRIM(re_ReplaceAll(Address3,' {2,}',' '))
 WHERE re_Test(Address3,' {2,}')>0 

The find pattern ' {2,}' will locate all portions of the string that have two or more consecutive spaces and will replace them with a single space. Another useful search pattern involves using case-insensitive replacements using the (?i) construct.

Allow RPG and COBOL to access REs with SQL

Don't forget, RPG and COBOL can benefit from the REs by using embedded SQL:

C/EXEC SQL
C+ -- Regular Expression Test to validate phone# format
C+ Set :PhoneOK=re_Test(:Phone,'\(\d{3}\) \d{3}-\d{4}')
C/END-EXEC

C/Free
    If PhoneOk=0;
        CallP Message('Invalid Phone # Entered');
    Else;
    ...

For the functions that return NULLs, don't forget to use indicator variables!

Reduce Code

Don't shy away from REs. Although the syntax is intimidating at first, you will get used to them, just like subfiles and some of the other iSeries oddities (if you can remember RPG spec columns, you can memorize RE patterns). As always, start with the simple examples first. If you import text data from an external source, you need to "scrub" dirty data in your database, or you have to perform complex validation on user input, REs will make your life easier.


Michael Sansoterra is a programmer/analyst for SilverLake Resources, an IT services firm based in Grand Rapids, Michigan. E-mail: msansoterra@silver-lake.com


Sponsored By
DAMON TECHNOLOGIES

Make Stored Procedure Testing easy with STE

STE (Stored Procedure Test Environment) is the easiest way to test Stored Procedures.

· Prompts for input parameters.
· Display the result set or multiple sets.
· Display the returned parameters.
· Screen or printed output.

After creating a stored procedure, developers are faced with the challenge of testing. With STE, the developer can easily call the stored procedure, step through debug, view the returned parameters, and result set or multiple sets.

STE is an excellent communication tool between the developer of the business logic and the web developer. STE will produce a report, listing the input parameters, showing the result set(s) and the out parameters, thus eliminating miscommunication between developers.

Do like so many other companies have, cut your stored procedure development cycle time and save your company money.

Download your FREE copy of STE today!

www.damontech.com



THIS ISSUE
SPONSORED BY:

Damon Technologies
Bug Busters Software Engineering
WorksRight Software
Profound Logic Software


BACK ISSUES

TABLE OF
CONTENTS
Staggering SQL String Handling with Regular Expressions

Binder Source: The Little Language That Could

Control Access Path Rebuilding

OS/400 Alert: OS/400 Passwords Can Be Seen


Editors
Shannon O'Donnell
Kevin Vandever

Managing Editor
Shannon Pastore

Contributing Editors:
Howard Arner
Raymond Everhart
Joe Hertvik
Ted Holt
Marc Logemann
David Morris

Publisher and
Advertising Director:

Jenny Thomas

Advertising Sales Representative
Kim Reed

Contact the Editors
Do you have a gripe, inside dope or an opinion?
Email the editors:
editors@itjungle.com


Copyright © 1996-2008 Guild Companies, Inc. All Rights Reserved.