|
||||||||
|
|
![]() |
|
|
Staggering SQL String Handling with Regular Expressions by Michael Sansoterra [The code for this article is available for download.] "Empower SQL with Java User-Defined Functions" explained how to create UDFs with Java and to register them for use with SQL. This article will illustrate the usefulness of the regular expression functions created in the previous article. Who cares if you don't know Java! Read on anyway, because one benefit of UDFs is that you don't have to know anything about what they're written in or how they're implemented. All you need to know is how to use them. To use these functions, you must have the 5722JV1 Java Developer Kit and 5722JC1 Toolbox for Java installed on your system. You also must be using Version 1.4 of the JDK (V5R2). Finally, the UDFs class should be compiled and the functions registered (as outlined in the previous article). Click here to download the UDFs.java code. Regular expressions are splendid for working with strings, and they beat the tar out of standard string-related functions, such as LIKE, SUBSTRING, and POSSTR. The main issue with traditional string handling is the amount of work it takes to do a nontrivial task. Consider, for example, column Address3 that contains data in City, ST ZIP format. Here's a fragile SQL statement to extract the city, state and ZIP code into three separate columns:
SELECT
CASE WHEN POSSTR(Address3,',')>1 THEN
LEFT(Address3,POSSTR(Address3,',')-1)
END AS City,
CASE WHEN POSSTR(Address3,',')>1 THEN
TRIM(SUBSTR(Address3,POSSTR(Address3,',')+1,3))
END AS State,
CASE WHEN POSSTR(Address3,',')>1 THEN
TRIM(SUBSTR(Address3,POSSTR(Address3,',')+4))
END AS Zip
FROM Address
It's fragile because it assumes a rigid format of the data. For example, it is expected that there is only one space before the state. Bullet-proofing for other possibilities makes the expressions bigger and uglier. One way to handle these expressions is to pack them into user defined functions, another is to use regular expressions. If you're already familiar with REs, skip to the "The SQL Regular Expression UDFs" section, below. Regular Expression Patterns So what makes REs superior, and how can they help? The superiority of REs lies in their ability to match advanced string patterns. (They're so advanced, I can't yet figure out many of them. But, hey, they're from the Unix world!) SQL/400 uses simple wildcard (%) and character substitution (?) in its patterns. REs can do this and more. Here is a list of some basic RE constructs, out of which RE patterns are built: Characters:
Character Classes:
Predefined Character Classes:
Boundary Matchers:
Logical Operators:
RE constructs vary among systems. (For example, Perl's implementation of REs isn't quite the same as Java.) The list I've displayed is for Java 1.4 because this version of Java is used to implement the regular expression UDFs. The full list of pattern constructs supported by Java's RE engine can be found on Sun Microsystems' Web site. What Can REs Do? With a little code, here's what REs can do with these advanced patterns:
Regular Expression Examples To work with REs, a pattern must be constructed using the constructs shown above. Say you need to verify that a character column contains a phone number in the yyy-zzzz format. Here's a RE pattern to test for this structure:
\d{3}-\d{4}
Here's how to interpret the pattern:
A test of the pattern against the string (810) 524-4210 would succeed because part of the string matches the pattern. A test of the pattern against 810 524 4210 would fail because the pattern demands that a hyphen exist between the numbers. A test of the pattern against 810524-42102423 would succeed because the pattern does exist within the string. Performing a RE find of this pattern against the string Hey, my phone number is (616) 812-3201 would result in 812-3201, because this part of the string matches the pattern \d{3}-\d{4}. To include the area code, enhance the pattern as follows:
\(\d{3}\) \d{3}-\d{4}
This pattern specifies the following:
Incidentally, parentheses and other characters, like periods (.), brackets ([]), and question marks (?), are called metacharacters and have a special meaning in REs. Therefore, to specify that parentheses are to be interpreted as real characters within the pattern, rather than as metacharacters, the parentheses needs to be "escaped" with a backslash (\). Back to the phone number pattern. Area codes and exchanges should not begin with a zero or a 1, so you can further enhance the phone number pattern to include this check as part of a valid phone number test.
\([2-9]\d{2}\) [2-9]\d{2}-\d{4}
Now the pattern reads as follows:
You could do all kinds of things to further enhance the pattern to test if the phone number is valid. You could, for instance, easily make the space, parenthesis, or hyphen optional. The options are limitless. Confusing Constructs You will notice that a few characters have multiple functions. If, for instance, a caret (^) is the first character in a pattern, it means the pattern to follow must be found at the beginning of the string; otherwise, the caret means "not," as in [^abc], which means any character except a, b, or c. Case-Insensitive Patterns Patterns are normally case-sensitive. They can be made case insensitive by specifying (?i) in the pattern. For example, given the string Mike is a stooge, an attempt to do a RE replace to substitute IS NOT for the simple pattern IS will fail, because the case doesn't match. However, specifying a pattern of (?i)IS will succeed because case is ignored, and the result will be Mike IS NOT a stooge. Quantifiers Another construct I'll briefly introduce is called a quantifier. These can be used to test for zero or more occurrences of a character or characters.
Here, X can be a single character or a group of characters enclosed by parenthesis. For example, the pattern a{4,6} specifies to find occurrences of a four to six times consecutively. Likewise, the pattern (mike){3,} means mike has to be found at least three consecutive times in the input string. The pattern ((mike).*){3,} translates to locate mike at least three times in the string but not necessarily consecutively. The period construct in this pattern means "any character" is valid (except for a line terminator, such as a carriage return or a line feed), which allows mike to be found three times, even if zero or more characters (as signified by .*) are present between the occurrences of mike. Further, by default these quantifiers are "greedy" because they return the largest string possible when a pattern match is found. When searching the string Mike rides a bike, for pattern (?i)M.*E, the result is Mike rides a bike. That's because greedy quantifiers start searching from the end of a string and move backward, resulting in the largest possible pattern match. By contrast, a quantifier can be made "reluctant" by appending a question mark after the quantifier symbol. In this case, the pattern search starts at the beginning of the string, moving forward, so the smallest possible match is found. Searching the string Mike rides a bike for pattern (?i)M.*?E will now result in Mike (the smallest possible match). These few examples are only introductory for the building of RE patterns. A thorough tutorial on regular expressions can be found on Sun's Web site. The SQL Regular Expression UDFs Here are the abbreviated UDF signatures for using regular expressions introduced in the last article. Remember, these signatures are used to define the function for SQL and to equate the function with a static method in a Java class. These UDFs perform the four basic RE tasks (test, replace, find, and split):
CREATE FUNCTION re_ReplaceAll(input VarChar(500),
RegEx VarChar(500),
Replace VarChar(500))
RETURNS VARCHAR(500)
CREATE FUNCTION re_Test(input VarChar(500),
RegEx VarChar(500))
RETURNS INTEGER
CREATE FUNCTION re_Find(input VARCHAR(500),
RegEx VarChar(500))
RETURNS VARCHAR(500)
CREATE FUNCTION re_Find(input VARCHAR(500),
RegEx VarChar(500),
occurrence INTEGER)
RETURNS VARCHAR(500)
CREATE FUNCTION re_Split(input VARCHAR(500),
Delimiter VARCHAR(500))
RETURNS VARCHAR(500)
CREATE FUNCTION re_Split(input VARCHAR(500),
Delimiter VARCHAR(500),
Occurrence INTEGER)
RETURNS VARCHAR(500)
The UDFs Find and Split are overloaded, which means they have more than one signature. Function re_Test will return a zero if no matches are found, and re_Replace will return the original value if no replacements are made. The remaining functions return character data and will return a NULL if the pattern match fails. Now consider examples of using these functions. Split and Find Pattern Examples Let's go back to the Address3 example. Using the RE functions, here's how the city, state and ZIP code can be retrieved:
SELECT re_Split(Address3,',') AS City,
re_Split(Address3,', {0,}| {0,}\d',2) AS State,
re_Find(Address3,'\d{5}') AS Zip
FROM Address
You may be thinking this statement doesn't look so good, either. It's not the prettiest (in all honesty, the cryptic RE constructs are down right homely), but it is doing everything the original did for us and more. For the address string Grand Rapids, MI 49546, the columns will return the following:
Handling Complicated Cases You can enhance the pattern to accept ZIP+4, as follows:
\d{5}-\d{4}
This says to find five digits, followed by a hyphen, followed by four digits. Since ZIP+4s are often missing, the pattern can accept either pattern by using the "or" (|) character again:
\d{5}-\d{4}|\d{5}
You can further restrict erroneous data by verifying that the ZIP code only comes at the end. If Grand12345 Rapids, MI 49546 is processed, the current pattern would return a ZIP code of 12345. You can ensure that the ZIP comes only at the end of the string by adding a dollar sign ($) to the end of the patterns:
\d{5}-\d{4}$|\d{5}$'
But what if there are trailing spaces after the ZIP code? Such a scenario can be handled by trimming the function first with the SQL TRIM function, or the pattern can be enhanced to allow for zero or more spaces after the ZIP code:
\d{5}-\d{4} {0,}$|\d{5} {0,}$
Once you get used to looking at them, the regular expressions aren't difficult to interpret. Because of their unsightliness, you may end up embedding them in UDFs, but they're still easier to develop and to debug than traditional string-handling methods, and they can handle more cases. Another Pattern Find Example Several years ago, I was on a project where I needed to extract a manufacturing order (MO) number from a free-form comment field on a purchase order. The MO# was supposed to be CHAR(7) and to always start with M, followed by six digits. This was not a task for standard SQL. If the column COMMENT contained the text Vendor 12345 M042345 is scheduled to start 08/29/2000, the following pattern could be used to find and extract the MO number:
re_Find(COMMENT, 'M\d{6}') AS MO#
The function would return M042345, the only text in the comment that matched the pattern. If only I had had REs for that project! The pattern can be enhanced for little problems. What if the MO number started with a lowercase M? A problem I had was that the second character was often typed as an O rather than as a zero, because the numbers were referred to verbally as "MOs"! These two problems can be accommodated easily by changing the pattern:
re_Find(COMMENT,'[Mm][Oo0-9]\d{5}') AS MO#
Where [Mm] means to find an upper or lowercase M. The [Oo0-9] means the second position of the string should be an uppercase or lowercase O or a digit. The \d{5} means the next five characters are digits. Pattern Test Examples The re_Test function will check a string for a pattern and return the number of occurrences of the pattern. If a zero is returned, the pattern was not found. In the prior example, re_Find was used to find a CHAR(7) field, starting with M, followed by digits in the COMMENTS field. If the pattern wasn't matched, that would be an error condition. Likewise, if the string was found more than once, you'd need to flag an error condition, because two embedded MO numbers could be a problem. If COMMENT contained the following: Prior MO M012344 was cancelled The new MO is M012345 The pattern would match twice, and a person is now required to figure out which number is correct. Here's an example that pulls out any comments that have no MO number or that have more than one MO number:
-- Exception Report
-- Select comments where there isn't an
-- identifiable MO or there is < 1.
SELECT COMMENT
FROM PO_Comment
WHERE re_Test(COMMENT, '[Mm][Oo0-9]\d{5}')<>1
In the address extraction example, you may not want to process a row that contains an invalid address such as Grand Rapids, 49546. In this case, you can come up with a test pattern that will validate the entire address and use it to test with before final processing:
-- Select bad records where the address
-- doesn't conform to "City, ST ZIP" format
SELECT RRN(Address) AS Row, Address3
FROM mike/Address
WHERE re_Test(Address3,
'.{3,}, {0,}[A-Za-z]{2} {0,}(\d{5}-\d{4} {0,}$|\d{5} {0,}$)')=0
Here's the breakdown of the test pattern:
For interactive SQL users: don't break the pattern string in the middle unless you stop it and start it; otherwise unwanted spaces may appear in the pattern:
WHERE re_Test(Address3,
'.{3,}, {0,}[A-Za-z]{2} {0,}' ||
'(\d{5}-\d{4} {0,}$|\d{5} {0,}$)')=0
A Replace Example The re_Replace function can do standard search and replace tasks, where:
re_ReplaceAll('14" tire', '"',' inch')
Results in 14 inch tire, with the double quotation marks being replaced by the word inch. However, RE pattern replacements offer more flexibility. Say you need to clean the address data so all extraneous spaces should be removed. So the following: Grand Rapids, MI 49503 Becomes the following: Grand Rapids, MI 49503 Here's a pattern than will allow two or more spaces to be changed to one:
UPDATE Address
SET Address3=TRIM(re_ReplaceAll(Address3,' {2,}',' '))
WHERE re_Test(Address3,' {2,}')>0
The find pattern ' {2,}' will locate all portions of the string that have two or more consecutive spaces and will replace them with a single space. Another useful search pattern involves using case-insensitive replacements using the (?i) construct. Allow RPG and COBOL to access REs with SQL Don't forget, RPG and COBOL can benefit from the REs by using embedded SQL:
C/EXEC SQL
C+ -- Regular Expression Test to validate phone# format
C+ Set :PhoneOK=re_Test(:Phone,'\(\d{3}\) \d{3}-\d{4}')
C/END-EXEC
C/Free
If PhoneOk=0;
CallP Message('Invalid Phone # Entered');
Else;
...
For the functions that return NULLs, don't forget to use indicator variables! Reduce Code Don't shy away from REs. Although the syntax is intimidating at first, you will get used to them, just like subfiles and some of the other iSeries oddities (if you can remember RPG spec columns, you can memorize RE patterns). As always, start with the simple examples first. If you import text data from an external source, you need to "scrub" dirty data in your database, or you have to perform complex validation on user input, REs will make your life easier. Michael Sansoterra is a programmer/analyst for SilverLake Resources, an IT services firm based in Grand Rapids, Michigan. E-mail: msansoterra@silver-lake.com
|
Editors
Contact the Editors |
| Copyright © 1996-2008 Guild Companies, Inc. All Rights Reserved. |