• The Four Hundred
  • Subscribe
  • Media Kit
  • Contributors
  • About Us
  • Contact
Menu
  • The Four Hundred
  • Subscribe
  • Media Kit
  • Contributors
  • About Us
  • Contact
  • Finding Duplicate Rows in a Name and Address Table

    May 25, 2011 Skip Marchesani

    Many companies still use the U.S. Postal Service to mail promotional and marketing materials to existing and potential customers. With rising printing expenses and postal rates, it’s important to minimize waste by identifying and eliminating duplicate rows in name and address tables. However, these tables can present some interesting challenges.

    For instance, the following two addresses are duplicates, but how do you identify them as potential duplicate rows?

    John Sampson
    Dickerson Mutual Ins.
    218 Bridge St., Suite 257
    New York, NY 12345

    John Samson
    Dickerson Insurance
    218 Bridge Street
    NYC, NY 12345

    Companies can write their own software, which has its own challenges, or purchase software from a third party, which can be quite expensive. This tip shows how to use an SQL inner join along with some simple algorithms to identify such rows.

    I developed this technique several years ago when cleaning up a name and address table used for mailing a conference brochure, and I still use this method today. This was an iterative process and there were several refinements before this technique produced the desired results. I chose the inner join because a result set row is returned only when a row in the left or outer table has a matching row in the right or inner table. Thus the name inner join (inner matching row).

    This discussion will review the process I went through to develop the technique and my sample data will be on a name and address table with the following attributes:

    Table Name: MAILLIST
    Total Rows: 30,000+
    

    Column
    Name

    Length

    Data
    Type

    Description

    IDNBR

    5.0

    Dec

    Customer Number

    FNAME

    15

    Char

    First Name

    LNAME

    30

    Char

    Last Name

    COMPNY

    50

    Char

    Company Name

    ADDRL1

    50

    Char

    Address Line 1

    ADDRL2

    50

    Char

    Address Line 2

    CITY

    25

    Char

    City

    STATE

    2

    Char

    State

    ZIP

    10

    Char

    Zip Code

    Other Data

     

     

     

    The first time I tried using the inner join, I thought it would be a simple matter of doing an inner join of the MAILLIST table to itself using the following SQL syntax.

    SELECT  a.idnbr,  a.fname,  a.lname,  a.compny,  a.addrl1,
       	      a.addrl2,  a.city,  a.state,  a.zip
              FROM  maillist  a  INNER JOIN  maillist  b
                 ON  a.fname  =  b.fname
                     AND  a.lname  =  b.lname
                     AND  a.compny  =  b.compny
                     AND  a.addrl1  =  b.addrl1
                     AND  a.addrl2  =  b.addrl2
                     AND  a.city  =  b.city
                     AND  a.state  =  b.state
                     AND  a.zip  =  b.zip
                     AND  a.idnbr  < >  b.idnbr
              ORDER BY  a.lname,  a.compny  a.fname
    

    Note that only the columns of the MAILLIST table containing address data are used as part of the join criteria (the OTHER DATA is not used), with the join condition for all but IDNBR being an “equal” condition. The join condition for IDNBR must be a “not equal” condition since this column provides unique identification for each row in the table. If the join condition for IDNBR is equal, every row in the table is returned in the result set.

    When I ran the above SQL statement, no rows were returned in the result set.

    The reason for this is obvious if you look at the duplicate rows for John Sampson at the beginning of this article. There are differences in the last name, company, address, and city for each of the John Sampson rows. Even with the differences between the two rows, mailing a letter to each of these addresses is going to result in two pieces of mail being delivered to the same address and person.

    This brings us to a definition of a duplicate row.

    From a pure database perspective, two rows would be considered duplicates when every position of every column in each row is the same. In other words, the two rows are a duplicate of each other when the data contained in each and every position for each row is identical.

    This is not the case with a name and address table used for mailing purposes. In this case, two rows may be a duplicate if the data contained in the same columns for each row contains similar data or data that closely matches but may not be an exact match.

    The trick is to figure out how to use SQL to identify such rows in the name and address table.

    After many iterations and much analysis I drew the following conclusions and was able to simplify the SQL syntax used for the inner join and therefore get better results.

    1. A single company could have multiple street addresses for the same location in the same city and state. Including ADDRL1 (address line 1) and ADDRL2 (address line 2) has a negative impact on the result set (decreased the number of rows returned). Therefore, these two columns were not included as part of the join criteria.

    2. Including the ZIP (zip code) column had no impact, positive or negative, on the result set. Therefore it was not included as part of the join criteria.

    3. Spelling or keying errors, and/or differences in name representation for first and last name, company, and city were the root causes for no records being returned. Upon further analysis, I determined that spelling or keying accuracy was very high for the first three characters of a name string, with accuracy falling off dramatically from the fourth character on. The same was true for most differences in name representation; the first three characters of the name were the same. Therefore a subset of the data for a given column–the first three characters–could be used to get better results instead of the entire column.

    Based on the above conclusions I revised the SQL syntax for the inner join as follows.

    SELECT  a.idnbr,  a.fname,  a.lname,  a.compny,  a.city
              FROM  maillist  a  INNER JOIN  maillist  b
                 ON  a.state  =  b.state
                     AND  SUBSTR(a.fname,1,3)  =  SUBSTR(b.fname,1,3)
                     AND  SUBSTR(a.lname,1,3)  =  SUBSTR(b.lname,1,3)
                     AND  SUBSTR(a.compny,1,3)  =  SUBSTR(b.compny,1,3)
                     AND  SUBSTR(a.city,1,3)  =  SUBSTR(b.city,1,3)
                     AND  a.idnbr  < >  b.idnbr
              ORDER BY  a.lname,  a.compny,  a.fname
    

    The first time I ran this SQL statement, only the following four rows were returned in the result set, and I knew there were more duplicates than those four rows.

    Let’s look at these four rows first before we discuss why they were the only ones returned.

    IDNBR

    FNAME

    LNAME

    COMPNY

    CITY

    08314

    Chris

    Smerr

    Metal Fabricators

    Barre City

    40384

    Chris

    Smer

    Metal Fabrication Inc

    Barre

    50168

    Randy

    Dodd Jr

    Masters Inc

    Jackson

    50167

    Randall

    Dodd Sr

    Masters Inc

    Jackson

    The second two rows (IDNBRS 50168 and 50167) are probably not duplicates since the last name is followed by a Jr in one case and a Sr in the other, and it is very likely that a father and son could be working a the same company.

    The first two rows (IDNBRS 08314 and 40384) most likely are duplicates of each other, with the differences being the misspelling of the last name of Smerr (Smerr versus Smer) and the name representation for Metal Fabricators (Metal Fabricators versus Metal Fabrication Inc).

    If these two rows are indeed duplicates, which one do you delete?

    The answer in my case was the one that was most current. The row format for the MAILLIST table contains a column called LASTACTD or Last Activity Date, and the row with the oldest activity date was the one deleted.

    What if both records contained historical data that had value? In that case you would have to determine how to merge that data into one of the two rows before deleting the other row, assuming merging the data was possible.

    If you don’t have a way to determine which row contains the most current data, try flipping a coin.

    Now let’s address the question of why only four rows were returned. The cause was differences in name representation of company and city. Specifically it was abbreviations of company and city versus the complete spelling out of the name.

    • NYC vs. New York City
    • St. Louis vs. Saint Louis
    • Ft. Worth vs. Forth Worth
    • IBM vs. International Business Machines
    • UPS vs. United Parcel Service
    • GE vs. General Electric

    After several iterations of testing, I determined that the solution was to have two versions of the SQL syntax for the inner join as shown below. The first one included company and excluded city, and the second one excluded company and included city.

    CITY Excluded from Join Criteria
    
    SELECT  a.idnbr,  a.fname,  a.lname,  a.compny,  a.city
              FROM  maillist  a  INNER JOIN  maillist  b
                 ON  a.state  =  b.state
                     AND  SUBSTR(a.fname,1,3)  =  SUBSTR(b.fname,1,3)
                     AND  SUBSTR(a.lname,1,3)  =  SUBSTR(b.lname,1,3)
                     AND  SUBSTR(a.compny,1,3)  =  SUBSTR(b.compny,1,3)
                     AND  a.idnbr  < >  b.idnbr
              ORDER BY  lname,  compny,  fname
    
    COMPNY Excluded from Join Criteria
    
    SELECT  a.idnbr,  a.fname,  a.lname,  a.compny,  a.city
              FROM  maillist  a  INNER JOIN  maillist  b
                 ON  a.state  =  b.state
                     AND  SUBSTR(a.fname,1,3)  =  SUBSTR(b.fname,1,3)
                     AND  SUBSTR(a.lname,1,3)  =  SUBSTR(b.lname,1,3)
                     AND  SUBSTR(a.city,1,3)  =  SUBSTR(b.city,1,3)
                     AND  a.idnbr  < >  b.idnbr
              ORDER BY  lname,  compny,  fname
    

    When I ran each of the above SQL statements, I got the results that I was looking for. Over 400 rows were identified as potential duplicates with only four of those rows existing in both result sets.

    Using the above SQL syntax, the two rows for John Sampson at the beginning of this article would be correctly identified when CITY was excluded from the join criteria.

    In this article, this SQL syntax for inner join was used to identify potential duplicate rows in a single table, where the table was joined to itself. It could also be used to identify duplicate rows that exist in two different tables simply by inserting the table names in the appropriate place in the JOIN syntax of the FROM clause in the SELECT statement.

    Will the use of an inner join as shown in this article find every potential duplicate record? No, it will not. But in my case it did find about 95 percent of the duplicates, I didn’t have to purchase expensive code from a third party, and I didn’t have to write complex application code to do the job. In the end the SQL syntax required to accomplish the task is relatively simple and straightforward.

    Skip Marchesani retired from IBM after 30 years and decided to pursue a career as a consultant in the IBM i marketplace. He spent much of his IBM career working with the Rochester Development Lab on projects for S/38 and AS/400, and was involved with the development of the AS/400. Skip was part of the team that taught early AS/400 education to customers and IBM lab sites worldwide. Now recognized as an industry expert on DB2 for i (a.k.a. DB2/400), and author of the book “DB2/400: The New AS/400 Database,” Skip specializes in providing customized education for any area of the IBM i and AS/400, does database design and design reviews, and general IBM i, iSeries, and AS/400 consulting. He has been a speaker for user groups, technical conferences, and IBM i, iSeries, and AS/400 audiences around the world, and has received COMMON’s Distinguished Service Award. Skip Recently moved to Vermont and is now president of the Vermont Mid Range User Group (VTMUG). Send your questions or comments for Skip to Ted Holt via the IT Jungle Contact page.



                         Post this story to del.icio.us
                   Post this story to Digg
        Post this story to Slashdot

    Share this:

    • Reddit
    • Facebook
    • LinkedIn
    • Twitter
    • Email

    Tags:

    Sponsored by
    DRV Tech

    Get More Out of Your IBM i

    With soaring costs, operational data is more critical than ever. IBM shops need faster, easier ways to distribute IBM applications-based data to users more efficiently, no matter where they are.

    The Problem:

    For Users, IBM Data Can Be Difficult to Get To

    IBM Applications generate reports as spooled files, originally designed to be printed. Often those reports are packed together with so much data it makes them difficult to read. Add to that hardcopy is a pain to distribute. User-friendly formats like Excel and PDF are better, offering sorting, searching, and easy portability but getting IBM reports into these formats can be tricky without the right tools.

    The Solution:

    IBM i Reports can easily be converted to easy to read and share formats like Excel and PDF and Delivered by Email

    Converting IBM i, iSeries, and AS400 reports into Excel and PDF is now a lot easier with SpoolFlex software by DRV Tech.  If you or your users are still doing this manually, think how much time is wasted dragging and reformatting to make a report readable. How much time would be saved if they were automatically formatted correctly and delivered to one or multiple recipients.

    SpoolFlex converts spooled files to Excel and PDF, automatically emailing them, and saving copies to network shared folders. SpoolFlex converts complex reports to Excel, removing unwanted headers, splitting large reports out for individual recipients, and delivering to users whether they are at the office or working from home.

    Watch our 2-minute video and see DRV’s powerful SpoolFlex software can solve your file conversion challenges.

    Watch Video

    DRV Tech

    www.drvtech.com

    866.378.3366

    Share this:

    • Reddit
    • Facebook
    • LinkedIn
    • Twitter
    • Email

    Sponsored Links

    VAULT400:  Take a 14-question Disaster Recovery Survey, 4 will win $50 Visa gift card
    SEQUEL Software:  View the recorded Webinar: 10 Ways SEQUEL Makes Developers More Productive
    looksoftware:  FREE Webcast: RPG Open Access Demystified. June 7 (Europe) & June 8 (USA)

    IT Jungle Store Top Book Picks

    BACK IN STOCK: Easy Steps to Internet Programming for System i: List Price, $49.95

    The iSeries Express Web Implementer's Guide: List Price, $49.95
    The iSeries Pocket Database Guide: List Price, $59
    The iSeries Pocket SQL Guide: List Price, $59
    The iSeries Pocket WebFacing Primer: List Price, $39
    Migrating to WebSphere Express for iSeries: List Price, $49
    Getting Started with WebSphere Express for iSeries: List Price, $49
    The All-Everything Operating System: List Price, $35
    The Best Joomla! Tutorial Ever!: List Price, $19.95

    ASCI Job Scheduler Gets Hooks for Nagios I, Cloud-i-us

    Leave a Reply Cancel reply

Volume 11, Number 17 -- May 25, 2011
THIS ISSUE SPONSORED BY:

WorksRight Software
SEQUEL Software
VAULT400

Table of Contents

  • Finding Duplicate Rows in a Name and Address Table
  • Asserting the Ostensibly Unassertable
  • Admin Alert: Other Options for a Post-Upgrade Backup

Content archive

  • The Four Hundred
  • Four Hundred Stuff
  • Four Hundred Guru

Recent Posts

  • Meet The Next Gen Of IBMers Helping To Build IBM i
  • Looks Like IBM Is Building A Linux-Like PASE For IBM i After All
  • Will Independent IBM i Clouds Survive PowerVS?
  • Now, IBM Is Jacking Up Hardware Maintenance Prices
  • IBM i PTF Guide, Volume 27, Number 24
  • Big Blue Raises IBM i License Transfer Fees, Other Prices
  • Keep The IBM i Youth Movement Going With More Training, Better Tools
  • Remain Begins Migrating DevOps Tools To VS Code
  • IBM Readies LTO-10 Tape Drives And Libraries
  • IBM i PTF Guide, Volume 27, Number 23

Subscribe

To get news from IT Jungle sent to your inbox every week, subscribe to our newsletter.

Pages

  • About Us
  • Contact
  • Contributors
  • Four Hundred Monitor
  • IBM i PTF Guide
  • Media Kit
  • Subscribe

Search

Copyright © 2025 IT Jungle