Digging Out Data Duplication

November 6, 2013 Hey, Ted

Sometimes an SQL query should return one row, yet it returns more than one. The problem turns out to be multiple matching rows in a secondary table. Is there a way to easily isolate the secondary table that causes more than one match?

–J

Yes, there is an easy way. But first, let me set up the problem for the edification of other readers.

Sometimes we execute an SQL query with the expectation that the result set will contain only one row (record), and we are surprised to get back two or more rows instead. At least one table (physical file)–usually one of the secondary tables–has more than one row that matches the join criteria. (I use the word “table” here to mean either table or view.)

I have noticed two main causes for this behavior.

Someone keyed duplicate information into the database. For example, a table that was loaded from a spreadsheet contains two rows for the same customer, vendor, item, or what have you, when it should contain only one.
The join criteria are insufficient. For instance, the fallible human being who wrote the query joined two tables on a common customer number, but should have joined them on common customer number and company number.

Finding the problem is not usually trivial because production queries often join several tables, and it’s common for some of the join criteria in a query to involve only secondary tables.

A simple way to find the culprit is to use the RRN (relative record number) function. Here’s a simple illustration of the technique.

Assume an item master table. It is uniquely keyed on item number, of course. Assume two other temporary tables that were loaded from spreadsheets. One temporary table has new prices for some items. The other has new descriptions for some items. We expect only one row per item in each table, but people, being the imperfect beings they are, may accidentally load more than one row for an item.

Here’s a query that should return one row for item BR-549.

select m.item, np.price, nd.descr
  from itemmast as m
 left join newprices as np
    on m.item = np.item
 left join newdesc as nd
    on m.item = nd.item
 where m.item = 'BR-549'

And here is the result set.

ITEM     PRICE   DESCR 
BR-549    5.00   Widget
BR-549    5.00   Widget

To locate the duplication, add RRN functions for the tables.

select m.item, rrn(m), rrn(np), rrn(d)
  from itemmast as m
  left join newprices as np
    on m.item = np.item
  left join newdesc as d
    on m.item = d.item
where m.item = 'BR-549'

And now the verdict:

ITEM    RRN ( M )  RRN ( NP )  RRN ( D )
BR-549         2           2          2
BR-549         2           3          2

Look at the RRN columns. The values are different in the third column, therefore the new prices table is the problem.

select * from newprices as np
where np.item = 'BR-549'

ITEM     PRICE
BR-549    5.00
BR-549    5.00

–Ted

                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot

Table of Contents

Content archive

Recent Posts

Subscribe

Pages

Search