Digging Out Data Duplication
November 6, 2013 Hey, Ted
Sometimes an SQL query should return one row, yet it returns more than one. The problem turns out to be multiple matching rows in a secondary table. Is there a way to easily isolate the secondary table that causes more than one match?
Yes, there is an easy way. But first, let me set up the problem for the edification of other readers.
Sometimes we execute an SQL query with the expectation that the result set will contain only one row (record), and we are surprised to get back two or more rows instead. At least one table (physical file)–usually one of the secondary tables–has more than one row that matches the join criteria. (I use the word “table” here to mean either table or view.)
I have noticed two main causes for this behavior.
Finding the problem is not usually trivial because production queries often join several tables, and it’s common for some of the join criteria in a query to involve only secondary tables.
A simple way to find the culprit is to use the RRN (relative record number) function. Here’s a simple illustration of the technique.
Assume an item master table. It is uniquely keyed on item number, of course. Assume two other temporary tables that were loaded from spreadsheets. One temporary table has new prices for some items. The other has new descriptions for some items. We expect only one row per item in each table, but people, being the imperfect beings they are, may accidentally load more than one row for an item.
Here’s a query that should return one row for item BR-549.
select m.item, np.price, nd.descr from itemmast as m left join newprices as np on m.item = np.item left join newdesc as nd on m.item = nd.item where m.item = 'BR-549'
And here is the result set.
ITEM PRICE DESCR BR-549 5.00 Widget BR-549 5.00 Widget
To locate the duplication, add RRN functions for the tables.
select m.item, rrn(m), rrn(np), rrn(d) from itemmast as m left join newprices as np on m.item = np.item left join newdesc as d on m.item = d.item where m.item = 'BR-549'
And now the verdict:
ITEM RRN ( M ) RRN ( NP ) RRN ( D ) BR-549 2 2 2 BR-549 2 3 2
Look at the RRN columns. The values are different in the third column, therefore the new prices table is the problem.
select * from newprices as np where np.item = 'BR-549' ITEM PRICE BR-549 5.00 BR-549 5.00