Distinctly Speeding Up DISTINCT

May 26, 2010 Ted Holt

The DISTINCT keyword is a great example of the power of the SQL SELECT statement. Add this powerful word to any SELECT command and voilà! Duplicate rows magically disappear. Comments from readers have alerted me to the fact that some people find use of this feature confusing. Here are a brief discussion of DISTINCT and a performance tip.

Given a table (physical file) of sales order information and a table of customer information, how do I go about finding the account numbers of customers who have orders in the database? We could try this:

select s.companyno, s.customerno
  from salesordh as s
 order by 1, 2

And we would get something like this:

COMPANYNO  CUSTOMERNO
     1       34567
     1       34567
     1       45678
     1       56789
     1       77777
     2       12345
     2       23456
     2       56789

Marvelous, but do I really need to have each customer listed once per order? In this short example, only one customer—34567–has more than one order. In a production database, containing hundreds or thousands of orders, there could be many duplicates. We can eliminate duplicates by adding the word DISTINCT:

select distinct s.companyno, s.customerno
  from salesordh as s
 order by 1, 2

And we would get this instead:

COMPANYNO  CUSTOMERNO
     1       34567
     1       45678
     1       56789
     1       77777
     2       12345
     2       23456
     2       56789

Let’s take it a step farther. Suppose I also want the customer’s name. For that, I must go to the customer master file. Here’s the same query, but I’ve added the customer master file.

select distinct s.companyno, s.customerno, c.cusnam
  from salesordh as s
  join customers as c
    on c.companyno = s.companyno
   and c.customerno = s.customerno
 order by 1, 2

This works. I get this:

COMPANYNO  CUSTOMERNO  CUSNAM
     1       34567     Polly Unsaturated
     1       45678     Molly Coddle
     1       56789     R. U. Furreal
     1       77777     Herman Nootix
     2       12345     Cal E. Phornya
     2       23456     Billy Rubin
     2       56789     Sally Mander

But notice something. Notice that all three selected fields are in the customer master file. Notice also that two of the fields–company number and customer number–are the key fields that uniquely identify each customer. In other words, why join, which creates multiple rows for each customer, then throw away duplicates? It’s unnecessary. Here’s the rewritten query:

select c.companyno, c.customerno, c.cusnam
  from customers as c
 where exists
    (select * 
       from salesordh as s
      where s.companyno  = c.companyno
        and s.customerno = c.customerno)
 order by 1, 2

DISTINCT is gone. The main query reads CUSTOMERS only. The sales order table has been relegated to a subquery. The resulting data set is the same, but performance should be better.

To sum it up, use DISTINCT when you’re retrieving data from one table (or view) only. When two or more tables and/or views are involved, and all selected columns are from one table, you can usually get a performance advantage by converting the query to include a subquery and throwing away DISTINCT.

                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot