Distinctly Speeding Up DISTINCT
May 26, 2010 Ted Holt
The DISTINCT keyword is a great example of the power of the SQL SELECT statement. Add this powerful word to any SELECT command and voilà! Duplicate rows magically disappear. Comments from readers have alerted me to the fact that some people find use of this feature confusing. Here are a brief discussion of DISTINCT and a performance tip.
Given a table (physical file) of sales order information and a table of customer information, how do I go about finding the account numbers of customers who have orders in the database? We could try this:
select s.companyno, s.customerno from salesordh as s order by 1, 2
And we would get something like this:
COMPANYNO CUSTOMERNO 1 34567 1 34567 1 45678 1 56789 1 77777 2 12345 2 23456 2 56789
Marvelous, but do I really need to have each customer listed once per order? In this short example, only one customer—34567–has more than one order. In a production database, containing hundreds or thousands of orders, there could be many duplicates. We can eliminate duplicates by adding the word DISTINCT:
select distinct s.companyno, s.customerno from salesordh as s order by 1, 2
And we would get this instead:
COMPANYNO CUSTOMERNO 1 34567 1 45678 1 56789 1 77777 2 12345 2 23456 2 56789
Let’s take it a step farther. Suppose I also want the customer’s name. For that, I must go to the customer master file. Here’s the same query, but I’ve added the customer master file.
select distinct s.companyno, s.customerno, c.cusnam from salesordh as s join customers as c on c.companyno = s.companyno and c.customerno = s.customerno order by 1, 2
This works. I get this:
COMPANYNO CUSTOMERNO CUSNAM 1 34567 Polly Unsaturated 1 45678 Molly Coddle 1 56789 R. U. Furreal 1 77777 Herman Nootix 2 12345 Cal E. Phornya 2 23456 Billy Rubin 2 56789 Sally Mander
But notice something. Notice that all three selected fields are in the customer master file. Notice also that two of the fields–company number and customer number–are the key fields that uniquely identify each customer. In other words, why join, which creates multiple rows for each customer, then throw away duplicates? It’s unnecessary. Here’s the rewritten query:
select c.companyno, c.customerno, c.cusnam from customers as c where exists (select * from salesordh as s where s.companyno = c.companyno and s.customerno = c.customerno) order by 1, 2
DISTINCT is gone. The main query reads CUSTOMERS only. The sales order table has been relegated to a subquery. The resulting data set is the same, but performance should be better.
To sum it up, use DISTINCT when you’re retrieving data from one table (or view) only. When two or more tables and/or views are involved, and all selected columns are from one table, you can usually get a performance advantage by converting the query to include a subquery and throwing away DISTINCT.