fhg
Volume 10, Number 17 -- May 26, 2010

Distinctly Speeding Up DISTINCT

Published: May 26, 2010

by Ted Holt

The DISTINCT keyword is a great example of the power of the SQL SELECT statement. Add this powerful word to any SELECT command and voilà! Duplicate rows magically disappear. Comments from readers have alerted me to the fact that some people find use of this feature confusing. Here are a brief discussion of DISTINCT and a performance tip.

Given a table (physical file) of sales order information and a table of customer information, how do I go about finding the account numbers of customers who have orders in the database? We could try this:

select s.companyno, s.customerno
  from salesordh as s
 order by 1, 2

And we would get something like this:

COMPANYNO  CUSTOMERNO
     1       34567
     1       34567
     1       45678
     1       56789
     1       77777
     2       12345
     2       23456
     2       56789

Marvelous, but do I really need to have each customer listed once per order? In this short example, only one customer—34567--has more than one order. In a production database, containing hundreds or thousands of orders, there could be many duplicates. We can eliminate duplicates by adding the word DISTINCT:

select distinct s.companyno, s.customerno
  from salesordh as s
 order by 1, 2

And we would get this instead:

COMPANYNO  CUSTOMERNO
     1       34567
     1       45678
     1       56789
     1       77777
     2       12345
     2       23456
     2       56789

Let's take it a step farther. Suppose I also want the customer's name. For that, I must go to the customer master file. Here's the same query, but I've added the customer master file.

select distinct s.companyno, s.customerno, c.cusnam
  from salesordh as s
  join customers as c
    on c.companyno = s.companyno
   and c.customerno = s.customerno
 order by 1, 2

This works. I get this:

COMPANYNO  CUSTOMERNO  CUSNAM
     1       34567     Polly Unsaturated
     1       45678     Molly Coddle
     1       56789     R. U. Furreal
     1       77777     Herman Nootix
     2       12345     Cal E. Phornya
     2       23456     Billy Rubin
     2       56789     Sally Mander

But notice something. Notice that all three selected fields are in the customer master file. Notice also that two of the fields--company number and customer number--are the key fields that uniquely identify each customer. In other words, why join, which creates multiple rows for each customer, then throw away duplicates? It's unnecessary. Here's the rewritten query:

select c.companyno, c.customerno, c.cusnam
  from customers as c
 where exists
    (select * 
       from salesordh as s
      where s.companyno  = c.companyno
        and s.customerno = c.customerno)
 order by 1, 2

DISTINCT is gone. The main query reads CUSTOMERS only. The sales order table has been relegated to a subquery. The resulting data set is the same, but performance should be better.

To sum it up, use DISTINCT when you're retrieving data from one table (or view) only. When two or more tables and/or views are involved, and all selected columns are from one table, you can usually get a performance advantage by converting the query to include a subquery and throwing away DISTINCT.




                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot


Sponsored By
SEQUEL SOFTWARE

SEQUEL ViewPoint®--Data Access & Analysis
for Power Systems™ Servers

                                               · Easy to use by IT and end users
                                               · Real-time data access and analysis
                                               · Queries, reports, pivot tables, drill-down analysis, kpi dashboards
                                               · IBM i-centric with multiplatform data access
                                               · Expert support and training
                                               · Secure data access
                                               · GUI, Web & green screen interface

SEQUEL--Fast, efficient & cost-effective data analysis

Click here for a FREE Information Kit!


Senior Technical Editor: Ted Holt
Technical Editor: Joe Hertvik
Contributing Technical Editors: Erwin Earley, Brian Kelly, Michael Sansoterra
Publisher and Advertising Director: Jenny Thomas
Advertising Sales Representative: Kim Reed
Contact the Editors: To contact anyone on the IT Jungle Team
Go to our contacts page and send us a message.

Sponsored Links

IBS:  Free e-book: The Six Margin Killers in Wholesale Distribution
WorksRight Software:  ZIP codes, area codes, Canadian postal codes, CASS certification, and more
COMMON:  Join us at the Fall 2010 Conference & Expo, Oct. 4 - 6, in San Antonio, Texas


 

IT Jungle Store Top Book Picks

Easy Steps to Internet Programming for AS/400, iSeries, and System i: List Price, $49.95
The iSeries Express Web Implementer's Guide: List Price, $49.95
The System i RPG & RPG IV Tutorial and Lab Exercises: List Price, $59.95
The System i Pocket RPG & RPG IV Guide: List Price, $69.95
The iSeries Pocket Database Guide: List Price, $59.00
The iSeries Pocket SQL Guide: List Price, $59.00
The iSeries Pocket Query Guide: List Price, $49.00
The iSeries Pocket WebFacing Primer: List Price, $39.00
Migrating to WebSphere Express for iSeries: List Price, $49.00
Getting Started With WebSphere Development Studio Client for iSeries: List Price, $89.00
Getting Started with WebSphere Express for iSeries: List Price, $49.00
Can the AS/400 Survive IBM?: List Price, $49.00
Chip Wars: List Price, $29.95


 
The Four Hundred
Power7 Blades Plus i Versus X64 Blades Plus Windows

Transitions Push Systems and Technology Group into the Red

AS/400 LUG Shares Chief i Architect's "Why i?" Arguments

Creativity Is the New Business Kool-Aid, IBM CEO Study Finds

IBM Emphasizes 'Deeper Skills' in New Business Partner Program

Four Hundred Stuff
Security and Auditing Breakthrough Gives Cilasoft Compliance Advantage

DBU Now Accessible Over the Web

PowerTech Delivers Object-Based Control Over Exit Points

IBM Considers Offering Trials of RPG OA

Databorough Targets CA 2E Apps for Migration

Four Hundred Monitor
Four Hundred Monitor's
Full iSeries Events Calendar

System i PTF Guide
May 22, 2010: Volume 12, Number 21

May 15, 2010: Volume 12, Number 20

May 8, 2010: Volume 12, Number 19

May 1, 2010: Volume 12, Number 18

April 24, 2010: Volume 12, Number 17

April 17, 2010: Volume 12, Number 16

TPM at The Register
Cray-1 resurfaces in pieces on eBay

Novell puts Identity Management into beta

BMC reveals 'free money' mainframe and DB2 tools

Oracle shows off M9000s for data warehousing

Server rebound, Perot services boost Dell

NOAA goes to Cray for climate super

Cisco shells out $99m for CoreOptics

Oracle punts first VirtualBox x64 hypervisor

Novell seeks rich suitors

Oracle sneaks out carrier grade Sparc blade, Xeon rack

SUSE Linux 11 gets first service pack

IBM punts commercial Hadoop distro

THIS ISSUE SPONSORED BY:

WorksRight Software
SEQUEL Software
Botz & Associates, Inc.


Printer Friendly Version


TABLE OF CONTENTS
Five Steps To Monitoring Your Server Log on IBM i

Distinctly Speeding Up DISTINCT

Hey! What Happened To My Last Used Dates

Four Hundred Guru

BACK ISSUES




 
Subscription Information:
You can unsubscribe, change your email address, or sign up for any of IT Jungle's free e-newsletters through our Web site at http://www.itjungle.com/sub/subscribe.html.

Copyright © 1996-2010 Guild Companies, Inc. All Rights Reserved.
Guild Companies, Inc., 50 Park Terrace East, Suite 8F, New York, NY 10034

Privacy Statement