Guru: When An Outer Join Is An Inner Join
February 15, 2021 Ted Holt
When is a boy not a boy? When he’s abed! When is a door not a door? When it’s ajar! When is an outer join not an outer join? (Sorry, no dad joke here. Three dad jokes in one paragraph would have been too many, don’t you agree?) Let me answer that last question.
In my work I often see outer joins that are not really outer joins, but inner joins. Oh, based on what I’ve heard from IBM, the query engine may treat them as outer joins, but the result set is the same as that produced by an inner join. This occurs when a query tests a column (field) from a secondary table or view (database file) for a non-null value.
Before I show you an example, let’s look at some data. Here’s an invoice header table:
And here’s the corresponding detail table:
Here’s the type of query I was referring to. There is a left outer join from the header to the detail, and the WHERE clause references the ITEM column, which is stored in the details.
select h.invoice, h.company, h.CustNbr, h.InvDate, d.line, d.item, d.qty from invhdr as h left join invdtl as d on h.invoice = d.invoice where d.item = 'AB-205';
Here’s the result set.
Since the query specifically asks for an outer join, we’d expect to see invoices 101 through 104 inclusive, but instead we see the same results we’d seen had we used an inner join.
So, which is it — an inner join or an outer join? And does it matter, as long as the query produces the correct results? I can’t speak for you, but it matters to me.
First, there’s the matter of self-documentation, the idea that a human being should be able to read source code and understand what it’s doing and possibly even why it’s doing it. Testing a secondary column in an outer join is instant confusion. Did the developer who wrote that statement want an outer join? Did the developer think he wanted an outer join? Did the developer think?
Second, there’s the matter of determining what the correct results should be. In many shops, the test data is a copy of the live data, sometimes with identifiable data changed, sometimes not. In some shops, the test data is the live data. There are two glaring problems with such test data: (1) It is rarely comprehensive, and therefore does not exercise all paths through the software and (2) there is usually far too much of it. Test data sets should be small enough that the output can be verified easily.
It’s possible that the programmer who wrote this SELECT did a poor job of testing. The programmer may have run a quick test against the production database, saw that the numbers matched a screen or report, and moved along to the next fire.
It’s also possible that the programmer did want an outer join, but only wanted secondary data for item AB-205. I wrote about that type of query in my younger days. If so, this statement definitely has a bug.
As I wrote a few months ago, certain programming practices raise a little flag when I see them. Testing for non-null values with outer joins is another such practice. One good way to help the flag go up is to use correlation names. Seeing d. in front of item increases the chances that you’ll notice that the test is for a column in a secondary table.
I should also mention that testing for null is a different story. An inner join returns only those rows where the secondary-table column is null.
select h.invoice, h.company, h.CustNbr, h.InvDate, d.line, d.item, d.qty from invhdr as h join invdtl as d on h.invoice = d.invoice where d.item is null;
The outer join returns the rows with a null secondary column as well as null values for unmatched rows.
select h.invoice, h.company, h.CustNbr, h.InvDate, d.line, d.item, d.qty from invhdr as h left join invdtl as d on h.invoice = d.invoice where d.item is null;
Effective, reliable code is not an accident. In the words of Sergeant Phil Esterhaus, “Let’s be careful out there.”