Guru: Another Red Flag – Joining On Expressions
August 30, 2021 Ted Holt
One would think that a given datum, for example a sales order number, would be defined identically in the various database tables in which it is found within an organization, but one might be wrong. I have on many occasions faced the challenge of joining two or more tables on unmatched data types.
But it gets worse than that. Sometimes joins involve expressions, which may consist of mathematical operations and/or invocations of functions, both intrinsic and user-written. As with the word DISTINCT in a SELECT, the presence of expression in a join sends up a little red flag that tells me that I should review the query. Who knows? There may be a performance problem that needs to be addressed.
Ideally all joins would be between identical data types. Consider two tables: sales order header and sales order detail.
create table sordh ( OrderNbr dec (5), CustomerID dec (5), DueDate date, primary key (OrderNbr)); insert into sordh values ( 320, 20260, '2021-07-15'), ( 321, 33750, '2021-07-12'), ( 322, 19455, '2021-07-25'); create table sordd ( OrderNbr dec (5), LineNbr dec (3), Item char (6), Quantity dec (3), primary key ( OrderNbr, LineNbr )); insert into sordd values ( 320, 1, 'AB-101', 12), ( 320, 2, 'BC-202', 1), ( 321, 1, 'XZ-411', 2), ( 322, 1, 'JR-399', 6);
These two tables obviously join on order number, which is defined as five digits packed decimal in both tables.
Now let’s bring in the manufacturing order table.
create table mord ( OrderNbr dec (5), Item char (6), Quantity dec (3), DueDate date, Comment char (24), primary key ( OrderNbr )); insert into mord values ( 725, 'C-9091', 2, '2021-07-01', 'SABGR7AA00005 DE00320001'), ( 726, 'K-3432', 1, '2021-07-01', 'TLD 7AB00003X 00320001'), ( 727, 'D-8850', 1, '2021-07-05', 'MENO 7TB00121 4 00321001'), ( 728, 'F-2855', 1, '2021-06-30', 'PLOTZ6 00009 00320002');
Let’s invent a bit of history. This organization uses an ERP package that was designed for a make-to-stock operation. As such, there is no way to tie a manufacturing order to a sales order line. When the business decided to make to order, the enterprising IT department staff decided to put order number and order line number in the last eight positions of the COMMENT field.
Here’s the join between manufacturing order and sales order header.
select m.OrderNbr, m.Item, m.DueDate, soh.CustomerID from mord as m left join sordh as soh on substr(m.Comment, 17, 5) = soh.OrderNbr;
|Manufacturing order||Item||Due date||Customer ID|
This is pretty much guaranteed to perform less than stellar. The system will have to read every row of MORD to see what’s in bytes 17 through 21 of the COMMENT column.
One way that should accelerate this query is to create an index over that piece of COMMENT.
create index mord1 on mord (substr(Comment, 17, 5));
I say should because I can’t be sure the query engine will use the index. I’ve even seen the query engine ignore indexes that I had built upon the recommendation of the index advisor.
There’s another way, the one we used to use before indexes could be built over expressions. We can create a cross-reference table.
Let’s say that sales order numbers are stored in three formats in our imaginary system. You’ve just seen two of them. The third is in a shipments table. We don’t like the way our ERP handles shipping, so we use someone else’s shipping software, and it has a spot for an eight-character sales order number.
create table shipments ( ID dec(7), Date date, Carrier char(3), SalesOrd char(8), primary key (ID)); create index shipment1 on shipments (cast (SalesOrd as dec(8))); insert into shipments values ( 184, '2021-07-15', 'ABC', '00000320'), ( 185, '2021-07-15', 'ABC', '00000322');
Here’s a join between shipments and sales order headers.
select sh.ID, sh.Date, sh.Carrier, so.OrderNbr, so.CustomerID, so.DueDate from shipments as sh left join sordh as so on sh.SalesOrd = so.OrderNbr order by sh.ID;
|Shipment ID||Ship date||Carrier||Sales order||Customer ID||Due date|
All is well and good. There are no expressions in that join. Or are there? The sales order number is five digits in one table and eight characters in the other. There is an implicit conversion.
OK, so back to the cross-reference table. Let’s create a table that has one row per sales order. Each row has the sales order number in the three formats that we use.
create table soxref ( OrderNbr dec(5), OrderC8 char(8), OrderC5 char(5), primary key (OrderNbr)); create index soxref1 on soxref (OrderC8); create index soxref2 on soxref (OrderC5); insert into soxref values ( 320, '00000320', '00320'), ( 321, '00000321', '00321'), ( 322, '00000322', '00322');
Let’s look at that shipments join again.
select sh.ID, sh.Date, sh.Carrier, so.OrderNbr, so.CustomerID, so.DueDate from shipments as sh left join soxref as x on sh.SalesOrd = x.OrderC8 left join sordh as so on x.OrderNbr = so.OrderNbr order by sh.ID;
|Shipment ID||Ship date||Carrier||Sales order||Customer ID||Due date|
Adding the cross-reference table to the query removed all data conversions from the joins.
But, is there really a need for a cross-reference table? We indexed the sales order column in each table. Isn’t that enough?
Yes, it is. But creating the cross-reference, in addition to eliminating the data conversions, opens up some other possibilities.
One, we can duplicate frequently-requested data from the other tables into the cross reference. For example, suppose it is very common to include customer number in queries. We might add customer number to the cross reference to eliminate the need to include the sales order header in a query when no other header columns are needed. I don’t advocate this technique. I try to adhere strongly to normalization to keep anomalies out of the database. However, if used properly, duplication can be effective.
Two, we can add additional columns that are functionally dependent on order number, but don’t have a place in the database. In this example, the sales order header comes from a software vendor. Unless this vendor has a way for us to add our own data to the database, it would probably be a mistake for us to add additional fields to the SORDH table. We could add our own data to the cross reference instead.
This brings me to a really good example. In some of the shops where I’ve worked there was a calendar file, also called a date dimension file. Such a file (or table) has one record (row) for each day. The file has fields (columns) for all kinds of values, such as:
- the date in various date formats and data types
- day of the week in one or more formats
- flags that indicate whether the date is a work day, weekend day, holiday, and so forth
- the value returned by the DAYS function
- the first and last days of the month
- accounting period information
- and so on
Here’s the definition of a simple calendar table.
create or replace table Calendar ( BaseDate date, YYMD numeric(8), MDY_Edited char(8), MDYY_Edited char(10), MDY numeric(6), MDYY numeric(8), DayOfWeek numeric(1), DOWAbbr char(3), DayOfWeekName varchar(9), primary key (BaseDate)); create index Calendar1 on Calendar (yymd);
Here are a few rows by way of example.
|BaseDate||YYMD||MDY Edited||MDYY Edited||MDY||MDYY||Day||Abbrev||Day name|
Using such a table removes date-conversion functions from queries, not only in the JOIN clause, but also in the SELECT, WHERE, and perhaps other clauses.
When I started work on this article a few months ago, I intended to devote a good bit of space to this example because of the usefulness of having a table of dates in the database. However, in the meantime, Kent Milligan published an article on this topic on his blog. Rather than repeat what he’s said, let me refer you to his article for more information about joining with a calendar table. While I’m at it, let me encourage you to subscribe to Kent’s blog if you haven’t done so already.
I’d also like to refer you to this article, in which Aaron Bertrand writes about using a calendar table in a Microsoft SQL Server database. He goes wild with the possibilities.
But back to the topic at hand. Using expressions to join tables certainly works. The query engine is powerful and can retrieve the results you need. I’m not saying that you should never join with expressions. If performance is acceptable, then leave the query alone and find a more useful task to spend your time on.
And if performance is unacceptable, but creating an index is out of the question, hold your nose and let the query run.