The Valuable NULL
by Michael Sansoterra
Nil, Nada, and NULL: One of these terms is of great benefit to the database programmer; the others are worthless. Coming from the "old school" world of flat files, COBOL and RPG, I had never heard of a NULL until I worked with Microsoft Access, in 1994. Later I came across NULLs on the iSeries. While I found NULLs problematic at first, I eventually found them to be a valuable aid in database development. If NULLs exhibit behavior that amuses or confuses you, read on to find out what NULLs are good for and how they operate.
So what is a NULL? The Oracle Glossary of Terms gives as good a definition as any: "A null value represents missing, unknown, or inapplicable data."
So a NULL is a "non-value"; that is, a NULL indicates that there is no data in a table column or variable. A common error is to assume that a numeric column containing a zero is a NULL. A similar error assumes blanks or an empty string is a NULL. Both of these views are incorrect because zeros, blanks, and an empty string are indeed values. Also, don't confuse the NULL string terminator (ASCII 0) in the C programming language with the NULL discussed here. The NULL in C is a value.
I'll be using the SQL Procedure Language (SPL) to demonstrate a few concepts. For easy reading, all SPL variable names will be prefixed with the @ sign. You can easily substitute embedded SQL here, remembering to replace the @ variables with host variables, prefixed by a colon (:), and also remember to use NULL indicator variables so that the host language can act appropriately on a NULL. (For more on embedded SQL and indicator variables, see the DB2 Universal Database for iSeries SQL Programming with Host Languages guide, in PDF format.)
Of What Value Is a Non-Value?
Since NULLs aren't values, what purpose do they serve? Why would you want to store a NULL in, say, a numeric column rather than just letting it contain a zero? Below are a few examples that demonstrate the benefit of NULLs.
Consider an application for a teacher that tracks students' test scores. Each student is allowed to miss a single test without hurting his grade. Say a student has the following 10 test scores recorded:
From the list, you can see that no score was entered for test 7 , the one "freebee." Assuming these test scores are stored as rows in a table, SQL's average function can be used to determine the average of these 10 scores, with the result of 821 / 9 = 91.2. But wait! There are 10 scores, so how come only nine were used in the average calculation? That's because the SQL aggregate functions ignore NULLs. NULL is better than a zero here, because if a zero were entered the average would be incorrectly calculated as 821 / 10 = 82.1. Since NULL is not a number, it's not included in the average, which saves the programmer from programmatically weeding out the lowest score before calculating the average.
In addition to AVG, other SQL column functions, such as COUNT and SUM, do not regard NULL values in their aggregations.
Below is a sample table called TestScores, which has a column, called Score, that is capable of holding the NULL value (which is a default; specifying NOT NULL prevents a column from containing a NULL).
CREATE TABLE TestScores ( School_Year SMALLINT NOT NULL, Student_No SMALLINT NOT NULL, Test_No SMALLINT NOT NULL, Score DEC(4,1), PRIMARY KEY (School_Year, Student_No, Test_No))
Here is a sample SELECT statement that would average a student's scores, ignoring NULL values. It also COUNTs the number of scores to ensure that one has been left NULL:
SELECT Student_No, AVG(Score) AS Average, /* Verify we have 10 scores entered */ CASE WHEN COUNT(Student_No) <>10 THEN 'ERROR' ELSE 'OK' END AS TestCount, /* Verify only one score is NULL */ CASE WHEN COUNT(Score) <>9 THEN 'ERROR' ELSE 'OK' END AS ScoreCount FROM TestScores WHERE "SCHOOL_YEAR"=2002 GROUP BY Student_No ORDER BY Student_No
For the next example on the benefits of NULL, consider an application for tracking church contributions. Each individual who gives on a regular basis is assigned a box of envelopes identifying an envelope with a contributor. Of course, each box of envelopes must have a unique number. A table for storing individual church membership information may be defined as follows:
CREATE TABLE DataLib/Indiv ( Family_No INTEGER NOT NULL, Indiv_No INTEGER NOT NULL, FirstName CHAR(30) NOT NULL, LastName CHAR(30), Envelope_No SMALLINT, PRIMARY KEY (Family_No,Indiv_No))
To guarantee column Envelope_No is unique, you build the following index:
CREATE UNIQUE WHERE NOT NULL INDEX DataLib/INDIVENV ON DataLib/INDIV (Envelope_No)
Not every individual member is a regular contributor. Some don't give at all and others give cash. These people won't have an envelope number, so the Envelope_No column will be NULL for these folks. Will this scenario still work? Yes, because when the unique index is built, the WHERE NOT NULL phrase is used to instruct the index to ignore NULLs. The benefit that NULL provides in this scenario is that the index on the Envelope_No column is still guaranteed to be unique, with the exception that NULLs are ignored.
This situation would be invalid if zeros were entered in the envelope number column, because zeros are values and would be included in the duplicate check. Further, programming time is again saved by having to avoid checks for duplicates or from building specialized select/omit logicals to enforce uniqueness for non-zero entries.
The envelope number and test score examples are cases in which NULLs are intended to represent "missing" data. An example of a NULL being "unknown" would be the actual ship date on an order header record. In this case, the actual ship date is unknown and is therefore left NULL until the goods are shipped and a date value is assigned to the column. For the last sample, let's examine a circumstance in which a NULL would represent "inapplicable" data.
Consider the following query to calculate the growth percentage of an item from one year to the next:
/* If there were no sales last year, the growth % will be NULL */ Select Item, CurSales, LYSales, Case When LYSales<>0 Then (CurSales-LYSales)/LYSales*100 End As GrowthPct From SalesSummary
What should we do for new items in which the prior year's sales are zero? In this circumstance, many programmers arbitrarily assign the percentage a value of either 0 or 100. In my opinion, returning NULL is appropriate here, representing "inapplicable" data, because the calculation doesn't apply when there are no sales for the prior year.
In case you're wondering, when omitting the ELSE from the CASE statement, a NULL is returned when none of the preceding conditions are met. In other words, if you don't specify an ELSE, SQL will implicitly treat the CASE statement as if an ELSE NULL were coded.
Now that you've seen how NULLs can be put to good use, let's consider some potential problems. Not understanding how NULLs operate in expressions and comparisons was one of the biggest pitfalls I encountered when first using NULLs.
In expressions, any operand that contains a NULL will cause the result of the expression to be NULL. For example, 5 + NULL will evaluate to NULL. Combining a string and a NULL such as 'CONSTANT' || NULL will yield a NULL. This behavior can cause unexpected results. Assume you have the following calculation to get an order total:
-- -- Get Customer, Taxable Flag, Order Value (sum of line items) -- SELECT Order.CustomerID, COUNT(OrderLine.LineSeq) AS NoLines, SUM(OrderLine.ExtPrice) AS OrderValue INTO @CustomerID, @NoLines, @OrderValue FROM Order JOIN Customer ON Customer.CustomerID=Order.CustomerID LEFT JOIN OrderLine ON OrderLine.OrderID=Order.OrderID WHERE Order.OrderID=@OrderID GROUP BY Order.CustomerID; -- -- Get taxes, freight and other misc charges -- for the order -- SET @OrderCharge=SELECT SUM(ChgAmt) FROM OrderCharges WHERE OrderID=@OrderID; SET @OrderTotal=@OrderValue+@OrderCharge;
Normally, the @OrderValue will contain the sum of the line items and @OrderCharge will contain charges other than the merchandise, such as taxes and shipping. When there are no special charges, @OrderCharge will have a zero. The sum of both will be @OrderTotal.
But what happens if you have an order that has no order lines but does have special charges (say a single line item for labor). Because of the LEFT JOIN, @OrderValue will contain a NULL; @OrderCharge will have the value of the special charges. The @OrderTotal will contain a NULL because adding a NULL to the @OrderCharge will yield a NULL, but we expected @OrderTotal to equal the @OrderCharge!
Preventing NULL Mayhem
How can you guarantee that NULLs will not throw a monkey wrench in the works whenever you encounter them? The answer lies in the IFNULL built-in function. Its syntax is:
IFNULL ( expression , expression )
The function will return the first non-null result in the list. The only requirement of the function is that all values in the list have compatible data types (that is, you can't mix alpha and numeric). We could revise the @OrderTotal calculation using IFNULL as follows:
Using IFNULL in this way will always substitute a zero in the event that @OrderValue contains a NULL, thereby preventing the result from being NULL. The same should be done for @OrderCharge, although the only time it would be NULL on the iSeries is when the OrderCharges table is empty.
It is important to bulletproof your code for potential NULLs using the IFNULL function because an unexpected NULL can cause your mathematical or comparison operations to fail. Two functions similar to IFNULL are COALESCE and VALUE, which also pick the first non-null value from a list of values. These three functions really operate in the same way; portability is the biggest consideration when deciding which one to use.
NULLs, Comparison Operators, and Connectors
Further, NULLs affect the results of comparison operations in, what was for me, a surprising way. Consider the following user-defined function:
Create Function SQLData.CompTest(@Dummy Integer) Returns VarChar(20) Language SQL Deterministic Begin Declare @Result VarChar(20); If @Dummy=5 Then Set @Result = 'Dummy Is 5'; ElseIf @Dummy<>5 Then Set @Result = 'Dummy Is Not 5'; Else Set @Result = 'NULL'; End If; Return @Result; End
At first glance it may appear that the ElseIf statement will catch all other possibilities if parameter @Dummy is not equal to five, thereby making the last ELSE statement unnecessary. This is true, except when @Dummy is NULL, in which case the stand-alone ELSE is evaluated and @Result is set to the constant NULL.
The reason for this behavior is revealed when you consider that a comparison operator normally evaluates to TRUE or FALSE. However, when one of the values is NULL, the comparison no longer evaluates to TRUE or FALSE; it evaluates to NULL.
In the IF statement above, when @DUMMY is NULL, the condition @DUMMY=5 is evaluated as a NULL (not FALSE) and so the ELSEIF is invoked. The ELSEIF @DUMMY<>5 will also be evaluated as a NULL (it is not TRUE, even though indeed @DUMMY is not 5). Finally, because the ELSEIF condition wasn't evaluated as TRUE, the final ELSE condition is evaluated.
The same factors need to be kept in mind when dealing with the ANDs and ORs (also called connectors). The tables below demonstrate how NULLs will affect comparisons involving ANDs and ORs.
Condition: Expr1 Or Expr2
Condition: Expr 1 And Expr 2
Once again, you can avoid having to worry about a special case for NULL by using IFNULL to substitute a zero if @Dummy contains a NULL:
If IFNULL(@Dummy,0)=5 Then Set @Result = 'Dummy Is 5'; Else Set @Result = 'Dummy Is Not 5'; End If;
Since you can't use equal (=) or not equal (<>) to test for a NULL, how do you do a comparison? The answer is to use IS NULL or IS NOT NULL respectively. Constructing a statement like the following is syntactically not allowed, and furthermore is illogical, because the result would always be NULL:
If @Dummy=NULL Then
Here is the proper way to test for a NULL:
If @Dummy IS NULL Then Set @Result = 'NULL'; ElseIf @Dummy=5 Then Set @Result = 'Dummy Is 5'; Else Set @Result = 'Dummy Is Not 5'; End If;
The NULL Sort
For a final consideration of NULL, let's ask "where does the NULL fall in the sort order on the iSeries?" The answer is that when an ORDER BY is issued against a column containing NULLs, the NULLs come last. Unfortunately, this behavior isn't consistent among SQL platforms, which can make portability problematic. For instance, Microsoft SQL Server and Microsoft Access both place NULLs at the top of the sort order.
Understanding the advantages of using NULLs where appropriate will contribute to better database design. Further, making sense of the expression and comparison issues involved with NULLs, and correctly designing applications, will save many hours of troubleshooting heartache.
Finally, beware that NULLs don't necessarily behave in the same way across all platforms; the sort order of the NULL is one example. Platforms may vary in the way that they handle such things as string concatenation or comparison tests. When writing portable applications, make sure to pay close attention to these "NULL" issues.
Michael Sansoterra is a programmer/analyst for SilverLake Resources, an IT services firm based in Grand Rapids, Michigan. E-mail: firstname.lastname@example.org
Contact the Editors
|Copyright © 1996-2008 Guild Companies, Inc. All Rights Reserved.|