NULLs and the NOT IN predicate

One of the most common requests is to retrieve data based on some column value not included in a list of values. The following two tables illustrate the scenario. We have tables with colors and products:

Colors table:

color
----------
Black
Blue
Green
Red

Products table:

sku product_description color
---- -------------------- ------
1 Ball Red
2 Bike Blue
3 Tent NULL

Note that these tables do not represent a perfect design, following normalization rules and best practices. Rather, it is a simplified scenario to help illustrate this example better. In reality, the colors table would most likely contain a color code key column that would be referenced in the products table.

The request is to select a list of colors that have not previously been used on products. In other words, we need to construct a query that returns only those colors for which there is no product with that color. It might seem, at first glance, that the NOT IN predicate provides a very intuitive way to satisfy this request, very close to how the problem would be stated in plain English:

SELECT C.color
FROM Colors AS C
WHERE C.color NOT IN (SELECT P.color
FROM Products AS P);

You may have been expecting this query to return two rows (for 'black' and 'green') but, in fact, it returns an empty result set:

color
----------

(0 row(s) affected)


Obviously this is 'incorrect'. What is the problem? It's simply that SQL uses three-valued logic, driven by the existence of NULL, which is not a value but a marker to indicate missing (or UNKNOWN) information. When the NOT operator is applied to the list of values from the subquery, in the IN predicate, it is translated like this:

"color NOT IN (Red, Blue, NULL)" is equivalent to "NOT(color=Red OR color=Blue OR color=NULL)"

The expression "color=NULL" evaluates to UNKNOWN and, according to the rules of three-valued logic, NOT UNKNOWN also evaluates to UNKNOWN. As a result, all rows are filtered out and the query returns an empty set.

This mistake will often surface if requirements change, and a non-nullable column is altered to allow NULLs. It also highlights the need for thorough testing. Even if, in the initial design, a column disallows NULLs, you should make sure your queries continue to work correctly with NULLs.

One solution is to use the EXISTS predicate in place of IN, since EXISTS uses two-valued predicate logic evaluating to TRUE/FALSE:

SELECT C.color
FROM Colors AS C
WHERE NOT EXISTS(SELECT *
FROM Products AS P
WHERE C.color = P.color);


color
----------
Black
Green


Other possible solutions are as follows:

/* IS NOT NULL in the subquery */

SELECT C.color
FROM Colors AS C
WHERE C.color NOT IN (SELECT P.color
FROM Products AS P
WHERE P.color IS NOT NULL);

/* EXCEPT */

SELECT color
FROM Colors
EXCEPT
SELECT color
FROM Products;


/* LEFT OUTER JOIN */

SELECT C.color
FROM Colors AS C
LEFT OUTER JOIN Products AS P
ON C.color = P.color
WHERE P.color IS NULL;


While all solutions produce the desired results, using EXCEPT may be the easiest to understand and use. Note that the EXCEPT operator returns distinct values, which works fine in our scenario but may not be correct in another situation.