Marvel should have an evil villain named “Null.” Nulls have always been trouble in the relational world. Certainly, nulls are used all over the place by virtually everyone. Still, that does not mean that nulls are harmless.
Nulls are deceptive in subtle ways. As an example, consider a table with 10 rows and a column named Test_Code, and that five of the rows have a value of “X” in Test_Code, while “Y” is the value in four rows, and one row is null. If a query seeks the number of rows where a value of “X” appears, the answer will be five. But for a query asking how many rows exist where a value of “X” does not appear, the answer will be four. Most people would assume the answer should be five because five other rows without a value of “X” do exist in the table.
Where the Null Monster Lurks
Cleverly, the DBMS is interpreting the null Test_Code as “unknown” and that value might be “X” if the “unknown” should later become known. Therefore, the null Test_Code row is excluded from our query’s consideration. Is “four” the expected answer for the query? It is in these kinds of query executions that the null monster lurks to foil us. Certainly, there are ways of writing queries so that the null rows will be included as normally desired, but the query writer needs to know that the column is “null-able” and add in special code for how to deal with any nulls.
In the analytics world, data structures defined for use in ad hoc queries will frequently have strict rules about when or if nulls may be allowed. Often, landing or staging areas allow nulls in most columns, as incoming data from an array of sources may be riddled with bad or incomplete data and there is a desire to reflect exactly what was received. As that initial data is ingested into more refined structures, nulls may no longer be allowed, may potentially be limited to numeric columns, or may include end-date columns—or not. Non-null rules such as this require an organization to set specific standards on how nulls are transformed. Effectively, a default value is chosen and assigned to each differing data type or logical domain. Start dates may get a “2000-01-01” value, or a “U” may be assigned to flags, while the value of “unknown” may be used for descriptions.
Preventing Unintended Harm
In this context, the intent is that, for these specific circumstances, there is the same consistent value. Consistency allows users to always be certain of a value’s meaning once they learn the default. This transformation may seem to be a waste of effort because, at the end of the day, a value that means “null” simply replaces any null. The difference is that the meaning of null is still conveyed without the potential need to know a column may be null and to write additional code in each query to ensure nulls are included. Similar to the villagers’ torches driving Frankenstein off into the hills, the DBMS null monster can be effectively removed from the scene and prevented from causing unintended query harm.
Still, as mentioned previously, folks use nulls all over the place. On the operational side of IT, the data structures are part of the application, and often the application’s code contains the only queries running against this data. The engineers developing those solutions should be on top of how to write a proper query to return exactly the required result. It is in the ad hoc, non-developed/verified/tested/tested-again world where things may go wrong. Casual users can run into problems on random occasions—problems that no one may even notice. A “nulls-not-allowed” strategy against structures used in this fashion can help protect these users.