Assuring data quality should be one of the most important goals for IT professionals. This is truer today than it has ever been as organizations rely on data to power their AI and machine learning algorithms. Accurate, quality data is required if you want to make accurate decisions. And that is true, whether the data is being analyzed by a human or a computer algorithm.
Data integrity in an operational DBMS entails making sure that the data is always accurate and correct. Of course, there are many aspects to data integrity, such as making sure that data is correct in terms of internal DBMS functionality, correct in terms of business meaning, correct in terms of the actual values of data elements, correct in terms of the relationship between the data elements, and that changes are correctly applied when requested.
Nevertheless, one of the fundamental steps that needs to be taken when data is stored in a database system is to define its proper data type and length. By specifying the proper data type for each column when a table is created, the DBMS automatically ensures that only the correct type of data is stored in that column. Processes that attempt to insert or update the data to a non-conforming value will be rejected. Furthermore, a maximum length is assigned to the column (for some data types) to prohibit larger values from being stored in the table.
It is usually the DBA that must choose the data type and length of each column, unless a data architect has provided a data model for the DBA. And even then, there may be changes needed when translating the logical model to a physical implementation. The general rule of thumb should be to always choose the data type that most closely matches the domain of correct values for the column. In general, adhere to the following rules:
- If the data is numeric, favor SMALLINT, INTEGER, or DECIMAL data types. FLOAT is also an option for very large numbers.
- If the data is character, use CHAR or VARCHAR data types.
- If the data is date and time, use DATE, TIME, or TIMESTAMP data types.
- If the data is multimedia, use GRAPHIC, VARGRAPHIC, BLOB, CLOB, or DBCLOB data types.
The actual name of the data types may vary from DBMS to DBMS, but the general rules above should apply. Unfortunately, these rules are not always adhered to. I do not know why using improper data types is such a widespread practice, but I can guarantee you that it is.
One of the most common data type problems is storing dates in character columns. This causes all sorts of issues. Yes, every DBMS product supports date/time data a little bit differently, but they all provide the following benefits that you don’t get storing your dates in character columns:
- Assuring only date and time data can be stored in the column. The DBMS will not permit a non-date value in a column defined with a date data type.
- Date/time columns can participate in date and time arithmetic. Adding one date to another or subtracting a duration from a date is difficult to program if the data is not stored as a date.
- Every DBMS provides a vast array of built-in functions that operate on and transform date and time values.
- Date/time columns can be formatted in multiple ways by the DBMS using built-in commands or functions.
So, the next time you see a character date column containing '02302022' or something even more nonsensical, blame the database schema—or the DBA who created that schema.
Another common mistake is to define numeric data as character. For example, consider the following scenario. A four-byte code is required to identify an entity; all of the codes are numeric and will stay that way. But, for reporting purposes, users wish the codes print out with leading zeroes. Should the column be defined as CHAR(4) or SMALLINT?
Without proper edit checks, inserts and updates could place invalid alphabetic characters into the product code. This is a valid concern if ad hoc data modifications are permitted. Although this may be rare in your production databases, data integrity issues can still arise if proper edit checks are not coded into every program that can modify the data. But just specifying a numeric data type puts the onus on the DBMS, where it should be, to assure that the data is numeric only.
Choosing the wrong data type can impact performance, too. Relational optimizers can do a better job of calculating filter factors and building proper access paths when columns are defined according to their domains. Consider our example again. There are many more possible values for a CHAR(4) column than a SMALLINT column. Even if programmatic edit checks are coded for each, most optimizers are not aware of them and will assume that all combinations of characters are permitted.
But what about those leading zeroes? If the data is stored in a CHAR(4) column we can just input the data with leading zeroes and then they will always be there. Well, this “problem” can be solved using other methods. Better to transform the data when it is accessed than trying to store it that way in the first place. After all, most reporting and query tools have features that will automatically insert those leading zeroes. And if you are accessing the data using a program, it is easy to insert leading zeroes when you need them, right?
We have only discussed a couple of the potential problems caused by bad data types. In general, it is best to assign your columns the data type (and length) that best matches the values in its domain. Choosing proper data types will improve data quality and the decisions made using that data.