Ensuring the integrity of the organization’s databases is a key component of the DBA’s job. A database is of little use if the data it contains is inaccurate or if it cannot be accessed due to integrity problems. The DBA has many tools at his disposal to ensure data integrity.
At a high-level, there are two aspects to data integrity: database structure integrity and semantic data integrity. Keeping track of database objects and ensuring that each object is created, formatted, and maintained properly is the goal of database structure integrity. We will not discuss this aspect of data integrity in this column.
The second type of integrity, semantic data integrity, refers to the meaning of data and the relationships that need to be maintained between different types of data. The DBMS provides options, controls, and procedures to define and assure the semantic integrity of the data stored within its databases. All too often these semantic integrity controls are ignored or only applied sparingly as an after-thought. But it is imperative that DBAs understand how each DBMS enables automatic semantic data integrity checking.
DBAs frequently struggle with how best to enforce data integrity—by using DBMS features or by using application code. In general, using DBMS features to support data integrity usually offers the best solution.
It is easier to discover and maintain a semantic integrity rule using DBMS features than by reading program code. Remember that you may need to be able to read several languages (COBOL, Java, etc.) in a heterogeneous environment; often, for the same rule.
Many forms of semantic data integrity can be enforced by using features of the DBMS. When DBMS facilities are used to enforce data integrity, less code needs to be written, and therefore less code needs to be tested. This can reduce the time and cost to get the “product” out the door.
Entity integrity means that each occurrence of an entity must be uniquely identifiable. In other words, entity integrity requires the specification of a primary key for each entity and that no component of the primary key be set to null. In practice, no major RDBMS product forces entity integrity because entities, or tables, can be created without specifying a primary key. However, it is considered a bad practice to create tables without a primary key because it makes identifying rows of the table difficult.
Another type of data integrity control, similar to a primary key, is the unique constraint. Each table can have zero, one, or many unique constraints consisting of one or more columns each. The values stored in the column, or combination of columns, must but unique within the table—that is, no other row can contain the same value. Unique constraints differ from primary key constraints in that they cannot be used to support referential constraints. Furthermore, the unique constraint columns may be set to null.
Data type and data length are the most fundamental integrity constraints applied to data in the database. Simply by specifying the data type for each column when a table is created, the DBMS will automatically ensure that only the correct type of data is stored in that column. Processes that attempt to insert or update the data to the wrong type will be rejected. Furthermore, a maximum length is assigned to the column to prohibit larger values from being stored in the table. It is best to choose the data type that most closely matches the domain of correct values for the column.
A user-defined data type, or UDT for short, provides a mechanism for extending the type of data that can be stored in databases and the way that the data is treated. In other words, the DBA can create user-defined data types to further clarify the legal values for a column. The UDT, once defined and implemented, extends the functionality of the DBMS by enabling the DBA to specify the UDT in CREATE TABLE DDL statements just like built-in data types.
When columns are created within tables, they can be assigned a default value that will be used when SQL INSERT statements are issued that do not provide an explicit value for that column. This allows programmers to ignore columns and the DBMS to automatically provide a default value. Each column can have only one default value.
A check constraint is a DBMS-defined restriction placed on the data values that can be stored in a column or columns of a table. When a check constraint is defined, it places specific data value restrictions on the contents of a column through the specification of an expression explicitly defined in the table DDL and formulated in much the same way that SQL WHERE clauses are formulated. Any attempt to modify the column data will cause the expression to be evaluated. If the modification conforms to the expression, it is permitted; if not, it will fail. The primary benefit of check constraints is their ability to enforce business rules directly into the database without requiring additional application logic. Once defined, the business rule is physically implemented and cannot be bypassed.
Triggers are event-driven specialized procedures that are attached to database tables. A trigger is a piece of code that is executed in response to a data modification statement. Triggers can be thought of as an advanced form of rule or constraint that is written using an extended form of SQL. A trigger cannot be directly called or executed; it is automatically executed (or “fired”) by the RDBMS as the result of an action—a data modification to the associated table. Triggers can be implemented for many practical uses, and they make the constraint rule non-bypassable, meaning it is always implemented when database changes are requested.
And the final type of semantic integrity is referential integrity, or RI. The basic concept of RI can be summarized as follows: RI guarantees that an acceptable value is always in the foreign key column. “Acceptable” is defined by the values in the corresponding primary key column(s), or a null.
Of course, there is much more to all of these semantic integrity capabilities than we can discuss in a column of this nature, so be sure to learn the nuances of these features as supported in the DBMS products that you use.
Avoiding Disaster
Today’s DBMSs provide a wealth of features to support data integrity. Because one of the major problems plaguing production systems today is data quality, it is imperative that DBAs understand, implement, and administer DBMS features such as constraints, referential integrity, and triggers in their database designs. Failure to do so can be a prescription for disaster.