Far too often information engineers and others take a dismissive stance about managing data structures. This indifferent attitude is largely because whatever tool the engineers use provides fast performance in resolving queries. They believe that quick performance means all is well, so nothing else matters, right?
Normalization provides stable data structures that not only describe the business but also reduce data redundancies and increase data integrity. The data integrity improvements primarily help prevent needless anomalies surrounding insert, update, and delete activities. Anomalies are simple inconsistencies; a single structure that is the source of data for more than one idea means that structure can cause one of these anomalies. For example, if a single structure has both employee and department data, then how can one add a new department before any employee has been assigned? Or how about losing all your department information once the last currently assigned employee is deleted? Not to mention the adjustments necessary if a piece of department information is updated for some of the employees but not all employees in that department, leaving the question then of which values are correct? These are dangerous anomalies that can leave an organization with inconsistent answers and outright missing information, even when query responses are fast.
Normalization is a way to evaluate functional dependencies across the data being modeled. The goal of this analysis is to establish data structures that are limited to a single concept. And the data items positioned within each data structure are the items that functionally depend on those concepts (as represented by each concept’s defined primary key).
Normalization is accomplished by working through several “normal forms.” For the most part, data modelers only progress through the first three normal forms. Initially, in starting a data model design one needs to identify the first group of entities within scope. The entities are the objects under discussion, the “nouns,” for this area within the business.
Next, discussion with the subject matter experts should focus on the expected relationships between those entities (one-to-many, many-to-one, one-to-one, etc.).
And finally, the data modeler should gather the specific data elements believed necessary for each entity. These items may come from discussions with experts, or simply breaking down the elements appearing in dashboards or reports.
Now the fun begins; first normal form says to remove “repeating groups” from one’s structure. By repeating, it should be obvious that this data depends on more than just the simple primary key of the data structure. These data items should be formatted in a new structure related to their initial placement and formatted so that each repeated group constitutes a single row.
“No partial key dependencies” is the second normal form. This reference suggests that if you have an entity with a composite primary key, meaning the primary key has more than one attribute, then you should verify that each non-key data item within the entity depends on the full primary key as opposed to just one or anything less than the full set of primary key attributes. When found, these items are again removed from the starting structure and placed within a related structure having only those partial primary key columns as its new full primary key.
The third normal form is a similar exercise, only now one is looking to see if any key attributes depend on other non-key attributes.
Going beyond third normal form is largely an intellectual exercise that rarely impacts real data structures, unless one happens to have data structures containing multiple composite candidate keys with overlapping attributes involved in each key.
In simply arriving at third normal form, normalized data structures will be free of the above-mentioned insert/update/delete anomalies and they will ensure that data structures are safe for all the usual data maintenance activities.