In building relational databases, it is not unusual to need to optimize performance across the ETL processes in order to fit tight production scheduling windows. However, it often seems that the developer’s first response to any performance concern is denormalizing table designs, and merging database objects together. Apparently, when in control of circumstances, many developers will denormalize regardless of performance issues, just in case.
Using Denormalization in Database Design
Denormalization of a database design is a legitimate tool, when used appropriately. And there are circumstances where one obtains improvements in performance by doing some form of data structure morphing. However, it is simply sad and sometimes even silly to see designs denormalized as a matter of course. The act of planning to write code should not start off with a developer simply requesting database design changes without the prior experience of poor performance.
This development approach is like a housing contractor about to build a new structure, who assumes the family that will buy the soon-to-be-built home might someday give birth to a child destined to become seven-foot tall or more, because the contractor heard of the situation happening somewhere, sometime, previously. Therefore he/she adjusts the plans so that all doorways are a foot-and-a-half taller than standard. Should the extra-tall child appear, the parents and child will very much appreciate the additional headroom. But, obviously, that circumstance may never appear.
Similarly, solutions development should start with completely normalized database designs. Let real performance issues drive which changes may occur. In a healthy development lifecycle, reworking to tune performance is exactly what refactoring is all about.
Importance of the SHOW PLAN Function
Likewise, any developer writing queries against a database should be aware of the SHOW PLAN or EXPLAIN PLAN function, or whatever variant exists within the platform employed, and be comfortable using it. This function details the steps the DBMS will perform in its attempt to resolve a query. When performance issues arise, it may be possible to see exactly which things along the way are potential bottlenecks.
Understanding the nature of the bottleneck allows for problem-resolution-investigation to focus on the right things. Overcoming these bottlenecks could be as simple as adding an index, or simply an additional coverage column for an already existing index. Maybe a query needs to alter its syntax so that outer joins become a sub-query in an effort to “fake out” the DBMS optimizer into selecting a better path, or using a table-expression. Not all approaches need to jump directly to options that are destructive to the database design.
It is important to remember that relational theory and normalization techniques are intended to create data structures that are rational and “make sense” with regard to the semantics surrounding our circumstances. And, in making sense, they should assist in guiding the processes using the data in the right kinds of approaches and choices. Relational theory is not based on the sole concept of fastest performance; because the “fastest performance” is constantly fluid in its approach and constantly changing in its implementation.
The relational theory aspects relate to the presentation of the data to the user, not to what happens “under the hood.” The original expectation was that performance, and enhancements to it, would be kept under the hood from the database users while the presentation of the “theory” was put forward. Vendors were expected to work toward optimizing performance while not mangling the external rationalized semantics. Far too often, DBMS vendors have exposed the under-the-hood bits, and asked us to accept those quirks as though they are just as valid as theory—the old, “It’s not a flaw, it’s a feature” argument.