A surrogate is, “One appointed to act in place of another.” A surrogate key is appointed to act in place of a natural key. Why? Well, one could argue that surrogate keys existed since the beginning of the computer. As the computer uses internal addresses for finding everything, and in the early days that usage had to be explicitly managed by those programming. It wasn’t until relational theory came along that the concept of using data values as identifiers arose and were functionally implemented.
Dimensional modeling has always embraced surrogate keys. In dimensional models the surrogate keys helped smooth over the, sometimes awkward, dealing with time and change activity, as well as purportedly improving performance by forcing joins into using simple numeric columns. Later, with the rise of object-oriented approaches, in the use of surrogate keys, the idea of improving database join performance again was used as a justification.
Objects existed because the program executing chose to create it. Interrelationships between objects driving those entities as strong or weak were downgraded in priorities. The pervasiveness of surrogate keys impacted data modeling by confusing many about whether these surrogate keys were logical or physical-only. For the many who decided to define these surrogate keys as logical, the detailed semantics of data models diminished as all objects overnight became strong entities, and weak entities virtually disappeared.
Initially, surrogates were defined as sequential numbers. They were managed by the code inserting records into a table. Eventually, databases made things easier by defining structures to support the tracking of the sequences. Since these surrogate keys were integers, the performance enhancement crowd pointed to the single numeric join columns as the perfect goal. Eventually, databases tried to assist more by having their own internally generated unique identifiers for employment as surrogate keys. These values were often hashes of internal system identifiers and dates and times. And while a few exceptions arose, they were generally unique.
More recently, the idea of hashing algorithms has been used in generating surrogate key values. With a hashing algorithm, while moving data through its pipeline the engineer may take the natural key of an object, whether just a single item or many, and use those values as input to a function that implements a set of logic to convert all input to a single encrypted value. Ten data items may go in, but a single value is the output. This single value can be used as the surrogate key. Popularly available hashing algorithm examples include SHA-256 and MD5. Both unique identifiers and hashed keys are lengthy character strings. So, saying performance is improved by joining on numbers is no longer an argument. However, if one finds it painful to write the SQL necessary to join on multiple columns, this does guarantee single column joins between objects.
Data lineage also confuses the issue. If one system creates a surrogate key, the next system down the line inputting that data may consider those previously generated surrogates as natural keys.
As is said, one man’s ceiling is another man’s floor. Life is funny that way. When designing one’s solutions, a data modeler should be cautious about the surrogate keys their solution is generating.
Obviously, surrogate keys are basic architectural elements of dimensional data models, and data vault models. More often than not, those surrogate keys should be labelled as physical only, and logical data models should reflect the actual natural keys. This caution over a logical versus a physical approach maintains a semantic honesty about the solution being built. Regardless, every data modeler should be mindful of what is natural and what is being created.