If you’ve worked with relational database systems for any length of time, you’ve probably participated in a discussion (argument?) about the topic of this month’s column, surrogate keys. A great debate rages within the realm of database developers about the use of “synthetic” keys. And if you’ve ever Googled the term “surrogate key,” you know the hornet’s nest of opinions that swirls around on the topic. For those who haven’t heard the term, here is my attempt at a quick summary: A surrogate key is a generated unique value that is used as the primary key of a database table; database designers tend to consider surrogate keys when the natural key consists of many columns, is very long, or may need to change.
Proponents of the surrogate key argue that a concise, manageable key is easier to deal with than an unwieldy multicolumn key. A surrogate key provides immutability because it is assigned and therefore its value need never change even as other data values change. Surrogate keys also provide uniformity and compatibility. If you are using several different database application development systems, drivers, and object-relational mapping systems it can be simpler to use an integer for surrogate keys for every table instead of natural keys to support object-relational mapping. Some advocates even go so far as to claim performance advantages for surrogate keys thinking that it is easier to optimize a query against a single key column than for the multiple columns of a natural key.
Those arguing against surrogate keys outline their disadvantages. Perhaps the strongest claim is the disassociation of the generated surrogate key from the real-world meaning of the data. Without the data in the natural key columns that would have been used as the foreign key, the meaning of child rows becomes less obvious and each foreign key needs to be joined to the parent table to give the data meaning. Another issue is that tables that are otherwise identical (for example, a test version and a production version) could contain the same key values but represent different data, leading to confusion and data quality issues. Other negative issues that can arise from using surrogate keys include the inadvertent disclosure of proprietary information (depending upon how the surrogate keys are defined), improper database design (failing to create a unique index on the natural key), and improper assumptions based on the generated key values (for example, are higher key values necessarily for newer accounts?).
I doubt that there is any final word on this topic. Indeed, the argument has been raging on for years and years now. However, when I get to the performance portion of the argument I am somewhat skeptical. Do fewer columns in a join really translate to better performance? I can see how it would be easier to code for a programmer but not necessarily for the DBMS to optimize. If you’ve created an index on those multiple columns how much worse will the performance be, really? Sure, the SQL is more difficult to write, but will a join over four or five indexed columns perform that much worse than a join on one indexed column? I suppose if the number of columns required for the natural key is absurdly high (e.g., 10 columns) the impact could be greater. And perhaps if you are swapping a variable length key with a surrogate, having a fixed length key performance can be impacted.
At any rate, consider this: For the parent table at least, the natural key columns are still going to be there—after all, they’re naturally part of the data, right? So the surrogate (synthetic) key gets added to each row. This will likely reduce the number of rows per page/block because the row length is now longer. And that, in turn, will negatively impact the performance of sequential access because more I/O will be required to read the “same” number of rows.
And, what about the impact of adding data? If there are a significant number of new rows being added at the same time by different processes, there will be locking issues as they all try to put the new data on the same page, unless, of course, your surrogate key is not a sequential number and is, instead, something like the microseconds portion of the current timestamp. But then, that must be tested to avoid duplicates, further degrading performance.
The one thing that usually causes me to tend to favor natural keys is just that—they are natural. If the data is naturally occurring, it becomes easier for end users to remember it and use it. If it is a randomly generated surrogate, nobody will actually know the data. Yes, this can be masked a great deal based on the manner in which you build your applications to access the data, but ad hoc access becomes quite difficult.
I guess the bottom line on surrogate keys versus natural keys is that “it depends” on a lot of different considerations … and that should really be no surprise to DBAs at all!