Database Design: Using Natural Keys
Whether to use natural or surrogate keys is a long-debated subject of database design. I am a fan of using natural keys. I think there are even more compelling reasons to use natural keys in databases as the systems grow more complex and interdependent.
Natural or Surrogate
Let’s start by what we mean by natural. It’s not trivial to define this. In today’s world of APIs, someone’s surrogate key is another’s natural key. Wikipedia defines natural keys as “a type of unique key in a database formed of attributes that exist and are used in the external world outside the database”. This makes it clear that the keys we get from APIs are our natural keys. But how about the ones generated by us to be used in the external world?
When applications expose the keys on the URLs and APIs, others start relying on them. This is where our choices become important. When all those different applications generate their own keys instead of using the keys they got from other places, life becomes difficult for no reason.
Let’s elaborate with an example corporate database where employees are identified by their usernames and the departments with their domains. So our data would look like this:
| department | username | job | | ----------------- | --------- | ----------------------- | | sysadm.corp-x.com | hasegeli | Database Administrator | | sysadm.corp-x.com | john | System Administrator | | dep1.corp-x.com | jane | Developer |
When we design this using surrogate keys, it’d look like this:
CREATE TABLE departments ( id int NOT NULL GENERATED ALWAYS AS IDENTITY, domain text NOT NULL, PRIMARY KEY (id), UNIQUE (domain) ); CREATE TABLE employees ( id int NOT NULL GENERATED ALWAYS AS IDENTITY, username text NOT NULL, PRIMARY KEY (id), UNIQUE (username) ); CREATE TABLE department_employees ( id int NOT NULL GENERATED ALWAYS AS IDENTITY, department_id int NOT NULL, employee_id int NOT NULL, job text NOT NULL, PRIMARY KEY (id), UNIQUE (department_id, employee_id), FOREIGN KEY (department_id) REFERENCES departments (id), FOREIGN KEY (employee_id) REFERENCES employees (id) );
Then, we would have to make a query joining them all together:
SELECT d.domain, e.username, de.job FROM department_employees de JOIN departments d ON de.department_id = d.id JOIN employees e ON de.employee_id = e.id;
Instead, we could just have it as we need it on a single table:
CREATE TABLE department_employees ( domain text NOT NULL, username text NOT NULL, job text NOT NULL, PRIMARY KEY (domain, username) );
And, we could just query it:
SELECT domain, username, job FROM department_employees;
Good Examples of Natural Keys
A primary key must be always available (NOT NULL) and unique. It should also be compact and easy to represent. Here are some good examples of natural keys:
- Keys acquired from various APIs
- Email addresses
- ISO country codes
- IATA airport codes
I think easy representation matters for unchanged use among many systems. For that reason, I don’t think names with spaces, symbols, or non-ASCII Unicode letters are good keys.
UUIDs and hashes/digests like MD5 and SHA are also popular as primary keys. I don’t think they are a good choice as surrogate keys. Auto-incrementing IDs are easier to use, generate, store, predict — although, when they are defined in the external world and we have to store them, there’s nothing wrong with using them as the primary keys.
Another consideration when selecting keys is their stability. Things that can change are not good keys. This is not only because it is more work to change the keys everywhere, but because different parts of the system cannot rely on them.
The context matters to evaluate a key’s stability. For example, a phone number is not a good key for the users table, because users may change their phone number. However, it is perfect for the phone numbers table.
Also, I don’t think we should discard some useful keys just because they can change in an edge case. It is a good trade-off to allow the rare change rather than to complicate the system with a surrogate key.
I don’t know what you understand from the word “enterprise” but I have heard people advocating streamlined, unpractical solutions in the name of “enterprise” for a long time. I have seen people using layers of layers in complex frameworks and ORMs with caches and buffers, managers and supervisors, factories and workers, locks and semaphores, queues and consumers, all in the name of “enterprise architecture”.
Many of those frameworks have a rule like requiring all database tables to have a single-column auto-incremented primary key named “id”. Such a rule is easy to enforce and follow but it’s limiting. It would not be practical, maybe not even possible to use natural keys as the primary keys within such framework.
Even though I am writing this post to support using natural keys, strong conventions built around surrogate keys do certainly help developing complex systems in the “enterprise” environments. They help to keep systems consistent when the developers are more motivated to write code than to read what’s there, while working until they get a better offer from another company.
Performance is often an argument in favour of surrogate keys. Surrogate keys can be small auto-incrementing integers which would keep the indexes small, next branches easy to predict, so well performing.
Surrogate keys would take a bit of extra space since otherwise we don’t need to have them at all, but the space savings of using them as foreign keys would easily outweigh this. However, using surrogate keys may require the queries to have more tables joined together.
Some relational databases maintain the tables indexed by the primary key. Such systems are more affected by the size of the primary key. On PostgreSQL all indexes are secondary including the primary key, therefore the primary key is not any more important than the other indexes on the table.
Natural keys are often unique together with other keys. This forces the database designer to use multiple columns as keys. Such design has many advantages, which I may cover in future blog posts.