Good keys, what are they like?

A central value add of data warehouses is their ability to restore the sanity that comes from using good keys. Taking a model-agnostic view of keys, they refer to “something” that is uniquely identifiable. Depending on what you are modeling, those “somethings” have different names, for example: entities, rows, tuples, cells, members, documents, attributes, object instances, and relations between the any of the latter. Because this article is about relational databases and because I don’t want to move up to the “logical” level with you (as I promised before), I will use the term: “row” as the “something” that a key refers to, and the term “column” as the data structure that implements the key.

Moving on, the definition of a “good key” is a column that has the following properties:

  1. It forced to be unique
  2. It is small
  3. It is an integer
  4. Once assigned to a row, it never changes
  5. Even if deleted, it will never be re-used to refer to a new row
  6. It is a single column
  7. It is stupid
  8. It is not intended as being remembered by users

Unique

It follows from the definition of a key that it is unique. But documentation that this is so, does not make it a key: An automated mechanism must be put in place to enforces uniqueness. This could be a constraint or a unique index.

Small

A key should be small, a few bytes at most. This makes it efficient to retrieve rows and to join on the key.

Integer

A key should be an integer, always. I have previously blogged about why this is a good idea.

Never Changes

You should never update key columns or change the column that is the key. First of all, it creates a mess of anything that depends on the key (which may or may not be enforced with referential integrity). Secondly, updates of key columns can have bad performance implications since the database has to do a lot of book keeping to keep the table consistent

Never re-use

Keys should never the re-used – even if the row they identify has been deleted. This turns out to be a crucial requirement in data warehouses, because we may not always be able to track all the dependencies between tables in a automated manner. This means that the warehouse or user reports may be in a state of partial inconsistency, and this is ok and normal. However, if rows exist that refer to an old version of the key, those rows may magically reappear in user reports and end up pointing to the wrong values.

It follows that key columns should be small (as in 2), but still large enough to honour this requirement.

Single column

It is bad practice to use multi-column keys. First, it is difficult to express foreign key constraints on them. Second, it makes the SQL required for accessing the key (and joining on it) more complicated and error prone to users (and programmers). Third, multi column keys makes it harder to write automated code that accesses the model (including code that auto loads the model). Fourth, it is more difficult to express histograms on multi-column keys, which makes the query optimizer more prone to generating bad plans.

I even apply this principle to tables that implement relationships between other tables. This is best illustrated with an example. I recommend against this design:

image

 

In the above example, the relationship table implements a unique relation between two tables, and the key is used to enforce the uniqueness of the relation (“a customer can only own a product once”). But this is not flexible! What if I open stores in multiple countries? Apple iTunes is a good example of such a case. Now, the customer may own the product in multiple countries like this:

image

This wreaks havoc with any code that refers to the relationship itself (since the key has now changed, violating 4) above. Instead, I recommend this design that does not suffer from the problem:

image

If you wish to enforce the uniqueness of the relationship (a business rule, not a key) then you can declare a unique index on customer_id, product_id.

Stupid

A key should not contain information about the row it refers to. This would be a violation of the rules of database normalization. While you may not agree with the rules of normalization (and there are good reasons to disagree), consider this: If the key contains information about the row it refers to – “smart” programmers may get the idea that code could be written that would use the value of the key to implement business logic. But what happens if someone changes the row values and the key no longer reflects the other columns that is a correlated within the row? Then the code is wrong, and you will have programmers begging/ordering you to change the key (violating the principle of “never change”). Any dependency on specific key values in business logic is generally bad news for your ability to implement changes. Keep the keys to yourself and don’t tempt programmers to get smart about them.

Teflon

You generally don’t want users remembering specific primary keys in the warehouse. This is the teflon principle: “nothing sticks to it”. As we shall see later in this series, we reserve the right to make changes to the way keys are linked up to fact tables if we change our mind about history tracking. You don’t want reports to refer to your keys by their value when this happens. Imagine a data warehouse that has 50.000 reports referring to the fact tables (this is not unusual) – if any of those reports refer to a key in a fact table, and you make changes to history tracking, you will have users screaming.

It is worth pointing out that a table may have secondary keys which users refer to in queries. This is perfectly fine, you may even declare the secondary key as unique too. Ideally, users should be using the primary  key on the table only for joins, and fetch unique rows by some known, secondary key. For example, locating an invoice by the source system invoice ID may use the source system key directly, but you will maintain a different primary key.

The one exception to the teflon principle, is keys for date and time. It is a reasonable assumption that “time does not change” and you may get away with generating these keys in the integer format YYYYMMDD and HHMMSS. Incidentally, this makes queries on date and time easier to write and the data easier to partition.