Why Integer Keys are the Right Choice!

The blog subject has come up a lot during my years of consulting. Source systems, traditionally composed of shoddy code and poor data modeling, will often use “natural keys”. Such keys can be of any data type, strings (CHAR/VARCHAR), DECIMAL, or other inefficient types. It is my claim that keys in a warehouse should always be integers and that it is not that difficult to implement proper keys. In this post, I will show you what the price is of using non-integer keys and let you be the judge of how much hardware you are willing to waste on making lazy key choices.

There are several reasons quoted for using non-integer keys in source systems:

One is the use of “intelligent keys”, keys containing some form of information about the entity they represent. I will give this choice the proper beating it deserves in a future post.

Another reason I hear is that character keys are “good enough”. A convenient choice has been made that makes the key easy to render. For example, they key may be stored as a zero-padded string of integers, instead of using a proper integer type. The fact that rendering choices are put into the database has its own problems of course – but if you read data off most ERP systems – chances are you are reading zero padded integers turned into strings.

Yet another reason quoted, is that the keys arrive to the warehouse from a master data management system. Why a master data system would ever spit out a non-integer key is a mystery to me – for reasons that I hope become obvious in my next post). But I have seen it happen, master data systems generating character keys.

Believe it or not, sometimes the reason quoted for using non-integer keys is that the key is generated as string concatenations of several columns to make them unique.

From a best practice perspective, Kimball recommends using integer surrogate keys for many of the same reasons I will use. But often, this guidance is ignored and source system keys are loaded directly into fact table or other warehouse models to “save time” on development.

The basic fact, which I will now proceed to prove to you, is that non-integer keys are a waste of CPU cycles! CPUs deal better with integers than string types. But just how bad is it? Let me illustrate with an example. Let us consider three competing design choices for a fact and dimension table:

  • Using a 4-byte integer key as per Kimball/old style data modeling traditions
  • Using a character key with some basic case insensitive intelligence built in. To make a fair compare, we shall use a CHAR(4) which takes up the same space as the integer key.
  • Using a character key as above, but where we are a bit more cunning about the collation choices. To improve join speed, we will use Binary collations (Which makes string compares faster)
    The following code generates 64K row dimension tables representing these three choices. The code will also generate three, 65M row fact tables that match the dimensions:

Now, we can run our typical data warehouse query: a  join of a dimension with the fact and grouping on an attribute:

Running on my laptop with a warm cache (taking I/O out of the equation with zero physical reads) gives me this result:

Key Choice CPU Time Clock Time Logical Reads Fact / Dim
4-byte Integer ~7 sec ~7 sec 13770 / 140
Char(4)Case Sensitive, Accent Sensitive Collations ~12 sec ~12 sec 13700 / 140
Char(4)Binary Collation ~8.5 sec ~8.5 sec 13770 / 140

This should make it pretty clear that it is worth using an integer key to save yourself some money on hardware and to spare your users the frustration of waiting for joins to return. Also note that the above effect will become even more pronounced as you add more dimensions to the fact table.

If you are stuck with character keys, it also highlights that there you will most likely benefit from using a binary collation (if you can live with the semantics implied by that choice).


  1. Tony Rogerson   •  

    I totally agree with you in terms of a surrogate key approach, but as we move into an era of SSD and the random access seek and throughput we get, that coupled with the need to scale out rather than have one big monolythic database GUID’s should be considered, yes, fragmentation, yes more cache used, yes possibly more latching, but those problems in a BI scenario (big queries with lots of IO) are negated.

    PS – great article.

    • Thomas Kejser   •  

      Hi Tony

      I agree that we will have to worry much less about sequential IOPS in the future.

      HOWEVER, I dont think this makes the GUID point a valid one. GUID suffer from several problems (frag being the smallest). They do not fit in CPU registers (which means they are expensive for CPU), they occupy a lot of space in the L2 caches and they generally make everything larger. While GUID may make more sense in OLTP, I don’t think they have a place in data warehouses.

      The INSERT scale test I performaned in a previous blog is not that relevant for warehouses, dimensions typically have a MUCH lower INSERT rate and the latching is less of a concern here. For fact tables, you are in bulk mode anyway, and you are not generating unique keys for those tables.

      With regards to loop joins: Random IOPS are not the only problem with loops. They consume a LOT more CPU than hash joins when you have the big/small join that is typical for warehouses. One of the reasons for this is that loop joins have to do “pointer walking” to crawl from the top of the B-tree to the leaf. If the next page needed is not in L2 cache, such a pointer walk is really expensive (it requires over 100 CPU cycles to do a memory fetch). Hash tables are much more efficient, because they allow you to jump directly to the data you need.

      So even though we will get super fast random IOPS in the future, I dont think we will see a new rise of loop join strategies 🙂

  2. Pingback: Good keys, what are they like? « Thomas Kejser's Database Blog

Leave a Reply

Your email address will not be published. Required fields are marked *