Why Integer Keys are the Right Choice!
The blog subject has come up a lot during my years of consulting. Source systems, traditionally composed of shoddy code and poor data modeling, will often use “natural keys”. Such keys can be of any data type, strings (CHAR/VARCHAR), DECIMAL, or other inefficient types. It is my claim that keys in a warehouse should always be integers and that it is not that difficult to implement proper keys. In this post, I will show you what the price is of using non-integer keys and let you be the judge of how much hardware you are willing to waste on making lazy key choices.
There are several reasons quoted for using non-integer keys in source systems:
One is the use of “intelligent keys”, keys containing some form of information about the entity they represent. I will give this choice the proper beating it deserves in a future post.
Another reason I hear is that character keys are “good enough”. A convenient choice has been made that makes the key easy to render. For example, they key may be stored as a zero-padded string of integers, instead of using a proper integer type. The fact that rendering choices are put into the database has its own problems of course – but if you read data off most ERP systems – chances are you are reading zero padded integers turned into strings.
Yet another reason quoted, is that the keys arrive to the warehouse from a master data management system. Why a master data system would ever spit out a non-integer key is a mystery to me – for reasons that I hope become obvious in my next post). But I have seen it happen, master data systems generating character keys.
Believe it or not, sometimes the reason quoted for using non-integer keys is that the key is generated as string concatenations of several columns to make them unique.
From a best practice perspective, Kimball recommends using integer surrogate keys for many of the same reasons I will use. But often, this guidance is ignored and source system keys are loaded directly into fact table or other warehouse models to “save time” on development.
The basic fact, which I will now proceed to prove to you, is that non-integer keys are a waste of CPU cycles! CPUs deal better with integers than string types. But just how bad is it? Let me illustrate with an example. Let us consider three competing design choices for a fact and dimension table:
- Using a 4-byte integer key as per Kimball/old style data modeling traditions
- Using a character key with some basic case insensitive intelligence built in. To make a fair compare, we shall use a CHAR(4) which takes up the same space as the integer key.
- Using a character key as above, but where we are a bit more cunning about the collation choices. To improve join speed, we will use Binary collations (Which makes string compares faster)
- The following code generates 64K row dimension tables representing these three choices. The code will also generate three, 65M row fact tables that match the dimensions:
USE tempdb GO SET NOCOUNT ON GO /* Create dimension tables with different key types */ CREATE TABLE Dim_Int ( SK INT NOT NULL PRIMARY KEY , A1 INT NOT NULL ) CREATE TABLE Dim_Char ( SK CHAR(4) COLLATE Latin1_General_CI_AS NOT NULL PRIMARY KEY , A1 INT NOT NULL ) CREATE TABLE Dim_Char_BIN ( SK CHAR(4) COLLATE Latin1_General_100_BIN2 NOT NULL PRIMARY KEY , A1 INT NOT NULL ) /* Populate dimension tables */ DECLARE @i INT SET @i = 0 WHILE @i < 65536 BEGIN INSERT Dim_Int (SK, A1) VALUES (@i, @i % 100) SET @i = @i + 1 END INSERT Dim_Char SELECT RIGHT('0000'+dbo.fn_convert_to_base(SK, 16),4), A1 FROM Dim_Int INSERT Dim_Char_BIN SELECT * FROM Dim_Char /* Create fact tables */ CREATE TABLE Fact_Int ( SK INT NOT NULL ,M1 SmallMoney NOT NULL ) CREATE TABLE Fact_Char ( SK CHAR(4) COLLATE Latin1_General_CI_AS NOT NULL ,M1 SmallMoney NOT NULL ) CREATE TABLE Fact_Char_BIN ( SK CHAR(4) COLLATE Latin1_General_100_BIN2 NOT NULL ,M1 SmallMoney NOT NULL ) /* Validate that we are indeed using the same space for dimensions */ EXEC sp_spaceused 'Dim_Int' EXEC sp_spaceused 'Dim_Char' EXEC sp_spaceused 'Dim_Char_BIN' INSERT Fact_Int WITH (TABLOCK) SELECT D.SK, BINARY_CHECKSUM(D.SK, D.A1) % 1000 AS M1 FROM Dim_INT D CROSS JOIN dbo.fn_nums(100) /* Might run a bit of time */ INSERT Fact_Char WITH (TABLOCK) SELECT RIGHT('0000'+dbo.fn_convert_to_base(D.SK, 16),4) , F.M1 FROM Fact_Int F JOIN Dim_Int D ON D.SK = F.SK INSERT Fact_Char_BIN WITH (TABLOCK) SELECT * FROM Fact_Char /* Validate that we are indeed using the same space for facts */ EXEC sp_spaceused 'Fact_Int' EXEC sp_spaceused 'Fact_Char' EXEC sp_spaceused 'Fact_Char_BIN'
Now, we can run our typical data warehouse query: a join of a dimension with the fact and grouping on an attribute:
SET NOCOUNT ON SET STATISTICS TIME ON SET STATISTICS IO ON GO /* Int Join */ SELECT D.A1, SUM(F.M1) FROM Fact_Int F JOIN Dim_Int D ON D.SK = F.SK GROUP BY D.A1 OPTION (MAXDOP 1) GO /* CHAR(4) join */ SELECT D.A1, SUM(F.M1) FROM Fact_Char F JOIN Dim_Char D ON D.SK = F.SK GROUP BY D.A1 OPTION (MAXDOP 1) GO /* CHAR(4) join with Binary collation*/ SELECT D.A1, SUM(F.M1) FROM Fact_Char_BIN F JOIN Dim_Char_BIN D ON D.SK = F.SK GROUP BY D.A1 OPTION (MAXDOP 1)
Running on my laptop with a warm cache (taking I/O out of the equation with zero physical reads) gives me this result:
|Key Choice||CPU Time||Clock Time||Logical Reads Fact / Dim|
|4-byte Integer||~7 sec||~7 sec||13770 / 140|
Case Sensitive, Accent Sensitive Collations
|~12 sec||~12 sec||13700 / 140|
|~8.5 sec||~8.5 sec||13770 / 140|
This should make it pretty clear that it is worth using an integer key to save yourself some money on hardware and to spare your users the frustration of waiting for joins to return. Also note that the above effect will become even more pronounced as you add more dimensions to the fact table.
If you are stuck with character keys, it also highlights that there you will most likely benefit from using a binary collation (if you can live with the semantics implied by that choice).
Category: DW and Big Data