Zombie vs Sheep

Transforming Source Keys to Real Keys – Part 2: Using…

In part 1 of this post, I introduced the idea of map tables. These tables serve as an abstraction between the source systems and the entities in the data warehouse. In this post, I will describe how you can use the maps to correct for source key errors.

Using the taxonomy of key pathologies described earlier, I will walk you through some examples of map usage.

Some conventions I will use in populated tables illustrated in this blog:

  • Source Keys are in italics
  • Entity Keys (in the Entities schema) are coloured
  • Keys and column values that are irrelevant to the argument have been grayed out so you cant read them
  • Values that receive UPDATE statements are marked in green highlight.

In order of the taxonomy, let us start with the composite key.

Fixing Composite Keys

Recall that we use the Entities schema to hold the “master key” for every entity and that we don’t rely on any source to be a master. The map table is a way to transform any key from a source to that master key.

For example, let us assume our sales system supplies Promotion keys as a composite of SKU and and PromotionDate. Because some sneaky programmer couldn’t be bothered to create a new key for a promotion and instead argued that a single SKU is only ever on promotion on one date at a time (this, as it happens, is a good example of trying to argue for doing the wrong thing).

In this case, we would have a map like this:

And our Dimension/Entity table will be:

And no, we will NOT store the SKU or promotion in this table, they are either redundant (the SKU is already in the sales data) or we don’t want to maintain the muddle-headed design of the source in the warehouse.

Example population of the tables if we receive the first promotion data on 2011-01-01

Dim.Promotion

EK_Promotion Name CreatedDate
1 Big Week Sale 2011-01-01
2 Good Deals Today 2011-01-01

Maps.Promotion_Sales

Map
ID
SKU Promotion
Date
Valid
From
Valid
To
Is
Current
Is
Deleted
EK_Promotion 
5 43 2011-01-01 2011-01-01 2011-01-07 true false 1
6 43 2011-01-01 2011-01-01 2011-01-02 true false 2

Apart from making the fact tables linked to this entity smaller, we can now work around our poorly arguing programmer when the business wants to have more than one SKU on the same promotion. Let us say that SKU = 87 is not on the “Good Deals Today” promotion. We can now update the Maps, without wreaking havoc with the Entities:

Maps.Promotion_Sales (updated to reflect the new reality)

Map
ID
SKU Promotion
Date
Valid
From
Valid
To
Is
Current
Is
Deleted
EK_Promotion 
5 43 2011-01-01 2011-01-01 2011-01-07 true false 1
6 43 2011-01-01 2011-01-01 2011-01-02 true false 2
7 87 2011-01-01 2011-01-01 2011-01-02 true false 2

Fixing Zombie Keys

Assume our inventory system has zombie keys. Recall that they use EAN as the key over there. At the initial load of our warehouse, let us say on 2010-06-01, we receive this staging table from them:

Staging.Products_Inventory (on 2010-06-01)

EAN Weight
978-0465026562 1.0
978-0201485417 2.2

This will cause the following population of the map:

Maps.Products_Inventory (on 2010-06-01)

Map
ID
EAN Valid
From
Valid
To
Is
Current
Is
Deleted
EK_Product 
51 978-0465026562 2010-06-01 9999-01-01 true false 1
52 978-0201485417 2010-06-01 9999-01-01 true false 2

Now, let us assume that EAN = 978-0465026562 goes out of stock on 2010-08-01 and that the inventory system cleans up the record. We will now receive this staged data:

Staging.Products_Inventory (on 2010-08-01)

EAN Weight
978-0201485417 2.2

We must conclude that the key is dead, and we update the map to reflect this.

Maps.Products_Inventory (on 2010-08-01)

Map
ID
EAN Valid
From
Valid
To
Is
Current
Is
Deleted
EK_Product 
51 978-0465026562 2010-06-01 2010-08-01 false false 1
52 978-0201485417 2010-06-01 9999-01-01 true false 2

Let us assume we have agreed with the source that it has zombie key behaviour. Because of this, we leave IsDeleted = false. We know that Product_Key = 1 is supposed to be dead, but it may return from the grave.

The supplier for EAN = 978-0465026562 delivers us a new stock three months later. On 2011-11-01 we again observe the EAN in the staging tables, and we reflect this in the map like this:

Maps.Products_Inventory (on 2010-11-01)

Map
ID
EAN Valid
From
Valid
To
Is
Current
Is
Deleted
EK_Product 
51 978-0465026562 2010-06-01 2010-08-01 false false 1
52 978-0201485417 2010-06-01 9999-01-01 true false 2
53 978-0465026562 2010-11-01 9999-01-01 true false 1

At this point you may ask: why didn’t we just update the ValidTo date of Map_ID = 51 and set it to 2010-11-01, instead of adding a new row? The reason we track the “dead state” of the key are twofold:

  1. We want to be able to detect wrong data from the inventory system. If a fact arrives when the key should be dead, we can flag this.
  2. We want to be able to handle any dolly keys arriving in the interval 2010-08-01 and 2010-11-01
    Recall that we have a unique index in place that enforces the rule:

“a source key cannot have TWO different meanings at the same point in time”

    . Let us see this rule in action…

Fixing Dolly Keys

After our warehouse goes into production, the guys down in inventory find an error in the reports. After some investigation, it is discovered that on 2010-10-13 a new guy unpacked a crate containing a book called: “Gödel, Escher, Bach: Ein Endloses Geflochtenes Band”. The new guy, understanding German, understood that this was most likely the book with EAN = 978-0465026562, so he created a new entry in the inventory system, temporarily making the EAN appear in the system – but pointing to the WRONG entity. The book was returned the day after to the supplier, because the inventory guys realised it was in the wrong language, after this, the batch clean process gets rid of the EAN again.

Picture of Black SheepHowever, the guys down in inventory care about returns, they want to track them and report on errors. We are faced with the problem of an EAN temporarily representing the wrong entity. Think this example is far fetched? I have seen it several times, on keys a lot of people would personally care about.

Fortunately, our Map structure can handle this. We start by looking up the Product_Key of the German edition of Gödel, Escher, Bach – let us assume it is Product_Key = 42. We can now handle our Dolly key like this:

 

Maps.Products_Inventory (updated after 2010-11-01)

Map
ID
EAN Valid
From
Valid
To
Is
Current
Is
Deleted
EK_Product 
51 978-0465026562 2010-06-01 2010-08-01 false false 1
52 978-0201485417 2010-06-01 9999-01-01 true false 2
53 978-0465026562 2010-11-01 9999-01-01 true false 1
54 978-0465026562 2010-10-13 2010-10-14 false true 42

Notice that, unlike the zombies that are the norm of this source, we marked the Dolly key as IsDeleted = true, we don’t expect this sheep to come back from the grave

(zombie sheep, now THERE is an idea for a new game in the app store).

Merging keys Together – Handling Multi Keys

To wrap up this blog entry about the usage of map tables, let me illustrate another use for them. There are cases where the source system carries two different keys for the same entity. This CAN be the result of an error (often the case with customer data), but there are situations where having two keys for an entity is a legitimate use case.

For example, our inventory system may have multi EAN numbers for the “same” book. This happens because books can exist in multiple editions. Examples of these differences are: paperback, hardcover, reprints and new editions. From the perspective of the inventory system, and often the customer, we may even care about the different editions. But perhaps this is deemed irrelevant from a business reporting perspective. If I drill into the Douglas Hoffstadter in my Product Dimension – I will probably WANT to see all English editions of “Gödel, Escher, Bach” as the same row.

Fortunately, the merging of source rows can be handled by maps. It WAS actually the case that “Gödel, Escher, Bach” had a paperback edition in 1999 (which I happen to own) with another EAN = 1313353195. We can reflect this in our map:

Maps.Products_Inventory (updated after 2010-11-01, old paperback added)

Map
ID
EAN Valid
From
Valid
To
Is
Current
Is
Deleted
EK_Product 
51 978-0465026562 2010-06-01 2010-08-01 false false 1
52 978-0201485417 2010-06-01 9999-01-01 true false 2
53 978-0465026562 2010-11-01 9999-01-01 true false 1
54 978-0465026562 2010-10-13 2010-10-14 false true 42
55 1313353195 1999-01-01 9999-01-01 true true 1

Notice that we leave IsCurrent = true. This is a perfectly valid EAN to receive from the source system. I may simply have been lingering around on the dusty shelves for a long time.

Summary – Transforming Source Keys to Real Keys

In this blog, I have shown you how you can use map tables to handle key errors in the sources. For those of you that currently use type 2 dimension history for this purpose – you may wish to consider the implications of polluting your end user data with this type of “source noise”. How would you use a type2 dimension table when sources are replaced or errors occur that you don’t want to reflect in the end user data?

Consider also what it means for the maintainability of the warehouse to adopt the poor key choices in the source as an integrated part of the end user reporting data. I think I have argued that there has to be a de-coupling between the “history reporting” that end-users want, and the tracking and handling of source keys. It is my claim that the data model is simplified by not mixing these two concepts together in the same table. It is also my claim that you should always adopt the map tables, with the structures described here, as an integral part of the ETL process in the data warehouse.

As Jørgen Guldmann pointed out in a previous comment, having the structures in place for mapping does not mean you can find a reliable source of the data needed to handle the transformation. The maps allow you to change the sources to the  format they should have had in the first place, even if that means manually updating the map tables or having someone hired to correct errors by adding rows to the map (as we saw with the Dolly key example). It is worth observing that if you can  classify which key pathologies a source has, you can often automate the maintenance of the map, and only manually insert the exceptions you find as you iterate towards the completely cleansed Enterprise Data Warehouse.

At a point in the future, you may find the holy grail and end up with no source pathologies at all, maybe even a “master source” for all entities. The maps can serve as the requirement definition for that process. Furthermore, the maps can be used as update structures for sources that want to comply with a better key structure. Some people would call this master data management, and by proactively building maps as part of your data warehouse model, you have placed yourself in a position to lead that argument with a data driven approach.

Just don’t be surprised that you end up BEING the master data system once you start building mapping structures. But this is a good thing, the problem does not go away just because you push it around – with maps, you have a way to quantify and encapsulate it that does not ruin your brilliant data model.

 

PS: The book I have used as an example happens to be one of the best books ever written on the relationship between music, philosophy, art and mathematics. I rally cannot recommend it enough.

 

  11Comments

  1. Pingback: An Overview of Source Key Pathologies « Thomas Kejser's Database Blog

  2. Pingback: Using Filtered Index to Maintain Unique Key Combinations « Rafi Asraf SQL Server Blog

  3. Pingback: Transforming Source Keys to Real Keys – Part 1: Introducing Map tables « Thomas Kejser's Database Blog

  4. Pingback: Modeling Maps with #MDS #MASTERDATESERVICES « Guldmann's Blog

  5. Pingback: Transforming Source Keys to Real Keys « Garrett Edmondson

  6. Pingback: Physically Placing the Maps in the architecture « Thomas Kejser's Database Blog

  7. MOCKALb   •  

    Great post. Idea for the future post: cleaning the DWH from the ‘phantom’ keys (or orphaned keys)?

    • Thomas Kejser   •  

      Thanks MOCKALb. Yes, I plan to talk about orphaned keys at a later time. This will be related to the posts on UPDATE (which will also cover DELETE) statements that I will do later.

  8. Marco Russo   •  

    Thomas, on every blog I think we converge to the same model using different names :)
    Where do you put these Map tables? I mean, in which database? The same that contains the dimensional model? If it’s different, how do you call it? (please, no “staging”…)

    • Thomas Kejser   •  

      Hi Marco

      I think we have a lot of overlap in our approach here – and I think it is good sign. With regards to placing the tables physically in a database: That is actually the subject of my next blog entry.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">