Archive

Archive for the ‘Data Warehouse’ Category

Physically Placing the Maps in the architecture

2011-08-19 2 comments

Before we leave the maps behind, I need to live up to my promise of describing the storage characteristics of tables visited during the journey, this must include the physical location of maps. Since believe form must follow function in a DW, let us just recall function:

From a functional perspective, I have shown you how map tables can be used to both track and correct source system keys. Maps are not visible to the end user, but they are a necessary part of the data’s journey from the source to the final data model. Maps also provide the abstraction of, or interface to, master data sources. In the absence of those sources – the maps can even serve as a makeshift master data repository.

As we travelled along the ETL road, I also introduced you to the Entities tables. I deferred the exact definition of them, but we have seen how the Entities hold the master key for every entity in the warehouse, along with some book keeping columns. Entities is our next stop, but during our brief visit to them, we noticed that there is a foreign key between the maps and the entities, with the entity holding the primary, master key.

Because foreign keys need to look up rows on every insert and update, it makes a lot of sense to physically locate tables related through foreign keys on the same machine: co-locating frequently joined tables make joins between them much faster. Furthermore, to preserve transactional integrity, the foreign and primary key tables must live inside the same transactional container. In the case of SQL server, this means that maps and entities must live in the same database – since that is the container of transactional integrity which does not require distributed transactions.

Whenever we add a new entity key, this is driven by adding a new map entry (for a new entity to appear, it must have been observed in at least one source). Because of this, we want to make sure a backup of entities always includes maps that are consistent with that data. In SQL Server terms, on a large database, this means that maps and entities live in the same file group. In DB2 and Oracle, I believe the term is table space for the same concept (the smallest unit of consistent backup).

Because of the above we will physically locate maps and entities together like this:

image

In my next post in the modeling series, we will spend more time with the Entities – keepers of the master keys and type 1 history.

I will not be covering master data in further detail in this blog, see the references section below if you want to dig in further.

References:

Location:Reading, UK

Transforming Source Keys to Real Keys – Part 2: Using Maps To Fix Key Problems

2011-08-15 10 comments

KeysIn part 1 of this post, I introduced the idea of map tables. These tables serve as an abstraction between the source systems and the entities in the data warehouse. In this post, I will describe how you can use the maps to correct for source key errors.

Using the taxonomy of key pathologies described earlier, I will walk you through some examples of map usage.

Some conventions I will use in populated tables illustrated in this blog:

  • Source Keys are in italics
  • Entity Keys (in the Entities schema) are coloured
  • Keys and column values that are irrelevant to the argument have been grayed out so you cant read them
  • Values that receive UPDATE statements are marked in green highlight.

In order of the taxonomy, let us start with the composite key.

Fixing Composite Keys

Recall that we use the Entities schema to hold the “master key” for every entity and that we don’t rely on any source to be a master. The map table is a way to transform any key from a source to that master key.

For example, let us assume our sales system supplies Promotion keys as a composite of SKU and and PromotionDate. Because some sneaky programmer couldn’t be bothered to create a new key for a promotion and instead argued that a single SKU is only ever on promotion on one date at a time (this, as it happens, is a good example of trying to argue for doing the wrong thing).

In this case, we would have a map like this:

CREATE TABLE Maps.Promotion_Sales (
Map_ID INT IDENTITY(1,1) NOT NULL
, SKU INT NOT NULL
, PromotionDate DATETIME NOT NULL
, ValidFrom DATETIME NOT NULL
, ValidTo DATETIME NOT NULL
, IsCurrent BIT NOT NULL
, IsDeleted BIT NOT NULL
, Promotion_Key INT NOT NULL
)
/* Index to enforce source uniqueness  */

ALTER TABLE Maps.Products_Inventory
ADD CONSTRAINT UK_SKU_PromotionDate
UNIQUE CLUSTERED (SKU, PromotionDate, ValidFrom)

/* Standard indexes here as per other blog entry */


And our Entities table will be:

CREATE TABLE Entities.Promotion (
, Promotion_Key INT NOT NULL PRIMARY KEY
, Name VARCHAR(50) NOT NULL DEFAULT (”)
, CreatedDate DATETIME NOT NULL DEFAULT (GETDATE())
)

And no, we will NOT store the SKU or promotion in this table, they are either redundant (the SKU is already in the sales data) or we don’t want to maintain the muddle-headed design of the source in the warehouse.

Example population of the tables if we receive the first promotion data on 2011-01-01

Entities.Promotion

Promotion_Key Name CreatedDate
1 Big Week Sale 2011-01-01
2 Good Deals Today 2011-01-01

Maps.Promotion_Sales

Map
ID
SKU Promotion
Date
Valid
From
Valid
To
Is
Current
Is
Deleted
Promotion
Key
5 43 2011-01-01 2011-01-01 2011-01-07 true false 1
6 43 2011-01-01 2011-01-01 2011-01-02 true false 2

Apart from making the fact tables linked to this entity smaller, we can now work around our poorly arguing programmer when the business wants to have more than one SKU on the same promotion. Let us say that SKU = 87 is no on the “Good Deals Today” promotion. We can now update the Maps, without wreaking havoc with the Entities:

Maps.Promotion_Sales (updated to reflect the new reality)

Map
ID
SKU Promotion
Date
Valid
From
Valid
To
Is
Current
Is
Deleted
Promotion
Key
5 43 2011-01-01 2011-01-01 2011-01-07 true false 1
6 43 2011-01-01 2011-01-01 2011-01-02 true false 2
7 87 2011-01-01 2011-01-01 2011-01-02 true false 2

Fixing Zombie Keys

Assume our inventory system has zombie keys. Recall that they use EAN as the key over there. At the initial load of our warehouse, let us say on 2010-06-01, we receive this staging table from them:

Staging.Products_Inventory (on 2010-06-01)

EAN Weight
978-0465026562 1.0
978-0201485417 2.2

This will cause the following populate of the map:

Maps.Products_Inventory (on 2010-06-01)

Map
ID
EAN Valid
From
Valid
To
Is
Current
Is
Deleted
Product
Key
51 978-0465026562 2010-06-01 9999-01-01 true false 1
52 978-0201485417 2010-06-01 9999-01-01 true false 2

Now, let us assume that EAN = 978-0465026562 goes out of stock on 2010-08-01 and that the inventory system cleans up the record. We will now receive this staged data:

Staging.Products_Inventory (on 2010-08-01)

EAN Weight
978-0201485417 2.2

We must conclude that the key is dead, and we update the map to reflect this.

Maps.Products_Inventory (on 2010-08-01)

Map
ID
EAN Valid
From
Valid
To
Is
Current
Is
Deleted
Product
Key
51 978-0465026562 2010-06-01 2010-08-01 false false 1
52 978-0201485417 2010-06-01 9999-01-01 true false 2

Let us assume we have agreed with the source that it has zombie key behaviour. Because of this, we leave IsDelete = false. We know that Product_Key = 1 is supposed to be dead, but it may return from the grave.

The supplier for EAN = 978-0465026562 delivers us a new stock three months later. On 2011-11-01 we again observe the EAN in the staging tables, and we reflect this in the map like this:

Maps.Products_Inventory (on 2010-11-01)

Map
ID
EAN Valid
From
Valid
To
Is
Current
Is
Deleted
Product
Key
51 978-0465026562 2010-06-01 2010-08-01 false false 1
52 978-0201485417 2010-06-01 9999-01-01 true false 2
53 978-0465026562 2010-11-01 9999-01-01 true false 1

At this point you may ask: why didn’t we just update the ValidTo date of Map_ID = 51 and set it to 2010-11-01, instead of adding a new row? The reason we track the “dead state” of the key are twofold:

  1. We want to be able to detect wrong data from the inventory system. If a fact arrives when the key should be dead, we can flag this.
  2. We want to be able to handle any dolly keys arriving in the interval 2010-08-01 and 2010-11-01
      Recall that we have a unique index in place that enforces the rule:

    “a source key cannot have TWO different meanings at the same point in time”

      . Let us see this rule in action…

    Fixing Dolly Keys

    After our warehouse goes into production, the guys down in inventory discover an error in the reports. After some investigation, it is discovered that on 2010-10-13 a new guy unpacked a crate containing a book called: “Gödel, Escher, Bach: Ein Endloses Geflochtenes Band”. The new guy, understanding German, understood that this was most likely the book with EAN = 978-0465026562, so he created a new entry in the inventory system, temporarily making the EAN appear in the system – but pointing to the WRONG entity. The book was returned the day after to the supplier, because the inventory guys realised it was in the wrong language, after this, the batch clean process gets rid of the EAN again.

    Picture of Black SheepHowever, the guys down in inventory care about returns, they want to track them and report on errors. We are faced with the problem of an EAN temporarily representing the wrong entity. Think this example is far fetched? I have seen it several times, on keys a lot of people would personally care about.

    Fortunately, our Map structure can handle this. We start by looking up the Product_Key of the German edition of Gödel, Escher, Bach – let us assume it is Product_Key = 42. We can now handle our Dolly key like this:

     

    Maps.Products_Inventory (updated after 2010-11-01)

    Map
    ID
    EAN Valid
    From
    Valid
    To
    Is
    Current
    Is
    Deleted
    Product
    Key
    51 978-0465026562 2010-06-01 2010-08-01 false false 1
    52 978-0201485417 2010-06-01 9999-01-01 true false 2
    53 978-0465026562 2010-11-01 9999-01-01 true false 1
    54 978-0465026562 2010-10-13 2010-10-14 false true 42

    Notice that, unlike the zombies that are the norm of this source, we marked the Dolly key as IsDeleted = true, we don’t expect this sheep to come back from the grave

    (zombie sheep, now THERE is an idea for a new game in the app store).

    Merging keys Together – Handling Multi Keys

    To wrap up this blog entry about the usage of map tables, let me illustrate another use for them. There are cases where the source system carries two different keys for the same entity. This CAN be the result of an error (often the case with customer data), but there are situations where having two keys for an entity is a legitimate use case.

    For example, our inventory system may have multi EAN numbers for the “same” book. This happens because books can exist in multiple editions. Examples of these differences are: paperback, hardcover, reprints and new editions. From the perspective of the inventory system, and often the customer, we may even care about the different editions. But perhaps this is deemed irrelevant from a business reporting perspective. If I drill into the Douglas Hoffstadter in my Product Dimension – I will probably WANT to see all English editions of “Gödel, Escher, Bach” as the same row.

    Fortunately, the merging of source rows can be handled by maps. It WAS actually the case that “Gödel, Escher, Bach” had a paperback edition in 1999 (which I happen to own) with another EAN = 1313353195. We can reflect this in our map:

    Maps.Products_Inventory (updated after 2010-11-01, old paperback added)

    Map
    ID
    EAN Valid
    From
    Valid
    To
    Is
    Current
    Is
    Deleted
    Product
    Key
    51 978-0465026562 2010-06-01 2010-08-01 false false 1
    52 978-0201485417 2010-06-01 9999-01-01 true false 2
    53 978-0465026562 2010-11-01 9999-01-01 true false 1
    54 978-0465026562 2010-10-13 2010-10-14 false true 42
    55 1313353195 1999-01-01 9999-01-01 true true 1

    Notice that we leave IsCurrent = true. This is a perfectly valid EAN to receive from the source system. I may simply have been lingering around on the dusty shelves for a long time.

    Summary – Transforming Source Keys to Real Keys

    In this blog, I have shown you how you can use map tables to handle key errors in the sources. For those of you that currently use type 2 dimension history for this purpose – you may wish to consider the implications of polluting your end user data with this type of “source noise”. How would you use a type2 dimension table when sources are replaced or errors occur that you don’t want to reflect in the end user data?

    Consider also what it means for the maintainability of the warehouse to adopt the poor key choices in the source as an integrated part of the end user reporting data. I think I have argued that there has to be a de-coupling between the “history reporting” that end-users want, and the tracking and handling of source keys. It is my claim that the data model is simplified by not mixing these two concepts together in the same table. It is also my claim that you should always adopt the map tables, with the structures described here, as an integral part of the ETL process in the data warehouse.

    As Jørgen Guldmann pointed out in a previous comment, having the structures in place for mapping does not mean you can find a reliable source of the data needed to handle the transformation. The maps allow you to change the sources to the  format they should have had in the first place, even if that means manually updating the map tables or having someone hired to correct errors by adding rows to the map (as we saw with the Dolly key example). It is worth observing that if you can  classify which key pathologies a source has, you can often automate the maintenance of the map, and only manually insert the exceptions you find as you iterate towards the completely cleansed Enterprise Data Warehouse.

    At a point in the future, you may find the holy grail and end up with no source pathologies at all, maybe even a “master source” for all entities. The maps can serve as the requirement definition for that process. Furthermore, the maps can be used as update structures for sources that want to comply with a better key structure. Some people would call this master data management, and by proactively building maps as part of your data warehouse model, you have placed yourself in a position to lead that argument with a data driven approach.

    Just don’t be surprised that you end up BEING the master data system once you start building mapping structures. But this is a good thing, the problem does not go away just because you push it around – with maps, you have a way to quantify and encapsulate it that does not ruin your brilliant data model.

    Related blog posts:

    PS: The book I have used as an example happens to be one of the best books ever written on the relationship between music, philosophy, art and mathematics. I rally cannot recommend it enough.

    Microsoft Announces Plans to Introduce Hadoop Interoperability

    2011-08-12 2 comments

    For those of you who have not yet seen it, Microsoft recently announced that they will be looking at Hadoop connectivity to the database stack:

    Some of you may have wondered why I have not yet mentioned the BigData movement as part of my DW articles. In my defense I will say that this a big trend, something I had to give a lot of thought to position correctly. Before I can talk more about how it fits into the full DW/BI architecture – I have to argue a bit more for my warehousing approach.

    I can reveal that I see traditional data warehousing (especially dimensional modeling) and BigData compliment each other in a way that solves some of the common complaints of warehouse builders across the world. I hope you will find my thoughts on BigData fit nicely into the picture I will be painting of the warehousing world going forward in this blog.

    Transforming Source Keys to Real Keys – Part 1: Introducing Map tables

    2011-08-04 11 comments

    I have ranted enough about the key quality, or lack thereof, in source systems. It is time to look at how we will practically go about transforming the dirty source to the final DW model. What I shall propose is a de-coupling of the source entities from the data warehouse entities. As will be shown, this leads to an interesting ETL pattern that can either address, or properly encapsulate, a variety of source problems.

    Because this post is about key generation, I shall limit myself to talk about dimensions or “reference data”. This data resides in the small tables in the warehouse – going no higher than 1 Billion rows (for VERY large customer dimensions or IP traffic monitoring).  More typically these tables range in the 10**3 to 10**7 rows. In fact, I will claim that small dimension/reference tables are the 99% case. I will assume it is reasonably clear to the reader that tables of such miniscule size can be comfortably be extracted and fully loaded into a temporary table in the warehouse staging area. If this should be unclear, or if you object to this assumption – please respond in the comments.

    I will also assume that we agree that source systems typically suffer from one or more pathologies as described earlier. And even if they don’t, we must at least assume that sources changes as new companies get acquired, big projects change IT infrastructure completely (SAP anyone?) or other similar events outside the control of the warehouse. We are thus faced with the task of keeping the data warehouse model relatively stable under these changing conditions.

    First, I propose that the warehouse has one table for each entity with at least this structure:

    CREATE TABLE Entities.<Entity Name (plural)> (  
     
    <Entity Name>_Key INT NOT NULL PRIMARY KEY
      , CreatedDate DATETIME NOT NULL DEFAULT (GETDATE())

      /* Other columns */
    )

    For now, please defer thinking about the exact nature of an entity, suffice to say that it is a concept that maps to some interesting object instance that can have a unique key. Also note that the table above has ONE row per entity, not one per historical version of it (i.e. it does not behave like a type 2 dimension table)

    For each source system that is the source of an entity, we shall create a map table with this structure:

    CREATE TABLE Maps.<Entity Name (plural)>_<Source Name> (
      , Map_ID INT NOT NULL PRIMARY KEY
      ,
    <Source Key Column> [, … n] NOT NULL
      , ValidFrom DATETIME NOT NULL
      , ValidTo DATETIME NOT NULL
      , IsCurrent BIT NOT NULL
      , IsDeleted BIT NOT NULL
      , <Entity Name>_Key NOT NULL
    )

    The map table is used to map between the row in the source system and the entity in the data warehouse. The entity we map to in the DW is not guaranteed to have a 1-1 relationship with the rows in the source, the map is the structure that encapsulates this discrepancy.

    By smartly populating the map table, we can handle source system errors during ETL.

    Example Scenario

    At this point, I think it is best to take the argument forward with an example. Let us assume we are designing the ETL system for a product table in the warehouse.

    The company we work for has two source systems for products: inventory and sales. The sales system uses the SKU as the key for the product, and the SKU is generated by the sales system because they run a big German made system that likes to be in control of these things. The inventory system uses the EAN of the product instead, since they use barcode scanners to keep track of products. For our scenario, we shall assume that the sales system holds the name of the product, the marketing team like to come up with cool sounding brands. The inventory system holds the weight of the product, because the guys down in storage have competitions about who can bench press the largest boxes..

    Example Schema

    Extracting the sales product table, we end up something like this in the staging area:

    CREATE TABLE Staging.Products_Sales (
      SKU INT NOT NULL
      , name NVARCHAR(50)
    )

    And for inventory:

    CREATE TABLE Staging.Products_Inventory (
      EAN DECIMAL(17,0)
      , weight FLOAT
    )

    In the data warehouse, we would like to hold the final, conformed products in a table like this:

    CREATE TABLE Entities.Products (
      , Product_Key INT NOT NULL PRIMARY KEY
      , CreatedDate DATETIME NOT NULL DEFAULT (GETDATE())
      , EAN DECIMAL(17,0) NULL
      , SKU INT NULL
      , Name NVARCHAR(50) NOT NULL DEFAULT (‘’)
      , WeightKg FLOAT NOT NULL
    )

    Product_Key is our key, we don’t trust the sources with that. Notice that EAN and SKU are just columns and not keys, we don’t assume they have values (NULL allowed) or even that they are unique.

    For bookkeeping purposes, I added the CreatedDate, to keep track of when the entity was first observed in the source.

    I don’t want to have NULL values in the Name column, so I will use the default of the empty string instead. I like to assign empty strings to text columns that are rendered in the client tools, I find that both users and client tools get easily confused by relational NULL semantics.

    For the WeightKg column, we have taken the liberty of adding the unit to the column name. This is generally good practice when this is not clear from the name of the column.

    For the inventory source system, I create the map table like this:

    CREATE TABLE Maps.Products_Inventory (
        Map_ID INT IDENTITY(1,1) NOT NULL
      , EAN DECIMAL(17,0) NOT NULL
      , ValidFrom DATETIME NOT NULL
      , ValidTo DATETIME NOT NULL
      , IsCurrent BIT NOT NULL
      , IsDeleted BIT NOT NULL
      , Product_Key INT
    )
    ALTER TABLE Maps.Products_Inventory
      ADD CONSTRAINT PK_Products_Inventory
      PRIMARY KEY NONCLUSTERED (Map_ID)
    ALTER TABLE Maps.Products_Inventory
      ADD CONSTRAINT UK_EAN
      UNIQUE CLUSTERED (EAN, ValidFrom)
    ALTER TABLE Maps.Products_Inventory
      ADD CONSTRAINT FK_Products_Inventory_Products
      FOREIGN KEY (Product_Key)
      REFERENCES Entities.Products (Product_Key)

    And similarly for the sales source:

    CREATE TABLE Maps.Products_Sales (
        Map_ID INT NOT NULL IDENTITY(1,1)
      , SKU INT NOT NULL
      , ValidFrom DATETIME NOT NULL
      , ValidTo DATETIME NOT NULL
      , IsCurrent BIT NOT NULL
      , IsDeleted BIT NOT NULL
      , Product_Key INT
    )
    ALTER TABLE Maps.Products_Sales
      ADD CONSTRAINT PK_Products_Sales
      PRIMARY KEY NONCLUSTERED(Map_ID)
    ALTER TABLE Maps.Products_Sales
      ADD CONSTRAINT UK_SKU
      UNIQUE CLUSTERED (SKU, ValidFrom)
    ALTER TABLE Maps.Products_Sales
      ADD CONSTRAINT FK_Products_Sales_Products
      FOREIGN KEY (Product_Key)
      REFERENCES Entities.Products (Product_Key)

    The data model created now looks like this:

    image

    Some things to notice about the data model so far:

    • The Product_Key is not unique in the map tables, this will come in handy when we handle multi keys
    • The source key (EAN or SKU) is not unique in the map either. We will need this to handle zombie and dolly keys
    • We are using a secondary key constraint to enforce the always true statement: “a source key cannot have TWO different meanings at the same point in time”. If it could, it would not be a key, and we have a new problem that cannot be solved by the mapping system.

    In part 2 of this blog series, I will look at how we will represent key errors in the maps. But for now, let us have a look at how we get started with populating these tables.

    Initial Population of the Maps and Entities – A Foray into Master Data

    The map tables will evolve as we learn more about our sources, but to get started with a prototype, we need an initial population of them. Our first question to the business users is this:

    “Given that I have two different sources of product keys, which one is the master source?”

    If you are in typical large organization, the answer will be “I don’t know”. You have a quest to go on, but don’t spend too much time on it, as you will see it wont matter much. At this point, you make an arbitrary assumption that one of the systems is the master source.

    Let us say you chose sales to be the master source (it is after all a German ERP system). You can now do an initial population of the Entities.Products like this:

    INSERT INTO Entities.Products (SKU)
    SELECT DISTINCT SKU FROM Staging.Products_Sales

    This generates the values for Product_Key that you need. You can now initialize Maps.Products_Sales like this:

    INSERT INTO Maps.Products_Sales (SKU, Product_Key
      , IsCurrent, IsDeleted, ValidFrom, ValidTo)
    SELECT S.SKU, P.Product_Key
      , 1, 0, GETDATE(), ‘99990101’
    FROM Staging.Products_Sales S
    JOIN Entities.Products P ON P.SKU = S.SKU

    Of course, none of this made the problem go away, you still don’t know how to map SKU to EAN. If your quest for master data was successful, about as likely as finding the holy grail, then you should have no problem supplying your warehouse with a table like this (which you will temporarily create for the prototype):

    CREATE TABLE Staging.EAN_SKU (
      EAN DECIMAL(17,0)
      , SKU INT
    )

    If you did not find the holy grail of master data, someone is in for some long hours in Excel, creating a table like the above. You may be lucky that the German ERP system contains both the EAN and SKU column, in which case a quick SELECT DISTINCT can do the initial population of this table. In either case, you will most likely find that the below query is very handy to find missing mappings during prototyping:

    SELECT DISTINCT
      COALESCE(I.EAN, ES.EAN) AS EAN
      , COALESCE(S.SKU, ES.SKU) AS SKU
      , ‘Unmatched, please fill in’ AS ActionForExcelPerson
    FROM Staging.EAN_SKU ES
    FULL OUTER JOIN Staging.Products_Sales S
      ON ES.SKU = S.SKU
    FULL OUTER JOIN Staging.Products_Inventory I
      ON ES.EAN = I.EAN
    WHERE I.EAN IS NULL OR S.SKU IS NULL

    For now, don’t worry about silly key errors in the sources, we shall correct them later. You don’t need all the data mapped in the Staging.EAN_SKU table, just get a decent start of it. You can now populate the Maps.Products_Inventory like this:

    INSERT INTO Maps.Products_Inventory (EAN, Product_Key
      , IsCurrent, IsDeleted, ValidFrom, ValidTo)
    SELECT I.EAN, S.Product_Key
      , 1, 0, GETDATE(), ‘99990101’
    FROM Staging.Products_Inventory I
    JOIN Staging.EAN_SKU ES ON I.EAN = ES.EAN
    JOIN Maps.Products_Sales S ON ES.SKU = S.SKU

     

    Notice that the above process can be repeated for multiple source systems. For every new system you add, you only need to translate the keys to ANY one of the systems you already know. You are conquering the source systems one at a time, starting with one map table and gradually working your way towards a more complete picture of the world.

    Things you may have noticed now:

    • There might be keys in inventory that don’t exist in sales. They may even be valid entities, but we have not yet added them to the Entities.Products table
    • We still haven’t done anything about key errors yet So far, we are just preparing some data for an initial, prototype load (and incidentally, building the requirement spec for a master data system in the process).

      Summary – Part 1

      In this blog, I have introduced the map and entity tables as fundamental part of a good data model for data warehousing. The entity tables are there to provide key stability in the large tables and be the conformed source of reference data (dimensions if you will).

      The map tables described here will be used to correct poor source keys and turn them into good entity keys.

    I have also described a method for initial population of the map and entity tables. This initial populating is used to drive new questions to the business users and to explore the quality of sources.

    In a near-perfect world, the map tables can be prepopulated by a master data management system. In the completely perfect world, sources would all agree on the same keys and the keys would have no pathologies. But we do no live in the ideal world, not even close. So, in the next part of this blog post I shall explore how we can use the maps to correct data errors.

    References:

    Categories: Data Warehouse, Modeling Tags: , ,

    An Overview of Source Key Pathologies

    2011-07-30 8 comments

    I previously made the point that source systems cannot be trusted to generate good keys. In this post, I will explore the pollution you may encounter as you dip your feet into the crude oil leaked ocean of data quality.

    Let us look at what can go wrong with keys:

    Composite Keys

    This is perhaps the most common pathology. It occurs when a programmer believes it is better to use a composite key than a single column key. This is rarely motivated by thoughts of sharding or of worries that the 64 bit integer space may run out. Normally, it is driven by nothing but laziness.

    Composite keys have several problems:

    • They are difficult to use for junior SQL coders, especially in join conditions and IN / NOT IN clauses in filters. Wrong usage may lead to wrong results that go unnoticed.
    • They typically perform much worse than single column keys. We care about this in a data warehouse.
    • It is difficult to create good statistics on composite keys to serve queries well. This can lead to poor query plans and unpredictable performance

    There ARE cases where composite keys make sense in a source system, but given the trouble they cause, the burden of proof rests on the designer of the composite key to show why it is superior to a single column key. As the ETL developer, you typically just have to accept the status quo.

    Duplicate Keys

    Mentioned for completeness. Even though good algorithms (both distributed and single machine) have existed for ensuring key uniqueness since the early 70’ies – there are still source system which have “keys” that are not real keys. If your key is not unique, it is not a key!

    However, these “keys” are more common that you might think, especially when the source system delivers extracts that are result of joins (some programmers may throw a DISTINCT in there for good measure)

    Zombie Keys

    Screen shot 2011-08-14 at 11.02.48 PMThese keys die and then rise from the grave (Christians may prefer another term). This is perhaps best illustrated with an example:

    Assume an inventory system that uses EAN as a key. When a product is in stock, the product table of the inventory system containss the EAN key. If a product goes out of stock, a cleanup job may run that removes all unused EAN keys. From the perspective of the ETL developers – it looks like the key just died and we are left with the question about what to do about it. Later, the product comes back in stock and the EAN number is inserted into the inventory products table again, the key has risen from the grave.

    Unfortunately, zombie keys can be hard to tell apart from…

    Dolly Keys

    Named after the cloned sheep, because sheep are stupid and clones are hard to tell apart, even though they are two different entities. Dolly keys happen when a source deletes a key and later re-uses that key for a different purpose and to name a different entity.

    Take Social Security numbers as an example. In Denmark, where I have a social security number, this key is in the format: YYMMDD-XXXX, with the first 6 digits being my birthday and the last four making the key unique (this by the way includes a checksum on the last digit). The quick reader may notice that this only leaves 10000 keys for people born on a given day (and not worrying about the shift of centuries). This is fast becoming a problem in Denmark, where we are now considering the use of letters in the keys, or re-using keys. This amounts to a lot of trouble for systems storing social security numbers. When it comes to keys, just use a value that is big enough to hold all instances – the 32 or 64 bit integer space should suffice for all by the most extreme values. 32-bit values were good enough for running the Internet, they are good enough for you!

    Multi Keys

    It happens that a source system has more than one key for the same entity. This can be the result of historical corrections and may represent a valid source scenario in the source.

    For example, a source may have two rows for the same customer, because the duplicate entry was not discovered at the point of data entry. When the error is discovered, both keys are set to point to the same entity in the source. When we extract the data we are faced with correcting this error in the customer dimension.

    Summary

    In this post I have provided an overview of typical source system key pathologies. It now rests on my shoulders to describe how to handle these pathologies in the ETL code.

    Related Posts:

    Categories: Data Warehouse, Modeling Tags: , ,

    The Ascending Column Problem in Fact Tables –Part two: Stat Job!

    2011-07-07 2 comments

    In my last post I described a common problem with statistics in data warehouses. I also directed you to a trace flag that partially solves the problem. There is an issue with this trace flag: at the time of writing this blog, it does not work on a partitioned tables. Considering that large tables tend to be partitioned, this could be an issue.

    Recall from my last post that old statistics will not contain information about the newly inserted rows until stats are updated. Queries asking for data that is outside the bounds of the histogram will estimate 1 row and typically end up with a bad, serial plan. The trace flag works, or doesn’t, by trying to intelligently guess how many rows are such an “out of bound” range instead of assuming the value of 1.

    There is another approach that does not rely on the optimizer making guesses: You can intentionally deceive the optimizer, lie about the distribution. Recall that we want the “desired DW plan”. Alas, as with all telling of lies, this exposes you to a lot of risk.

    Disclaimer: What I am about to show you is a: “Don’t do this at home” trick. We are going to see some features in SQL Server used in a ways they are not intended for. And we are going to take control of the optimizer in ways it is not designed to handle. If you go down this path, you are on your own! Don’t call support with your issues. Just because I showed it to you, doesn’t mean you can put this in production.

    Consider this example an exploration and education of the internals of SQL Server. And think of in the context of what you might consider asking for, next time you file an MSConnect item.

    On the issue of MSConnect: If you care about big (any TB sized) DW workloads: please, please file and vote on items that make your working day hard. It helps if you describe which company you work for and what the impact is on your business of not having the feature or “fix”, I know this is not always possible. I know it takes times, and I appreciate you are busy, but Microsoft does listen, and they listen more closely when you can make it clear why this feature is needed and how it hurts you (I call this the “blood on the table” approach). If your favourite Connect item is about DW functionality, shoot me a mail (include the ConnectID please) to let me know. I may well put up a big fight for items that align with my mission in MS. Please keep in mind: I don’t care about client tools at all, I fight for server side features. We have a lot of people pushing on client side features and they are very important too, I have just chosen to specialize.

    And with that said, let us dig in:

    Recall that our issue was that the histogram lacked rows/buckets containing information about a range of data. The histogram in SQL server has up to 200 rows, but even if we could add a single row to the histogram on every insert, we would soon run out on a large table.

    But how about this idea: Fake a histogram were every year we ever want to store (say 100 of them) contains a lot of data. In fact, make sure that every date range or value we select contain a tremendous amount of rows. One way to create such a histogram is to grow the fact table very large and then capture the stats and make it smaller again. If your fact table is not born with the right proportions, growing fat to get the right histogram proportions and then slimming down is not a very nice approach is it?

    Our Fact table histogram is not pretty as it is, but how about doing some precision surgery instead? For that, we need an implant. First, we create a table with the same index structure as Fact:

    CREATE TABLE HackStats
    (
      date_sk INT NOT NULL
      , payload CHAR(4000) NOT NULL DEFAULT ‘X’
    )

    CREATE CLUSTERED INDEX scan_me ON HackStats(date_sk)

    Now, let us prepare the implant. We need something that has these characteristics:

    1. Enough histogram divider rows to cover all possible data values we could ask for.
    2. A very large value for RANGE_ROWS and AVG_RANGE_ROWS between in each histogram row.
    3. A value of 1 for the DISTINCT_RANGE_ROWS in each histogram row.
      Ad 1) Make sure that we do the implant once, and that I lasts for the lifetime Fact.
      Ad 2) Make sure that there is sufficient size and elasticity for any range of dates we might touch.

    More about 3) later…

    With a little reverse engineering (no, I did not look at the source), we can create a histogram that match our requirement. It does not take a lot of rows in HackStats. Here we go:

    SET @y = 1950
    WHILE @y < 2050 BEGIN
      /* make the stats dense (low distinct) within the year */
     SET @r = 0
     WHILE @r < 100 BEGIN
    /* Make sure the equal rows in the histogram are never hit using invalid
        dates  */
        INSERT HackStats (date_sk) VALUES (@y * 10000)
        INSERT HackStats (date_sk) VALUES (@y * 10000 + 5000)
        SET @r = @r + 1
      END
      SET @y = @y + 1
    END

    You may have been wondering why I added a payload column to HackStats. For small tables (by page size), SQL Server will do a full sample, even if we ask for a small percent. And the engine is smart enough to know that is has seen all pages. A fully sampled table cannot be hacked and made artificially pretty – in ways you will soon observe. The HackStats table has to be small, but large enough (in our case 20K pages) that all rows will not be sampled. And that is why we had to to pad the rows a bit.

    We can now hack like this:

     /* prepare the implant */

    UPDATE STATISTICS HackStats WITH ROWCOUNT = 199000000
    UPDATE STATISTICS HackStats WITH SAMPLE 99 PERCENT

    The first statement above “fakes” a large table by telling SQL Server that there are a LOT of rows in the table. The second statement samples the distribution of those rows (which we have cunningly crafted to have just the right dimensions, so we need to sample almost all of them).

    Let us check the quality of the implant:

    DBCC SHOW_STATISTICS (HackStats, scan_me)
    WITH STAT_HEADER, HISTOGRAM

    Output (your results may vary a little, depending on which rows got sampled, rerun the sample if you don’t like the output)

    image

    And:

    image

    Notice the neat distribution. Any query asking for a year range will get a high estimate of the outcome of scanning the table. SQL Server is smart enough to boost the row estimates based on how many total rows it believes are in the table.

    And now, for the dirty little trick (Brent Ozar, if you are reading this, just don’t say it Smile ):

    Let us transplant the faked, big stats to the ugly Fact table. First, take the implant:

    DBCC SHOW_STATISTICS (HackStats, scan_me)
    WITH STATS_STREAM

    This gives an output somewhat like this (your result will vary):

      0x01000000010000000000000000000000B1AB0052000000….

    We don’t really have to worry what is in there, it is some binary representation of the stats. The feature to read the binary stats was put into SQL to allow database clones – and we are abusing it now. But hey, we are in the business of precision surgery: Copy the bit values to the clipboard, past and execute:

    UPDATE STATISTICS Fact scan_me WITH
    STATS_STREAM = 0x01000000010000000000000000000000B1AB0052000000….

    And a finishing flourish, to make it fact table look really big:

    /* Make the table expensive to access in the future */

    UPDATE STATISTICS Fact scan_me WITH PAGECOUNT = 200000000

    Validate the result:

    /* Check the patient */

    DBCC SHOW_STATISTICS (Fact, scan_me)
    WITH HISTOGRAM

    (output similar to HackStats)

    And let us check our problem query again:

    DBCC FREEPROCCACHE
    SELECT C.shoe_size, SUM(Measure)
    FROM Fact F
    JOIN DimCustomer C ON F.customer_sk = C.customer_sk
    WHERE F.date_sk = 20010111
      AND C.shoe_size = 43
    GROUP BY C.shoe_size

    Eureka: the “Desired DW Plan”:

    image

     

    Notice something interesting: the Estimated Operator Cost of access is about 736. That is a REALLY expensive access to the table. Just the way we like it actually, wouldn’t want the optimizer to get any ideas that this table is cheap to access. Our little trick with hacking the page size gave our Fact patient quite a bit of confidence.

    I deferred talking about why we need a low number of distinct rows in the histogram. For now, I leave this as an exercise to the reader. Hint: it is related to the bitmap filters and the way Fact is accessed.

    And remember folks, you do this at your own risk, you are on your own and in unsupported land here… Don’t come crying to support if you break something. Ask for a connect item instead.

    The Ascending Key Problem in Fact Tables– Part one: Pain!

    2011-07-01 3 comments

    Time for another deep dive into SQL Server. In the last couple of days, I have been having coffee with Joe Chang in the afternoon, who some of you may know. He is visiting Sweden a lot by the way, so shoot him an email if you are interested in having him present. Joe and I ended up talking about an interesting problem in data warehousing: the ascending key problem. I think it is best illustrated by an example.

    Let us say you have a pure star schema (sorry, Inmon people), with a large fact table. If you are following the guidance in the Top 10 Best Practices from SQLCAT (bullets 1,2 and 9), you typically partition the fact table by date, and you  cluster on date too. On of the reason you do this, is that it helps you do fast range scans.

    Let us try an example star schema that follows this guidance:

    CREATE PARTITION FUNCTION pf_daily(INT) AS RANGE RIGHT FOR VALUES
    (20000101, 20000102
    , 20000103, 20000104
    , 20000105, 20000106
    , 20000107, 20000108
    , 20000109, 20000110
    , 20000111, 20000112
    , 20000113, 20000114
    )
    CREATE PARTITION SCHEME ps_daily AS PARTITION pf_daily ALL TO ([PRIMARY])

    CREATE TABLE Fact
    (
      date_sk INT NOT NULL
      , customer_sk INT NOT NULL
      , payload CHAR(80) NOT NULL DEFAULT REPLICATE(‘X’, 80)
      , measure MONEY NULL
    )
    /* Turn off stats (as per typical DW) */
    EXEC sp_autostats ‘Fact’, ‘OFF’
    CREATE CLUSTERED INDEX scan_me ON Fact(date_sk) ON ps_daily(date_sk)

    CREATE TABLE DimCustomer
    (
      customer_sk INT NOT NULL
      , shoe_size INT NOT NULL
    )
    CREATE UNIQUE CLUSTERED INDEX CIX ON DimCustomer(customer_sk)

     

    If you have a very large fact table, you often don’t want to run stats in the middle of a query, so we have turned auto stats off. In this context, recall that stats in SQL Server are per table, not per partition – which can be a pretty expensive operation. We really want to control when stats are run for large tables.

    To generate some test data, let us fill up 10 partitions with 50MB each. For the customer dimension, we will use 100K rows:

    /* Generate test data */

    DECLARE @i INT
    DECLARE @customer_size INT
    SET @customer_size = 100000
    DECLARE @day_size_pages INT 
    DECLARE @day_size_rows INT
    SET @day_size_pages = 50001 / 8
    /* last sum are row width plus uniquefier */

    SET @day_size_rows = @day_size_pages * 8060 / (4 + 4 + 80 + 8) 

     

    /* Create one day of data */
    CREATE TABLE OneDay
    (
      date_sk INT NOT NULL
      , customer_sk INT NOT NULL
      , payload CHAR(80) NOT NULL DEFAULT REPLICATE(‘X’, 80)
      , measure MONEY NULL
    )
    SET @i = 0
    WHILE @i < @day_size_rows BEGIN
      INSERT OneDay (date_sk, customer_sk, measure)
     
    VALUES (’20000101′, @customer_size * RAND(), RAND() * 1000)
      SET @i = @i + 1
    END

    /* create the customer dimension */
    SET @i = 0
    WHILE @i < @customer_size BEGIN
      INSERT DimCustomer (customer_sk, shoe_size)
      VALUES (@i, 36 + @i % 15)
      SET @i = @i + 1
    END

    /* Insert ten days of data in the fact */
    SET @i = 0
    WHILE @i < 10 BEGIN
      INSERT Fact (date_sk, customer_sk, measure)
      SELECT 20010101 + @i, customer_sk, measure
      FROM OneDay
      SET @i = @i + 1
    END

    Good, now we can get started on some testing. Let us first execute a typical, star schema query. But before that, we will update the statistics. To be nice to the query, we will even do a full scan.

    UPDATE STATISTICS Fact WITH FULLSCAN
    DBCC FREEPROCCACHE 

    /* An ordinary star-schema query arrives */
    SELECT C.shoe_size, SUM(Measure)
    FROM Fact F
    JOIN DimCustomer C ON F.customer_sk = C.customer_sk
    WHERE F.date_sk = 20010101
      AND C.shoe_size = 43
    GROUP BY C.shoe_size

    On my 2-core laptop and a hot buffer pool, this query runs in a little over 100ms, touching over 500K rows. Life is good. The query plan is what I like to refer to as “the desired DW plan”:

    image

    All the neat stuff in the SQL Server optimizer that we like for DW workloads:

    • Hash Join to dimension tables
    • Bitmap index pushdown to the fact table (BOL link)
    • The right join order (build hash on dimension, use fact table to probe)
    • Parallelism
    • Cluster index seek into nice, fat range of facts

    But warehouses are not always nice to us. During the night, some evil ETL developer comes along and loads a new day of data:

    /* Nightly ETL run adds another day of data */
    INSERT Fact (date_sk, customer_sk)
    SELECT 20010111, customer_sk FROM OneDay

    Alas, the ETL guy did not have time to run statistics – maybe he forgot. Now, let us run the same DW query again, but this time with the newly loaded data:

    DBCC FREEPROCCACHE
    SELECT C.shoe_size, SUM(Measure)
    FROM Fact F
    JOIN DimCustomer C ON F.customer_sk = C.customer_sk
    WHERE F.date_sk =
    20010111
      AND C.shoe_size = 43
    GROUP BY C.shoe_size

    over 2 seconds on my 2 core laptop, more than 20 times slower! This is the time when people typically complain that SQL server is broken and doesn’t scale. Instead, we adopt a less moaning stance and start a little bit of troubleshooting. As always, we begin with a look at the query plan:

    image

    Well, apart from the plan fitting the width of my blog, there is nothing good to say about it. What just happened?

    This is the part where we up the geek bar just a little: SQL Server gives you the ability to dig into the statistics like this:

    DBCC SHOW_STATISTICS (Fact, scan_me)
    WITH HISTOGRAM

    Output:

    image

    Oh no, the value 20010111 is not even in the histogram. The optimizer estimates zero rows coming out of the fact table. Zero rows are very cheap to access, so the overhead cost of all those nice, parallel and hash join optimizations are not deemed worth it. Of course, the problem here is that the histogram is lying, there are 500K rows in that date range! Loop joins are expensive in that case.

    That hurt, what now?… Well, I guess you just have to remember to update your statistics won’t you? And for those of you that can see a problems with the that (and some will) – start the commenting, go vote on Connect, and of course: stay tuned…

    … And by the way, one last last thing, until we meet again in the next post: Have a look at this: Ascending Keys and Auto Quick Corrected Statistics.

    Have a great weekend everyone, and consider updating your stats if the warehouse loads data on Sunday…

    Good keys, what are they like?

    2011-05-13 27 comments

    A central value add of data warehouses is their ability to restore the sanity that comes from using good keys. Taking a model-agnostic view of keys, they refer to “something” that is uniquely identifiable. Depending on what you are modeling, those “somethings” have different names, for example: entities, rows, tuples, cells, members, documents, attributes, object instances, and relations between the any of the latter. Because this article is about relational databases and because I don’t want to move up to the “logical” level with you (as I promised before), I will use the term: “row” as the “something” that a key refers to, and the term “column” as the data structure that implements the key.

    Moving on, the definition of a “good key” is a column that has the following properties:

    1. It forced to be unique
    2. It is small
    3. It is an integer
    4. Once assigned to a row, it never changes
    5. Even if deleted, it will never be re-used to refer to a new row
    6. It is a single column
    7. It is stupid
    8. It is not intended as being remembered by users

    Unique

    It follows from the definition of a key that it is unique. But documentation that this is so, does not make it a key: An automated mechanism must be put in place to enforces uniqueness. This could be a constraint or a unique index.

    Small

    A key should be small, a few bytes at most. This makes it efficient to retrieve rows and to join on the key.

    Integer

    A key should be an integer, always. I have previously blogged about why this is a good idea.

    Never change

    You should never update key column or change the column that is the key. First of all, it creates a mess of anything that depends on the key (which may or may not be enforced with referential integrity). Secondly, updates of key columns can have bad performance implications since the database has to do a lot of book keeping to keep the table consistent

    Never re-use

    Keys should never the re-used – even if the row they identify has been deleted. This turns out to be a crucial requirement in data warehouses, because we may not always be able to track all the dependencies between tables in a automated manner. This means that the warehouse or user reports may be in a state of partial inconsistency, and this is ok and normal. However, if rows exist that refer to an old version of the key, those rows may magically reappear in user reports and end up pointing to the wrong values.

    It follows that key columns should be small (as in 2), but still large enough to honour this requirement.

    Single column

    It is bad practice to use multi-column keys. First, it is difficult to express foreign key constraints on them. Second, it makes the SQL required for accessing the key (and joining on it) more complicated and error prone to users (and programmers). Third, multi column keys makes it harder to write automated code that accesses the model (including code that auto loads the model). Fourth, it is more difficult to express histograms on multi-column keys, which makes the query optimizer more prone to generating bad plans.

    I even apply this principle to tables that implement relationships between other tables. This is best illustrated with an example. I recommend against this design:

    image

     

    In the above example, the relationship table implements a unique relation between two tables, and the key is used to enforce the uniqueness of the relation (“a customer can only own a product once”). But this is not flexible! What if I open stores in multiple countries? Apple iTunes is a good example of such a case? Now, the customer may own the product in multiple countries like this:

    image

    This wreaks havoc with any code that refers to the relationship itself (since the key has now changed, violating 4) above. Instead, I recommend this design that does not suffer from the problem:

    image

    If you wish to enforce the uniqueness of the relationship (a business rule, not a key) then you can declare a unique index on customer_id, product_id.

    Stupid

    A key should not contain information about the row it refers to. This would be a violation of the rules of database normalization. While may not agree with the rules of normalization, consider this: If the key contains information about the row it refers to – “smart” programmers may get the idea that code could be written that would use the value of the key to implement business logic. But what happens if someone changes the row values and the key no longer reflects the other columns that is a correlated within the row? Then the code is wrong, and you will have programmers begging/ordering you to change the key (violating the principle of “never change”). Any dependency on specific key values in business logic is generally bad news for your ability to implement changes. Keep the keys to yourself and don’t tempt programmers to get smart about them.

    Teflon

    You generally don’t want users remembering specific primary keys in the warehouse. This is the teflon principle: “nothing sticks to it”. As we shall see later in this series, we reserve the right to make changes to the way keys are linked up to fact tables if we change our mind about history tracking. You don’t want reports to refer to your keys by their value when this happens. Imagine a data warehouse that has 50.000 reports referring to the fact tables (this is not unusual) – if any of those reports refer to a key in a fact table, and you make changes to history tracking, you will have users screaming.

    It is worth pointing out that a table may have secondary keys which users refer to in queries. This is perfectly fine, you may even declare the secondary key as unique too. Ideally, users should be using the primary  key on the table only for joins, and fetch unique rows by some known, secondary key. For example, locating an invoice by the source system invoice ID may use the source system key directly, but you will maintain a different primary key.

    The one exception to the teflon principle, is keys for date and time. It is a reasonable assumption that “time does not change” and you may get away with generating these keys in the integer format YYYYMMDD and HHMMSS. Incidentally, this makes queries on date and time easier to write and the data easier to partition.

    References:

    Categories: Data Warehouse, Modeling, OLTP Tags: ,

    Reading Material

    2011-05-09 1 comment

    Today, I was asked a good question: "What material do you read to stay up to date on SQL Server, computers and data modeling" (thanks Chris). Let me try to answer this:

    Twitter

    I am not a big fan of Twitter – I find the noise to signal ratio horrible. But I just don’t need to stay up to date with the bleeding edge of the chatter on the net

    Blogs

    I AM a big fan of following individuals that I respect for their opinions. I read blogs a lot. It is my observation that when people take the time to sit down and write a longer text (i.e. a blog, not a Tweet) , their ideas are often expressed with more clarity. I highly recommend the links to the right. Note that only 10 are shown at a time, so they will occasionally change.

    I also track a variety of other blog sources that I may not always keep up to date on this web page – so if you are not on my list here, don’t take that the wrong way. If you are interested in sharing good reading material with me, then feel free to add me for "following" in Google Reader (which is my blog aggregator). My Gmail account is thomas<at>kejser.org.

    Some major blogs that give me daily ideas on the subjects I blog about myself:

    Books, Articles and whitepapers

    I have a rather extensive library at home, though I am currently in the process of buying Kindle editions of most books on my shelves.

    Here is a list of books I would consider the “bare minimum” to acquaint yourself with as a DBA or architect to be on the Microsoft platform.

    Basics of computers and databases:

    Before you can work with computers, operating systems and databases, I think it is crucial to understand how they are built.

    • Mark Russinovich: “Windows Internals” – THE reference on how windows works
    • Donald Knuth: “The Art of Computer Programming” (AKA: TAO)– THE reference on algorithms and optimizing code. You will not have to read this from one end to the other, but browse though and try to understand the key messages Knuth has to tell you.
    • Andrew S. Tannenbaum: “Modern Operating System” and “Structured Computer Organization” – Some of the greatest books on operating system design and behavior ever written. Though I am not in the “micro kernel camp” – I greatly admire Tannenbaum for being such a great educator.
    • Kalen Delaney: “SQL Server Internals” – Great reference guide on how SQL Server works
    • Ken Henderson: “The Guru’s Guide to SQL Server architecture and Internals” – By the late Ken Henderson, this book is IMHO still the best book ever written about SQL Server. It deals with old versions, but Ken’s technique for learning is simply superb and is just as relevant today as it was back when it was written
    • Dan Tow: “SQL Tuning” – A surprisingly unknown book in the SQL Server community. Great resource on how to find the best query plan. I attended Dan’s course once, and this guy knows what he is talking about.
    • I would recommend reading about the latest servers from the big hardware vendors: HP, Dell, IBM, Fujitsu etc. Try to understand the design tradeoffs they make
    • Intel’s and AMD’s chip roadmaps are crucial to understand too (and ARM if you are into that)
    • Follow the storage vendors websites, and try to track the NAND revolution as it is happening
    • Read the latest TPC full disclosure reports for your favourite database engine and hardware vendor. Try to understand the thinking behind the configuration of the database and hardware.

    Becoming a great developer or architect:

    My personal belief is that if you don’t have a full overview of everything from the storage all the way up the stack to organization of code modules and networking (you can skip the GUI), you are not yet ready to become an architect. Thus, the first step is to learn how to code and tune well, and get familiar with the intricate details of systems. There is really no shortcut here. But once you know how to code, how to tune and how to build a good data model, architecture discussion become trivial to you.

    • Steve McConnel: “Code Complete” – A really good resource on writing good code and remaining passionate about your job
    • Andrew Hunt & Dave Thomas: “The Pragmatic Programmer” – A great view on the mindset required to become a good programmer/architect
    • Frederick P. Brooks: “The Mythical Man Month” – wise words about organising large software projects. As relevant today as it was back then, we didn’t learn a thing
    • Christopher Duncan: “The Career Programmer – Guerilla tactics for an imperfect world” – I think this one fits the world I observe better than Steve McConnel’s: “Rapid Development”. But both are really good. Duncan is probably a better primer for the programmer to be, Steve has a more mature outlook.

    I have not yet found really good resources on data modeling. Which is one of the reasons I started this blog.

       

    Intermezzo–Data Modeling

    2011-05-03 5 comments

    From the very nice WordPress Dashboard, I can see that I now have over 100 regular readers. Comments are flowing in too.

    Thanks to everyone who is listening, it is my hope that this blog can be a great place for debates about data modeling and high scale performance tuning. I was surprised that is was hard to find concrete guidance about data modeling for the warehouse on the web. Perhaps I am missing something out there?

    My branch predictor tells me that is I time to beg for your patience with me in advance, for three reasons:

    1) I have a very combative personality – I love arguing with people. I also have a super low tolerance for bullshit/bollocks (depending on your dialect of English). My reason for behaving like this is to discover the truth, which am I not only convinced exists (I am very far from a relativist), I also believe it can only be found through strong testing of hypothesis, rational inquiry and expressing yourself with clarity. This means that I may disagree vehemently with your opinion, but please don’t take it personal – disrespect and disagreement are two different things. I can almost guarantee that I will stepping, battle axe swinging, into the religious battle of Inmon vs. Kimball during this argument.

    2) English is not my native language. I make spelling mistakes and sometimes my typing goes out of sync with my thinking. I will read over, and make minor corrections, to my own blog a few times AFTER publishing (having had a cup of coffee in the intermediate time). I prefer to get something out there fast as a brain dump, to get the debate going. That being said, I understand the need for semi-permanent content: If I make a change in the argument or a major update, I will point out the changes with a bit of manual history tracking.

    3) My Data Modeling argument will be rather long. There are a lot of assumptions that I need to fix in order to proceed with the argument – and I want to establish those first before moving on to the final conclusions. I want to take away most of the “it depends” that are so prevalent in this area. There is also a lot of common ground between Kimball and Inmon that I want to acknowledge.

    To give you an idea of the trajectory of the argument, it goes something like this:

    • Integer keys are right choice 
    • Keys in a data warehouses must have certain properties
    • Source Systems cannot be relied on to provide good keys – ever!
    • Surrogate Keys are not good keys!
    • It is not worth modeling the source system
    • Smart reloads of the warehouse (build to scale and to restrict impact) is preferable to modeling for flexibility
    • Business users should nearly always be presented with a pure star schema (not a snow flake)
    • Normalization techniques should be severely restricted in the relational warehouse
    • Update statements are NOT (that) expensive if done right!
    • Tracking history must be done with care and targeted to the end user – modeling for flexibility here leads to disaster
    • Guidance on how to handle changing business requirements in history tracking
    • Dimensional Modeling (with some modifications to Kimball’s model) is the right model choice for the relational data warehouse
    • Surrogate keys should still be used for certain scenarios (contradicting part of my own argument above Smile )
    • With proper data modeling, you can build a very large data warehouses that is both flexible, high performance and cheap.
      You can think of each bullet above mapping approximately to a blog entry – so there I a bit of way to go. You may already now disagree with me, so please chime in as I move the argument forward.

     

    Changes since publication:

    • Added bullet about my bad English
    • Added information about common ground between Inmon/Kimball
    • 2011-07-30 – Added links to new entries
    Categories: Data Warehouse, Modeling Tags:
    Follow

    Get every new post delivered to your Inbox.

    Join 350 other followers