Boxing Ring

The Data Vault vs. Kimball

Here we go again, the discussion about the claimed benefits of the Data Vault. Thomas Christensen has written some great blog posts about his take on the Vault method. Dan Linstedt has been commenting.

The discussions are a good read to track:

Apologies in advance for my spelling errors in the comments, the posts are written while travelling.

As with most complex subjects, it is often hard to have people state with clarity what EXACTLY their claim is and what their supporting arguments are. It seems that the Vault is no exception – I hope the discussions lead somewhere this time.

For your reference, here are some of the posts I have previous done that solve the postulated problems with the Kimball model.

Notice that there are some very interesting claims being made about normalization creating more load and query parallelism in the comments on Thomas Christensen’s Blog by Sanjay. I personally look forward to hearing the argument for that.

  5Comments

  1. Jose Borja   •  

    I followed the threads with interest as I have been asked to participate in a project where the customer is interested in applying DV modeling. I am an Inmon practitioner and have done several multiterabyte DW implementations successfully so this latest request to use DV modeling perked my interest.

    After several hours of reading I came to the conclusion that the original premise for DV is faulty. You also state you still do not know what problem Linstedt is attempting to fix with DV and I think the answer is in http://www.tdan.com/view-articles/5054/ article where he states

    “One particularly thorny problem is evident when a date-time stamp is placed into the primary key of a parent table (See Figure 2 below). This is necessary in order to represent changes to detail data over time.

    The problem is scalability and flexibility. If an additional parent table is added, the change is forced to cascade down through all subordinate table structures. Also, when a new row is inserted with an existing parent key (the only field to change is the date-time stamp) all child rows must be reassigned to the new parent key. This cascading effect has a tremendous impact on the processes and the data model-the larger the model the greater the impact. This makes it difficult (if not impossible) to extend and maintain an enterprise-wide data model. The architecture and design suffer as a result.”

    That entire premise is completely faulty. There is absolutely no need to put a TS in a PK of any entity. Without the TS in the PK the assumed problem is not there anymore. Time variance has been done successfully in DW using BEGIN and END timestamp pairs per row leaving the PRIMARY KEY intact. I have done several DWs that way and I just deployed a brand new DW using Teradata 13.10 Temporal features to handle Time Variance auto-magically.

    I can take the ERD of my latest DW and easily convert it into a DV model by following 3 simple rules to derive hubs, links, and satellites. In that process I would make the model incomprehensible to the community by doubling or tripling the number of artifacts and removing its business appearance for a technical appearance or table types prefixed with H, L, or S. The DV would also be far more expensive to access and maintain per TB. The community agrees the DV is practically unusable for reporting/adhoc access and is better used as a source for Dimensional Models.

    To me it looks like DV parades as a technical solution looking for a problem that does not exist. I can develop an ERD in 3NF and leave the PK alone. Row Uniqueness can be defined using a USI if needed with the PK+BEGIN TS.

    I am glad I fond the blogs where the discussions were held. I learned far more form the discussion than from the DV guides. I fail to see the value of DV and will need someone to convince me that the TS needs to be in the PK definition in spite of the several DWs I have deployed without doing just that.

    • Thomas Kejser   •     Author

      Eloquently put Jose.

      I would add that begin/end dates DO represent a particularly nasty problem for query optimisers (I have blogged about this here: http://kejser.org/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/). Unfortunately, the DV model makes this WORSE; not better.

      Being a fellow Multi-TB DW building, I am curious to know what you mean when you say that you are an “Inmon practitioner”. Does this mean you subscribe to a particular process or a data modelling technique too?

  2. Dave Ballantyne   •  

    Thanks for accumulating these, and would I be correct in assuming that your viewpoint hasn’t changed since these discussions took place. I too have been frustrated with the DV and the religious arguments that are thrown up, you have to believe to see the point in believing. These seems to be even more compounded by the “extra secret sauce” ( my words) in the DV model 2 http://danlinstedt.com/datavaultcat/data-vault-2-0-being-announced/ It may also be interesting to note that the DV model allows for a “point in time” table http://www.tdan.com/view-articles/5067/ that de-normalises away the BETWEEN problem.

    • Thomas Kejser   •     Author

      Dave, my view basically has not changed. I think that DV is a dangerous modelling technique because it pretty consistently gets you into trouble with relational database and their optimisers. It is also much hard to do than it needs to be – there really isn’t (much) wrong with the dimensional model. What troubles me a lot is that is very hard to pinpoint exactly what problem it is that Dan claims to solve.

      I am aware of the de-normalised “solution” to the BETWEEN problem. But one really has to ask: If that is necessary (and it has a significant storage overhead) why did you bother doing it like this in the first place?

  3. Pingback: WMP Blog » Data Warehouse Architecture: Inmon CIF, Kimball Dimensional or Linstedt Data Vault?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">