The Big Picture – EDW/DW architecture

Now that the cat is out of the bag on the Kimball forum, I figured it would be a good idea to present the full architecture that I will be arguing for. I was hoping to build up to it slowly, by establishing each premise on its own before moving on to the conclusion.

But perhaps it is better to start from the conclusion and then work my way down to the premises and show each one in turn.

 

Here is the architecture and data flow I will be arguing should form the basis of a healthy, performing, flexible and solid foundation for an enterprise:

image

 

Method, at a High Level

The basic method for building this is:

  • Model your data as Kimball would have done it. But ONLY model Type1 history in the dimension. The dimension form the entities (but not the entity history)
  • If you believe snowflakes are easier to maintain, go ahead and normalize the entities, but do so using standard OLTP methods (you are allowed to have 1-m keys)
  • Use the maps as described earlier to maintain key invariance
  • For the attributes you want to history track, add a entity history table and add only the attributes you want to track (if not, you get into some interesting ETL issues later)
  • If users want fact tables with type2 history, either create a mart for them (using the join technique I will describe), materialize a view, or allow them directly on the EDW using joins that a slight more complicated than a traditional star schema.
  • Users who only need type1 history can simply use the EDW directly
    Major choices

    There are some major Kimball choices that must be made to make this model work

  • The grain is always the lowest possible level. This means that the facts in the EDW (though maybe not in the mart, if you use it) are copies of the source transactions. This gets rid of any complaints about aggregating the fact and losing data
  • The world is dimensional, and we are letting users lose on the EDW itself
  • We allow either normalized or pure star schema dimensions/entities. Our choice in this will depend on our target database engine and required performance.
  • We are using maps to keep the entity keys stable. This minimized the need for reload of facts
  • We are assuming that it is easy to reload entities/dimensions and change their structure (as long as the key remains stable). If you are having trouble handling a billion row customer table, ask questions on this blog as we move forward.
  • Facts contain all the data of the source. If there is “garbage” in the source we will split that out into a vertically partitioned table off the fact and put it on slow (i.e. cheap) storage.
  • Facts are fully de-normalized – there are no big table/big table joins here

What happened to my ODS/Vault/Staging/Archive etc?

One thing to notice is what is NOT in this picture: There is no ODS, Vault or any 3NF databases. The reason is that I am of the opinion that you don’t need one, at least not one that has a data model you care about. I previously said that I would make the point that you should not bother modeling the source, it is a waste of time. This is an argument that needs a lot more backing, more about this later. But a few observations:

I acknowledge that there are auditing requirements which may drive you towards a “zero information loss” model, and that the EDW is often tasked with solving this problem. Note that zero information loss is achievable in the model above, with a few tweaks to the Kimball model. Here is how:

One of the things we can leverage today is the new HADOOP type storage systems (like HIVE or Cassandra). These systems finally deliver on the promise of truly cheap storage – no big SAN storage racks involved: cheap hardware all the way down to the disks. They can run massive archives on SATA spindles, maintain multiple copies of the data and run ad-hoc analysis with high block size I/O (giving you reasonable speed, even on SATA) at record low cost/TB. They are not (yet) super fast at responding to ad-hoc queries at high user concurrency, typically taking many minutes or hours to respond. But responding to ad-hoc queries fast and providing the “current business view of the truth” is what the EDW is all about.

HADOOP is the mother of all stovepipesIn this architecture we use the HADOOP both as storage for raw copies of the source system and as a source itself (for example for web log data). HADOOP is the mother of all stovepipes and it lets us bring back old data if required to do so, satisfying the auditors and de-railing the argument about keeping too much data around just for this purpose. In other words: it leaves us free to model the EDW to only contain data that we know is needed to answer business problems and throw the rest away. We don’t want to understand the source or model it (except for the purpose of ETL’ing from it for the exact structures we need). We just make sure we don’t lose the source, if it cannot guarantee zero loss for us.

If we have them,  analysts are also free to hack away with ad-hoc queries on raw source data in HADOOP, in the table format and model they are used to see in the source, without having to disturb the source or rely on some new vocabulary or over normalized model technique.

Moving forward (or back) from here

I am fully aware that I make some grand claims, and of course, you should not believe me until you have seen the arguments, the design considerations, the safety measures, the methods, the raw numbers and calculations.

But this is a start for the debate and a framework for fleshing out these arguments.