Archive

Archive for the ‘Data Warehouse’ Category

A DW Venn Diagram

2012-01-28 Leave a comment

image

Categories: Data Warehouse Tags:

Don’t Become a One-trick Architect

2011-12-08 9 comments

imageWe are near the dawn of a new workload: BigData. While some people say that “it is always darkest just before the dawn”. I beg to differ: I think it is darkest just before it goes pitch black.  Have a cup of wakeup coffee, get your eyes adjusted to the new light, and to flying blind a bit, because the next couple of years are going to be really interesting.

In this post, I will be sharing my views on where we have been and a bit about where we are heading in the enterprise architecture space. I will say in advance that my opinions on BigData are just crystalizing, and it is most likely that I will be adjusting them and changing my mind.

Yet, I think it will be useful to go back in history and try to learn from it. Bear we me, as I reflect on the past and walk down memory lane.

The Dawn of Computing: Mainframes

If you were to look at the single most successful computing platform of ancient times (that would be before 1990), you are stuck with the choice between the Apple II, the C64 or the IBM S/360 mainframe. The first two are consumer devices, and I may get back to those in another blog post. Today, let us look at the heavy lifting server platforms, since we are after all going to talk about data.

imageUnder the guidance of Fred Brooks, IBM created one of the most durable, highly available and performing computing platforms in the history of mankind. Today, mainframes are challenged by other custom build supercomputers, x86/x64 scale-up platforms and massive scale out systems (ex: HADOOP). But even now, the good old S/360 still holds on to a very unique value proposition. No, it is not the fact that some of these machines almost never need to reboot. It is not the prophetic beauty of JCL (a job scheduler that “gets” parallelism) or the intricacies of COBOL or PL/I…

In fact, it is not the mainframe itself that gives it an edge, it is the idea of MIPS: Paying for compute cycles!

When you pay for compute cycles, every CPU instruction you use counts. The desire to write better code goes from being a question of aesthetics, to a question about making business sense (i.e. money). Good programmers, who can reduce MIPS count, can easily visualize their value proposition to business people, and justify extraordinary consulting fees.

As we shall see, it took the rest of us a long time to realize this.

The Early 90ies: Cataclysm

My programming career really took off in the late 80ies and early 90ies – before I got bitten by the database bug. I used to write C/C++, LISP, Perl, Assembler (various), ML and even a bit of Visual Basic (sorry!) back then. Pardon my nostalgia, but in those “old days” it was expected that you were fluent in more than one programming language.

There were some common themes back then.

First of all, we took pride in killing mainframes. We saw the old green/black terminals as an early, and failed, attempt at computing – dinosaurs that had to be killed off by the cataclysm of cheap, x86 compute power (or maybe RISK processors, though I never go around to using them). We embraced new programming languages and the IDE with open arms. We thought we succeeded: IBM entered a major crisis in the 80ies for the first time in their long and proud history. However, it could be argued that it was IBM as a company that failed, not the mainframe. There are still a lot of mainframes alive today, some of them have not been turned off since the 70ies and they run a large part of what we like to call civilisation. As a computer scientist, you have to tip your hat to that.

Another theme was a general sense of quick money. IBM had a lot of fat on their organization, and all that cost had to go somewhere: It ended up as MIPS charges to the customer. This made mainframes so expensive that it was easy to compete. It was the era of the “shallow port”. ERP systems running on ISAM style “databases” were ported 1-1 to relational databases on “decentral platforms” – aka: affordable machines. Back then, this was much cheaper than running on the mainframes and it required relatively little coding effort.

The results of shallow ports was code strewn with cursors. I suspect that a lot of our hatred towards cursors is from that time. People would do incredibly silly things with cursors, because that was the easy way to port mainframe code. Oracle supported this shallow port paradigm by improving the speed of cursors and introducing the notion of sequencers and Row ID, which allows even the database ignoramus to get decent, or should we say: “does not suck completely”, performance out of a shallow port. If you hook up a profiling tool to SAP or Axapta today, you can still see the leftovers from those times.

Late 90ies: All Abstractions are Leaky

Just as proper relational data models were beginning to take off, and we began realizing the benefits of cost based optimizers and set based operations, something happened that would throw us into a dark decade of computer science. As with all such things, it started well intentioned: A brilliant movement of programmers began to worry about encapsulation and maintainability of large code bases.  Object oriented (OO) programming truly took off, it was older than that, but it now had critical mass.

The KoolAid everyone drank was that garbage collection, high level (OO) languages, IDE support and platform independence (P code/Byte Code) would massively boost productivity and that we never had to worry about low level code, storage and hardware again. To some extend, this was true: we could now trains armies of cheap programmers and throw them at them at problem (I that ironic way history plays tricks on us, the slang term was: The Mongolian Horde Technique). We also had less need to worry about ugly memory allocations and pointers – in the garbage collector we trusted. Everything was a class, or an object, and the compilers better damn well keep up with that idea (they didn’t). The champion of them all: JAVA, suffered the most under this delusion. C/C++ programmers became an endangered species. Visual Basic programmers became almost well respected. And, to please people who would give up pointers and malloc, but not curly brackets, a brilliant Dane invented C#.

At my university Danes, proud of our heritage, even invented our own languages supporting this new world (Mjolnir Beta if anyone is still interested). Everyone was high on OO. A natural consequence of this was that relational database had to keep up. If I could not squeeze a massive object graph into a relational database, that was a problem with the database, not with OO. Relational databases were not really taught at universities, it was bourgeois.

This was the dawn of object databases and the formalization of the 3-tier architecture. Central to the this architecture was the notion that scale-out was mostly about two things: 

  1. Scaling out the business logic (ex: COM+, Entity Beans)
  2. Scaling out the rendering logic (ex: HTML, Java Applets)

We still hadn’t figured out how to fully scale databases, though most database vendors were working on it, and there were expensive implementations that already had viable solutions (DB2 and Tandem for example). What we called “Scale-out” in the 3-tier stack was functional scale, not data scale: Divide up the app into separate parts and run those parts on different machines.

I suspect we scaled out business logic because, we believed (I did!) that this was where the heavy lifting had to be done. There was also a sense that component reuse was a good thing, and that highly abstract implementations of business logic (libraries of classes) facilitated this. Of course, we did not see the dark side: that taking a dependency on a generic component, tightly coupled you to release cycles of that component. And thus was born “DLL-hell” and an undergrowth of JAVA protocols for distributed communication (ex: CORBA, JMS, RMI)

Moving business logic to components outside the database also created a demand for distributed transactions and frameworks for handling it (ex: DTC). People started arguing that code like this was the Right Way™ of doing things:

Begin Distributed Transaction

  MyAccount = DataAccesLayer.ReadAccount()

  if  withdrawAmount <= accountBalance then

    MyAccount.Balance = accountBalance – withdrawAmount

    MyTransaction = DatabaseAccessLayer.CreateTran()

    MyTransaction.Debit = withDrawAmount

    MyTransaction.Target = MyAccount

    MyTransaction.Credit = withDrawAmount

    MyTransaction.Source = OtherAccount

    MyTransaction.Commit

 else

    Rollback Transaction

    Throw “You do not have enough money!”

 end

Commit Distributed Transaction

You get the idea… Predictably, this led to chatty interfaces. Network administrators started worrying about latency, people were buying dark fibers like there was no tomorrow. Database administrators were told to fix the problem and tune the database, which was mostly seen as a poorly performing file system. Looking back, it was a miracle that noSQL wasn’t invented back then.

Since this was the dawn of the Internet, scale out rendering made a lot of sense. It was all about moving fat HTML code over stateless protocols. It was not unreasonable to assume that you needed web farms to handle this rendering, so we naturally scaled out this part of the application. Much later, someone would invent AJAX and improve browsers to a point where this would become less of a concern.

We were high on compute power and coding OO like crazy. But like the Titanic and the final end of the Victorian technology bubble, we never saw what was about to hit us.

Y2K: Mediocre, But Ambitious

The new millennium had dawned (in 2000 or 2001, depending on how you count) and people generally survived. The mainframes didn’t break down, life went on. But a lot of programmers found themselves out of work.

In this light, it is interesting to consider that programmers were considered the most expensive part of running successful software back then. JAVA didn’t live up to its promise of cross platform delivery – it became: “Write once, debug everywhere” and people hated the plug-ins for the browser. While productivity gains from OO were clearly delivered, I personally think that IntelliSense was the most significant advance in productivity that happened between 1995 and 2005 (it takes away work from typing, so I can use it on thinking).

Professional managers, as the good glory hunters they are, quickly sniffed out the money to be made in computers during the tech bubble. As these things go, they invented a new term: “IT” that they could use to convince themselves that once you name something, you understand it. It was argued that we needed to “build computer systems like we build bridges”, but the actions we took looked more like “building computer systems like we butcher animals”. The capitalist conveyor belt metaphor, made so popular by Henry Ford and “enriched” by the Jack Welsh’ish horror regime of the 10-80-10 curve, eventually led to the predictable result: Outsourcing.

Make no mistake, I am a great fan of outsourcing. I truly believe in the world-wide, utilitarian view of giving away our technology, knowledge, money and work to poor countries. It is a great way to equalize opportunities in the world and help nations grow out of poverty – fast. In fact, I think we need to start outsourcing more to Africa now. Outsourcing it is the ultimate, global socialist principle.

The problem with outsourcing isn’t that we gave work away to unqualified people in Asia and Russia – because we didn’t – those people quickly became great, and some of the best minds in computer science were created there.  Chances are that you are reading this blog on a software written by Indians.

The problem with outsourcing was that it led to a large portion of westerners artificially inflating the demand for Feng Shui decorators, lifestyle coaches, Pilates instructors, postmodernwriters”, public servants and middle level management. These days, Europe is waking up to this problem.

But then again, if we send the jobs out of the country, all the unemployed have to do something with their time. Perhaps it is no coincidence that the computer game industry soon outgrew the movie industry. You can after all waste a lot more time playing Skyrim (amazing by the way) than watching the latest Mission Impossible, and you don’t have to support Tom Cruise’s delusions in the process

Sorry, I ramble, somebody stop me. Yet, the fact is that a lot of companies wasted an enormous amount of money sending work outside the country, dealing with cultural differences and building very poor software. One of the the results (again a big equalizer, this time of life expectancy) was that numerous people must have died from mal-practice as the doctors were busy arguing about the latest way to AVOID building a centralized, and properly implemented, Electronic Patient Journal/Record system. As far as I know, that battle is still with us.

The Rise of the East: Nothing interesting to say about XML

Once you have latched on to the idea of a fully componentized code base, it makes sense to standardize on the protocols used for interoperability. Just when we thought we couldn’t waste any more compute cycles – someone came up with SOAP and XML.

This led to a new generation of architects arguing that we need Service Oriented Architectures (SOA). There are very few of those today (both the architects and the systems), but the idea still rings true and for the customers who have adopted it. And it IS very true: a lot of things become easier if you can agree on XML as an interchange format. We also saw the rise of database agnostic data access layers (DAL). Everyone was really afraid to bind themselves to a monopoly provider.

I don’t think it is a coincidence that open standards and the rise of Asia/Eastern Europe/Russia coincided with *nix becoming very popular. The critical mass of programmers, as well as the interoperability offered by new and open standards, made it viable to build new businesses and software. The East has a cultural advantage of teaching mathematics in school at a level unknown to most westerns – I suspect it will be the end of our Western culture if we don’t adopt. Good riddance..

And the problem was again the over interpretation of a new idea. Just because XML is a great interchange format does not mean it is a good idea to store data in that format. We saw the rise (and fall) of XML database and a massive amount of programmer effort wasted on turning things that work perfectly well (SQL for example) into XML based, slower, versions of the same thing (XQuery for example). But something was true about this, and it lingered in our minds that SQL didn’t address everything, there WAS a gap: Unstructured data…

All the while, we were still high on compute power. Racking blade machines with SAN storage like there was no tomorrow. The term “server sprawl” was invented – but no one who wrote code cared. We continued to believe all that compute power was really cheap. Moore’s Law just kept on giving.

The end of the Free Lunch: Multi Core

But around 2005, it was the end of it. CPU clock speed stopped increasing. Throwing hardware at bad code was no longer an option. In fact, throwing hardware at bad code made it scale WORSE.

Those of us who had flirted with scale-out in the 90ies, and failed miserably, got flashbacks. Predictably, we acted like grumpy old men: angry and bitter that we didn’t get the blonde while she was still young and hot. Oracle became hugely successful with RAC, and the idea of good data modeling that actually has to run on hardware came back on the table, snatched from the hands of the OO fans while they were distracted by UML. People started arguing that perhaps, some of that business logic we all loved scaling out (functional scale-out, mind you), DID belong in the database after all. Thus came SQLCLR and JAVA support in Oracle and DB2. Opportunistic companies started publishing data models and selling them for good money, it was believed that if ONLY we could do 3NF modeling of data, all would be well. Inmon followers were high on data models!

People demanded automatic scale and everyone in the know, knew that this was not going to work without a fair amount of reengineering. But of course, we continued to pay homage to the idea – people don’t like their illusions broken.

While everyone was busy wrapping their head around scale-out, the infrastructure guys were beginning to show signs of panic. Big banks were complaining (and still are) that their server rooms were overflowing with machines. Their network switches were not keeping up anymore and the SAN guys went into denial: “You may measure a problem in Windows, but that is Windows who can’t do I/O, I see no problem in the SAN”. All our XML, OO and lazy code and CPU greed had started to take its toll on the environments. Somebody started talking about “Green Computing” – damn hippies!

Flashback: Kimball’s Revenge

While Teradata was busy selling “magic scale out” engines that had made great progress in this space, Kimball followers were spraying out Data Marts and delivering value in less time than it took the “EDW people” to do a vendor evaluation. The Kimball front got themselves a nice weapon of mass destruction with the Westmere and Nehalem CPU cores: cheap, powerful CPU power that took up very little space in the server room and didn’t need any special network components. It was all in-process on the same mainboard. Itanium (IA-64) went the way of the Alpha chip, the final realization that most code out there truly sucks: high clock speed beat elegant architectures.

Finally, a single machine with lots of compute power, no magic scale-out tricks, could solve a major subset of all warehouse problems (and interestingly most of the classic OLTP problems too). Vendors started digging out old 1970 storage formats that had a great fit with Kimball. Column stores and bitmap indexing got a revival. We saw the rise of Neteeza, Redbrick, Analysis Services, TM1, DB2 OLAP Views and ESSBase. This trend continues today, for example with Vertica and the Microsoft VertiPaq engines. there is even a great website dedicated to it: The BI Verdict.

Data kept growing though, and today we are seeing an interest in MPP, Kimball style, warehouses promising truly massive scale. Scale-out and -Up combined in a beautiful harmony.

But of course, social media beat us all to it. No matter how much data we tried to cram into the warehouse, the world could spam it faster than we could consume it.

BigData: More Pictures of Cats than you can Consume

Our tour of history is nearly at an end – bear with me a little longer. Today, the old dream of handling unstructured data is becoming reality – but perhaps a nightmare for some.

With BigData and HADOOP, we have a new architecture that for the first time makes some promises we can start believing:

  • Storage is TRULY cheap and massive (But you have to live with SATA DAS, not SAN)
  • Unstructured data, and queries on it, can be run in decent time (Not optimally fast, but good enough)
  • Semi automated scale is doable (but only if you know how to write Map/Reduce or use expression trees)
    This means that we can now drink the barrage of data coming from the Internet. Of course, human nature being what it is, a lot of this data is pure noise. In fact, the signal to noise ratio is scary on the Internet – I can think of no better word for it.
    Among all the noise of pictures of cats, tweets and re-tweets about toilet visits, and latest news about Britney Spear’s haircut, there is something we treasure: Knowledge about human nature. The problem that BigData helps us address is that we don’t always know how to find the needle in the haystack. If we build BigData installations, we can keep the data around, unafraid of losing it or burning all our money, while we dig for that insight we so desperately seek about our customers and the world around us.
    I have written elsewhere about the great fit between BigData and Data Warehouses. But I would like to reinforce my point here: Once you know the structure of the data you are looking for, by all means model it and put it in a warehouse for fast consumption. If you don’t know what you are looking for yet (or if you can wait for it), put it in BigData.
    Just be careful, there is a price associated with choosing temporary ignorance over knowledge seeking.

The Cloud: Grow or Pay!

Remember those infrastructure people who complained about server rooms overflowing? Do I hear your growls of anger? As architect/programmer/DBA, I think we often have felt the frustration of dealing with infrastructure departments (in my case the SAN guys take the price). “Why can’t those people just rack the damn server and be done with it – instead of letting me wait 3 months?” we ask ourselves as we again bang our fist into the table in frustration.

But the fact of the matter is that those server rooms ARE overflowing and overheating all over the world. Handling all the complicated logistics of racking and patching servers is expensive, and that is a direct consequence of our greed for compute power. Those infrastructure people are just trying to survive the onslaught of the server sprawl – give them a break. They respond by virtualizing (SAN and virtual machines) and they fight back when we ask for hardware that is non-standard or “not certified”.

imagePerhaps it is time we look ourselves in the mirror and ask if we really need to waste all those CPU cycles on overly complex architectures that protect us from ourselves and our inability to properly analyse what we plan to do with data. Every time we rack a new server, power consumption goes up in the server room and it gets just a little warmer in there. We are incurring a long term cost in power and cooling by writing poor code and designing overly flexible architectures. At the end of the day, a conditional branch in the code (an “if statement”), metadata and dynamically typed, interpreted code costs CPU, and if we want to be flexible, we often need a lot of that.

If there is one thing we have learned from mainframes, it is that code lives nearly forever. Especially when it is so bad that no one dares to touch it, or so complex that no one understands it.

If it makes your day, think of this as a green initiative: you are saving the planet by writing better code and saving power. But of course, such tree hugger mentality gets us nowhere in the modern business world, so how about this instead: with the dawn (overcast?) of the cloud, compute cycles will once again be billed direct to you: the coder and DBA!

MIPS are back in a new form, and it again makes good business sense to write efficient code. In fact, it always made good sense (even in the 90ies), but all those costs you were incurring from writing bad code or forcing data into storage formats that are unnatural (e.g. XML) has so far been hidden in your corporate balance sheet. With the cloud, that cost is about to uncloak itself, and like a Klingon battleship – this might leave you in trouble!

Summary: What history Taught Us

If you are still with me here, thank you for putting up with my tantrums. It is time to wrap this up. What has history taught us about architectures and how we go about storing data?

I like to think of computing as evolutionary, not revolutionary. There are very few major breakthroughs in computer science (perhaps with the aforementioned exception of IntelliType) that don’t arrive with a lot of advance warning, and grow up slowly from many failed attempts at implementations. For example, column stores are super hot today, but they were invented back in the 70ies.

Let us have a look at the “evolutionary tree of life” for data storage engines:

image

Above is of course a gross over-simplification, and you could argue about where the arrows should connect. But no matter where those arrows lead, it makes the point I want to make: That nearly all of these technologies are still around today and very much alive (perhaps with the exception of XML databases, but those were pretty silly to begin with). In fact, most large organisations have most of them around.

Are all these technologies just alive because they have not yet been replaced with the latest and greatest? Is noSQL, HADOOP or some other new product bound to replace them all over time?

I think the different technologies are there for a good reason. A reason very similar to why there are many species of animals in the world. Each technology is evolved to solve a certain problem, and to live well in a specific environment. When you target the right technology to a certain problem (for example column stores towards dimensional models) you solve that problem elegantly and with the least amount of compute cycles – preserving energy. Very soon, solving problems with a low number of compute cycles in elegant ways is going to really count.

While you might be able to solve most of the structured DW data problems with HADOOP, you eventually have to ask: Is a generic map/reduce paradigm really the way to go about that?

Humans have a strange way of wanting a single answer to the big and complex question in life, and we waste significant time searching for it. Entire cults of management theories (and I use that term lightly) and religions are build around these over simplifications. We seek simple answers to complex question, and as IT architects, this can lead us down a path of believing in “killer architectures”.

What history has taught us is that such killer architectures do not exist, but that the optimal strategy (just like in nature) depends on which environment you find yourself in. Being an IT architect means rising above all these technologies, to see that big picture, and resisting the temptation to find a single solution/platform for all problems in the enterprise. They key is to understand the environment you live in, and design for it using a palette of technologies.

Staying in the metaphor, this leads me to another conclusion: Just like evolution made us store fat on our bodies in different places depending on the usage (thank goodness) you also have to consider the option of storing your data in more than one place. Having a bit of padding all over the body is a lot more charming (and healthy) than a beer belly Smile

References:

Exploring Hash Functions in SQL Server

2011-11-06 4 comments

Hash distributing rows is a wonderful trick that I often apply. It forms one of the foundations for most scale-out architectures. It is therefore natural to ask which hash functions are most efficient, so we may chose intelligently between them.

In this blog post, I will benchmark the build in function in SQL Server. I will focus on answering two questions:

  • How fast is the hash function?
  • How well does the hash function spread data over a 32-bit integer space

I know there is also the question about how cryptographically safe the function is, but this is not a necessary property for scale-out purposes – hence, that aspect is out of scope for this blog.

For the picky people (my dear soul mates): I will be using 1K = 1024 in the following, so 64K = 65536.

      Test data

      Because we sometimes find ourselves hashing dimension keys, I will use a very simply table with a nice key range as my test input. Such a key range could for example have been generated by an IDENTITY(1,1) column or a SEQUENCER. Here is the test script that generates the input:

CREATE TABLE CustKey (SK INT NOT NULL)

INSERT INTO CustKey WITH (TABLOCK) (SK)
SELECT n FROM dbo.fn_nums(10000000)

 

(see my previous post for the fn_nums source)

Speed of the Hash function

SQL Server exposes a series of hash functions that can be used to generate a hash based on one or more columns.

The most basic functions are CHECKSUM and BINARY_CHECKSUM. These two functions each take a column as input and outputs a 32-bit integer.

Inside SQL Server, you will also find the HASHBYTES function. This little gem can generate hashes using MD2, MD4, MD5, SHA and SHA1 algorithms. The problem for the purpose of our test is that these function spit out BINARY types, either 128 bits (for SHA) or 160 bits (for MD). However, we can quickly map that value into the integer space by doing a modulo (in SQL: %) with MaxInt (2**31 – 1). Interetingly, Modulo work directly on values of BINARY in SQL Server.

Before I move on the results, I would like share a little trick. When you need to get the raw CPU speed of the hash, you may be tempted to do something like this:

SELECT BINARY_CHECKSUM(SK) AS h
FROM CustKey

 

But this is wrong. Why? Because you are both measuring the time it takes to hash as well as the time it takes to transfer the rows to the client. This is not very useful, on my laptop it takes over two minutes to run the above query.

Instead, you should try to measure the raw time the server needs to calculate the data. One way to achieve this (and this trick can be used on most queries you benchmark) is to wrap the SELECT statement in a MAX function. Like this:

SELECT MAX(h)
FROM (
    SELECT BINARY_CHECKSUM(SK) AS h
    FROM CustKey
    ) noRows
OPTION (MAXDOP 1)

Notice something else interesting, I am forcing parallelism down to 1. I want to get as close to the raw cost of the CPU as possible – so I don’t want to clutter the query with any overhead of parallelism. The query now runs in a little less than 3 seconds on my laptop.

I addition to the above trick, we also need to quantify the time SQL Server spends on just accessing the table. We can get a good approximation of this by measuring the SELECT statement without any hash function. On my laptop, I measured this to be around 3200 ms. In the results below, I have subtracted this runtime so we are only measuring the cost of the hashing.

With all this said, let us have a look at the results:

image

Some comments on the data:

  • Notice that MD2 is very inefficient. I have not studied this algorithm in great detail and would appreciate a specialist commenting here.
  • Apart from the MD2 anomaly, the HASHBYTES functions all performan pretty much the same. Using around 1 micro sec per hash value calculation
  • BINARY_CHECKSUM and CHECKSUM are MUCH faster, by about an order of magnitude
  • Doing a simple modulo that spreads out the data over the integer space is about the same speed as CHECKSUM and BINARY_CHECKSUM
  • It is not the cast to NVARCHAR that uses CPU time (HASHBYTES requires NVARCHAR input)
  • However, the checksum function do take longer to run when you first cast to NVARCHAR. Costing around 100 ns per INT value

Of course, the HASHBYTES functions also have other characteristics than speed which may be desired (cryptographic utility for example). but if you you want is speed, it looks like CHECKSUM and BINARY_CHECKSUM are faster.

But, let us see how good the functions are at spreading the data.

Spread of the Hash Function

Another desirable characteristic, apart from speed, of a hash function is how well it spreads values over the target bit space. This is useful not only for cryptographic purposes, but also to “bucket” data values into equal sized portions, for example in scale-out MPP architectures.

Often, you will not know in advance how many buckets you eventually want to subdivide the integer space into. You may start out with 32K buckets and divide up the integer interval in 128K (32-bit space divided by 128K = 32K) even sized buckets, each with 128K values in them:

  • Bucket0: [MinInt…MinInt + 128K[
  • Bucket1: [MinInt + 128K…MinInt + 128K*2[
  • Bucket2: [MinInt + 128K*2…MinInt + 128K*3[
  • … etc..
  • Bucket 32K: [MinInt + 128K*32K…MaxInt]

Perhaps you will later want a further subdivision into 64K buckets, each with 64K hash values.

Wise from the runtimes I measured before, I used only 1M rows with values [1…1M] for this test. However, this is plenty to show some good results, so don’t worry.

To avoid running chi-square testing (See later) of integer value spread all over the 32-bit space, I calculated the hash of each key and I then bucketed them into 64K bucket ranges (each with 64K values, neat isn’t it?).

This table comes in handy for that bucketing

CREATE TABLE ExpectedBuckets 
(
    BucketID INT
 
, RangeStart INT
  , RangeEnd INT)


INSERT INTO ExpectedBuckets
SELECT
    n
    ,(n – 1 – POWER(2,15)) * POWER(2,16)
    , (n – POWER(2,15)) * POWER(2,16) – 1
FROM dbo.fn_nums( POWER(2,16) )


A sample output form this table:

image

I can calculate the hash values and put them into a table like this:

CREATE TABLE HashResult
(
  ObservationID INT IDENTITY(1,1)
, HashValue_i INT
)

CREATE CLUSTERED INDEX CIX ON HashResult (HashValue_i)

 

And finally, I can join them all up and count the values in each bucket:

SELECT
      BucketID
    , CAST(ISNULL(COUNT(HashValue_i), 0) AS FLOAT) AS Observed
FROM ExpectedBuckets EB
LEFT JOIN HashResult
ON HashValue_i BETWEEN EB.RangeStart AND EB.RangeEnd
GROUP BY BucketID

 

We now have a nice count of the content in hash bucket, it looks like this:

image

How do we test if this is a good result or not? It seems that min/max, averages and standard deviations don’t really suffice here (sorry modern MBA types, I realise I am going to loose you now Smile).

We need to do some hypothesis testing. The hypothesis I want to test (aka: the NULL hypothesis – all resemblance to relational algebra is purely coincidental).

“My chosen hash function evenly distributes the integers into the 64K hash buckets”

Interestingly, this hypothesis makes a prediction (has to, if not we cannot falsify it): namely that in each bucket filled by hashing the 1M rows, we should expect to measure 1M / 64K = 15 members.

There is also an important twist here: We have to measure the content of each bucket, even when it is zero. This is why I use a left join in the query above, we have to measure all the results we know, not just count the hash buckets that actually get filled (you can see how not bucketing the data would have made the test result very large).

We can now test the hypothesis against the measured result. If we see correlation, then it gives us confidence in the correctness of the hypothesis, and lets us conclude that our hash function is a good one. It also gives us the ability to compare different hash functions with each other. We can test for correlation between the predicted result and the actual result with the chi-square test. Interestingly, the tools we need for this test is even exposed in Excel as the CHISQ series of functions.

Without further ado (and insults), here are the results:

image

(source data available on request)

Some observations about the data:

  • Most importantly: the CHECKSUM and BINARY_CHECKSUM functions are very poor hash functions. Even with 1M input and 64K large buckets, they don’t even fill the hash space. The skew is massive on non-skewed input data.
  • We also see that casting the INT to NVARCHAR improves the behavior of both BINARY_CHECKSUM and CHECKSUM, but not enough to make them good hash functions.
  • SHA and SHA1 seem to give us the best spread of the data, but at 1 microsecond per hash, it seems rather expensive.
  • Notice that the MD series do not give a particularly nice spread. I suspect this is related to the way I divide the data with MaxInt to get from the 120bit value to a 4-byte integer
  • We see that only the modulo does a distribution that gives us nearly 100% confidence (not fully though, because 1M does not divide 64K, which is reflected in the 824 value of Chi-squared). This result is expected, and my function here is engineered to “game the test”. The modulo cannot generally be used like this to spread the value all over the 32 bit integer space.
  • For fun, I put in a test where I use random values all over the integer space. Even with a confidence interval of 10%, we cannot say it distributes evenly. This means that SHA functions distribute better than random on this input dataset. Your mileage will of course vary for random values

On the note of the CHECKUM series, Books Online does (under?) state:

If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.

Summary

In this blog I have explored properties of hash function built into SQL Server. The results indicate some general guidance that could be followed:

  • For best spread over the hash space, use SHA or SHA1 algorithms with HASHBYTES function
  • Be VERY careful about using CHECKSUM and BINARY_CHECKSUM as these functions do not spread a hashed integer value nicely over the full space. Depending on what you use these functions for, they may not be suitable
  • It takes about 1 CPU microsecond to calculate a single call to HASHBYTES, except when hashing with MD2
  • Using HASHBYTES to spread out values over the 32-bit integer space seems rather expensive, since even the best cases uses tens of thousands of CPU instructions.
  • This indicates that it may be possible to implement a better hash function if you do not care about cryptographic properties of the hash

Reference:

Why Surrogate Keys are not Good Keys

2011-10-22 25 comments

History tracking in warehouses is a controversial discipline. I this post, I will begin to unravel some of the apparent complexities by taking apart the history tracking problem, piece by piece.

Observation: History tracking an EDW is not the same as archiving the source.

I already blogged about using HADOOP or "dumb databases" for archival or source system data. The purpose of these archives is to avoid the discussion with the business user about loosing data. We can always bring back the source. I also voiced the opinion that you should not bother with modeling the source, and that it is trivial to automate this archival process in such a way that the source can be restored to any point in time.

Observation: history tracking is part of the business requirements for a warehouse.

We will only track history on data that the end user needs history about. We will not model history just for the sake of modeling it.

Observation: Fact table history is very different from dimension table history

I hope it is clear that tables containing fact data typically don’t act like dimension/reference tables. I will assert that OLTP designers (they guys we get our source data from) have known and agreed upon this simple observation for years: tables behave differently and fall into certain patterns. For example: Most OLTP system have tables that are "add only", which means we don’t need to track history on them (they are already historized). As another example: By following the well laid out arguments in Kimball books on how to models inventory systems, it is also clear that we need a modeling trick "snapshots" which is quite distinct from they way we model dimensions,

In other words: Just because a fact table has from/to date does not mean we used a type 2 dimension pattern to model it.

Dimensions, on the other hand, have a property that facts rarely have: The history we model on these tables depends a lot on what we are trying to achieve for the business user and their behavior is not dictated by the behavior of the source.

Modeling Facts

Kimball has written extensively about modeling fact data, and I mostly agree with his observations and techniques. I am not even sure this statement is disagreement, but: I would like to point out that all the talk about "choosing the grain" generally doesn’t make any sense to me: just pick the lowest possible grain you can get from the source.

Modeling Dimensions

After this initial tap-dance, let me turn my attention to the subject I really want to talk about: modeling dimensions and the problem with surrogate keys.

Let me start from a set of commonly perceived problems for Kimball style warehouses. Imagine you have a model like the below:

image

Now extend you mental image to include a situation where we change our mind about the history tracking in the customer dimension. We obtain knowledge (maybe through data cleansing) about the education of our customers. There are two ways to handle this:

1) In the simplest model, we may simple add this information "going forward". Our old, historical records remain unchanged, but we can analyze the future data by education. This is not the problem I am concerned about, and I think this modeling trick has limited use.

2) In the more complex model, we may wish to change history "backwards in time". This means that the data model may change like this:

image

It is the complex model that seems to worry the opponents of dimensional modeling for Enterprise Warehousing, especially the transformation indicated by the orange arrow above.

Here is the crux of the argument against dimensional modelers, as I have understood it from my many discussion with warehouse specialists:

Problem 1: changing your mind about history means that ALL fact tables that reference the Customer dimension must change to reference the new keys in Customer’. For some reason, this is perceived as a large dependency problem that is insurmountable at scale.

Answer 1: if you don’t have a meta model (or naming convention) that allow you to quickly (literally in minutes) analyse the impact on your database of making such a change, you are doing something very wrong with your dimensional model. This problem has been solved several times by builders of dimensional enterprise warehouses – I am sure you can solve it too.

Problem 2: changing the history tracking of the dimension requires large update statements to all facts that reference it.

Answer 2: this issue has some merit, but it depends on the implementation of the database you work on. It is true that in a multi TB system, such an update can be painful. But even in an SMP system, I would temper this statement with the comment: "but not as painful as you may think". This will be the subject of a future post.

We should also observe that the impact on fact tables means that the surrogate keys we use in Kimball models, are not good keys (as per my previous definition). The reason they are not good keys, is that Kimball surrogate keys track two things:

  1. The uniqueness of the dimension keys
  2. The history of the entity

When one of these two are volatile (the history) the entire model becomes volatile. This is not a desirable trait of a large warehouse.

Problem 3: not every business user wants the same version of history tracking, so a single type2 dimension table is wrong.

Answer 3: this is also true for some scenarios, and it hints at a model where there are multiple versions of each dimension and a new way to link facts and dimensions together. We observe that several warehouses are built where this never happens, so we should be careful not to over-generalize the model technique to always take an edge case like this into account.

Here we observe that this additional requirement does not mean that we need to abandon the notion of type 1 columns – we just have an additional requirement which we could call "multiple views of type 2 columns".

Summary

The three above problems (of which only the last two have any merit) are quoted, in different forms, by the persons I have heard argue that dimensional models are unsuitable for an "Enterprise" Data Warehouse.

The solution, and this baffles me, seems to be: use a 3NF data model instead of Kimball’s model. I think this 3NF approach is a fallacy and I have witnessed the horrors that arise from it. But as I have admitted, the arguments DO have merit and should be addressed in the model.

Location:Boing 767 over the Pacific

References:

The Analysis Services 2008R2 Performance Guide is Online

2011-10-10 2 comments

I am happy to announce that companion volumes for performance tuners and operations people are now again, for the first time since Analysis Services 2000, a reality.

As a developer, you can learn about building, tuning and troubleshooting Analysis Services 2005, 2008 and 2008R2 cubes(yes, the guides cover all three editions) in the Performance Guide.

If you are the DBA or operations team, you can read about running such cubes in production in the Operations Guide.

And finally, if you are a consultant, expert developer or cube DBA, you can learn how to build the meanest and largest cubes from the 5 day Analysis Services Maestro Course.

These three artifacts, which I am proud to have contributed to (and for the guides, leading the effort on), are a big milestone in my Analysis Services career. They represent a large amount of knowledge transfer from Microsoft to the field. The publication of the companion volumes also marks my transition into some new and exiting projects at least for the near future. I will busy be digging into more “grade of the steel” work,  among it ROLAP UDM testing, and I hope to blog about over at SQLCAT.com very soon.

Thanks to everyone for the incredible feedback during the writing of these guys. The Analysis Services community is very vibrant and these guides are the result of our collaboration in the field, and I am happy to give something back.

References:

An Interesting Blog to Follow

2011-09-01 2 comments

I know some of you track Conor Cunningham’s blog, which I highly recommend. The information on query optimizers is rather sparse out there – and it is a real privilege when people who know what they are talking about share their information freely on a blog. There are just not that many of them.

Here is something to think about in that context: Even if you are only interested in SQL Server, there are good things to be learned from studying other database engines – they are after all very similar. One of the good blogs for that purpose is the Oracle optimizer blog: http://blogs.oracle.com/optimizer/. It is delightfully free from marketing, and packed with good stuff. Don’t worry about the fact that they like textual query plans over there in Oracle. You might be used to viewing plans graphically, but I am sure you agree that such minor difference in display preferences can be set aside in the name of reading good material.

The Big Picture – EDW/DW architecture

2011-08-30 27 comments

Now that the cat is out of the bag on the Kimball forum, I figured it would be a good idea to present the full architecture that I will be arguing for. I was hoping to build up to it slowly, by establishing each premise on its own before moving on to the conclusion.

But perhaps it is better to start from the conclusion and then work my way down to the premises and show each one in turn.

Here is the architecture and data flow I will be arguing should form the basis of a healthy, performing, flexible and solid foundation for an enterprise:

image

 

Method, at a High Level

The basic method for building this is:

  • Model your data as Kimball would have done it. But ONLY model Type1 history in the dimension. The dimension form the entities (but not the entity history)
  • If you believe snowflakes are easier to maintain, go ahead and normalize the entities, but do so using standard OLTP methods (you are allowed to have 1-m keys)
  • Use the maps as described earlier to maintain key invariance
  • For the attributes you want to history track, add a entity history table and add only the attributes you want to track (if not, you get into some interesting ETL issues later)
  • If users want fact tables with type2 history, either create a mart for them (using the join technique I will describe), materialize a view, or allow them directly on the EDW using joins that a slight more complicated than a traditional star schema.
  • Users who only need type1 history can simply use the EDW directly
    1. Major choices

      There are some major Kimball choices that must be made to make this model work

    2. The grain is always the lowest possible level. This means that the facts in the EDW (though maybe not in the mart, if you use it) are copies of the source transactions. This gets rid of any complaints about aggregating the fact and losing data
    3. The world is dimensional, and we are letting users lose on the EDW itself
    4. We allow either normalized or pure star schema dimensions/entities. Our choice in this will depend on our target database engine and required performance.
    5. We are using maps to keep the entity keys stable. This minimized the need for reload of facts
    6. We are assuming that it is easy to reload entities/dimensions and change their structure (as long as the key remains stable). If you are having trouble handling a billion row customer table, ask questions on this blog as we move forward.
    7. Facts contain all the data of the source. If there is “garbage” in the source we will split that out into a vertically partitioned table off the fact and put it on slow (i.e. cheap) storage.
    8. Facts are fully de-normalized – there are no big table/big table joins here

    What happened to my ODS/Vault/Staging/Archive etc?

    One thing to notice is what is NOT in this picture: There is no ODS, Vault or any 3NF databases. The reason is that I am of the opinion that you don’t need one, at least not one that has a data model you care about. I previously said that I would make the point that you should not bother modeling the source, it is a waste of time. This is an argument that needs a lot more backing, more about this later. But a few observations:

    I acknowledge that there are auditing requirements which may drive you towards a “zero information loss” model, and that the EDW is often tasked with solving this problem. Note that zero information loss is achievable in the model above, with a few tweaks to the Kimball model. Here is how:

    One of the things we can leverage today is the new HADOOP type storage systems (like HIVE or Cassandra). These systems finally deliver on the promise of truly cheap storage – no big SAN storage racks involved: cheap hardware all the way down to the disks. They can run massive archives on SATA spindles, maintain multiple copies of the data and run ad-hoc analysis with high block size I/O (giving you reasonable speed, even on SATA) at record low cost/TB. They are not (yet) super fast at responding to ad-hoc queries at high user concurrency, typically taking many minutes or hours to respond. But responding to ad-hoc queries fast and providing the “current business view of the truth” is what the EDW is all about.

    HADOOP is the mother of all stovepipesIn this architecture we use the HADOOP both as storage for raw copies of the source system and as a source itself (for example for web log data). HADOOP is the mother of all stovepipes and it lets us bring back old data if required to do so, satisfying the auditors and de-railing the argument about keeping too much data around just for this purpose. In other words: it leaves us free to model the EDW to only contain data that we know is needed to answer business problems and throw the rest away. We don’t want to understand the source or model it (except for the purpose of ETL’ing from it for the exact structures we need). We just make sure we don’t lose the source, if it cannot guarantee zero loss for us.

    If we have them,  analysts are also free to hack away with ad-hoc queries on raw source data in HADOOP, in the table format and model they are used to see in the source, without having to disturb the source or rely on some new vocabulary or over normalized model technique.

    Moving forward (or back) from here

    I am fully aware that I make some grand claims, and of course, you should not believe me until you have seen the arguments, the design considerations, the safety measures, the methods, the raw numbers and calculations.

    But this is a start for the debate and a framework for fleshing out these arguments.

    References:

    Physically Placing the Maps in the architecture

    2011-08-19 2 comments

    Before we leave the maps behind, I need to live up to my promise of describing the storage characteristics of tables visited during the journey, this must include the physical location of maps. Since believe form must follow function in a DW, let us just recall function:

    From a functional perspective, I have shown you how map tables can be used to both track and correct source system keys. Maps are not visible to the end user, but they are a necessary part of the data’s journey from the source to the final data model. Maps also provide the abstraction of, or interface to, master data sources. In the absence of those sources – the maps can even serve as a makeshift master data repository.

    As we travelled along the ETL road, I also introduced you to the Entities tables. I deferred the exact definition of them, but we have seen how the Entities hold the master key for every entity in the warehouse, along with some book keeping columns. Entities is our next stop, but during our brief visit to them, we noticed that there is a foreign key between the maps and the entities, with the entity holding the primary, master key.

    Because foreign keys need to look up rows on every insert and update, it makes a lot of sense to physically locate tables related through foreign keys on the same machine: co-locating frequently joined tables make joins between them much faster. Furthermore, to preserve transactional integrity, the foreign and primary key tables must live inside the same transactional container. In the case of SQL server, this means that maps and entities must live in the same database – since that is the container of transactional integrity which does not require distributed transactions.

    Whenever we add a new entity key, this is driven by adding a new map entry (for a new entity to appear, it must have been observed in at least one source). Because of this, we want to make sure a backup of entities always includes maps that are consistent with that data. In SQL Server terms, on a large database, this means that maps and entities live in the same file group. In DB2 and Oracle, I believe the term is table space for the same concept (the smallest unit of consistent backup).

    Because of the above we will physically locate maps and entities together like this:

    image

    In my next post in the modeling series, we will spend more time with the Entities – keepers of the master keys and type 1 history.

    I will not be covering master data in further detail in this blog, see the references section below if you want to dig in further.

    References:

    Location:Reading, UK

    Transforming Source Keys to Real Keys – Part 2: Using Maps To Fix Key Problems

    2011-08-15 10 comments

    KeysIn part 1 of this post, I introduced the idea of map tables. These tables serve as an abstraction between the source systems and the entities in the data warehouse. In this post, I will describe how you can use the maps to correct for source key errors.

    Using the taxonomy of key pathologies described earlier, I will walk you through some examples of map usage.

    Some conventions I will use in populated tables illustrated in this blog:

    • Source Keys are in italics
    • Entity Keys (in the Entities schema) are coloured
    • Keys and column values that are irrelevant to the argument have been grayed out so you cant read them
    • Values that receive UPDATE statements are marked in green highlight.

    In order of the taxonomy, let us start with the composite key.

    Fixing Composite Keys

    Recall that we use the Entities schema to hold the “master key” for every entity and that we don’t rely on any source to be a master. The map table is a way to transform any key from a source to that master key.

    For example, let us assume our sales system supplies Promotion keys as a composite of SKU and and PromotionDate. Because some sneaky programmer couldn’t be bothered to create a new key for a promotion and instead argued that a single SKU is only ever on promotion on one date at a time (this, as it happens, is a good example of trying to argue for doing the wrong thing).

    In this case, we would have a map like this:

    CREATE TABLE Maps.Promotion_Sales (
    Map_ID INT IDENTITY(1,1) NOT NULL
    , SKU INT NOT NULL
    , PromotionDate DATETIME NOT NULL
    , ValidFrom DATETIME NOT NULL
    , ValidTo DATETIME NOT NULL
    , IsCurrent BIT NOT NULL
    , IsDeleted BIT NOT NULL
    , Promotion_Key INT NOT NULL
    )
    /* Index to enforce source uniqueness  */

    ALTER TABLE Maps.Products_Inventory
    ADD CONSTRAINT UK_SKU_PromotionDate
    UNIQUE CLUSTERED (SKU, PromotionDate, ValidFrom)

    /* Standard indexes here as per other blog entry */


    And our Entities table will be:

    CREATE TABLE Entities.Promotion (
    , Promotion_Key INT NOT NULL PRIMARY KEY
    , Name VARCHAR(50) NOT NULL DEFAULT (”)
    , CreatedDate DATETIME NOT NULL DEFAULT (GETDATE())
    )

    And no, we will NOT store the SKU or promotion in this table, they are either redundant (the SKU is already in the sales data) or we don’t want to maintain the muddle-headed design of the source in the warehouse.

    Example population of the tables if we receive the first promotion data on 2011-01-01

    Entities.Promotion

    Promotion_Key Name CreatedDate
    1 Big Week Sale 2011-01-01
    2 Good Deals Today 2011-01-01

    Maps.Promotion_Sales

    Map
    ID
    SKU Promotion
    Date
    Valid
    From
    Valid
    To
    Is
    Current
    Is
    Deleted
    Promotion
    Key
    5 43 2011-01-01 2011-01-01 2011-01-07 true false 1
    6 43 2011-01-01 2011-01-01 2011-01-02 true false 2

    Apart from making the fact tables linked to this entity smaller, we can now work around our poorly arguing programmer when the business wants to have more than one SKU on the same promotion. Let us say that SKU = 87 is no on the “Good Deals Today” promotion. We can now update the Maps, without wreaking havoc with the Entities:

    Maps.Promotion_Sales (updated to reflect the new reality)

    Map
    ID
    SKU Promotion
    Date
    Valid
    From
    Valid
    To
    Is
    Current
    Is
    Deleted
    Promotion
    Key
    5 43 2011-01-01 2011-01-01 2011-01-07 true false 1
    6 43 2011-01-01 2011-01-01 2011-01-02 true false 2
    7 87 2011-01-01 2011-01-01 2011-01-02 true false 2

    Fixing Zombie Keys

    Assume our inventory system has zombie keys. Recall that they use EAN as the key over there. At the initial load of our warehouse, let us say on 2010-06-01, we receive this staging table from them:

    Staging.Products_Inventory (on 2010-06-01)

    EAN Weight
    978-0465026562 1.0
    978-0201485417 2.2

    This will cause the following populate of the map:

    Maps.Products_Inventory (on 2010-06-01)

    Map
    ID
    EAN Valid
    From
    Valid
    To
    Is
    Current
    Is
    Deleted
    Product
    Key
    51 978-0465026562 2010-06-01 9999-01-01 true false 1
    52 978-0201485417 2010-06-01 9999-01-01 true false 2

    Now, let us assume that EAN = 978-0465026562 goes out of stock on 2010-08-01 and that the inventory system cleans up the record. We will now receive this staged data:

    Staging.Products_Inventory (on 2010-08-01)

    EAN Weight
    978-0201485417 2.2

    We must conclude that the key is dead, and we update the map to reflect this.

    Maps.Products_Inventory (on 2010-08-01)

    Map
    ID
    EAN Valid
    From
    Valid
    To
    Is
    Current
    Is
    Deleted
    Product
    Key
    51 978-0465026562 2010-06-01 2010-08-01 false false 1
    52 978-0201485417 2010-06-01 9999-01-01 true false 2

    Let us assume we have agreed with the source that it has zombie key behaviour. Because of this, we leave IsDelete = false. We know that Product_Key = 1 is supposed to be dead, but it may return from the grave.

    The supplier for EAN = 978-0465026562 delivers us a new stock three months later. On 2011-11-01 we again observe the EAN in the staging tables, and we reflect this in the map like this:

    Maps.Products_Inventory (on 2010-11-01)

    Map
    ID
    EAN Valid
    From
    Valid
    To
    Is
    Current
    Is
    Deleted
    Product
    Key
    51 978-0465026562 2010-06-01 2010-08-01 false false 1
    52 978-0201485417 2010-06-01 9999-01-01 true false 2
    53 978-0465026562 2010-11-01 9999-01-01 true false 1

    At this point you may ask: why didn’t we just update the ValidTo date of Map_ID = 51 and set it to 2010-11-01, instead of adding a new row? The reason we track the “dead state” of the key are twofold:

    1. We want to be able to detect wrong data from the inventory system. If a fact arrives when the key should be dead, we can flag this.
    2. We want to be able to handle any dolly keys arriving in the interval 2010-08-01 and 2010-11-01
        Recall that we have a unique index in place that enforces the rule:

      “a source key cannot have TWO different meanings at the same point in time”

        . Let us see this rule in action…

      Fixing Dolly Keys

      After our warehouse goes into production, the guys down in inventory discover an error in the reports. After some investigation, it is discovered that on 2010-10-13 a new guy unpacked a crate containing a book called: “Gödel, Escher, Bach: Ein Endloses Geflochtenes Band”. The new guy, understanding German, understood that this was most likely the book with EAN = 978-0465026562, so he created a new entry in the inventory system, temporarily making the EAN appear in the system – but pointing to the WRONG entity. The book was returned the day after to the supplier, because the inventory guys realised it was in the wrong language, after this, the batch clean process gets rid of the EAN again.

      Picture of Black SheepHowever, the guys down in inventory care about returns, they want to track them and report on errors. We are faced with the problem of an EAN temporarily representing the wrong entity. Think this example is far fetched? I have seen it several times, on keys a lot of people would personally care about.

      Fortunately, our Map structure can handle this. We start by looking up the Product_Key of the German edition of Gödel, Escher, Bach – let us assume it is Product_Key = 42. We can now handle our Dolly key like this:

       

      Maps.Products_Inventory (updated after 2010-11-01)

      Map
      ID
      EAN Valid
      From
      Valid
      To
      Is
      Current
      Is
      Deleted
      Product
      Key
      51 978-0465026562 2010-06-01 2010-08-01 false false 1
      52 978-0201485417 2010-06-01 9999-01-01 true false 2
      53 978-0465026562 2010-11-01 9999-01-01 true false 1
      54 978-0465026562 2010-10-13 2010-10-14 false true 42

      Notice that, unlike the zombies that are the norm of this source, we marked the Dolly key as IsDeleted = true, we don’t expect this sheep to come back from the grave

      (zombie sheep, now THERE is an idea for a new game in the app store).

      Merging keys Together – Handling Multi Keys

      To wrap up this blog entry about the usage of map tables, let me illustrate another use for them. There are cases where the source system carries two different keys for the same entity. This CAN be the result of an error (often the case with customer data), but there are situations where having two keys for an entity is a legitimate use case.

      For example, our inventory system may have multi EAN numbers for the “same” book. This happens because books can exist in multiple editions. Examples of these differences are: paperback, hardcover, reprints and new editions. From the perspective of the inventory system, and often the customer, we may even care about the different editions. But perhaps this is deemed irrelevant from a business reporting perspective. If I drill into the Douglas Hoffstadter in my Product Dimension – I will probably WANT to see all English editions of “Gödel, Escher, Bach” as the same row.

      Fortunately, the merging of source rows can be handled by maps. It WAS actually the case that “Gödel, Escher, Bach” had a paperback edition in 1999 (which I happen to own) with another EAN = 1313353195. We can reflect this in our map:

      Maps.Products_Inventory (updated after 2010-11-01, old paperback added)

      Map
      ID
      EAN Valid
      From
      Valid
      To
      Is
      Current
      Is
      Deleted
      Product
      Key
      51 978-0465026562 2010-06-01 2010-08-01 false false 1
      52 978-0201485417 2010-06-01 9999-01-01 true false 2
      53 978-0465026562 2010-11-01 9999-01-01 true false 1
      54 978-0465026562 2010-10-13 2010-10-14 false true 42
      55 1313353195 1999-01-01 9999-01-01 true true 1

      Notice that we leave IsCurrent = true. This is a perfectly valid EAN to receive from the source system. I may simply have been lingering around on the dusty shelves for a long time.

      Summary – Transforming Source Keys to Real Keys

      In this blog, I have shown you how you can use map tables to handle key errors in the sources. For those of you that currently use type 2 dimension history for this purpose – you may wish to consider the implications of polluting your end user data with this type of “source noise”. How would you use a type2 dimension table when sources are replaced or errors occur that you don’t want to reflect in the end user data?

      Consider also what it means for the maintainability of the warehouse to adopt the poor key choices in the source as an integrated part of the end user reporting data. I think I have argued that there has to be a de-coupling between the “history reporting” that end-users want, and the tracking and handling of source keys. It is my claim that the data model is simplified by not mixing these two concepts together in the same table. It is also my claim that you should always adopt the map tables, with the structures described here, as an integral part of the ETL process in the data warehouse.

      As Jørgen Guldmann pointed out in a previous comment, having the structures in place for mapping does not mean you can find a reliable source of the data needed to handle the transformation. The maps allow you to change the sources to the  format they should have had in the first place, even if that means manually updating the map tables or having someone hired to correct errors by adding rows to the map (as we saw with the Dolly key example). It is worth observing that if you can  classify which key pathologies a source has, you can often automate the maintenance of the map, and only manually insert the exceptions you find as you iterate towards the completely cleansed Enterprise Data Warehouse.

      At a point in the future, you may find the holy grail and end up with no source pathologies at all, maybe even a “master source” for all entities. The maps can serve as the requirement definition for that process. Furthermore, the maps can be used as update structures for sources that want to comply with a better key structure. Some people would call this master data management, and by proactively building maps as part of your data warehouse model, you have placed yourself in a position to lead that argument with a data driven approach.

      Just don’t be surprised that you end up BEING the master data system once you start building mapping structures. But this is a good thing, the problem does not go away just because you push it around – with maps, you have a way to quantify and encapsulate it that does not ruin your brilliant data model.

      Related blog posts:

      PS: The book I have used as an example happens to be one of the best books ever written on the relationship between music, philosophy, art and mathematics. I rally cannot recommend it enough.

      Microsoft Announces Plans to Introduce Hadoop Interoperability

      2011-08-12 2 comments

      For those of you who have not yet seen it, Microsoft recently announced that they will be looking at Hadoop connectivity to the database stack:

      Some of you may have wondered why I have not yet mentioned the BigData movement as part of my DW articles. In my defense I will say that this a big trend, something I had to give a lot of thought to position correctly. Before I can talk more about how it fits into the full DW/BI architecture – I have to argue a bit more for my warehousing approach.

      I can reveal that I see traditional data warehousing (especially dimensional modeling) and BigData compliment each other in a way that solves some of the common complaints of warehouse builders across the world. I hope you will find my thoughts on BigData fit nicely into the picture I will be painting of the warehousing world going forward in this blog.

      Follow

      Get every new post delivered to your Inbox.

      Join 271 other followers