The Information Staircase
With the Big Data wave rolling over us these days, it seems everyone is trying to wrap their heads around how these new components fit into the overall information architecture of the enterprise.
Not only that, there are also organisational challenges on how to staff the systems drinking the big data stream. We are hearing about new job roles such as "Data Scientist” being coined (the banks have had them for a long time, they call them Quants) and old names being brought back like “Data Steward”.
While thinking of these issues, I have tried to put together a visual representation of the different architecture layers and the roles interacting with them:
As enterprises enter the big data era, it will become crucial to consolidate data into easily accessible locations in a cost efficient manner. When we start at the bottom of the staircase, data sizes are nearly insurmountable and the first job role becomes to simply locate them and gather them in one place that can be queried. That is what the Data Extractor does. Data integration is already in demand and the future will require data integrators to get smart about total system throughput.
Traditionally, when we needed “quick and dirty” implementations, the Data Modeler skipped the staging step between the sources and the warehouse and transformed directly to the format the Business Analyst wants – typically a star schema. This delivers local value, which actually DOES work in many cases, because organisations are siloed into different units. Unfortunately, in this way, we get “mart sprawl” – the proliferation of a lot of small data warehouse solutions that rarely agree on the right answer to business questions.
The traditional answer to achieve “one version of the truth” has unfortunately been to draw the conclusion that we will need is “one MODEL of the truth”. Organisations have wasted large amounts of money building these enormous data models and trying to make everyone agree on them and replace the existing data marts – an enormous undertaking often spanning several years of work. Master data management systems also aim to tackle this problem, albeit only at the entity level – not for transactions.
With the waves of big data sources crashing on the shores of the warehouse – “truth” now arrives faster than we can model it. We often don’t know if data has any value hidden in it and we may simply have to store it until we have time to find out. We may not even know what the data MEANS from a modeling perspective. Only after the storage of data has been done can the Data Scientists mine it for value.
If we are to have any hope of efficiently analysing these large data sources, we need to at least co-locate them. This co-location avoids ad-hoc queries for large data streams all over the network. This is where Big Data technologies come to the rescue: they allow us to store data in any form we desire – for example the format it is produced in – without having to worry about modeling it first whilst at the same time maintaining the ability to query it. In order for this process to scale, we will need to engage in a process of digital curation: adding metadata on top of the source data to keep track of WHAT we have in these stores so the Data Modeler and Data Scientist can find it again and experiment with data models which bring value to the business. Fortunately, this curation process is made easy by the fact that it is simpler to catalogue data than to model it.
The most significant deviation from vanilla Kimball architectures in this approach, is the introduction of the Data Scientist role. This person’s job is to form hypotheses about the available data which could drive new business value and to test these against the raw source. When a hypothesis is confirmed – the data modeler can then materialise this insight into a data model. A model optimised for delivery of this insight to the higher steps of the staircase. Traditionally, users have only been let loose on data after it has been carefully modeled in the warehouse. But I don’t think we can afford this luxury of non transparency and strict model requirements anymore. As Richard Cook says in this brilliant talk: “How Complex Systems Fail” (about 26 min in):
“I think this idea about hiding in layers of abstraction everything about the details has in fact pretty much runs its course now. I think the idea that we can make these sort of black box devices that we only know about the shell of and have no knowledge of the internal working of, is in fact a deep mistake. Because it turns out that in order to be able to make and reason about how the system is actually working, we have to have knowledge about what is inside the black box”
In order for us to adopt this new approach, we need to get a heck of a lot better at data modeling and data science. And we need to show the source as it is.