Modeling Dimensions with History Tracked, Generic Attributes
Sometimes, you need to model a database in such a way that you can dynamically extend the model without altering any tables. Perhaps the attributes changes faster than you can add new columns or the data you store has a “ragged” structure that does not lend itself well to being described as a traditional table structure.
The typical approach taken in these cases is to “pivot” the required “flexible columns” of a table into rows instead and dynamically reconstruct the schema at runtime. When you model like this, adding new attributes to the model is simply a question of inserting rows into the database.
As I sure you are aware, there are issues with this approach. Tuning columns into rows can cause interesting issues for database engines. The flexibility requirement and performance loss to be balanced carefully. In this blog, I will walk you through an example of how to handle this generic case in a star schema.
To make this concrete, let me use an anonymised, real customer example I ran into this week.
Simplified, the customer has this OLTP model of employees and transactions:
Transactions are all linked up to Employees. Because the City and a few other columns of the employee are important to keep track of historically, Employee_History models this concept using the ValidFrom/To column. In addition to City and other static columns, employees can also have a general “bag” of zero or more attributes associated with them that vary over time – this relation is tracked in Employee_Has_Attribute which also needs the ValidFrom/To columns to keep track of the temporal aspect of this relation. For example, an employee could have the attribute “Part Time” associated with him for a year, and another employee have both the “Part Time” attribute and the “Speaks Chinese” set in another time period. Later, the administrator of system may decide that a new attribute “Speaks German” is now allowed and they can simply add that to employees by inserting a row in the Attribute table and adding the historical information in Employee_Has_Attribute.
Whenever you find yourself in a situation where the schema needs this form a generic attribute behaviour – the question you have to ask yourself is: HOW generic? Taking the fully generic route for every single attribute will often lead to poor performance, so it is something to be done with caution. Don’t make the schema generic just because you can. But sometimes, there is no way around it.
In the case, my customer had done their homework and carefully balanced their need for generic behaviour and fixed schema structures. For example making they made a good choice of tracking City as a column and only moving the truly generic attribute into the separate table.
The big question now is: How do you turn this design into a star schema?
Modeling the Star Schema – Tracking the City
My customer wanted to turn the source schema into a fast query star schema Kimball style. Let us walk through the modeling example for this interesting case.
The first principle I apply is always: “Only model what you need”. In this case, what was needed was the history tracking of attributes and the City column. While history may be tracked in other columns in Employee_History too, we will collapse those for now.
The customer had already done the cleaning up of EmployeeKey in the source, using a structure very similar to the maps I recommend. They are nice integers and follow all my rules for being “good keys”.
Step one of the modeling process after these things have been cleared is thus to get rid of the nasty BETWEEN join that is needed to get from Transactions to Employee_History. We use a standard Type2 dimension.
The first step at the model will thus look like this:
The ValidFrom/To dates in Dim.Employee_T2 will have to be loaded based in changes tracked in Employee_History. Note that if Employee_History tracks MORE than the City columns, the ValidFrom/To dates may NOT be the same as in the Employee_History source.
We translate the TransactionDate to a surrogate integer keys and link it up to a date dimension (Dim.Date) – standard Kimball practice so far.
Generic Attributes – Potential Designs?
How do we proceed from the T2 dimension to the generic Attribute table? We notice that because an employee can have many attributes, we have to model an M-M relationships. In Kimball terms: We need a bridge table.
Let’s walk through a few failed attempts to weight the trade-offs.
Disclaimer for the Data Vault modelers out there: I do apologise is advance for the discomfort any chills down your spine may cause you if you see the obvious parallels between these example and how you model a warehouse.
Attempt 1: Use Employee_Has_Attribute directly?
This results in this model:
That wont work very well! Why? first of all, we have to do a nasty BETWEEN join, while carrying the TransactionDate over to the final join table. Also notice that the Employee_Has_Attribute table does not have key relationship to Dim.Employee_T2 – this can cause all sorts of interesting plan estimates and poor query plans which I have blogged about previously.
Attempt 2: Bridge directly from Fact to Attribute using TransactionDate
In this model, we de-normalise the EmployeeKey into the fact (recall that we have used a surrogate key to model the City history tracking).
We get this candidate model:
This is better than attempt 1, but still suffers from the BETWEEN join. We have also made the fact table wider and wasted space.
Attempt 3 Bridge on DISTINCT TransactionDate, EmployeeKey
We could sore the distinct join result of Employee_Has_Attribute and their matching dates Fact.Transactions. This results in a model like this:
This at least gets rid of the BETWEEN join. Unfortunately, this model also results in the bridge table having a size proportional the cardinality of the PRODUCT of EmployeeKey and TransactionDate – in other words: It grows with the size of the fact table. If we proceed with this model, we will have to do big/big table joins between the bridge and the fact. And we don’t like big/big joins – they are generally not scalable.
It seems we are out of options, there is no good place to build the bridge table without either creating a large bridge or making the join criteria very bad… None of these options look promising.
Generic Attributes – Suggested Design
I asked the customer: Does an employee’s attributes change fast or slow? It turns out they change pretty slow, faster than City yes, but still slow. There is another solution in this case.
Based on the data in Employee_Has_Attribute and Employee_History, we can find the DISTINCT UNION of the dates where EITHER the attribute changed or the City changed for a given employee. In a sense, we will count the change in the Attribute relationship as a change in the Employee. If we take the result of this distinct union of all EmployeeKey and FromDate in both the relation and the Employee table, this can be used as the basis for bounding the Dim.Employee_T2 dimension. If an attribute changed, but not the city, this would mean we would have two nearly identical rows in Dim.Employee_T2 (with only the surrogate key being different) – but this new surrogate key could be used to join to a bridge table between Dim.Employee_T2 and Attribute. The final model will look like this:
No nasty BETWEEN joins here. In addition, the bridge table size is now proportional with the number of times an employee changes attributes, instead of proportional to the fact size.
In this blog entry, I have shown you a way to model star schemas on top of sources that have a semi-loose schema.
Sometimes, you need the flexibility that comes with flexible schemas and there is no way around handling it. Just remember, flexibility often comes at the price of performance, especially when you have history tracking turned on. Don’t model flexibly just because you can, if something truly is static or change only infrequently, make use of that information in the model.
I have also shown you an example of a habit that I recommend building: to mentally walk through the join paths a data model force you to take through the data. When you do, you often gain insights into optimisation and alternative ways to structure tables.