Radiation no digging

Why “Date BETWEEN FromDate AND ToDate” is a dangerous join criteria

I have been meaning to write this blog post for some time and the discussion about Data Vault finally prompted me to do it.

Sometimes, you find yourself in situations where you have to join a table that has a structure like this:

The join criteria is expressed:

Or more commonly, this variant with a semi open interval:

Data models that promote these types of joins are very dangerous to relational optimizers and you have to step carefully when executing queries with many of these joins. Let us have a look at why this is so.

Temporal Join

To illustrate the issue with this query pattern, let me create a very simple test setup that you can experiment with. Use this script to generate the two tables:

 

(see my utility functions for fn_nums)

SmallTable above is a temporal tracker table with 10 temporal records per BusinessKey. BigTable is about 400MB large and does not use temporal tracking but has a TranDate column that determines which temporal records in SmallTable match it.

Let us now try to execute a reporting query where we ask for an aggregate over BigTable joining it up to its matching temporal keys in SmallTable.

Some quick statistics about this query execution on my laptop:

  • CPU time: 25547 ms
  • Logical I/O operations: 50762  (no physical)
  • Memory Grant: 370 MB
  • Rows Returned: 2600

Nothing overly suspicious yet (though the memory grant seems rather right). Let’s just have a quick look at the query plan:

image

That is a pretty big misestimate isn’t it? And that is the crux of the issues, it is immensely hard for a query optimizer to accurately predict that the join on the temporal table will lead to one and only one row (unless you have a temporally aware database of course).

But misestimates are not the full story, there is also a higher CPU cost involved in doing this join. At the CPU instruction level, more work needs to be done to find record matches an interval than doing a straight compare of two values.

Now, you can imagine what happens if you have a data model that has very long chains of these joins. As you probably know, query misestimates (and the risk of bad query plans) typically grows exponentially with the number of tables being joined. Having a data model that almost guarantees poor estimate even in a two join setup can quickly lead to interesting tuning challenges.

Going Kimball Again

There is a very good reason Kimball recommends integer keys for type2 dimensions instead of the temporal join you just saw.

Let us change BigTable into a Kimball representation instead by pre-joining like this:

We can now rewrite the previous aggregate query to this:

Let us first have a look at the query plan for the Kimball style join:

image

Same query plan shape – but look at the difference in estimates vs. actuals! We are spot on here.

Comparing the Kimball style Type2 join with the temporal join we get:

Measurement Temporal Join Kimball Type 2
CPU Time 25547 ms 10844 ms
Logical I/O 50762 50762
Memory Grant 370 MB 315 MB
Rows Returned 2600 2600
Misestimate 3x None

Summary

In this blog entry, I have shown you why temporal style joins can be dangerous to query optimizers. While it is not always possible to avoid them – extreme care should be taken if you chose to include them as the only way to access your data model.

 

  12Comments

  1. Alin Selicean   •  

    Hi Thomas

    I came across this just now… Great article… I think we have the exactly same scenario with a table rather large (600million records). The 2 date columns are instead integer columns representing a day_key value (number of days since 01/01/2004, for example). In addition, the table is partitioned on the starting day column. For some reason, the partition elimination does not work.

    I have few questions:
    1. Are we under the same scenario as depicted in your article even if the columns are of a different datatype ?
    2. If yes, can this be the cause why partition elimination does not work ?

    Thanks,
    Alin

    • Thomas Kejser   •     Author

      Hi Alin

      Yes, I think the same will happen with the datetime family of types. Those are really just integers behind the scenes.

  2. Pingback: How Vertical Partitioning and Deep Joins Kill Parallelism « Thomas Kejser's Database Blog

  3. Pingback: Modeling Dimensions with History Tracked, Generic Attributes « Thomas Kejser's Database Blog

  4. cteveret   •  

    Thanks again Thomas for these additional tips – I appreciate you taking the time to share your experiences. We will have additional projects this year attempting to query this database so these tips will help. I checked out the Thomas Christensens forum and there is a lot to digest :-). Still working on it and have learned a good deal already!

  5. Todd Everett   •  

    Great post Thomas – thanks for taking the time. I work frequently with a database designed using a vault model concept. This “data warehouse” is built and extended using an architectural strategy based upon the idea that a small set of business data in scope of a given operational system functional enhancement project must be “warehoused” as it is already being worked upon and “might be useful someday” to those who might want to report on it. The architecture also requires that a full history is retained for every data element in every entity. The vault model has been very helpful to us in meeting these requirements given its exclusive use of M to M relationships and in retaining a history of every column change with its satellite tables. The Link tables and the Satellite tables both rely on bi-temporal relationships to relate entities to each other and retain a relationship history over time, and also to maintain a row history for each entity over time.

    While it works really well for folding in new entities and data elements without impacting existing process (you just create a new hub or link, or add a new satellite), we often struggle to get good plans for range scan queries. The plans are hard to read as there are so many tables involved (every entity has at least 1 hub/link, one satellite (often 3 or 4), and one point in time satellite to make it easy to join the current row without a subselect). Often the optimizer times out generating the plan if we are joining in 10 or more composite entities. I can usually re-write queries to get only the data required and often use cross applies to join the hub/link tables to the satellite (which only helps for “current state” queries where the matching bi-temporal child is always the most recent one – thankfully that is almost always what is needed) but it is a lot of work. But a 20 entity “monster” query written to retrieve data as of some time in the past (like in your example) never finishes. I was at a loss to understand why this was the case and your post has really helped me grasp the trade-offs in query optimization we have in using this approach..

    Ultimately our architecture calls for the “warehouse” to be a like a distribution center – it feeds data marts and isn’t intended to be queried – but in reality our business partners don’t want to fund additional development for a data mart and our operations partners don’t want to manage the extra databases as each means more backups, more maintenance, etc. You have given me some good tools and understanding to help explain to our partners why, if they want easy and fast reporting under our current architecture, we need the extra funding to build a data mart. Usually what I encounter is the belief that indexing and the nolock hint solves all. The test script you provided is a great example to show how an inability for the optimizer to develop accurate estimates given all these bi-temporal relationships is a root problem that won’t always be easily addressed by indexes and query hints.

    I don’t want to get into a Kimball vs. Inmon / Linstedt battle as every architecture optimizes certain objectives at the expense of others and none is good or bad absent of those objectives. I have been trying to learn all I can about all the different approaches, what objectives they maximize and what trade-offs they have. Your blog has been a really good source of learning material and I look forward to your continued posts.

    • Thomas Kejser   •  

      Hi Tod

      Thanks for the reply. Your experience with Vault reflects mine.

      If your data model consistently create these types of issues for you – isn’t there a point where you have to ask the obvious question?

      It is indeed true that Vault allows the addition of sources fast – but at what cost further down the delivery chain? You might find the discussion in Thomas Christensens forum informative (see my links of previous post). Here, I describe how the history tracking you are looking for can be done without requiring the Vault model.

      The forum also contains a fascinating discussion about the tradeoffs you do with Vault and exactly which benefits are claimed. After reading it – I hope you might revise your stance that every model has BOTH good and bad sides :-)

      • cteveret   •  

        Thanks Thomas – I will check these discussions out!

        • Thomas Kejser   •  

          I forgot to share some advise Todd (and sorry for misspelling your name)

          If you do find yourself in Vault land with no way back, there are a few tricks you can play to narrow the search space. Disclaimer: this will NOT generate the best queries – though in my experience it makes may the optimiser create a “good enough” plan in more reasonable time.

          First or all: add OPTION (LOOP JOIN) to the queries. While it IS possible to perform BETWEEN queries with hash joins – the cost (and risk) of getting the join order wrong is too high.

          Second: use FORCE_SEEK hints on all tables. This again narrows the search space and will avoid expensive spools. If you are standard Vault indexed (especially if clustered on all join keys and from-dates you should have a fully indexed path through the join tree.

          There are some additional tricks you can play to “unfold” history in a structure manner, but they are the subject for a full blog entry.

          May the force be with you – you will likely need it :-)

  6. Adam Machanic   •  

    You could have achieved even better results more easily by properly indexing “BigTable” ((BusinessKey, TranDate) INCLUDE (OtherKey, RowId)) and using a FORCESEEK hint on BigTable in the original query. (At least on my end, this made that query 25% faster than your revised version. It’s too bad that the QO requires a hint for that.)

    I’m not sure how realistic this example is. I’m used to seeing temporal questions like “give me all of the rows that were active between date X and date Y,” whereas yours is “give me everything in the database.” If you’re only looking for active rows based on a narrow date range with respect to what’s in the database (which I think is much more common), you’re not going to have this issue. (Again, assuming appropriate indexes and somewhat thoughtful query construction.)

    • Thomas Kejser   •  

      Your first comment is exactly to my point Adam: optimizes don’t deal well with this construct and you often have to hint them if you go down this path. You could index it for that one join – but that won’t do you much good if you have to bi-temporal join to another table too.

      I agree that these queries often have the form of “slide me into this date range” which help immensely (but still give you estimates that are way off when you join multiple tables together). Take this as an example:

      CREATE TABLE Product (BK, From, To, GroupBK)
      CREATE TABLE Group( BK, From, To, CatgoryBK)
      CREATE TABLE Category(BK, From, To)

      Asking the question: “what did this product look like at a certain date)?” is easy here (but estimates are way off already) But asking: “show me history of all products by Group and Category” is quite tough and the chance of getting the wrong join strategy and confusing the optimizer becomes significant – even on a small table.

      The query I used in this blog entry is to illustrate the tradeoff between storing dimensions as bi-temporal relations to facts (like for example – Data Vault recommends) and as storing them as “materialized” surrogate keys. You really don’t want to serve up fact data to ETL tools or end users in this format if you can avoid it.

  7. peteads   •  

    In the original example, also consider the impact/cost of maintaining statistics on the changing dates for ‘todate’ as your data evolves

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">