Big-Big Tables join

Big/Big Table Joins

With the popularity of my last blog entry on Dangerous Joins, I felt inspired to write a bit more about the join strategies. Thanks for participating and reading, there seems to be a large appetite for Data Modeling out there – this blog now has over 7K unique visits every months.

Today, let us look at strategies for joining two big tables together.

 

What’s a big Table?

Before I move on to the test case, let me define what I mean by a “big table”.

Big Table ::= A table that is at least 2x larger than DRAM memory and where access to the table requires disk I/O

A large table exercises the database engines ability to perform good I/O and utilize parallelism during access.

When I speak of a Big/Big Table Join, I mean a join between two tables where a large subset of both tables must be accessed to return the result.

Test Case

I will stay in the TPC-H model as it provides plenty of food for thought. This time, let us zoom in on the ORDERS and LINEITEM tables.

Recall that both of these tables can be generated to be Big. together, they dominate over 80% of the total data size of TPCH.

In the following, we shall concern ourselves with this query:

SELECT L_LINESTATUS, O_SHIPPRIORITY, COUNT(*)
FROM ORDERS
INNER JOIN LINEITEM ON L_ORDERKEY = O_ORDERKEY
GROUP BY L_LINESTATUS, O_SHIPPRIORITY

The query is chosen in such a way that the return result is very small (4 rows) but that the amount of data that must be touched is very large. It is also chosen in such a way that data from both LINEITEM and ORDERS must be touched.

In the following sections, I will force different query plans using the OPTION (<X> JOIN) hint. Lets have a look at each query plan in turn and reason over it. To fully explore parallelism, make sure you work on a large enough result set or hack the statistics with UPDATE STASTISTICS.

Merge Join

The merge join is perhaps the most intuitive query plan shape for a big/big join. I proceed in sorted order through the dataset, merging each row with its matching row. The join part of the plan looks like this:

image

Notice there repartition streams here (and that they are expensive!). SQL Server does intra table parallelism by horizontally partitioning the dataset at runtime to fit the merge join. Notice that the repartitioning is quite expensive in the query plan, more about this later.

Obviously, merge join only works well when both inputs are already sorted by the join key. In this case, SQL Server uses the clustering on that key as the scan order.

The problem with a merge strategy for the Big/Big join is that merge is very fragile. By that I mean that it does not take a lot of query complexity until it is no longer possible to merge without a sort. To illustrate, consider this example:

SELECT L_LINESTATUS, O_SHIPPRIORITY, COUNT(*)
FROM ORDERS
INNER JOIN LINEITEM ON L_ORDERKEY = O_ORDERKEY
INNER JOIN CUSTOMER ON O_CUSTKEY = C_CUSTKEY
WHERE C_NATIONKEY = 7
GROUP BY L_LINESTATUS, O_SHIPPRIORITY

I have indexed ORDERS with an index on O_CUSTKEY to help this query.  C_NATIONKEY is moderately selective, about 1 in every 25 rows match a single nation. But the selectivity and index doesn’t really help us. We get this disaster of a query plan:

image

Goodbye merge, hello loop. Which brings me to our next candidate.

Loop Join

Going back to the original query and forcing a loop strategy we now get:

image

The plan is still parallel because SQL Server horizontally partitions the input at runtime before entering the loop join. The data will now stream up the query tree, but recall from my previous blog that long join chains further up the tree will forced the operators into a specific sequence due to the dependencies introduced by the join.  Lets consider and example of this:

SELECT L_LINESTATUS, O_SHIPPRIORITY, COUNT(*)
FROM ORDERS
INNER JOIN LINEITEM ON L_ORDERKEY = O_ORDERKEY
INNER JOIN PART ON L_PARTKEY = P_PARTKEY
INNER JOIN CUSTOMER ON O_CUSTKEY = C_CUSTKEY
WHERE C_NATIONKEY = 7
AND  P_SIZE = 46
GROUP BY L_LINESTATUS, O_SHIPPRIORITY

Here, our previous filter on C_NATION has been expanded with a filter on P_SIZE in PART. P_SIZE happens to be extremely selective (around 1:1000). But if we stay with loop joins, here is the query plan we get:

image

 

This is not good! The very selective filter from the join to PART is harvested after the expensive join between ORDERS and LINEITEM has been executed. At this point, we already have done too much work. While the merge join could proceed in large chunks, the loop requires constant lookups into the B-tree which is very memory intensive, you want to do as few of them as possible and not pushing the filter is really unhelpful to the plan.

Now, there are ways partially benefit from the filter on P_SIZE, namely to use bitmap/bloom filters generated from scanning (and later hash joining) PART. If we allow hash joins, we get the following plan (red arrow hides smaller, irrelevant joins)

image

If you inspect the clustered index seek into LINEITEM, you find the bitmap (marked with red circle above) being pushed into the seek:

image

While this allows you to collect the filter for the joins before you get to the red arrow above, it still does not change the fact that you have to loop into LINEITEM too many times (the same number as in the loop only plan). Normalisation creates dependences which make heavy filtering work less efficient!

The last query plan leads us to the final option for Big/Big joins…

Hash Join

There is a final join option that may seem tempting at first, the hash join. Forcing hash join, we get the following plan when joining ORDERS and LINEITEMS:

 

image

Again, a parallel plan. Notice that the cost of the query is almost fully dominated by the price of the join, not by scanning the tables (as was the case with loop joins).

Hash joins are an amazing join trick, but only under one condition: When the tables are small enough to fit memory. Once a hash table spills to disk (as it will do when we are Big/Big joining) performance goes south.

Just to give you an idea of how expensive hash spills are when they don’t fit memory I did a compare on a small 1GB dataset (with memory artificially lowered) and an I/O system capable of 400MB/sec.

Memory Condition Runtime
No Spill 3 sec
Spill > 5 minutes

In other words: Hash joins as a big/big join strategy are a risky game.

Summary of Join Strategies

The big message here is that there really is no good join strategy for a big/big join. Of the strategies available, the loop join is by far the safest. However, as we have seen, the loop join carries a heavy overhead and if combined with normalisation, will have a hard time benefitting from filters applied to the big tables.

Big/Big joins are one of the great unsolved problems of databases. Whenever you see them in a data model, step with extreme care.

All this being said, there are ways to scale big/big joins and achieve strong co-location in an MPP system. But only under certain conditions, which will be be subject of a future blog entry.

Don’t join: Column Store Scan

There is of course another way to look at this problem: Don’t big/big join in the first place.

This requires de-normalising ORDERS and LINEITEM into a single fact table. In the following test, I have to be kind to rowstore to not make the comparison too glaring. Let me therefore narrow ORDERS and LINEITEMS to these columns only (if not, row stores are going to be destroyed completely soon).

The point I will try to make is illustrated already at 1GB and because I am not at home with my server now, I will use a small dataset to prove it.

 

LINEITEMS (row store):
L_LINESTATUS, L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_SHIPDATE, L_EXTENDEDPRICE

ORDERS (row store):
O_SHIPPRIORITY, O_ORDERKEY, O_CUSTKEY

I created ORDERS_1 and LINEITEM_1 as narrow tables containing only the above. The tables both use the ORDERKEY as leading columns in the unique cluster index for optimal merge join performance. To save storage and emulate a real life situation, I have page compressed the data and set fill factor to 100%.

De-normalising we get the table I will call: SALES_1

SELECT *
INTO SALES_1
FROM ORDERS_1
INNER JOIN LINEITEM_1 ON L_ORDERKEY = O_ORDERKEY

For reasons which shall become obvious in a future blog, I choose this sort order to cluster the column store on:

CREATE CLUSTERED INDEX CIX
ON SALES_1 (L_EXTENDEDPRICE, L_PARTKEY, L_SUPPKEY)

After this, I create a column store index covering all rows (the order of column in this statement is not important):

CREATE NONCLUSTERED COLUMNSTORE INDEX CSI
ON SALES_1 (L_LINESTATUS, O_SHIPPRIORITY, O_ORDERKEY,O_CUSTKEY,
L_PARTKEY, L_SUPPKEY, L_QUANTITY, L_SHIPDATE, L_EXTENDEDPRICE)

Lets take stock of how much storage we have used at this point:

Table / Index Index Type Space MB
LINEITEM_1 Row Cluster

152MB

ORDERS_1 Row Cluster

17MB

Total Normal Form

169MB

SALES_1 Row Cluster

154MB

SALES_1 Column Column Index

103MB

Total Kimball Form

103MB / 257MB

This table may surprise you. Why is the row clustered de-normalized table smaller than the sum of the normalised data? A few reasons

  1. The normalised data has an extra column (the ORDERKEY, which must be in both)
  2. The normalized data is clustered (to enable fast joins) on ORDERKEY. The de-normalised data does not have to worry about then join and can hence cluster on a more efficient sort order that leads to better compression.

Note that the total size of the table with the column store is still quite a bit larger than the normalised form. This is because SQL Server 2012, unlike for example Vertica, does not allow clustered column stores. But we see the potential to shrink the table down, even when de-normalised, to a even smaller size the the normalised structure.

Pitting Column Stores against Row Stores

Now that we have two comparable models, we can start running some test queries. First stop, take the low cardinality, two column, query we have used so far and benchmark it against the equivalent de-normalised form:

SELECT L_LINESTATUS, O_SHIPPRIORITY, COUNT(*)
FROM SALES_1
GROUP BY L_LINESTATUS, O_SHIPPRIORITY

Measuring, we get:

Model CPU time Physical  Read MB
Normalised 6228 ms 174
Kimball 16 ms 0.1

Surprised? You shouldn’t be. The query above is optimal for column stores as L_LINESTATUS an O_SHIPPRIORITY has very low cardinality and therefore an really good run length encoding. Obviously, the pseudo merge cost of the column store is not very expensive.

Lets try something more ambitious. How about the query with the two filters used in the section on Loop joins. In the Kimball model it looks like this:

SELECT L_LINESTATUS, O_SHIPPRIORITY, COUNT(*)
FROM SALES_1
INNER JOIN PART ON L_PARTKEY = P_PARTKEY
INNER JOIN CUSTOMER ON O_CUSTKEY = C_CUSTKEY
WHERE C_NATIONKEY = 7
AND  P_SIZE = 46
GROUP BY L_LINESTATUS, O_SHIPPRIORITY

And the results:

Model CPU time Physical Read MB
Normalised (loop) 1251 ms 975
Kimball 234 ms 121

The large reads for the normalised model are from the continues loops that have to occur because of the poor filtering that is the result of the normalisation.

And I believe that is when they say: “I rest my case”…

Summary

In this blog, we have first seen the theory behind why big/big joins are very hard to do. As we explored the traditional database options for solving this problem,  we saw the problematic query plans you have to deal with. I will be blogging about some of ways you can address this problem in MPP under certain favourable conditions of co-location. But at this point, I hope it is clear that even in a scale-out system there really isn’t a good and scalable way to solve the generic problem of the big/big join.

We have also seen how column stores, even on de-normalized data, both compress better, perform faster and spend less CPU cycles than the equivalent normalised design. Furthermore, we have seen how normalised tables, because they are forced to implement indexing on the keys that support joins, can sometimes (in the presence of database compression) be larger than de-normalised tables, EVEN when there is no column store backing up the de-normalised model.

I hope this blog has given you food for thought and that you will at this point see that data models which implement big/big joins should only be used when all other options have been exhausted.

  7Comments

  1. Pingback: How Vertical Partitioning and Deep Joins Kill Parallelism « Olipa kerran Bigdata

  2. cnu   •  

    Is your de-normalization a big-big join? How long does that take?

    Very interesting topic!

    • Thomas Kejser   •  

      Hi cnu

      That is an odd way to see it, but let us see where this line of thinking leads.

      We can consider the turning of rows from a normalised into a de-normalised form a big/big join. However, this assume that all rows arrive in the database at once. If rows arrive in streams (for example, via ETL) the big/big turns into a small/big or a small/small join.

      Then there is the issue of when to pay the big/big cost. Do you pay it once (to denormalise) during load or every time you run a query. Interestingly, this is is the OLTP vs. Data Warehouse question: Do you optimise for minimum latency insert and small query results or for high throughput latency and big query results?

  3. Øyvind Strøm   •  

    Hi Thomas

    Very interesting read, and good examples.

    • Thomas Kejser   •  

      Hi Øyvind

      Long time no see, hope life is treating you well. Glad to hear you found the article interesting and hopefully also useful for making data modelling decisions.

      I am still looking for datasets that I can analyse for optimal column stores indexing strategy. Would you happen to have any you can share?

      Thanks
      Thomas

  4. hennie7863   •  

    Wow Thomas ! Although i haven’t read you whole blogspot it seems a thorough investigation. Thank you for sharing. So you’re saying that from a performance viewpoint it’s better to denormalize?! How conclusive are your investigations? Are these conclusions applicable in every situation of joining? are these conclusions your cons against datavault and anchormodelling?

    I know that you’re debating with Dan about this issue. Okay performance is big point when datawarehousing. But is this the only important aspect of a dwh? Isn’t dwhing a mix of multiple aspects like structure, adaptibility, auditability, etc and performance. If you plus on one aspect you min on another?!

    You give some good food for thoughts!

    Greetz,
    Hennie

    • Thomas Kejser   •  

      Hi Hennie

      Thanks for reading and participating. Happy to get the thoughts going. To answer your questions:

      It is not ALWAYS better to de-normalise. As I mentioned in previous blogs, there are cases (optimising of small writes in OLTP for example) where normalisation makes a lot of sense. But if what you are building are large table structures that have to be joined frequently with each other, you really have to think VERY hard before you normalise.

      With regards to how important performance is: It is not the ONLY important aspect. But my experiences of data warehousing shows that if you cannot perform, you fail. Hence, it is a necessary condition for success, but not the only one. The question then becomes: How do you achieve the other things while maintaining performance. My discussions in the TimeExtender forum provides some directions on how to do that:

      http://thebibackend.wordpress.com/2012/06/11/thoughts-on-data-vault-and-automation/

      With regards on conclusive investigations: Query optimisation is a very large area and this blog has only touched a smaller part of it so far. But I think the burden of evidence now rests on anyone claiming that de-normalisation leads to better big/big performance to prove it and not just claim it point blank. As I also pointed out in my comments to Dan in my previous blog entry, the simple maths he has done does not map onto reality (the model of computation he uses is too simplified to make real predictions as I think I have quite clearly shown here).

      With regards to 6NF and anchor models: I actually think that an argument could be made for using some of these techniques for the DIMENSION data in the warehouse, the small tables. I describe some light weight anchor like models in my posts about map tables. However, I think the buck stops at dimension tables: One of my biggest issues with Data Vault is that is does not restrict its grand claims about modelling to small tables but that it makes claims about performance of large tables, which frankly, I don’t think are founded in reality.

      Cheers
      Thomas

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">