Statistics Blog Logo

The Ascending Key Problem in Fact Tables –Part two: Stat Job!

In my last post I described a common problem with statistics in data warehouses. I also directed you to a trace flag that partially solves the problem. There is an issue with this trace flag: at the time of writing this blog, it does not work on partitioned tables. Considering that large tables tend to be partitioned, this could be an issue.

Recall from my last post that old statistics will not contain information about the newly inserted rows until stats are updated. Queries asking for data that is outside the bounds of the histogram will estimate 1 row and typically end up with a bad, serial plan. The trace flag works, or doesn’t, by trying to intelligently guess how many rows are such an “out of bound” range instead of assuming the value of 1.

There is another approach that does not rely on the optimizer making guesses: You can intentionally deceive the optimizer, lie about the distribution. Recall that we want the “desired DW plan”. Alas, as with all telling of lies, this exposes you to a lot of risk.

Disclaimer: What I am about to show you is a: “Don’t do this at home” trick. We are going to see some features in SQL Server used in a ways they are not intended for. And we are going to take control of the optimizer in ways it is not designed to handle. If you go down this path, you are on your own! Don’t call support with your issues. Just because I showed it to you, doesn’t mean you can put this in production.

Consider this example an exploration and education of the internals of SQL Server. And think of in the context of what you might consider asking for, next time you file an MSConnect item.

On the issue of MSConnect: If you care about big (any TB sized) DW workloads: please, please file and vote on items that make your working day hard. It helps if you describe which company you work for and what the impact is on your business of not having the feature or “fix”, I know this is not always possible. I know it takes times, and I appreciate you are busy, but Microsoft does listen, and they listen more closely when you can make it clear why this feature is needed and how it hurts you (I call this the “blood on the table” approach). If your favourite Connect item is about DW functionality, shoot me a mail (include the ConnectID please) to let me know. I may well put up a big fight for items that align with my mission in MS. Please keep in mind: I don’t care about client tools at all, I fight for server side features. We have a lot of people pushing on client side features and they are very important too, I have just chosen to specialize.

And with that said, let us dig in:

Recall that our issue was that the histogram lacked rows/buckets containing information about a range of data. The histogram in SQL server has up to 200 rows, but even if we could add a single row to the histogram on every insert, we would soon run out on a large table.

But how about this idea: Fake a histogram were every year we ever want to store (say 100 of them) contains a lot of data. In fact, make sure that every date range or value we select contain a tremendous amount of rows. One way to create such a histogram is to grow the fact table very large and then capture the stats and make it smaller again. If your fact table is not born with the right proportions, growing fat to get the right histogram proportions and then slimming down is not a very nice approach is it?

Our Fact table histogram is not pretty as it is, but how about doing some precision surgery instead? For that, we need an implant. First, we create a table with the same index structure as Fact:

CREATE TABLE HackStats
(
date_sk INT NOT NULL
, payload CHAR(4000) NOT NULL DEFAULT ‘X’
)

CREATE CLUSTERED INDEX scan_me ON HackStats(date_sk)

Now, let us prepare the implant. We need something that has these characteristics:

  1. Enough histogram divider rows to cover all possible data values we could ask for.
  2. A very large value for RANGE_ROWS and AVG_RANGE_ROWS between in each histogram row.
  3. A value of 1 for the DISTINCT_RANGE_ROWS in each histogram row.

Ad 1)

      Make sure that we do the implant once, and that I lasts for the lifetime

Fact

    .

Ad 2)

    Make sure that there is sufficient size and elasticity for any range of dates we might touch.

More about 3) later…

With a little reverse engineering (no, I did not look at the source), we can create a histogram that match our requirement. It does not take a lot of rows in HackStats. Here we go:

SET @y = 1950
WHILE @y < 2050 BEGIN
/* make the stats dense (low distinct) within the year */
SET @r = 0
WHILE @r < 100 BEGIN
/* Make sure the equal rows in the histogram are never hit using invalid
dates  */
INSERT HackStats (date_sk) VALUES (@y * 10000)
INSERT HackStats (date_sk) VALUES (@y * 10000 + 5000)
SET @r = @r + 1
END
SET @y = @y + 1
END

You may have been wondering why I added a payload column to HackStats. For small tables (by page size), SQL Server will do a full sample, even if we ask for a small percent. And the engine is smart enough to know that is has seen all pages. A fully sampled table cannot be hacked and made artificially pretty – in ways you will soon observe. The HackStats table has to be small, but large enough (in our case 20K pages) that all rows will not be sampled. And that is why we had to to pad the rows a bit.

We can now hack like this:

 /* prepare the implant */

UPDATE STATISTICS HackStats WITH ROWCOUNT = 199000000
UPDATE STATISTICS HackStats WITH SAMPLE 99 PERCENT

The first statement above “fakes” a large table by telling SQL Server that there are a LOT of rows in the table. The second statement samples the distribution of those rows (which we have cunningly crafted to have just the right dimensions, so we need to sample almost all of them).

Let us check the quality of the implant:

DBCC SHOW_STATISTICS (HackStats, scan_me)
WITH STAT_HEADER, HISTOGRAM

Output (your results may vary a little, depending on which rows got sampled, rerun the sample if you don’t like the output)

image

And:

image

Notice the neat distribution. Any query asking for a year range will get a high estimate of the outcome of scanning the table. SQL Server is smart enough to boost the row estimates based on how many total rows it believes are in the table.

And now, for the dirty little trick (Brent Ozar, if you are reading this, just don’t say it Smile ):

Let us transplant the faked, big stats to the ugly Fact table. First, take the implant:

DBCC SHOW_STATISTICS (HackStats, scan_me)
WITH STATS_STREAM

This gives an output somewhat like this (your result will vary):

  0x01000000010000000000000000000000B1AB0052000000….

We don’t really have to worry what is in there, it is some binary representation of the stats. The feature to read the binary stats was put into SQL to allow database clones – and we are abusing it now. But hey, we are in the business of precision surgery: Copy the bit values to the clipboard, past and execute:

UPDATE STATISTICS Fact scan_me WITH
STATS_STREAM = 0x01000000010000000000000000000000B1AB0052000000….

And a finishing flourish, to make it fact table look really big:

/* Make the table expensive to access in the future */

UPDATE STATISTICS Fact scan_me WITH PAGECOUNT = 200000000

Validate the result:

/* Check the patient */

DBCC SHOW_STATISTICS (Fact, scan_me)
WITH HISTOGRAM

(output similar to HackStats)

And let us check our problem query again:

DBCC FREEPROCCACHE
SELECT C.shoe_size, SUM(Measure)
FROM Fact F
JOIN DimCustomer C ON F.customer_sk = C.customer_sk
WHERE F.date_sk = 20010111
AND C.shoe_size = 43
GROUP BY C.shoe_size

 

 

Eureka: the “Desired DW Plan”:

image

 

Notice something interesting: the Estimated Operator Cost of access is about 736. That is a REALLY expensive access to the table. Just the way we like it actually, wouldn’t want the optimizer to get any ideas that this table is cheap to access. Our little trick with hacking the page size gave our Fact patient quite a bit of confidence.

I deferred talking about why we need a low number of distinct rows in the histogram. For now, I leave this as an exercise to the reader. Hint: it is related to the bitmap filters and the way Fact is accessed.

And remember folks, you do this at your own risk, you are on your own and in unsupported land here… Don’t come crying to support if you break something. Ask for a connect item instead.

  5Comments

  1. @sql_handle   •  

    Excellent! This is a good tool to add to the toolbox when I’m investigating behavior… however for production use it would seem I’m likely to be far better off by freezing a ‘good’ plan, rather than trying to hack stats.
    Would you agree? Currently I’m trying to shape SQL Server stats behavior with T4199 (to improve some cases of too small samples), T2371 to lower auto-update threshold on very large tables, T2389 + T2390 to mitigate ascending key problem for non-partitioned tables, T4139 to improve the mitigation, and T2388 when evaluating ascending keys. In addition to all that, evaluating which stats keys should be norecompute in favor of periodic fullscan update, or enhanced with filtered statistics…
    That seems a bit much :-)
    Makes me hope that SQL Server plan guides and their management become more mature like the Oracle 12c set of plan management features… in part that plan guides may lose their current stigma in the SQL Server community.

    • Thomas Kejser   •     Author

      Hi SQL_handle

      I largely agree with your comment (and thanks for all the details too). However, plan guides in SQL Server are simply hopeless, so the only way to get the “guaranteed” good plan is to use query hints. OPTIMIZE FOR UNKNOWN comes to mind. Unfortunately, not many BI tools allow the addition of hints.

      Another approach I have found to be decent is to simply remove statistics and turn off auto create/update stats. Once a table reaches a certain size, there isnt much hope for getting good stats anymore.

      The REALLY nasty trick you can play is to reverse the order of the SK_Date column. This ways, you can make the stats “smear” out over the entire key space. Something similar can be achieved with other dimension keys by reversing the bits in the key before adding it to the dimension.

  2. Pingback: Statistics Starters Presentation: Scripts, the Database, and Some Answers - Erin Stellato

  3. Brent Ozar   •  

    HAHAHA, I love this trick! I was sitting here reading it (thanks to a link from James Rowland-Jones) and busted out laughing when I saw my name. This is awesome. Nice work as always, sir.

    • Thomas Kejser   •  

      Thank you sir. Have fun experimenting, it probably wouldn’t take much to create a little tool that “fixes” the distribution to your preference.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">