Compression

Quantifying the Cost of Compression

Last week, at SQL Saturday Exeter, I did a new Grade of Steel experiment: to quantify just how expensive it is to do PAGE compression.

The presentation is a 45 minute show, but for those of you who were not there, here is a summary of the highlights of the first part.

My tests were all run on the TPC-H LINEITEM table at scale factor 10. That is about 6.5GB of data.

Test: Table Scan of Compressed Data

My initial test is this statement:

Because the statement only returns one row, the result measured does not drown in client transfer time. Instead, the raw cost of extracting the columns from the compressed format can be quantified.

The result was quite shocking on my home 12 core box:

image

Even when doing I/O, it still takes quite a bit longer to scan the compressed format. And when scanning from DRAM, the cost is a whopping 2x.

A quick xperf run shows where the time goes when scanning from memory

image

Indeed, the additional CPU cost explains the effect. The code path is simply longer with compression.

Test: Singleton Row fetch

By sampling some rows from LINEITEM, it is possible to measure the cost of fetching pages in an index seek. This is the test:

This gives us the plan:

image

Which has the desired characteristics of having the runtime dominated by the seek into the LINEITEM table.

The numbers again speak for themselves:

image

And again, the xperf trace shows that this runtime difference can be fully explained from longer code paths.

Test: Singleton UPDATE

Using the now familiar pattern, we can run a test that updates the rows instead of selecting them. By updating a column that is NOT NULL and an INT, we can make sure the update happens in place. This means we pay the price to decompress, change and recompress that row – which should be more expensive than reading. And indeed it is:

image

Summary

Quoting a few of my tests from my  presentation, I have shown you that PAGE compression carries a very heavy CPU cost for each page access. Of course, not every workload is dominated by accessing data in pages – some are more compute heavy on the returned data. However, in these days when I/O bottlenecks can easily be removed, it is worth considering if the extra CPU cycles to save space are worth it.

It turns out that it is possible to also show another expected results: that locks are held longer when updating compressed pages (Thereby limiting scalability if the workload contains heavily contended pages). But that is the subject of a new blog entry.

  No Comments

  1. Henrik Staun Poulsen   •  

    hi Thomas,
    I too would like a comparison to ROW Compression.
    My current experience is that it compress nicely, and does not cost too much CPU.
    But then again, we are “spinning rust” (great expression), and we do normally have extra CPU resources available.

  2. datatake   •  

    Thomas regarding “Rules of thumb”, surely this has to have a large caveat on the I/O sub system and how its configured, the storage architecture topology etc etc . The performance benefits of flash have been expounded for a while now, the fact of the matter is that on my travels, I see a lot of storage sub systems where you don’t even get good latencies for spinning rust, i.e. you should get 5 ~ 10 ms for reads and 10 ~ 20 ms for writes, sometimes I see I/O response time in the 40 ms plus region. There are certain models of storage arrays in which writes are throttled back in the event that the write cache getting filled etc etc, there is tiering to consider, all in all lots of different factors. I’ve seen SANs that badly configured a pair of SATA II drives in my home PC have delivered better through put and lower latency.

    On the sobject of code paths, what would be interesting, would be the use of a certain “Leading brand flash PCIe card” that can be configured to act as an extension to main memory (if I’m correct) and how the performance of SQL Server compares when it uses this as opposed to tempdb. The hypothesis being that the code path to main memory should be shorted than that to tempdb.

    • Thomas Kejser   •  

      Hi Chris

      I think my position on using shared storage for databases is quite well known at this point :-) If you are measuring above 10ms latencies on your storage – then it is important to realise that you ARE in the business of putting lipstick on a pig. To a large degree – this is the type of problem that PAGE compression sets out to solve.

      What really puzzles me is how these shops rationalise running their infrastructure like this. While there may be a (highly questionable) business case for sharing storage – how does that stack up against wasting CPU core licenses on the database that sit there unused or burn precious cycles doing decompression? And how does it stack up against the countless hours required to tune the database to avoid I/O?

      How did we ever convince ourselves that splitting the DBA function and the storage function into different departments was a good idea?

      (on a mobile device. Spelling mistakes and brevity may happen)

      On 13 Mar 2013, at 09:43, “Thomas Kejser’s Database Blog”

      • gavinpayneuk   •  

        Unfortunately, there are many infrastructure design requirements that still require us to use shared storage, the best we can do for now is to put NAND in the shared storage.

        • Thomas Kejser   •  

          Which design requirements Gavin?… And if you DO put NAND in shared storage, you better have an array that can keep up and doesn’t add tens of ms latency

  3. Pingback: SQL Saturday #194 Exeter – A presenters review - SQL Server Blog - SQL Server - Telligent

  4. K.Korobkov   •  

    What was SQL Server buffer cache size for the test? Was it bigger 6.5 GB of the data set size (I guess this is uncompressed size)? It would be interesting to see results of the test with the buffer cache set @ 4GB for example (i.e. when data set is bigger than the buffer cache for uncompressed but compressed set will fit in the buffer cache).

  5. Staffan   •  

    How about row compression, does ithave the same effect ?
    /Staffan

  6. datatake   •  

    Hi Thomas,

    Assuming that you have not got a full blown SAN or NAS at home, although I would not put this past you ;-). The length of the code paths are fixed, i.e. you use compression, be it at page or row level and the code path executed will always be the same. I get the impression that the real ‘Moral’ of this story is that as you use low latency storage the “Tipping point” at which it becomes beneficial to use compression is not so much moved but removed completely. Because the I/O latency becomes that low that its dwarfed by the code path length overheads of performing the compression. If you were to use disk based storage, I would consider that compression is still very much relevant. Referring back to Mark’s question, am I correct in thinking that the number of cores comes into play for plans that can be parallelized or if the nature of the workload means that spin lock overheads are coming into play. Also, you need a certain amount of storage I/O bandwidth to drive the CPUs to certain utilisations, the rough rule of thumb is that a core can consume roughly 200Mb a second, however I beleive that this is more like 260Mb/s for the Sandybridge-E Xeons, another reason why the number of cores might affect the scan rate. It would be nice to know that exaqct set up of your ‘Lab’, memory capacity + speed, Fusion I/O card models and capacities and whether you have them configured for maximum performance or capacity, I would suspect the former.

    Regards,

    Chris

    • Thomas Kejser   •  

      Hi Chris

      Good points. This config is with an ioDrive Duo – my slides in Exeter details the full setup. Pretty straightforward really, total I/O speed is 1.5 GB/sec.

      I don’t actually believe that this result is only low latency applicable. For table scans, as long as you can keep up with the throughput (easy, even on spinning rust) you can keep the CPU busy.

      The fast track ROT of 200-250 is for compressed data, which means the number takes CPU cycles for that into account. The uncompressed number is more like 400MB – I did this test on an old Xeon. With Sandy, I would not be surprised to see that being a lot higher.

      The core count had other implications on the scale curve, but that is an entire blog entry of its own :-)

      Sent from my C64

      On 11 Mar 2013, at 05:51, “Thomas Kejser’s Database Blog”

  7. John Alan   •  

    Thomas,

    In several cases page compression can make the difference between the frequently-used table being in memory (more often than not) or being pulled from disk.

    It would be useful to have a rule-of-thumb whether in those cases page compression is automatically beneficial irrespective of the longer code path (assuming it provides decent space gains, over 50% perhaps).

    • Thomas Kejser   •  

      Hi John

      For a “do I/O” vs “fetch from memory” my intuition would be:

      - Table scan: NONE still faster – singleton: compressed faster (assuming you unpack in memory)

      Sent from my C64

      On 11 Mar 2013, at 05:17, “Thomas Kejser’s Database Blog”

  8. gavinpayneuk   •  

    Hello Thomas,

    An interesting analysis of the CPU cost of page compression, thank you.
    I wonder how the overall execution time of the query differed when page compression was used vs when not used?

    I have always agreed with the general principle that while page compression increases the CPU cost of a query we have lots of CPU time compared to IO bandwidth, so an exchange of CPU resource in return for needing fewer IO reads is a gain overall reducing the total execution time of the query. Is this true a fair statement?

    Regards,

    Gavin

  9. Marco Russo   •  

    Very interesting! Have you seen differences caused by higher or lower number of cores?

    • Thomas Kejser   •  

      Marco, I don’t understand the question?

      Sent from my C64

      On 10 Mar 2013, at 22:53, “Thomas Kejser’s Database Blog”

      • Marco Russo   •  

        I was wondering whether the number of cores affects the scan time or not.
        Moreover, should I assume you used FusionIO for your I/O test?
        Thanks!

        • Thomas Kejser   •  

          Cores do affect scan times, as expected. The core count is kept constant. Ys, single fusion card

          Sent from my C64

          On 10 Mar 2013, at 23:21, “Thomas Kejser’s Database Blog”

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">