Shot in the Foot

Clustered Indexes vs. Heaps

At Stack Overflow the other day, I once again found myself trying to debunk a lot of the “revealed wisdom” in the SQL Server community. You can find the post here: Indexing a PK GUID in SQL Server 2012 to read the discussion. However, this post is not about GUID or sequential keys, which I have written about elsewhere, it is about cluster indexes and the love affair that SQL Server DBAs seem to have with them.

It is interesting to note that both SQL Server and MySQL by default will create clustered primary keys on a table. On the other hand, you will find Oracle DBAs shake their heads in amazement as to why one would do such a thing. As with all “best practices” that are so widespread in the SQL Server community – the problem is that things typically a little more nuanced than the whitepapers would have you believe. And “it depends” is the worst cop-out ever.

Good Reasons to Cluster

As you may have guessed, this post is about the advantages of heaps. But before we get there, let us first review the good reasons for using a clustered index on a table.

Tables with a Single Index on an Ascending Key

If your table has a single index on a primary key, you probably want a cluster index. This is the default behaviour of SQL Server. But WHY is this so?

Consider the case of INSERT statements in these two tables:

In the clustered case, an insert will only touch the leaf row of the last page. In the heap case, both the index and the heap has to be updated. This results in the heap needing one more logical read and one more write than the cluster. Clearly, the cluster index is to be preferred.

For now, I just want to plant the idea in your mind that scalable tables are NOT sequentially laid out. I have blogged and presented extensively about this and Microsoft (Diagnosing and Resolving Latch Contention on SQL Server) even has it documented if you are the type of person who only take your wisdom from the corporation.

We will get back to tables with one key very soon.

Large Tables that get a lot of Ranged Scans on a Key

Obviously, if you plan to range scan a table a lot on a key – this favours the cluster. Not only does this reduce I/O – it also reduces the amount of page lookups you have to do in memory.

I have sometimes heard the revealed wisdom that clusters should also be used to support queries that have ORDER BY on the clustered key. There ARE cases where this is true and a the cluster index can help. But in my experience, these cases are very rare. It only takes a join to another tables and ordering on columns from that to completely break any ORDER BY benefits you gain from a cluster index.

This then leads us to ask: What tables are likely to be range scanned in a database?

Let’s break the question down into the two major types of workload

Data Warehouses: Range scans on date columns in fact tables are common. Joining from a dimension table to the fact will also cause range scan of the surrogate key in the fact. Do note that of you need to range scan on date efficiently, it is often wise to use partitioning to optimise this. You may not have Enterprise Edition, in which case clustering on the date column is often a great idea. An argument can be made for clustering on the fact table surrogate key that is most frequently joined on.

OLTP: Range scan are most common in two scenarios:

1) Hash build of small tables
2) Joins from the referenced table to the primary key in an FK relationship

Ad 1) We can safely ignore any benefit from cluster indexes here. Small tables are, as the name implies, small. This means that scanning them is cheap no matter how silly our indexing is. For extremely frequently accessed tables, you should most likely be creating covering indexes to squeeze out the last performance. Cluster indexes don’t have any particular advantage here.

Ad 2) When you join like this, you are typically range scanning. If you expect large result sets as an effect of this, clustering can be a great help. However, if you are generally expecting large result sets, you are either running ad-hoc reports, have poorly tuned access patterns or you are running a data warehouse. There is however an important exception to this rule: Tables where the typical query is nearly always seek on the primary key followed by a range scan of a few records in a table referencing that key where those records get created at the same time as the primary table. A good example of this is the Order/OrderLines paired table that exist in most sales systems. It makes good sense to cluster the OrderLines on OrderKey (and NOT on some IDENTITY column added to OrderLines) – because the join is frequent and nearly always results in a range scan of OrderLines that benefits from the order rows being close together. Also note that in this example, fragmentation isn’t massively affected by the key on Orders not being sequential. There is some inherent “packing” of OrderLines at INSERT time simply because the rows tend to arrive together in a single transaction (which means that any page splitting is minimal).

Tables that see a Lot of DELETE and INSERT Activity

imageHeaps have the great disadvantage of needing forward pointers when rows are deleted. Cluster indexes don’t suffer from this. For tables that see a lot of mixed deletes and inserts, forward pointers quickly waste space. They also doubles the cost of doing seeks: With a fragmented heap, you will now have to follow the forward pointer to find the final record, which adds an additional I/O.

Tables that are in flux like this are great candidates for cluster indexes because you only pay the price to reach the bottom of the B-tree, not the price of chasing a forward pointer.

Consider for a moment which tables these are. They see a lot of both deletes AND inserts. Are they therefore likely to be big? Perhaps, but tables typically get big because they see mostly INSERTS (transaction tables and fact tables being two classic examples). Lets keep this in mind.

Good Reasons to use a Heap

It should be clear that there are some good reasons to use the cluster index. But how common are the cluster scenarios really? Should the general rule be to use a clustered index? I don’t think there should be a general rule, because there are a lot of equally good reasons to use heaps.

Fragmentation Prone tables with lots of INSERT activity

Consider what happens if the keys in the the single index in the cluster example are NOT sequential. In this case, a cluster index is going to get fragmented.

I have already said that tables with a lot of DELETE activity are not good candidates for heaps. But what about those tables who see mostly INSERT? If you think about them, they are very common and often the tables you really need to run well. Examples of such tables:

  • Tables containing OLTP transactions in POS and banking systems
  • Most Data Warehouse fact tables
  • Log tables of all kinds
  • Healthcare records for procedures and medication
  • Telco Call Detail Records (CDR)
  • Investment banking Risk Vectors
  • Event from Web Servers and social media

imageThese are the types of tables you care about at the heart of the business, they tend to make up the majority of your database size. It’s your bread and butter.

if you put a clustered IDENTITY INT/BIGINT column on those tables and declare that the key, you have just shot yourself in the foot!

…Read that again…

The IDENTITY column will make the INSERT activity un-scalable. And since this is one of your largest tables, you probably care quite a bit about that.

What about GUID keys then, the thing everyone hates? Let’s say you did that, but kept the cluster index. Your table would now be massively fragmented. Not a problem in itself and it gets you around the scale bottlenecks. But you are wasting a lot of space.

Consider the alternative: an indexing strategy where you pick a sufficiently “random” key (like a GUID) and instead of blindly declaring that key a cluster index, you instead leave the table as a heap and just put a good old fashioned unique, non-clustered key index on the GUID. The majority of the table growth from INSERTs now goes to the heap which will nicely fill up and stay un-fragmented. And that is a heck of lot more scalable than the last page insert problem (though not infinitely scalable, you need an additional trick for that). In the meantime, the index you care about remains small (it only contains the GUID, not the entire row at the leaf level) and is therefor MUCH more likely to fit in memory. Your seeks are now faster, and any fragmentation you get on the key is not going to consume as much space as if you had to fit the entire row into those leaf pages.

What’s not to like?

Tables with Lots of Indexes

Recall that I said that tables with only one index will often benefit from a cluster index. However, consider the scenario where you have multiple indexes that are frequently accessed.

As an example, let us use the ORDERS table in TPC-H. This table is often accessed by customer, so it makes sense to add an index on O_CUSTKEY. The table also has a primary key index on O_ORDERKEY.

Consider the execution of this simple statement:

We can try out two different strategies:

  • Unique cluster index on O_ORDERKEY and index O_CUSTKEY
  • Leave table as heap. Key on O_ORDERKEY and index on O_CUSTKEY

These are the plans we get:

image

On my little scale factor to (10GB) TPC-H database, these are the stats:

  • Cluster: 27 logical reads
  • Heap: 11 logical reads
    So the heap is better, even on I/O. Why?
    Think about what needs to happen when you do the key lookup into the cluster:

image

    Once the correct key has been located in the secondary index, you have to traverse the entire B-tree for the clustered index just to find the rest of the row.
    Compare with the heap strategy:

image

 

Because the index contains the RID of the row (which is the exact location in the data files) there is no need to climb any B-tree to find the rest of the row. Climbing B-trees is NOT free, there is a locking and memory access cost associated with it. By clustering the table, you are forcing every index except for the cluster to pay double the B-tree climb cost. Heap are better for this scenario.

Tables that need to be Fully Scanned

The final example in defence of the heap is tables that you typically want to scan (and that don’t fall into the frequent DELETE/INSERT category).

In nearly all of the examples I have seen, these tables are fact tables in data warehouse or log files. With the exception of column store indexed tables (and soon, clustered column stores), you generally want these tables to be scanned and not seeked.

In my data warehousing course, I describe the challenges of indexing fact tables. The short version is that when you get a seek of a fact table, even with a good index supporting it, it spells trouble. Seeks destroy sequential I/O on perfectly laid out table. They are also  expensive because B-tree walking costs a lot of CPU. As if that wasn’t be enough, seek heavy plans on data warehouses can cause the execution engine to consume a lot of memory when index intersection plans are generated.

Seeks destroy Sequential I/OUnfortunately, the SQL Server optimiser will often pick a seek over a scan – because it under estimates the speed that a modern I/O system can achieve when properly tuned for sequential access. Fast track is a good example of this; it is not uncommon to see sequential scan speed exceeding 10GB/sec. Sometimes, bad statistics cause the seek too, and bad statistics are difficult to avoid in a large data warehouse.

Because of these issue with the optimiser, it is often desirable to completely take away the option of picking a seek. This means leaving the table as an un-indexed heap or a column store index. One way to think about a column store is that it is just a heap that is optimised for fast scans. You can consider partitioning the heap or column store to give the table a coarse grained “index” that reduces the amount of data that must be scanned – this is often better than clustering.

  25Comments

  1. Chad   •  

    I’m an inexperienced developer who inherited an enterprise PLM framework that has been torturing our 1,000+ worldwide user base for several years via horrible performance and instability. We’ve got about 2,000 tables, probably 3,000 queries that span deep object graphs (most of which are ad-hoc in nature). The ‘scale’ we work with is in breadth, rather than depth, meaning we have < 5 tables with 1-4 million rows, and < 30 with 100k-1million rows. Everything in this system is keyed and clustered on GUIDs, and the overall ratio
    is probably 10 reads for every one write.

    I've spent the last few weeks learning SQL server tuning on the fly.. I write a lot of sql, but normally I'm a C# app guy. I've had the gut feeling all along that both the framework vendor and the consultants who did the initial implementation made a massive mess of the DB based on 'conventional' wisdom. Our DBAs and system 'Architect' have scoffed at my suggestions all along that we should try turning many tables into HEAPs, spouting 'Without the VENDOR suggesting it? Are you crazy?".

    While it's been a complicated tune (for me at least) and I've greatly oversimplified the problem in this post, your ideas laid out here, Thomas, helped me improve overall performance of our system by %300 on average.

    P.S. I AM an american, and share your cynicism deeply on the state of our society. The mid-size corporation I work for is the mess it is EXACTLY because people are lazy and cannot be asked to THINK for themselves.. the path of least resistance really is the american way, on average.

    TLDR: This topic helped me think outside of the box and solve a massive problem, both in critical spirit, and technical detail.

    Thank you, sir. Your site here is instantly a favorite bookmark for me.

    • Thomas Kejser   •     Author

      Chad

      Thank you for the kind words. I am really happy to hear that my advise was useful to you.

      I am also happy to see (As I also mentioned) that the US still have great people like yourself who dare to challenge conventional “wisdom”. It’s such a shame that the political system has created an environment where complacency and political correctness overrules good reason.

  2. Pingback: Clustered Tables vs Heap Tables | Senior DBA

  3. Pingback: SQL Server Radio: Show 11 – Live Show

  4. Aaron Morelli   •  

    Thomas,

    I stumbled onto this KB today and was reminded of your post:
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;297861

    The KB is old, but it seems plausible that “searching heaps for freespace” could still be bottleneck, since the core architecture of heaps hasn’t changed since SQL 2000. Is this something that you’ve seen in your high-transaction environments? If so, what have you done to mitigate?

    Thanks!
    Aaron

  5. Brent Ozar   •  

    Thomas Kejser :
    perhaps with the exception of Americans, where stupidity seems more endemic

    Umm, okay, wow. Sounds like it’s time for me to gracefully step out. Thanks for the conversation, guys!

    • Thomas Kejser   •  

      Sorry Brent, I didn’t mean to single out any individual.

      You must admit that living in a country where a large portion of the population believes the earth is 4000 years old, watch Fox news, think themselves to be informed and then go on to vote republication isn’t exactly something to be proud of. The signs of endemic stupidity can’t be much strong than that. I appreciate that not everyone is like that and that intelligent people suffer under it. But I am sure you can appreciate the irony: that my point about the dangers of catering to the lowest common denominator is particularly well illustrated by the current state of the US.

  6. tobi   •  

    Note the popularity of blogs like SQLauthority.com where they give a very tightly focused answer to a simple question, and they’re massively, massively more popular than Thomas and I combined. I’m certainly not saying Thomas needs to be more like Pinal, nor am I saying Pinal needs to be more like Thomas. There’s a middle ground there, though, where we can boil down complex topics into guidelines for time-pressed readers, then give more information for folks who would like to pursue the topics deeper.

    Oh god, don’t start with SqlAuthority. The solutions there are often out-dated and based on superstitiousness.

    • Thomas Kejser   •  

      @Feodor/Brent:

      I think there is a category mistake going on here. Just because something is popular or widespread does not make it RIGHT or worthy of pursuit. No matter how many people believe silly things, they remain silly things.

      Personally, I find it deeply insulting when people speak to me like a child because there happens to be an ignorant audience out there. I don’t question that there are stupid people in the world. There are people who blindly copy/paste code, shoot themselves in the foot while blaming others and then go on to ask stupid questions. But the moment we cater to them and lower our standards for communication, we are guilty of taking something good out of the world.

      The collateral damage of catering to the stupid masses is immense. It results in things like:

      1. Tech level specification at conferences (read the bloody abstract!)
      2. Boring, long sessions with very little content.
      3. …With tons of time wasting disclaimers at the beginning.
      4. Tons of demos showing otherwise smart people how to click around in SSMS (RTFM!)
      5. Thick, boring books full of screenshots that describe all the internals of SQL Server, but fail to mention the central methods required to actually do root cause analysis on complex problems (compare with the MySQL community and the complete focus on solving problems there).
      6. Dumbed down default configuration settings that are the lowest common denominator.
      7. A silly fear of trace flags that might do people good if they were to know them.
      8. PSS articles where you have to read so many disclaimers that when you finally arrive the facts, you are likely to need another cup of coffee just stay awake long enough to understand them.

      Unlike Brent, I actually believe that the majority of people out there (perhaps with the exception of Americans, where stupidity seems more endemic) are a pretty smart bunch. We owe to them to give them the facts, the special considerations, and the background they need to make good choices. To quote Goethe, who said it better than I ever can:

      “If I accept you as you are, I will make you worse; however if I treat you as though you are what you are capable of becoming, I help you become that”

    • Thomas Kejser   •  

      @Tobi: I actually think the work Pinal does has some good uses for beginners in the SQL community. However, I think a blog is the wrong format for it (and one could argue about the choice of domain name :-)

      If I was on a mission to publish small, frequent articles like that – I would have chosen a HOWTO database. Then people can publicly contribute and the bite size information would be easier to search.

  7. Pingback: My links of the week – January 19, 2014 | R4

  8. Brent Ozar   •  

    Thomas – uh oh, we’ve got a problem here, and it doesn’t have anything to do with the always-excellent quality of your research. The problem is the phrases “big” and “scalable”.

    Over and over again, when customers come to me and I ask about the size of their data, they always – ALWAYS – say it’s big. After all, it’s the biggest data *they’ve* ever seen. Ideally, if you’re doing your career right, you’re always moving upwards and onwards to bigger and faster systems, but that always means you’re working with what you *think* is “big”. In reality, most of these databases are 50, 100, 250GB.

    Readers see your post and say, “Well, Thomas says that if I’m going to have a big table, I shouldn’t cluster on an identity field. It says so on his blog. And my table, it’s big!”

    That’s how we end up with a lot of really small databases designing themselves for really big scale.

    Could you do me a huge favor? In the post somewhere, not in the comments, can you define what you mean by the terms big and scalable? That’ll really help in the years to come as I get folks saying all their 5GB tables are heaps because Thomas Kejser told them it was the right way.

    Thanks, and keep up the great work! I was worried that you’d be retiring from all SQL Server life, heh.

    • Thomas Kejser   •  

      Hi Brent

      Yes, we do indeed have a problem here.

      The problem is that people think that having IDENTITY column on a clustered index is generally a good design choice and that Heap+GUID is a horrible idea – and that is simply not true. I don’t really care if your data is 5GB or 5TB – If you are paying a few extra IOPS (and note, you dont HAVE to, there are way to still get good page filling) to get something that can grow, you are IMHO doing your job as a DBA (which business does NOT want scale?). Fragmentation and page splittings are a distraction (and are inevitable) and they take focus away from doing the real work of actually tuning the queries and indexes.

      As you will also notice in the blog post, I explicitly list the times where it DOES make sense to use a clustered index. The problem is that this “Best practice” has been taken too far and that people now blindly apply the cluster on IDENTITY advise – which is a sure fire way to get them into trouble that (unlike page fragmentation) they cant easily get out of by throwing more hardware at the problem.

      So no, I am not going to update the post with such a qualification. I am of the opinion that people need to think for themselves and understand the real trade-offs. Qualifying all advise with “this only applies if you are bigger than X” is IMHO a sure fire way to speak to SQL Server DBA as if they were children (and it also addressed them as if they only have small databases). Give people the facts and have them think – that is the respectful thing to do.

      Thanks for the kind words though. Though I am working with both MySQL and MSSQL these days, I will be blogging about both.

      Best regards -TK

      • Brent Ozar   •  

        Ah, see, I would agree if everyone had the time to “think for themselves”, but when folks are under the gun to design and build lots of database applications quickly, they often need to start with guidelines, and understand when to deviate from those guidelines. Think about the thousands of developers working right now to build an app. Do they have time to understand database structures fully before they build a table? I wish they did – but I remember back to my job ten years ago, and I didn’t. I just needed a quick guiding rule – when I’ve got databases under X GB, what’s the safest way to get started that won’t kill me?

        • Thomas Kejser   •  

          I agree that we are all under pressure (though I think it is a little condescending to assume this means people don’t think).

          So, if you are really under pressure to get something out the door – I would claim that your best choice is to use GUID. They are a ton easier to program for (you can generate them in the app and not have to learn about the ins and outs of SCOPE_IDENTITY(). And with GUID, you don’t run into any nasty surprises if you suddenly grow the data. Now, you can do BETTER than GUID (and gain most of the benefits of IDENTITY) if you have time to think about it. For example using bit reversed sequencers (see the link to Rick’s Blog: http://dangerousdba.blogspot.co.uk/2011/10/day-sequences-saved-world.html) – but that is still not IDENTITY.

          To summarise: If you dont want to think, use GUID – because that is future safe when the fools up in management want to throw hardware at the problem. I don’t think the evidence supports IDENTITY columns as good choices. Saving a few bytes on the rows and avoiding page splits really shouldn’t be the priority if you are under the gun, building a proper database and indexes should be.

          As I am also stating in many of my posts: If you are working on some super shared virtual with 2 cores, memory over allocated a crap SAN underneath – you are already a lost cause. Thinking about GUID/IDENTITY shouldn’t be your concern, creating fully indexed paths to the data should be (and filling out your CV would be another good idea).

          Best regards -TK

    • Feodor Georgiev   •  

      Brent, I find it very tribal to think that a reader of this blog post who manages to finish reading it will actually blindly follow the instructions without further curiosity and testing.
      As we discussed in private a while back, there are two types of SQL Server gurus: the ones that bring 300 sheets of printed paper and makes 101 level presentations around them, and the ones that push the boundaries of the SQL Server product features. (awards and audience statistics are merely food for the poor ego and are matter of interpretation).
      I don’t really understand the problem you are having with the blog post: are you worried that you will get swamped with extra consulting work if people implement heaps after reading the blog post, or are you afraid that you will have to do some heavy lifting and test the suggested approach from different angles?
      Trust me, there are a lot of humble and quiet people out there who read blog posts like this one and take it to the next level without being in anyone’s face.
      In general, doubting the qualities of the reader will make you less popular.

      • Brent Ozar   •  

        Feodor – about “actually blindly follow the instructions” – I’ve got a great story there. When I first published my original Blitz script (before it was a stored procedure), it was an interactive script that you would run line-by-line, reviewing the results. I put in big capital letters across the top, READ THIS AND RUN EACH BATCH SEPARATELY. There were variables that you needed to enter at various stages – for example, your email address was used to test database mail. The default value for that variable was my own email address.

        I got dozens of emails per day.

        People just copy scripts from the Internet, don’t even read them, and run them on their own production servers.

        You’re different – you would never do anything like that, because you’re exactly the kind of reader Thomas wants. You put a lot of thought into the advice you get, and you carefully evaluate things before you put them into production.

        It’s GREAT that you’re different. I wish every reader was more like you. But doubting the qualities of the reader doesn’t make me more popular – and in fact, it’s quite the reverse.

        Note the popularity of blogs like SQLauthority.com where they give a very tightly focused answer to a simple question, and they’re massively, massively more popular than Thomas and I combined. I’m certainly not saying Thomas needs to be more like Pinal, nor am I saying Pinal needs to be more like Thomas. There’s a middle ground there, though, where we can boil down complex topics into guidelines for time-pressed readers, then give more information for folks who would like to pursue the topics deeper.

        Hope that helps explain where I’m coming from. Thanks for taking the time to share your thoughts!

        • Thomas Kejser   •  

          Brent, there is a very easy solution to your problem with the tool (which I like by the way, the humour is great!):

          Set up an template mail that says: RTFM! with a link to the instructions.

  9. Pingback: Interesting Links #1 | James Snape

  10. tobi   •  

    And I must say that the heated Stack Overflow discussion was just caused by different people having different opinions on what a “common situation” and “assumptions commonly true” are. Depending on where a DBA comes from he will have different priorities.

    I have been in such discussions. Everyone loses.

    Such a situation can be 100% defused by giving a complete and unbiased list of advantages and disadvantages of each alternative.

    • Thomas Kejser   •  

      Tobi: My goal of this post was indeed to provide such an unbiased list of pros/cons. I think the bias has been too heavy on the CLUSTER/IDENTITY school – to the extend that it is touted as religion.

  11. tobi   •  

    Another advantage for heaps is that they get allocation-order scans (IAM scans). For a b-tree this only happens under nolock or tablock locking models.

    Another disadvantage for the heap+unique nci strategy is that it uses more disk space. The nci is redundant. One row will take 16(guid)+8(bookmark)+9(row overhead) space. That’s an increase in effective row size of 33 bytes right there.

    • Thomas Kejser   •  

      Good comments Tobi. However, you last calculation is not what actually happens in most situation (which is why I only use heaps in the fragmentation prone situation).

      Consider the fragmented Index cases. The NCI adds an additional #rows x (16+8+9) overhead. However, you have to subtract the significantly lower cost of half filled pages from this. If The fragmented cluster will add:1 – [fillfactor] x [total row size] overhead to the table. Furthermore, page splits will be more frequent (as each split doesn’t clear as many free row slots as in the NCI)

  12. Feodor Georgiev   •  

    Awesome post! Thanks, Thomas!
    For quite a while (ever since a month after I got introduced to SQL Server) I have been wondering why the PK get Clustered by default and what is the link between one and the other. Even in smaller tables in OLTPs I have seen huge performance problems with this. At one time i git 100 times performance improvement by moving the clustered index from a PK to the FK, since the table was actually joined on the FK and the query was running about 100 times per second.
    This post has been on my mind for a while now, but I don’t think I would have written it better than you. Great job!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">