Shot in the Foot

Myth Busting Query Optimisation

imageIt’s been some time since I posted here, but recent customer events prompted me to write up some notes I have been taking the past years.

The problem I would like to talk to you about is one that is well examined and well understood, yet continues to create frustrations with customers. Unfortunately, some of these frustrations come from “best practices” that people apply consistently, which in turn consistently gets them into trouble.

When you are done reading this post, I hope you will have a better understanding on how to avoid these pitfalls. It is also my hope that some of your expectations about query optimisation will have been adjusted.

Myth: You can rely on the optimiser to always get queries right

It is commonly believed by customers that as long as you some/all of do the following you are fine:

  • Keep statistics updated
  • Defrag indexes
  • Watch out for parameterised queries
  • Keep the database patched up with hotfixes and service packs
  • Index the database well
      While most of the above are good practices (perhaps with the exception of defragging indexes) – they are not enough. Even if you do ALL of the above, you are not on safe ground. Finding the optimal query plan for any given query is an NP-complete problem. This is computer science terms for: Very hard, likely impossible. If you are interested in knowing more about this fascinating subject following this link:

P=NP

    .

What it means to you: Even in the most well designed database, queries will sometime run wild or give you inconsistent execution times. When things go wrong, they often go very wrong. Queries that run for seconds can run for hours if the plan they get is bad. Expect this, and defend yourself against it.

What you can do about it: First of all, you need to properly monitor the database to spot queries that run wild. SQL Server is very well equipped to do this, the details you need can either be found via profiler/XEvent traces or by querying sys.dm_exec_requests. If you find a runaway query, it is often a good idea to force its plan to be better or add indexes/statistics to help the optimiser. In order for you to understand what a “better” plan looks like, you need to be able to spot the difference between “the plan you want” and “the plan you are getting”. More about this later.

Myth: You should update statistics after upgrading SQL Server

imageThis is probably the most dangerous “best practise” out there. Often, people will run an UPDATE STATISTICS WITH FULLSCAN after upgrading to a new version of SQL Server. And the results if often dismal: By updating the statistics, all queries will get new plans and a carefully tuned workload can suddenly behave very different if you happen to get plans that favour skew over equal distributions. Sometimes this helps the workload, sometimes it makes it dramatically worse. It’s really a gamble when you do this.

What it means to you: You should carefully weigh the pros and cons of updating statistics after an upgrade. If you DO decide to follow the “best practise” of updating everything, make sure you take the proper action, which is:

What you can do about it: Before an upgrade, always get a trace of the query plans the server executes on a typical day. If you decide to update all statistics and run the risks, compare the new query plans with the old ones (you can use the column query_hash for this) and look for outliers. You may get lucky and only have better performance (one of the reasons you upgraded in the first place) or you may have some outliers that need tuning after the upgrade.

Myth: Updating statistics regularly is always a good idea

This myth has a lot of truth to it. Unfortunately, it also causes a lot of trouble for customers who run large databases. UPDATE STATISTICS is an expensive operation in SQL Server and it can run for a very long time; for some customers, several days. By the time you have finished updating the statistics, you will not only have burned a lot of CPU cycles on the server (which remember, you are paying a license for) – you may also end up with the result described above.

What this means to you: You may have tables in the database that never have good statistics, because the update takes so long that they are always out of date.

What you can do about it: Carefully plan statistics updates on large tables. Consider using the tricks described in my blog about the ascending key problem. This also means that having auto update statistics turned on for all tables is not always a good idea.

Myth: Auto create statistics will create the statistics I need

This is only true for single column statistics. If you have two columns that are somehow correlated, the optimiser will not be able to understand this unless you manually create multi column statistics

What this means to you: With no other information available, the optimiser will assume that filters on two columns are additive. For example, imagine a database of customers and the food they eat and cook. If  10% of all your customers buy fish and chips and 10% of them are English, then the optimiser assumes that a filter on Food = ‘Fish and Chips’  AND Nation = ‘English’ means that you will filter the table to  1% of the rows. If you add the filter: Cooking Skill = ‘Very Poor’ it will again assume this to be additive and estimate 0.1%. In all cases, we of course know that the real answer is around 10%. Statistics will in this case mislead the optimiser.

What you can do about it: You can use queries or my  TableStat.exe tool to find out which columns are correlated. Once you know, you can then add multi-column statistics to them. This is especially important in data warehouse fact tables, where high correlation of columns is particularly common (a fact exploited by column stores).

Myth: Patching the SQL engine makes the optimiser do the right thing

As I said in the introduction, the optimiser is not perfect. While it constantly evolves, there simply is no such thing as an optimiser that always does the right thing. Furthermore, a lot of the changes to the optimiser are disabled by default.

What this means to you: Patching the SQL Engine most likely has no effect on the query plans you get. Unless…

What you can do about it: Trace flag 4199 will enable the latest fixes to the optimiser. Before using it, make sure you do a baseline and treat enabling this flag exactly like you would treat upgrading to a new version of SQL Server.

Myth: Buying a larger server will make my queries run faster

This is probably one of the most common myths out there. Query tuning is a tough discipline that requires a deep understanding of database internals. Often, it is tempting or even economically feasible to simply throw hardware at the problem. With a properly configured I/O system and cheap cores, it is now easy to build a server with no hardware bottlenecks.

Unfortunately, upgrading to a larger server can have negative side effects on your query workload. Queries that have bad plans can now run faster, but they will also gobble up more resources. This may make the entire workload run worse than it did before you upgraded the server. Add to this that large servers often have lower clock rates per core, which means that single threaded query plans, or plans on skewed data will run slower. For example, a 2 socket Intel Xeon can clock over 3.5GHz while a 4 socket will rarely clock over 3GHz.

What this means to you: Buying a larger server for your database may make it run SLOWER, not faster. If you decided to invest a lot of money in hardware, this wont make you look good.

What you can do about it: Perform an pre-analysis of the workload before upgrading to understand if you will benefit from more hardware. If you discover you will, then make sure you baseline the query workload before upgrading so you can quickly locate queries that consume more of the new resources. Comparing a simple query trace in the before/after situation is generally enough to spot the worst offenders.

Summary

Query optimisation is a hard problem (NP hard in fact). The code the SQL Server uses to heuristically find good plans is very complex. In general, the optimiser does a great job at finding good plans. The database is generally smarter than humans, even the most experience SQL Server veterans. After 15 years of database tuning, I am still occasionally surprised when the optimiser finds a really clever plan I never thought about forcing.

Things DO go wrong. Expect that “poor plans happen” and have a contingency plan for what to do about it.  Powerful diagnostics tools inside SQL Server allow you to easily track down poor query plans – but it is up to you to use them. Query hinting, combined with a solid understanding of join types and indexing strategies, will help you fix the problem when it inevitably appears. Coming up with a good query plan when the optimiser wont give you one is not an easy task. A few hours with a consultant will often be money well spend. If you wish to learn how to find good query plans yourself, I highly recommend the book “SQL Tuning” by Dan Tow.

The very best thing you can do to avoid poor query performance and bad plans is to use proper data modeling techniques. Good data modeling has a powerful effect on all relational databases, both column and row stores. SQL Server’s optimiser is one of the smartest optimisers out there, but even that piece of code needs a hand from the data modeller.

Use dimensional models to build data warehouses. This greatly improves your chances of getting good query plans and great performance. Make sure you manually create statistics that reflect relationships between columns in the data.

When building OLTP systems, avoid Entity-Attribute-Value models. The many self joins and skewed distributions in these models mislead the optimiser. If you absolutely must use them, be prepared to force a lot of the query plans with hints (often OPTION LOOP JOIN) and to run without table lock escalation and auto update statistics.

  No Comments

  1. Ewald Cress   •  

    This is one of those blog posts where I wish you had a “like” button after each paragraph. Great stuff, Thomas.

    And I second your endorsement of Dan Tow’s book. It’s a great way to put yourself in an optimiser’s shoes and gain some understanding and respect for what it needs to achieve.

    • Thomas Kejser   •  

      Thank you Ewald

  2. tobi   •  

    I don’t buy the NP hard argument. Find a near-perfect plan is much easier than finding *the* optimal plan because there are very many near-perfect plans.

    In my opinion, bad query plans mostly come from an imperfect cost model or the other things you mentioned. I almost never find a bad plan that was created just because a better one was not considered. Bad plans are almost always because a truly better one was deemed inferior.

    • Thomas Kejser   •  

      Toni, it is well understood that the problem itself is NP complete. But what does that mean if we look at it closer? This DOESN’T mean that there is no way to find an optimal plan, given a very small number of tables and indexes and possible join paths (for small values of n, the entire space CAN be explored in reasonable time). But consider what this means for things like “safe crossjoin” optimisation that is part of the star schema optimisation problem. It also DOESN’T mean that you cant make progress towards local optimums and “climb up the hill” once you have crash landed somewhere in the space of all query plans (which is what the optimiser will try to do).

      Following this logic, you could argue that a Dimensional Model is “good” because it reduces the search space significantly. But by the same argument, normalisation is “bad” because the search space gets enormous once you go beyond really trivial plans. I am pretty sure you would not argue that one should NOT normalise OLTP system because of this? The argument FOR normalisation in small databases is that it hurts less (in general) to get a bad plan than in a big database (and that hash joins basically dont make sense if you want to be sure you get reasonable performance)… Of course, one can overdo it (Data Vault for example) and intentionally shoot oneself in the foot.

      What is a “near perfect plan”? How much better does it have to be over a “not near perfect” to consider it “optimal enough”? Where in the landscape of possible plans do you need to set land first to get close to what you would consider “near perfect”?

      With regards to what a “bad” plan is: That depends on both hardware, current server conditions, the available parallelism (at high possible DOP, good plans may exist that did not make sense at low DOP). When the optimiser picks a bad plan, a few things can be true:

      1) The optimiser already considered a good plan and decided against it (the situation you mention). Unfortunately, by the time you have searched a large plan space, you may no longer hold the good plan in memory (the space complexity is also high) and there is no going back because time ran out 2) The optimiser made a mistake because the costs are off and calibrated to a machine hidden under Lubor’s desk (a common complaint about the SQL Server Optimiser). Good luck coming up with a better costing model. l would argue that most of that trouble is not a costing problem, but caused by other factors – the subject of a completely new blog one day perhaps 3) Someone did a horrible model that doesnt HAVE any good plans that you would want (or which is too complex to make them appear) 4) The best plan is somehow not “possible” because certain plan operators cannot be transformed to others in every context 5) The memory estimate of the plan usage is too high (even though it is correct) and the optimiser tries to conserve resources and settles for a cheaper plan

      … etc…

  3. Davide Mauri   •  

    “Buying a larger server will make my queries run faster”: This particularly hard to fight. I always found customer that bought a server with 128 Core, 64 GB of RAM (“hey, we EVEN have a lot of RAM”) and……drum rolling….2 TB of space in 4 drives. (“With 2TB of space you can go slow!”. Yes, so why you bought a Ferrari and not a van?)

    • Thomas Kejser   •  

      If I may take the liberty to philosophise a little. I think this linear type of “factory thinking” about server capacities is a sign of a greater problem with the way we view production capacity. Back in the days of Henry Ford, adding more people and machines would boost productivity. Unfortunately – this thinking led to really poor cars that break all the time. But as Toyota has shown us: you can make better cars if you stop thinking like that.

      In IT – we are still thinking like Ford and not like Toyota.

  4. datatake   •  

    Would interpreting low query cost to equal lower execution time be a good ( or bad ) myth to add to the list ?

  5. Thomas Skytte   •  

    Great post. Thanks Thomas. BR Thomas Skytte

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">