Bottleneck Diagnosis on SQL Server – New Scripts
Finally, I have found some time with my good colleagues at Fusion-io to work on some of my old SQL Scripts.
Our first script queries the servers for wait stats – a very common task for nearly all SQL Server DBAs. This is the first thing I do when I meet a new SQL Server installation and I have generally found that I apply the same filters over and over again. Because of this, I worked with Sumeet Bansal to standardise our approach to this.
You can watch Sumeet introduce the script in this YouTube video: http://youtu.be/zb4FsXYvibY. A TV star is born!
We have already used this script at several customers in a before/after Fusion-io situation. As you may know, the tuning game changes a lot when you remove the I/O bottleneck from the server.
Based on our experiences so far, I wanted to share some more exotic waits and latches we have been seeing lately.
This is not exotic at all you may say. In fact, it is probably the most common wait type in most SQL Server installations.
Here is what you may not know: This wait does NOT measure the I/O latency of the log drive. To get that, you have to query the log file in sys.dm_io_virtual_file_stats – which you may notice often reports a lower wait time than WRITELOG. Just something to think about when measuring log latency as we see a larger and large discrepancy between this wait and the actual I/O wait.
I have provided more details on this in my SQLBits presentations:
- Finding the Limits (Presentation from SQLBits 9)
We are seeing this latch wait a lot, especially in data warehouse environments. It is related to parallelism. While lowering MAXDOP is one way to remove it – it is not always the optimal way and you may not get the most out of your hardware if you do.
We typically see this wait happening in two different situations:
- When doing table scans
- When running INSERT…SELECT as part of an ETL process
In both cases, the wait indicates that you are hitting some form of scalability issue. My friend Henk van der Valk blogged about how to work around the table scan problem:
Henk used another type of non volatile storage: battery backed DRAM; the lesson he learned is equally applicable to Fusion-io.
I have written about optimising INSERT…SELECT in this whitepaper:
- There is probably another blog entry to be written one day.
Typically, we see this wait when a filegroup in a database doesn’t have enough files allocated. It is easy to diagnose: just look for PFS pages in the resource_description. Yes, this applies to user databases too, not just tempdb.
How many data files you should have in a database or filegroup has been debated a lot in the SQL Server community, even by me. When running on spindles (or spinning rust as we like to call it in Fusion-io) the answer is a big “it depends”. But on flash our experience and our benchmarking has shown that more is better, especially under high concurrency workloads.
Above: Runtime and PAGELATCH_UP waits when adding more data files. Notice the sharp drop from 1 to 8 and that the runtime curve continue to drop as you add more files
There is enough material for another blog entry (and for some nice new scripts) on this subject.
Another common wait. Typically, this occurs in two situations:
- When you issue UPDATE statements a lot against small tables
- When you run many concurrent INSERT statements into a table that has an index on an IDENTITY or NEWSEQUENTIALID column
- When using a table as the backing store for a message queue
Ad 1) The solution to situation 1 is normally to “pad” the table with a CHAR(4000) NOT NULL column to make every single row fit only one page. However, this is only viable on small tables – but those are the ones that tend to see this effect.
Ad 2) For high scale systems powered by flash memory, our recommendation for generating keys goes directly against the typical approach taken by the SQL Server community. DO NOT use IDENTITY, SEQUENCER or NEWSEQUENTIALID to generate keys. In fact, we have found that the old and despised NEWID is significantly faster than NEWSEQUENTIALID. The numbers speak for themselves, here is the runtime of a workload that inserts 160M rows:
Above: Myth busting “best practises” for generating keys.
There is a very interesting solution to the key generation problem that has been blogged about by another friend of mine, here is the link to Rick’s very elegant key generator:
Ad 3) I have previously blogged about the design pattern that solves this problem
As we install more and more Fusion-io into SQL Servers and remove the I/O problem, we continue to learn about new bottlenecks in typical customer workloads and how to solve them. Through our understanding of the full application stack, we are helping customers get more out of their new hardware. I run a mentoring program inside the company where our field unit, always eager for more knowledge, learn how to tune even very complex problems in SQL Server.
In Fusion-io, we live every day in the world beyond the I/O problem, and we are privileged to be the first to witness the disruption NAND flash is creating in the market. We are especially excited about Project Hekaton and how that is going to change the game once again. As we explore Hekaton, we will adjust our guidance. The SQL community has a lot to look forward to from Microsoft and we look forward to continuing to provide acceleration in this space.
I would like to thank my colleagues in Fusions Data Propulsion Labs for helping making all this come together – especially Sumeet Bansal the new face of our SQL Server YouTube series.