Archive

Archive for the ‘Grade of the Steel’ Category

Implementing MurmurHash and CRC for SQLCLR

2011-12-07 Leave a comment

As we saw in my previous post, the build in hash functions of SQL Server were either expensive with good distribution, or cheap, but with poor distribution. As a breath of fresh air, let us look at a useful magic quadrant:

 

image

We see that all of the hash functions exposed by HASHBYTES fall into low speed quadrants, caution is advised here. We also see that while CHECKSUM and BINARY_CHECKSUM are challengers, they do not have the spread to fully compete. We are left with modulo as the best possible hash function, but we believe that this algorithm may have trouble executing. The Modulo hash function has some problems (sorry, had to say “problem” instead of “challenge”, I couldn’t bear speaking management bollocks anymore):

  • It only works on integer types
  • If there is structure in the data (for example, if all values are equal) it will not spread the values equally
    The question we are faced with is:
    “Could we create a high speed, good spread, hash function that belongs in the leader quadrant and which does not have the limitation of the module functions”

It was this question I sought to answer when I started hacking away at SQLCLR functions. As you read the following, please don’t laugh too hard at me (a small snigger will suffice) – it has been a long time since I hacked away at code.

MurmurHash2 and Davy’s Bit Magic

Google has published a very interesting hash function, MurmurHash, claiming to have the properties I am looking for: It is both fast and has a good spread. The hash algorithm comes in different flavours: MurmurHash2 and MurmurHash3. Each flavour also exists in architecture and word size optimized versions. I decided to implement the x86/x64, 32-bit integer version in C#.

Fortunately, someone already beat me to it. Davy Landman has an implementation of MurmurHash2 – which he has kindly shared under GPL. I took his implementation and wrapped it in a SQLCLR user defined function.

While trying to understand Davy’s code and hacking it to work with SQLCLR, I had a few “aha moments” that I would like to share.

First of all, Davy uses this cute trick to turn 4 bytes from a BYTE[] array into a 32-bit INT:

UInt32 k = (UInt32)(data[currentIndex++] 
                  | data[currentIndex++] << 8
                  | data[currentIndex++] << 16
                  | data[currentIndex++] << 24
           
);

 

I thought that was a rather neat trick. To practice my illustration skills, here is what happens:

image

Please note that left shift (<<) and or (|) operators take INT32, not UINT32 as input. Hence the need to cast the final value to UINT32.

Speaking of integers, in my modification of Davy’s source code you will find:

unchecked
{
    return (SqlInt32)(Int32)h;
}

I need to get form the UINT which is returned by the reference implementation of MurmurHash to the signed integer (SqlInt32) that SQL Server understands. Doing the unchecked cast here is faster than Convert.ToInt32.

MurmurHash3

From the C++ source code of MurmurHash3, and using Davy’s bit trick, it took me only a few hours to get myself a nice implementation of MurmurHash3. I compared this with a C++ implementation done by my colleague Christian Martinez (based directly on the Google source), and we agreed on outputs that exercise all the branches in the code. So I am reasonably confident that my implementation is correct.

Except for the fact that it turns out to be important to use SqlBinary instead of SqlBytes (blogged here), there is not much more to say. Feel free to use the MurmurHash3 for your own implementations (my implementation is GPL)

The source code for the MurmurHash functions is too long to paste in this blog, so I created a new page on my site which you can find by following this link: C# Source code for MurmurHash in SQLCLR.

Before I show you the spread and speed results, let me just talk a little about two very old, but very commonly used, hash functions.

CRC16 and CRC32

The Cyclic Redundancy Check (CRC) family of hashes are, compared to MurmurHash, very simple to implement. They walk through each byte in the input and divide it with a specially selected polynomial. For implementation purposes, the polynomial is simply a constant in the code and the division is done with XOR and shifting (making this a very efficient operation). The remainder of the byte division is fed into the division on the next byte (giving the hash algorithm its name) and so on, until the end of the input .

WikiPedia contains a great reference implementation that I used for my tests. For readability, the source code is again available on a separate page of this blog: CRC16 and CRC32 in C#.

In the naïve approach to CRC, the polynomial division is implemented using an inner loop through all the bits of each byte:

rem = rem ^ data[i];

for (int j = 0; j < 8; j++)
{
  if ((rem & 0×00000001) == 0×00000001)
  {   // if rightmost (least significant) bit is set
     rem = (rem >> 1) ^ Polynomial32; /* Polynomial */
  }
  else
  {
    rem = (rem >> 1);
  }
}

An interesting optimization, shaving off a lot of cycles, is to create a table that allows you to look up the result of this loop for all 256 combinations of 8 bits. My source code contains such  tables, both for CRC32 and CRC16 implementations. Using these tables, I can replace the above code fragment with:

rem = (rem >> 8 ) ^ CRCTable32[(rem & 0xff) ^ data[i]];

…A major optimization

Speed Results

At this point, you may wonder if I did all of this to give you a chance to laugh at my rusty coding skills. What exactly was the point of implementing a new hash function in the first place?

Without further ado, let us look at the results compared to native SQL Server functions:

image

Isn’t that beautiful? If we don’t need the cryptographic properties of the HASHBYTES functions, we can beat SQL Servers hashes with our own SQLCLR implementation. Please note that I took out MD2 from the above to avoid skewing the results (I have previously shown that it is simply too inefficient).

Now, the bad news: the yellow line above. That represent a “hash function” that returns zero, like this:


public static SqlInt32 NoHash(SqlBinary data)
{      
  return (SqlInt32)0;
}

 

Notice that even such a simple user defined function that does nothing, consumes almost all the same number of CPU cycles as a full hash calculation. This is the overhead of the CLR and throwing data back and forth between managed and unmanaged code. So much for my optimizations in CRC.

Spread Results

We now know that there is at least hope of implementing a faster hash function than HASHBYTES. But how well do these functions spread the input. Running the chi-squared test from my previous blog entry we get These results:

image

Now THAT is pretty nice isn’t it? We see that good old CRC makes for an excellent hash function for our purpose. MurmurHash does very well too, comparable with SHA, but at under half the CPU cost.

Summary

In this blog, I have shown you how to implement hash functions that are both faster and have better spread than the built in hash functions exposed in SQL Server HASHBYTES. We have also seen that the overhead of SQLCLR is significant for this case, and hence, it would be preferable if non cryptographic hash function were exposed natively in SQL Server.

If you care about such new hash functions for SQL Server, I would suggest you file a Connect item to ask for them and state your purpose for wanting them. “It would be cool” or “Thomas says so” is not enough, provide a business justification. But at least, now you know what to ask for…

I owe Christian Martinez thanks for this blog for directing me to the MurmurHash functions, and for helping me validate the correctness of my C# implementation against his C++ version.

PS: As I was exploring this little implementation, I noticed that one of my heroes, Donald Knuth, has extended The Art of Computer Programming with a new volume: Volume 4, Fascicle 1: Bitwise Tricks & Techniques; Binary Decision Diagrams. That sounds like an interesting read and it is now on my birthday wish list.

Thread Synchronization in SQL Server

2011-11-09 Leave a comment

Any code optimized for highly concurrent workloads must worry about thread synchronization. SQL Server is no exception, because in a database system, synchronization is one of the core functionalities you rely on the engine to provide (the noSQL crowd may ponder that a bit). In this blog post, I will describe the synchronization primitives, like locks, latches and spinlocks, used by SQL Server to coordinate access to memory regions between threads.

imageBefore I go on, an update: am currently sitting in the train from Stockholm after a hugely successful PASS SQL Rally Nordic conference.

At the conference, I presented my new “Grade of the Steel” deck, which I hope will be available online soon. The deck led to a lot of questions about spinlocks during the party that same evening.

We had Swedish legend E-Type playing and putting on an impressive stage show (including dancing Vikings, as seen in picture to the right). The show left me wondering how to improve my presentations skills to get THAT level of reach to the audience. The conference arrangers had  also done an amazing job at keeping everything running smoothly and I return with many good memories. Thank you guys, you showed the crowd that the Nordic community is very strong.

Well, “here I go again”, being distracted by all the fun. My point was, the questions at the conference led me to write this blog entry, as your (always not so) humble servant. Keep that feedback coming.

Locks and Lock Graphs

Most DBAs and developers are familiar with locking and blocking: Locks in SQL Server are the semaphores that guard the consistency of the database and enforce the high level ACID properties and serialization properties in the database.

You can measure locking activity in several DMV, and there are many articles written on how to diagnose and troubleshoot locking issues. For this blog, suffice to say that the total time the entire engine spends on waiting for locks is measured in sys.dm_os_wait_stats. I would like to make the point that EVERY performance investigation I do on SQL Server stats with querying this view… Learn to use it.

In sys.dm_os_wait_stats, you will find rows that have wait_type beginning with LCK. These are the lock waits, and they have been extensively described elsewhere.

The one thing I would like to add, is that locks have a directed acyclic graph relationship (DAG) with each other and are generally acquired in a certain order. Let us take a hypothetical example of a query locking five rows in a table with two partitions:

image

Before the query can even start, it needs a database (DB) lock to even run, this lock is acquired by the connection itself and held as long as you are connected to the database.

The first locks acquired by the query itself, are the two schema locks (SCH in shared mode S): one of the table itself, and one for the schema the table lives in. Note that the term “schema” is a bit overloaded here: I mean to say that the lock is acquired both on the table (the table object in the database schema) and on the schema (the schema object in the database schema). The SCH locks prevent the table structure itself from being modified while the query runs. The requirement to hold this SCH lock is why, for example a SWITCH operation, will block a person from reading the table while the switch happens.

When SCH locks are acquired, the query must hold a lock on the table. This prevents other conflicting locks, for example a bulk inserter or a greedy, table escalated X-lock, from conflicting with the query.

Traversing onwards from the table level, we have a special lock level: HOBT, representing a partition (HOBT = Heap Or B-Tree, pronounced: “Hobbit”). Recall that even an unpartitioned table has one partition: the table itself. The HOBT lock is only acquired if the table has been configured to support this lock level. This can be done with this statement:

ALTER TABLE <table> SET (LOCK_ESCALATION = AUTO)

 

… note that the default for tables is to have this lock type disabled and to jump directly to…

Row and page locks: Queries will generally lock the pages it needs access to, though this is mostly a performance optimization used for table scans. Row locks fall into two different categories, RID or Key, depending on whether it is a heap or a B-tree you are locking. Note that you can disable row and page locks by altering the table like this for a B-tree:

ALTER INDEX <index> ON <table>
SET (
   
ALLOW_ROW_LOCKS = OFF
  , ALLOW_PAGE_LOCKS = OFF)

 

There are cases where disabling the row and page lock levels make queries go faster, but it can also have a negative effect on concurrency. I hope to blog about those cases, and how to diagnose them, in a later installment.

I said before that locks are acquired in a the order of the directed acyclic graph by the SQL engine. This is not something you need to worry about, the engine handles this for you. But, can you guess why it matters which order locks get acquired?

(Hint: What would happen if someone started at the top at the same time that someone else started from the bottom of the graph?)

Locks in SQL Server are stored in hash tables in the part of the engine called the lock manager. These hash tables consume some memory (but so little we don’t really care about it, so Oracle people, stop bickering this point) and are optimized for highly concurrent access. The lock manager even partitions locks over multiple NUMA nodes to make accessing lock information faster. The mechanism used to partition locks are similar to the SuperLatches described in: Hot It Works: SQL Server SuperLatch’ing / Sub-latches.

Latches

Latches are a special type of lock that is used inside the SQL Server engine to protect shared data structures. They enforce integrity at the physical layer of the database and will be used automatically by SQL Server, irrespective of the transactional semantics you request.

The wait time for latches shows up in sys.dm_os_wait_stats, there are three broad categories: PAGEIOLATCH, PAGELATCH and LATCH

    The first two are latches used by the database to protect the integrity of the buffer pool. The PAGEIOLATCH variant is used when data is brought in from disk, while the PAGELATCH is used to coordinate things that are already in memory.
    The last latch type, LATCH, is a different beast that is used several places inside the SQL code path to coordinate other memory structures – for example during query execution. The good news is that you can get more details about this latch from the DMV: sys.dm_os_latch_stats. You can think of this view as a way to “zoom in” on latches.

The view contains a latch_class called BUFFER – this is just the aggregate of all the PAGEIOLATCH and PAGELATCH waits, you already have the detail of those, so just ignore it. The rest of the rows in sys.dm_os_latch_stats add up to the LATCH waits.

Spinlocks

Locks and latches have something in common: When a thread in SQL server needs to wait for them, the thread will block and yield the scheduler. In general, it is a good idea to play nice this way and yield the scheduler to another thread if you cannot get any more work done until you acquire access to the structures you want to use. SQL Server also implements user mode scheduling, which takes down the price of a context switch significantly – further boosting scale.

Think about locks and latches with this analogy: You are about to leave a plane, but your luggage is in an overhead compartment in the seats behind you. The best strategy for optimizing throughput, is for you to sit down, wait for the people in the seats behind you to pass and then go open the overhead compartment. This optimizes for the best system throughput and lets you turn off your brain just a little longer.

HOWEVER, sitting down and patiently waiting for other people to get their work done first, is not always the smartest idea. As you leave the plane and move to passport control, you don’t sit down on the floor while you wait for the person in front of you to fumble for his passport in his pocket. The reason you don’t do this, is that you expect him to quickly be done with his fetching; and you remain alert while you wait (and curse). It would be a waste of time and leg muscles to sit down, and the people behind you would get annoyed by you too.

Spinlocks are optimized to protect data structure where you expect the thread (or person) in front of you to get their work done fast! They are typically implemented as busy waiting loops that are optimized for the CPU architecture they run on (we want the loop to be as cheap as possible and we need some low level tricks to get it working). Wikipedia has an excellent article with x86 assembly examples.

SQL Server uses spinlocks in many parts of the code where the assumption of “shortly held locks” can be made, and typically this gives you more throughput and is very efficient.

You can see how many spins are done by querying this view:

SELECT * FROM sys.dm_os_spinlock_stats
ORDER BY backoffs DESC

 

In older version of SQL Server (before 2005) – this view was not available. However, there is another way to get the spinlock information, namely by executing:

DBCC SQLPERF ( spinlockstats )

 

The problem with spinlocks is of course that the assumption about shortly held locks is not always true. Sometimes, the guy in front of you just take too long to get stuff done, and you are stuck with all your attention pinned to him (spinning) and getting more and more annoyed. Interestingly, NUMA systems are typically the machines where this assumption breaks down, because accessing memory on a foreign node can take a significant amount of time. If access to that memory is protected by a spinlock, this means you can burn a lot of CPU in the busy waiting loop and you can get some very interesting side effects.

In general, you should not worry about high values in the spins column of sys.dm_os_spinlock_stats. Spinlocks are after all designed to spin a lot in busy waiting loops. The condition that may cause you to worry that the person in front of you is not clearing the queue fast enough, is when you see high values in the backoffs column. Backoffs are what happen when the thread spinning for a resource gives up and decided to fall asleep. If the backups go up dramatically as you increase throughput, you may have spinning on a memory region that is heavily contended and have to design around it.

There is an excellent article on how to diagnose spinlocks written by SQLCAT members Mike Ruthruff and Ewan Fairweather: Diagnosing and Resolving Spinlock Contention on SQL Server. Among other good stuff, it details how to use Trace Flag 3656 (we all love trace flags) and XEvents (and we love XEvents even more) to troubleshooting spinlock problems (see page 14-16).

Interlocked Operations

I am including this last synchronization primitive for completeness (read on to understand why).

Modern x86/x64 CPUs expose some very interesting instructions that allow you to build what is known as interlock operations. An interlock is a CPU special instruction, or series of CPU instructions, that are guaranteed to perform an atomic operation on an integer. For example, using an interlock you can safely add one to an integer and not have to worry that someone else is trying to add one at the same time. As another example, spinlocks (see above) are typically implemented using the compare and swap operation (CMPXCH in the x86 instruction set) that allow you to compare two values and swap them if they are equal, in a single, atomic instruction.

The reason I am only including this section for completeness, is that you have no way of telling if SQL Server used an interlock to synchronize access to a memory region (though it sometimes does) and no way to influence it. And the reason is that an interlock neither blocks, nor spins the thread – it simply allows the thread to execute as if no other threads were touching the data, because the interlock instruction “knows” that the underlying hardware handles the synchronization. It is the ultimate, lightweight synchronization trick.

You may wonder why SQL Server does not use interlocks everywhere. The big reason is that interlocks are severely limited. They can typically only operate on a a single integer at a time, and only support a small subset of operations on that integer. This is not enough to fully implement all the thread synchronization you need.

If you are interesting in reading more about interlock operations, there is a very nice introduction at this site for computer game programmers: Multithreaded Programming Part 3: Going Lockless. And by the way, I recommend studying the work of computer game programmers – they are the true wizards of computer science.

Related:

Exploring Hash Functions in SQL Server

2011-11-06 4 comments

Hash distributing rows is a wonderful trick that I often apply. It forms one of the foundations for most scale-out architectures. It is therefore natural to ask which hash functions are most efficient, so we may chose intelligently between them.

In this blog post, I will benchmark the build in function in SQL Server. I will focus on answering two questions:

  • How fast is the hash function?
  • How well does the hash function spread data over a 32-bit integer space

I know there is also the question about how cryptographically safe the function is, but this is not a necessary property for scale-out purposes – hence, that aspect is out of scope for this blog.

For the picky people (my dear soul mates): I will be using 1K = 1024 in the following, so 64K = 65536.

      Test data

      Because we sometimes find ourselves hashing dimension keys, I will use a very simply table with a nice key range as my test input. Such a key range could for example have been generated by an IDENTITY(1,1) column or a SEQUENCER. Here is the test script that generates the input:

CREATE TABLE CustKey (SK INT NOT NULL)

INSERT INTO CustKey WITH (TABLOCK) (SK)
SELECT n FROM dbo.fn_nums(10000000)

 

(see my previous post for the fn_nums source)

Speed of the Hash function

SQL Server exposes a series of hash functions that can be used to generate a hash based on one or more columns.

The most basic functions are CHECKSUM and BINARY_CHECKSUM. These two functions each take a column as input and outputs a 32-bit integer.

Inside SQL Server, you will also find the HASHBYTES function. This little gem can generate hashes using MD2, MD4, MD5, SHA and SHA1 algorithms. The problem for the purpose of our test is that these function spit out BINARY types, either 128 bits (for SHA) or 160 bits (for MD). However, we can quickly map that value into the integer space by doing a modulo (in SQL: %) with MaxInt (2**31 – 1). Interetingly, Modulo work directly on values of BINARY in SQL Server.

Before I move on the results, I would like share a little trick. When you need to get the raw CPU speed of the hash, you may be tempted to do something like this:

SELECT BINARY_CHECKSUM(SK) AS h
FROM CustKey

 

But this is wrong. Why? Because you are both measuring the time it takes to hash as well as the time it takes to transfer the rows to the client. This is not very useful, on my laptop it takes over two minutes to run the above query.

Instead, you should try to measure the raw time the server needs to calculate the data. One way to achieve this (and this trick can be used on most queries you benchmark) is to wrap the SELECT statement in a MAX function. Like this:

SELECT MAX(h)
FROM (
    SELECT BINARY_CHECKSUM(SK) AS h
    FROM CustKey
    ) noRows
OPTION (MAXDOP 1)

Notice something else interesting, I am forcing parallelism down to 1. I want to get as close to the raw cost of the CPU as possible – so I don’t want to clutter the query with any overhead of parallelism. The query now runs in a little less than 3 seconds on my laptop.

I addition to the above trick, we also need to quantify the time SQL Server spends on just accessing the table. We can get a good approximation of this by measuring the SELECT statement without any hash function. On my laptop, I measured this to be around 3200 ms. In the results below, I have subtracted this runtime so we are only measuring the cost of the hashing.

With all this said, let us have a look at the results:

image

Some comments on the data:

  • Notice that MD2 is very inefficient. I have not studied this algorithm in great detail and would appreciate a specialist commenting here.
  • Apart from the MD2 anomaly, the HASHBYTES functions all performan pretty much the same. Using around 1 micro sec per hash value calculation
  • BINARY_CHECKSUM and CHECKSUM are MUCH faster, by about an order of magnitude
  • Doing a simple modulo that spreads out the data over the integer space is about the same speed as CHECKSUM and BINARY_CHECKSUM
  • It is not the cast to NVARCHAR that uses CPU time (HASHBYTES requires NVARCHAR input)
  • However, the checksum function do take longer to run when you first cast to NVARCHAR. Costing around 100 ns per INT value

Of course, the HASHBYTES functions also have other characteristics than speed which may be desired (cryptographic utility for example). but if you you want is speed, it looks like CHECKSUM and BINARY_CHECKSUM are faster.

But, let us see how good the functions are at spreading the data.

Spread of the Hash Function

Another desirable characteristic, apart from speed, of a hash function is how well it spreads values over the target bit space. This is useful not only for cryptographic purposes, but also to “bucket” data values into equal sized portions, for example in scale-out MPP architectures.

Often, you will not know in advance how many buckets you eventually want to subdivide the integer space into. You may start out with 32K buckets and divide up the integer interval in 128K (32-bit space divided by 128K = 32K) even sized buckets, each with 128K values in them:

  • Bucket0: [MinInt…MinInt + 128K[
  • Bucket1: [MinInt + 128K…MinInt + 128K*2[
  • Bucket2: [MinInt + 128K*2…MinInt + 128K*3[
  • … etc..
  • Bucket 32K: [MinInt + 128K*32K…MaxInt]

Perhaps you will later want a further subdivision into 64K buckets, each with 64K hash values.

Wise from the runtimes I measured before, I used only 1M rows with values [1…1M] for this test. However, this is plenty to show some good results, so don’t worry.

To avoid running chi-square testing (See later) of integer value spread all over the 32-bit space, I calculated the hash of each key and I then bucketed them into 64K bucket ranges (each with 64K values, neat isn’t it?).

This table comes in handy for that bucketing

CREATE TABLE ExpectedBuckets 
(
    BucketID INT
 
, RangeStart INT
  , RangeEnd INT)


INSERT INTO ExpectedBuckets
SELECT
    n
    ,(n – 1 – POWER(2,15)) * POWER(2,16)
    , (n – POWER(2,15)) * POWER(2,16) – 1
FROM dbo.fn_nums( POWER(2,16) )


A sample output form this table:

image

I can calculate the hash values and put them into a table like this:

CREATE TABLE HashResult
(
  ObservationID INT IDENTITY(1,1)
, HashValue_i INT
)

CREATE CLUSTERED INDEX CIX ON HashResult (HashValue_i)

 

And finally, I can join them all up and count the values in each bucket:

SELECT
      BucketID
    , CAST(ISNULL(COUNT(HashValue_i), 0) AS FLOAT) AS Observed
FROM ExpectedBuckets EB
LEFT JOIN HashResult
ON HashValue_i BETWEEN EB.RangeStart AND EB.RangeEnd
GROUP BY BucketID

 

We now have a nice count of the content in hash bucket, it looks like this:

image

How do we test if this is a good result or not? It seems that min/max, averages and standard deviations don’t really suffice here (sorry modern MBA types, I realise I am going to loose you now Smile).

We need to do some hypothesis testing. The hypothesis I want to test (aka: the NULL hypothesis – all resemblance to relational algebra is purely coincidental).

“My chosen hash function evenly distributes the integers into the 64K hash buckets”

Interestingly, this hypothesis makes a prediction (has to, if not we cannot falsify it): namely that in each bucket filled by hashing the 1M rows, we should expect to measure 1M / 64K = 15 members.

There is also an important twist here: We have to measure the content of each bucket, even when it is zero. This is why I use a left join in the query above, we have to measure all the results we know, not just count the hash buckets that actually get filled (you can see how not bucketing the data would have made the test result very large).

We can now test the hypothesis against the measured result. If we see correlation, then it gives us confidence in the correctness of the hypothesis, and lets us conclude that our hash function is a good one. It also gives us the ability to compare different hash functions with each other. We can test for correlation between the predicted result and the actual result with the chi-square test. Interestingly, the tools we need for this test is even exposed in Excel as the CHISQ series of functions.

Without further ado (and insults), here are the results:

image

(source data available on request)

Some observations about the data:

  • Most importantly: the CHECKSUM and BINARY_CHECKSUM functions are very poor hash functions. Even with 1M input and 64K large buckets, they don’t even fill the hash space. The skew is massive on non-skewed input data.
  • We also see that casting the INT to NVARCHAR improves the behavior of both BINARY_CHECKSUM and CHECKSUM, but not enough to make them good hash functions.
  • SHA and SHA1 seem to give us the best spread of the data, but at 1 microsecond per hash, it seems rather expensive.
  • Notice that the MD series do not give a particularly nice spread. I suspect this is related to the way I divide the data with MaxInt to get from the 120bit value to a 4-byte integer
  • We see that only the modulo does a distribution that gives us nearly 100% confidence (not fully though, because 1M does not divide 64K, which is reflected in the 824 value of Chi-squared). This result is expected, and my function here is engineered to “game the test”. The modulo cannot generally be used like this to spread the value all over the 32 bit integer space.
  • For fun, I put in a test where I use random values all over the integer space. Even with a confidence interval of 10%, we cannot say it distributes evenly. This means that SHA functions distribute better than random on this input dataset. Your mileage will of course vary for random values

On the note of the CHECKUM series, Books Online does (under?) state:

If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.

Summary

In this blog I have explored properties of hash function built into SQL Server. The results indicate some general guidance that could be followed:

  • For best spread over the hash space, use SHA or SHA1 algorithms with HASHBYTES function
  • Be VERY careful about using CHECKSUM and BINARY_CHECKSUM as these functions do not spread a hashed integer value nicely over the full space. Depending on what you use these functions for, they may not be suitable
  • It takes about 1 CPU microsecond to calculate a single call to HASHBYTES, except when hashing with MD2
  • Using HASHBYTES to spread out values over the 32-bit integer space seems rather expensive, since even the best cases uses tens of thousands of CPU instructions.
  • This indicates that it may be possible to implement a better hash function if you do not care about cryptographic properties of the hash

Reference:

SQLBits and Phones

2011-10-06 2 comments

imageMy presentation from SQLBits: “Finding the Limits: The Grade of The Steel” should be online soon. There is a lot of stuff to blog about and so little time to do it. It was some fun days of tuning as the picture shows.

I am curious to hear comments on my session. Was it useful? What other tests would you like to see? Do you prefer this presentation style over other styles (no, I won’t do demos!).

Special thanks to the good people over at Fusion-io for letting me use their kit to run tests. You guys rock!

In other news: I finally found a phone that is just a phone. It is called the Nokia X2, I had it for only a few days and I am already liking it a lot. So far, it has survived on only one charge.

Boosting INSERT Speed by Generating Scalable Keys

2011-10-05 7 comments

Throughout history, similar ideas tend to surface at about the same time. Last week, at SQLBits 9, I did some “on stage” tuning of the Paul Randal INSERT challenge.

It turns out that at almost the same time, a lab run was being done that demonstrated, on a real world workload, a technique similar to the one I ended up using. You can find it at this excellent blog: Rick’s SQL Server Blog.

Now, to remind you of the Paul Randal challenge, it consists of doing as many INSERT statements as possible into a table of this format (the test does 160M inserts total)

CREATE TABLE MyBigTable (
    c1 UNIQUEIDENTIFIER ROWGUIDCOL DEFAULT
NEWID ()
    ,c2 DATETIME DEFAULT GETDATE ()
    ,c3 CHAR (111) DEFAULT ‘a’
    ,c4 INT DEFAULT 1
    ,c5 INT DEFAULT 2
    ,c6 BIGINT DEFAULT 42);

Last week, I was able to achieve  750K rows/sec (runtime: 213 seconds) on a SuperMicro, AMD 48 Core machine with 4 Fusion-io cards with this test fully tuned. I used 48 data files for best throughput, the subject of a future blog.

Now, I suspect Paul used NEWID() for key generation, to create as much I/O as possible with page splits. If the goal is purely to optimize throughput, why not ask: Is NEWID really the best way to generate a key?

Traditional SQL Server knowledge dictates that either IDENTITY(1,1) or NEWSEQUENTIALID are better key choices: they lay out the keys sequentially, at the end of the index. But one should not easily trust such recommendations without validating them. I did the above test on the 48 core box Fusion-io lend me, and here are the results:

 

image

This might shock you! The “Traditional, sequential layout” of the key is more than 100 times SLOWER than completely random inserts!

The CPU are NOT at 100% during this (they were when using NEWGUID). The problem shows itself when we dig into sys.dm_os_wait_stats. (by the way, below is my new: “this is how I think about it” diagram, let me know if you like this way of doing things and I will continue to use this format for illustrating troubleshooting mind processes)

image

We are waiting for a PAGELATCH on the last pages of the index. This wait dominates the run. The price of page splits is minor (when you have a fast I/O system) compared to the price of coordinating the last page memory access!

As Rick writes in his blog, there are ways to work around this problem by being smart about your key generation. Basically, you want a key that is NOT sequential. Rick flips the bits around in a Denali SEQUENCE objects – a very neat idea that uses a new SQL Server Denali feature.

For my test at SQLBits (which ran on 2008R2), I used a variant of the same idea: I took the @@SPID of the users session and added an increasing value (increments controlled by the app server) to each insert made. On reflecting, using the SPID is not a good idea, you should probably generate the key something like this:

INSERT Key =

   [Unique App Server ID] * 10**9 + [Unique Number Generated at App Server]

I also tried the hash trick I have described here – to compare the techniques.

Below are the results of my test of Paul’s INSERT Challenge with different key strategies (I have taken out the sequential ones, since we have seen how poorly they perform)

image

The last column is a trick I will describe in more detail in a later blog. Suffice to say that I reached 1.1M INSERT/sec (runtime: 150 sec) using a key generation strategy very similar to Rick’s.

Edits:

  • The UNIT in the introduction was originally wrong. it should of course be 750K not 750M as can be seen later in post.

The Ascending Column Problem in Fact Tables –Part two: Stat Job!

2011-07-07 2 comments

In my last post I described a common problem with statistics in data warehouses. I also directed you to a trace flag that partially solves the problem. There is an issue with this trace flag: at the time of writing this blog, it does not work on a partitioned tables. Considering that large tables tend to be partitioned, this could be an issue.

Recall from my last post that old statistics will not contain information about the newly inserted rows until stats are updated. Queries asking for data that is outside the bounds of the histogram will estimate 1 row and typically end up with a bad, serial plan. The trace flag works, or doesn’t, by trying to intelligently guess how many rows are such an “out of bound” range instead of assuming the value of 1.

There is another approach that does not rely on the optimizer making guesses: You can intentionally deceive the optimizer, lie about the distribution. Recall that we want the “desired DW plan”. Alas, as with all telling of lies, this exposes you to a lot of risk.

Disclaimer: What I am about to show you is a: “Don’t do this at home” trick. We are going to see some features in SQL Server used in a ways they are not intended for. And we are going to take control of the optimizer in ways it is not designed to handle. If you go down this path, you are on your own! Don’t call support with your issues. Just because I showed it to you, doesn’t mean you can put this in production.

Consider this example an exploration and education of the internals of SQL Server. And think of in the context of what you might consider asking for, next time you file an MSConnect item.

On the issue of MSConnect: If you care about big (any TB sized) DW workloads: please, please file and vote on items that make your working day hard. It helps if you describe which company you work for and what the impact is on your business of not having the feature or “fix”, I know this is not always possible. I know it takes times, and I appreciate you are busy, but Microsoft does listen, and they listen more closely when you can make it clear why this feature is needed and how it hurts you (I call this the “blood on the table” approach). If your favourite Connect item is about DW functionality, shoot me a mail (include the ConnectID please) to let me know. I may well put up a big fight for items that align with my mission in MS. Please keep in mind: I don’t care about client tools at all, I fight for server side features. We have a lot of people pushing on client side features and they are very important too, I have just chosen to specialize.

And with that said, let us dig in:

Recall that our issue was that the histogram lacked rows/buckets containing information about a range of data. The histogram in SQL server has up to 200 rows, but even if we could add a single row to the histogram on every insert, we would soon run out on a large table.

But how about this idea: Fake a histogram were every year we ever want to store (say 100 of them) contains a lot of data. In fact, make sure that every date range or value we select contain a tremendous amount of rows. One way to create such a histogram is to grow the fact table very large and then capture the stats and make it smaller again. If your fact table is not born with the right proportions, growing fat to get the right histogram proportions and then slimming down is not a very nice approach is it?

Our Fact table histogram is not pretty as it is, but how about doing some precision surgery instead? For that, we need an implant. First, we create a table with the same index structure as Fact:

CREATE TABLE HackStats
(
  date_sk INT NOT NULL
  , payload CHAR(4000) NOT NULL DEFAULT ‘X’
)

CREATE CLUSTERED INDEX scan_me ON HackStats(date_sk)

Now, let us prepare the implant. We need something that has these characteristics:

  1. Enough histogram divider rows to cover all possible data values we could ask for.
  2. A very large value for RANGE_ROWS and AVG_RANGE_ROWS between in each histogram row.
  3. A value of 1 for the DISTINCT_RANGE_ROWS in each histogram row.
    Ad 1) Make sure that we do the implant once, and that I lasts for the lifetime Fact.
    Ad 2) Make sure that there is sufficient size and elasticity for any range of dates we might touch.

More about 3) later…

With a little reverse engineering (no, I did not look at the source), we can create a histogram that match our requirement. It does not take a lot of rows in HackStats. Here we go:

SET @y = 1950
WHILE @y < 2050 BEGIN
  /* make the stats dense (low distinct) within the year */
 SET @r = 0
 WHILE @r < 100 BEGIN
/* Make sure the equal rows in the histogram are never hit using invalid
    dates  */
    INSERT HackStats (date_sk) VALUES (@y * 10000)
    INSERT HackStats (date_sk) VALUES (@y * 10000 + 5000)
    SET @r = @r + 1
  END
  SET @y = @y + 1
END

You may have been wondering why I added a payload column to HackStats. For small tables (by page size), SQL Server will do a full sample, even if we ask for a small percent. And the engine is smart enough to know that is has seen all pages. A fully sampled table cannot be hacked and made artificially pretty – in ways you will soon observe. The HackStats table has to be small, but large enough (in our case 20K pages) that all rows will not be sampled. And that is why we had to to pad the rows a bit.

We can now hack like this:

 /* prepare the implant */

UPDATE STATISTICS HackStats WITH ROWCOUNT = 199000000
UPDATE STATISTICS HackStats WITH SAMPLE 99 PERCENT

The first statement above “fakes” a large table by telling SQL Server that there are a LOT of rows in the table. The second statement samples the distribution of those rows (which we have cunningly crafted to have just the right dimensions, so we need to sample almost all of them).

Let us check the quality of the implant:

DBCC SHOW_STATISTICS (HackStats, scan_me)
WITH STAT_HEADER, HISTOGRAM

Output (your results may vary a little, depending on which rows got sampled, rerun the sample if you don’t like the output)

image

And:

image

Notice the neat distribution. Any query asking for a year range will get a high estimate of the outcome of scanning the table. SQL Server is smart enough to boost the row estimates based on how many total rows it believes are in the table.

And now, for the dirty little trick (Brent Ozar, if you are reading this, just don’t say it Smile ):

Let us transplant the faked, big stats to the ugly Fact table. First, take the implant:

DBCC SHOW_STATISTICS (HackStats, scan_me)
WITH STATS_STREAM

This gives an output somewhat like this (your result will vary):

  0x01000000010000000000000000000000B1AB0052000000….

We don’t really have to worry what is in there, it is some binary representation of the stats. The feature to read the binary stats was put into SQL to allow database clones – and we are abusing it now. But hey, we are in the business of precision surgery: Copy the bit values to the clipboard, past and execute:

UPDATE STATISTICS Fact scan_me WITH
STATS_STREAM = 0x01000000010000000000000000000000B1AB0052000000….

And a finishing flourish, to make it fact table look really big:

/* Make the table expensive to access in the future */

UPDATE STATISTICS Fact scan_me WITH PAGECOUNT = 200000000

Validate the result:

/* Check the patient */

DBCC SHOW_STATISTICS (Fact, scan_me)
WITH HISTOGRAM

(output similar to HackStats)

And let us check our problem query again:

DBCC FREEPROCCACHE
SELECT C.shoe_size, SUM(Measure)
FROM Fact F
JOIN DimCustomer C ON F.customer_sk = C.customer_sk
WHERE F.date_sk = 20010111
  AND C.shoe_size = 43
GROUP BY C.shoe_size

Eureka: the “Desired DW Plan”:

image

 

Notice something interesting: the Estimated Operator Cost of access is about 736. That is a REALLY expensive access to the table. Just the way we like it actually, wouldn’t want the optimizer to get any ideas that this table is cheap to access. Our little trick with hacking the page size gave our Fact patient quite a bit of confidence.

I deferred talking about why we need a low number of distinct rows in the histogram. For now, I leave this as an exercise to the reader. Hint: it is related to the bitmap filters and the way Fact is accessed.

And remember folks, you do this at your own risk, you are on your own and in unsupported land here… Don’t come crying to support if you break something. Ask for a connect item instead.

The Ascending Key Problem in Fact Tables– Part one: Pain!

2011-07-01 3 comments

Time for another deep dive into SQL Server. In the last couple of days, I have been having coffee with Joe Chang in the afternoon, who some of you may know. He is visiting Sweden a lot by the way, so shoot him an email if you are interested in having him present. Joe and I ended up talking about an interesting problem in data warehousing: the ascending key problem. I think it is best illustrated by an example.

Let us say you have a pure star schema (sorry, Inmon people), with a large fact table. If you are following the guidance in the Top 10 Best Practices from SQLCAT (bullets 1,2 and 9), you typically partition the fact table by date, and you  cluster on date too. On of the reason you do this, is that it helps you do fast range scans.

Let us try an example star schema that follows this guidance:

CREATE PARTITION FUNCTION pf_daily(INT) AS RANGE RIGHT FOR VALUES
(20000101, 20000102
, 20000103, 20000104
, 20000105, 20000106
, 20000107, 20000108
, 20000109, 20000110
, 20000111, 20000112
, 20000113, 20000114
)
CREATE PARTITION SCHEME ps_daily AS PARTITION pf_daily ALL TO ([PRIMARY])

CREATE TABLE Fact
(
  date_sk INT NOT NULL
  , customer_sk INT NOT NULL
  , payload CHAR(80) NOT NULL DEFAULT REPLICATE(‘X’, 80)
  , measure MONEY NULL
)
/* Turn off stats (as per typical DW) */
EXEC sp_autostats ‘Fact’, ‘OFF’
CREATE CLUSTERED INDEX scan_me ON Fact(date_sk) ON ps_daily(date_sk)

CREATE TABLE DimCustomer
(
  customer_sk INT NOT NULL
  , shoe_size INT NOT NULL
)
CREATE UNIQUE CLUSTERED INDEX CIX ON DimCustomer(customer_sk)

 

If you have a very large fact table, you often don’t want to run stats in the middle of a query, so we have turned auto stats off. In this context, recall that stats in SQL Server are per table, not per partition – which can be a pretty expensive operation. We really want to control when stats are run for large tables.

To generate some test data, let us fill up 10 partitions with 50MB each. For the customer dimension, we will use 100K rows:

/* Generate test data */

DECLARE @i INT
DECLARE @customer_size INT
SET @customer_size = 100000
DECLARE @day_size_pages INT 
DECLARE @day_size_rows INT
SET @day_size_pages = 50001 / 8
/* last sum are row width plus uniquefier */

SET @day_size_rows = @day_size_pages * 8060 / (4 + 4 + 80 + 8) 

 

/* Create one day of data */
CREATE TABLE OneDay
(
  date_sk INT NOT NULL
  , customer_sk INT NOT NULL
  , payload CHAR(80) NOT NULL DEFAULT REPLICATE(‘X’, 80)
  , measure MONEY NULL
)
SET @i = 0
WHILE @i < @day_size_rows BEGIN
  INSERT OneDay (date_sk, customer_sk, measure)
 
VALUES (’20000101′, @customer_size * RAND(), RAND() * 1000)
  SET @i = @i + 1
END

/* create the customer dimension */
SET @i = 0
WHILE @i < @customer_size BEGIN
  INSERT DimCustomer (customer_sk, shoe_size)
  VALUES (@i, 36 + @i % 15)
  SET @i = @i + 1
END

/* Insert ten days of data in the fact */
SET @i = 0
WHILE @i < 10 BEGIN
  INSERT Fact (date_sk, customer_sk, measure)
  SELECT 20010101 + @i, customer_sk, measure
  FROM OneDay
  SET @i = @i + 1
END

Good, now we can get started on some testing. Let us first execute a typical, star schema query. But before that, we will update the statistics. To be nice to the query, we will even do a full scan.

UPDATE STATISTICS Fact WITH FULLSCAN
DBCC FREEPROCCACHE 

/* An ordinary star-schema query arrives */
SELECT C.shoe_size, SUM(Measure)
FROM Fact F
JOIN DimCustomer C ON F.customer_sk = C.customer_sk
WHERE F.date_sk = 20010101
  AND C.shoe_size = 43
GROUP BY C.shoe_size

On my 2-core laptop and a hot buffer pool, this query runs in a little over 100ms, touching over 500K rows. Life is good. The query plan is what I like to refer to as “the desired DW plan”:

image

All the neat stuff in the SQL Server optimizer that we like for DW workloads:

  • Hash Join to dimension tables
  • Bitmap index pushdown to the fact table (BOL link)
  • The right join order (build hash on dimension, use fact table to probe)
  • Parallelism
  • Cluster index seek into nice, fat range of facts

But warehouses are not always nice to us. During the night, some evil ETL developer comes along and loads a new day of data:

/* Nightly ETL run adds another day of data */
INSERT Fact (date_sk, customer_sk)
SELECT 20010111, customer_sk FROM OneDay

Alas, the ETL guy did not have time to run statistics – maybe he forgot. Now, let us run the same DW query again, but this time with the newly loaded data:

DBCC FREEPROCCACHE
SELECT C.shoe_size, SUM(Measure)
FROM Fact F
JOIN DimCustomer C ON F.customer_sk = C.customer_sk
WHERE F.date_sk =
20010111
  AND C.shoe_size = 43
GROUP BY C.shoe_size

over 2 seconds on my 2 core laptop, more than 20 times slower! This is the time when people typically complain that SQL server is broken and doesn’t scale. Instead, we adopt a less moaning stance and start a little bit of troubleshooting. As always, we begin with a look at the query plan:

image

Well, apart from the plan fitting the width of my blog, there is nothing good to say about it. What just happened?

This is the part where we up the geek bar just a little: SQL Server gives you the ability to dig into the statistics like this:

DBCC SHOW_STATISTICS (Fact, scan_me)
WITH HISTOGRAM

Output:

image

Oh no, the value 20010111 is not even in the histogram. The optimizer estimates zero rows coming out of the fact table. Zero rows are very cheap to access, so the overhead cost of all those nice, parallel and hash join optimizations are not deemed worth it. Of course, the problem here is that the histogram is lying, there are 500K rows in that date range! Loop joins are expensive in that case.

That hurt, what now?… Well, I guess you just have to remember to update your statistics won’t you? And for those of you that can see a problems with the that (and some will) – start the commenting, go vote on Connect, and of course: stay tuned…

… And by the way, one last last thing, until we meet again in the next post: Have a look at this: Ascending Keys and Auto Quick Corrected Statistics.

Have a great weekend everyone, and consider updating your stats if the warehouse loads data on Sunday…

Diagnosing and fixing SOS_OBJECT_STORE spins for Singleton INSERTS

2011-05-30 2 comments

Following up on my previous post, my next target for “optimization”, while I am waiting for an even faster I/O system, is the SOS_OBJECT_STORE spin.

Recall that I am having high waits for WRITELOG, but still see 100% CPU, which indicates that spins may be our sinner. The big spin on the system is still LOGCACHE_ACCESS – but until we get hardware to work on that – we might as well stay greedy and learn a bit about SQL Server in the process. We just got rid of the OPT_IDX_STATS spin by running TF2330.

Unfortunately, the documentation available on our next spin: SOS_OBJECT_STORE is rather sparse. It is one of the SQLOS internal data structure used many places inside SQL Server. But there are ways, even for the public (which is why I can share it here), to get more information about what is going on. You can capture the call stacks of SQL Server when it does this spin and use publicly available symbols to lookup the function names inside the code.

One way to do this is to run an Xperf trace of the sqlservr.exe, another is with WinDbg. Anything that can collect and bucketize call stacks can help you. I will not get into more details here, but follow the links in this paragraph to learn more. I also have an article on SQLCAT that should help you get started on setting public symbol paths.

Suffice to say that I got hold of the sqlservr.pdb file (the publicly available symbols) and had a look at the call stacks that leads to SOS_OBJECT_STORE spins:

SpinlockBase::Sleep
LockManager::GetLockBlocks
lck_lockInternal
GetLock
PageContext::AcquireLock
IndexDataSetSession::AcquireLocksForInsert
IndexDataSetSession::InsertSmallRecord
IndexDataSetSession::InsertRowInternal
DatasetSession::InsertRow
RowsetNewSS::InsertRow

Aha! So this is related to the lock manager acquiring a lock on a page. Now, you may then ask: how can we influence this, surely we cannot control the way locks are acquired.

Well, as a matter of fact, we DO have a knob that gives us a tiny bit of influence. How about building the index like this:

CREATE CLUSTERED INDEX MyBigTable_cl
ON dbo.MyBigTable (c1)
WITH (ALLOW_PAGE_LOCKS = OFF)

That should get rid of one level of the lock hierarchy (ROW/PAGE/TABLE), restricting us to either table level locks or row locks. Since we are playing OLTP system here – who needs page locks anyway? Total leftover from old times Smile… (I am only half kidding here)

Interestingly, this was just what was needed, the SOS_OBJECT_STORE spin is now gone:

image

But throughput has not changed at all. This is not surprising, given the much larger amount of spins on LOGCACHE_ACCES. But we learned something new: Disabling PAGE level locks can save some CPU cycles by eliminating some of the code paths – we can speculate that this might lead to increased throughput once other bottlenecks are gone.

At this time, before I am moving to a faster transaction log, these are my waits:

image

Notice the high SOS_SCHEDULER_YIELD up there right after the WRITELOG? I have a feeling those spins are not yet done tormenting us….

Secret Scale Sauce : Heavy Singleton INSERT spinlocks

2011-05-28 3 comments

Paul Randal recently ran a very interesting test on a simple INSERT workload. His results were promising and provide both good conclusions and more suggestions for research.

You should read Paul’s blog post first, because if you have not, this blog will not make much sense.

I decided to bring Paul’s test onto a DL980 box with 8 x 320 GB FusionIO cards I have access to. After a few email exchanges with Paul, I managed to run his test harness which basically is 64 times SQLCMD running this command:

 

DECLARE @counter BIGINT;
DECLARE @start   DATETIME;
DECLARE @end     DATETIME;

SELECT @counter = 0;
SELECT @start = GETDATE ();

WHILE (@counter < 2500000)
BEGIN

  IF @counter % 1000 = 0 BEGIN TRAN

  INSERT INTO MyBigTable DEFAULT VALUES;
SELECT @counter = @counter + 1;
IF @counter % 1000 = 999 COMMIT TRAN   

END;

SELECT @end = GETDATE ();

INSERT INTO Results VALUES (CONVERT (INTEGER, DATEDIFF (second, @start, @end)));

Unfortunately, my current DL980 has the FusionIO drives set up for max capacity and low write speeds (and there are files on the drives I cannot move)– so the result will not yet be comparable to Paul’s – you should take these early runs with a grain of salt and not draw any conclusions based on them.

Observations:

  • I can drive the CPU to 100% on my 64 core box.
  • Throughput is not steady, the CPU’s are all over the place (big fluctuations)
  • I am running on 8 files
  • Largest wait is the WRITELOG wait (about 70%) – rest are PAGELATCH.
  • There are big differences between thread completion times. The threads on the first two NUMA nodes complete much faster than the others. Runtime fluctuations are between 150 and 600 seconds (but still nearly four times faster than Paul’s run)
    The WRITELOG wait is not surprising; Paul’s run is done on write optimized drives and I have sacrificed the write optimization for additional space (I hope to get some runs in on write optimized drives later this weekend). But, it might be surprising that I hit 100% CPU while still struggling with this wait. Whenever CPU is too high compared to throughput, it is generally a good idea to dig into spinlocks:

SELECT * FROM sys.dm_os_spinlock_stats
WHERE backoffs > 0
ORDER BY spins DESC

Aha! This gives us some insights:

image

We are spinning on log cache structures and burning CPU on this – again, not surprising and some of this will go away as we boost the transaction log speed. We are trying to squeeze way beyond what the log will allow us to squeeze through, and there is no one in the runnable queue behind us.

But what are those other spins? OPT_IDX_STATS and SOS_OBJECT_STORE?

Well, OPT_IDX_STATS is the optimizer trying to update the DMVs sys.dm_db_index_usage_stats and sys.dm_db_missing_index_stats. You can read about this in KB 2003031. While the KB is about databases with many thousand rows in those DMV – you might see why this problem can appear for a single table that has heavy insert activity: A quick calculation (16M rows in 600 sec) shows us that we are hitting at least 270.000 inserts/sec on this run, with peaks up to 4 times that (nice, by the way). It is not strange that there is some contention on that data structure.

The knowledge base article hints at a solution: run TF2330. This has to be set at startup, not with DBCC TRACEON. Running this trace flag got rid of the wait, and it looks like I got a minor throughput increase (hard to measure under big fluctuations)

My next step is to diagnose the SOS_OBJECT_STORE, more to follow…

(By the way, I played around with naming the columns in the INSERT statement instead of using default values. This turned out to be a very bad idea, I got massive amounts of SQL_MGR spins – you can read about why I got that here.)

Grade of the Steel: Introduction

2011-05-01 Leave a comment

There are a lot of things I want to say on this blog, so I do apologize in advance for jumping a bit between subjects. I will shortly write the next installment of my data modeling series – but before that, I want to let you know about some exciting new tests that I am working on in my copious amount of spare time Smile.

FusionIO has kindly granted me access to what can only be described as a beast of a machine! Here is the configuration I am looking at in my terminal session:

  • An HP DL980G7 box with 8 Nehalem-EX CPU at 8 Cores each (16 in HT mode). Clock speed of 2.0 GHz.
  • 22 PCI Slots, 7 which are currently filled with FusionIO cards (with at least 2 ioDIMM in each slot)
  • A good pile of RAM (still not sure how much we will run on, but the box is capable of 2TB)
    Along with Tom Scribner of FusionIO (one of the guys who wrote the FIO Windows driver) – I will be doing some tests on this monster box. This is one of the largest scale-up systems out there – and the biggest scale-up x64 box that HP currently ships.
    We still don’t know exactly how many IOPS we can squeeze out of this, but it is looking promising. You know you are heading the right way with your tuning, when both tuners are looking at 100K Random IOPS, at 64K blocks, consistent 0.6ms latency – and say: “Nice, but not impressive”. This is basically a system with no bottlenecks.

Here are a few screenshots:

imageimage

Notice the little green light in the top right corner, with the number 22 next to it.That is the number of IODimms.

Some of you may have participated in the “Crappy Code Games” during SQL Bits and seen some pretty amazing results. But now, I want to up the bar and try “Great Code Game”. I want to put SQL Server to the test and discover the grade of the steel of basic operations that we all need to do when designing databases.

Here are some of the questions I plan to answer:

  1. How fast can we run the ETL World Record on this? (I hope to hit 4 minutes for 1TB if the CPUs will do it)
  2. How many singleton INSERT / sec can this box sustain?
  3. How many single row UPDATE statements / sec can we do?
  4. How fast can you table scan fact tables?
  5. How fast can you do backup?

Ad 1) Since I don’t have any blade servers to use for ETL clients, I will run BULK INSERT commands. This will be the (current) ultimate test of a scale-up ETL load

Ad 2,3) This will of course depend a lot on the table design and file layout. If you have seen me present at SQLBits, you will know that I have some unorthodox design techniques to boost scale. I plan to put them through extreme testing and write more about them. For all you OLTP people out there – this is going to be important if you design for scale.

Ad 4) I have no clue how fast this box will run for table scans. But I know that the PCI bus can do at least 36GB/sec.

Ad 5) Again, I have no clue what the result will be and what bottlenecks I might hit. Only testing will show.

I am quite confident that the standard “out of the box” designs you generally see recommended will not scale on this box. But I am equally confident that there are ways to break through the scale roof and shut up the voices that would claim that SQL Server does not scale. Be prepared for some surprising design hints.

The box also has Hyper Threading (HT). Time allowing, I will see how well the claim that the new HT actually boosts scale will measure up to actual testing. Also, I know a lot of people have asked me about NAND devices and what can be done with them. Feel free to suggest ideas for tests in the comment section.

Anyone else have a box like this out there? I challenge you to compete and explore with me.

Let the games begin!

Follow

Get every new post delivered to your Inbox.

Join 271 other followers