Hash distributing rows is a wonderful trick that I often apply. It forms one of the foundations for most scale-out architectures. It is therefore natural to ask which hash functions are most efficient, so we may chose intelligently between them.
In this blog post, I will benchmark the build in function in SQL Server. I will focus on answering two questions:
- How fast is the hash function?
- How well does the hash function spread data over a 32-bit integer space
I know there is also the question about how cryptographically safe the function is, but this is not a necessary property for scale-out purposes – hence, that aspect is out of scope for this blog.
For the picky people (my dear soul mates): I will be using 1K = 1024 in the following, so 64K = 65536.
- Because we sometimes find ourselves hashing dimension keys, I will use a very simply table with a nice key range as my test input. Such a key range could for example have been generated by an IDENTITY(1,1) column or a SEQUENCER. Here is the test script that generates the input:
CREATE TABLE CustKey (SK INT NOT NULL)
INSERT INTO CustKey WITH (TABLOCK) (SK)
SELECT n FROM dbo.fn_nums(10000000)
(see my previous post for the fn_nums source)
Speed of the Hash function
SQL Server exposes a series of hash functions that can be used to generate a hash based on one or more columns.
Inside SQL Server, you will also find the HASHBYTES function. This little gem can generate hashes using MD2, MD4, MD5, SHA and SHA1 algorithms. The problem for the purpose of our test is that these function spit out BINARY types, either 128 bits (for SHA) or 160 bits (for MD). However, we can quickly map that value into the integer space by doing a modulo (in SQL: %) with MaxInt (2**31 – 1). Interetingly, Modulo work directly on values of BINARY in SQL Server.
Before I move on the results, I would like share a little trick. When you need to get the raw CPU speed of the hash, you may be tempted to do something like this:
SELECT BINARY_CHECKSUM(SK) AS h
But this is wrong. Why? Because you are both measuring the time it takes to hash as well as the time it takes to transfer the rows to the client. This is not very useful, on my laptop it takes over two minutes to run the above query.
Instead, you should try to measure the raw time the server needs to calculate the data. One way to achieve this (and this trick can be used on most queries you benchmark) is to wrap the SELECT statement in a MAX function. Like this:
SELECT BINARY_CHECKSUM(SK) AS h
OPTION (MAXDOP 1)
Notice something else interesting, I am forcing parallelism down to 1. I want to get as close to the raw cost of the CPU as possible – so I don’t want to clutter the query with any overhead of parallelism. The query now runs in a little less than 3 seconds on my laptop.
I addition to the above trick, we also need to quantify the time SQL Server spends on just accessing the table. We can get a good approximation of this by measuring the SELECT statement without any hash function. On my laptop, I measured this to be around 3200 ms. In the results below, I have subtracted this runtime so we are only measuring the cost of the hashing.
With all this said, let us have a look at the results:
Some comments on the data:
- Notice that MD2 is very inefficient. I have not studied this algorithm in great detail and would appreciate a specialist commenting here.
- Apart from the MD2 anomaly, the HASHBYTES functions all performan pretty much the same. Using around 1 micro sec per hash value calculation
- BINARY_CHECKSUM and CHECKSUM are MUCH faster, by about an order of magnitude
- Doing a simple modulo that spreads out the data over the integer space is about the same speed as CHECKSUM and BINARY_CHECKSUM
- It is not the cast to NVARCHAR that uses CPU time (HASHBYTES requires NVARCHAR input)
- However, the checksum function do take longer to run when you first cast to NVARCHAR. Costing around 100 ns per INT value
Of course, the HASHBYTES functions also have other characteristics than speed which may be desired (cryptographic utility for example). but if you you want is speed, it looks like CHECKSUM and BINARY_CHECKSUM are faster.
But, let us see how good the functions are at spreading the data.
Spread of the Hash Function
Another desirable characteristic, apart from speed, of a hash function is how well it spreads values over the target bit space. This is useful not only for cryptographic purposes, but also to “bucket” data values into equal sized portions, for example in scale-out MPP architectures.
Often, you will not know in advance how many buckets you eventually want to subdivide the integer space into. You may start out with 32K buckets and divide up the integer interval in 128K (32-bit space divided by 128K = 32K) even sized buckets, each with 128K values in them:
- Bucket0: [MinInt…MinInt + 128K[
- Bucket1: [MinInt + 128K…MinInt + 128K*2[
- Bucket2: [MinInt + 128K*2…MinInt + 128K*3[
- … etc..
- Bucket 32K: [MinInt + 128K*32K…MaxInt]
Perhaps you will later want a further subdivision into 64K buckets, each with 64K hash values.
Wise from the runtimes I measured before, I used only 1M rows with values [1…1M] for this test. However, this is plenty to show some good results, so don’t worry.
To avoid running chi-square testing (See later) of integer value spread all over the 32-bit space, I calculated the hash of each key and I then bucketed them into 64K bucket ranges (each with 64K values, neat isn’t it?).
This table comes in handy for that bucketing
CREATE TABLE ExpectedBuckets
, RangeStart INT
, RangeEnd INT)
INSERT INTO ExpectedBuckets
,(n – 1 – POWER(2,15)) * POWER(2,16)
, (n – POWER(2,15)) * POWER(2,16) – 1
FROM dbo.fn_nums( POWER(2,16) )
A sample output form this table:
I can calculate the hash values and put them into a table like this:
CREATE TABLE HashResult
ObservationID INT IDENTITY(1,1)
, HashValue_i INT
CREATE CLUSTERED INDEX CIX ON HashResult (HashValue_i)
And finally, I can join them all up and count the values in each bucket:
, CAST(ISNULL(COUNT(HashValue_i), 0) AS FLOAT) AS Observed
FROM ExpectedBuckets EB
LEFT JOIN HashResult
ON HashValue_i BETWEEN EB.RangeStart AND EB.RangeEnd
GROUP BY BucketID
We now have a nice count of the content in hash bucket, it looks like this:
How do we test if this is a good result or not? It seems that min/max, averages and standard deviations don’t really suffice here (sorry modern MBA types, I realise I am going to loose you now ).
We need to do some hypothesis testing. The hypothesis I want to test (aka: the NULL hypothesis – all resemblance to relational algebra is purely coincidental).
“My chosen hash function evenly distributes the integers into the 64K hash buckets”
Interestingly, this hypothesis makes a prediction (has to, if not we cannot falsify it): namely that in each bucket filled by hashing the 1M rows, we should expect to measure 1M / 64K = 15 members.
There is also an important twist here: We have to measure the content of each bucket, even when it is zero. This is why I use a left join in the query above, we have to measure all the results we know, not just count the hash buckets that actually get filled (you can see how not bucketing the data would have made the test result very large).
We can now test the hypothesis against the measured result. If we see correlation, then it gives us confidence in the correctness of the hypothesis, and lets us conclude that our hash function is a good one. It also gives us the ability to compare different hash functions with each other. We can test for correlation between the predicted result and the actual result with the chi-square test. Interestingly, the tools we need for this test is even exposed in Excel as the CHISQ series of functions.
Without further ado (and insults), here are the results:
(source data available on request)
Some observations about the data:
- Most importantly: the CHECKSUM and BINARY_CHECKSUM functions are very poor hash functions. Even with 1M input and 64K large buckets, they don’t even fill the hash space. The skew is massive on non-skewed input data.
- We also see that casting the INT to NVARCHAR improves the behavior of both BINARY_CHECKSUM and CHECKSUM, but not enough to make them good hash functions.
- SHA and SHA1 seem to give us the best spread of the data, but at 1 microsecond per hash, it seems rather expensive.
- Notice that the MD series do not give a particularly nice spread. I suspect this is related to the way I divide the data with MaxInt to get from the 120bit value to a 4-byte integer
- We see that only the modulo does a distribution that gives us nearly 100% confidence (not fully though, because 1M does not divide 64K, which is reflected in the 824 value of Chi-squared). This result is expected, and my function here is engineered to “game the test”. The modulo cannot generally be used like this to spread the value all over the 32 bit integer space.
- For fun, I put in a test where I use random values all over the integer space. Even with a confidence interval of 10%, we cannot say it distributes evenly. This means that SHA functions distribute better than random on this input dataset. Your mileage will of course vary for random values
On the note of the CHECKUM series, Books Online does (under?) state:
If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.
In this blog I have explored properties of hash function built into SQL Server. The results indicate some general guidance that could be followed:
- For best spread over the hash space, use SHA or SHA1 algorithms with HASHBYTES function
- Be VERY careful about using CHECKSUM and BINARY_CHECKSUM as these functions do not spread a hashed integer value nicely over the full space. Depending on what you use these functions for, they may not be suitable
- It takes about 1 CPU microsecond to calculate a single call to HASHBYTES, except when hashing with MD2
- Using HASHBYTES to spread out values over the 32-bit integer space seems rather expensive, since even the best cases uses tens of thousands of CPU instructions.
- This indicates that it may be possible to implement a better hash function if you do not care about cryptographic properties of the hash