Archive

Archive for the ‘Engines’ Category

Moving on from Microsoft

2012-05-09 17 comments

Not too long ago, I handed in my notice to Microsoft, terminating my employment mid-June 2012. I suspect this is slowly leaking out on the Twitter and Facebook these hours – as is tradition in the connected world.

image

During the graduation ceremony at my university it was said that: “Life is a series of temporary relationships” (though hopefully not with your wife). Throughout my career I have taken this fact of life to heart. And I can, without fear of the future, try out new challenges.

After June 2012, I am going to take it a bit slow, and do some consulting. I will then consider which adventure I want to go on from here. Because I believe you have to live out your dreams of adventure while you are still young, and I have a fair bit of fuel left in the tank before “peak oil”.

I part with Microsoft as a friend and will be working on creating a smooth transition over the next month. My work with SQL Server will continue, just outside the campus in Redmond. After June, I will be available for some teaching and short consulting gigs to share my knowledge. Terms and conditions will apply and follow here.

In the meantime, I have made my CV available: About Me.

Categories: Musing, Public Speaking, SQL Server Tags:

Why You Need to Stop Worrying about UPDATE Statements

2012-04-27 3 comments

There seems to be a myth perpetuated out there in the database community that UPDATE statements are somehow “bad” and should be avoided in data warehouses.

Let us have a look at the facts for a moment and weigh up if this myth has any merit.

Transaction Logged Data

In traditional, relational, ACID property, rows stored in pages, relational databases we typically distinguish between two types of DML operations from a transaction logging perspective.

  • Row Logged
  • Allocation Logged

Row logged operations will write a transaction log entry every time a row/tuple is modified. This means that the amount of transaction log traffic generated is proportional to the number of rows touched.

Allocation Logged (called: “Minimal logging” in SQL Server) operations only write the physical allocations to the transaction log, if at all. This means the log traffic (if any) is proportional to the size of the data touched. This typically generates at least an order of magnitude fewer log entries than row logged, and is thus faster… Or is it? Read on…

Typically, ACID databases only allow bulk style loads, index builds and large table/partition truncations and drops to be allocation logged. UPDATE and DELETE statements tend to be fully row logged.

Myth: This difference in allocation structures makes UPDATE “bad” because the transaction log is the bottleneck.

Reality: The picture a quite a bit more nuanced than that.

First of all, INSERT in bulk mode is not always allocation logged. Typically a lot of conditions have to be true for allocation logging to work. For those of you interested in SQL Server, I have written about this extensively here: The Data Loading Performance Guide.

Second, the transaction log bottleneck ”wall” is widely exaggerated. I have personally driven 750MB/sec write log traffic into a single database in SQL Server using a FusionIO card. I have seen colleagues do 120MB/sec with traditional, 15K spindles. True: I have also driven 3GB/sec (around 10TB in an hour) using allocation logged INSERT, which is faster than its row logged sibling. However, you have to ask yourself the question: Do you need to go that fast in a single database? In an MPP system you will also have several log files that work together to provide linear scale of log traffic with no relevant roof.

Third, it is perfectly conceivable that you are running a warehouse database that does NOT need to serialize DML operations or may even write entire blocks directly to the database instead of the transaction log. Such systems simply do not have the above bottleneck. For example, Neteeza uses a “logless” implementation and so does many noSQL database systems and HADOOP/HIVE. MySQL with the MyISAM engine also allows non logged operations.

 

INSERT/DELETE vs. in-place UPDATE

In an traditional, relational database an update can be implemented either as a transactionally wrapped INSERT of the new data followed by a DELETE of the old data (or the other way around).

Myth: Because UPDATE is an INSERT and a DELETE this can double the amount of data that needs to be written during an UPDATE operation and will make me hit the transaction log wall even faster.

Reality: If you are not modifying the keys in an index or making the column size wider, UPDATE statements can be executed as in-place modifications of the row. This allows the database to only write a special old/new value into the transaction log. This optimization can even be applied on a column by column basis, further reducing the transaction log footprint. This leads us to:

 

UPDATE vs. INSERT speed

This myth is an amalgam of the above arguments.

Myth: INSERT of row logged data is faster than UPDATE of row logged data

Reality: Let us first settle one thing which I will type on its own line and in red to make it easy to remember:

An in-place, row logged, UPDATE operation on a non compressed page is faster than doing a row logged INSERT of the same data.

Why is this? Because the INSERT operation has to allocate new physical structures in the index, while the UPDATE can simply reuse database pages without having to allocate more space.

And here are the numbers to prove it where I am running INSERT vs. UPDATE of large dataset in SQL Server (smaller is faster):

INSERT vs UPDATE

True: if your UPDATE statement has to do the INSERT/DELETE trick, it will likely be slower. But if you are NOT changing the row size and you get the in-place UPDATE, it might just be FASTER to run an UPDATE than an insert.

Also true: compression can change the game quite significantly depending on the compression algorithm you use for the table structure. This is again implementation dependent.

Summary

The myth about UPDATE statements being bad or slow is too simple a way to look at this crucial DML operation. In fact, the myth is outright false in some cases.

Avoiding UPDATE statements should not generally be a major driver for design guidance, or used as the basis for drawing any conclusions about the data modeling techniques you  should apply. There picture is quite a bit more nuanced than this.

First of all, the speed of UPDATE statement as compared to bulk inserts will depend on the database engine you run on.

Second, we have seen that even when UPDATE is fully row logged, the transaction log “wall” is very far away on proper hardware and not much of a concern to 99% of all the installations out there. There are of course cases where you will hit the “wall”, but those are largely mitigated in MPP systems or other sharded deployments that have more than one transaction log.

Third, there are cases where UPDATE statements are actually faster than (row logged) INSERT statements. These typically occur when you change columns in such a way that they don’t grow larger than they already are, allowing in-place UPDATE operations. An interesting and highly relevant example of such a case is UPDATE statements that target fact table keys – which (if you follow my guidance) are integers and therefore have constant width.

 

Bonus Exercise and chance to win (for SQL Server people): Here is an interesting experiment. In theory, it might be possible to create a workload where you UPDATE a very wide table and where the equivalent, minimally logged, super optimized “copy to new table” BULK INSERT or SELECT INTO statement is actually slower than the UPDATE. Where is the crossover point on table width? I will offer a free, 1 hour teaching session. Database subject of your choice, you host me, I bring the coffee in the London City area to the first person who can provide a test script and the data to show this crossover point. Alternatively the price can also be claimed if you can conclusively prove that the crossover point does not exist.

Conference Updates

2012-04-25 Leave a comment

A few weeks ago, I presented at SQL Bits X. The slides from my co-presentation with Conor are available here:

Last week, I spoke at Microsoft Open World 2012 doing quite a lot of presentations. I have created a new and extended “Grade of the Steel” deck that I hope to blog more about my findings (these things tend to start as PowerPoint slides and evolve into blog entries).

If you want a copy of any of my presentations, please just mail me. Speaking of presentations, I have found this really cool tool that will compress slides: NXPowerLite. This neat tool has taken off around 60-80% of my slide sizes.

Lately, I have been practicing zen style presentations: max four sentences on each slide and heavy use of visualisations instead. I highly recommend putting yourself under such constraints. After I removed the text from my slides I feel much better improvising on stage.

When Statistics are not Enough – Search Patterns

2012-02-21 4 comments

Co-author: Lasse Nedergaard

Yesterday, Lasse ran into an issues with a query pattern in the large database that he is responsible for. Based on our conversation, we wrote up this blog and created a repro.

The troublesome query we were debugging executed like this:

  1. Find a list of keys values to search for
  2. Insert these keys in a temp table – lets call this the SearchFor table
  3. Join the temp table to a large table (lets call it BigTable) and retrieve the full row from the large table

Why not use a correlated sub query in step 2? In this case, the customer in question had multiple code paths (including one accepting XML queries) that all needed to pass thousands of key to a final search procedure. They wanted a generic way to pass these key filters to to the final access of BigTable. The schema will make it clearer.

Schema

We created a repro and it turns out to be a great look into the strange world of database statistics. Here is the schema:

CREATE TABLE BigTable
(
  NearKey UNIQUEIDENTIFIER NOT NULL
  , Payload char(200) NOT NULL
  , RowType INT NOT NULL
)

/* Insert rows to simulate large table with 177 being the skewed value */

INSERT INTO Bigtable WITH (TABLOCK) (NearKey, PayLoad, RowType)
SELECT NEWID()
 
, REPLICATE(‘X’, 200)
  , CASE WHEN n%100 BETWEEN 0 AND 50 THEN 177 ELSE n%100 END
FROM dbo.fn_nums(1000000)

/* Creates indexes and statistics */

CREATE INDEX IX_NearKey ON BigTable (NearKey)
CREATE STATISTICS STAT_RowType ON BigTable(RowType)

UPDATE STATISTICS BigTable WITH FULLSCAN

/* Insert rows that are NOT in the statistics with key = 199 */

INSERT INTO BigTable (NearKey, PayLoad, RowType)
SELECT NEWID(), ‘New Rows Not stat updated’, 199 /* New type*/
FROM fn_nums (100000)

/* Create generic search table */

CREATE TABLE #SearchFor (NearKey UNIQUEIDENTIFIER)

 

In Production, BigTable is a very large table with billions of rows. Each row in the table has RowType that describes metadata about the row. The metadata is used to populate #SearchFor before accessing BigTable, as this saves significant I/O.

Things to note about BigTable:

  • All stats have been fully updated for values that are not RowType = 199. As long as we stay below this value, we have done ALL we can
  • There is big skew in the RowType column.
  • There are 100K values that are not in the statistics (you can run DBCC SHOW_STATISTICS (BigTable, STAT_RowType) WITH HISTOGRAM to convince yourself of this)

Query Problems

    When a user queries this system, the code will first populate #SearchFor with a stored procedure like this:

Typical search procedures in the system look like this:

CREATE PROCEDURE GetDataA
AS

SELECT INTO #SearchFor

EXEC dbo.ReturnFromBigTable

 

ReturnFromBigTable contains the troublesome statement that is the subject of this blog:

SELECT *
FROM BigTable
JOIN #SearchFor
  ON BigTable.NearKey = #SearchFor.NearKey
WHERE RowType = <SomeValue>

 

Looks quite innocent, doesn’t it?

To simulate the customer’s workload, let us say we are searching for 1% of the rows in BigTable. We do this by populating #SearchFor.

INSERT INTO #SearchFor
SELECT TOP 10000 NearKey FROM BigTable   
TABLESAMPLE (1 PERCENT)

And now, the fun begins! Lets first use our NearKey search and for a non skewed RowType:

SELECT * FROM BigTable
JOIN #SearchFor
  ON BigTable.NearKey = #SearchFor.NearKey
WHERE RowType = 99

Plan (only join shown)

image

Looking at Predicate in properties for the BigTable scan we unsurprisingly see that the predicate is being driven down: [dbo].[BigTable].[RowType]=(99)… Nothing interesting so far. The plan is parallel. Again, not surprising considering that we are scanning accessing 1M rows in BigTable.

… but as Lasse and I observed on his workload, something isn’t quite right here.

Let us run the same query, but with the filter RowType = 99 replaced by RowType = 177. The new plan is:

image

It is at this point your design intuition should wake you up with that good ol’ WTF feeling.  “Wait a minute you should say”…before (when there were fewer rows coming out of BigTable) we were parallel and hashing the output of BigTable. Now, when there are more (50x more) rows, we are both reversing the hash order, probing into BigTable and hashing SearchFor. Note that all estimates are relatively accurate (within 10%) and statistics in the space we are searching is fully up to date… If we take a fully naïve approach, trusting relational algebra and query optimizers, there is nothing more we can do here.

But hey, it gets worse: What happens when we go outside the boundaries of the statistics. This for example, could happen when the stats are not fully updated – a completely normal situation in a database. Searching with RowType = 199 (that is not in the histogram) and estimating the plan, we get:

image

The estimate coming out of BigTable is 1 row, the actual is 100K; this is probably going to hurt.

Running the query gives us the deadly:

image

Congratulations to us, we just committed performance suicide!

Lets just list the execution times of these query variants:

Join Strategy RowType CPU time Elapsed Logical I/O Reads
Hash BigTable
Probe #SearchFor
99 421ms 549ms 31429
Hash #SearchFor
Probe BigTable
177 765ms 1293ms 31429+62
Merge Join (sort both) 199 1575ms 2261ms 31429+62

What is the problem here? From the user’s perspective, the query response time will often vary unpredictably – which is a poor experience. But worse, from a system and DBA perspective: the memory consumption will vary a lot depending on which query plan we end up with. This means it becomes hard to extrapolate the total system throughput and resource requirements based on the queries. If you are capitalist inclined in your mindset, let me translate this: At the end of the day, this costs you money!

Why is this happening? It is happening because there are hidden assumptions here that you have not brought into the light. We are trusting technology (in this case the database statistics and optimizer) with solving something that is a data modeling problem.

This is a very good example of something Thomas has been seeing a lot lately: An unwillingness to be specific about your requirements costs you money – a LOT of money. In other words: There is a price tag associated with ignorance. Let us make it clear what we are NOT saying: There is nothing wrong with ignorance: it is hard to know what you don’t know. Sometimes, you simply don’t know a requirement exists. The trouble arrives when denial and lack of curiosity and analysis trumps the observed data.

Fortunately, Lasse is the curious kind and he asked all the right questions and sought knowledge to remedy the situation.

Let’s get pragmatic, and talk about solutions.

The solution

With all the query plans above, which one do we want? In this case, the answer is: “None of them”

Here is an observation about the workload we discovered as we were analysing the solution: It is reasonable to assume ask the user to accept that response time is proportional with the number of results returned. From a system perspective, we also want the memory consumption and CPU to be proportional to the returned result size. These are questions you have to ask yourself as system designer and discussion you can have with users.

This is where computer science comes into the equation. We know:

  1. That we can search a B-tree of n rows in lg(n) time (which is as good as constant)
  2. That the the returned result are capped by the number of rows (let us call this R) in #SearchFor
  3. That we are willing to accept that the runtime and resource use is proportional with number of rows returned
  4. That we cant live with unpredictable behavior

The optimizer KNOWS about the search time and the cap (1+2). But the optimizer does NOT know that we are willing to declare certain properties that narrow the search spare (3) and that we want predictability (4). Our "ignorance” has a price, and the optimizer makes a choice that might be optimal, but which unfortunately is also unpredictable.

This is a great case for hinting. And we can do it like this:

SELECT *
FROM #SearchFor
INNER LOOP JOIN BigTable
  ON BigTable.NearKey = #SearchFor.NearKey

WHERE RowType = 177

Note something important here: The order of the joined tables matter when you use the LOOP hint (which is why we flipped them above). The above hint BOTH forces the join order and the join strategy.

An alternative, which may be more generally applicable is this rewrite:

SELECT *
FROM BigTable
INNER JOIN #SearchFor
WITH (FORCESEEK)
ON BigTable.NearKey = #SearchFor.NearKey
WHERE RowType = 177
OPTION (LOOP JOIN)

The above rewrite does not guarantee join ordering. However, with the FORCESEEK hint, the plan we want is almost guaranteed.

Let us add some new rows to the result:

Join Strategy RowType CPU time Elapsed Logical I/O Reads
Hash BigTable
Probe #SearchFor
99 421ms 549ms 31429
Hash #SearchFor
Probe BigTable
177 765ms 1293ms 31429+62
Merge Join (sort both) 199 1575ms 2261ms 31429+62
Hinted LOOP 99 406 718 81259
Hinted LOOP 177 499 964 81259
Hinted LOOP 199 306 718 81259

Above, we can see why the optimizer didn’t want to help us! The number of IOPS required for my hinted plan are much higher than the optimizer generated plans.

The hinted plan shape is:

image

Obviously, we could replace the non clustered index with a clustered on (which would get rid of the RID lookup) to make this query faster. But in this case, this was not a viable solution for other reasons.

Summary

In this blog, I have shown you an example of database optimizers failing to make the “right choice”. There are several reasons this is happening:

Lack of information: The database needs declarative information about the user’s intent and what we consider the Right Choice™. In the example, the missing information is: “the speed of the returned result should always be proportional with the result size”. We used a hint to control the optimizers behavior, using our knowledge of the physical execution of queries. Other people might have taken more extreme steps and argued for noSQL – perhaps too aggressive a step in this case.

Leaky abstractions: We saw the line between the “logical” and “physical” model break down. We believed that by building the right model, we had supplied enough information to the database. The distinction between physical and logical model has, as Thomas previously argued, always been a pretty useless paradigm. We have to consider the components of the system in a holistic way, but not at the expense of maintaining the overview of individual building blocks.

Optimality vs. Good enough: the optimizers default behavior is to look for “optimal plans”. In the plan search space, even a very small space like this example, plans are sometimes found that look optimal from statistics. Yet, these plans may not be optimal or have the properties we seek – or we may fail to recognize the good plans as we search the space. In our case, we just wanted a plan that was “good enough” and had certain predictable attributes.

False hardware assumptions: The optimizer makes some assumptions about the cost of IOPS which unfortunately do not correlate with the reality of a modern machine. It assumes that optimizing for a low number of Logical IOPS is a Good Thing™. In this light, the plans generated are the right plans. However, the optimizer does not take into account that the I/O system might be fast, and that there is enough RAM to have a likelihood of finding some of the needed data already residing in the buffer pool. To prove this: Recall that we had fully updated statistics available for queries going after RowType = 177. But, running the query on a petty IBM laptop the  loop hinted plan, even on an empty buffer pool, is still faster than running the plan generated by the optimizer.

The simple query we studied here raise some interesting questions about the design tradeoffs that the data modeler and architect is forced to make. In this case, we were fortunate that Lasse was on the alert for these tradeoffs. With this example, we have seen what the potential consequences of “behavior and model ignorance” are and why you need to be alert to it. 

Running Many Batch Statements in Parallel

2012-01-08 1 comment

Location: Somewhere over the Atlantic

imageWhen designing highly scalable architectures for modern machines, you will often need to do some form of manual parallelism control. Managing this is not always easy, but in this blog I will give you one piece of my toolbox to help you.

Let us walk through an example together, a tiny case study. This is a problem which many of you will be familiar with.

Let us say you have 16 files that you want to load into the same table in your database in an automated manner. The naïve approach will do something like this:

BULK INSERT MyTarget FROM ‘C:\temp\MyFile1′ WITH
  (FIELDTERMINATOR = ‘;’, ROWTERMINATOR = ‘\n’)

BULK INSERT MyTarget FROM ‘C:\temp\MyFile2′ WITH
  (FIELDTERMINATOR = ‘;’, ROWTERMINATOR = ‘\n’)

… etc…

BULK INSERT MyTarget FROM ‘C:\temp\MyFile16′ WITH
  (FIELDTERMINATOR = ‘;’, ROWTERMINATOR = ‘\n’)

Now here is the problem with this approach: it executes one statement at a time. Sequential execution is BAD, you need to stop thinking about the world like that if you want to scale on a modern architecture.

Lets assume we have enough hardware resources (in this case, it would take a blade server and a decent I/O system). What we really want is to run every one of these statements in parallel. Unfortunately, SQL server does not have a command to start up new connection from inside T-SQL… what to do?

Getting to the Command Line

Because you cannot execute more than one command on a single connection at a time, we will need multiple connections to SQL Server and this mean we have to go back to the command line. Let us start by creating a little batch file Worker.Cmd with this content:

REM Worker.Cmd File

CALL SQLCMD –S.\MyServer –q”BULK INSERT MyTarget FROM ‘C:\Temp\MyFile%1’ …EXIT

This allows us to invoke a bulk load for the first file by executing: Worker.Cmd 1

Unfortunately, we still cannot start multiple connections without manually firing up a lot of command prompts. The coders in the audience may at this point reach for their favorites programming language to write a little utility that can spawn multiple copies of an executable.

However, there is a problem with such a home made executable: you cannot generally rely on a server having the necessary runtime libraries. Typical comments might be:

 “No, we don’t have .NET 4.0 here, this is not yet certified by our infrastructure department. Could you recompile it for 1.1 please?”.

“Power Shell is much too fancy for us, what is wrong with running this on Windows 2000?”

Perhaps this customer is just skeptical about letting you run your executable on a server. This may sound silly, but I have seen this happen too many times to make assumptions.

Start to the Rescue

There is a very nice little utility for the good old command prompt that allows you to fire up new processes: START.EXE. This comes with all versions of Windows and it takes any command line executable as input, fires it up in a new thread and returns control back to the caller.

Using start.exe, we can write batch script that fire up multiple copies of the same executable. It looks like this:

REM SpawnMany.cmd

REM Author: Thomas Kejser
REM Purpose: Spawns many copies of the same executable. Useful for running many things in parallel

@ECHO OFF
ECHO Spawning %2 copies of %1

FOR /L %%i IN (1, 1, %2) DO (
    ECHO Spawning thread %%i
    START "Worker%%i" /Min %1 %%i %2
)

Each new process is started in a minimized window and we pass the thread number and the total number of threads to it. Using this little batch script, we can now do this:

  SpawnMany.exe Worker.Exe 16

This starts 16 workers, each with their own thread number assigned. Very useful for running stuff in parallel in a quick and dirty way. For example, I use this to run the TPC-H data generator dbgen.exe highly parallelized.

Notice that I added the EXIT command at the end of the worker.cmd batch. This makes sure that the window closes itself when done executing.

Summary

In this blog, I have shown you how to write a little batch script to fire up multiple threads, from the command line. each doing their own work in parallel. The script is “zero dependency” which makes it ideal for server use and for hacking together quick and dirty parallelism for test scenarios.

I mentioned that SQL server does not have a way to start up new connections from T-SQL. This is not strictly true. Sorry for leading you astray, but I wanted you to see how to do this from the command line first (and go through the pain Devil ). There is a way to hack SQL Server and implement a stored procedure I like to call sp_executesql_async. This will be the subject of a future blog, but since I am heading into a lab for a few weeks, you just have to wait for it.

Don’t Become a One-trick Architect

2011-12-08 10 comments

imageWe are near the dawn of a new workload: BigData. While some people say that “it is always darkest just before the dawn”. I beg to differ: I think it is darkest just before it goes pitch black.  Have a cup of wakeup coffee, get your eyes adjusted to the new light, and to flying blind a bit, because the next couple of years are going to be really interesting.

In this post, I will be sharing my views on where we have been and a bit about where we are heading in the enterprise architecture space. I will say in advance that my opinions on BigData are just crystalizing, and it is most likely that I will be adjusting them and changing my mind.

Yet, I think it will be useful to go back in history and try to learn from it. Bear we me, as I reflect on the past and walk down memory lane.

The Dawn of Computing: Mainframes

If you were to look at the single most successful computing platform of ancient times (that would be before 1990), you are stuck with the choice between the Apple II, the C64 or the IBM S/360 mainframe. The first two are consumer devices, and I may get back to those in another blog post. Today, let us look at the heavy lifting server platforms, since we are after all going to talk about data.

imageUnder the guidance of Fred Brooks, IBM created one of the most durable, highly available and performing computing platforms in the history of mankind. Today, mainframes are challenged by other custom build supercomputers, x86/x64 scale-up platforms and massive scale out systems (ex: HADOOP). But even now, the good old S/360 still holds on to a very unique value proposition. No, it is not the fact that some of these machines almost never need to reboot. It is not the prophetic beauty of JCL (a job scheduler that “gets” parallelism) or the intricacies of COBOL or PL/I…

In fact, it is not the mainframe itself that gives it an edge, it is the idea of MIPS: Paying for compute cycles!

When you pay for compute cycles, every CPU instruction you use counts. The desire to write better code goes from being a question of aesthetics, to a question about making business sense (i.e. money). Good programmers, who can reduce MIPS count, can easily visualize their value proposition to business people, and justify extraordinary consulting fees.

As we shall see, it took the rest of us a long time to realize this.

The Early 90ies: Cataclysm

My programming career really took off in the late 80ies and early 90ies – before I got bitten by the database bug. I used to write C/C++, LISP, Perl, Assembler (various), ML and even a bit of Visual Basic (sorry!) back then. Pardon my nostalgia, but in those “old days” it was expected that you were fluent in more than one programming language.

There were some common themes back then.

First of all, we took pride in killing mainframes. We saw the old green/black terminals as an early, and failed, attempt at computing – dinosaurs that had to be killed off by the cataclysm of cheap, x86 compute power (or maybe RISK processors, though I never go around to using them). We embraced new programming languages and the IDE with open arms. We thought we succeeded: IBM entered a major crisis in the 80ies for the first time in their long and proud history. However, it could be argued that it was IBM as a company that failed, not the mainframe. There are still a lot of mainframes alive today, some of them have not been turned off since the 70ies and they run a large part of what we like to call civilisation. As a computer scientist, you have to tip your hat to that.

Another theme was a general sense of quick money. IBM had a lot of fat on their organization, and all that cost had to go somewhere: It ended up as MIPS charges to the customer. This made mainframes so expensive that it was easy to compete. It was the era of the “shallow port”. ERP systems running on ISAM style “databases” were ported 1-1 to relational databases on “decentral platforms” – aka: affordable machines. Back then, this was much cheaper than running on the mainframes and it required relatively little coding effort.

The results of shallow ports was code strewn with cursors. I suspect that a lot of our hatred towards cursors is from that time. People would do incredibly silly things with cursors, because that was the easy way to port mainframe code. Oracle supported this shallow port paradigm by improving the speed of cursors and introducing the notion of sequencers and Row ID, which allows even the database ignoramus to get decent, or should we say: “does not suck completely”, performance out of a shallow port. If you hook up a profiling tool to SAP or Axapta today, you can still see the leftovers from those times.

Late 90ies: All Abstractions are Leaky

Just as proper relational data models were beginning to take off, and we began realizing the benefits of cost based optimizers and set based operations, something happened that would throw us into a dark decade of computer science. As with all such things, it started well intentioned: A brilliant movement of programmers began to worry about encapsulation and maintainability of large code bases.  Object oriented (OO) programming truly took off, it was older than that, but it now had critical mass.

The KoolAid everyone drank was that garbage collection, high level (OO) languages, IDE support and platform independence (P code/Byte Code) would massively boost productivity and that we never had to worry about low level code, storage and hardware again. To some extend, this was true: we could now trains armies of cheap programmers and throw them at them at problem (I that ironic way history plays tricks on us, the slang term was: The Mongolian Horde Technique). We also had less need to worry about ugly memory allocations and pointers – in the garbage collector we trusted. Everything was a class, or an object, and the compilers better damn well keep up with that idea (they didn’t). The champion of them all: JAVA, suffered the most under this delusion. C/C++ programmers became an endangered species. Visual Basic programmers became almost well respected. And, to please people who would give up pointers and malloc, but not curly brackets, a brilliant Dane invented C#.

At my university Danes, proud of our heritage, even invented our own languages supporting this new world (Mjolnir Beta if anyone is still interested). Everyone was high on OO. A natural consequence of this was that relational database had to keep up. If I could not squeeze a massive object graph into a relational database, that was a problem with the database, not with OO. Relational databases were not really taught at universities, it was bourgeois.

This was the dawn of object databases and the formalization of the 3-tier architecture. Central to the this architecture was the notion that scale-out was mostly about two things: 

  1. Scaling out the business logic (ex: COM+, Entity Beans)
  2. Scaling out the rendering logic (ex: HTML, Java Applets)

We still hadn’t figured out how to fully scale databases, though most database vendors were working on it, and there were expensive implementations that already had viable solutions (DB2 and Tandem for example). What we called “Scale-out” in the 3-tier stack was functional scale, not data scale: Divide up the app into separate parts and run those parts on different machines.

I suspect we scaled out business logic because, we believed (I did!) that this was where the heavy lifting had to be done. There was also a sense that component reuse was a good thing, and that highly abstract implementations of business logic (libraries of classes) facilitated this. Of course, we did not see the dark side: that taking a dependency on a generic component, tightly coupled you to release cycles of that component. And thus was born “DLL-hell” and an undergrowth of JAVA protocols for distributed communication (ex: CORBA, JMS, RMI)

Moving business logic to components outside the database also created a demand for distributed transactions and frameworks for handling it (ex: DTC). People started arguing that code like this was the Right Way™ of doing things:

Begin Distributed Transaction

  MyAccount = DataAccesLayer.ReadAccount()

  if  withdrawAmount <= accountBalance then

    MyAccount.Balance = accountBalance – withdrawAmount

    MyTransaction = DatabaseAccessLayer.CreateTran()

    MyTransaction.Debit = withDrawAmount

    MyTransaction.Target = MyAccount

    MyTransaction.Credit = withDrawAmount

    MyTransaction.Source = OtherAccount

    MyTransaction.Commit

 else

    Rollback Transaction

    Throw “You do not have enough money!”

 end

Commit Distributed Transaction

You get the idea… Predictably, this led to chatty interfaces. Network administrators started worrying about latency, people were buying dark fibers like there was no tomorrow. Database administrators were told to fix the problem and tune the database, which was mostly seen as a poorly performing file system. Looking back, it was a miracle that noSQL wasn’t invented back then.

Since this was the dawn of the Internet, scale out rendering made a lot of sense. It was all about moving fat HTML code over stateless protocols. It was not unreasonable to assume that you needed web farms to handle this rendering, so we naturally scaled out this part of the application. Much later, someone would invent AJAX and improve browsers to a point where this would become less of a concern.

We were high on compute power and coding OO like crazy. But like the Titanic and the final end of the Victorian technology bubble, we never saw what was about to hit us.

Y2K: Mediocre, But Ambitious

The new millennium had dawned (in 2000 or 2001, depending on how you count) and people generally survived. The mainframes didn’t break down, life went on. But a lot of programmers found themselves out of work.

In this light, it is interesting to consider that programmers were considered the most expensive part of running successful software back then. JAVA didn’t live up to its promise of cross platform delivery – it became: “Write once, debug everywhere” and people hated the plug-ins for the browser. While productivity gains from OO were clearly delivered, I personally think that IntelliSense was the most significant advance in productivity that happened between 1995 and 2005 (it takes away work from typing, so I can use it on thinking).

Professional managers, as the good glory hunters they are, quickly sniffed out the money to be made in computers during the tech bubble. As these things go, they invented a new term: “IT” that they could use to convince themselves that once you name something, you understand it. It was argued that we needed to “build computer systems like we build bridges”, but the actions we took looked more like “building computer systems like we butcher animals”. The capitalist conveyor belt metaphor, made so popular by Henry Ford and “enriched” by the Jack Welsh’ish horror regime of the 10-80-10 curve, eventually led to the predictable result: Outsourcing.

Make no mistake, I am a great fan of outsourcing. I truly believe in the world-wide, utilitarian view of giving away our technology, knowledge, money and work to poor countries. It is a great way to equalize opportunities in the world and help nations grow out of poverty – fast. In fact, I think we need to start outsourcing more to Africa now. Outsourcing it is the ultimate, global socialist principle.

The problem with outsourcing isn’t that we gave work away to unqualified people in Asia and Russia – because we didn’t – those people quickly became great, and some of the best minds in computer science were created there.  Chances are that you are reading this blog on a software written by Indians.

The problem with outsourcing was that it led to a large portion of westerners artificially inflating the demand for Feng Shui decorators, lifestyle coaches, Pilates instructors, postmodernwriters”, public servants and middle level management. These days, Europe is waking up to this problem.

But then again, if we send the jobs out of the country, all the unemployed have to do something with their time. Perhaps it is no coincidence that the computer game industry soon outgrew the movie industry. You can after all waste a lot more time playing Skyrim (amazing by the way) than watching the latest Mission Impossible, and you don’t have to support Tom Cruise’s delusions in the process

Sorry, I ramble, somebody stop me. Yet, the fact is that a lot of companies wasted an enormous amount of money sending work outside the country, dealing with cultural differences and building very poor software. One of the the results (again a big equalizer, this time of life expectancy) was that numerous people must have died from mal-practice as the doctors were busy arguing about the latest way to AVOID building a centralized, and properly implemented, Electronic Patient Journal/Record system. As far as I know, that battle is still with us.

The Rise of the East: Nothing interesting to say about XML

Once you have latched on to the idea of a fully componentized code base, it makes sense to standardize on the protocols used for interoperability. Just when we thought we couldn’t waste any more compute cycles – someone came up with SOAP and XML.

This led to a new generation of architects arguing that we need Service Oriented Architectures (SOA). There are very few of those today (both the architects and the systems), but the idea still rings true and for the customers who have adopted it. And it IS very true: a lot of things become easier if you can agree on XML as an interchange format. We also saw the rise of database agnostic data access layers (DAL). Everyone was really afraid to bind themselves to a monopoly provider.

I don’t think it is a coincidence that open standards and the rise of Asia/Eastern Europe/Russia coincided with *nix becoming very popular. The critical mass of programmers, as well as the interoperability offered by new and open standards, made it viable to build new businesses and software. The East has a cultural advantage of teaching mathematics in school at a level unknown to most westerns – I suspect it will be the end of our Western culture if we don’t adopt. Good riddance..

And the problem was again the over interpretation of a new idea. Just because XML is a great interchange format does not mean it is a good idea to store data in that format. We saw the rise (and fall) of XML database and a massive amount of programmer effort wasted on turning things that work perfectly well (SQL for example) into XML based, slower, versions of the same thing (XQuery for example). But something was true about this, and it lingered in our minds that SQL didn’t address everything, there WAS a gap: Unstructured data…

All the while, we were still high on compute power. Racking blade machines with SAN storage like there was no tomorrow. The term “server sprawl” was invented – but no one who wrote code cared. We continued to believe all that compute power was really cheap. Moore’s Law just kept on giving.

The end of the Free Lunch: Multi Core

But around 2005, it was the end of it. CPU clock speed stopped increasing. Throwing hardware at bad code was no longer an option. In fact, throwing hardware at bad code made it scale WORSE.

Those of us who had flirted with scale-out in the 90ies, and failed miserably, got flashbacks. Predictably, we acted like grumpy old men: angry and bitter that we didn’t get the blonde while she was still young and hot. Oracle became hugely successful with RAC, and the idea of good data modeling that actually has to run on hardware came back on the table, snatched from the hands of the OO fans while they were distracted by UML. People started arguing that perhaps, some of that business logic we all loved scaling out (functional scale-out, mind you), DID belong in the database after all. Thus came SQLCLR and JAVA support in Oracle and DB2. Opportunistic companies started publishing data models and selling them for good money, it was believed that if ONLY we could do 3NF modeling of data, all would be well. Inmon followers were high on data models!

People demanded automatic scale and everyone in the know, knew that this was not going to work without a fair amount of reengineering. But of course, we continued to pay homage to the idea – people don’t like their illusions broken.

While everyone was busy wrapping their head around scale-out, the infrastructure guys were beginning to show signs of panic. Big banks were complaining (and still are) that their server rooms were overflowing with machines. Their network switches were not keeping up anymore and the SAN guys went into denial: “You may measure a problem in Windows, but that is Windows who can’t do I/O, I see no problem in the SAN”. All our XML, OO and lazy code and CPU greed had started to take its toll on the environments. Somebody started talking about “Green Computing” – damn hippies!

Flashback: Kimball’s Revenge

While Teradata was busy selling “magic scale out” engines that had made great progress in this space, Kimball followers were spraying out Data Marts and delivering value in less time than it took the “EDW people” to do a vendor evaluation. The Kimball front got themselves a nice weapon of mass destruction with the Westmere and Nehalem CPU cores: cheap, powerful CPU power that took up very little space in the server room and didn’t need any special network components. It was all in-process on the same mainboard. Itanium (IA-64) went the way of the Alpha chip, the final realization that most code out there truly sucks: high clock speed beat elegant architectures.

Finally, a single machine with lots of compute power, no magic scale-out tricks, could solve a major subset of all warehouse problems (and interestingly most of the classic OLTP problems too). Vendors started digging out old 1970 storage formats that had a great fit with Kimball. Column stores and bitmap indexing got a revival. We saw the rise of Neteeza, Redbrick, Analysis Services, TM1, DB2 OLAP Views and ESSBase. This trend continues today, for example with Vertica and the Microsoft VertiPaq engines. there is even a great website dedicated to it: The BI Verdict.

Data kept growing though, and today we are seeing an interest in MPP, Kimball style, warehouses promising truly massive scale. Scale-out and -Up combined in a beautiful harmony.

But of course, social media beat us all to it. No matter how much data we tried to cram into the warehouse, the world could spam it faster than we could consume it.

BigData: More Pictures of Cats than you can Consume

Our tour of history is nearly at an end – bear with me a little longer. Today, the old dream of handling unstructured data is becoming reality – but perhaps a nightmare for some.

With BigData and HADOOP, we have a new architecture that for the first time makes some promises we can start believing:

  • Storage is TRULY cheap and massive (But you have to live with SATA DAS, not SAN)
  • Unstructured data, and queries on it, can be run in decent time (Not optimally fast, but good enough)
  • Semi automated scale is doable (but only if you know how to write Map/Reduce or use expression trees)
    This means that we can now drink the barrage of data coming from the Internet. Of course, human nature being what it is, a lot of this data is pure noise. In fact, the signal to noise ratio is scary on the Internet – I can think of no better word for it.
    Among all the noise of pictures of cats, tweets and re-tweets about toilet visits, and latest news about Britney Spear’s haircut, there is something we treasure: Knowledge about human nature. The problem that BigData helps us address is that we don’t always know how to find the needle in the haystack. If we build BigData installations, we can keep the data around, unafraid of losing it or burning all our money, while we dig for that insight we so desperately seek about our customers and the world around us.
    I have written elsewhere about the great fit between BigData and Data Warehouses. But I would like to reinforce my point here: Once you know the structure of the data you are looking for, by all means model it and put it in a warehouse for fast consumption. If you don’t know what you are looking for yet (or if you can wait for it), put it in BigData.
    Just be careful, there is a price associated with choosing temporary ignorance over knowledge seeking.

The Cloud: Grow or Pay!

Remember those infrastructure people who complained about server rooms overflowing? Do I hear your growls of anger? As architect/programmer/DBA, I think we often have felt the frustration of dealing with infrastructure departments (in my case the SAN guys take the price). “Why can’t those people just rack the damn server and be done with it – instead of letting me wait 3 months?” we ask ourselves as we again bang our fist into the table in frustration.

But the fact of the matter is that those server rooms ARE overflowing and overheating all over the world. Handling all the complicated logistics of racking and patching servers is expensive, and that is a direct consequence of our greed for compute power. Those infrastructure people are just trying to survive the onslaught of the server sprawl – give them a break. They respond by virtualizing (SAN and virtual machines) and they fight back when we ask for hardware that is non-standard or “not certified”.

imagePerhaps it is time we look ourselves in the mirror and ask if we really need to waste all those CPU cycles on overly complex architectures that protect us from ourselves and our inability to properly analyse what we plan to do with data. Every time we rack a new server, power consumption goes up in the server room and it gets just a little warmer in there. We are incurring a long term cost in power and cooling by writing poor code and designing overly flexible architectures. At the end of the day, a conditional branch in the code (an “if statement”), metadata and dynamically typed, interpreted code costs CPU, and if we want to be flexible, we often need a lot of that.

If there is one thing we have learned from mainframes, it is that code lives nearly forever. Especially when it is so bad that no one dares to touch it, or so complex that no one understands it.

If it makes your day, think of this as a green initiative: you are saving the planet by writing better code and saving power. But of course, such tree hugger mentality gets us nowhere in the modern business world, so how about this instead: with the dawn (overcast?) of the cloud, compute cycles will once again be billed direct to you: the coder and DBA!

MIPS are back in a new form, and it again makes good business sense to write efficient code. In fact, it always made good sense (even in the 90ies), but all those costs you were incurring from writing bad code or forcing data into storage formats that are unnatural (e.g. XML) has so far been hidden in your corporate balance sheet. With the cloud, that cost is about to uncloak itself, and like a Klingon battleship – this might leave you in trouble!

Summary: What history Taught Us

If you are still with me here, thank you for putting up with my tantrums. It is time to wrap this up. What has history taught us about architectures and how we go about storing data?

I like to think of computing as evolutionary, not revolutionary. There are very few major breakthroughs in computer science (perhaps with the aforementioned exception of IntelliType) that don’t arrive with a lot of advance warning, and grow up slowly from many failed attempts at implementations. For example, column stores are super hot today, but they were invented back in the 70ies.

Let us have a look at the “evolutionary tree of life” for data storage engines:

image

Above is of course a gross over-simplification, and you could argue about where the arrows should connect. But no matter where those arrows lead, it makes the point I want to make: That nearly all of these technologies are still around today and very much alive (perhaps with the exception of XML databases, but those were pretty silly to begin with). In fact, most large organisations have most of them around.

Are all these technologies just alive because they have not yet been replaced with the latest and greatest? Is noSQL, HADOOP or some other new product bound to replace them all over time?

I think the different technologies are there for a good reason. A reason very similar to why there are many species of animals in the world. Each technology is evolved to solve a certain problem, and to live well in a specific environment. When you target the right technology to a certain problem (for example column stores towards dimensional models) you solve that problem elegantly and with the least amount of compute cycles – preserving energy. Very soon, solving problems with a low number of compute cycles in elegant ways is going to really count.

While you might be able to solve most of the structured DW data problems with HADOOP, you eventually have to ask: Is a generic map/reduce paradigm really the way to go about that?

Humans have a strange way of wanting a single answer to the big and complex question in life, and we waste significant time searching for it. Entire cults of management theories (and I use that term lightly) and religions are build around these over simplifications. We seek simple answers to complex question, and as IT architects, this can lead us down a path of believing in “killer architectures”.

What history has taught us is that such killer architectures do not exist, but that the optimal strategy (just like in nature) depends on which environment you find yourself in. Being an IT architect means rising above all these technologies, to see that big picture, and resisting the temptation to find a single solution/platform for all problems in the enterprise. They key is to understand the environment you live in, and design for it using a palette of technologies.

Staying in the metaphor, this leads me to another conclusion: Just like evolution made us store fat on our bodies in different places depending on the usage (thank goodness) you also have to consider the option of storing your data in more than one place. Having a bit of padding all over the body is a lot more charming (and healthy) than a beer belly Smile

References:

Implementing MurmurHash and CRC for SQLCLR

2011-12-07 1 comment

As we saw in my previous post, the build in hash functions of SQL Server were either expensive with good distribution, or cheap, but with poor distribution. As a breath of fresh air, let us look at a useful magic quadrant:

 

image

We see that all of the hash functions exposed by HASHBYTES fall into low speed quadrants, caution is advised here. We also see that while CHECKSUM and BINARY_CHECKSUM are challengers, they do not have the spread to fully compete. We are left with modulo as the best possible hash function, but we believe that this algorithm may have trouble executing. The Modulo hash function has some problems (sorry, had to say “problem” instead of “challenge”, I couldn’t bear speaking management bollocks anymore):

  • It only works on integer types
  • If there is structure in the data (for example, if all values are equal) it will not spread the values equally
    The question we are faced with is:
    “Could we create a high speed, good spread, hash function that belongs in the leader quadrant and which does not have the limitation of the module functions”

It was this question I sought to answer when I started hacking away at SQLCLR functions. As you read the following, please don’t laugh too hard at me (a small snigger will suffice) – it has been a long time since I hacked away at code.

MurmurHash2 and Davy’s Bit Magic

Google has published a very interesting hash function, MurmurHash, claiming to have the properties I am looking for: It is both fast and has a good spread. The hash algorithm comes in different flavours: MurmurHash2 and MurmurHash3. Each flavour also exists in architecture and word size optimized versions. I decided to implement the x86/x64, 32-bit integer version in C#.

Fortunately, someone already beat me to it. Davy Landman has an implementation of MurmurHash2 – which he has kindly shared under GPL. I took his implementation and wrapped it in a SQLCLR user defined function.

While trying to understand Davy’s code and hacking it to work with SQLCLR, I had a few “aha moments” that I would like to share.

First of all, Davy uses this cute trick to turn 4 bytes from a BYTE[] array into a 32-bit INT:

UInt32 k = (UInt32)(data[currentIndex++] 
                  | data[currentIndex++] << 8
                  | data[currentIndex++] << 16
                  | data[currentIndex++] << 24
           
);

 

I thought that was a rather neat trick. To practice my illustration skills, here is what happens:

image

Please note that left shift (<<) and or (|) operators take INT32, not UINT32 as input. Hence the need to cast the final value to UINT32.

Speaking of integers, in my modification of Davy’s source code you will find:

unchecked
{
    return (SqlInt32)(Int32)h;
}

I need to get form the UINT which is returned by the reference implementation of MurmurHash to the signed integer (SqlInt32) that SQL Server understands. Doing the unchecked cast here is faster than Convert.ToInt32.

MurmurHash3

From the C++ source code of MurmurHash3, and using Davy’s bit trick, it took me only a few hours to get myself a nice implementation of MurmurHash3. I compared this with a C++ implementation done by my colleague Christian Martinez (based directly on the Google source), and we agreed on outputs that exercise all the branches in the code. So I am reasonably confident that my implementation is correct.

Except for the fact that it turns out to be important to use SqlBinary instead of SqlBytes (blogged here), there is not much more to say. Feel free to use the MurmurHash3 for your own implementations (my implementation is GPL)

The source code for the MurmurHash functions is too long to paste in this blog, so I created a new page on my site which you can find by following this link: C# Source code for MurmurHash in SQLCLR.

Before I show you the spread and speed results, let me just talk a little about two very old, but very commonly used, hash functions.

CRC16 and CRC32

The Cyclic Redundancy Check (CRC) family of hashes are, compared to MurmurHash, very simple to implement. They walk through each byte in the input and divide it with a specially selected polynomial. For implementation purposes, the polynomial is simply a constant in the code and the division is done with XOR and shifting (making this a very efficient operation). The remainder of the byte division is fed into the division on the next byte (giving the hash algorithm its name) and so on, until the end of the input .

WikiPedia contains a great reference implementation that I used for my tests. For readability, the source code is again available on a separate page of this blog: CRC16 and CRC32 in C#.

In the naïve approach to CRC, the polynomial division is implemented using an inner loop through all the bits of each byte:

rem = rem ^ data[i];

for (int j = 0; j < 8; j++)
{
  if ((rem & 0×00000001) == 0×00000001)
  {   // if rightmost (least significant) bit is set
     rem = (rem >> 1) ^ Polynomial32; /* Polynomial */
  }
  else
  {
    rem = (rem >> 1);
  }
}

An interesting optimization, shaving off a lot of cycles, is to create a table that allows you to look up the result of this loop for all 256 combinations of 8 bits. My source code contains such  tables, both for CRC32 and CRC16 implementations. Using these tables, I can replace the above code fragment with:

rem = (rem >> 8 ) ^ CRCTable32[(rem & 0xff) ^ data[i]];

…A major optimization

Speed Results

At this point, you may wonder if I did all of this to give you a chance to laugh at my rusty coding skills. What exactly was the point of implementing a new hash function in the first place?

Without further ado, let us look at the results compared to native SQL Server functions:

image

Isn’t that beautiful? If we don’t need the cryptographic properties of the HASHBYTES functions, we can beat SQL Servers hashes with our own SQLCLR implementation. Please note that I took out MD2 from the above to avoid skewing the results (I have previously shown that it is simply too inefficient).

Now, the bad news: the yellow line above. That represent a “hash function” that returns zero, like this:


public static SqlInt32 NoHash(SqlBinary data)
{      
  return (SqlInt32)0;
}

 

Notice that even such a simple user defined function that does nothing, consumes almost all the same number of CPU cycles as a full hash calculation. This is the overhead of the CLR and throwing data back and forth between managed and unmanaged code. So much for my optimizations in CRC.

Spread Results

We now know that there is at least hope of implementing a faster hash function than HASHBYTES. But how well do these functions spread the input. Running the chi-squared test from my previous blog entry we get These results:

image

Now THAT is pretty nice isn’t it? We see that good old CRC makes for an excellent hash function for our purpose. MurmurHash does very well too, comparable with SHA, but at under half the CPU cost.

Summary

In this blog, I have shown you how to implement hash functions that are both faster and have better spread than the built in hash functions exposed in SQL Server HASHBYTES. We have also seen that the overhead of SQLCLR is significant for this case, and hence, it would be preferable if non cryptographic hash function were exposed natively in SQL Server.

If you care about such new hash functions for SQL Server, I would suggest you file a Connect item to ask for them and state your purpose for wanting them. “It would be cool” or “Thomas says so” is not enough, provide a business justification. But at least, now you know what to ask for…

I owe Christian Martinez thanks for this blog for directing me to the MurmurHash functions, and for helping me validate the correctness of my C# implementation against his C++ version.

PS: As I was exploring this little implementation, I noticed that one of my heroes, Donald Knuth, has extended The Art of Computer Programming with a new volume: Volume 4, Fascicle 1: Bitwise Tricks & Techniques; Binary Decision Diagrams. That sounds like an interesting read and it is now on my birthday wish list.

Thread Synchronization in SQL Server

2011-11-09 Leave a comment

Any code optimized for highly concurrent workloads must worry about thread synchronization. SQL Server is no exception, because in a database system, synchronization is one of the core functionalities you rely on the engine to provide (the noSQL crowd may ponder that a bit). In this blog post, I will describe the synchronization primitives, like locks, latches and spinlocks, used by SQL Server to coordinate access to memory regions between threads.

imageBefore I go on, an update: am currently sitting in the train from Stockholm after a hugely successful PASS SQL Rally Nordic conference.

At the conference, I presented my new “Grade of the Steel” deck, which I hope will be available online soon. The deck led to a lot of questions about spinlocks during the party that same evening.

We had Swedish legend E-Type playing and putting on an impressive stage show (including dancing Vikings, as seen in picture to the right). The show left me wondering how to improve my presentations skills to get THAT level of reach to the audience. The conference arrangers had  also done an amazing job at keeping everything running smoothly and I return with many good memories. Thank you guys, you showed the crowd that the Nordic community is very strong.

Well, “here I go again”, being distracted by all the fun. My point was, the questions at the conference led me to write this blog entry, as your (always not so) humble servant. Keep that feedback coming.

Locks and Lock Graphs

Most DBAs and developers are familiar with locking and blocking: Locks in SQL Server are the semaphores that guard the consistency of the database and enforce the high level ACID properties and serialization properties in the database.

You can measure locking activity in several DMV, and there are many articles written on how to diagnose and troubleshoot locking issues. For this blog, suffice to say that the total time the entire engine spends on waiting for locks is measured in sys.dm_os_wait_stats. I would like to make the point that EVERY performance investigation I do on SQL Server stats with querying this view… Learn to use it.

In sys.dm_os_wait_stats, you will find rows that have wait_type beginning with LCK. These are the lock waits, and they have been extensively described elsewhere.

The one thing I would like to add, is that locks have a directed acyclic graph relationship (DAG) with each other and are generally acquired in a certain order. Let us take a hypothetical example of a query locking five rows in a table with two partitions:

image

Before the query can even start, it needs a database (DB) lock to even run, this lock is acquired by the connection itself and held as long as you are connected to the database.

The first locks acquired by the query itself, are the two schema locks (SCH in shared mode S): one of the table itself, and one for the schema the table lives in. Note that the term “schema” is a bit overloaded here: I mean to say that the lock is acquired both on the table (the table object in the database schema) and on the schema (the schema object in the database schema). The SCH locks prevent the table structure itself from being modified while the query runs. The requirement to hold this SCH lock is why, for example a SWITCH operation, will block a person from reading the table while the switch happens.

When SCH locks are acquired, the query must hold a lock on the table. This prevents other conflicting locks, for example a bulk inserter or a greedy, table escalated X-lock, from conflicting with the query.

Traversing onwards from the table level, we have a special lock level: HOBT, representing a partition (HOBT = Heap Or B-Tree, pronounced: “Hobbit”). Recall that even an unpartitioned table has one partition: the table itself. The HOBT lock is only acquired if the table has been configured to support this lock level. This can be done with this statement:

ALTER TABLE <table> SET (LOCK_ESCALATION = AUTO)

 

… note that the default for tables is to have this lock type disabled and to jump directly to…

Row and page locks: Queries will generally lock the pages it needs access to, though this is mostly a performance optimization used for table scans. Row locks fall into two different categories, RID or Key, depending on whether it is a heap or a B-tree you are locking. Note that you can disable row and page locks by altering the table like this for a B-tree:

ALTER INDEX <index> ON <table>
SET (
   
ALLOW_ROW_LOCKS = OFF
  , ALLOW_PAGE_LOCKS = OFF)

 

There are cases where disabling the row and page lock levels make queries go faster, but it can also have a negative effect on concurrency. I hope to blog about those cases, and how to diagnose them, in a later installment.

I said before that locks are acquired in a the order of the directed acyclic graph by the SQL engine. This is not something you need to worry about, the engine handles this for you. But, can you guess why it matters which order locks get acquired?

(Hint: What would happen if someone started at the top at the same time that someone else started from the bottom of the graph?)

Locks in SQL Server are stored in hash tables in the part of the engine called the lock manager. These hash tables consume some memory (but so little we don’t really care about it, so Oracle people, stop bickering this point) and are optimized for highly concurrent access. The lock manager even partitions locks over multiple NUMA nodes to make accessing lock information faster. The mechanism used to partition locks are similar to the SuperLatches described in: Hot It Works: SQL Server SuperLatch’ing / Sub-latches.

Latches

Latches are a special type of lock that is used inside the SQL Server engine to protect shared data structures. They enforce integrity at the physical layer of the database and will be used automatically by SQL Server, irrespective of the transactional semantics you request.

The wait time for latches shows up in sys.dm_os_wait_stats, there are three broad categories: PAGEIOLATCH, PAGELATCH and LATCH

    The first two are latches used by the database to protect the integrity of the buffer pool. The PAGEIOLATCH variant is used when data is brought in from disk, while the PAGELATCH is used to coordinate things that are already in memory.
    The last latch type, LATCH, is a different beast that is used several places inside the SQL code path to coordinate other memory structures – for example during query execution. The good news is that you can get more details about this latch from the DMV: sys.dm_os_latch_stats. You can think of this view as a way to “zoom in” on latches.

The view contains a latch_class called BUFFER – this is just the aggregate of all the PAGEIOLATCH and PAGELATCH waits, you already have the detail of those, so just ignore it. The rest of the rows in sys.dm_os_latch_stats add up to the LATCH waits.

Spinlocks

Locks and latches have something in common: When a thread in SQL server needs to wait for them, the thread will block and yield the scheduler. In general, it is a good idea to play nice this way and yield the scheduler to another thread if you cannot get any more work done until you acquire access to the structures you want to use. SQL Server also implements user mode scheduling, which takes down the price of a context switch significantly – further boosting scale.

Think about locks and latches with this analogy: You are about to leave a plane, but your luggage is in an overhead compartment in the seats behind you. The best strategy for optimizing throughput, is for you to sit down, wait for the people in the seats behind you to pass and then go open the overhead compartment. This optimizes for the best system throughput and lets you turn off your brain just a little longer.

HOWEVER, sitting down and patiently waiting for other people to get their work done first, is not always the smartest idea. As you leave the plane and move to passport control, you don’t sit down on the floor while you wait for the person in front of you to fumble for his passport in his pocket. The reason you don’t do this, is that you expect him to quickly be done with his fetching; and you remain alert while you wait (and curse). It would be a waste of time and leg muscles to sit down, and the people behind you would get annoyed by you too.

Spinlocks are optimized to protect data structure where you expect the thread (or person) in front of you to get their work done fast! They are typically implemented as busy waiting loops that are optimized for the CPU architecture they run on (we want the loop to be as cheap as possible and we need some low level tricks to get it working). Wikipedia has an excellent article with x86 assembly examples.

SQL Server uses spinlocks in many parts of the code where the assumption of “shortly held locks” can be made, and typically this gives you more throughput and is very efficient.

You can see how many spins are done by querying this view:

SELECT * FROM sys.dm_os_spinlock_stats
ORDER BY backoffs DESC

 

In older version of SQL Server (before 2005) – this view was not available. However, there is another way to get the spinlock information, namely by executing:

DBCC SQLPERF ( spinlockstats )

 

The problem with spinlocks is of course that the assumption about shortly held locks is not always true. Sometimes, the guy in front of you just take too long to get stuff done, and you are stuck with all your attention pinned to him (spinning) and getting more and more annoyed. Interestingly, NUMA systems are typically the machines where this assumption breaks down, because accessing memory on a foreign node can take a significant amount of time. If access to that memory is protected by a spinlock, this means you can burn a lot of CPU in the busy waiting loop and you can get some very interesting side effects.

In general, you should not worry about high values in the spins column of sys.dm_os_spinlock_stats. Spinlocks are after all designed to spin a lot in busy waiting loops. The condition that may cause you to worry that the person in front of you is not clearing the queue fast enough, is when you see high values in the backoffs column. Backoffs are what happen when the thread spinning for a resource gives up and decided to fall asleep. If the backups go up dramatically as you increase throughput, you may have spinning on a memory region that is heavily contended and have to design around it.

There is an excellent article on how to diagnose spinlocks written by SQLCAT members Mike Ruthruff and Ewan Fairweather: Diagnosing and Resolving Spinlock Contention on SQL Server. Among other good stuff, it details how to use Trace Flag 3656 (we all love trace flags) and XEvents (and we love XEvents even more) to troubleshooting spinlock problems (see page 14-16).

Interlocked Operations

I am including this last synchronization primitive for completeness (read on to understand why).

Modern x86/x64 CPUs expose some very interesting instructions that allow you to build what is known as interlock operations. An interlock is a CPU special instruction, or series of CPU instructions, that are guaranteed to perform an atomic operation on an integer. For example, using an interlock you can safely add one to an integer and not have to worry that someone else is trying to add one at the same time. As another example, spinlocks (see above) are typically implemented using the compare and swap operation (CMPXCH in the x86 instruction set) that allow you to compare two values and swap them if they are equal, in a single, atomic instruction.

The reason I am only including this section for completeness, is that you have no way of telling if SQL Server used an interlock to synchronize access to a memory region (though it sometimes does) and no way to influence it. And the reason is that an interlock neither blocks, nor spins the thread – it simply allows the thread to execute as if no other threads were touching the data, because the interlock instruction “knows” that the underlying hardware handles the synchronization. It is the ultimate, lightweight synchronization trick.

You may wonder why SQL Server does not use interlocks everywhere. The big reason is that interlocks are severely limited. They can typically only operate on a a single integer at a time, and only support a small subset of operations on that integer. This is not enough to fully implement all the thread synchronization you need.

If you are interesting in reading more about interlock operations, there is a very nice introduction at this site for computer game programmers: Multithreaded Programming Part 3: Going Lockless. And by the way, I recommend studying the work of computer game programmers – they are the true wizards of computer science.

Related:

Exploring Hash Functions in SQL Server

2011-11-06 4 comments

Hash distributing rows is a wonderful trick that I often apply. It forms one of the foundations for most scale-out architectures. It is therefore natural to ask which hash functions are most efficient, so we may chose intelligently between them.

In this blog post, I will benchmark the build in function in SQL Server. I will focus on answering two questions:

  • How fast is the hash function?
  • How well does the hash function spread data over a 32-bit integer space

I know there is also the question about how cryptographically safe the function is, but this is not a necessary property for scale-out purposes – hence, that aspect is out of scope for this blog.

For the picky people (my dear soul mates): I will be using 1K = 1024 in the following, so 64K = 65536.

      Test data

      Because we sometimes find ourselves hashing dimension keys, I will use a very simply table with a nice key range as my test input. Such a key range could for example have been generated by an IDENTITY(1,1) column or a SEQUENCER. Here is the test script that generates the input:

CREATE TABLE CustKey (SK INT NOT NULL)

INSERT INTO CustKey WITH (TABLOCK) (SK)
SELECT n FROM dbo.fn_nums(10000000)

 

(see my previous post for the fn_nums source)

Speed of the Hash function

SQL Server exposes a series of hash functions that can be used to generate a hash based on one or more columns.

The most basic functions are CHECKSUM and BINARY_CHECKSUM. These two functions each take a column as input and outputs a 32-bit integer.

Inside SQL Server, you will also find the HASHBYTES function. This little gem can generate hashes using MD2, MD4, MD5, SHA and SHA1 algorithms. The problem for the purpose of our test is that these function spit out BINARY types, either 128 bits (for SHA) or 160 bits (for MD). However, we can quickly map that value into the integer space by doing a modulo (in SQL: %) with MaxInt (2**31 – 1). Interetingly, Modulo work directly on values of BINARY in SQL Server.

Before I move on the results, I would like share a little trick. When you need to get the raw CPU speed of the hash, you may be tempted to do something like this:

SELECT BINARY_CHECKSUM(SK) AS h
FROM CustKey

 

But this is wrong. Why? Because you are both measuring the time it takes to hash as well as the time it takes to transfer the rows to the client. This is not very useful, on my laptop it takes over two minutes to run the above query.

Instead, you should try to measure the raw time the server needs to calculate the data. One way to achieve this (and this trick can be used on most queries you benchmark) is to wrap the SELECT statement in a MAX function. Like this:

SELECT MAX(h)
FROM (
    SELECT BINARY_CHECKSUM(SK) AS h
    FROM CustKey
    ) noRows
OPTION (MAXDOP 1)

Notice something else interesting, I am forcing parallelism down to 1. I want to get as close to the raw cost of the CPU as possible – so I don’t want to clutter the query with any overhead of parallelism. The query now runs in a little less than 3 seconds on my laptop.

I addition to the above trick, we also need to quantify the time SQL Server spends on just accessing the table. We can get a good approximation of this by measuring the SELECT statement without any hash function. On my laptop, I measured this to be around 3200 ms. In the results below, I have subtracted this runtime so we are only measuring the cost of the hashing.

With all this said, let us have a look at the results:

image

Some comments on the data:

  • Notice that MD2 is very inefficient. I have not studied this algorithm in great detail and would appreciate a specialist commenting here.
  • Apart from the MD2 anomaly, the HASHBYTES functions all performan pretty much the same. Using around 1 micro sec per hash value calculation
  • BINARY_CHECKSUM and CHECKSUM are MUCH faster, by about an order of magnitude
  • Doing a simple modulo that spreads out the data over the integer space is about the same speed as CHECKSUM and BINARY_CHECKSUM
  • It is not the cast to NVARCHAR that uses CPU time (HASHBYTES requires NVARCHAR input)
  • However, the checksum function do take longer to run when you first cast to NVARCHAR. Costing around 100 ns per INT value

Of course, the HASHBYTES functions also have other characteristics than speed which may be desired (cryptographic utility for example). but if you you want is speed, it looks like CHECKSUM and BINARY_CHECKSUM are faster.

But, let us see how good the functions are at spreading the data.

Spread of the Hash Function

Another desirable characteristic, apart from speed, of a hash function is how well it spreads values over the target bit space. This is useful not only for cryptographic purposes, but also to “bucket” data values into equal sized portions, for example in scale-out MPP architectures.

Often, you will not know in advance how many buckets you eventually want to subdivide the integer space into. You may start out with 32K buckets and divide up the integer interval in 128K (32-bit space divided by 128K = 32K) even sized buckets, each with 128K values in them:

  • Bucket0: [MinInt…MinInt + 128K[
  • Bucket1: [MinInt + 128K…MinInt + 128K*2[
  • Bucket2: [MinInt + 128K*2…MinInt + 128K*3[
  • … etc..
  • Bucket 32K: [MinInt + 128K*32K…MaxInt]

Perhaps you will later want a further subdivision into 64K buckets, each with 64K hash values.

Wise from the runtimes I measured before, I used only 1M rows with values [1…1M] for this test. However, this is plenty to show some good results, so don’t worry.

To avoid running chi-square testing (See later) of integer value spread all over the 32-bit space, I calculated the hash of each key and I then bucketed them into 64K bucket ranges (each with 64K values, neat isn’t it?).

This table comes in handy for that bucketing

CREATE TABLE ExpectedBuckets 
(
    BucketID INT
 
, RangeStart INT
  , RangeEnd INT)


INSERT INTO ExpectedBuckets
SELECT
    n
    ,(n – 1 – POWER(2,15)) * POWER(2,16)
    , (n – POWER(2,15)) * POWER(2,16) – 1
FROM dbo.fn_nums( POWER(2,16) )


A sample output form this table:

image

I can calculate the hash values and put them into a table like this:

CREATE TABLE HashResult
(
  ObservationID INT IDENTITY(1,1)
, HashValue_i INT
)

CREATE CLUSTERED INDEX CIX ON HashResult (HashValue_i)

 

And finally, I can join them all up and count the values in each bucket:

SELECT
      BucketID
    , CAST(ISNULL(COUNT(HashValue_i), 0) AS FLOAT) AS Observed
FROM ExpectedBuckets EB
LEFT JOIN HashResult
ON HashValue_i BETWEEN EB.RangeStart AND EB.RangeEnd
GROUP BY BucketID

 

We now have a nice count of the content in hash bucket, it looks like this:

image

How do we test if this is a good result or not? It seems that min/max, averages and standard deviations don’t really suffice here (sorry modern MBA types, I realise I am going to loose you now Smile).

We need to do some hypothesis testing. The hypothesis I want to test (aka: the NULL hypothesis – all resemblance to relational algebra is purely coincidental).

“My chosen hash function evenly distributes the integers into the 64K hash buckets”

Interestingly, this hypothesis makes a prediction (has to, if not we cannot falsify it): namely that in each bucket filled by hashing the 1M rows, we should expect to measure 1M / 64K = 15 members.

There is also an important twist here: We have to measure the content of each bucket, even when it is zero. This is why I use a left join in the query above, we have to measure all the results we know, not just count the hash buckets that actually get filled (you can see how not bucketing the data would have made the test result very large).

We can now test the hypothesis against the measured result. If we see correlation, then it gives us confidence in the correctness of the hypothesis, and lets us conclude that our hash function is a good one. It also gives us the ability to compare different hash functions with each other. We can test for correlation between the predicted result and the actual result with the chi-square test. Interestingly, the tools we need for this test is even exposed in Excel as the CHISQ series of functions.

Without further ado (and insults), here are the results:

image

(source data available on request)

Some observations about the data:

  • Most importantly: the CHECKSUM and BINARY_CHECKSUM functions are very poor hash functions. Even with 1M input and 64K large buckets, they don’t even fill the hash space. The skew is massive on non-skewed input data.
  • We also see that casting the INT to NVARCHAR improves the behavior of both BINARY_CHECKSUM and CHECKSUM, but not enough to make them good hash functions.
  • SHA and SHA1 seem to give us the best spread of the data, but at 1 microsecond per hash, it seems rather expensive.
  • Notice that the MD series do not give a particularly nice spread. I suspect this is related to the way I divide the data with MaxInt to get from the 120bit value to a 4-byte integer
  • We see that only the modulo does a distribution that gives us nearly 100% confidence (not fully though, because 1M does not divide 64K, which is reflected in the 824 value of Chi-squared). This result is expected, and my function here is engineered to “game the test”. The modulo cannot generally be used like this to spread the value all over the 32 bit integer space.
  • For fun, I put in a test where I use random values all over the integer space. Even with a confidence interval of 10%, we cannot say it distributes evenly. This means that SHA functions distribute better than random on this input dataset. Your mileage will of course vary for random values

On the note of the CHECKUM series, Books Online does (under?) state:

If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.

Summary

In this blog I have explored properties of hash function built into SQL Server. The results indicate some general guidance that could be followed:

  • For best spread over the hash space, use SHA or SHA1 algorithms with HASHBYTES function
  • Be VERY careful about using CHECKSUM and BINARY_CHECKSUM as these functions do not spread a hashed integer value nicely over the full space. Depending on what you use these functions for, they may not be suitable
  • It takes about 1 CPU microsecond to calculate a single call to HASHBYTES, except when hashing with MD2
  • Using HASHBYTES to spread out values over the 32-bit integer space seems rather expensive, since even the best cases uses tens of thousands of CPU instructions.
  • This indicates that it may be possible to implement a better hash function if you do not care about cryptographic properties of the hash

Reference:

Maestro web page online

2011-11-01 3 comments

The SSAS Maestro training now has a home at the Microsoft domain. You can find it here: http://www.microsoft.com/learning/en/us/certification/ssas-maestros.aspx

Follow

Get every new post delivered to your Inbox.

Join 350 other followers