What Structure does your Data Have?

2012-05-15 2 comments

I am currently thinking about measuring compression rates for different column store indexing strategies. In order for me to get some realistic data, I am looking for people who can share a very small anonymised data sample with me.

Specifically, I am looking for samples from Kimball style warehouses from different industries (If you are a 3NF warehouse, I am not interested). Roughly speaking, I would like something like this select statement:

 

SELECT HashFunction(DimensionKey1) AS d1
, HashFunction(DimensionKey2) AS d2
, ….
, HashFunction(DimensionKeyN) AS dN
, HashFunction(Measure1) AS m1
, …
, HashFunction(MeasureM) AS mM

FROM FactTable

 

Where HashFunction is something that yields either a 4 byte or 8 byte integer. If you are using SQL Server, BINARY_CHECKSUM will do (or you can get fancy and use my C# implementation of CRC32).

10MB of data from a fact table would be optimal, preferably sampled from a single time interval in the data. For example: if a full day is around 10MB in the warehouse, a single day would be the  best sample. I don’t need to know any column names, only which columns are dimensions and which ones are measures. If possible, some indication of which industry the data is from would be helpful, but it is not strictly needed either.

I am fully aware that there may be legal reasons you cannot share data, even when anonymised like above. Please only give it to me if you would feel confident to give this data freely away on the Internet. I am specifically trying to create guidance on compression techniques depending on the content of data, and I plan to share this on  my blog, along with distribution statistics on the anonymised data.

Data can be delivered to me via DropBox or Google Drive in CSV format. If needed, I can also HTTP or FTP GET it.

Thanks for anything you can share, I will return the sharing favour here on my blog

Moving on from Microsoft

2012-05-09 17 comments

Not too long ago, I handed in my notice to Microsoft, terminating my employment mid-June 2012. I suspect this is slowly leaking out on the Twitter and Facebook these hours – as is tradition in the connected world.

image

During the graduation ceremony at my university it was said that: “Life is a series of temporary relationships” (though hopefully not with your wife). Throughout my career I have taken this fact of life to heart. And I can, without fear of the future, try out new challenges.

After June 2012, I am going to take it a bit slow, and do some consulting. I will then consider which adventure I want to go on from here. Because I believe you have to live out your dreams of adventure while you are still young, and I have a fair bit of fuel left in the tank before “peak oil”.

I part with Microsoft as a friend and will be working on creating a smooth transition over the next month. My work with SQL Server will continue, just outside the campus in Redmond. After June, I will be available for some teaching and short consulting gigs to share my knowledge. Terms and conditions will apply and follow here.

In the meantime, I have made my CV available: About Me.

Categories: Musing, Public Speaking, SQL Server Tags:

The Hollywood Business Model is Broken

2012-05-08 6 comments

Last week, I watched “The Avengers”. Very nice movie, recommended. Why then, do I feel the need to rant today and tell the movie industry where to stick their business model?

imageLet me give you some background: I like movie theatres, a lot. One of my favourite, the venue last week, is the Electric Cinema in Noting hill. This place represents everything that is great about theatres: It has plush red chairs, footstools, Victorian interior and even a café at the back where you can buy a well made espresso. The sound is great and the screen has high quality, digital projection. The people who go there are not annoying either. The cost is just high enough to discourage people from bringing their youngest children. In other words: It represents everything good about watching movies for the childless couple – my girlfriend and I. Building something like this at home would be, to put it mildly, cost restrictive. It’s not just the quality of the room, it’s the atmosphere.

As I emerged into the daylight, a thought occurred to me: “Wouldn’t it be nice to watch Iron Man 1 and 2 now? I liked those movies too”. My girlfriend, like most people with two X-chromosomes, finds Robert Downey Jr. very charming – it would be an easy sell. I know we are a typical couple in this case – because those movies have suddenly become very popular on iTunes.

Why then, is it that I cant watch Iron Man in the movie theatres? There is really no technical barrier that prevents anyone from uploading digital movies to theatres “on demand”. Yes, the movie is old – but movies don’t go out of fashion. Heck, my father watches the same black and white movies – over and over again.

Imagine a future where I can put down a deposit of money, vote for movies to be shown in my local theatre. I would also provide some dates that I would like to watch. If enough people “vote” for a movie with their deposits, it is shown in a proper timeslot. Of course, it takes off some of the spontaneous "impulse watching” of movies. But I would not expect that everyone who votes will have time to show up. I could buy tickets of someone else or at a discount price – stock market style. This could even be done as Facebook app to drive herd behavior: “Thomas wants to watch Iron Man, hey, I will vote too”.

All of this is of course terribly complicated to implement for someone who lacks planning and IT skills. Instead, I am forced to live by the arbitrary (for the consumer), geographically controlled release dates of Hollywood movies. It seems to me that this is yet another variant of the dinosaur mentality we see in the music industry and it is carrying over to other media companies. It’s a desperate attempt to control a market, because you are too daft, too cocaine snorting or groupie distracted to come up with a better business model. Now, wouldn’t it be nice to have a consumer controlled movie theatre? How much would you pay per month to subscribe if you had influence?

Categories: Musing Tags: ,

Reading Material: Abstractions, Virtualisation and Cloud

2012-05-01 9 comments

Two SocketsWhen speaking at conferences, I often get asked questions about virtualization and how fast databases will run on it (and even if they are “supported” on virtualised systems).  This is complex question to answer. Because it requires a very deep understanding of CPU caches, memory and I/O systems to fully describe the tradeoffs.

Let us first look at the political reasons for virtualising: Operation teams, for very good reasons, often try to push developers towards virtualised systems – cloud is just the latest in this ongoing trend. They try to provide an abstraction between application code and the nasty, physical logistics of data centers – making their job easier. The methods of the operation teams employ take many forms: VLAN, SAN, Private clouds and VMWare/HyperV to quote a few examples. Virtualising will increase their flexibility – and drive down their cost per machine, which looks great in the balance sheet. However, this is flexibility comes at a very high cost. It has been said that:

 

All non-trivial abstractions, to some degree, are leaky

Joel Spolsky

In the case of virtualisation, the abstraction provided is very non-trivial indeed and the leaking is sometimes equally extreme. Traditionally, the issue with virtualisation has been slowdown of I/O or network – though this has gotten a lot better with hardware support for virtual hosts (though SAN still haunts us). Over provisioned memory is another good example of virtualisation wrecking havoc with performance. All of these seems to be surmountable though and this is driving cloud forward.

However, lately it is becoming increasingly clear that scheduling, NUMA and L2/L3 cache misses are potentially an even larger problem and one that will surface once you take I/O out of the bottleneck club.

As we grow our data centers to cloud massive scale and pay for compute power by the hour, every machine counts and will figure in the balance sheet. It should also be clear that a important optimisation will be to focus on the performance on individual scale nodes – to make the best use of the expensive power.

This morning, I ran into some fascinating research in this area (Barret Rhoden, Kevin Klues, David Zhu, Eric Brewer) who take this to another level:

Improving Per-Node Efficiency in the Datacenter with New OS Abstractions” (pdf)

To whet your appetite, here is a quote from the abstract (my highlight).

“We believe datacenters can benefit from more focus on per-node efficiency, performance, and predictability, versus the more common focus so far on scalability to a large number of nodes. Improving per-node efficiency decreases costs and fault recovery because fewer nodes are required for the same amount of work. We believe that the use of complex, general-purpose operating systems is a key contributing factor to these inefficiencies.”

A highly recommend read and a good primer on some of the things that concern me a lot these days.

Kejser’s Law

I think it is time for me to state my own law (or trivial insight if you will) of computing. Though I stand here at the shoulders of giants, I will steal a bit of the fame. I think it is appropriate that I state one of the things I aim to show people at conferences:

 

“Any shared resource in a non-trivial scale workload, will eventually bottleneck”

Why You Need to Stop Worrying about UPDATE Statements

2012-04-27 3 comments

There seems to be a myth perpetuated out there in the database community that UPDATE statements are somehow “bad” and should be avoided in data warehouses.

Let us have a look at the facts for a moment and weigh up if this myth has any merit.

Transaction Logged Data

In traditional, relational, ACID property, rows stored in pages, relational databases we typically distinguish between two types of DML operations from a transaction logging perspective.

  • Row Logged
  • Allocation Logged

Row logged operations will write a transaction log entry every time a row/tuple is modified. This means that the amount of transaction log traffic generated is proportional to the number of rows touched.

Allocation Logged (called: “Minimal logging” in SQL Server) operations only write the physical allocations to the transaction log, if at all. This means the log traffic (if any) is proportional to the size of the data touched. This typically generates at least an order of magnitude fewer log entries than row logged, and is thus faster… Or is it? Read on…

Typically, ACID databases only allow bulk style loads, index builds and large table/partition truncations and drops to be allocation logged. UPDATE and DELETE statements tend to be fully row logged.

Myth: This difference in allocation structures makes UPDATE “bad” because the transaction log is the bottleneck.

Reality: The picture a quite a bit more nuanced than that.

First of all, INSERT in bulk mode is not always allocation logged. Typically a lot of conditions have to be true for allocation logging to work. For those of you interested in SQL Server, I have written about this extensively here: The Data Loading Performance Guide.

Second, the transaction log bottleneck ”wall” is widely exaggerated. I have personally driven 750MB/sec write log traffic into a single database in SQL Server using a FusionIO card. I have seen colleagues do 120MB/sec with traditional, 15K spindles. True: I have also driven 3GB/sec (around 10TB in an hour) using allocation logged INSERT, which is faster than its row logged sibling. However, you have to ask yourself the question: Do you need to go that fast in a single database? In an MPP system you will also have several log files that work together to provide linear scale of log traffic with no relevant roof.

Third, it is perfectly conceivable that you are running a warehouse database that does NOT need to serialize DML operations or may even write entire blocks directly to the database instead of the transaction log. Such systems simply do not have the above bottleneck. For example, Neteeza uses a “logless” implementation and so does many noSQL database systems and HADOOP/HIVE. MySQL with the MyISAM engine also allows non logged operations.

 

INSERT/DELETE vs. in-place UPDATE

In an traditional, relational database an update can be implemented either as a transactionally wrapped INSERT of the new data followed by a DELETE of the old data (or the other way around).

Myth: Because UPDATE is an INSERT and a DELETE this can double the amount of data that needs to be written during an UPDATE operation and will make me hit the transaction log wall even faster.

Reality: If you are not modifying the keys in an index or making the column size wider, UPDATE statements can be executed as in-place modifications of the row. This allows the database to only write a special old/new value into the transaction log. This optimization can even be applied on a column by column basis, further reducing the transaction log footprint. This leads us to:

 

UPDATE vs. INSERT speed

This myth is an amalgam of the above arguments.

Myth: INSERT of row logged data is faster than UPDATE of row logged data

Reality: Let us first settle one thing which I will type on its own line and in red to make it easy to remember:

An in-place, row logged, UPDATE operation on a non compressed page is faster than doing a row logged INSERT of the same data.

Why is this? Because the INSERT operation has to allocate new physical structures in the index, while the UPDATE can simply reuse database pages without having to allocate more space.

And here are the numbers to prove it where I am running INSERT vs. UPDATE of large dataset in SQL Server (smaller is faster):

INSERT vs UPDATE

True: if your UPDATE statement has to do the INSERT/DELETE trick, it will likely be slower. But if you are NOT changing the row size and you get the in-place UPDATE, it might just be FASTER to run an UPDATE than an insert.

Also true: compression can change the game quite significantly depending on the compression algorithm you use for the table structure. This is again implementation dependent.

Summary

The myth about UPDATE statements being bad or slow is too simple a way to look at this crucial DML operation. In fact, the myth is outright false in some cases.

Avoiding UPDATE statements should not generally be a major driver for design guidance, or used as the basis for drawing any conclusions about the data modeling techniques you  should apply. There picture is quite a bit more nuanced than this.

First of all, the speed of UPDATE statement as compared to bulk inserts will depend on the database engine you run on.

Second, we have seen that even when UPDATE is fully row logged, the transaction log “wall” is very far away on proper hardware and not much of a concern to 99% of all the installations out there. There are of course cases where you will hit the “wall”, but those are largely mitigated in MPP systems or other sharded deployments that have more than one transaction log.

Third, there are cases where UPDATE statements are actually faster than (row logged) INSERT statements. These typically occur when you change columns in such a way that they don’t grow larger than they already are, allowing in-place UPDATE operations. An interesting and highly relevant example of such a case is UPDATE statements that target fact table keys – which (if you follow my guidance) are integers and therefore have constant width.

 

Bonus Exercise and chance to win (for SQL Server people): Here is an interesting experiment. In theory, it might be possible to create a workload where you UPDATE a very wide table and where the equivalent, minimally logged, super optimized “copy to new table” BULK INSERT or SELECT INTO statement is actually slower than the UPDATE. Where is the crossover point on table width? I will offer a free, 1 hour teaching session. Database subject of your choice, you host me, I bring the coffee in the London City area to the first person who can provide a test script and the data to show this crossover point. Alternatively the price can also be claimed if you can conclusively prove that the crossover point does not exist.

Conference Updates

2012-04-25 Leave a comment

A few weeks ago, I presented at SQL Bits X. The slides from my co-presentation with Conor are available here:

Last week, I spoke at Microsoft Open World 2012 doing quite a lot of presentations. I have created a new and extended “Grade of the Steel” deck that I hope to blog more about my findings (these things tend to start as PowerPoint slides and evolve into blog entries).

If you want a copy of any of my presentations, please just mail me. Speaking of presentations, I have found this really cool tool that will compress slides: NXPowerLite. This neat tool has taken off around 60-80% of my slide sizes.

Lately, I have been practicing zen style presentations: max four sentences on each slide and heavy use of visualisations instead. I highly recommend putting yourself under such constraints. After I removed the text from my slides I feel much better improvising on stage.

Setting Yourself up for Debugging

2012-03-14 4 comments

Lately, I have been doing a fair amount of debugging, and helped other people set up their debuggers.

It can be very painful to make this work. To assist up and coming debuggers, I put my notes together in this blog. Here is how to get things going

Downloading and Installing the Tools

First, you need the full tool package. Fortunately, all you need is now packaged in the Windows SDK to ease the pain:

Do the full install of all packages. They will take several GB. Please make sure you have a large drive (more than 100GB of space), you will need it.

Symbol Setup

This is the tricky part. In order to debug properly, you need symbols. Getting those set up requires a few tricks.

First, create directories to hold symbols and symbol caches. It is generally a good idea to stick to the naming convention that Microsoft uses in their examples:

  • Create C:\Symbols
  • Create C:\SymCache
    The next step is to configure symbol environment variables. While you CAN debug without them, setting them up is really worth the effort. First, locate you environment variables in My Computer –> Properties –> Advanced System Settings –> Advanced (why is all the good stuff always located in “Advanced”?). Here you will find:
    SettingUpDebug
    You need to set up two environment variables. I prefer to set them as system variables, since I am the only one using my machine.

Assuming you use the directories above, set your symbol paths like this:

  • _NT_SYMBOL_PATH=SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols
  • _NT_SYMCACHE_PATH=C:\SymCache
    Note that the paths are separated with stars, not semi-colon.

Main tools overview

There are some key command line tools that you will need to get familiar with:

  • SqlDumper.exe – Used to generate dumps and minidumps. There are other tools that do the same, but I prefer this one. The usage is described here: KB 917825
  • WinDbg.exe – A lightweight debugger for analysing dumps. Leaves a lot to be desired in usability, but it is really fast once you memorise the arcane commands required.
  • SymChk – Used to check if symbols can be resolved. Gives verbose information with the /v command line, which is useful to validate symbol paths. See: SymChk Command-Line Options.
  • SymStore.exe – Allows you to create your own symbol stores (under C:\symbols for example). This is useful if you are debugging your own code or code which has symbols that dont originate from Microsoft. See: SymStore Command-Line Options.
  • xperf.exe – The mother of all code profilers. This tool comes with an enormous amount of command line parameters. It would require an entire blog entry just to list my favourites (Hint: I may do one). Documentation is hard to come by, but here is a good intro: All Your Base are Belong to Us – Xperf.
  • xperfview.exe – Used to view the traces generated by xperf

WinDbg Cheat Sheet

There are some commands that I use frequently in WinDbg. I would recommend you minidump a process you know while it is working in a test system (sqlservr.exe for example) and try them out. It is really quite amazing what you can see.

Command Description
!uniqstack Shows all unique stacks. This is generally the first command I run to get an overview of the dump
Show threads in the dump
∼<thread>s Switch to <thread>. Useful for digging into the thread stack and data structures
k Shows the thread stack. You can use ∼*kn to show all thread stacks
d<x> There are different version of this command. But all are used to dig into data types that reside in a memory address or address range.
.frame <x> Look at a stack frame
.reload Retries symbol loading. Useful if you locate a missing symbol and want to see if it matches the dump
.cls Clears the screen (which tends to get crowded)
q Quit WinDbg
Follow

Get every new post delivered to your Inbox.

Join 350 other followers