Synchronized

Synchronisation in .NET– Part 3:
Spin Locks and Interlocks/Atomics

In the previous instalments (Part 1 and Part 2) of this series, we have drawn some conclusions about both .NET itself and CPU architectures. Here is what we know so far:

  • When there is contention on a single cache line, the lock() method scales very poorly and you get negative scale the moment you leave a single CPU core.
  • The scale takes a further dip once you leave a single CPU socket
  • Even when we remove the lock() and do thread unsafe operations, scalability is still poor
  • Going from a class to a padded struct gives a scale boost, though not enough to get linear scale
  • The maximum theoretical scale we can get with the current technique is around 90K operations/ms.

In this blog entry, I will explore other synchronisation primitives to make the implementation safe again, namely the spinlock and Interlocks. As a reminder, we are still running the test on a 4 socket machine with 8 cores on each socket with hyper threading enabled (for a total of 16 logical cores on each socket).

Spin Lock Background

Spinlocks are an alternative to blocking synchronisation. The premise of a spinlock is that it is sometimes cheaper to go into a busy waiting loop and poll a resources instead of blocking when the resources is locked. Context switches are quite expensive, so it is often wise not to yield a scheduler if you expect that the held resource will be available soon.

It is difficult to determine when a spinlock is better than a blocking lock. The MSDN documentation is wisely vague:

“A spin lock can be useful in to avoid blocking; however, if you expect a significant amount of blocking, you should probably not use spin locks due to excessive spinning. Spinning can be beneficial when locks are fine-grained and large in number (for example, a lock per node in a linked list) and also when lock hold-times are always extremely short.”

My highlight above. While our data structure has a significant amount of contention, it also has a very shortly held lock. We can hypothesis that using a SpinLock instead of a lock()/Monitor might yield more throughput at high concurrency.

 

.NET Spinlock (Queued Spinlock) Implementation

Spinlocks can be implemented in several ways. There is a significant amount of controversy about when to use which. Notably, the Linux kernel uses what is known as a Ticket Spinlock, which works really well under low concurrency. It is noteworthy that Google is now lobbying to have this changed to something more like the Windows (introduced in XP/2003) implementation instead. There is now significant evidence that the ticket lock is not as scalable as one could wish for: Non-scalable locks are dangerous (PDF). Let us see if .NET does better.

The spin lock implementation included with the .NET framework is a queued spin lock, similar to the one used in the Windows Kernel. This particular spinlock, when implemented correctly, guarantees fairness and scales well under high concurrency. But how does it scale under our high contention test?

Here is the implementation of  OperationData.

We use our knowledge from the last blog about padding to make sure cache lines don’t overlap. Ideally, the SpinLock itself should take up a full cache line (so two spinlocks don’t experience false sharing when spinning). However, when i take size of on the SpinLock struct, it looks like it takes up only 4B.

Here is the implementation of Add():

 

Benchmark: .NET SpinLock

And here are the results of the standard .NET spinlock

image

This might be quite a surprising result to you based on the .NET description of the spinlock. Even though the lock is held very shortly, using a spinlock is still slower than the much more heavyweight lock(). However, as the lock becomes increasingly contended, the spinlock achieve that same scale as lock().

Even at low and no contention, the spinlock is slower than lock() – this indicates that the acquisition of the spinlock is more expensive than lock().

TTAS Spinlock Implementation

As mentioned earlier, there are several ways to implement spinlocks. One very simply variant is called the Test and Test And Set (TTAS) spinlock. The advantage of this spinlock is that it is very cheap to acquire and release – which may address the issue we saw with the .NET SpinLock struct. However, it scales very poorly under high concurrency, perhaps worse than .NET?

To see how this lock would do as concurrency increases (and to compare with the .NET implementation, I build my own TTAS spinlock.

Here is the OperationData struct:

And here is the Add() implementation with the TTAS spin lock:

 

Benchmark: TTAS Spin Lock

Let’s see how my custom TTAS spinlock does against .NET’s own implementation:

image

As can be seen, the TTAS implementation is actually faster than lock() at low concurrency. However, this graph is getting a little crowded. Allow me to zoom in on just the .NET SpinLock and TTAS Spin lock:

image

Now THAT is interesting isn’t it? While the TTAS is faster at low contention – it completely collapses under high contention. it is clear that the .NET SpinLock has been designed to be scalable under highly contended scenarios.

Interlock/Atomics Background

Advanced synchronisation structures like lock(), SpinLock, Events and Conditional Variables would be very difficult (if not impossible) to implement without explicit support of atomic operations on the CPU itself.

The x86/x64 instruction set contains instructions that allow the compiler to guarantee that some operations will happen atomically – in other words: They will either happen or not. For example, the following:

x86 instruction Behaviour
MOV Atomically read a word (But can be reordered unless MFENCE’d)
MOV + MFENCE
or XCHG
Atomically write a word sized value
LOCK INCL/ADDL Atomically add one or a value to a word
CMPXCHG Compare a word against another value. If they match, swap the word with another

Programming with atomic operations is extremely difficult and error prone. However, for those brave enough, .NET exposes atomics through the System.Threading.Interlocked static class. For those of you coding C++, wrappers of atomic operations finally becoming part of the standard.

Interlock Implementation

Using atomics, it is possible to implement the class without any synchronisation primitive at all. Here is what that looks like:

First, we make sure the struct is padded to avoid false sharing:

Second, we use atomics to make sure we change the struct correctly:

 

Note that the Interlocked.CompareExchange (CPU instruction CMPXCHG) often requires a loop to guarantee correctness. There are starvation issues with atomics which require careful coding to avoid. This is major area of research.

What we have created here is a true wait free implementation (see: http://en.wikipedia.org/wiki/Non-blocking_algorithm)

Benchmark: Atomics/Interlock

And here is the wait free algorithm in action

image

 

… Faster than the .NET SpinLock at low scale, yet just as scalable as we approach 64 cores. Note quite as fast as the TTAS at low scale though – why? The atomic instructions require a lock of a cache line (so other cores don’t update it). This is not free and costs AT LEAST the order of CPU cycles even when the cache line is uncontested. If you think about the TTAS implementation, only a single cache line lock is needed to acquire the lock – the Interlocked.CompareExchange in the spin lock. In the case of the purely atomic implementation, we need at least four locks: one to update each of the fields in the struct, potentially more than one for the min/max depending on whether we are racing against others.

Conclusions so Far

In this third part of the series, we have seen how even light weight locks don’t scale well beyond a single core when we compete for cache lines. There really is no such thing as a magic when it comes to synchronisation.

We have also seen how synchronisation methods have to carefully balance the cost of acquiring the lock with the scalability of the lock. The .NET SpinLock and TTAS implementation illustrates this balancing act.

It seems we are lost here. It is clear that a single CPU core can do 90K  operations every millisecond if we don’t have any contention – yet even a moderate amount of contention breaks that scale of every implementation we have seen so far. If we were to plug a logging framework like the one we have seen so far into an application server – we run the risk of introducing artificial bottlenecks into the system.

We have to get smarter…

Next part: Partitioned data structures. The part in which we finally implement a scalable version of this data structure using everything we have learned so far.

  2Comments

  1. Pingback: Synchronisation in .NET– Part 4: Partitioned Data Structures | Thomas Kejser's Database Blog

  2. Pingback: Synchronisation in .NET– Part 2: Unsafe Data Structures and Padding | Thomas Kejser's Database Blog

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">