Hardware Post Logo

The Effect of CPU Caches and Memory Access Patterns

In this blog, I will provide you with some “Grade of the Steel” background information that will help you understand CPU caches and their effects better.

As we move towards a future where power consumption in the server room begins to make a big dent in the balance sheet – it becomes important for programmers to fully understand hardware and make the best possible use of it, instead of floating around in the leaky abstractions of the virtual systems. Even when we take the old enemy of I/O out of the equation, there are other bottlenecks we need to worry about: DRAM access time being one of the most important.

A modern machine running a “typical” workloads out there will often spend a very large percent of its CPU cycles on fetching memory address. This happens because the access time of DRAM is a lot slower than the clock speed of the CPU, and because most code is not written with this access time in mind. This high latency pushing back and forth of data between the memory bus and the CPU is known as the  Van Neumann Bottleneck.

To work around this problem and attempt to shield programmers from the complexity, chip designers add caches to the CPU. A typical modern CPU will have at least two layers of cache on the chip itself, known as the L1 and L2. Sometimes, more layers of cache (L3… Ln) are added too – especially on multi core systems. These caches look at lot like buffer pools in relational databases and serve many of the same purposes. The analogy is not perfect, because this is chip design and nanoseconds matter. For example, cache replacement policies on CPU caches typically don’t follow the same rules as relational database buffer pools.

To take a concrete example, here is what my laptop with an i7-2620M CPU schematically looks like:

image

 

A few things to notice here: First of all, caches fall into three categories: instruction-, data- and unified caches (with unified holding either type of memory). Second, you can also see that caches may be shared between multiple cores, in this case the L3 cache is shared between two cores.

As mentioned, these caches are there to make the cost of memory access appear cheaper. But just how expensive is it to miss the cache? Lets have a look.

Test Dataset and run configuration

To set up a good test, I am afraid I have to bother you with a bit of C++ code, since we need full control of the machine to get accurate measurements.

The tests I will run will also use this data structure:

const int payloadSize = 64; /* A page is a cache line */

const int bufferSize = 500000; /* Number of data pages in the test */

struct dataPage
{
char payload[payloadSize];
};

dataPage* buffer = new dataPage[bufferSize];

 

This data structure takes up exactly 32MB of memory (with 1M = 10**6). You can think of it as a super simplified buffer pool. Each page in the pool is 64B, because this is the size of a memory cache line on my CPU. This means that every time I fetch one page form the buffer, it will cause one memory request. If the page size was smaller than the memory line size, more than one page could fit in a single memory request, which causes a completely different set of problems than the ones I try to illustrate here (might be subject of a future blog).

Side note: Why am I not allocating the buffer variable like this, without needing the pointer and new operator: database buffer[bufferSize]? Because this would cause  the array to be allocated on the stack and cause a stack overflow. The allocation has to come from the heap and hence we need a pointer instead.

I will fill up the buffer like this:

for(int i = 0; i<bufferSize; ++i)
{
/* Fill up payload with random data */
for (int j = 0;j<payloadSize -1 ;j++)
{
buffer[i].payload[j] = ‘A’ + abs(rand() % 30);
}
buffer[i].payload[59] = NULL; /* terminate char (easy printing) */
}

The tests were run on my Mac mini: an Intel Core 2 Duo processer with 3MB cache and a clock speed of 2.53GHz. The machine has plenty of memory to hold the buffer in memory and not on disk.

Test 1 – Running through Arrays of different sizes

The first test I will run is to loop through each element of the array and read the value. Intuitively, in a perfect machine world, this should be linearly proportional to the number of element in the array since each memory access is constant time? Or little-o(n) in computer science language.

Its always good to validate, let us try to run this simple test:

int traverseSimple(const dataPage* buffer, const int pageLookups)
{
register int runningSum = 0;
for(register int i = 0; i<pageLookups; ++i)
{
runningSum += buffer[i].payload[0];
}
return runningSum;
}

Notice that I am forcing the variables into CPU registers, just to be safe and avoid any memory fetches from those modifications.

I then wrote a simple harness that gradually ramps up the number of pages (as per the pageLookups parameter) each run. For each value of pageLookups, I traverse the appropriate subset of buffer 5 times and then take the average runtime. The harness looks like this:

void runTestSimple(const dataPage* buffer, const int iterations
, const int pageLookups, int& runningSum, double& runTime)
{

tick_count begin = tick_count::now();
for (register int i = 0;i< iterations;i++)
{
runningSum = traverseSimple(buffer, pageLookups);
}
tick_count end = tick_count::now();

runTime = (end-begin).seconds() / iterations;
}

 

Because only a limited number of the pages in buffer are requested for small values of pageLookups, only a subset of the full buffer structure is touched by each test cycle.

Side note: The tick_count class used to time runs comes from the Intel Thread Building Block (TBB) library and gives precision into the CPU tick range. It is a very nice parallelism library that I will be blogging more about later.

Lets have a look at the data:

image

 

That is some drop-off at around 3 MB isn’t it? This just happens to be the cache size of my machine. This is nearly an order of magnitude difference in speed! Also, notice that there are some outliers which are most likely caused by scheduling of the cores.

While the test is running, the CPU is at 100% – it does not LOOK like we are waiting for anything and from the perspective of the operating system, we aren’t (and no, this is no different in Windows than OSX).

The test runs allows us to come up with a rough estimate of how expensive it is for the CPU to get memory from DRAM. If we look at the rightmost part of the curve, we are doing around 82M fetches/sec when the cache is not helping. If we estimate the CPU instructions to run the test loop as (roughly)

  • Add one to the register of the loop (1 instruction)
  • Memory fetch the next array element (unknown instructions)
  • Read the raw value of the payload (1 instruction)
  • Compare if we are at end of loop, jump if we are not (1-2 instructions)

At 82M loops/sec we are burning 82M * 3 instructions on the loop itself, around 246M instructions/sec. The rest of the instructions from the total 2530M/sec as per my clock speed have to be burned on fetching memory. We can now guesstimate the price of a memory fetch on my CPU:

image

 

The actual, “useful” CPU work is less than 10% of total CPU cycles burned.. Interesting.

Test 2 – Random Memory Access

Wait a minute you may say. Isn’t sequential and random access something we only have to worry about with I/O systems? Perhaps, but how about putting it to the test?

Let me modify my test slightly. Instead of looping through parts of the buffer in sequence, let us supply the test with a walk argument:

int traverseBuffer(const dataPage* buffer, const int pageLookups
, const int* walk)
{
register int runningSum = 0;
for(register int i = 0; i<pageLookups; ++i)
{
runningSum += buffer[walk[i]].payload[0];

}
return runningSum;
}

The walk parameter is a simply array with exactly the same number of elements as buffer, but populated with a random ordering of the offsets in buffer. The code to do this looks like this:

int* randomWalk = new int[bufferSize];
int* sequentialWalk = new int[bufferSize];
int* singleWalk = new int[bufferSize];

for(int i = 0; i<bufferSize; ++i)
{
randomWalk[i] = i;
sequentialWalk[i] = i;
singleWalk[i] = 0;
}

/* Randomize array  */
for(int i = 0; i<bufferSize; i++)
{
swapArray(randomWalk, abs(rand() % bufferSize)
, abs(rand() % bufferSize));
}

By passing the randomWalk array into traverseBuffer we can visit elements in a random order instead of sequentially.

But we have to be careful now. The new test is not directly comparable to test 1. Because the new traverseBuffer needs another CPU instruction and memory fetch to get the walk array.

To make a true apples with apples compare, I created a sequentialWalk array that can also be passed to the test, simulating the same access pattern as in test 1, but with the new number of CPU instructions matched to the random walk. Furthermore, to illustrate a later point, I create an array that ONLY visits element zero, no matter how many iterations are done over the pageCount. This allows us to explore the code running at a VERY small cache footprint.

Without further ado, let us look at the results:

image

 

Now THAT is interesting isn’t it? Notice how the sequential access of memory is MUCH faster  (About 2x) than the random access. And this happens even when the cache is being missed beyond the 3MB drop-off. Our 28 CPU instructions per memory fetch was overly optimistic. Beyond the 3MB barrier, when randomly traversing memory, we can only sustain 40M fetches/second. This means that a random memory access will cost around 50 CPU instructions. And this isn’t even a NUMA system!

Another thing to notice is the green line data points above. This is the “perfect machine” numbers that a naïve approach to hardware would assume. When the accessed memory becomes small enough to fit into the lower level caches (32KB and 256K in my case), it seems that we can get full benefit of the CPU instructions and we only need around 2-3 CPU instruction for every loop iteration, including the memory fetch.

Why is sequential memory fetch faster than random?

Why is it so beneficial to access memory sequentially? This is mainly caused by an interesting optimizations that the CPU does. If the CPU heuristics detects that memory is being fetched in a linear manner, it will try to prefetch memory to anticipate the CPU requiring it. Because memory is being requested before it is needed, it will already be in the cache when the CPU requests the memory line. You get a pipeline effect that is only limited by the bandwidth of the memory bus.

However, to make use of this optimization, your code will have to understand how a modern CPU works and you must traverse data structures in the right way.

Summary

In this blog I have scratched the surface of what modern hardware architectures look like. We have had a chance to peek inside a CPU and been able to quantify the effect of CPU caches. I hope I have given you an idea of the types of performance gain you can get, if you code in a way that is friendly to the hardware, not just to some abstract/virtualized thinking of the world.

There are much deeper things to explore, for example an illustration on how you diagnose these issues in a real, non synthetic workload. Perhaps this will be the subject of a future blog or a course I may run.

  16Comments

  1. Pingback: How do Column Stores Work? | T-SQL

    • Thomas Kejser   •  

      Hi GG

      Yep, there is much more to it than this – I used a quite simplified example to get the point accros that “memory isnt just memory”. For example, TLB misses can REALLY kill you as you also write (My test doesnt even measure that effect).

      Thanks for the link. On the notes of memory access, I have found Ulrich Drepper’s work to be a really great description of the process too:

      http://lwn.net/Articles/250967/

  2. Joe Chang   •  

    ok, sorry about the side-track, now back to this blog. Your i7-2620 is listed at 2.7GHz with turbo-boost to 3.4GHz. So lets assume a CPU cycle time of 0.33ns that is, 3 clocks per ns.
    Techreport http://techreport.com/articles.x/19196/4 says Westmere memory latency is just under 60ns, so memory access should be approx 170 cycles, versus your test of 50 cycles per iteration. I am not sure I understand your code exactly but I would guess that the random numbers could be calculated ahead of time, effectively pipelining the sequence. A force test of random access would have one array store the next array index, so iteration n+1 cannot start until iteration n is complete?

    • Thomas Kejser   •  

      Thanks for the additional details Joe. I am indeed calculating the random sequence ahead of time (I have one array with the “walk” and another that I am accessing). Exactly as you describe.

      If I understand you right, you are basically saying that I should measure an even greater difference than I am here? Would the CPU be smart enough to prefetch the next random number from the array and prefetch the memory for it while the code is looking up the previous?

      • Joe Chang   •  

        I am sitting on a plane at SFO, big storms on east coast delaying everything, people are complaining. In the old days, people died while traveling. Right, I am expecting 170+ cycles per random access. Actually it does not even have to be random. So long as index position x contains the address of the next access, that access cannot start until the original is complete. This should be very similar to the pointer chasing code of row-store databases.

        • Thomas Kejser   •  

          Hi Joe

          It seems my test case allows branch prediction to prefetch, making random look better than it is. I will rewrite in a way that this cannot be done. Thanks for your insights.

          With regards to improved intra-page lookups, you might find this article enlightening:

          http://www.vldb.org/conf/2001/P169.pdf

  3. Joe Chang   •  

    true, I only discussed access inside the page. Let me look at the SQL Server engine code for B-tree navigation, and I will work out what the vector instructions for B-tree should be. I do think a relayout of the database page would improve prefetch performance, ie, not using the reverse order of the offset. Which reply to Dan?

    • Thomas Kejser   •  

      Hi Joe

      I remember our discussion about this idea. I agree that CPU level instruction This would make existing page layouts scale better – but only INSIDE the page – not during the B-tree pointer chase. If you change the way the page is laid out – you can do even better, with our requiring a new CPU instruction. Check the links in my reply to Dan for an interested idea where they change the 8K page layout to benefit from prefetches.

  4. Pingback: How Vertical Partitioning and Deep Joins Kill Parallelism « Thomas Kejser's Database Blog

  5. Pingback: How do Column Stores Work? « Thomas Kejser's Database Blog

  6. Pingback: Link Resource # 59 : Jun 06 – Jun 22 « Dactylonomy of Web Resource

  7. Kevin Boles   •  

    Keeping the CPU chock full of data to chew on – one of the fundamental tenants of the Batch Mode Processing the very smart people on the Microsoft SQL Server team put together to greatly enhance the performance of the Column Store Index!!

    • Thomas Kejser   •  

      That is indeed on good example of it Kevin. And as you can see, even with aggressive prefetch, you can still end up being bound by the memory speed.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">