The Effect of CPU Caches and Memory Access Patterns
In this blog, I will provide you with some “Grade of the Steel” background information that will help you understand CPU caches and their effects better.
As we move towards a future where power consumption in the server room begins to make a big dent in the balance sheet – it becomes important for programmers to fully understand hardware and make the best possible use of it, instead of floating around in the leaky abstractions of the virtual systems. Even when we take the old enemy of I/O out of the equation, there are other bottlenecks we need to worry about: DRAM access time being one of the most important.
A modern machine running a “typical” workloads out there will often spend a very large percent of its CPU cycles on fetching memory address. This happens because the access time of DRAM is a lot slower than the clock speed of the CPU, and because most code is not written with this access time in mind. This high latency pushing back and forth of data between the memory bus and the CPU is known as the Van Neumann Bottleneck.
To work around this problem and attempt to shield programmers from the complexity, chip designers add caches to the CPU. A typical modern CPU will have at least two layers of cache on the chip itself, known as the L1 and L2. Sometimes, more layers of cache (L3… Ln) are added too – especially on multi core systems. These caches look at lot like buffer pools in relational databases and serve many of the same purposes. The analogy is not perfect, because this is chip design and nanoseconds matter. For example, cache replacement policies on CPU caches typically don’t follow the same rules as relational database buffer pools.
To take a concrete example, here is what my laptop with an i7-2620M CPU schematically looks like:
A few things to notice here: First of all, caches fall into three categories: instruction-, data- and unified caches (with unified holding either type of memory). Second, you can also see that caches may be shared between multiple cores, in this case the L3 cache is shared between two cores.
As mentioned, these caches are there to make the cost of memory access appear cheaper. But just how expensive is it to miss the cache? Lets have a look.
Test Dataset and run configuration
To set up a good test, I am afraid I have to bother you with a bit of C++ code, since we need full control of the machine to get accurate measurements.
The tests I will run will also use this data structure:
const int payloadSize = 64; /* A page is a cache line */
const int bufferSize = 500000; /* Number of data pages in the test */
dataPage* buffer = new dataPage[bufferSize];
This data structure takes up exactly 32MB of memory (with 1M = 10**6). You can think of it as a super simplified buffer pool. Each page in the pool is 64B, because this is the size of a memory cache line on my CPU. This means that every time I fetch one page form the buffer, it will cause one memory request. If the page size was smaller than the memory line size, more than one page could fit in a single memory request, which causes a completely different set of problems than the ones I try to illustrate here (might be subject of a future blog).
Side note: Why am I not allocating the buffer variable like this, without needing the pointer and new operator: database buffer[bufferSize]? Because this would cause the array to be allocated on the stack and cause a stack overflow. The allocation has to come from the heap and hence we need a pointer instead.
I will fill up the buffer like this:
for(int i = 0; i<bufferSize; ++i)
/* Fill up payload with random data */
for (int j = 0;j<payloadSize -1 ;j++)
buffer[i].payload[j] = ‘A’ + abs(rand() % 30);
buffer[i].payload = NULL; /* terminate char (easy printing) */
The tests were run on my Mac mini: an Intel Core 2 Duo processer with 3MB cache and a clock speed of 2.53GHz. The machine has plenty of memory to hold the buffer in memory and not on disk.
Test 1 – Running through Arrays of different sizes
The first test I will run is to loop through each element of the array and read the value. Intuitively, in a perfect machine world, this should be linearly proportional to the number of element in the array since each memory access is constant time? Or little-o(n) in computer science language.
Its always good to validate, let us try to run this simple test:
int traverseSimple(const dataPage* buffer, const int pageLookups)
register int runningSum = 0;
for(register int i = 0; i<pageLookups; ++i)
runningSum += buffer[i].payload;
Notice that I am forcing the variables into CPU registers, just to be safe and avoid any memory fetches from those modifications.
I then wrote a simple harness that gradually ramps up the number of pages (as per the pageLookups parameter) each run. For each value of pageLookups, I traverse the appropriate subset of buffer 5 times and then take the average runtime. The harness looks like this:
void runTestSimple(const dataPage* buffer, const int iterations
, const int pageLookups, int& runningSum, double& runTime)
tick_count begin = tick_count::now();
for (register int i = 0;i< iterations;i++)
runningSum = traverseSimple(buffer, pageLookups);
tick_count end = tick_count::now();
runTime = (end-begin).seconds() / iterations;
Because only a limited number of the pages in buffer are requested for small values of pageLookups, only a subset of the full buffer structure is touched by each test cycle.
Side note: The tick_count class used to time runs comes from the Intel Thread Building Block (TBB) library and gives precision into the CPU tick range. It is a very nice parallelism library that I will be blogging more about later.
Lets have a look at the data:
That is some drop-off at around 3 MB isn’t it? This just happens to be the cache size of my machine. This is nearly an order of magnitude difference in speed! Also, notice that there are some outliers which are most likely caused by scheduling of the cores.
While the test is running, the CPU is at 100% – it does not LOOK like we are waiting for anything and from the perspective of the operating system, we aren’t (and no, this is no different in Windows than OSX).
The test runs allows us to come up with a rough estimate of how expensive it is for the CPU to get memory from DRAM. If we look at the rightmost part of the curve, we are doing around 82M fetches/sec when the cache is not helping. If we estimate the CPU instructions to run the test loop as (roughly)
- Add one to the register of the loop (1 instruction)
- Memory fetch the next array element (unknown instructions)
- Read the raw value of the payload (1 instruction)
- Compare if we are at end of loop, jump if we are not (1-2 instructions)
At 82M loops/sec we are burning 82M * 3 instructions on the loop itself, around 246M instructions/sec. The rest of the instructions from the total 2530M/sec as per my clock speed have to be burned on fetching memory. We can now guesstimate the price of a memory fetch on my CPU:
The actual, “useful” CPU work is less than 10% of total CPU cycles burned.. Interesting.
Test 2 – Random Memory Access
Wait a minute you may say. Isn’t sequential and random access something we only have to worry about with I/O systems? Perhaps, but how about putting it to the test?
Let me modify my test slightly. Instead of looping through parts of the buffer in sequence, let us supply the test with a walk argument:
int traverseBuffer(const dataPage* buffer, const int pageLookups
, const int* walk)
register int runningSum = 0;
for(register int i = 0; i<pageLookups; ++i)
runningSum += buffer[walk[i]].payload;
The walk parameter is a simply array with exactly the same number of elements as buffer, but populated with a random ordering of the offsets in buffer. The code to do this looks like this:
int* randomWalk = new int[bufferSize];
int* sequentialWalk = new int[bufferSize];
int* singleWalk = new int[bufferSize];
for(int i = 0; i<bufferSize; ++i)
randomWalk[i] = i;
sequentialWalk[i] = i;
singleWalk[i] = 0;
/* Randomize array */
for(int i = 0; i<bufferSize; i++)
swapArray(randomWalk, abs(rand() % bufferSize)
, abs(rand() % bufferSize));
By passing the randomWalk array into traverseBuffer we can visit elements in a random order instead of sequentially.
But we have to be careful now. The new test is not directly comparable to test 1. Because the new traverseBuffer needs another CPU instruction and memory fetch to get the walk array.
To make a true apples with apples compare, I created a sequentialWalk array that can also be passed to the test, simulating the same access pattern as in test 1, but with the new number of CPU instructions matched to the random walk. Furthermore, to illustrate a later point, I create an array that ONLY visits element zero, no matter how many iterations are done over the pageCount. This allows us to explore the code running at a VERY small cache footprint.
Without further ado, let us look at the results:
Now THAT is interesting isn’t it? Notice how the sequential access of memory is MUCH faster (About 2x) than the random access. And this happens even when the cache is being missed beyond the 3MB drop-off. Our 28 CPU instructions per memory fetch was overly optimistic. Beyond the 3MB barrier, when randomly traversing memory, we can only sustain 40M fetches/second. This means that a random memory access will cost around 50 CPU instructions. And this isn’t even a NUMA system!
Another thing to notice is the green line data points above. This is the “perfect machine” numbers that a naïve approach to hardware would assume. When the accessed memory becomes small enough to fit into the lower level caches (32KB and 256K in my case), it seems that we can get full benefit of the CPU instructions and we only need around 2-3 CPU instruction for every loop iteration, including the memory fetch.
Why is sequential memory fetch faster than random?
Why is it so beneficial to access memory sequentially? This is mainly caused by an interesting optimizations that the CPU does. If the CPU heuristics detects that memory is being fetched in a linear manner, it will try to prefetch memory to anticipate the CPU requiring it. Because memory is being requested before it is needed, it will already be in the cache when the CPU requests the memory line. You get a pipeline effect that is only limited by the bandwidth of the memory bus.
However, to make use of this optimization, your code will have to understand how a modern CPU works and you must traverse data structures in the right way.
In this blog I have scratched the surface of what modern hardware architectures look like. We have had a chance to peek inside a CPU and been able to quantify the effect of CPU caches. I hope I have given you an idea of the types of performance gain you can get, if you code in a way that is friendly to the hardware, not just to some abstract/virtualized thinking of the world.
There are much deeper things to explore, for example an illustration on how you diagnose these issues in a real, non synthetic workload. Perhaps this will be the subject of a future blog or a course I may run.