High level languages such as C, C++, C#, FORTRAN, and Java all do a great job of abstracting the hardware away from the language. This means that programmers generally don’t have to worry about how the hardware goes about executing their program. However, in order to get the maximum amount of performance out of your programs, it is necessary to start thinking about how the hardware is actually going to execute your program.

This article will be taking a quick look at how cache works, and how to properly write your programs so that you can take full advantage of cache coherence. Most modern CPUs have at least 1 MB of cache, and some have up to 8 MB of cache or more. Datasets for many programs are easily in the tens of megabytes if not in the hundreds or thousands of megabytes. Therefore, we have to worry about how the cache works.

What is cache?

Cache is a small set of high speed memory which is very close to the processor. Nowadays, the cache is almost always on the same die as the processor, which allows it to transfer many hundreds of gigabytes of data per second, which is easily 10 times faster than main system memory. The only catch is that the cache is relatively small, generally only a few megabytes. There is a hierarchy of cache, L1, L2, and sometimes L3. L1 (sometimes called L0) is the smallest, fastest, and closest to the processor. This is generally only a few kilobytes. 16 kB for L1 cache is typical. L2 is a little slower than L1, but is larger, generally between 256 kB and 4MB. In a single core CPU, the L2 is usually at least a megabyte. In multicore systems such as dual or quad core CPUs, the L3 cache is generally shared among all the processors, and is usually at least 2MB for modern systems. Why do these levels of cache exist? If a program needs data, it looks in L1 first. If the data it needs isn’t there, it looks in L2. If the data it needs still isn’t there, it looks in L3 (single core processors usually don’t have L3). If the data isn’t in L3, then the CPU is forced to retrieve the data from main system memory which is slow, usually a couple hundred clock cycles. Another thing that is important to understand is that when something is retrieved from the main system memory, that data is put into the cache. Because the cache is so small, another, unrelated piece of data is usually kicked out of the cache.

What is cache coherence?

Now that we’ve gone over the basics of what cache is, we can talk about cache coherence. When a program requests data from a memory location, say 0x1000, and then shortly afterwards requests data from a nearby memory location, say 0x1004, the data is coherent. In other words, when a program needs to access different pieces of data around the same memory location within a short period of time, the data is coherent. When a piece of data is retrieved from main memory, more often than not, more data is retrieved than we actually requested. This is because the processor ‘hopes’ that the program will soon request data from a nearby memory location which it just so happened to have fetched already behind your back. In fact, we can use this processor behavior to our advantage. This behavior is often referred to as prefetching. Consider the following two pieces of code written in C.

for (int i=0; i < 32; i++)
{
	for (int j=0; j < 32; j++)
	{
		total += myArray[i][j];
	}
}

for (int i=0; i < 32; i++)
{
	for (int j=0; j < 32; j++)
	{
		total += myArray[j][i];
	}
}

The two pieces of code above both perform the exact same task. However, the first piece of code is faster than the second piece of code! The reason for this is because of how memory is organized in C and C++ applications. The first piece of code accesses consecutive memory addresses, which not only utilizes cache coherency, but prefetching as well. The second piece of code is much less coherent because each time the inner loop executes, the memory address being accessed is 32 words away from the previous execution of the inner loop. This doesn’t utilize prefetching at all.

Another example of cache coherence

For another, more extract example, consider a simple ray tracing program. For every pixel in the screen, a function is called which determines what objects intersect with the ray generated by that pixel, and ultimately, the final color for that pixel is calculated. The simple way to approach this is to fall the function for every column for every row. However, there is a more efficient way to do it! Instead of simply doing each pixel is a raster style, consider ray tracing small blocks of 8×8 pixels before moving onto the next block of pixels. By doing this, chances are that the same objects are being accessed by most of the 64 rays. This means that if one ray requests information about an object, that information is put into the cache. Since all the other rays are nearby, chances are they will need information about that same object, which doesn’t have to be fetched from main memory again because it’s already in the cache! Many studies have been performed on this classic example, finding speedups from 10% to 25%. Just imagine getting a 25% speedup in your application simply because you take into account how the cache works, and make sure your program takes advantage of cache coherency!

Rules of thumb

Access memory linearly as much as possible to utilize prefetching. Accesses to main memory can easily cost you hundreds of clock cycles. When you access memory linearly in a predictable fashion, chances are greater that the processor will notice the pattern and prefetch automatically for you.

Try to re-use the data which is already in the cache as much as possible. If you’re going to need to access the same data more than once, it’s better to do so quickly to diminish the chances of that data being kicked out of the cache by other, unrelated data.