Optimizing CUDA programs for GTX 400 series

Unlike most programming languages, CUDA is coupled very closely together with the hardware implementation. While x86 processors have not changed very much over the past 10 years, CUDA hardware has had a significant change in architecture several times. First, the introduction of CUDA with the 80 series, followed shortly by the 200 series, and now nVidia has begun selling cards in the 400 series, namely the GTX 480 and GTX 470.

Major changes you should be aware of

More cores means more threads

There are simply more cores than ever before. A total of 480 cores for the GTX 480. What this means for you is that your program will need to be able to create even more threads in order to keep this GPU busy. So when writing your program, it’s best to be able to spawn many thousands of threads in order to gain the most efficiency. However, if your program already does that, there will be no need for you to change your code!

What the new cache hierarchy means to you

The second most important change in the GTX 400 series is that there is now a true L1/L2 cache structure. So what does this mean for you? Everything. One major complaint of CUDA is that only so many registers were allowed for each thread before they started overflowing to memory off chip. Accessing memory off the chip can cost hundreds of clock cycles! Instead of increasing the size of the register file for each SM, nVidia chose, correctly so, to add an actual cache structure. This way, when threads require more registers than the hardware can provide, they will first spill into L1 cache. This cache is very fast. If L1 cache is full, or there are other conflicts, these registers will spill into L2 cache, which is significantly larger. Still, L2 cache is much faster than accessing memory off chip.

Do understand that even the GTX 480 has a limited amount of L2 cache, totaling 768kB. This is much smaller than most modern CPU caches. But it is also important to remember that these cards have extraordinary main memory bandwidth that far exceeds that of any Intel or AMD CPU.

In short, you can now write your programs and not worry so much about register spilling. It can still be an issue, but it won’t impact your performance nearly as much as before

Double floating point units

There has been much press and celebration that the GF100 chip (the chip used in GTX 480 and 470) has half-speed double floating point arithmetic units. It is vital for you to keep in mind that this half-speed double floating point arithmetic is NOT available for the Geforce desktop products. Instead, these products will still run at one eighth the speed, just like the GTX 200 series. Later, it can only be presumed that the half-speed double floating point arithmetic will be enabled in the supercomputing oriented product line in the near future.

L1/Shared memory split

The last significant change you, as the programmer, should be aware of is that the L1 and Shared memory are split. You can decide to either use 16kB shared memory with 48kB of L1 cache, or you can choose to use 48kB of shared memory with 16kB of L1 cache. Some programs need lots of shared memory, while other programs will benefit from having extra cache. You, as the programmer, will need to choose which is best for your application.

This entry was posted by admin on April 24, 2010 at 11:34 am under CUDA. Tagged 400 series, CUDA, GTX 400, GTX 470, GTX 480, Optimization. Both comments and pings are currently closed.

The Supercomputing Blog