Atomic operations are often essential for multithreaded programs, especially when different threads need to access or modify the same data. Conventional multicore CPUs generally use a test-and-set instruction to manage which thread controls which data. CUDA has a much more expansive set of atomic operations. With CUDA, you can effectively perform a test-and-set using the atomicInc() instruction. However, you can also use atomic operations to actually manipulate the data itself, without the need for a lock variable.
Consider the following code, where different threads run different computations. The final variable, globalMax, should contain the maximum value obtained.
x86 code: max = calculateValuesForThisThread(); if (max > globalMax) { while(!test-and-set (&myLockVariable)) {}; if (max > globalMax) // test again since it may have changed. { globalMax = max; } unset(&myLockVariable); }
Notice how the above pseudo code uses a locking variable, myLockVariable, to control which thread currently has permission to write to the variable, globalMax. This is a typical demonstration of how the atomic operation of test-and-set is used on x86 machines. Notice that the atomic instruction isn’t used for manipulating data, but rather, for controlling which thread has permission to modify certain data.
CUDA takes a different approach. The function above could be rewritten using atomicInc(&myLockVariable, 1) instead of Test-and-set. Similarly, atomicDec(&myLockVariable, 0) could be used instead of Unset. However, CUDA can simply directly use the function, atomicMax(), and not worry about a lock variable at all.
Performance advantages and penalties of atomic operations in CUDA
Atomic functions in CUDA can greatly enhance the performance of many algorithms. However, if atomic functions are misused, they can just as easily have devastating performance impacts on your program. This article will explore and quantify these performance hits, and will give you tips on how to avoid such pitfalls while using atomic instructions in CUDA.
Atomic functions operate from main memory
While CUDA devices with compute capability 1.2 or higher support atomic operations for both shared and global memory, we will be focusing our examination of atomic operations on global memory, which is generally where atomic operations are necessary for many algorithms. CUDA devices with compute capability 1.1 or higher support atomic operations on global memory.
It is very important that you understand that atomic operations that operate from global memory are not cached. This means that every time you perform an atomic operation, there must be a read, from global memory, and if the value needs to be changed, there must be a write to global memory. This can take hundreds of clock cycles to accomplish. The CUDA architecture is set up to withstand extreme latency. However, if too many threads are stalled, the computational throughput of the GPU will severely diminish.
Example
For this article, we are going to have four arrays with random integers.
Result = A*B/C + D;
Ultimately, we want the maximum value achieved. Each thread will compute one result element. Then, each thread will perform an atomicMax() call, which will set the maximum value of a global variable, if and only if the value we pass in is greater than the current value. There are several ways to approach this problem.
Download the full source code listing here
Naive approach
The naïve approach would be to simply have each thread compute its result, and perform an atomicMax() function call.
Less naïve approach
A less naïve approach would be to simply have each thread compute its result. Then, thread 0 will read the global maximum variable, and store the value in shared memory. After a __syncthreads() function call, each thread will read the value in shared memory, and only make atomicMax() function call if their values are greater than that in shared memory. This approach is very easy to do programmatically, yet should significantly cull out the number of atomic function calls made.
Smart approach
A smarter approach would be to have each thread compute its result. Then, each thread block would use shared memory to find the maximum result for all threads in that thread block. Then, the atomicMax() function would only have to be called once for each thread block.
Smarter approach
The smart approach would be to use atomic operations on both the local and global level. When atomic operations are performed on shared memory, no access to global memory is required. This means that atomic operations performed in shared memory are generally much faster than those performed on global memory. This approach will only work on CUDA devices with compute capability of 1.2 or above.
Results
This test was performed using an underclocked GTX 280 with a 2.66GHz Core 2 Duo with 64-bit Vista Ultimate. Note that before each kernel call, the time it takes to copy all of the data from system memory to GPU memory is included in the test measurements, along with the overhead of calling the kernel. It is also important to note that simply computing A*B/C + D isn’t very intensive, which is why the CPU performance is so close to the GPU performance. This is included in the benchmark and is meant for use as a reference. The project was build and run in 64-bit release configuration.
Not surprisingly, the very naïve approach yielded the worst results, far worse than the naïve CPU implementation. The naïve approach also computes a relatively constant number of elements per second regaurdless of how many total elements there are. This is because the GPU becomes memory bound, as it is constantly reading and writing from the same global memory address over and over again. Needless to say, the naïve implementation is terrible.
Not surprisingly, the less naïve approach performed much better, and for large numbers of elements, it even surpassed the more complicated ‘smart’ approaches. Unfortunately, for a small amount of elements, this approach had very low performance.
The smart and smarter approaches to this problem had relatively good performance with a small number of elements, and had better performance when there were more elements.
CPU performance for this test was constant, regardless of how many elements there were. This is expected, because there is no overhead of copying the test data to the gpu before each test.
Conclusion
There are a couple things that you should take away from this article. The most important lesson is that atomic operations generally require global memory accesses, which take hundreds of clock cycles. Therefore, it is absolutely vital that you limit the number of atomic operations as much as you possibly can. While atomic operations are supported in shared memory in some of the more recent CUDA devices, many algorithms will make use of atomic operations to facilitate communication between separate thread blocks, which means global memory accesses. Just remember, atomic operations can either be your best friend, or your worst enemy.
A word about Fermi
At the time of this writing, nVidia has publically announced some details and whitepaper about Fermi. Fermi is nVidia’s next generation of GPUs. Unlike previous generations, Fermi will include a full-fledged L1/L2 cache hierarchy. nVidia’s published papers suggest that atomic operations will make use of this cache, thus allowing anywhere from 5x to 20x improvement in the speed of atomic operations. Fermi has yet to be released to the public, but it is worth noting that some of the traditional pitfalls of atomic operations in CUDA may disappear with the new Fermi architecture.