CUDA - Tutorial 4 - Atomic Operations | The Supercomputing Blog

CUDA – Tutorial 4 – Atomic Operations

This tutorial will discuss how to perform atomic operations in CUDA, which are often essential for many algorithms. Atomic operations are easy to use, and extremely useful in many applications. Atomic operations help avoid race conditions and can be used to make code simpler to write.

What are atomic operations?

Atomic operations are operations which are performed without interference from any other threads. Atomic operations are often used to prevent race conditions which are common problems in mulithreaded applications. For example, suppose you have two threads named A and B. Now suppose each thread wants to increase the value of memory location 0x1234 by one. Suppose the value at memory location 0x1234 is 5. If A and B both want to increase the value at location 0x1234 at the same time, each thread will first have to read the value. Depending on when the reads occur, it is possible that both A and B will read a value of 5. After adding a value of 1, both A and B will want to write 6 into the memory location, which is not correct! The value, 5, should have been increased twice (once by each thread), but instead, the value was only increased once! This is called a race condition, and can happen in any multi-threaded program if the programmer is not careful.

How to avoid race conditions

Fortunately, race conditions are easy to avoid in CUDA. An atomic operation is capable of reading, modifying, and writing a value back to memory without the interference of any other threads, which guarentees that a race condition won’t occur. Atomic operations in CUDA generally work for both shared memory and global memory. Atomic operations in shared memory are generally used to prevent race conditions between different threads within the same thread block. Atomic operations in global memory are used to prevent race conditions between two different threads regaurdless of which thread block they are in. Please note that shared memory is generally much faster than global memory.

Example:

int atomicAdd(int* address, int val);

This atomicAdd function can be called within a kernel. When a thread executes this operation, a memory address is read, has the value of ‘val’ added to it, and the result is written back to memory. The original value of the memory at location ‘address’ is returned to the thread. Many algorithms which require atomic operations will not need to use the original value at the memory location. For a full list of available atomic functions, please read a CUDA programming guide version 1.1 or later.

Performance notes

There are a couple things to beware of when using atomic operations. As mentioned before, shared memory is much faster than global memory, so atomic operations in shared memory tend to complete faster than atomic operations in global memory. While atomic operations are often necessary in some algorithms, it is important to minimize their usage when possible, especially with global memory accesses.

Also beware of serialization. If two threads perform an atomic operation at the same memory address at the same time, those operations will be serialized. The order in which the operations complete is undefined, which is fine, but the serialization can be quite costly.

Example of SLOW code:

__shared__ totalSum;
if (threadIdx.x == 0) totalSum = 0;
__syncthreads();

int localVal = pValues[blockIdx.x * blockDim.x + threadIdx.x];
atomicAdd(&totalSum, 1);
__syncthreads();

The code you see above is very simple. Each thread reads a value from memory sequentially, which is quite fast. If the sum of all those numbers is needed, you might think it would be okay to simply use an atomicAdd operations. This would effectively calculate the final sum in one line of code, which might seem great. Unfortunately, all of those operations will be serialized, which is extremely slow. If you have 512 threads per thread block, each block would have to do 512 sequential additions. However, using a reduction method discussed in a previous tutorial would be able to accomplish the same task with just 511 additions. The key here is that these additions can be done in parallel. Generally, 16 or 32 additions can be done completely parallel, making it much faster than using an atomicAdd for a reduction problem such as this. So whenever you use atomic operations, be sure to program such that there won’t need to be too many sequential operations. Failure to do so will result in a dramatic loss of parallelism, and thus a dramatic loss in performance. However, when atomic operations are used correctly, they are extremely useful.

Compatibility notes

When nVidia released their first CUDA capable cards, the original 8800GTX with 768MB memory and the 8800GTS with 640 MB of memory, CUDA was a new technology. These original CUDA capable cards are the only ones which do not support atomic operations. Every nVidia GPU that is a core 84 or higher supports CUDA 1.1 or higher, and thus supports atomic operations.

Even after the introduction of atomic operations with CUDA 1.1, there are still a couple atomic operations which were added later, such as 64-bit atomic operations, etc. Because there are a *lot* of CUDA 1.1 cards in consumer hands right now, I would recommend only using atomic operations with 32-bit integers and 32-bit unsigned integers. This will ensure that your application will work on the largest number of graphics cards available in the market today.

For full compatibility with all CUDA devices including those with compute capability 1.0, you may wish to you ifdefs in your code. This way, when your program executes on a device which supports atomic operations, they will be used, but your program will still be able to execute alternate, less efficient code if the device only has compute capability 1.0. When you compile to support atomic operations, the constant, CUDA_NO_SM_11_ATOMIC_INTRINSICS will be defined.

Compiler issues

With Visual Studio, you shouldn’t have any trouble compiling programs with atomic intrinsics. You may want to use the histogram64 program as a template for starting your own. However, if you are working on Linux or Mac OS, you may need to add “-arch=sm_11” as a compiler flag for nvcc. You can tell if it’s working by placing the following code inside the main function of your program:

#ifndef CUDA_NO_SM_11_ATOMIC_INTRINSICS
	printf("WARNING! Not using atomics!\n");
#endif

Back to CUDA tutorials

This entry was posted by admin on July 24, 2009 at 12:16 am under CUDA. Tagged Algorithm, Atomic, Coherency, CUDA, Parallel, Performance, Tutorial. Both comments and pings are currently closed.

The Supercomputing Blog