Welcome to the first article in a series of tutorials to teach you the basics of using CUDA. These tutorials will teach you, in a user-friendly way, how CUDA works, and how to take advantage of the massive computational ability of modern GPUs.

Step 1: Download the CUDA SDK

Before programming anything in CUDA, you’ll need to download the SDK. You can download the CUDA SDK here. For now, CUDA is specialized primarily for nVidia graphics cards only. So if you’re interested in getting extremely high performance out of your application, you’ll need a CUDA capable graphics chip. You’ll also need visual studio 2005. Visual studio 2008 is not yet supported at the time of this writing.

Step 2: Understand the thread hierarchy

The first thing you’ll need to know about is the thread hierarchy. CPUs are designed to run just a few threads, very quickly. GPUs, on the other hand, are designed to process thousands of threads simultaneously, with great efficiency. So, in order to take full advantage of your graphics card, you’ll need to break your problem down into hundreds, or thousands of threads.

Half-Warp – A half-warp is a group of 16 consecutive threads. Half-warp threads are generally executed together. Half-warps are aligned. For instance, Threads 0->15 will be in the same half-warp, 16->31, etc.

Warp – A warp of threads is a group of 32 consecutive threads. On future computing devices from nVidia, it might be possible that all threads in the same Warp are generally executed together in parallel. Therefor, it is a good idea to make your programs as if all threads within the same warp will execute together in parallel. Threads 0->31 will be in the same warp, 32->63, etc.

Block – A block is a collection of threads. For technical reasons, blocks should have at least 192 threads to obtain maximum efficiency and full latency hiding. Typically, blocks might contain 256 threads, 512 threads, or even 768 threads. Here’s the important thing you need to know; threads within the same block can synchronize with each other, and quickly communicate with each other.

Grid – A grid is a collection of blocks. Blocks can not synchronize with each other, and therefore threads within one block can not synchronize with threads in another block.

Step 3: Understand the memory hierarchy

Global Memory – Global memory can be thought of as the physical memory on your graphics card. If you have an integrated nVidia chipset like the ION, the global memory can be thought of as the amount of memory alloted to the graphics device. All threads can read and write to Global memory. You can even read and write to Global memory from a thread on the CPU.

Shared Memory – A GPU consists of many processors, or multiprocessors. Each multiprocessor has a small amount of Shared memory, on the order of about 16KB of memory. Shared memory is generally used as a very quick working space for threads within a block. Shared memory is allocated on a block by block basis. For example, you may have three blocks running consecutively on the same multiprocessor. This means that the maximum amount of shared memory the blocks can reserve is 16KB / 3. Threads within the same block can quickly and easily communicate with each other by writing and reading to the shared memory. It’s worth mentioning that the shared memory is at least 100 times faster than global memory, so it’s very advantageous if you can use it correctly.

Texture Memory – A GPU also has texture units and memory which can be taken advantage of in some circumstances. Unlike global memory, texture memory is cached, and is generally read only. If you expect threads to access memory addresses which have some coherence, you might want to consider using texture memory to speed up those memory accesses.