CUDA stands for Compute Unified Device Architecture, and is an extension of the C programming language and was created by nVidia. Using CUDA allows the programmer to take advantage of the massive parallel computing power of an nVidia graphics card in order to do general purpose computation. Before continuing, it’s worth talking about this for a little bit longer.

CPUs like Intel Core 2 Duo and AMD Opteron are good at doing one or two tasks at a time, and doing those tasks very quickly. Graphics cards, on the other hand, are good at doing a massive number tasks at the same time, and doing those tasks relatively quickly. To put this into perspective, suppose you have a 20 inch  monitor with a standard resolution of 1,920 x 1200. An nVidia graphics card has the computational ability to calculate the color of 2,304,000 different pixels, many times a second. In order to accomplish this feat, graphics cards use dozens, even hundreds of ALUs. Fortunately, nVidia’s ALUs are fully programmable, which enables us to harness an unprecedented amount of computational power into the programs that we write.

As stated previously, CUDA lets the programmer take advantage of the hundreds of ALUs inside a graphics processor, which is much more powerful than the handful of ALUs available in any CPU. However, this does put a limit on the types of applications that are well suited to CUDA.

CUDA is only well suited for highly parallel algorithms

In order to run efficiently on a GPU, you need to have many hundreds of threads. Generally, the more threads you have, the better. If you have an algorithm that is mostly serial, then it does not make sense to use CUDA. Many serial algorithms do have parallel equivalents, but many do not. If you can’t break your problem down into at least a thousand threads, then CUDA probably is not the best solution for you.

CUDA is extremely well suited for number crunching

If there is one thing that CUDA excels at, it’s number crunching. The GPU is fully capable of doing 32-bit integer and floating point operations. In fact, it GPUs are more suited for floating point computations, which makes CUDA an excellent for number crunching. Some of the higher end graphics cards do have double floating point units, however there is only one 64-bit floating point unit for every 16 32-bit floating point units. So using double floating point numbers with CUDA should be avoided if they aren’t absolutely necessary for your application.

CUDA is well suited for large datasets

Most modern CPUs have a couple megabytes of L2 cache because most programs have high data coherency. However, when working quickly across a large dataset, say 500 megabytes, the L2 cache may not be as helpful. The memory interface for GPUs is very different from the memory interface of CPUs. GPUs use massive parallel interfaces in order to connect with it’s memory. For example, the GTX 280 uses a 512-bit interace to it’s high performance GDDR-3 memory. This type of interface is approximately 10 times faster than a typical CPU to memory interface, which is great. It is worth noting that most nVidia graphics cards do not have more than 1 gigabyte of memory. Nvidia does offer special CUDA compute cards which have up to four gigabytes of ram onboard, but these cards are more expensive than cards originally intended for gaming.

Writing a kernel in CUDA

As stated previously, CUDA can be taken full advantage of when writing in C. This is good news, since most programmers are very familiar with C. Also stated previously, the main idea of CUDA is to have thousands of threads executing in paralle. What wasn’t stated is that all of these threads are going to be executing the very same function, known as a kernel. Understanding what the kernel is and how it works is critical to your success when writing an application that uses CUDA. The idea is that even though all of the threads of your program are executing the same function, all of the threads will be working with a different dataset. Each thread will know it’s own ID, and based off it’s ID, it will determine which pieces of data to work on. Don’t worry, flow control like ‘if, for, while, do, etc.’ are all supported.

Writing programs with CUDA

One important thing to remember is that your entire program DOES NOT need to be written in CUDA. If you’re writing a large application, complete with a user interface, and many other functions, then most of your code will be written in C++ or whatever your language of choice is. Then, when something extremely computationally intense is needed, your program can simply call the CUDA kernel function you wrote. So the main idea is that CUDA should only be used for the most computationally intense portions of your program.

CUDA without a graphics card

While CUDA is specifically meant to run on nVidia’s graphics cards, it can also run on any CPU. Albeit, the program will never be able to run nearly as fast on a CPU, it will still work.

Overview

That basically does it for this article. We covered what CUDA is, what types of applications it’s good for and what types of applications it isn’t good for. The most important point is that CUDA is only well suited for computations which can be broken down and performed by thousands of threads in parallel. We’ve only scratched the surface of exactly what a kernel is, and why it’s such a key concept in CUDA. I didn’t want to get too deep into how the kernel works in an introductory article. Please feel free to browse the CUDA section of this website for more information as to how CUDA works, and what advantages it has to offer you.