Welcome to the first tutorial for getting started programming with CUDA. This tutorial will show you how to do calculations with your CUDA-capable GPU. Any nVidia chip with is series 8 or later is CUDA -capable. This tutorial will also give you some data on how much faster the GPU can do calculations when compared to a CPU.
In this tutorial, there are going to be two arrays which contain floating point number, and an array for results. Our program will simply computer C[i] = A[i] * B[i], and measure how fast this executes on the GPU and on the CPU. Later, we’ll make this equation more complicated and study the results.
Step 1: Get the CUDA SDK
You can freely download and use the CUDA SDK. It is available for Linux, Windows XP, Vista, and Mac OS. After you download and install the SDK, please compile and run a sample program which is included, just to make sure your installation and drivers are all up to date. For windows, you’ll need Visual Studio 2005, the express version is fine. Visual Studio 2008 is not supported at the time of this writing. Once you get your compiler and computer setup for compiling and running CUDA programs, you may proceed to step 2.
Step 2: Start a new project
When starting a new project, I usually simply copy the convolutionSeperable example in the CUDA SDK, and rename it. This is effective because it’s one of the smaller examples.
Step 3: Understanding how the memory works
For this tutorial program, we will need to allocate three large arrays, both on the host machine, and on the GPU. Because CUDA kernels can only access memory dedicated to the GPU, we will need to seperately allocate memory space both on the host machine, and on the GPU. This is a very important key concept. Even if the GPU is running on a shared memory device like ION, or any other nVidia motherboard chipset, the program must still be written this way, even if it is just to ensure that the program will run on a GPU with dedicated memory.
float *h_dataA, *h_dataB, *h_resultC; float *d_dataA, *d_dataB, *d_resultC; h_dataA = (float *)malloc(sizeof(float) * MAX_DATA_SIZE); h_dataB = (float *)malloc(sizeof(float) * MAX_DATA_SIZE); h_resultC = (float *)malloc(sizeof(float) * MAX_DATA_SIZE); CUDA_SAFE_CALL( cudaMalloc( (void **)&d_dataA, sizeof(float) * MAX_DATA_SIZE) ); CUDA_SAFE_CALL( cudaMalloc( (void **)&d_dataB, sizeof(float) * MAX_DATA_SIZE) ); CUDA_SAFE_CALL( cudaMalloc( (void **)&d_resultC , sizeof(float) * MAX_DATA_SIZE) );
Note that malloc is being used instead of the C++ operator ‘new’. This is because the program is written in C. Don’t worry, it is possible and easy to work with CUDA and C++ in the same program, but this will be covered in a later tutorial. To allocate memory on the device, it’s important to call cudaMalloc(void **ppData, int numBytes).
For a better understanding of the basic CUDA memory and cache structure, I encourage you to take a look at the CUDA memory and cache architecture page.
Step 4: Using the high precision timer
Have you ever been bothered by the fact that finding a high-precision timer or counter for a program while maintaining platform compatibility? This problem still exists today, even though it should have been solved a decade ago. Fortunately, the GPU has a highly accurate counter which can be used to accurately measure the performance of GPU or CPU activities.
double gpuTime; unsigned int hTimer; CUT_SAFE_CALL(cutCreateTimer(&hTimer)); CUT_SAFE_CALL( cutResetTimer(hTimer) ); CUT_SAFE_CALL( cutStartTimer(hTimer) ); // Do work here CUT_SAFE_CALL(cutStopTimer(hTimer)); gpuTime = cutGetTimerValue(hTimer);
Step 5: Copy data to and from the device
// Copy the data to the device CUDA_SAFE_CALL( cudaMemcpy(d_dataA, h_dataA, sizeof(float) * dataAmount, cudaMemcpyHostToDevice) ); CUDA_SAFE_CALL( cudaMemcpy(d_dataB, h_dataB, sizeof(float) * dataAmount, cudaMemcpyHostToDevice) ); // Do the multiplication on the GPU multiplyNumbersGPU<<<blockGridRows, threadBlockRows>>>(d_dataA, d_dataB, d_resultC); CUT_CHECK_ERROR("multiplyNumbersGPU() execution failed\n"); CUDA_SAFE_CALL( cudaThreadSynchronize() ); // Copy the data back to the host CUDA_SAFE_CALL( cudaMemcpy(h_resultC, d_dataA, sizeof(float) * dataAmount, cudaMemcpyDeviceToHost) );
In the code above, two data arrays are copied to the device. The kernel is then executed. After the kernel is executed, the results which still reside in the GPU memory must be copied back to the host memory. Notice the interesting syntax for calling the kernel. When the host calls a CUDA kernel function, many threads are spawned. However, we need to specify how many threads are spawned, and how those threads are organized. This will be discussed in the CUDA kernel tutorial. Also notice the last argument in the cudaMemcpy function. This controlls whether data is being sent from the host machine to the CUDA device, or vice versa. It is also possible to use this function to copy data from the CUDA device to another location on the same CUDA device.
This concludes this tutorial. We covered how to use the CUDA precision timer, how memory must be allocated both on the device, and on the host machine, and finally, how to copy data to and from the device.
Proceed to the next tutorial to learn how to write a kernel, and how threads are organized when executing a kernel. The next tutorial will also present some results to you showing just how fast CUDA functions can be when compared to doing the same calculations on a CPU.