## CUDA – Tutorial 2 – The Kernel

Welcome to the second tutorial in how to write high performance CUDA based applications. This tutorial will cover the basics of how to write a kernel, and how to organize threads, blocks, and grids. For this tutorial, we will complete the previous tutorial by writing a kernel function. The goal of this application is very simple. The idea is to take two arrays of floating point numbers, and perform an operation on them and store the result in a third floating point array. We will then study how fast the code executes on a CUDA device, and compare it to a traditional CPU. The data analysis will take place toward the end of the article.

If you haven’t read the first tutorial, it may be a good idea to go back and read the first CUDA tutorial.

A critical part of designing CUDA applications is to organize threads, thread blocks, and grids appropriately. For this application, the simplest choice is to have each thread calculate one, and only one, element in the final result array. A general guidline is that a block should consist of at least 192 threads in order to hide memory access latency. Therefore, 256, and 512 threads are common and practical numbers. For the purposes of this tutorial, 256 threads per block is chosen. It’s best to think of a thread block as a 3-d block of threads. You may shape the block essentially any way you would like. For some applications, it may make sense to shape a block with 16x16x1. For our application, we are dealing with linear data, so it’s probably simplest to keep the thread structure linear. Therefore, the blocks will all be shaped with dimensions 256x1x1.

The variable as seen above is of type dim3, and it will be used when calling the CUDA kernel. Now it’s time to think of how we’re going to stucture the blocks. Exactly like the thread block, you may think of each ‘grid’ as a 3d brick, filled with blocks. Naturally, since our problem is linear, we would like to make the grid have a linear structure. For this tutorial, I wanted to support having each array have about 32 million elements. Here, we have our first problem!

Size of each array = 1024 x 1024 x 32 = 33,554,432 total number of elements

Number of threads each block will calculate = 256

Number of blocks required = total number of elements / 256 = 131,072

Okay, so naturally, we would like our grid to have the dimensions of 131,072 x 1 x 1. Unfortunately, the maximum size for any dimension is 65535! Therefore, we are forced to chose another grid structure. For the purposes of this tutorial, I have chosen 128 x 1024 x 1.

### The CUDA Kernel

Now that you know what the thread structure will be like, we can write the kernel.

```__global__ void multiplyNumbersGPU(float *pDataA, float *pDataB, float *pResult)
{
// We already set it to 256 threads per block, with 128 thread blocks per grid row.
int tid = (blockIdx.y * 128 * 256) + blockIdx.x * 256 + threadIdx.x;    // This gives every thread a unique ID.
//pResult[tid] = pDataA[tid] * pDataB[tid];        // Each thread only multiplies one data element.
//pResult[tid] = pDataA[tid] * pDataB[tid] / 12.34567;
//pResult[tid] = sqrt(pDataA[tid] * pDataB[tid] / 12.34567);
pResult[tid] = sqrt(pDataA[tid] * pDataB[tid] / 12.34567) * sin(pDataA[tid]);
}```

The first thing to notice is the __global__ keyword. This simply indicates that this function may be called from either the host PC or the CUDA device. The next thing you should notice is how each thread figures out exactly which data element is it responsible for computing. Each thread runs the same code, so the only way for them to differentiate themselves from the other threads is to use their threadIdx, and their blockIdx.

### Performance Results

The program written for this tutorial is able to measure the amount of time the GPU takes to accomplish this task, and compares it to the amount of time a CPU uses to accomplish the same task. Please note that you may download the source code at the end of this tutorial and try it on your own system. As you can see in the above code, several different calculations were tried.

• C = A * B
• C = A * B / 12.34567
• C = sqrt(A * B / 12.34567)
• C = sqrt(A * B / 12.34567) * sin(A)

System setup includes a 2.66 Ghz, Core 2 Duo, with 1333Mhz FSB and 4 gigs of DDR2 ram. The graphics card is a GTX 280, which has been underclocked to run with a core clock of 540 Mhz, and a memory clock of 940 Mhz. As you can see in the graph, the amount of time the CPU and GPU took to compute A*B was roughly the same. This is because the calculation is simple enough so that the system memory is the main bottleneck for this calculation. However, as soon as we introduce a floating point division, we can clearly see that the CPU takes a significantly longer time to compute the result. However, there is no noticeable difference in the GPU execution time. This is likely because the calculations are still simple enough so that the GPU is limited mainly by memory bandwidth. Fortunately for us, the memory bandwidth of graphics cards is usually about 10 times faster than the bandwidth for system memory on most desktops, therefore, the calculations are done faster on the GPU. Also keep in mind that these results include the amount of time it takes to copy the data to the graphics card, call and execute the kernel, and copy the results from the graphics card back to system memory. For the next calculation, a square root is introduced. Notice how the CPU now takes a significantly longer amount of time to compute. The GPU is still unaffected by the new compuations. Finally, a sin() calculation and multiplication are added into the equation. The CPU takes even longer to complete, however, the GPU execution time is largely unchanged. These results are simply amazing. You can also see in the graph that with as little as 1,000,000 elements to perform computations on, it may make sense to use CUDA. Datasets of this size are very common in many applications today, which means that CUDA may be a viable option for your program.