This tutorial will be discussing how different threads can communicate with each other. In the previous tutorial, each thread operated without any interaction or data dependency from other threads. However, most parallel algorithms require some amount of data to be communicated between threads.

How threads communicate

Generally, threads may only safely communicate with each other if and only if they exist within the same thread block. There are technically ways where two threads from different blocks can communicate with each other, but this is much more difficult, and much more prone to bugs within your program. This topic will be saved for another tutorial.

Shared memory

Threads within the same block have two main ways to communicate data with each other. The fastest way would be to use shared memory. When a block of threads starts executing, it runs on an SM, a multiprocessor unit inside the GPU. Each SM has a fairly small amount of shared memory associated with it, usually 16KB of memory. To make matters more difficult, often times, multiple thread blocks can run simultaneously on the same SM. For example, if each SM has 16KB of shared memory and there are 4 thread blocks running simultaneously on an SM, then the maximum amount of shared memory available to each thread block would be 16KB/4, or 4KB. So as you can see, if you only need the threads to share a small amount of data at any given time, using shared memory is by far the fastest and most convenient way to do it.

Global memory

However, if your program is using too much shared memory to store data, or your threads simply need to share too much data at once, then it is possible that the shared memory is not big enough to accommodate all the data that needs to be shared among the threads. In such a situation, threads always have the option of writing to and reading from global memory. Global memory is much slower than accessing shared memory, however, global memory is much larger. For most video cards sold today, there is at least 128MB of memory the GPU can access.

Calculating statistics with CUDA

For this tutorial, we will be writing a simple program where each block finds the minimum, maximum, and average value of an array of floating point numbers. Of course, the process would be the exactly the same for integers. CUDA is an excellent choice for any program that calculates many statistics such as standard deviation, mean, min, max, etc.

Block and thread structure

Because threads will need to share data, it is important to take that into consideration when formulating the thread block structure (how the threads will be organized in any given thread block). The variables we will have to share are min, max, and average. 3 variables * 4 bytes / variable * 256 threads / block = 3KB / block. Since each SM has a total of 16KB of shared memory, it is acceptable for each block to be using 3KB of shared memory. Therefore, the thread blocks will consist of 256 threads each, in a 256x1x1 configuration.

Like the previous tutorial, we will be working with very large arrays. Because there is a limit of 65535 thread blocks in any given dimension for a grid, we will be designing the grid with a 128 x Y x 1 configuration, where Y changes according to how large the array is.

Declaring shared arrays

For CUDA kernels, there is a special keyword, __shared__, which places a variable into shared memory for each respective thread block. The __shared__ keyword works on any type of variable or array. In the case for this tutorial, we will be declaring three arrays in shared memory.

// Declare arrays to be in shared memory.
// 256 elements * (4 bytes / element) * 3 = 3KB.
__shared__ float min[256];

__shared__ float max[256];
__shared__ float avg[256];

Value reduction

Reduction is a very common problem both for serial and parallel applications. For example, suppose you want to add the numbers 1+2+3+4 together. In order to do this in parallel, one thread could calculate the value of 1+2, and another thread could calculate the value of 3+4. After those calculations are complete, a thread could add the results from the previous additions. Thus, the addition problem is reduced in half with each step. We will be using reduction in order to find the max, min, and average values of an array of floating point numbers.

	int nTotalThreads = blockDim.x;	// Total number of active threads


	while(nTotalThreads > 1)
	{
		int halfPoint = (nTotalThreads >> 1);	// divide by two
		// only the first half of the threads will be active.

		if (threadIdx.x < halfPoint)
		{
			// Get the shared value stored by another thread
			float temp = min[threadIdx.x + halfPoint];

			if (temp < min[threadIdx.x]) min[threadIdx.x] = temp;

			temp = max[threadIdx.x + halfPoint];
			if (temp > max[threadIdx.x]) max[threadIdx.x] = temp;

			
			// when calculating the average, sum and divide
			avg[threadIdx.x] += avg[threadIdx.x + halfPoint];

			avg[threadIdx.x] /= 2;
		}
		__syncthreads();

		nTotalThreads = (nTotalThreads >> 1);	// divide by two.

	}

In the code above, there is a while loop. With each iteration of the while loop, the number of calculations to be done is cut in half. Reduction is a popular technique because it can be massively parallelized. Originally, the thread block consists of 256 threads because that is the value we chose earlier. The first time the while loop executes, threads 0->127 will compare their min, max, and averages with those of threads 128->255. The results are stored in shared memory indexes 0->127. The next time the while loop executes, only threads 0->63 will be active, and so on. One important thing to consider is the fact that even though most threads will be idle during the execution of the kernel, the GPU will still be drastically faster than the CPU. Performance results will be displayed towards the end of this tutorial.

Notice how there is a function called __syncthreads() in the kernel. This function acts as a barrier to all the threads in that particular thread block. No thread can continue past that block until all threads have reached that location. While this may seem to slow down execution because threads will be idle if they reach it before other threads, it is absolutely necessary to sync the threads here. By using __syncthreads(), we can guarantee that all threads are in the same iteration of the while loop at the same time, thus ensuring that all threads are reading the correct values from shared memory. Without a call to __syncthreads, there will be race condition.

Because there are no good ways of communicating data between thread blocks, the results for each block are written to global memory. These values can then be copied back to the host PC for final processing before the final answer has been determined.

Summary

In summing up this article, it is possible, and many times necessary, for threads within the same block to communicate with each other through either shared memory, or global memory. Shared memory is by far the fastest way, however due to it’s size limitations, some problems will be forced to use global memory for thread communication. Using __syncthreads is sometimes necessary to ensure that all data from all threads is valid before threads read from shared memory which is written to by other threads. Below is a graph of execution time it took my CPU against the amount of time it took my graphics card. the CPU is a 2.66 Core 2 Duo, while the graphics card is a GTX 280, slightly underclocked. As you can see, the GPU is faster when there are at least a million elements, and the spread between the GPU and CPU continues to widen with more elements. However, main system memory may be a significant bottleneck which is preventing the GPU from achieving more than 1.5x the processor performance.

Download tutorial source code here

Next tutorial: Atomic functions

CUDA_Results3

One Comment

  1. Reduction for Statistic Calculation « CudaKu says:

    […] at CUDA again. I tried to follow tutorial about reduction procedure from The Supercomputing Blog here. The tutorial explains about how threads can share information using shared memory. It also […]