Taking the square root of a floating point number is essential in many engineering applications. Whether you are doing nBody simulations, simulating molecules, or linear algebra, the ability to accurately and quickly perform thousands or even millions of square root operations is essential. Unfortunately, the square root functions on most CPUs are very time consuming, even with specialized SSE instructions. Fortunately enough, GPUs have specialized hardware to perform such square root operations extremely fast. CUDA, NVidia’s solution to extremely high performance parallel computing, puts the onboard specialized hardware to full use, and easily outperforms modern Intel or AMD CPUs by a factor of over a hundred.

Example problem

The example problem for this article is a reduced nbody problem. We are given a sets of (x,y) coordinates. For this article, we can have anywhere from 512 to 65536 coordinates, or elements. In order to simplify this example as much as possible, all need to do is calculate the sum of distances to all other points, for each point. In a simulation problem such as this, the algorithm complexity of if the order N squared, which means we will need vast computational power as N, or the number of elements, is increased.

CPU approach

This problem is very simple, and can be approached easily with C or C++. Below is some sample code. Note that it has not been optimized for SSE, but it is debatable as to whether or not some compilers will be able to automatically vectorize such code.

void getDistCPU(float *h_result, float *h_X, float *h_Y, int nElems)
	for (int i=0 ; i < nElems; i++)
		float localX = h_X[i];
		float localY = h_Y[i];

		float accumDist = 0;
		for (int j=0; j < nElems; j++)

			float curX = h_X[j]-localX;
			float curY = h_Y[j]-localY;

			accumDist += sqrt((curX * curX) + (curY*curY));
		h_result[i] = accumDist;

CUDA approach

Because CUDA code is basically C code, it has extreme similarity to the CPU code. However, instead of having two imbedded for loops, it is easier to simply have each CUDA thread calculate the result for one element. Thus, the CUDA kernel only needs one for loop.

__global__ void CalculateDist(float *pX, float *pY, float *pResult, int totalElements)
	int index = blockIdx.x * blockDim.x + threadIdx.x;

	float localX = pX[index];
	float localY = pY[index];

	float accumulatedDist = 0;
	for (int i=0; i < totalElements; i++)
		float curX = pX[i];
		float curY = pY[i];

		curX = curX - localX;	//curX is now distance
		curY = curY - localY;

		accumulatedDist += sqrt((curX * curX) + (curY*curY));
	pResult[index] = accumulatedDist;


Of course, performance is much faster on a GPU than it is on a CPU. The card used for this experiment is an underclocked GTX 280. The CPU is a 2.66Ghz Core 2 Duo, however only one core is being utilized for this experiment. For 65536 elements, the CUDA code executed over 172 times as fast as CPU code. In fact, even for very small numbers of elements, 256 for example, the CUDA code was still able to outperform the CPU by 40%. For these calculations, all overhead, including copying memory from the host to the GPU and executing the kernel are taken into consideration. Clearly, if your computational algorithm requires many square roots to be performed, the performance of CUDA will far exceed that of a CPU.