Pages: 1 2

CUDA is great for any compute intensive task, and that includes image processing. In this tutorial, we’ll be going over why CUDA is ideal for image processing, and how easy it is to port normal c++ code to CUDA.

Why CUDA is ideal for image processing

A single high definition image can have over 2 million pixels. Many image processing algorithms require dozens of floating point computations per pixel, which can result in slow runtime even for the fastest of CPUs. The slow speed of a CPU is a serious hindrance to productivity, as anyone who uses Photoshop without a CUDA capable graphics card will tell you. In CUDA, we can generally spawn exactly one thread per pixel. Each thread will be responsible for calculating the final color of exactly one pixel. Since images are naturally two dimensional, it makes sense to have each thread block be two dimensional. 32×16 is a good size because it allows each thread block to run 512 threads. Then, we spawn as many thread blocks in the x and y dimension as necessary to cover the entire image. For example, for a 1024×768 image, the grid of thread blocks is 32×48, with each thread block having 32×16 threads.

Picking the best thread block dimensions

One question that may rise is why the thread blocks aren’t a nice, even 32×32 size. The reason is that CUDA devices of compute capability of 1.3 or less can have a maximum of 512 threads per block. Cuda devices of compute capability of 2.0 or greater can have up to 1024 threads per block, due to some substantial hardware enhancements for CUDA. So keeping this in mind, if you want your code to be compatible with all CUDA devices, you’ll want to have 512 or less threads per block. But if you’re running your program on a super computer or only on your personal computer where you know the hardware configuration, feel free to use all 1024 threads per block.

The ease of porting image processing code to CUDA

Some people don’t mind spending hours tweaking their code to get the absolute maximum performance on a CUDA device. However, many people simply want their code to run several times faster with minimal effort. This article focuses on the latter. Image processing algorithms typically do something like the following:

#pragma omp parallel for
for (int i=0; i < height; i++)
{
	for (int j=0; j < width; j++)
	{
		result[i*width+j] = ProcessPixel(j,i);
	}
}

In order to port this to CUDA, you literally just replace the for loops with a simple calculation involving each thread’s id and block id.

int i = blockIdx.y * blockDim.y + threadIdx.y;
int j = blockIdx.x * blockDim.x + threadIdx.x;

// Test to see if we're testing a valid pixel
if (i >= height || j >= width) return;	// Don't bother doing the calculation. We're not in a valid pixel location

To demonstrate the ease of this process, I took some code from the image swirl algorithm tutorial and modified it to run on CUDA. Other than the i and j calculations, I did not need to alter the core algorithm code whatsoever!

CUDA Host code

Because this program runs on a CUDA device, the image data must be copied over to the GPU. Furthermore, this algorithm requires a copy of the original data along with the final memory to write results into. Therefore, the image is copied to the GPU, then that image is copied to another place on the GPU with a GPU to GPU memory copy. The kernel is called, and finally the resulting image can be copied back to the host.

// Copy the data to the device
CUDA_SAFE_CALL( cudaMemcpy(d_imageData, h_imageData, sizeof(unsigned int) * MAX_DATA_SIZE, cudaMemcpyHostToDevice) );
CUDA_SAFE_CALL( cudaMemcpy(d_imageDataCopy, d_imageData, sizeof(unsigned int) * MAX_DATA_SIZE, cudaMemcpyDeviceToDevice) );

//SwirlCu(int width, int height, int stride, unsigned int *pRawBitmapOrig, unsigned int *pBitmapCopy, double factor)
SwirlCu<<<blockGridRows, threadBlockRows>>>(C_WIDTH, C_HEIGHT, C_WIDTH*4, d_imageData, d_imageDataCopy, 0.005f);
CUT_CHECK_ERROR("SwirlCu() execution failed\n");
CUDA_SAFE_CALL( cudaThreadSynchronize() );

// Copy the data back to the host
CUDA_SAFE_CALL( cudaMemcpy(h_imageData, d_imageData, sizeof(unsigned int) * MAX_DATA_SIZE, cudaMemcpyDeviceToHost) );

You may view the CUDA Kernel code and the speed results of CUDA vs. the CPU on the next page.

Pages: 1 2