Tutorial - Parallel For Loops with OpenMP

Tutorial – Parallel For Loops with OpenMP

Virtually all useful programs have some sort of loop in the code, whether it is a for, do, or while loop. This is especially true for all programs which take a significant amount of time to execute. Much of the time, different iterations of these loops have nothing to do with each other, therefore making these loops a prime target for parallelization. OpenMP effectively exploits these common program characteristics, so it is extremely easy to allow an OpenMP program to use multiple processors simply by adding a few lines of compiler directives into your source code.

Parallel for loops

This tutorial will be exploring just some of the ways in which you can use OpenMP to allow your loops in your program to run on multiple processors. For the sake of argument, suppose you’re writing a ray tracing program. Without going too much into the details of how ray tracing works, it simply goes through each pixel of the screen, and using lighting, texture, and geometry information, the color of that pixel is determined. The program goes on to the next pixel and repeats the process. The important thing to note here is that the calculation for each pixel is completely separate from the calculation of any other pixel, therefore making this program highly suitable for OpenMP. Consider the following pseudo-code:

for(int x=0; x < width; x++)
{
	for(int y=0; y < height; y++)
	{
		finalImage[x][y] = RenderPixel(x,y, &sceneData);
	}
}

Please take a quick look at the code above. This piece of code simply goes through each pixel of the screen, and calls a function, RenderPixel, to determine the final color of that pixel. Note that the results are simply stored in an array. Simply put, the entire scene that is being rendered is stored in a variable, sceneData, whose address is passed to the RenderPixel function. Because each pixel is independent of all other pixels, and because RenderPixel is expected to take a noticeable amount of time, this small snippet of code is a prime candidate for parallelization with OpenMP. Consider the following modified pseudo-code:

#pragma omp parallel for
for(int x=0; x < width; x++)
{
	for(int y=0; y < height; y++)
	{
		finalImage[x][y] = RenderPixel(x,y, &sceneData);
	}
}

The only change to the code is the line directly above the outer for loop. This compiler directive tells the compiler to auto-parallelize the for loop with OpenMP. If a user is using a quad-core processor, the performance of your program can be expected to be 300% increased with the addition of just one line of code, which is amazing. In practice, true linear or super linear speedups are rare, while near linear speedups are very common.

WARNING – Watch out for all the variables used in parallel regions of code

There are a few important things you need to keep in mind when parallelizing for loops or any other sections of code with OpenMP. For example, take a look at variable y in the pseudo code above. Because the variable is effectively being declared inside the parallelized region, each processor will have a unique and private value for y. However, take the following buggy code example below:

int x,y;
#pragma omp parallel for
for(x=0; x < width; x++)
{
	for(y=0; y < height; y++)
	{
		finalImage[x][y] = RenderPixel(x,y, &sceneData);
	}
}

The above code has a serious bug in it. The only thing that changed is the fact that now, variables x and y are declared outside the parallelized region. When we use the compiler directive to declare the outer for loop to be parallelized with OpenMP, the compiler already knows by common sense that the variable x is going to have different values for different threads. However, the default scope for the other variables, y, finalImage, and sceneData, are all shared by default, meaning that these values will be the same for all threads. All threads have access to read and write to these shared variables. The code above is buggy because variable y should be different for each thread. Declaring y inside of the parallelized region is one way to guarantee that a variable will be private to each thread, but there is another way to accomplish this.

int x,y;
#pragma omp parallel for private(y)
for(x=0; x < width; x++)
{
	for(y=0; y < height; y++)
	{
		finalImage[x][y] = RenderPixel(x,y, &sceneData);
	}
}

Instead of declaring variable y inside the parallel region, we can declare it outside the parallel region and explicitly declare it a private variable during the OpenMP compiler directive. This effectively makes each thread have an independent variable called y. Each thread will only have access to it’s own copy of this variable.

A word about shared variables

Forgetting to declare a variable as private is one of the most common bugs associated with writing OpenMP applications. However, if you want the highest performance out of your program, it is best to use private variables only when you have to. In the pseudo code for this tutorial, finalImage and sceneData are shared variables for good reasons. Even though each thread is writing to the finalImage array, these writes will not conflict with each other as long as x and y are private variables. SceneData is also a shared variable, because the threads will only read from this data structure, which will never change. Therefore, there will be no race conditions associated with this variable.

Wrapping up

There are many more interesting things that can be done with parallelizing loops, and this tutorial is just the tip of the iceberg. The main focus of this article was to let you know how you can quickly and easily modify your program to use multiple processors with OpenMP. Aside from using the compiler directive to specify which loop you want to be parallel, it is also extremely important to know which variables should be private, and which should be shared. Failure to properly classify a variable will result in terrible bugs and race conditions which are very hard to debug. If your program is written correctly, it should work great on a computer with one processor, and it should work even better on a serious computer with 24 or more processors. Using OpenMP to parallelize loops for you can be extremely scalable.

Back to OpenMP Tutorials

This entry was posted by admin on July 13, 2009 at 8:46 pm under OpenMP. Tagged HPC, OpenMP, Parallel, Tutorial. Both comments and pings are currently closed.

The Supercomputing Blog