Posts tagged ‘Parallel’

There are several standard, cross platform ways to create high performance, multithreaded programs. There are no standard ways to spawn threads with the C++ language, which means that sometimes we have to resort to using compiler-specific methods to create threads for our programs. This tutorial will be focused on how to easily create work threads for your windows, or WIN32, program using Microsoft Visual Studio. Continue reading ‘Writing multithreaded programs for Windows’ »

As with any parallel program, there is an overhead associated with the amount of time threads spend communicating with each other, and waiting for each other to finish. This means that parallel programs are often less efficient than serial programs. However, in most applications, we care about performance, or wall time performance. A computer user typically only cares how fast his or her program runs, not how efficient the program is. In this article, we will be examining tangible numbers dealing with the overhead which can be associated with MPI programs.

It only makes sense to parallelize a program or part of a program only when it makes sense to do so. But when is there enough computations to justify spending your valuable time coding a parallel program? The answer, of course, depends on the nature of the computations you will be performing.

For this article, a program started with a large array with a maximum number of 100,000,000 integer elements. We will be running two tests:

array[i] = i*i
array[i] = i*i - (int)sqrt((double)i);

For different array sizes, the program measures the amount of time it takes to complete all the computations. The program was run with just 1 thread, and again with two threads on a dual core processor. Unforunately, I do not have access to a quad core computer.

mpi_overhead

As you can see in the graph above, the results are quite interesting. When doing something as simple as a multiplication, 1 thread actually finishes faster than 2 threads. This is because the computation is so simple and fast that the amount of overhead is enough to make the program slower when run on more than 1 thread. However, the more complicated and time consuming calculation i*i – sqrt(i) tells a different story. Because square root is a relatively slow operation, the amount of overhead is much less than before when compared percentage wise. As you can see with the more complicated calculation is that the program efficiency increases when parallelizing over a larger dataset, which makes sense.

There are many sources of overhead in MPI programs, including the time wasted by blocking operations such as MPI_Send, MPI_Recv, MPI_Barrier, and the actual communication itself. In this example, the two cores were on the same substrate, so communication was very fast. However, the overhead of communication is hundreds of even thousands of times higher when transferring data over a network between threads running on different physical computers. So before you decide to parallelize a part of your program by writing with MPI or any other parallel language, it’s best to make sure that the benefit of parallelization outweighs the added overhead of thread communication. As you can see above, the more complicated and time consuming your calculations are, the more efficiently your program will run across multiple cores. Otherwise, you might need a very large dataset in order to receive any benefit from paralellization. Feel free to use code similar to that below to run a sanity check before making a full scale program.

for (int i=nTasks*4; i < C_MAX_ARRAY_SIZE; i += (C_MAX_ARRAY_SIZE >> 4))
{
	int numOfElems = i/nTasks;
	int startIndex = numOfElems * rank;
	double startTime = MPI_Wtime();
	for (int j=startIndex; j < startIndex + numOfElems; j++)
	{
		g_testArray[j] = j*j - sqrt((double)j);	// do a simple computation
	}
	// each thread needs to send results to thread 0
	if (rank == 0)
	{
		// The master thread will need to receive all computations from all other threads.
		MPI_Status status;
		for (int j=1; j < nTasks; j++)
		{
			MPI_Recv(&g_testArray[j*numOfElems], numOfElems, MPI_INT, j,3, MPI_COMM_WORLD, &status);
		}
	}
	else
	{
		MPI_Send(&g_testArray[startIndex], numOfElems, MPI_INT, 0, 3, MPI_COMM_WORLD);
	}
	double endTime = MPI_Wtime();
	if (rank == 0) printf("i = %d    time = %f\n", i, (float)(endTime - startTime));
}

This tutorial will discuss how to perform atomic operations in CUDA, which are often essential for many algorithms. Atomic operations are easy to use, and extremely useful in many applications. Atomic operations help avoid race conditions and can be used to make code simpler to write. Continue reading ‘CUDA – Tutorial 4 – Atomic Operations’ »

Virtually all useful programs have some sort of loop in the code, whether it is a for, do, or while loop. This is especially true for all programs which take a significant amount of time to execute. Much of the time, different iterations of these loops have nothing to do with each other, therefore making these loops a prime target for parallelization. OpenMP effectively exploits these common program characteristics, so it is extremely easy to allow an OpenMP program to use multiple processors simply by adding a few lines of compiler directives into your source code. Continue reading ‘Tutorial – Parallel For Loops with OpenMP’ »

Welcome to my tutorial on the very basics of OpenMP. OpenMP is a powerful and easy tool which makes multi-threaded programming very easy. If you would like your program to run faster on dual, or quad core computers, then your project may be very well suited to OpenMP. Continue reading ‘OpenMP tutorial – the basics’ »