As with any parallel program, there is an overhead associated with the amount of time threads spend communicating with each other, and waiting for each other to finish. This means that parallel programs are often less efficient than serial programs. However, in most applications, we care about performance, or wall time performance. A computer user typically only cares how fast his or her program runs, not how efficient the program is. In this article, we will be examining tangible numbers dealing with the overhead which can be associated with MPI programs.
It only makes sense to parallelize a program or part of a program only when it makes sense to do so. But when is there enough computations to justify spending your valuable time coding a parallel program? The answer, of course, depends on the nature of the computations you will be performing.
For this article, a program started with a large array with a maximum number of 100,000,000 integer elements. We will be running two tests:
array[i] = i*i array[i] = i*i - (int)sqrt((double)i);
For different array sizes, the program measures the amount of time it takes to complete all the computations. The program was run with just 1 thread, and again with two threads on a dual core processor. Unforunately, I do not have access to a quad core computer.
As you can see in the graph above, the results are quite interesting. When doing something as simple as a multiplication, 1 thread actually finishes faster than 2 threads. This is because the computation is so simple and fast that the amount of overhead is enough to make the program slower when run on more than 1 thread. However, the more complicated and time consuming calculation i*i – sqrt(i) tells a different story. Because square root is a relatively slow operation, the amount of overhead is much less than before when compared percentage wise. As you can see with the more complicated calculation is that the program efficiency increases when parallelizing over a larger dataset, which makes sense.
There are many sources of overhead in MPI programs, including the time wasted by blocking operations such as MPI_Send, MPI_Recv, MPI_Barrier, and the actual communication itself. In this example, the two cores were on the same substrate, so communication was very fast. However, the overhead of communication is hundreds of even thousands of times higher when transferring data over a network between threads running on different physical computers. So before you decide to parallelize a part of your program by writing with MPI or any other parallel language, it’s best to make sure that the benefit of parallelization outweighs the added overhead of thread communication. As you can see above, the more complicated and time consuming your calculations are, the more efficiently your program will run across multiple cores. Otherwise, you might need a very large dataset in order to receive any benefit from paralellization. Feel free to use code similar to that below to run a sanity check before making a full scale program.
for (int i=nTasks*4; i < C_MAX_ARRAY_SIZE; i += (C_MAX_ARRAY_SIZE >> 4)) { int numOfElems = i/nTasks; int startIndex = numOfElems * rank; double startTime = MPI_Wtime(); for (int j=startIndex; j < startIndex + numOfElems; j++) { g_testArray[j] = j*j - sqrt((double)j); // do a simple computation } // each thread needs to send results to thread 0 if (rank == 0) { // The master thread will need to receive all computations from all other threads. MPI_Status status; for (int j=1; j < nTasks; j++) { MPI_Recv(&g_testArray[j*numOfElems], numOfElems, MPI_INT, j,3, MPI_COMM_WORLD, &status); } } else { MPI_Send(&g_testArray[startIndex], numOfElems, MPI_INT, 0, 3, MPI_COMM_WORLD); } double endTime = MPI_Wtime(); if (rank == 0) printf("i = %d time = %f\n", i, (float)(endTime - startTime)); }