Sending large datasets in MPI

In our previous tutorial, Thread Communication with MPI, we covered the basics of how to send data between threads. However, in the previous tutorial, only integers were sent. However, sometimes large amounts of data need to be sent between threads.

The cost of sending data

As stated in previous tutorials, when an application is written with MPI, it is absolutely necessary to use MPI functions for communicating any sort of data between threads. The use of global variables or other techniques are not acceptable because an MPI application should be designed to work on a single computer, or across multiple computers.

That being said, there can be a substantial cost to sending data to and from threads. If two threads are running on the same computer, the communication may be fairly quick. However, it is possible that two threads which need to communicate with each other are running on two different computers across a network. Whether it is a 10Mbps, 100Mbps, 1000Mpbs, or even InfiniBand network, there is still a substantial cost to sending a large amount of data. Still, it is still preferable to send one large chunk or data across a network, rather than many small chunks of data. So one key to optimizing your program is to have as little thread communication function calls as possible. Another goal is to organize your algorithm to allow for as little thread communication as possible.

Word counting with MPI

For this tutorial, we’ll be writing an MPI application which will read a large text file, and ask the user to input a string. The program will then continue to count all the times the user input string appears in the text file. This example was chosen because it will show how to broadcast, send, and receive large amounts of data quickly. It is critical to keep in mind that, while sending large amounts of data between threads over a network can be expensive, it is nothing compared to the cost of reading from a hard drive. Therefore, one thread will read the data file, and send out relevant portions of data to each thread at a later point in the program.

After the user has input a string into thread zero, thread zero will need to broadcast this data out to all threads. When using the MPI_Bcast function, each thread needs to already know the amount of data to expect. However, we don’t know how many characters the user will input, so there is no way for the other threads to know how many characters to receive!

There are two mainstream approaches to this problem. When the user inputs data, the data goes into an array which we can presume has a limited number of characters it can store. In the case of this tutorial, that limit is 80 characters. The first approach would be to send all 80 characters, even if only the first few characters are used. Remember, thread communication across a network can be expensive!

The second approach is to use two MPI_Bcast functions to communicate the user input. The first broadcast will send the length of data that is about to be send. Then the second broadcast will be for the actual data.

MPI_Bcast(&searchWordLength, 1, MPI_INT, hasSearchWord, MPI_COMM_WORLD);
// Now receive the Word. We're adding 1 to the length to allow for NULL termination.
MPI_Bcast(szSearchWord, searchWordLength+1, MPI_CHAR, hasSearchWord, MPI_COMM_WORLD);

As you can see in the above code, there are two MPI_Bcast functions that are used. The first function just sends an integer, however, the second function call sends an array of data. Please note that MPI_Bcast can be used to send very large amounts of data if necessary.

Sending large datasets with MPI_Send

Sending large datasets with MPI_Send and MPI_Recv uses exactly the same principal as the MPI_BCast code above. However, this time, another conecpt is employed. Remember that for word counting, each thread can be assigned to review only a portion of the total data. Therefore, it is critically important for thread zero, the thread which read the data file in the beginning, to only send relevant datasets to each thread.

if (rank == 0)
{
	totalChars = pLineStartIndex[nTotalLines];
	portion = totalChars / nTasks;
	startNum = 0;
	endNum = portion;
	totalChars = endNum - startNum;

	for (int i=1; i < nTasks; i++)
	{
		// calculate the data for each thread.
		int curStartNum = i * portion - searchWordLength+1;
		int curEndNum = (i+1) * portion;
		if (i == nTasks-1) { curEndNum = pLineStartIndex[nTotalLines]-1;}
		if (curStartNum < 0) { curStartNum = 0; }

		// we need to send a thread the number of characters it will be receiving.
		int curLength = curEndNum - curStartNum;
		MPI_Send(&curLength, 1, MPI_INT, i, 1, MPI_COMM_WORLD);
		MPI_Send(pszFileBuffer+curStartNum, curLength, MPI_CHAR, i, 2, MPI_COMM_WORLD);
	}
}

Above is some sample code which shows how thread 0 breaks up the data, and only sends the relevant pieces of data to each thread. Notice how the two MPI_Send functions are called with different tags, 1 and 2. When writing larger applications, there will probably be many MPI_Sends. Replacing the 1 with C_LENGTH_OF_DATA and 2 with C_ACTUAL_DATA, might be helpful for debugging purposes.

MPI_Status status;
MPI_Recv(&totalChars, 1, MPI_INT, 0,1, MPI_COMM_WORLD, &status);
MPI_Recv(pszFileBuffer, totalChars, MPI_CHAR, 0,2, MPI_COMM_WORLD, &status);

Above is an example of the source code used by the receiving threads. As you can see, the total number of characters is received first, followed soon after with the large amount of data. For the purposes of this tutorial, I tested my program with the bible, which is free and easy to download, yet large enough to justify parallel computation. In this case, the total number of characters can be quite large.

Wrapping up

This was a pretty simple tutorial, which showed you how to use MPI_Bcast and MPI_Recv to send large amounts of data between threads. We also covered the key concept that sending data is slow and therefor expensive. The less thread communication a program has, the faster the application will generally run. So far, we’ve covered synchronous communication only, but there are also asynchronous ways to communicate data with MPI.

Download the source code

Only small bits and pieces of the program were copied in this article. You may download the full copy of the source code here: MPI_Tutorial_3.

Back to the MPI tutorial landing page

This entry was posted by admin on July 3, 2009 at 9:44 pm under MPI. Tagged Communication, MPI, MPI_Bcast, MPI_Recv, MPI_Send, Thread, Thread Communication, Tutorial. Both comments and pings are currently closed.

The Supercomputing Blog