Searching is a common task in computer science, and fortunately, it is also perfectly suited for CUDA. For this article, we’re talking about searching through an unsorted text file for a specific word or phrase. For example, if you have a 50 megabyte text file open in Microsoft Visual Studio, you’re sure to notice that searching for a word can take several seconds, which is more than any person wants to wait just to find a word in a document. This article will demonstrate a simple kernel which can perform simple string matches.

Search algorithm on the CPU

The algorithm on the CPU is fairly straightforward, and code is provided below.

int wordSearchCPU(char *pszData, int dataLength, char *pszTarget, int targetLen)
{
	for (int i=0; i < dataLength-targetLen; i++)
	{
		int fMatch = 1;
		for(int j=0; j < targetLen && fMatch; j++)	// Including the fMatch here reduces search time greatly
		{
			if (pszData[i+j] != pszTarget[j]) fMatch = 0;
		}
		if (fMatch) return i;
	}
}

Search algorithm on the GPU

For the GPU, the algorithm is very similar. As with porting most algorithms to CUDA, the highest level of parallelism translates to running separately on different threads. Therefore, we will be spawning one thread for each character in the text file. So if your text file has a few million characters, you will spawn a few million threads. Then, each thread will determine if the letters starting at the corresponding index match the target string.

But we need one result, not many

This particular type of problem requires many threads, but only produces one final answer, the first occurrence of whatever you’re searching for. So how do the threads communicate? There are two options. If you’re using the original G80 hardware, you can reduce the results with a standard reduction algorithm provided in the CUDA SDK. However, if you’re using a chip that supports atomic instructions, and almost all CUDA chips out there nowadays do, you can use the atomicMin function to store the first occurrence of the target phrase. The atomic method is simple, easy, and quick, especially when running on any GTX 400 series hardware. A single integer in global memory is used to store the earliest location. The first thing a thread does is check to see if its local character index is greater than the earliest location found so far.

The code

__global__ void wordSearchSimple(char *pszData, int dataLength, char *pszTarget, int targetLen, int *pFound)
{
	int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
	if(*pFound > startIndex)
	{
		// only continue if an earlier instance hasn't already been found
		int fMatch = 1;
		for (int i=0; i < targetLen; i++)
		{
			if (pszData[startIndex+i] != pszTarget[i]) fMatch = 0;
		}
		if (fMatch) atomicMin(pFound, startIndex);
	}
}

The results

The hardware used for this test is a core i7 930, with a GTX 470, with everything at stock speed. The main string comprised of random lower-case letters, with the search phrase consisting of 4 random lower-case letters. The time measured is the total time for 100 runs. Each run includes the overhead of copying the entire string to device memory. By chance alone, we expect to find a result within the first 450,000 elements. So we see that CPU time starts leveling off after we see that many elements. The CUDA kernels on the other hand, still take increasingly more time because it has to run more threads.

CUDA Search TimeWe can also see the effects of memory copy overhead. The crossing over point appears to be around 30,000 elements. We can also see the exploding average search time toward the very end. An easy way to remedy this is to only copy a portion of the data string at a time, run the search, and only copy more data if the search didn’t find any results.