Using SSE to process images or video is essential to achieving good performance. Most popular multimedia applications use SSE to greatly accelerate application performance. Unfortunately, like everything in life, if SSE is used incorrectly it can actually perform worse than non-SSE code. This article will take you through some code and discuss the performance of each.

Simple image processing

This article will go through two non-SSE functions and 5 SSE functions which perform very simple processing on an image. The process taking place for the sake of this article is simply removing blue from all the pixels. While this simple operation is memory constrained on some systems, this article will still demonstrate some right and wrong ways to do multimedia processing.

Integers with SSE

Using integers easily with SSE generally requires SSE2 extensions and intrinsics. These extensions are available on all modern processors. The SSE registers are 128-bit wide, which can accommodate 4, 32-bit numbers. With SSE2, these numbers can be 32-bit integers, which is perfect for processing pixel data. Each pixel can be represented by a 32-bit number arranged as 4, 8-bit fields, ARGB. This makes SSE2 perfectly suitable for processing pixel data!

A note about the examples

All examples here use GDI+, using LockBits so that the bitmap memory may be directly accessed. These same concepts apply regardless of what you’re using for image processing. All results are processed using a core i7-930 processor with stock clock speed of 2.8 Ghz. The dimensions of the image used for testing is 1024×768. The source code may be downloaded here.

Naive non-SSE function – RemoveBlue

for (int x=0;x < width; x++)
{
	for (int y=0; y < height; y++)
	{
		pRawBitmapOrig[y * bitmapData.Stride / 4 + x] &= 0xffffff00;
	}
}

Here, we have naive code which processes one pixel at a time. Execution time is 6.708 milliseconds. Note how the y coordinate is changed in the inner for loop. This code exhibits very poor cache coherence, and will cause a large penalty in your runtime. The memory locations for two consecutive iterations will be off a lot.

Good non-SSE function – RemoveBlue2

int nPixels = height*bitmapData.Stride/4;

for (int i=0; i < nPixels; i++) pRawBitmapOrig[i] &= 0xffffff00;	// for loop reduced to one line

Here, we have code which goes through the image one consecutive pixel at a time. This is great for cache coherence. When one 32-bit pixel is read, the next few pixels will likely be prefetched which will greatly reduce runtime. In fact, this function completes in 0.538ms. That’s over 12 times as fast as the naive code! Here is some more information about cache coherence and performance.

Naive SSE function – RemoveBlueSSE

unsigned int *pSSEArray = (unsigned int*) _aligned_malloc((bitmapData.Stride*height/4) * sizeof(unsigned int), 16);	// align to 16-byte for SSE
memcpy(pSSEArray, pRawBitmapOrig, (bitmapData.Stride*height/4) * sizeof(unsigned int));
int nPixels = height*bitmapData.Stride/4;
for (int i=0; i < nPixels; i+=4)
{
	__m128i curPixelGroup = *(__m128i*)(&pSSEArray[i]);		// This is a group of four pixels
	__m128i noBlueMask = _mm_set1_epi32 (0xffffff00);
	__m128i newPixelGroup = _mm_and_si128(curPixelGroup, noBlueMask);
	_mm_store_si128((__m128i*)&pSSEArray[i], newPixelGroup);
}
// Copy the aligned memory back into the original
memcpy(pRawBitmapOrig, pSSEArray, (bitmapData.Stride*height/4) * sizeof(unsigned int));

Many instructions in SSE and SSE2 require data to be 16-byte aligned. In this function, we allocate 16-byte aligned memory, copy the pixel data, manipulate that data, copy back the memory to the original pixel data location, and deallocate the aligned memory we allocated not shown). As you might have guessed, this function has a very large memory overhead, and completes in 1.458ms. That’s worse than the good non-SSE code!

Better SSE function – RemoveBlueSSE2

This time, instead of using the _mm_store_si128 SSE2 intrinsic, we use the _mm_storeu_si128 SSE2 intrinsic. This way, we can store to unaligned memory addresses, and remove the last memcpy. The runtime of this function is 1.251ms.

Even better SSE function – RemoveBlueSSE3

int nPixels = height*bitmapData.Stride/4;
for (int i=0; i < nPixels; i+=4)
{
	__m128i curPixelGroup = _mm_loadu_si128((__m128i*)(&pRawBitmapOrig[i]));		// This is a group of four pixels
	__m128i noBlueMask = _mm_set1_epi32 (0xffffff00);
	__m128i newPixelGroup = _mm_and_si128(curPixelGroup, noBlueMask);
	_mm_storeu_si128((__m128i*)&pRawBitmapOrig[i], newPixelGroup);		// store with unaligned instruction
}

In this version, we use _mm_loadu_si128 so that we can load unaligned data into the SSE registers. This means we don’t have to allocate specially aligned memory or copy the bitmap. It turns out that most of the time, GDI+ bitmaps are aligned to begin with, so there is very little to any performance hit using the unaligned store and load intrinsics. This function completes in 0.243ms. That’s 2.2 times as fast as the non-SSE code! Keep in mind this problem is limited by memory bandwidth on some systems. So even though we’re barely doing any processing at all, it still makes a lot of sense to use SSE in order to double your performance.

More advanced SSE image processing

This tutorial took a look at some very simple image manipulation with SSE. But for more advanced image processing techniques, you may need to start treating your color components as floating point values. For more information about this topic, read advanced image processing with SSE.

Function Time per call (ms)
RemoveBlue (no SSE) 6.70816063
RemoveBlue2 (no SSE) 0.53816052
RemoveBlueSSE 1.458046264
RemoveBlueSSE2 1.251234674
RemoveBlueSSE3 0.242780743
RemoveBlueSSE4 0.242357937
RemoveBlueSSE5 0.243047778