The SSE instruction set can be a very useful tool in developing high performance applications. SSE, or Streaming SIMD Extensions, is particularly helpful when you need to perform the same instructions over and over again on different pieces of data. SSE vectors are 128-bits wide, and allow you to perform calculations for 4 different floating point numbers at the same time. SSE can also be configured to work on 2, 64-bit floating point numbers concurrently, 4, 32-bit integers, or even 16, 8-bit chars. By properly utilizing the SSE instruction set, you gain access to a large amount of parallel computing power that would otherwise go untapped. In some cases, using SSE optimization can provide just as much speedup as going from a single core to a quad core processor.

Example for this tutorial

For this tutorial, we are going to write a simple program that will calculate Y = sqrt(x)/x, for values of x ranging from 1 -> 64000. This is a fairly typical scenario for any application that would need to plot a graph. For this tutorial, we are going to be using compiler intrinsics, which only seem to be available on Microsoft Visual Studio. Generally, I encourage cross-platform capability, but Microsoft’s compiler intrinsics are simply the easiest and overall best way to write applications utilizing SSE instructions. Of course, there are similar SSE intrinsics which can be used for other processors, so this is just a good place to start.

Starting the project

Create a new project with Visual Studio. A console project will work just fine. Be sure to include xmmintrin.h. It is important to note that this header file may be named differently depending on which compiler y ou are using.

#include <xmmintrin.h>  // Need this for SSE compiler intrinsics

Alignment

For this tutorial, we will be calculating the values of Y with SSE and storing them into a floating point array. It is of the utmost importance that the array be created and aligned to a 16-byte boundary. This will insure that the values can be loaded into, and stored from the SSE hardware as efficiently as possible. Failure to properly align data being used with SSE instructions will result in a huge performance hit. Since large datasets are generally required in order to justify using SSE, we’ll be dynamically allocating our array so it doesn’t reside on the stack.

float *pResult = (float*) _aligned_malloc(length * sizeof(float), 16);  // align to 16-byte for SSE

For this tutorial, we won’t worry about storing the actual values of X, since they could easily be derived at a later time if needed. So instead of using an array to store X, we’ll be using a special vector variable which is explicitly meant to be used for SSE instructions.

__m128 x;

We have declared variable ‘x’ as a 128 bit vector variable. We haven’t told the compiler that we’re going to be using it as 4 floating point numbers because it doesn’t matter at this point. Before we continue, here is a brief look at what we’re trying to accomplish:

Set X to <1,2,3,4>

For (int i=0; I < total/4; i++)

{

Y[i] = sqrt(x)/x;            // Y[i] is also a vector of 4 floats

X = X+4;                     // add 4 to each element

}

To modify the values of the SSE variables, we’ll need to use special intrinsic functions which can be easily looked up on the MSDN library. To initialize X to contain the values <1,2,3,4> we can use the _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f). We’ll get to why these are in the reverse order later.

Performing computation with SSE vectors

The computation we will be performing is sqrt(x)/x. The best way to perform the square root is to use an intrinsic function. Using special hardware, four square roots will be calculated simultaneously, which will provide a great speedup.

__m128 xSqrt = _mm_sqrt_ps(x);

Now that we have a variable holding our square roots, we need to divide by X. There are two says to do this. The first way is to simply use a divide intrinsic. This will provide the most accurate result, so I usually recommend you take this approach.

pResultSSE[i] = _mm_div_ps(xSqrt, x);

As it turns out, division is relatively slow. A faster approach would be to take the reciprocal of X using the _mm_rcp_ps intrinsic, then multiply the reciprocal of X to the square root using the _mm_mul_ps intrinsic.

__m128 xRecip = _mm_rcp_ps(x);

pResultSSE[i] = _mm_mul_ps(xRecip, xSqrt);

How regular arrays interact with SSE vectors

Previously, we discussed how the result array had to be aligned to the 16-byte boundary. When we declared pResult, we declared it as a floating point array. In order to store SSE results into this array, we will need to use an intermediate pointer.

__m128 *pResultSSE = (__m128*) pResult;

By using a pointer of type __m128 take on the same address as the floating point array, we effectively created a way for the SSE results to be stored directly into our floating point array.

BEWARE! Remember earlier, when we decided to initialize the X vector with <4,3,2,1> which is backwards from what we might expect? Well, as it turns out when using this approach, to store SSE results into a regular floating point array, the 4 floating point numbers will actually get reversed. For example, if the result of _mm_mul_ps is <2,4,6,8>, the value of pResult[0] will be 8, pResult[1] will be 6, and so on. So be very careful when storing results using pointers.

Incrementing each value in a vector by 4

SSE numbers typically do not play well with regular numbers. Any interaction between an SSE vector and a normal 32-bit floating point number can be an expensive performance hit. For this reason, we’re going to create another __m128 variable, and set the value to <4,4,4,4>. That way, inside the for loop, we will simply add the two __m128 vectors together in order to increment them.

x = _mm_add_ps(x, xDelta);    // Advance x to the next set of numbers

Results

Below are results which were obtained by testing this program using the reciprocal-multiply method, the division method, and the division method without SSE instructions. As you can see, the difference in run time is dramatic. Please keep in mind that the division has more accurate results than the reciprocal-multiply method.

SSE Runtime

Compatibility

Virtually all x86 processors made within the past 10 years fully support SSE. Still, it is worth noting that if you are running your program on a processor which does not support SSE, it will crash. There are ways to check whether or not a computer supports SSE during program runtime. However, because it is exceptionally rare to find an x86 processor without SSE support, I won’t go into further details on how to do this.

Conclusion

That just about wraps it up for this tutorial. There are a few key points you should keep in mind when deciding to use the SSE instruction set. Often, algorithmic changes to a program can make the most impact. You should only start writing code for SSE if and only if improving the main program algorithm is not an option. Writing code that uses SSE is more challenging, but the possible rewards in speedup to your program are potentially great. In this case, the benefit of using SSE would be greater than the benefit of using two or even four processors without SSE!

For a slightly more practical example of how to use SSE, feel free to check out the intro to image processing with SSE article.

Full Source Code Listing

Download SSE source code here.

/*
	SSE_Tutorial
	This tutorial was written for supercomputingblog.com
	This tutorial may be freely redistributed provided this header remains intact
*/

#include "stdafx.h"
#include <xmmintrin.h>	// Need this for SSE compiler intrinsics
#include <math.h>		// Needed for sqrt in CPU-only version


int main(int argc, char* argv[])
{
	printf("Starting calculation...\n");

	const int length = 64000;

	// We will be calculating Y = Sin(x) / x, for x = 1->64000

	// If you do not properly align your data for SSE instructions, you may take a huge performance hit.

	float *pResult = (float*) _aligned_malloc(length * sizeof(float), 16);	// align to 16-byte for SSE

	__m128 x;
	__m128 xDelta = _mm_set1_ps(4.0f);		// Set the xDelta to (4,4,4,4)
	__m128 *pResultSSE = (__m128*) pResult;


	const int SSELength = length / 4;

	for (int stress = 0; stress < 100000; stress++)	// lots of stress loops so we can easily use a stopwatch

	{
#define TIME_SSE	// Define this if you want to run with SSE
#ifdef TIME_SSE
		x = _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f);	// Set the initial values of x to (4,3,2,1)

		for (int i=0; i < SSELength; i++)
		{

			__m128 xSqrt = _mm_sqrt_ps(x);
			// Note! Division is slow. It's actually faster to take the reciprocal of a number and multiply
			// Also note that Division is more accurate than taking the reciprocal and multiplying

#define USE_DIVISION_METHOD
#ifdef USE_FAST_METHOD
			__m128 xRecip = _mm_rcp_ps(x);

			pResultSSE[i] = _mm_mul_ps(xRecip, xSqrt);
#endif //USE_FAST_METHOD
#ifdef USE_DIVISION_METHOD
			pResultSSE[i] = _mm_div_ps(xSqrt, x);

#endif	// USE_DIVISION_METHOD
			
			// NOTE! Sometimes, the order in which things are done in SSE may seem reversed.
			// When the command above executes, the four floating elements are actually flipped around
			// We have already compensated for that flipping by setting the initial x vector to (4,3,2,1) instead of (1,2,3,4)

			x = _mm_add_ps(x, xDelta);	// Advance x to the next set of numbers

		}
#endif	// TIME_SSE
#ifndef TIME_SSE
		float xFloat = 1.0f;
		for (int i=0 ; i < length; i++)
		{

			pResult[i] = sqrt(xFloat) / xFloat;	// Even though division is slow, there are no intrinsic functions like there are in SSE
			xFloat += 1.0f;
		}

#endif	// !TIME_SSE
	}

	// To prove that the program actually worked
	for (int i=0; i < 20; i++)
	{

		printf("Result[%d] = %f\n", i, pResult[i]);
	}

	// Results for my particular system
	// 23.75 seconds for SSE with reciprocal/multiplication method
	// 38.5 seconds for SSE with division method
	// 301.5 seconds for CPU


	return 0;
}