Understanding how to use LockBits is essential for creating high performance GDI+ applications. Usually, GDI+ is thought of as a low performance graphics API. While arguments can be made for this, if you use GDI+ properly, you can achieve great performance. Continue reading ‘Using LockBits in GDI+’ »
Author Archive
[Updated 6-23-2025] Over a decade ago, this was written to show how to load a JPG with C++ using GDI+. However, GDI+ is effectively obsolete, and so this article will show you another, more platform independent way to load a JPG (or JPEG), PNG, BMP, TGA, HDR in a very short amount of time, and also how to modify the image data and write an image file to disk.
Continue reading ‘How to load a JPG with C++’ »This tutorial will focus on how to create and compile an application that uses GDI+. We will be starting a project from scratch using Microsoft Visual Studio. GDI+ is a powerful, object oriented API for doing mostly 2D graphics. Unlike GDI, GDI+ is generally much easier to use, much more difficult to misuse, and in many cases can produce higher quality images than GDI. While GDI+ is considered slower than GDI, it is still perfectly acceptable for most applications. Continue reading ‘Getting started with GDI+ in Visual Studio’ »
Atomic operations are often essential for multithreaded programs, especially when different threads need to access or modify the same data. Conventional multicore CPUs generally use a test-and-set instruction to manage which thread controls which data. CUDA has a much more expansive set of atomic operations. With CUDA, you can effectively perform a test-and-set using the atomicInc() instruction. However, you can also use atomic operations to actually manipulate the data itself, without the need for a lock variable. Continue reading ‘CUDA – Tutorial 5 – Performance of atomics’ »
The SSE instruction set can be a very useful tool in developing high performance applications. SSE, or Streaming SIMD Extensions, is particularly helpful when you need to perform the same instructions over and over again on different pieces of data. SSE vectors are 128-bits wide, and allow you to perform calculations for 4 different floating point numbers at the same time. SSE can also be configured to work on 2, 64-bit floating point numbers concurrently, 4, 32-bit integers, or even 16, 8-bit chars. Continue reading ‘Getting started with SSE programming’ »
High level languages such as C, C++, C#, FORTRAN, and Java all do a great job of abstracting the hardware away from the language. This means that programmers generally don’t have to worry about how the hardware goes about executing their program. However, in order to get the maximum amount of performance out of your programs, it is necessary to start thinking about how the hardware is actually going to execute your program.
Continue reading ‘Taking advantage of cache coherence in your programs’ »
There are several standard, cross platform ways to create high performance, multithreaded programs. There are no standard ways to spawn threads with the C++ language, which means that sometimes we have to resort to using compiler-specific methods to create threads for our programs. This tutorial will be focused on how to easily create work threads for your windows, or WIN32, program using Microsoft Visual Studio. Continue reading ‘Writing multithreaded programs for Windows’ »
As with any parallel program, there is an overhead associated with the amount of time threads spend communicating with each other, and waiting for each other to finish. This means that parallel programs are often less efficient than serial programs. However, in most applications, we care about performance, or wall time performance. A computer user typically only cares how fast his or her program runs, not how efficient the program is. In this article, we will be examining tangible numbers dealing with the overhead which can be associated with MPI programs.
It only makes sense to parallelize a program or part of a program only when it makes sense to do so. But when is there enough computations to justify spending your valuable time coding a parallel program? The answer, of course, depends on the nature of the computations you will be performing.
For this article, a program started with a large array with a maximum number of 100,000,000 integer elements. We will be running two tests:
array[i] = i*i array[i] = i*i - (int)sqrt((double)i);
For different array sizes, the program measures the amount of time it takes to complete all the computations. The program was run with just 1 thread, and again with two threads on a dual core processor. Unforunately, I do not have access to a quad core computer.
As you can see in the graph above, the results are quite interesting. When doing something as simple as a multiplication, 1 thread actually finishes faster than 2 threads. This is because the computation is so simple and fast that the amount of overhead is enough to make the program slower when run on more than 1 thread. However, the more complicated and time consuming calculation i*i – sqrt(i) tells a different story. Because square root is a relatively slow operation, the amount of overhead is much less than before when compared percentage wise. As you can see with the more complicated calculation is that the program efficiency increases when parallelizing over a larger dataset, which makes sense.
There are many sources of overhead in MPI programs, including the time wasted by blocking operations such as MPI_Send, MPI_Recv, MPI_Barrier, and the actual communication itself. In this example, the two cores were on the same substrate, so communication was very fast. However, the overhead of communication is hundreds of even thousands of times higher when transferring data over a network between threads running on different physical computers. So before you decide to parallelize a part of your program by writing with MPI or any other parallel language, it’s best to make sure that the benefit of parallelization outweighs the added overhead of thread communication. As you can see above, the more complicated and time consuming your calculations are, the more efficiently your program will run across multiple cores. Otherwise, you might need a very large dataset in order to receive any benefit from paralellization. Feel free to use code similar to that below to run a sanity check before making a full scale program.
for (int i=nTasks*4; i < C_MAX_ARRAY_SIZE; i += (C_MAX_ARRAY_SIZE >> 4)) { int numOfElems = i/nTasks; int startIndex = numOfElems * rank; double startTime = MPI_Wtime(); for (int j=startIndex; j < startIndex + numOfElems; j++) { g_testArray[j] = j*j - sqrt((double)j); // do a simple computation } // each thread needs to send results to thread 0 if (rank == 0) { // The master thread will need to receive all computations from all other threads. MPI_Status status; for (int j=1; j < nTasks; j++) { MPI_Recv(&g_testArray[j*numOfElems], numOfElems, MPI_INT, j,3, MPI_COMM_WORLD, &status); } } else { MPI_Send(&g_testArray[startIndex], numOfElems, MPI_INT, 0, 3, MPI_COMM_WORLD); } double endTime = MPI_Wtime(); if (rank == 0) printf("i = %d time = %f\n", i, (float)(endTime - startTime)); }
This tutorial will discuss how to perform atomic operations in CUDA, which are often essential for many algorithms. Atomic operations are easy to use, and extremely useful in many applications. Atomic operations help avoid race conditions and can be used to make code simpler to write. Continue reading ‘CUDA – Tutorial 4 – Atomic Operations’ »
Until now, we have only talked about synchronous, blocking communications in MPI. This tutorial will focus on Asynchronous, non-blocking communication with MPI. Asynchronous communication is often the key to achieving high performance computing with MPI applications. Using asynchronous communication has several advantages. Continue reading ‘MPI – Tutorial 5 – Asynchronous communication’ »