Optimization | The Supercomputing Blog

Posts tagged ‘Optimization’

Advanced Image Processing with SSE

In a previous article about image processing with SSE, we used some basic SSE intrinsics to perform a very easy image manipulation routine, removing all blue from an image. This task was easy, since each pixel was 8 bits per component, with 4 components (ARGB). However, for more advanced image processing functions such as 2D convolution, it is preferable to work with each color component as a 32-bit floating point number rather than an 8-bit unsigned integer. Continue reading ‘Advanced Image Processing with SSE’ »

Posted by admin on September 27, 2011 at 9:29 pm under C++, Graphics, Optimization, Windows.
Tags: 128-bit, Algorithm, C++, Code, Example, Floating Point, Guide, Image manipulation, Image processing, Intrinsic, Optimization, Performance, SSE, SSE2, Tutorial, Vector
Comments Off on Advanced Image Processing with SSE.

Image Processing with SSE

Using SSE to process images or video is essential to achieving good performance. Most popular multimedia applications use SSE to greatly accelerate application performance. Unfortunately, like everything in life, if SSE is used incorrectly it can actually perform worse than non-SSE code. This article will take you through some code and discuss the performance of each. Continue reading ‘Image Processing with SSE’ »

Posted by admin on August 11, 2011 at 12:11 am under C++, Graphics, Optimization, Windows.
Tags: Aligned, Cache Coherence, Graphics, Image, Integer, Memory bandwidth, Multimedia, Optimization, Perofmance, SSE, SSE2, Unaligned, Video
Comments Off on Image Processing with SSE.

Optimizing CUDA programs for GTX 400 series

Unlike most programming languages, CUDA is coupled very closely together with the hardware implementation. While x86 processors have not changed very much over the past 10 years, CUDA hardware has had a significant change in architecture several times. First, the introduction of CUDA with the 80 series, followed shortly by the 200 series, and now nVidia has begun selling cards in the 400 series, namely the GTX 480 and GTX 470.

Continue reading ‘Optimizing CUDA programs for GTX 400 series’ »

Posted by admin on April 24, 2010 at 11:34 am under CUDA.
Tags: 400 series, CUDA, GTX 400, GTX 470, GTX 480, Optimization
Comments Off on Optimizing CUDA programs for GTX 400 series.

Performance of sqrt in CUDA

Taking the square root of a floating point number is essential in many engineering applications. Whether you are doing nBody simulations, simulating molecules, or linear algebra, the ability to accurately and quickly perform thousands or even millions of square root operations is essential. Unfortunately, the square root functions on most CPUs are very time consuming, even with specialized SSE instructions. Fortunately enough, GPUs have specialized hardware to perform such square root operations extremely fast. CUDA, NVidia’s solution to extremely high performance parallel computing, puts the onboard specialized hardware to full use, and easily outperforms modern Intel or AMD CPUs by a factor of over a hundred.

Continue reading ‘Performance of sqrt in CUDA’ »

Posted by admin on January 19, 2010 at 11:17 pm under CUDA.
Tags: CUDA, Experiment, Optimization, Performance, Sqrt
Comments Off on Performance of sqrt in CUDA.

Using LockBits in GDI+

Understanding how to use LockBits is essential for creating high performance GDI+ applications. Usually, GDI+ is thought of as a low performance graphics API. While arguments can be made for this, if you use GDI+ properly, you can achieve great performance. Continue reading ‘Using LockBits in GDI+’ »

Posted by admin on December 12, 2009 at 1:01 am under Graphics.
Tags: C++, GDI, Graphics, Optimization, Tutorial
Comments Off on Using LockBits in GDI+.

Getting started with SSE programming

The SSE instruction set can be a very useful tool in developing high performance applications. SSE, or Streaming SIMD Extensions, is particularly helpful when you need to perform the same instructions over and over again on different pieces of data. SSE vectors are 128-bits wide, and allow you to perform calculations for 4 different floating point numbers at the same time. SSE can also be configured to work on 2, 64-bit floating point numbers concurrently, 4, 32-bit integers, or even 16, 8-bit chars. Continue reading ‘Getting started with SSE programming’ »

Posted by admin on October 2, 2009 at 4:05 pm under Optimization.
Tags: Intrinsic, MMX, Optimization, Programming, SSE, SSE2, Tutorial
Comments Off on Getting started with SSE programming.

Taking advantage of cache coherence in your programs

High level languages such as C, C++, C#, FORTRAN, and Java all do a great job of abstracting the hardware away from the language. This means that programmers generally don’t have to worry about how the hardware goes about executing their program. However, in order to get the maximum amount of performance out of your programs, it is necessary to start thinking about how the hardware is actually going to execute your program.

Continue reading ‘Taking advantage of cache coherence in your programs’ »

Posted by admin on August 24, 2009 at 1:09 pm under Optimization.
Tags: Cache, Coherence, Optimization
Comments Off on Taking advantage of cache coherence in your programs.

The Supercomputing Blog