CUDA | The Supercomputing Blog

Archive for the ‘CUDA’ Category

CUDA Tutorial – 3d vertex transformations

Vertex transformations are an extremely common operation for both 2d and 3d programs. A transformation can include translation, rotation, scaling, or any combination of the three. While it is beyond the scope of this article to elaborate on fine details of vertex transformations, it all boils down to a matrix multiplication. A 3d vertex can be represented as a 1×4 matrix, [x, y, z, w] where w is usually 1, and the transformation is represented as a 4×4 matrix. To get the translated vertex, you simply need to multiply the vertex by the transformation matrix, where the result is also a convenient 1×4 matrix. For a more detailed explanation, you can read about the transformation matrix here. Continue reading ‘CUDA Tutorial – 3d vertex transformations’ »

Posted by admin on March 2, 2012 at 11:15 pm under CUDA.
Tags: 3D, CUDA, Matrix Multiply, Performance, Transformation, Tutorial, Vertex
Comments Off on CUDA Tutorial – 3d vertex transformations.

Advanced Image Processing with CUDA

In the previous tutorial, intro to image processing with CUDA, we examined how easy it is to port simple image processing functions over to CUDA. In this tutorial, we’ll be going over a substantially more complex algorithm, and how to port it to CUDA with incredible ease. Continue reading ‘Advanced Image Processing with CUDA’ »

Posted by admin on September 21, 2011 at 12:17 am under CUDA, Graphics.
Tags: Algorithm, Benchmark, Cache, CUDA, Image, Image processing, Local memory, Oil Painting, Paintl, Performance
Comments Off on Advanced Image Processing with CUDA.

Intro to image processing with CUDA

CUDA is great for any compute intensive task, and that includes image processing. In this tutorial, we’ll be going over why CUDA is ideal for image processing, and how easy it is to port normal c++ code to CUDA. Continue reading ‘Intro to image processing with CUDA’ »

Posted by admin on September 20, 2011 at 12:03 am under CUDA, Graphics.
Tags: Benchmark, C++, CPU, CUDA, GPU, Image, Image processing, Port, Rotate, Tutorial, Twist
Comments Off on Intro to image processing with CUDA.

CUDA Memory and Cache Architecture

Understanding the basic memory architecture of whatever system you’re programming for is necessary to create high performance applications. Most desktop systems consist of large amounts of system memory connected to a single CPU, which may have 2 or three levels or fully coherent cache. Before you get started with CUDA, you should read this to understand the basic memory hierarchy of modern CUDA capable compute devices. Continue reading ‘CUDA Memory and Cache Architecture’ »

Posted by admin on September 10, 2011 at 6:18 pm under CUDA.
Tags: Cache, Coalesce, Coherence, CUDA, Hierarchy, L1, L2, Memory, Tutorial
Comments Off on CUDA Memory and Cache Architecture.

Search algorithm with CUDA

Searching is a common task in computer science, and fortunately, it is also perfectly suited for CUDA. For this article, we’re talking about searching through an unsorted text file for a specific word or phrase. For example, if you have a 50 megabyte text file open in Microsoft Visual Studio, you’re sure to notice that searching for a word can take several seconds, which is more than any person wants to wait just to find a word in a document. This article will demonstrate a simple kernel which can perform simple string matches.

Continue reading ‘Search algorithm with CUDA’ »

Posted by admin on July 28, 2010 at 6:41 pm under CUDA.
Tags: Algorithm, Atomic, CUDA, Search, Tutorial, unsorted, word
Comments Off on Search algorithm with CUDA.

Optimizing CUDA programs for GTX 400 series

Unlike most programming languages, CUDA is coupled very closely together with the hardware implementation. While x86 processors have not changed very much over the past 10 years, CUDA hardware has had a significant change in architecture several times. First, the introduction of CUDA with the 80 series, followed shortly by the 200 series, and now nVidia has begun selling cards in the 400 series, namely the GTX 480 and GTX 470.

Continue reading ‘Optimizing CUDA programs for GTX 400 series’ »

Posted by admin on April 24, 2010 at 11:34 am under CUDA.
Tags: 400 series, CUDA, GTX 400, GTX 470, GTX 480, Optimization
Comments Off on Optimizing CUDA programs for GTX 400 series.

Performance of sqrt in CUDA

Taking the square root of a floating point number is essential in many engineering applications. Whether you are doing nBody simulations, simulating molecules, or linear algebra, the ability to accurately and quickly perform thousands or even millions of square root operations is essential. Unfortunately, the square root functions on most CPUs are very time consuming, even with specialized SSE instructions. Fortunately enough, GPUs have specialized hardware to perform such square root operations extremely fast. CUDA, NVidia’s solution to extremely high performance parallel computing, puts the onboard specialized hardware to full use, and easily outperforms modern Intel or AMD CPUs by a factor of over a hundred.

Continue reading ‘Performance of sqrt in CUDA’ »

Posted by admin on January 19, 2010 at 11:17 pm under CUDA.
Tags: CUDA, Experiment, Optimization, Performance, Sqrt
Comments Off on Performance of sqrt in CUDA.

CUDA – Tutorial 5 – Performance of atomics

Atomic operations are often essential for multithreaded programs, especially when different threads need to access or modify the same data. Conventional multicore CPUs generally use a test-and-set instruction to manage which thread controls which data. CUDA has a much more expansive set of atomic operations. With CUDA, you can effectively perform a test-and-set using the atomicInc() instruction. However, you can also use atomic operations to actually manipulate the data itself, without the need for a lock variable. Continue reading ‘CUDA – Tutorial 5 – Performance of atomics’ »

Posted by admin on December 4, 2009 at 8:38 pm under CUDA.
Tags: Atomic, Atomic Function, Atomic operation, CUDA, global memory, GPGPU, memory access, nVidia, Performance, shared memory, Tutorial
Comments Off on CUDA – Tutorial 5 – Performance of atomics.

CUDA – Tutorial 4 – Atomic Operations

This tutorial will discuss how to perform atomic operations in CUDA, which are often essential for many algorithms. Atomic operations are easy to use, and extremely useful in many applications. Atomic operations help avoid race conditions and can be used to make code simpler to write. Continue reading ‘CUDA – Tutorial 4 – Atomic Operations’ »

Posted by admin on July 24, 2009 at 12:16 am under CUDA.
Tags: Algorithm, Atomic, Coherency, CUDA, Parallel, Performance, Tutorial
Comments Off on CUDA – Tutorial 4 – Atomic Operations.

CUDA – Tutorial 3 – Thread Communication

This tutorial will be discussing how different threads can communicate with each other. In the previous tutorial, each thread operated without any interaction or data dependency from other threads. However, most parallel algorithms require some amount of data to be communicated between threads. Continue reading ‘CUDA – Tutorial 3 – Thread Communication’ »

Posted by admin on July 14, 2009 at 7:45 pm under CUDA.
Tags: CUDA, HPC, Multithreaded, Reduction, Thread Communication, Tutorial
Comments Off on CUDA – Tutorial 3 – Thread Communication.

The Supercomputing Blog