Posts tagged ‘CUDA’

There has been an incredible amount of advancement with regard to machine learning during these last few years. Problems which used to take research teams months or years can now be easily implemented by a skilled programmer using machine learning techniques. This is the first in a series of articles that should point you in the right direction as to how to start using machine learning, and in particular, deep learning techniques as easily as possible. Continue reading ‘Getting Started in Machine Learning’ »

Vertex transformations are an extremely common operation for both 2d and 3d programs. A transformation can include translation, rotation, scaling, or any combination of the three. While it is beyond the scope of this article to elaborate on fine details of vertex transformations, it all boils down to a matrix multiplication. A 3d vertex can be represented as a 1×4 matrix, [x, y, z, w] where w is usually 1, and the transformation is represented as a 4×4 matrix. To get the translated vertex, you simply need to multiply the vertex by the transformation matrix, where the result is also a convenient 1×4 matrix. For a more detailed explanation, you can read about the transformation matrix here. Continue reading ‘CUDA Tutorial – 3d vertex transformations’ »

You might want to get started with OpenCL after working with another parallel computing framework. CUDA, for instance, is pretty nice, and some of its processing flow algorithms are pretty concrete. The way that the main memory and GPU memory copy processing data and results might call to mind some aspects of cloud computing. However, the only GPU systems with CUDA capabilities ship from Nvidia. This isn’t too bad for those who are only working with Nvidia chipsets, and they’re really quite common. However, there’s always going to be someone who ruins the fun by sticking with an AMD or IBM device.

Apple originally developed the Open Computing Language framework, and the non-profit Khronos Group consortium manages it. Since it’s royalty free, developers in a particular organization might want to set up a cloud computing environment to share code and the latest football scores.

This sort of technology is well applied to the world of computer games, where the OpenCL framework’s large distribution base can be particularly useful. GPU chips are used not only to render graphics, but also perform game physics calculations. Nevertheless, that’s not the only way to use the GPU. It can be repurposed for mathematical calculations. Cryptography and computational biology are just a few of the fields that can be given a boost in this way. Everyone would rather write a biophysics formula calculator than a first person shooter, right?

Regardless, one of the best ways to get started is to ensure that the hardware being developed for supports the OpenCL standard. Make sure that you have the right SDK and runtime files, and then you can usually proceed without development without too much trouble. Some experience with C99 might help, but it really isn’t required. C99 was the language that the OpenCL’s computation kernel coding language is based around.

The language in question is extended to use parallelism without too much trouble, which is extremely important when working with these kinds of scenarios. However, advanced options like recursion, bit fields and variable-length arrays are gone. This can actually make it easier to start coding with than the actual C99 dialect.

In the previous tutorial, intro to image processing with CUDA, we examined how easy it is to port simple image processing functions over to CUDA. In this tutorial, we’ll be going over a substantially more complex algorithm, and how to port it to CUDA with incredible ease. Continue reading ‘Advanced Image Processing with CUDA’ »

CUDA is great for any compute intensive task, and that includes image processing. In this tutorial, we’ll be going over why CUDA is ideal for image processing, and how easy it is to port normal c++ code to CUDA. Continue reading ‘Intro to image processing with CUDA’ »

Understanding the basic memory architecture of whatever system you’re programming for is necessary to create high performance applications. Most desktop systems consist of large amounts of system memory connected to a single CPU, which may have 2 or three levels or fully coherent cache. Before you get started with CUDA, you should read this to understand the basic memory hierarchy of modern CUDA capable compute devices. Continue reading ‘CUDA Memory and Cache Architecture’ »

Searching is a common task in computer science, and fortunately, it is also perfectly suited for CUDA. For this article, we’re talking about searching through an unsorted text file for a specific word or phrase. For example, if you have a 50 megabyte text file open in Microsoft Visual Studio, you’re sure to notice that searching for a word can take several seconds, which is more than any person wants to wait just to find a word in a document. This article will demonstrate a simple kernel which can perform simple string matches.

Continue reading ‘Search algorithm with CUDA’ »

Unlike most programming languages, CUDA is coupled very closely together with the hardware implementation. While x86 processors have not changed very much over the past 10 years, CUDA hardware has had a significant change in architecture several times. First, the introduction of CUDA with the 80 series, followed shortly by the 200 series, and now nVidia has begun selling cards in the 400 series, namely the GTX 480 and GTX 470.

Continue reading ‘Optimizing CUDA programs for GTX 400 series’ »

Taking the square root of a floating point number is essential in many engineering applications. Whether you are doing nBody simulations, simulating molecules, or linear algebra, the ability to accurately and quickly perform thousands or even millions of square root operations is essential. Unfortunately, the square root functions on most CPUs are very time consuming, even with specialized SSE instructions. Fortunately enough, GPUs have specialized hardware to perform such square root operations extremely fast. CUDA, NVidia’s solution to extremely high performance parallel computing, puts the onboard specialized hardware to full use, and easily outperforms modern Intel or AMD CPUs by a factor of over a hundred.

Continue reading ‘Performance of sqrt in CUDA’ »

Atomic operations are often essential for multithreaded programs, especially when different threads need to access or modify the same data. Conventional multicore CPUs generally use a test-and-set instruction to manage which thread controls which data. CUDA has a much more expansive set of atomic operations. With CUDA, you can effectively perform a test-and-set using the atomicInc() instruction. However, you can also use atomic operations to actually manipulate the data itself, without the need for a lock variable. Continue reading ‘CUDA – Tutorial 5 – Performance of atomics’ »