Taking the square root of a floating point number is essential in many engineering applications. Whether you are doing nBody simulations, simulating molecules, or linear algebra, the ability to accurately and quickly perform thousands or even millions of square root operations is essential. Unfortunately, the square root functions on most CPUs are very time consuming, even with specialized SSE instructions. Fortunately enough, GPUs have specialized hardware to perform such square root operations extremely fast. CUDA, NVidia’s solution to extremely high performance parallel computing, puts the onboard specialized hardware to full use, and easily outperforms modern Intel or AMD CPUs by a factor of over a hundred.
Posts tagged ‘Experiment’
Welcome to the second tutorial in how to write high performance CUDA based applications. This tutorial will cover the basics of how to write a kernel, and how to organize threads, blocks, and grids. For this tutorial, we will complete the previous tutorial by writing a kernel function. The goal of this application is very simple. The idea is to take two arrays of floating point numbers, and perform an operation on them and store the result in a third floating point array. We will then study how fast the code executes on a CUDA device, and compare it to a traditional CPU. The data analysis will take place toward the end of the article. Continue reading ‘CUDA – Tutorial 2 – The Kernel’ »