Vertex transformations are an extremely common operation for both 2d and 3d programs. A transformation can include translation, rotation, scaling, or any combination of the three. While it is beyond the scope of this article to elaborate on fine details of vertex transformations, it all boils down to a matrix multiplication. A 3d vertex can be represented as a 1×4 matrix, [x, y, z, w] where w is usually 1, and the transformation is represented as a 4×4 matrix. To get the translated vertex, you simply need to multiply the vertex by the transformation matrix, where the result is also a convenient 1×4 matrix. For a more detailed explanation, you can read about the transformation matrix here.

Why use CUDA for 3d vertex transformations

Because a transformation is basically just a matrix multiplication, the problem is perfectly suited for CUDA. Sure, the computation for each vertex only requires 16 multiplications and 12 additions, but it is very common to have 3d models with tens, or hundreds of thousands of vertices. Large projects like James Cameron’s Avatar could easily require hundreds of millions of vertices in a single scene, all of which need to be transformed into their final positions before being rendered. CUDA’s strength lies in it’s massive compute capabilities, which can be fully exercised when presented with a large number of vertices to be transformed. Another reason why CUDA is suited well for 3d transformations is it’s sheer memory bandwidth. High end GPUs can have 10 times higher memory bandwidth than a high end CPU from Intel or AMD. This fact helps in cases like this where only a few computations are needed per vertex transformation.

The code

The code for 3d vertex transformations for CUDA is fairly simple. In fact, we don’t need to do any fancy work at all. The kernel will have three parameters, float *vertices, float *transformMatrix and nVertices. The vertices array is our global array of vertices. Each vertex is represented by four floats. This array is also where the results are stored. While this may work for many applications, if you want to preserve the original vertex data you would need to create another array to store the results. The transformMatrix is the other parameter, which stores a single, 4×4 matrix as an array of 16 floats.

Kernel structure

There are many ways you could arrange the kernel. I chose to have each thread compute exactly one result. Since each vertex is four floats, four threads would be required to compute each vertex. So if you need to transform 10,000 vertices, you need to spawn 40,000 threads. Since our input data is just a one dimensional array, I choose to have the blocks be one dimensional as well, meaning the y and z dimensions of the block are always 1. I also decided to limit each block to 512 threads to maintain backwards compatibility all the way to the first CUDA devices ever made.

Grid structure

Since you will probably be transforming many more than 512 vertices at a time, you will need to have a grid of blocks. The current maximum x dimension of a grid is 65535. However, with 128 vertices per block times 65535 blocks, that would only give us the ability to transform 8,388,480 vertices at once. Since you could easily require more than that, I decided to have the grid of blocks be two dimensional, enabling us to calculate more vertices than would ever fit in memory.

Determining the number of threads and blocks to run

The code to determine how many threads and blocks to run is below. This isn’t really the best way to do it, but it’s a non-critical piece of code as far as speed is concerned. The code to execute the kernel is also below.

// setup execution parameters
int nBlocksX = 1;
int nBlocksY = 1;
if (num_threads > 512)	// best for backwards compatibility
{
	nBlocksX = (num_threads/512) + 1;
	num_threads = 512;
	if (nBlocksX > 65535)	// best for backwards compatibility
	{
		nBlocksY = (nBlocksX/65535) + 1;

		nBlocksX = 65535;
	}
}
dim3 grid(nBlocksX, nBlocksY, 1);
dim3 threads(num_threads, 1, 1);
    
// execute the kernel
ThreeDVertexTransform<<< grid, threads >>>(d_vertices, d_transformMatrix, curNumVertices);

// check if kernel execution generated and error
getLastCudaError("Kernel execution failed");

Kernel code

__global__ void ThreeDVertexTransform(float *vertices, float *transformMatrix, int nVertices)
{
	// each thread will compute exactly one number
	// four numbers per vertex [x,y,z,w] means 4 threads will compute exactly
	// one vertex transformation

	// each block must be 1-D, with the number of threads being a multiple of 4 (32 is best)
	// grid of blocks may be 1-D or 2-D to accomodate a very large number of vertices
	// compute the current thread ID
	// Each thread will compute the new vertices[tid]
	int tid = (blockIdx.y * gridDim.x * blockDim.x) +
					(blockIdx.x * blockDim.x) + 
					(threadIdx.x);
	
	// perform checking to prevent buffer overruns
	if (tid > nVertices*4) return;
	// are we calculating x,y,z, or w?
	int subId = threadIdx.x % 4;
	float sum = vertices[tid-subId] * transformMatrix[subId];
	sum += vertices[tid-subId+1] * transformMatrix[4+subId];
	sum += vertices[tid-subId+2] * transformMatrix[8+subId];
	sum += vertices[tid-subId+3] * transformMatrix[12+subId];
	vertices[tid] = sum;
}

Results

The machine used for testing has an Intel i7-930 CPU, and a GTX 460 graphics card. There are three tests run. One test just did the vertex transformations using the CPU. This test was single-threaded. I also tried multithreaded with OpenMP, but there was little improvement, indicating the bottleneck was not the number of CPU cores, but some part of the memory and cache system. Another test was used with the GPU, and included copying data to and from the GPU. If you have a problem where vertex data isn’t already on the GPU, you would need to copy the data to the GPU’s global memory first. Finally, a test was performed which measured the kernel runtime only, and didn’t include memory transfers from or two the host’s main memory. Notice that the actual kernel runtime is extremely low. So if you have a problem where you can permanently keep all your problem data on the GPU, you could achieve a massive speed up when compared to the CPU. However, if you need to copy data back and forth every single time, your speedup will only be around 2x. One last thing to note is that the GPU became lower runtime than the CPU even with all the memory copies for as few as 8192 vertices.