Writing CUDA applications using the D programming language

On the release note of Fedora 14 there is the introduction of the support for developing using the D programming language, so I’ve readed something on that language and I feel that probably I can actually use it, the main interesting feature for me is the simple interoperability with code written in C.

So just for testing purpose I decided to write a small CUDA (I’m pretty sure that you already know what is cuda) application using the D language.

Of course is not possible to write CUDA kernel and device functions directly in a D module so what is needed is to implement kernels in a cuda source file (with a proper launcher) and then use the kernel from the D module.

The application that i’ve written converts a string in uppercase, reading the input from the command line (remember is just for testing).

The CUDA kernel

As I’ve already said, the implementation of the the kernel has to be done in a cuda source file, that will be built using the cuda compiler (nvcc). So I’ve created a file, named ucase_kernel.cu, containing the implentation of the kernel and a function for launching the kernel.

The launcher is convenient but not essential: it would be possible to use the kernel directly from the D modules, but, of course, in D there isn’t the convenient “kernel<<<blocks,threads>>>(params)” syntax, so to launch the kernel there should be the appropriate calls to the routine for configuring and launching the kernel (see cudaConfigureCall, cudaSetupArgument, cudaLaunch).

The implementation of the kernel for converting a string in upper case parallely could be:

__global__ void ucase_kernel(char *s) {
	int i = threadIdx.x;
	if(s[i] >= 'a' && s[i] <= 'z')
		s[i] -= 'a' - 'A';
}

Actually nvcc uses a g++ as backend, so the binary produced from nvcc is a C++ binary and function names are mangled using the C++ “standard”,  C++ interoperability is really difficult, so the better thing is to tell to the compiler to use the C mangling for the launcher function (that has to be invoked from a D module) using the extern “C” syntax:

extern "C" {
void ucase(void *d_s, size_t len) {
	ucase_kernel<<<1,len>>>((char *) d_s);
}
}

The kernel part (and the ucase_kernel.cu source file) is complete. What we have done here is the implementation of a kernel and a launcher for that kernel, that giving it in input a string stored already on the device memory, it converts the string in uppercase parralely opering the conversion of each character in a separated thread (yes, it isn’t the best for a long string).

The D module

The runtime library provided in Fedora 14 for the D language is tango (here is just used the Console output from that library), so the upper.d source file should begin with:

module ucase;

import tango.io.Stdout;

For executing the above kernel it is needed to allocate a buffer on the device memory, copy the string to convert in it, execute the kernel and the copy the result back from the device memory (and free the allocated memory, of course: there is no garbage collector in CUDA). So what we need is to use cudaMalloc, cudaMemcpy, cudaFree and ucase. These are all external standard C function, before using it in D, it is needed to declare them; the declaration is really simple the syntax is “extern (C) ” followed by the standard C function declaration.
There is often a 1:1 compatibility for standard C types. Regarding data structurs or enumerations used as input or output types for the external C function should be declared before declaring the function self.
So in the upper.d source there is the declaration of the external C function to use (the cuda runtime routines and the kernel launcher) and the enum types used by them:

enum cudaMemcpyKind
{
  cudaMemcpyHostToHost = 0,
  cudaMemcpyHostToDevice = 1,
  cudaMemcpyDeviceToHost = 2,
  cudaMemcpyDeviceToDevice = 3
};

extern (C) int cudaMalloc(void **ptr, size_t size);
extern (C) int cudaMemcpy(void *dst, void *src, size_t count, cudaMemcpyKind kind);
extern (C) int cudaFree(void *ptr);
extern (C) void ucase(void *d_s, size_t len);

For the sake of simplicity I’m declaring the return type of the cuda runtime routines as an integer instead of cudaError_t (I had to declare the enum cudaError to use it).

Isn’t it wonderfull? Now it is possible to use cuda routines in D in the common way of .cu source files.

Now it’s time to use the ucase_kernel. So in ucase.d is implemented a function to convert a string in uppercase using the ucase_kernel, it is called ucaseGPU, and it is like the following one:

char[] ucaseGPU(char[] input) {
	void *dptr;
	char[] output;

	cudaMalloc(&amp;dptr, input.length);
	cudaMemcpy(dptr, input.ptr, input.length, cudaMemcpyKind.cudaMemcpyHostToDevice);

	ucase(dptr, input.length);

	output.length = input.length;
	cudaMemcpy(output.ptr, dptr, output.length, cudaMemcpyKind.cudaMemcpyDeviceToHost);
	cudaFree(dptr);

	return output;
}

What ucaseGPU does is:

  • allocate a buffer of the same size of the input string on the device memory;
  • copy the input string to the just allocated device memory buffer;
  • launch the kernel for doing the conversion using ucase;
  • allocate a new D dynamic array for storing the result and copy the device memory buffer on it;
  • free the device memory
  • return the result

Warning: for doing direct memory copy from and to D arrays it is used the .ptr property to get the pointer to the first element of the array.

For testing purpose there is also a ucaseCPU function that performs the conversions in a classic way:

char[] ucaseCPU(char[] input) {
	char[] output = input.dup;
	for(int i = 0; i = 'a' &amp;&amp; output[i] &lt;= 'z')
		output[i] -= 'a' - 'A';
	return output;
}

In the end there is the main function:

int main(char[][] args) {
	char[] input, cpu_output, gpu_output;

	Stdout("Executing ucase on GPU.").newline();

	foreach (i, arg; args)
		input ~= arg ~ " ";

	Stdout("Input:  '")(input)("'").newline();

	cpu_output = ucaseCPU(input);
	Stdout("CPU Output: '")(cpu_output)("'").newline();

	gpu_output = ucaseGPU(input);
	Stdout("GPU Output: '")(gpu_output)("'").newline();

	if(cpu_output == gpu_output)
		Stdout("Test PASSED").newline();
	else
		Stdout("Test FAILED").newline();

	return 0;
}

In the main function

  • the input string is obtained using the command line arguments (concatenating them);
  • it is used ucaseGPU for performing the conversion using the GPU;
  • it is used ucaseCPU for performing the conversion using the CPU;
  • it is tested that the output of both the above functions is the same;

Building the application

The application is composed of two source file: ucase_kernel.cu and ucase.d, a cuda source file and a D source file; each oh them has to be built with the appropriate compiler and the link all to build the application.

To build ucase_kernel.cu use the following command:

nvcc -c ucase_kernel.cu

Then to build ucase.d use the following command:

ldc -c ucase.d

To link the object codes together it is used the g++ compiler because nvcc uses g++ as backend so the object code produced by nvcc is a C++ binary. The application has to be linked against the tango library (and its dependencies: pthread, m, dl) and the cuda runtime library, so the command to use is:

g++ ucase.o ucase_kernel.o -ltango -lcudart -lpthread -lm -ldl -o ucase

Now its possible to run the application and test if it works:

$ ./ucase Writing CUDA applications using the D programming language
Executing ucase on GPU.
Input:  './ucase Writing CUDA applications using the D programming language '
CPU Output: './UCASE WRITING CUDA APPLICATIONS USING THE D PROGRAMMING LANGUAGE '
GPU Output: './UCASE WRITING CUDA APPLICATIONS USING THE D PROGRAMMING LANGUAGE '
Test PASSED

That’s all! Now you can write your GPU enabled applications using the D language!

If you write an interesting application think about releasing it as free software or open source software, thank you!

One thought on “Writing CUDA applications using the D programming language

  1. Hi, Thanks for the info.

    Question: Somewhere close behind the C mangling discussion there was the phrase: “converts the string in uppercase parralely opering”. Maybe you meant ‘converts the string in uppercase, to open in parallel’ or some-such?

    Thanks again for the info,
    Chuck

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s