Your first GPU kernel

In this quickstart you will execute your first GPU kernel though Silicon. The goal is to sum two N-dimensional arrays, into a third array of the same length.

What's a kernel?

A GPU kernel is a function written to run in parallel on a Graphics Processing Unit (GPU), executing the same code across thousands or millions of threads to process large datasets efficiently, with each thread handling a unique piece of data.

For this case, we will use a simple kernel written using Slangarrow-up-right:

StructuredBuffer<float> a;
StructuredBuffer<float> b;
RWStructuredBuffer<float> result;

[shader("compute")]
[numthreads(1,1,1)]
void add(uint3 threadId : SV_DispatchThreadID)
{
    uint index = threadId.x;
    result[index] = a[index] + b[index];
}

Before running the kernel you'll need to compile it for your native backend. You can get more information on how to use Slang in their documentation.

Creating the context

We start by getting the first device we find, and creating a context. A context is used to allocate resources, create queues and load modules.

Loading the module

A module is a program containing which can contain zero, one or multiple functions. It changes between the various backends, you can load the module with the following code:

The suffix depends on the chosen backend:

  • For Apple systems, you need a .metal file, containing the source code of the program.

  • For NVIDIA systems, you need a .ptx file, containing the assembly code of the program.

  • For OpenCL systems, you need a .clfile` containing the source of the program.

You can get the current backend like this:

Loading the kernel

Once the module is loaded, you can get the kernel that needs to be executed.

Creating the buffers

For our project we will need three buffers, two input buffers and one output buffer.

Silicon allows the developer to create a buffer on the GPU memory and automatically copy the values from an array, such as in the example above. It also allows to allocate empty buffeers, with a specific size.

Specifying arguments

Before executing a kernel, you'll need to specify some arguments. For our case, you'll need to specify some arguments:

You don't need to worry about the types and size, Silicon handles everything for you.

Additionally, you'll also need to specify a global size and a group size.

  • Global size: The total amount of work for each dimension.

  • Group size: The total amount of work for each group in each dimension.

chevron-rightWhat's a group?hashtag

GPUs do not execute everything separately, instead they group everything into work groups, each work group has a defined number of threads, each executing the same kernel for different data.

Queues & Execution

Queues are a foundamental part in GPU programming, they allow for multiple kernels to be executed in order, reducing overhead and without sacrificing throughput.

Finally you'll need to dispatch the kernel, and block the current thread for it's completition.

Results

Once the kernel has finished computing, you can read results back to the CPU, this can be easily done using the buffer API and the previously defined c buffer.

Last updated