Your first GPU kernel

In this quickstart you will execute your first GPU kernel though Silicon. The goal is to sum two N-dimensional arrays, into a third array of the same length.

What's a kernel?

A GPU kernel is a function written to run in parallel on a Graphics Processing Unit (GPU), executing the same code across thousands or millions of threads to process large datasets efficiently, with each thread handling a unique piece of data.

For this case, we will use a simple kernel written using Slang:

StructuredBuffer<float> a;
StructuredBuffer<float> b;
RWStructuredBuffer<float> result;

[shader("compute")]
[numthreads(1,1,1)]
void add(uint3 threadId : SV_DispatchThreadID)
{
    uint index = threadId.x;
    result[index] = a[index] + b[index];
}

Before running the kernel you'll need to compile it for your native backend. You can get more information on how to use Slang in their documentation.

Creating the context

We start by getting the first device we find, and creating a context. A context is used to allocate resources, create queues and load modules.

ComputeDevice device = Silicon.createSystemDevice();
ComputeContext context = device.createContext();

Loading the module

A module is a program containing which can contain zero, one or multiple functions. It changes between the various backends, you can load the module with the following code:

ComputeModule moduleFromPath = context.loadModule(Path.of("resources/vector_add.suffix"));

The suffix depends on the chosen backend:

For Apple systems, you need a .metal file, containing the source code of the program.
For NVIDIA systems, you need a .ptx file, containing the assembly code of the program.
For OpenCL systems, you need a .clfile` containing the source of the program.

You can get the current backend like this:

ComputeBackend backend = Silicon.getBackend();
System.out.println(backend.getName());

Loading the kernel

Once the module is loaded, you can get the kernel that needs to be executed.

// for our example we will use a kernel named add
ComputeFunction function = moduleFromPath.getFunction("add");

Creating the buffers

For our project we will need three buffers, two input buffers and one output buffer.

int N = 1024 * 1024;

float[] dataA = new float[N];
float[] dataB = new float[N];

for (int i = 0; i < N; i++) {
    dataA[i] = i;
    dataB[i] = i + 1;
}

ComputeBuffer a = context.allocateArray(dataA);
ComputeBuffer b = context.allocateArray(dataB);
ComputeBuffer c = context.allocateBytes(N * 4L);

Silicon allows the developer to create a buffer on the GPU memory and automatically copy the values from an array, such as in the example above. It also allows to allocate empty buffeers, with a specific size.

Specifying arguments

Before executing a kernel, you'll need to specify some arguments. For our case, you'll need to specify some arguments:

ComputeArgs args = ComputeArgs.of(a, b, c, N);

You don't need to worry about the types and size, Silicon handles everything for you.

Additionally, you'll also need to specify a global size and a group size.

Global size: The total amount of work for each dimension.
Group size: The total amount of work for each group in each dimension.

ComputeSize globalSize = new ComputeSize(N, 1, 1);
ComputeSize groupSize = new ComputeSize(256, 1, 1);

What's a group?

GPUs do not execute everything separately, instead they group everything into work groups, each work group has a defined number of threads, each executing the same kernel for different data.

Queues & Execution

Queues are a foundamental part in GPU programming, they allow for multiple kernels to be executed in order, reducing overhead and without sacrificing throughput.

ComputeQueue queue = context.createQueue();

Finally you'll need to dispatch the kernel, and block the current thread for it's completition.

queue.dispatch(function, globalSize, groupSize, args);
queue.awaitCompletion();

Results

Once the kernel has finished computing, you can read results back to the CPU, this can be easily done using the buffer API and the previously defined c buffer.

// Gets only the first 10 elements of the 1024 * 1024 array
float[] result = new float[10];
c.get(result);
System.out.println(Arrays.toString(result));

PreviousSilicon

Last updated 1 month ago

hashtagWhat's a kernel?

hashtagCreating the context

hashtagLoading the module

hashtagLoading the kernel

hashtagCreating the buffers

hashtagSpecifying arguments

hashtagQueues & Execution

hashtagResults