CCCL Runtime: A Modern C++ Runtime for CUDA

SOURCE | 6 hours ago

✨ Enhance your Social Media content with NViNiO•AI™ for FREE

The NVIDIA CUDA Core Compute Libraries (CCCL) provides delightful and efficient abstractions for CUDA developers in C++ and Python. It features:

Parallel algorithms – Host-launched algorithms including sort, scan and reduce that remove the need to write custom kernels for common operations Cooperative algorithms – Device-side algorithms such as block-wide or warp-wide reductions or scans that simplify custom kernel development Language idiomatic CUDA abstractions – Fundamental abstractions for CUDA-specific operations including memory allocation, resource management, and hardware features

This post introduces a new group of functionality in CCCL that provides modernized C++ abstractions for fundamental CUDA programming model concepts that make CUDA C++ development safer and more convenient.

What is CCCL runtime?

NVIDIA CCCL runtime is a new set of idiomatic C++ APIs that implement core CUDA functionality: stream management, memory allocation, kernel launches, and more.

The familiar NVIDIA CUDA runtime was originally developed as a convenience layer on top of the CUDA driver API. The new CCCL runtime aims to be an alternative with the same goal, but with an updated design aligned with modern C++. Figure 1, below, shows the relationship between the three CUDA API surfaces mentioned above:

Both CCCL runtime API and CUDA runtime API are built on top of the CUDA driver API. Both runtimes are also easily interoperable with each other

Figure 1. Stack diagram of different CUDA API surfaces

CCCL runtime is a collection of headers within CCCL, such as <cuda/stream>, <cuda/buffer>, and <cuda/launch>. It leverages modern C++ features to provide more convenient and robust abstractions than what was possible within the C source compatibility constraints of the traditional CUDA runtime API.

We also took the opportunity to incorporate lessons learned over 20 years of CUDA evolution into the API design. Even with all these changes, CCCL runtime provides compatibility helpers that let developers adopt it incrementally without rewriting surrounding code that uses the CUDA runtime API.

As CUDA programs grow more complex, with multiple libraries sharing devices, streams, and memory, the need for APIs that compose cleanly and make dependencies explicit becomes more pressing. That is the space CCCL runtime is designed to fill.

The code

Here is the classic vectorAdd example implemented with the new CCCL runtime APIs. If you’ve written CUDA before, the overall structure will be familiar: Focus on what’s different. Don’t try to understand everything at once, the rest of this post will walk through this example to explain the semantics and design choices behind CCCL runtime.

#include <cuda/buffer> #include <cuda/devices> #include <cuda/launch> #include <cuda/memory_pool> #include <cuda/std/span> #include <cuda/stream> struct kernel { template <typename Config> __device__ void operator()(Config config, cuda::std::span<const int> A, cuda::std::span<const int> B, cuda::std::span<int> C) { auto tid = cuda::gpu_thread.rank(cuda::grid, config); if (tid < A.size()) C[tid] = A[tid] + B[tid]; } }; int main() { // 1. Devices and streams cuda::device_ref device = cuda::devices[0]; cuda::stream stream{device}; // 2. Memory allocation auto pool = cuda::device_default_memory_pool(device); int num_elements = 1000; auto A = cuda::make_buffer<int>(stream, pool, num_elements, 1); auto B = cuda::make_buffer<int>(stream, pool, num_elements, 2); auto C = cuda::make_buffer<int>(stream, pool, num_elements, cuda::no_init); // 3. Kernel launch constexpr int threads_per_block = 256; auto config = cuda::distribute<threads_per_block>(num_elements); cuda::launch(stream, config, kernel{}, A, B, C); // Make the CPU thread wait for the GPU work to finish. stream.sync(); return 0; }

The example can be broken down into the following three main sections:

1.) Devices and streams

Consider the creation of a stream using the CUDA Runtime API as the following code snippet shows.

cudaStream_t stream; cudaStreamCreate(&stream); // associated with whichever device happens to be "current"

Note this creates a stream, but the stream is associated with whichever device is current when cudaStreamCreate is called. Based on this call alone, you don’t know which device the stream is associated with.

Contrast that with using CCCL runtime API as illustrated by the code snippet that follows.

cuda::device_ref device = cuda::devices[0]; cuda::stream stream{device};

The above code snippet shows how to create a stream on a specific device. The first line illustrates a core design principle: CCCL runtime uses dedicated types instead of raw identifiers. A device is a device_ref, not a plain integer; a stream is an object, not an opaque pointer. Strong typing across the API helps catch mistakes at compile time rather than chasing them at runtime.

The second line illustrates another principle: making dependencies explicit. In both CCCL runtime and the CUDA runtime API, a stream is associated with a device. The difference is how. Here, the cuda::stream constructor takes the device as an explicit argument whereas with the CUDA runtime API the stream is associated with whichever device is active when the stream is created.

Explicit dependencies enable local reasoning. You can read a function and understand what it does without tracking the global state. They also improve composability: When multiple libraries are used, none of them need to save and restore implicit state across calls to avoid interfering with each other.

A related consequence is that CCCL runtime doesn’t expose the default stream. Managing the meaning of the default stream requires tracking the current device, which is exactly the kind of implicit state we are moving away from. While a default stream from the CUDA runtime API can still be wrapped into CCCL runtime types, its usage is discouraged; anything involving the default stream should be handled through the CUDA runtime API directly. With no default stream in the API, the notion of a “blocking stream” no longer applies, so all CCCL runtime streams are created as non-blocking.

Resource ownership: Owning types and refs

Following the example of std::string and std::string_view, many CUDA objects have two types in CCCL runtime: an owning type and a non-owning type with a _ref suffix; cuda::stream owns the underlying cudaStream_t handle and destroys it in its destructor. The cuda::stream_ref holds the handle without managing its lifetime and is trivially copyable.

The _ref types are essential for composability with existing code. If a stream handle’s lifetime is managed elsewhere, cudaStream_t implicitly converts to cuda::stream_ref, and the raw handle can be retrieved with .get(). To transfer ownership, cuda::stream::from_native_handle wraps a raw handle into the owning type, and .release() relinquishes ownership back.

void stream_type_example(cudaStream_t handle) { cuda::stream_ref non_owning{handle}; assert(handle == non_owning.get()); cuda::stream owning = cuda::stream::from_native_handle(handle); assert(handle == owning.get()); assert(handle == owning.release()); }

The same pattern applies to events, memory pools, and other CUDA objects: cuda::device_ref has no owning counterpart because there is no device state to own.

2.) Memory allocation

auto pool = cuda::device_default_memory_pool(device); auto A = cuda::make_buffer<int>(stream, pool, num_elements, 1); auto B = cuda::make_buffer<int>(stream, pool, num_elements, 2); auto C = cuda::make_buffer<int>(stream, pool, num_elements, cuda::no_init);

The next section demonstrates asynchronously allocating and initializing device memory. Here we see the next design principle: APIs are asynchronous by default. Rather than distinguishing synchronous and asynchronous variants by name, CCCL runtime uses a simple convention: If an API takes a stream as its first argument, it operates in stream order. We don’t plan to provide synchronous counterparts for APIs that have both variants in the CUDA runtime API.

Memory allocation is where this matters most in practice. Stream-ordered memory management via memory pools has been available since CUDA 11.2 (explained here), and CUDA 13.0 expanded it to managed and host memory. Memory pooling and less frequent synchronization points are in most cases essential to reach maximum performance, and stream-ordered memory management composes naturally with the rest of the asynchronous programming model. To convey those guidelines, CCCL runtime makes memory pools and stream-ordered allocation the default. On older CUDA versions and platforms, where newer memory pool types are not yet supported, we provide non-stream-ordered allocation as a fallback, but plan to remove it once pool support is universal.

In the snippet above, we first query the default memory pool for a given device, passing it as an explicit argument rather than relying on cudaMallocAsync‘s implicit device selection. The example uses the default pool which should be preferred where possible, but CCCL runtime also allows creating separate pool objects when different pool settings are needed.

The pool reference is then used to create three buffers using the new cuda::make_buffer. It takes a stream as its first argument to signal stream-ordered operation. Each buffer submits three operations to that stream: allocation from the specified pool, initialization, and eventually deallocation when the buffer goes out of scope.

Initialization is mandatory unless explicitly opted out with cuda::no_init, as with buffer C which will be overwritten by the kernel. Uninitialized device memory is a common source of hard-to-diagnose bugs, so we chose to require an explicit opt-out rather than making it the silent default. Input buffers A and B have all elements initialized to 1 and 2, respectively. Buffers support additional initialization modes as well, for example from another buffer or a range.

Buffer lifetime and deallocation

The stream passed to make_buffer is stored inside the buffer and used for deallocation when the buffer is destroyed. This means the buffer should generally hold the stream that corresponds to its usage, so that computation is properly ordered with deallocation. It is possible to change the stream later with .set_stream() or manually trigger destruction on a specific stream with .destroy(), but the default behavior is designed to do the right thing in the common case.

{ auto pool = cuda::device_default_memory_pool(device); // Equivalent to cudaMallocFromPoolAsync on the stream, possibly along with initialization pushed into the stream as well. Saves the stream for future deallocation auto buffer = cuda::make_buffer(allocation_stream, pool, /*... */); // buffer usage... } // Closing bracket will call cudaFreeAsync on allocation_stream, there is also buffer.destroy(which_stream) to keep the behavior explicit

3.) Kernel launch

struct kernel { template <typename Config> __device__ void operator()(Config config, cuda::std::span<const int> A, cuda::std::span<const int> B, cuda::std::span<int> C) { auto tid = cuda::gpu_thread.rank(cuda::grid, config); if (tid < A.size()) C[tid] = A[tid] + B[tid]; } }; // ... constexpr int threads_per_block = 256; auto config = cuda::distribute<threads_per_block>(num_elements); cuda::launch(stream, config, kernel{}, A, B, C);

The final section demonstrates configuring and launching the kernel on the GPU with cuda::launch.

cuda::launch takes three groups of arguments:

The stream to run on A configuration object that encodes the thread hierarchy (block and grid sizes) along with other launch options. Here, cuda::distribute creates a configuration that launches at least num_elements threads grouped into blocks of threads_per_block. This replaces the common pattern many CUDA developers are familiar with of (N + block_size - 1) / block_size The kernel and its arguments

Compile-time configuration flow

The most novel aspect of cuda::launch is how it moves compile-time information from the host launch site into device code through the type system. For example, notice how the block size is provided as a template argument to cuda::distribute, which means it is encoded in the configuration object’s type.

When the kernel accepts that configuration as its first argument, cuda::launch passes it through automatically. Inside the kernel, this static information is available when we compute the rank of the calling thread inside the grid:

auto tid = cuda::gpu_thread.rank(cuda::grid, config);

Because the block size is known at compile time, the rank calculation can use only the x dimension and skip the runtime block-size query entirely. This is a simple example, but the mechanism generalizes. The CCCL documentation shows further cases where configuration-embedded information is used to specialize device code.

Sometimes kernel implementation makes assumptions about the exact shape of the grid and/or block. Compile time information in the configuration object allows kernel authors to implement checks to ensure alignment of the kernel and the call site in those cases.

template <typename Config> __global__ void kernel(Config conf) { // Make sure the block is one dimensional with 256 threads static_assert(cuda::gpu_thread.static_dims(cuda::block, conf).x == 256); static_assert(cuda::gpu_thread.static_dims(cuda::block, conf).y == 1); static_assert(cuda::gpu_thread.static_dims(cuda::block, conf).z == 1); }

Kernel functors

You may have noticed the kernel is a struct with a __device__ operator() rather than a __global__ function. While cuda::launch supports existing __global__ functions, we also introduced kernel functors: types with a __device__-annotated call operator. The practical advantage is that template arguments are deduced automatically, whereas __global__ functions used with cuda::launch require explicit instantiation.

template <typename T> __global__ void kernel_function(T input) { // body ... } struct kernel_functor { template <typename T> __device__ void operator()(T input) { // body ... } }; // explicit template instantiation is required with a __global__ function cuda::launch(stream, config, kernel_function<int>, 42); // deduction from arguments for a functor with __device__ call operator cuda::launch(stream, config, kernel_functor{}, 42);

This is what makes the compile-time configuration flow work. The config template parameter is deduced from the configuration object passed by cuda::launch. Kernel functors also cover device lambdas and have additional features described in the CCCL documentation.

Automatic argument transformation

cuda::buffer owns its underlying allocation, but CUDA kernels can only accept trivially copyable arguments. When a buffer is passed to cuda::launch, it is automatically transformed to cuda::std::span. There is no need to manually construct the span or extract a raw pointer. The kernel signature reflects how the data is actually used on the device side.

What’s next

This post covered the core ideas behind CCCL runtime: explicit dependencies, strong typing, asynchronous-by-default APIs, and clean interoperability with existing CUDA code. But a walkthrough of one example can only show so much. The CCCL documentation has more detailed coverage of each API, including additional buffer initialization modes, event management, data movement, and advanced kernel launch features like dynamic shared memory and other launch attributes. CCCL runtime is available today in CCCL. We’d love to hear your feedback as you try it out.

✨ Enhance your brand's digital communication with NViNiO•Link™ : Get started for FREE here