AI Article

Under the Hood of a CUDA Kernel Launch

From triple-angle brackets to SASS assembly and doorbell registers, here is how NVIDIA orchestrates GPU execution.

Rachel Goldstein

Dev Tools Editor · Jun 29, 2026 · 6 min read

To a software engineer, launching a GPU kernel looks deceptively simple. You write a C++ function, decorate it with __global__, and invoke it with the triple-angle-bracket syntax: vadd<<<4096, 256>>>(da, db, dc, n).

But this clean abstraction hides a massive coordination effort. Between your high-level C++ code and the actual execution of warps on silicon lies a complex multi-stage compilation pipeline, hundreds of operating system system calls, and direct hardware-level signaling. Understanding this pipeline is not just an academic exercise. For anyone building modern machine learning libraries or high-performance GPU infrastructure, understanding how virtual instructions map to physical hardware is the key to diagnosing performance bottlenecks.

The Compilation Pipeline: PTX vs. SASS

When you run the CUDA compiler driver, nvcc, it does not produce a single binary executable for the GPU. Instead, it orchestrates a series of sub-compilers. By passing the --keep flag to nvcc, you can inspect the intermediate artifacts generated during compilation.

The compilation path splits immediately. The host code is sent to your standard host compiler (like GCC or Clang), while the device code is processed through two distinct compilation phases:

cicc (LLVM-based frontend): This tool compiles your CUDA C++ into Parallel Thread Execution (PTX), which is a virtual instruction set architecture (ISA) maintained by NVIDIA.
ptxas (assembler): This tool compiles the virtual PTX instructions into Shader Assembly (SASS), which is the machine code specific to a target GPU architecture.

PTX is device-agnostic and assumes an idealized GPU with an infinite register file. In the PTX representation of a vector addition kernel, variables are assigned to virtual registers like %r1 (a 32-bit integer), %rd4 (a 64-bit pointer), or %f2 (a 32-bit float). Because PTX cannot make assumptions about physical hardware, its instructions are verbose. For example, calculating a memory address requires explicit pointer conversion (cvta.to.global), widening the index to 64 bits (mul.wide.s32), and adding it to the base pointer (add.s64).

When ptxas translates PTX into SASS for a specific architecture, such as sm_89 (NVIDIA's Ada Lovelace architecture), it optimizes these operations for physical silicon. The infinite virtual registers are mapped to a limited set of physical registers (such as R1 through R9).

SASS also introduces hardware-specific instructions. The verbose address calculation in PTX is fused into a single IMAD.WIDE instruction in SASS. SASS also uses special registers to manage execution geometry. For example, the S2R (Special Register to Register) instruction copies hardware-maintained values like SR_CTAID.X (the block index) and SR_TID.X (the thread index within the block) into standard registers so the execution units can perform arithmetic on them.

The Runtime Handshake: Doorbells and Ioctls

Compiling the code is only half the battle. At runtime, the host CPU must instruct the GPU to execute the compiled SASS. This is not a simple function call; the CPU and GPU operate on entirely different memory spaces and execution queues.

When your host program calls a kernel, the CUDA driver performs a sequence of low-level operations. According to system-level tracing of a basic vector addition launch, executing a single kernel involves tens of millions of CPU instructions, multiple device files, approximately nine hundred ioctl system calls, and a write to a memory-mapped doorbell register.

The driver uses ioctl calls to allocate memory on the device (cudaMalloc) and copy data over the PCIe bus (cudaMemcpy). Once the data is in place and the kernel arguments are loaded into a driver-managed constant memory bank (constant bank 0, or c[0x0][...]), the driver prepares a command packet in a ring buffer called the pushbuffer.

To notify the GPU that work is waiting in the pushbuffer, the driver writes to a specific memory-mapped I/O (MMIO) register on the GPU known as a doorbell register. Writing to this register bypasses the operating system kernel for the actual launch, signaling the GPU's hardware command processor directly. This hardware-level signaling allows the CPU to queue work asynchronously and return control to the host application immediately, minimizing launch latency.

The Developer's Reality: Register Pressure and Occupancy

For developers, the transition from PTX to SASS is where performance is won or lost. The primary bottleneck in modern GPU kernels is rarely raw floating-point math; it is memory latency and register pressure.

Every Streaming Multiprocessor (SM) on a GPU has a fixed, physical register file shared among all active threads. When ptxas compiles PTX to SASS, it must allocate these physical registers. If your kernel code is complex, uses many local variables, or has deeply nested loops, the compiler will allocate more registers per thread.

This allocation directly impacts occupancy, which is the ratio of active warps per SM to the maximum supported active warps. If a kernel requires too many physical registers, the GPU cannot schedule as many concurrent blocks on each SM.

+-------------------------------------------------------------+
|                     High Register Usage                     |
|  Each thread uses more registers -> Fewer threads per SM    |
|  -> Low Occupancy -> Cannot hide memory latency (Stalls)    |
+-------------------------------------------------------------+
                              vs
+-------------------------------------------------------------+
|                      Low Register Usage                     |
|  Each thread uses fewer registers -> More threads per SM    |
|  -> High Occupancy -> Easily hides memory latency           |
+-------------------------------------------------------------+

Low occupancy limits the GPU's ability to hide memory latency. When a warp requests data from global memory via an LDG.E instruction, it must wait hundreds of clock cycles for the data to arrive. If occupancy is high, the hardware scheduler can instantly switch to another active warp that is ready to execute. If register pressure has forced low occupancy, there are no other warps available, and the execution units sit idle.

To manage this, developers should monitor register allocation during compilation by passing the --ptxas-options=-v flag to nvcc. This outputs the exact register count per thread:

nvcc -arch=sm_89 --ptxas-options=-v -o vadd vadd.cu

If the register count is too high, you have several options:

Use __launch_bounds__: You can annotate your kernel with execution bounds, telling the compiler the maximum block size. This forces ptxas to limit register usage to ensure the requested block size can fit on the SM.
Set -maxrregcount: You can pass a hard cap on registers to the compiler, forcing it to spill excess variables to local memory (which is backed by cache and global memory) rather than using physical registers.
Profile with Nsight Compute: Use Nsight Compute to measure theoretical versus active occupancy, allowing you to see if register pressure or shared memory limits are the primary bottleneck.

The Abstraction Trade-off

NVIDIA's software stack is remarkably successful because it makes a highly parallel, asynchronous machine look like a standard C++ programming environment. But this convenience comes at the cost of visibility.

When you write CUDA, you are writing for a virtual machine (PTX) that is heavily transformed before it ever touches silicon (SASS). By understanding how the compiler fuses instructions, how the driver uses doorbell registers to bypass the OS kernel, and how physical register limits dictate thread occupancy, you can write code that works with the hardware rather than against it.

Sources & further reading

What happens when you run a CUDA kernel? — fergusfinn.com

#Systems Programming #Compilers #Nvidia #Cuda #Gpu Infrastructure

Written by

Rachel Goldstein · Dev Tools Editor

Rachel has been embedded in the developer tooling ecosystem for nearly eight years, covering everything from IDE wars and package-manager drama to the quiet rise of AI-assisted coding. She has a soft spot for open-source maintainers and an unhealthy number of terminal emulators installed on a single laptop.

Discussion 0

Join the discussion

No comments yet

Be the first to weigh in.

Under the Hood of a CUDA Kernel Launch

The Compilation Pipeline: PTX vs. SASS

The Runtime Handshake: Doorbells and Ioctls

The Developer's Reality: Register Pressure and Occupancy

The Abstraction Trade-off

Sources & further reading

Discussion 0

Related Reading

Ornith-1.0: Coding Models That Train Their Own Agent Scaffolds

Qwen 3.6 27B Hits the Local Development Sweet Spot

Google's design.md: A Spec to Stop Agents Writing Ugly UI

How a Database Schema Error Triggered an Expensive AI Retry Storm