locomp: write a GPU kernel once, run it everywhere

GPU kernels are where performance lives — and where portability goes to die. A kernel hand-tuned for CUDA doesn't run on Apple Silicon. A Metal shader means nothing to an AMD card. Every backend is a rewrite, and every rewrite is a chance to introduce a subtle, silent correctness bug.

locomp removes the rewrite. You express a kernel once as a plain Python function, and the compiler lowers it through an SSA intermediate representation into native code for whatever hardware you're targeting.

One source, four backends

Apple Silicon via Metal
NVIDIA GPUs via CUDA
AMD GPUs via HIP / ROCm
RISC-V via the vector (RVV) extension

The same function becomes the right instructions for each device. There is no '#ifdef CUDA' sprawl and no parallel set of hand-written shaders drifting out of sync.

@locomp.kernel
def gelu(x: Float16) -> Float16:
    return 0.5 * x * (1.0 + tanh(0.7978845608 * (x + 0.044715 * x * x * x)))

# compiles to Metal, CUDA, HIP or RISC-V — same source

Why an SSA IR matters

By normalizing every kernel into static single-assignment form first, the optimizer reasons about data flow once and the backends only worry about emitting native code. That separation is what lets one frontend serve four very different machines without compromising on speed.

“Write a GPU kernel once as a plain Python function — one kernel runs everywhere, no rewrites.”
— locomp

locomp is what powers the kernels inside ZSE. It's open source, tested across M1, A100 and RISC-V, and it's how we keep the same engine fast on a laptop and in a data center.

locomp: write a GPU kernel once, run it everywhere

One source, four backends

Why an SSA IR matters

Inside ZSE: how we cut LLM cold starts from minutes to seconds

Building zMesh: a complete backend you can ship in minutes

Nexula AIBOM: securing your entire AI supply chain