GPU kernels are where performance lives — and where portability goes to die. A kernel hand-tuned for CUDA doesn't run on Apple Silicon. A Metal shader means nothing to an AMD card. Every backend is a rewrite, and every rewrite is a chance to introduce a subtle, silent correctness bug.
locomp removes the rewrite. You express a kernel once as a plain Python function, and the compiler lowers it through an SSA intermediate representation into native code for whatever hardware you're targeting.
One source, four backends
- Apple Silicon via Metal
- NVIDIA GPUs via CUDA
- AMD GPUs via HIP / ROCm
- RISC-V via the vector (RVV) extension
The same function becomes the right instructions for each device. There is no '#ifdef CUDA' sprawl and no parallel set of hand-written shaders drifting out of sync.
@locomp.kernel
def gelu(x: Float16) -> Float16:
return 0.5 * x * (1.0 + tanh(0.7978845608 * (x + 0.044715 * x * x * x)))
# compiles to Metal, CUDA, HIP or RISC-V — same sourceWhy an SSA IR matters
By normalizing every kernel into static single-assignment form first, the optimizer reasons about data flow once and the backends only worry about emitting native code. That separation is what lets one frontend serve four very different machines without compromising on speed.
“Write a GPU kernel once as a plain Python function — one kernel runs everywhere, no rewrites.”
— locomp
locomp is what powers the kernels inside ZSE. It's open source, tested across M1, A100 and RISC-V, and it's how we keep the same engine fast on a laptop and in a data center.