Compilers · GPU2026v1.0 · Open source

locomp

A Python GPU Kernel Compiler for Apple Silicon, NVIDIA, AMD & RISC-V

Write a GPU kernel once as a plain Python function. locomp compiles it through an SSA intermediate representation into native Metal, CUDA, HIP or RISC-V vector code — one kernel runs everywhere, no rewrites.

View on GitHub Website

Hardware backends from one source

85×

Faster GELU vs PyTorch (A100)

2.6×

Faster GELU vs MLX (Apple M1)

227+

Tests across M1, A100, RISC-V

Abstract

One kernel, every GPU

locomp is an open-source GPU kernel compiler. You write a GPU kernel as a plain Python function, and locomp compiles it through an SSA intermediate representation into native code for your target hardware — Metal on Apple Silicon, CUDA C on NVIDIA, HIP C on AMD ROCm, or C + RISC-V Vector (RVV) intrinsics on RISC-V.

Think Triton, but hardware-agnostic — one @locomp.kernel runs on M1, A100, MI300X and RISC-V without rewriting a line. Triton targets NVIDIA only; locomp targets all four.

import locomp
import numpy as np

@locomp.kernel
def vector_add(X: locomp.Tensor, Y: locomp.Tensor, O: locomp.Tensor,
               N: locomp.constexpr):
    i = locomp.program_id(0)
    locomp.store(O + i, locomp.load(X + i) + locomp.load(Y + i))

Pipeline

How it works

Your Python function is compiled, not interpreted. The compiled pipeline is cached per constexpr configuration, so repeated calls have near-zero overhead.

@locomp.kernel  (Python function)
        │
   Python AST → Locomp IR (SSA, 60+ opcodes)
        │
   Optimizer (CSE · DCE · constant folding · type
              inference · strength reduction)
        │
   ├── backend="metal" → Metal Shading Language → Apple GPU
   ├── backend="cuda"  → CUDA C (nvcc -O3)      → NVIDIA GPU
   ├── backend="rocm"  → HIP C (hipcc -O3)      → AMD GPU
   └── backend="riscv" → C + RVV intrinsics     → RISC-V CPU

Design

Built for serving engines

locomp is designed to drop in as the kernel-compilation layer under a GPU inference server. Instead of shipping pre-compiled shaders or PTX, the server compiles kernels at first request and caches them permanently.

Specialization per config — constexpr params (batch size, seq len, head dim) become hardware literals, generating a separate optimized pipeline per shape with no dynamic dispatch overhead.
Persistent pipeline cache — compiled kernels are written to ~/.cache/locomp/, so server restarts are instant after the first run.
Async dispatch + batch mode — multiple kernel calls in one command buffer; GPU pipelines work while the CPU prepares the next batch.
Backend-agnostic — the same kernel auto-selects Metal / CUDA / ROCm / RISC-V based on available hardware.

Results

NVIDIA A100 vs PyTorch

Large-model kernels, fused with no temporary allocations.

Kernel	Speedup	Note
GELU (N=4M)	85×	Fused, no temp alloc
RoPE (N=1M)	21×	In-place
Softmax (B=256, D=1024)	4.7×	—
RMSNorm (B=128, D=4096)	4.1×	—

locomp vs PyTorch on an A100-80GB, CUDA 12.1.

Results

Apple M1 vs MLX 0.31

float32, median of 10 runs after 3 warmup. Speedup > 1 = locomp wins.

Kernel	Speedup
Flash Attention (N=64)	12×
RoPE	2.9×
GELU + bias	2.6×
Reduce sum [32×1024]	1.5×
Batch norm (N=4, C=128)	1.4×
LayerNorm	1.3×

SmolLM2-135M (30 layers, GQA, RoPE, INT4) decodes at 7.9 tok/s on M1 — every kernel pure @locomp.kernel.

Results

AMD MI300X — memory bandwidth

HIP C compiled with hipcc --offload-arch=gfx942 -O3, measured on real hardware (192 GB HBM3, 5,300 GB/s theoretical peak).

Kernel	Bandwidth	% peak
vector_add	3,961 GB/s	74.7%
relu	3,587 GB/s	67.7%
scale_shift	3,440 GB/s	64.9%
gelu	3,260 GB/s	61.5%
rmsnorm	1,119 GB/s	21.1%

locomp is the first and only Python kernel compiler targeting AMD ROCm — bandwidth-bound kernels reach 62–75% of HBM3 peak.

End to end

A full LLM on pure kernels

SmolLM2-135M runs entirely on @locomp.kernel — no PyTorch, no MLX, no Metal C++. The complete inference loop is just ten pure Python kernel functions: rms_norm, matvec, silu_mul, rope, gqa_attn, kv_cache_update, and a handful of element-wise ops.

$ python examples/54_smollm2_inference.py

SmolLM2-135M — locomp GPU inference
Loading weights... 272 tensors, 538MB

Prompt: "The meaning of life is"
Output: "to be found in the meaning of the universe."
Decode:  7.9 tok/s

Landscape

Where locomp fits

locomp is the only Python kernel compiler that targets Apple Silicon, NVIDIA CUDA, AMD ROCm and RISC-V from a single source — and the only one with CPU + GPU autograd, built-in auto-tuning and native INT4/INT8 quantization support.

Capability	locomp	Triton	MLX
Apple Silicon	✓	✗	✓
NVIDIA CUDA	✓	✓	✗
AMD ROCm	✓	✗	✗
RISC-V RVV	✓	✗	✗
Auto-tuning	✓	✓	✗
Autograd	CPU+GPU	✗	✓

Validation

Tested everywhere

Apple M1 & M4 — 227 passed, 22 skipped.
NVIDIA A100 (Modal) — 64/64 execution checks; A10G — 17/17 CUDA codegen + runtime.
RISC-V RVV under QEMU — 9/9 execution tests, all matching NumPy.
63 kernel examples shipped, from vector add to flash attention and full LLM inference. Apache-2.0 licensed.

Interested in this work?

Star on GitHub Contact us