Back to research
Compilers · GPU2026v1.0 · Open source

locomp

A Python GPU Kernel Compiler for Apple Silicon, NVIDIA, AMD & RISC-V

Write a GPU kernel once as a plain Python function. locomp compiles it through an SSA intermediate representation into native Metal, CUDA, HIP or RISC-V vector code — one kernel runs everywhere, no rewrites.

4
Hardware backends from one source
85×
Faster GELU vs PyTorch (A100)
2.6×
Faster GELU vs MLX (Apple M1)
227+
Tests across M1, A100, RISC-V
Abstract

One kernel, every GPU

locomp is an open-source GPU kernel compiler. You write a GPU kernel as a plain Python function, and locomp compiles it through an SSA intermediate representation into native code for your target hardware — Metal on Apple Silicon, CUDA C on NVIDIA, HIP C on AMD ROCm, or C + RISC-V Vector (RVV) intrinsics on RISC-V.

Think Triton, but hardware-agnostic — one @locomp.kernel runs on M1, A100, MI300X and RISC-V without rewriting a line. Triton targets NVIDIA only; locomp targets all four.

import locomp
import numpy as np

@locomp.kernel
def vector_add(X: locomp.Tensor, Y: locomp.Tensor, O: locomp.Tensor,
               N: locomp.constexpr):
    i = locomp.program_id(0)
    locomp.store(O + i, locomp.load(X + i) + locomp.load(Y + i))
Pipeline

How it works

Your Python function is compiled, not interpreted. The compiled pipeline is cached per constexpr configuration, so repeated calls have near-zero overhead.

@locomp.kernel  (Python function)
        │
   Python AST → Locomp IR (SSA, 60+ opcodes)
        │
   Optimizer (CSE · DCE · constant folding · type
              inference · strength reduction)
        │
   ├── backend="metal" → Metal Shading Language → Apple GPU
   ├── backend="cuda"  → CUDA C (nvcc -O3)      → NVIDIA GPU
   ├── backend="rocm"  → HIP C (hipcc -O3)      → AMD GPU
   └── backend="riscv" → C + RVV intrinsics     → RISC-V CPU
Design

Built for serving engines

locomp is designed to drop in as the kernel-compilation layer under a GPU inference server. Instead of shipping pre-compiled shaders or PTX, the server compiles kernels at first request and caches them permanently.

  • Specialization per config — constexpr params (batch size, seq len, head dim) become hardware literals, generating a separate optimized pipeline per shape with no dynamic dispatch overhead.
  • Persistent pipeline cache — compiled kernels are written to ~/.cache/locomp/, so server restarts are instant after the first run.
  • Async dispatch + batch mode — multiple kernel calls in one command buffer; GPU pipelines work while the CPU prepares the next batch.
  • Backend-agnostic — the same kernel auto-selects Metal / CUDA / ROCm / RISC-V based on available hardware.
Results

NVIDIA A100 vs PyTorch

Large-model kernels, fused with no temporary allocations.

KernelSpeedupNote
GELU (N=4M)85×Fused, no temp alloc
RoPE (N=1M)21×In-place
Softmax (B=256, D=1024)4.7×
RMSNorm (B=128, D=4096)4.1×
locomp vs PyTorch on an A100-80GB, CUDA 12.1.
Results

Apple M1 vs MLX 0.31

float32, median of 10 runs after 3 warmup. Speedup > 1 = locomp wins.

KernelSpeedup
Flash Attention (N=64)12×
RoPE2.9×
GELU + bias2.6×
Reduce sum [32×1024]1.5×
Batch norm (N=4, C=128)1.4×
LayerNorm1.3×
SmolLM2-135M (30 layers, GQA, RoPE, INT4) decodes at 7.9 tok/s on M1 — every kernel pure @locomp.kernel.
Results

AMD MI300X — memory bandwidth

HIP C compiled with hipcc --offload-arch=gfx942 -O3, measured on real hardware (192 GB HBM3, 5,300 GB/s theoretical peak).

KernelBandwidth% peak
vector_add3,961 GB/s74.7%
relu3,587 GB/s67.7%
scale_shift3,440 GB/s64.9%
gelu3,260 GB/s61.5%
rmsnorm1,119 GB/s21.1%
locomp is the first and only Python kernel compiler targeting AMD ROCm — bandwidth-bound kernels reach 62–75% of HBM3 peak.
End to end

A full LLM on pure kernels

SmolLM2-135M runs entirely on @locomp.kernel — no PyTorch, no MLX, no Metal C++. The complete inference loop is just ten pure Python kernel functions: rms_norm, matvec, silu_mul, rope, gqa_attn, kv_cache_update, and a handful of element-wise ops.

$ python examples/54_smollm2_inference.py

SmolLM2-135M — locomp GPU inference
Loading weights... 272 tensors, 538MB

Prompt: "The meaning of life is"
Output: "to be found in the meaning of the universe."
Decode:  7.9 tok/s
Landscape

Where locomp fits

locomp is the only Python kernel compiler that targets Apple Silicon, NVIDIA CUDA, AMD ROCm and RISC-V from a single source — and the only one with CPU + GPU autograd, built-in auto-tuning and native INT4/INT8 quantization support.

CapabilitylocompTritonMLX
Apple Silicon
NVIDIA CUDA
AMD ROCm
RISC-V RVV
Auto-tuning
AutogradCPU+GPU
Validation

Tested everywhere

  • Apple M1 & M4 — 227 passed, 22 skipped.
  • NVIDIA A100 (Modal) — 64/64 execution checks; A10G — 17/17 CUDA codegen + runtime.
  • RISC-V RVV under QEMU — 9/9 execution tests, all matching NumPy.
  • 63 kernel examples shipped, from vector add to flash attention and full LLM inference. Apache-2.0 licensed.

Interested in this work?