Systems · Inference2026v2.0 · Open source

ZSE

Zero-dependency Server Engine for LLM Inference

A production LLM inference engine that owns the full stack — no PyTorch, no Triton, no transformers. Models load in seconds, not minutes, and serve in a fraction of the memory other engines need.

View on GitHub Website

30.2×

Faster cold start vs vLLM (T4)

5.79 GB

VRAM for Qwen2.5-7B on a T4

~5 MB

Pip install · zero ML deps

40.6k

Lines of code · 0 dependencies

Abstract

What is ZSE?

ZSE is a production LLM inference engine that owns the full stack. There is no PyTorch, no Triton, no bitsandbytes and no transformers — just pure Python, ctypes, and a kernel compiler that emits CUDA, ROCm (HIP) and Metal directly.

The result is an engine where models load in seconds, not minutes, and serve at a fraction of the memory other engines need. The entire system is roughly 40,600 lines of code with zero third-party dependencies, installable in about 5 MB.

pip install zse-engine   # one package, zero transitive ML deps
zse serve qwen-7b.zse    # 7-second cold start. 5.8 GB on a T4.

Motivation

The problem with the stack

Modern inference servers inherit a heavy dependency tree — PyTorch, Triton and a CUDA toolkit can total ~12 GB. That bloat translates directly into multi-minute cold starts and enormous VRAM footprints, making it expensive to run anything outside a top-tier data-center GPU.

ZSE asks a different question: what if the engine owned every layer — the quantized model format, the kernels, the scheduler and the server — so nothing is paid for that isn't used?

Approach

Owning the full stack

A custom .zse model format: pre-quantized INT4/INT8/FP16, memory-mapped so weights load instantly instead of being deserialized on every boot.
A pure-Python kernel compiler (zse-compiler) that emits CUDA C, HIP C and Metal Shading Language — 29 GPU kernels for the inference path.
A continuous-batching scheduler (ZStreamer) with disaggregated prefill/decode, SLO-aware ordering, chunked prefill and speculative decoding.
An OpenAI-compatible server with API keys, rate limiting, LoRA hot-swap and built-in RAG — all on pure asyncio, no web framework.

Architecture

System architecture

HTTP / SSE · OpenAI API · Web dashboard · API key + RAG
        │
   ZStreamer — continuous batching, scheduling
        │
 Orchestrator │ KV Cache (PagedAttention) │ LoRA Mgr
 29 GPU kernels│ adaptive blocks · token-evict│ hot-swap
        │
 .zse format │ VRAM allocator (unified) │ CUDA/HIP Graphs
        │
   ZSE Kernel Compiler — Python DSL → GPU code
        │
   CUDA C (nvrtc) · HIP C (hiprtc) · Metal (MSL)

      No PyTorch · No Triton · No transformers

The kernel compiler queries each GPU's compute capability at runtime and emits the correct PTX / GCN / MSL automatically — so a new architecture usually works on day one.

Results

Cold start — every GPU, every size

ZSE INT4 vs vLLM AWQ INT4, verified on Modal (T4, L4, A10G, A100), DigitalOcean (MI300X) and Apple M1.

GPU · Model	ZSE	vLLM	Speedup
T4 · Qwen2.5-7B	7.25s	218.96s	30.2×
L4 · Qwen2.5-7B	5.58s	145.22s	26.0×
A10G · Qwen2.5-7B	6.01s	193.05s	32.1×
A100-80GB · Qwen2.5-14B	6.29s	127.02s	20.2×
MI300X · Qwen2.5-32B	3.14s	42.65s	13.6×

Cold start time to first token-ready, lower is better.

Results

VRAM — fits where others can't

GPU · Model	ZSE	vLLM	Saving
T4 · Qwen2.5-7B	5.79 GB	~14 GB	~2.5×
A100-80GB · Qwen2.5-14B	12.28 GB	71.45 GB	5.82×
MI300X · Qwen2.5-32B	22.07 GB	161.77 GB	7.33×

ZSE runs 32B INT4 in 22 GB — with room for 8 more models on one MI300X.

Results

Single-sequence throughput

On data-center GPUs ZSE matches or beats vLLM for single-sequence decode, while staying far leaner on memory.

GPU · Model	ZSE	vLLM	Ratio
A100-80GB · Qwen2.5-14B	37.0 tok/s	26.5	1.40×
A10G · Qwen2.5-7B	48.6 tok/s	50.9	0.95×
L4 · Qwen2.5-7B	36.3 tok/s	47.3	0.77×
MI300X · Qwen2.5-32B	38.4 tok/s	56.4	0.68×
T4 · Qwen2.5-7B	18.8 tok/s	35.2	0.53×

Beyond inference

Built-in retrieval (RAG)

Hybrid retrieval: BM25 + TF-IDF + dense embeddings from mean-pooled LLM hidden states — no extra embedding model needed.
Reciprocal Rank Fusion plus an LLM cross-encoder reranker.
ZPF compressed document format — 25% fewer LLM tokens at 100% retrieval accuracy.
A full PDF parser handling encrypted (RC4 / AES-128 / AES-256), multi-column reflow, /ObjStm and an OCR hook.

Honesty

What ZSE doesn't beat yet

We believe in numbers, not marketing. ZSE does not yet beat vLLM on concurrent throughput at N≥4 on INT4 — vLLM's hand-tuned AWQ Marlin kernels hit memory-bandwidth ceilings we're still closing on NVIDIA (already down to 2.12× on AMD via a wave-64 bgemv rewrite). Full Apple-Silicon transformer inference and socket-restricted tensor parallelism are wired and validated at the kernel level, pending broader hardware runs.

If steady-state batched throughput is your only metric and you have ~50× the VRAM budget, use vLLM. If you care about cold start, footprint, vendor lock-in, or running on anything other than an H100 — use ZSE.

Hardware

Validated across vendors

NVIDIA T4 (Turing), L4 (Ada), A10G (Ampere), A100 (40/80 GB), H100 / H200 (Hopper); AMD Instinct MI300X (CDNA3); and Apple M1 — all validated. ZSE is Apache-2.0 licensed and built in Nagercoil, India.

Interested in this work?

Star on GitHub Contact us