Back to research
Systems · Inference2026v2.0 · Open source

ZSE

Zero-dependency Server Engine for LLM Inference

A production LLM inference engine that owns the full stack — no PyTorch, no Triton, no transformers. Models load in seconds, not minutes, and serve in a fraction of the memory other engines need.

30.2×
Faster cold start vs vLLM (T4)
5.79 GB
VRAM for Qwen2.5-7B on a T4
~5 MB
Pip install · zero ML deps
40.6k
Lines of code · 0 dependencies
Abstract

What is ZSE?

ZSE is a production LLM inference engine that owns the full stack. There is no PyTorch, no Triton, no bitsandbytes and no transformers — just pure Python, ctypes, and a kernel compiler that emits CUDA, ROCm (HIP) and Metal directly.

The result is an engine where models load in seconds, not minutes, and serve at a fraction of the memory other engines need. The entire system is roughly 40,600 lines of code with zero third-party dependencies, installable in about 5 MB.

pip install zse-engine   # one package, zero transitive ML deps
zse serve qwen-7b.zse    # 7-second cold start. 5.8 GB on a T4.
Motivation

The problem with the stack

Modern inference servers inherit a heavy dependency tree — PyTorch, Triton and a CUDA toolkit can total ~12 GB. That bloat translates directly into multi-minute cold starts and enormous VRAM footprints, making it expensive to run anything outside a top-tier data-center GPU.

ZSE asks a different question: what if the engine owned every layer — the quantized model format, the kernels, the scheduler and the server — so nothing is paid for that isn't used?

Approach

Owning the full stack

  • A custom .zse model format: pre-quantized INT4/INT8/FP16, memory-mapped so weights load instantly instead of being deserialized on every boot.
  • A pure-Python kernel compiler (zse-compiler) that emits CUDA C, HIP C and Metal Shading Language — 29 GPU kernels for the inference path.
  • A continuous-batching scheduler (ZStreamer) with disaggregated prefill/decode, SLO-aware ordering, chunked prefill and speculative decoding.
  • An OpenAI-compatible server with API keys, rate limiting, LoRA hot-swap and built-in RAG — all on pure asyncio, no web framework.
Architecture

System architecture

HTTP / SSE · OpenAI API · Web dashboard · API key + RAG
        │
   ZStreamer — continuous batching, scheduling
        │
 Orchestrator │ KV Cache (PagedAttention) │ LoRA Mgr
 29 GPU kernels│ adaptive blocks · token-evict│ hot-swap
        │
 .zse format │ VRAM allocator (unified) │ CUDA/HIP Graphs
        │
   ZSE Kernel Compiler — Python DSL → GPU code
        │
   CUDA C (nvrtc) · HIP C (hiprtc) · Metal (MSL)

      No PyTorch · No Triton · No transformers

The kernel compiler queries each GPU's compute capability at runtime and emits the correct PTX / GCN / MSL automatically — so a new architecture usually works on day one.

Results

Cold start — every GPU, every size

ZSE INT4 vs vLLM AWQ INT4, verified on Modal (T4, L4, A10G, A100), DigitalOcean (MI300X) and Apple M1.

GPU · ModelZSEvLLMSpeedup
T4 · Qwen2.5-7B7.25s218.96s30.2×
L4 · Qwen2.5-7B5.58s145.22s26.0×
A10G · Qwen2.5-7B6.01s193.05s32.1×
A100-80GB · Qwen2.5-14B6.29s127.02s20.2×
MI300X · Qwen2.5-32B3.14s42.65s13.6×
Cold start time to first token-ready, lower is better.
Results

VRAM — fits where others can't

GPU · ModelZSEvLLMSaving
T4 · Qwen2.5-7B5.79 GB~14 GB~2.5×
A100-80GB · Qwen2.5-14B12.28 GB71.45 GB5.82×
MI300X · Qwen2.5-32B22.07 GB161.77 GB7.33×
ZSE runs 32B INT4 in 22 GB — with room for 8 more models on one MI300X.
Results

Single-sequence throughput

On data-center GPUs ZSE matches or beats vLLM for single-sequence decode, while staying far leaner on memory.

GPU · ModelZSEvLLMRatio
A100-80GB · Qwen2.5-14B37.0 tok/s26.51.40×
A10G · Qwen2.5-7B48.6 tok/s50.90.95×
L4 · Qwen2.5-7B36.3 tok/s47.30.77×
MI300X · Qwen2.5-32B38.4 tok/s56.40.68×
T4 · Qwen2.5-7B18.8 tok/s35.20.53×
Beyond inference

Built-in retrieval (RAG)

  • Hybrid retrieval: BM25 + TF-IDF + dense embeddings from mean-pooled LLM hidden states — no extra embedding model needed.
  • Reciprocal Rank Fusion plus an LLM cross-encoder reranker.
  • ZPF compressed document format — 25% fewer LLM tokens at 100% retrieval accuracy.
  • A full PDF parser handling encrypted (RC4 / AES-128 / AES-256), multi-column reflow, /ObjStm and an OCR hook.
Honesty

What ZSE doesn't beat yet

We believe in numbers, not marketing. ZSE does not yet beat vLLM on concurrent throughput at N≥4 on INT4 — vLLM's hand-tuned AWQ Marlin kernels hit memory-bandwidth ceilings we're still closing on NVIDIA (already down to 2.12× on AMD via a wave-64 bgemv rewrite). Full Apple-Silicon transformer inference and socket-restricted tensor parallelism are wired and validated at the kernel level, pending broader hardware runs.

If steady-state batched throughput is your only metric and you have ~50× the VRAM budget, use vLLM. If you care about cold start, footprint, vendor lock-in, or running on anything other than an H100 — use ZSE.

Hardware

Validated across vendors

NVIDIA T4 (Turing), L4 (Ada), A10G (Ampere), A100 (40/80 GB), H100 / H200 (Hopper); AMD Instinct MI300X (CDNA3); and Apple M1 — all validated. ZSE is Apache-2.0 licensed and built in Nagercoil, India.

Interested in this work?