When you deploy a large language model in production, the first number that bites you isn't tokens per second — it's cold start. The time between 'the GPU is allocated' and 'the model can answer a request' is where autoscaling, cost and reliability quietly fall apart. With most engines that window is measured in minutes.
ZSE is our answer. It's a zero-dependency server engine for LLM inference: no PyTorch, no Triton, no transformers. We own the full stack so we control exactly what happens between a cold container and a served token.
Why cold starts are slow
A typical stack pays a tax three times over: importing a multi-gigabyte Python ML framework, JIT-compiling kernels on first use, and streaming full-precision weights into VRAM. None of that work serves a single user — it's pure latency before the first request.
- Framework import alone can cost several seconds of pure Python startup.
- Kernel JIT compilation happens lazily, so the first request pays for everyone.
- Loading FP16 weights for a 7B model moves ~14 GB before you can serve anything.
What ZSE does differently
We pre-quantize weights to INT4 and store them in a format that maps almost directly into the layout the kernels expect. There's no framework to import, no graph to trace, and no kernel to compile at request time — the kernels are already there.
“Models load in seconds, not minutes, and serve in a fraction of the memory other engines need.”
— ZSE design goal
$ zse serve qwen2.5-7b --quantize int4 --tp 3
→ weights mapped in 1.9s
→ engine ready in 7s
→ serving on :8000 (OpenAI-compatible)The numbers
On a single NVIDIA T4, Qwen2.5-7B fits in 5.79 GB of VRAM and the engine is ready dramatically faster than a comparable vLLM deployment. The whole thing ships as a ~5 MB pip install with zero ML dependencies.
Fast cold starts aren't a vanity metric. They're what make per-model autoscaling, scale-to-zero and cheap multi-tenant serving actually viable. That's the foundation everything else at Zyora Labs is built on.