Back to blog
ResearchZSE

Inside ZSE: how we cut LLM cold starts from minutes to seconds

Most inference engines spend minutes just getting a model ready to serve. ZSE rebuilds the stack from first principles so a 7B model is live in seconds — on a single T4.

Zyora Labs·Systems team··7 min read

When you deploy a large language model in production, the first number that bites you isn't tokens per second — it's cold start. The time between 'the GPU is allocated' and 'the model can answer a request' is where autoscaling, cost and reliability quietly fall apart. With most engines that window is measured in minutes.

ZSE is our answer. It's a zero-dependency server engine for LLM inference: no PyTorch, no Triton, no transformers. We own the full stack so we control exactly what happens between a cold container and a served token.

Why cold starts are slow

A typical stack pays a tax three times over: importing a multi-gigabyte Python ML framework, JIT-compiling kernels on first use, and streaming full-precision weights into VRAM. None of that work serves a single user — it's pure latency before the first request.

  • Framework import alone can cost several seconds of pure Python startup.
  • Kernel JIT compilation happens lazily, so the first request pays for everyone.
  • Loading FP16 weights for a 7B model moves ~14 GB before you can serve anything.

What ZSE does differently

We pre-quantize weights to INT4 and store them in a format that maps almost directly into the layout the kernels expect. There's no framework to import, no graph to trace, and no kernel to compile at request time — the kernels are already there.

Models load in seconds, not minutes, and serve in a fraction of the memory other engines need.

ZSE design goal
$ zse serve qwen2.5-7b --quantize int4 --tp 3
→ weights mapped in 1.9s
→ engine ready in 7s
→ serving on :8000 (OpenAI-compatible)

The numbers

On a single NVIDIA T4, Qwen2.5-7B fits in 5.79 GB of VRAM and the engine is ready dramatically faster than a comparable vLLM deployment. The whole thing ships as a ~5 MB pip install with zero ML dependencies.

Fast cold starts aren't a vanity metric. They're what make per-model autoscaling, scale-to-zero and cheap multi-tenant serving actually viable. That's the foundation everything else at Zyora Labs is built on.

Want to go deeper?

Read the ZSE research