Inside ZSE: how we cut LLM cold starts from minutes to seconds

When you deploy a large language model in production, the first number that bites you isn't tokens per second — it's cold start. The time between 'the GPU is allocated' and 'the model can answer a request' is where autoscaling, cost and reliability quietly fall apart. With most engines that window is measured in minutes.

ZSE is our answer. It's a zero-dependency server engine for LLM inference: no PyTorch, no Triton, no transformers. We own the full stack so we control exactly what happens between a cold container and a served token.

Why cold starts are slow

A typical stack pays a tax three times over: importing a multi-gigabyte Python ML framework, JIT-compiling kernels on first use, and streaming full-precision weights into VRAM. None of that work serves a single user — it's pure latency before the first request.

Framework import alone can cost several seconds of pure Python startup.
Kernel JIT compilation happens lazily, so the first request pays for everyone.
Loading FP16 weights for a 7B model moves ~14 GB before you can serve anything.

What ZSE does differently

We pre-quantize weights to INT4 and store them in a format that maps almost directly into the layout the kernels expect. There's no framework to import, no graph to trace, and no kernel to compile at request time — the kernels are already there.

“Models load in seconds, not minutes, and serve in a fraction of the memory other engines need.”
— ZSE design goal

$ zse serve qwen2.5-7b --quantize int4 --tp 3
→ weights mapped in 1.9s
→ engine ready in 7s
→ serving on :8000 (OpenAI-compatible)

The numbers

On a single NVIDIA T4, Qwen2.5-7B fits in 5.79 GB of VRAM and the engine is ready dramatically faster than a comparable vLLM deployment. The whole thing ships as a ~5 MB pip install with zero ML dependencies.

Fast cold starts aren't a vanity metric. They're what make per-model autoscaling, scale-to-zero and cheap multi-tenant serving actually viable. That's the foundation everything else at Zyora Labs is built on.

Inside ZSE: how we cut LLM cold starts from minutes to seconds

Why cold starts are slow

What ZSE does differently

The numbers

locomp: write a GPU kernel once, run it everywhere

Building zMesh: a complete backend you can ship in minutes

Nexula AIBOM: securing your entire AI supply chain