AI Systems
LLM Inference Serving Platform
Serve large language model completions with low latency, high throughput, and predictable cost.
Scale to anchor on
Millions of QPS, sub-second TTFT, hundreds of milliseconds inter-token latency, multi-region.
Requirements
Functional
- Stream completions token-by-token.
- Support multiple model versions and routing.
- Apply rate limits, abuse filters, and safety classifiers.
- Cache prompt prefixes for cost reduction.
Non-functional
- GPU utilization > 50% steady state.
- Graceful degradation under spikes.
- Multi-region failover for availability.
High-level architecture
A gateway authenticates, rate-limits, and routes by model. Inference servers run with continuous batching, paged KV cache, and prompt prefix caching. A safety layer (classifier on input and output) runs in parallel where possible. Streaming responses use SSE / HTTP2 with backpressure.
Components
Gateway
Auth, rate limit, abuse detection, model routing.
Inference server (vLLM-style)
Continuous batching, paged KV, speculative decoding.
Prefix cache
Shared KV cache across requests with common prefixes.
Safety classifier
Pre- and post-generation policy enforcement.
Telemetry
Per-request cost, latency breakdown (prefill/decode), and policy outcomes.
Key decisions
Continuous batching over static batching.
Tail-heavy generation lengths waste static batch slots; continuous batching keeps the GPU saturated.
Prefix caching for common system prompts.
Major cost lever — shared system prompts can dominate prefill cost.
Smaller-model fallback under load.
Graceful degradation: cheaper model serves overflow rather than the system failing.
Safety as a layered system, not a single guardrail.
Classifier + preference-tuned model + content policy + audit log — each catches what the others miss.
Pitfalls
- Single shared queue for all tenants — noisy neighbors.
- No per-tenant cost attribution — billing surprises.
- Cold-start of safety classifier on the request path.
- No clear separation between prefill and decode in monitoring.
Follow-up questions
- How do you handle a 20x viral spike?
- What's the multi-region failover plan?
- How do you isolate large-tenant traffic from small tenants?
- What's the model rollout pattern?