AI Systems

LLM Inference Serving Platform

Serve large language model completions with low latency, high throughput, and predictable cost.

Scale to anchor on

Millions of QPS, sub-second TTFT, hundreds of milliseconds inter-token latency, multi-region.

Requirements

Functional

  • Stream completions token-by-token.
  • Support multiple model versions and routing.
  • Apply rate limits, abuse filters, and safety classifiers.
  • Cache prompt prefixes for cost reduction.

Non-functional

  • GPU utilization > 50% steady state.
  • Graceful degradation under spikes.
  • Multi-region failover for availability.

High-level architecture

A gateway authenticates, rate-limits, and routes by model. Inference servers run with continuous batching, paged KV cache, and prompt prefix caching. A safety layer (classifier on input and output) runs in parallel where possible. Streaming responses use SSE / HTTP2 with backpressure.

Components

Gateway
Auth, rate limit, abuse detection, model routing.
Inference server (vLLM-style)
Continuous batching, paged KV, speculative decoding.
Prefix cache
Shared KV cache across requests with common prefixes.
Safety classifier
Pre- and post-generation policy enforcement.
Telemetry
Per-request cost, latency breakdown (prefill/decode), and policy outcomes.

Key decisions

Continuous batching over static batching.
Tail-heavy generation lengths waste static batch slots; continuous batching keeps the GPU saturated.
Prefix caching for common system prompts.
Major cost lever — shared system prompts can dominate prefill cost.
Smaller-model fallback under load.
Graceful degradation: cheaper model serves overflow rather than the system failing.
Safety as a layered system, not a single guardrail.
Classifier + preference-tuned model + content policy + audit log — each catches what the others miss.

Pitfalls

  • Single shared queue for all tenants — noisy neighbors.
  • No per-tenant cost attribution — billing surprises.
  • Cold-start of safety classifier on the request path.
  • No clear separation between prefill and decode in monitoring.

Follow-up questions

  • How do you handle a 20x viral spike?
  • What's the multi-region failover plan?
  • How do you isolate large-tenant traffic from small tenants?
  • What's the model rollout pattern?

Related patterns

Further reading