AI Systems

LLM Inference Serving Platform

Serve large language model completions with low latency, high throughput, and predictable cost.

Scale to anchor on

Millions of QPS, sub-second TTFT, hundreds of milliseconds inter-token latency, multi-region.

Requirements

Functional

Stream completions token-by-token.
Support multiple model versions and routing.
Apply rate limits, abuse filters, and safety classifiers.
Cache prompt prefixes for cost reduction.

Non-functional

GPU utilization > 50% steady state.
Graceful degradation under spikes.
Multi-region failover for availability.

High-level architecture

A gateway authenticates, rate-limits, and routes by model. Inference servers run with continuous batching, paged KV cache, and prompt prefix caching. A safety layer (classifier on input and output) runs in parallel where possible. Streaming responses use SSE / HTTP2 with backpressure.

Components

Gateway

Auth, rate limit, abuse detection, model routing.

Inference server (vLLM-style)

Continuous batching, paged KV, speculative decoding.

Prefix cache

Shared KV cache across requests with common prefixes.

Safety classifier

Pre- and post-generation policy enforcement.

Telemetry

Per-request cost, latency breakdown (prefill/decode), and policy outcomes.

Key decisions

Continuous batching over static batching.

Tail-heavy generation lengths waste static batch slots; continuous batching keeps the GPU saturated.

Prefix caching for common system prompts.

Major cost lever — shared system prompts can dominate prefill cost.

Smaller-model fallback under load.

Graceful degradation: cheaper model serves overflow rather than the system failing.

Safety as a layered system, not a single guardrail.

Classifier + preference-tuned model + content policy + audit log — each catches what the others miss.

Pitfalls

Single shared queue for all tenants — noisy neighbors.
No per-tenant cost attribution — billing surprises.
Cold-start of safety classifier on the request path.
No clear separation between prefill and decode in monitoring.

Follow-up questions

How do you handle a 20x viral spike?
What's the multi-region failover plan?
How do you isolate large-tenant traffic from small tenants?
What's the model rollout pattern?

Related patterns

caching rate-limiting circuit-breaker multi-region