Platform
Rate Limiter at Scale
Enforce per-principal and global request limits across a distributed fleet with minimal latency overhead.
Scale to anchor on
Tens of millions of QPS globally, per-tenant limits across millions of tenants, sub-millisecond limiter overhead per request.
Requirements
Functional
- Enforce per-API-key, per-tenant, and global limits.
- Support multiple buckets (e.g., RPM, TPM for tokens) on the same request.
- Surface limit and retry-after metadata to clients.
Non-functional
- Single-digit-ms or sub-ms overhead.
- Survive Redis or backend outage with controlled degradation.
- Fair across tenants; resistant to noisy neighbors.
High-level architecture
Token-bucket algorithm in a distributed in-memory store (Redis, KeyDB). Local pre-check via per-instance shadow buckets to avoid round-tripping every request. Periodic resync with central store. For LLM-style token-based limits, the request body's token count is computed before the work is performed.
Components
Local pre-check
Per-instance approximate bucket to absorb traffic without central round-trip.
Central limiter (Redis)
Authoritative token bucket per principal; atomic decrement with Lua.
Quota policy service
Resolves which buckets apply to a given request (per-key, per-tenant, per-feature).
Metadata in responses
Returns X-RateLimit-* headers and Retry-After on 429 so clients can back off correctly.
Key decisions
Token bucket over fixed window.
Token bucket allows bursts within the average, which matches user expectations better.
Approximate local pre-check.
Removes a Redis hop on every request at the cost of slight overshoot — usually acceptable.
Fail-open vs. fail-closed during limiter outage.
Most public APIs fail-open with elevated alerts; safety-critical limits fail-closed. Pick consciously.
Compute token cost before the work for LLM APIs.
Otherwise expensive long-context calls can exceed limits after the cost is already incurred.
Pitfalls
- Per-IP limiting kills mobile users behind NAT.
- Missing retry-after headers — clients hammer back immediately.
- Single global Redis becomes the bottleneck; shard per principal.
- No DLQ or backpressure when downstream is overwhelmed.
Follow-up questions
- How do you handle a single tenant trying to consume 100% of capacity?
- How does the system degrade when Redis is unreachable?
- How do you support multiple bucket types on one request?
- How are limits communicated to clients?