Platform

Rate Limiter at Scale

Enforce per-principal and global request limits across a distributed fleet with minimal latency overhead.

Scale to anchor on

Tens of millions of QPS globally, per-tenant limits across millions of tenants, sub-millisecond limiter overhead per request.

Requirements

Functional

Enforce per-API-key, per-tenant, and global limits.
Support multiple buckets (e.g., RPM, TPM for tokens) on the same request.
Surface limit and retry-after metadata to clients.

Non-functional

Single-digit-ms or sub-ms overhead.
Survive Redis or backend outage with controlled degradation.
Fair across tenants; resistant to noisy neighbors.

High-level architecture

Token-bucket algorithm in a distributed in-memory store (Redis, KeyDB). Local pre-check via per-instance shadow buckets to avoid round-tripping every request. Periodic resync with central store. For LLM-style token-based limits, the request body's token count is computed before the work is performed.

Components

Local pre-check

Per-instance approximate bucket to absorb traffic without central round-trip.

Central limiter (Redis)

Authoritative token bucket per principal; atomic decrement with Lua.

Quota policy service

Resolves which buckets apply to a given request (per-key, per-tenant, per-feature).

Metadata in responses

Returns X-RateLimit-* headers and Retry-After on 429 so clients can back off correctly.

Key decisions

Token bucket over fixed window.

Token bucket allows bursts within the average, which matches user expectations better.

Approximate local pre-check.

Removes a Redis hop on every request at the cost of slight overshoot — usually acceptable.

Fail-open vs. fail-closed during limiter outage.

Most public APIs fail-open with elevated alerts; safety-critical limits fail-closed. Pick consciously.

Compute token cost before the work for LLM APIs.

Otherwise expensive long-context calls can exceed limits after the cost is already incurred.

Pitfalls

Per-IP limiting kills mobile users behind NAT.
Missing retry-after headers — clients hammer back immediately.
Single global Redis becomes the bottleneck; shard per principal.
No DLQ or backpressure when downstream is overwhelmed.

Follow-up questions

How do you handle a single tenant trying to consume 100% of capacity?
How does the system degrade when Redis is unreachable?
How do you support multiple bucket types on one request?
How are limits communicated to clients?

Related patterns

rate-limiting circuit-breaker caching