Staff System Design Playbook

12 cross-cutting patterns Staff and Principal candidates are expected to discuss in any design round — with when to reach for them, the real trade-offs, the pitfalls interviewers probe, and the follow-up questions you should be ready for.

Idempotency keys

Correctness

Make every mutating request safe to retry. The defining pattern for money-moving and tool-using systems.

When to reach for it

Any API where retries are possible — payments, webhooks, agent tool calls, mobile-client retries, queue consumers.

How it works

Caller supplies a unique key; server stores (key → result) for a bounded TTL. Repeated requests with the same key return the stored result rather than re-executing. Persist the key+outcome inside the same transaction as the side effect.

Trade-offs

  • Storage cost and TTL strategy (too short = lost dedup, too long = unbounded growth).
  • Choosing the right scope for the key (per-customer, per-tenant, global) affects collision and isolation.
  • Server-generated vs. client-generated keys: client-generated is more robust to network failure but requires trust.

Common pitfalls

  • Storing the key in a separate transaction from the effect — race conditions reintroduce double execution.
  • Hashing the request body as the key — small format changes break dedup.
  • Forgetting webhook delivery is itself retried — both producer and consumer need idempotency.

Follow-ups interviewers ask

  • How do you handle a request that returned an error — retry-safe or not?
  • What's your TTL and how did you pick it?
  • How is the key surfaced in logs for incident response?

Consistent hashing

Sharding

Distribute load across a fleet so that adding or removing a node moves only ~1/N of the keys.

When to reach for it

Sharded caches, distributed databases, request routing to stateful workers (e.g., per-user model state, per-tenant inference).

How it works

Hash both keys and nodes onto a ring. Each key maps to the next node clockwise. Use virtual nodes (e.g., 256 per physical) to smooth distribution and reduce hot-spotting.

Trade-offs

  • Hot keys still hot — consistent hashing doesn't solve skew, only re-balancing.
  • Virtual node count trades distribution smoothness for memory and rebalance cost.
  • Choice of hash function affects skew and security (cryptographic hashes avoid adversarial hot-spotting).

Common pitfalls

  • Tracking node membership without a strongly consistent view — split-brain routing.
  • Forgetting to handle in-flight requests during membership changes.
  • Letting a single celebrity user create a permanent hot shard.

Follow-ups interviewers ask

  • How do you handle skew when one tenant is 100x larger than the median?
  • What's the impact on replication when a node leaves?
  • How do you migrate keys during planned rebalances?

Change data capture (CDC)

Data movement

Stream every database mutation as an event log — the universal connector between OLTP and downstream systems.

When to reach for it

Materialized views, search index sync, feature stores, audit logs, cross-region replication, ML training-data pipelines.

How it works

Read the database's commit log (e.g., Postgres WAL, MySQL binlog, MongoDB oplog) and emit ordered events to a stream. Consumers apply changes idempotently downstream.

Trade-offs

  • At-least-once delivery is the realistic guarantee — consumers must be idempotent.
  • Schema evolution requires explicit contracts; downstream consumers can't move at the source's pace.
  • Backfill strategy is non-trivial — log retention, snapshot+stream merge, ordering during cutover.

Common pitfalls

  • Treating CDC as a fire-and-forget pipe instead of a versioned contract.
  • Tight coupling of consumer business logic to source schema — a column rename takes down 10 services.
  • Ignoring log retention until a consumer falls behind and can't recover without snapshot.

Follow-ups interviewers ask

  • How do you bootstrap a new consumer that needs full history?
  • How do you evolve the source schema without breaking consumers?
  • What ordering guarantees does your stream offer?

Fan-out: read vs. write

Feed / messaging

The classic timeline architecture question. Decides where you absorb the cost: at write time or at read time.

When to reach for it

Social feeds, notifications, chat delivery, broadcast updates — anywhere one event must reach many subscribers.

How it works

Fan-out on write: when a producer posts, write a copy into each subscriber's inbox. Read is a cheap lookup. Fan-out on read: writers append to a single timeline; readers query and merge across followed producers at read time. Real systems use a hybrid — fan-out on write for normal users, fan-out on read for celebrities.

Trade-offs

  • Fan-out on write: cheap reads, expensive writes, terrible for celebrities with 10M followers.
  • Fan-out on read: cheap writes, expensive reads, cache-heavy.
  • Hybrid: complexity in routing logic and consistency between paths.

Common pitfalls

  • Designing for the median user and crashing on power users (or vice versa).
  • Inbox storage growing without TTL or pagination strategy.
  • Synchronously fanning out at write time — viral content takes the site down.

Follow-ups interviewers ask

  • Where do you draw the celebrity threshold?
  • How do you handle a viral post that crosses the threshold mid-flight?
  • What's the consistency model for read-your-own-write?

Leader election and consensus

Distributed coordination

When exactly one node must be authoritative, you need a real consensus protocol — not a config file.

When to reach for it

Distributed locks, schedulers, primary-replica failover, distributed cron, deployment coordinators.

How it works

Use Raft or Paxos via an off-the-shelf system (etcd, ZooKeeper, Consul). Leader holds a time-bounded lease and renews it; followers can challenge if the lease expires. Operations route through the leader; followers serve reads with bounded staleness.

Trade-offs

  • Strong consistency at the cost of availability under partition (CP side of CAP).
  • Leader is a throughput bottleneck — sharding the keyspace across multiple Raft groups is the standard scale-out.
  • Lease duration trades failover latency against false-positive failovers.

Common pitfalls

  • Rolling your own — almost every homegrown consensus has a split-brain bug.
  • Misconfiguring lease/election timeouts and getting flapping leaders.
  • Letting clients cache stale leader identity and write to a former leader.

Follow-ups interviewers ask

  • How do you handle a network partition where the leader is isolated but still thinks it's leader?
  • What's the read consistency model for followers?
  • How does the system behave during a rolling upgrade of the consensus quorum?

Caching strategy and invalidation

Performance

Caches are how you afford the architecture. Invalidation is how you stay correct.

When to reach for it

High-read, low-write paths: profile lookups, recommendation candidates, LLM prompt-prefix caches, vector retrieval.

How it works

Pick read-through (cache loads on miss), write-through (writes go to cache and DB), or write-behind (async). Pair with TTL and explicit invalidation. For correctness-sensitive caches, use single-flight / request coalescing to avoid thundering herd on miss.

Trade-offs

  • TTL-only: simple, but stale until expiry — unacceptable for some domains.
  • Explicit invalidation: correct, but requires cache-key discipline and ordering with writes.
  • Write-through: extra latency on the write path; write-behind: data loss risk on cache failure.

Common pitfalls

  • Cache stampede when a popular key expires — solve with single-flight and probabilistic early refresh.
  • Storing serialized objects whose schema evolves — silent corruption on rollback.
  • Treating the cache as the source of truth and discovering it during a node loss.

Follow-ups interviewers ask

  • How do you handle stampede on a hot key after expiry?
  • What's your invalidation pattern — and what consistency does it guarantee?
  • What happens when the cache layer is unavailable?

Rate limiting and load shedding

Resilience

Decide who gets served and who waits — before the system collapses under load.

When to reach for it

Any public API, multi-tenant platform, LLM inference fleet, agent tool calls, webhook delivery.

How it works

Token bucket per principal (user/tenant/API key). At system level, add admission control with priority queues and explicit shedding (return 429 with Retry-After). For LLM serving, queue depth + per-replica concurrency saturation are better signals than QPS.

Trade-offs

  • Hard limits are simple but feel arbitrary to users — burst tolerance via token buckets is friendlier.
  • Distributed rate limiters require either central state (latency) or approximate algorithms (overshoot).
  • Priority queues prevent starvation but require clear product policy on whose traffic dies first.

Common pitfalls

  • Limiting by IP — kills mobile users behind NAT.
  • Forgetting to surface limit info in headers — clients have no way to back off intelligently.
  • No load shedding at all — cascade failure when one downstream slows.

Follow-ups interviewers ask

  • What happens when a single tenant tries to use 100% of capacity?
  • How does the limiter behave during a downstream outage?
  • How do you communicate limits to clients so they can implement client-side backoff?

Circuit breakers and bulkheads

Resilience

Stop pouring requests into a failing dependency. Contain blast radius before it takes the rest of the system with it.

When to reach for it

Any call to an external dependency: third-party APIs, model providers, internal services, databases.

How it works

Circuit breaker tracks failure rate over a sliding window. When the threshold is crossed, the breaker opens — calls fail fast without contacting the dependency. After a cool-down, it half-opens and lets a few probe requests through. Bulkhead pattern: isolate calls to different dependencies into separate thread pools/connection pools so one slow dependency can't exhaust the pool for the rest.

Trade-offs

  • Faster failure but no recovery without explicit probes.
  • Tuning thresholds (failure rate, window size, cool-down) is workload-specific and requires real telemetry.
  • Bulkheads cost extra resources for isolation.

Common pitfalls

  • Opening too eagerly on transient blips — flapping.
  • Forgetting to expose breaker state in metrics — silent failures invisible to on-call.
  • Using a single shared pool for all downstreams — one slow dependency starves all.

Follow-ups interviewers ask

  • What's the half-open probe strategy?
  • What does the system return to clients when a breaker is open?
  • How is breaker state observable in your dashboards?

Queues for write decoupling

Asynchrony

Absorb spikes, smooth downstream load, and turn synchronous failures into retriable async work.

When to reach for it

Email/notification delivery, video transcoding, ML training jobs, webhook delivery, anything that doesn't need a synchronous response.

How it works

Producers write to a queue (SQS, Kafka, NATS). Consumers process at their own pace. Use dead-letter queues for poison messages, visibility timeouts to avoid double-processing, and explicit retry policies with backoff.

Trade-offs

  • Decouples write/read latency — but adds end-to-end latency for the synchronous case.
  • At-least-once vs. exactly-once — exactly-once is mostly a marketing term; design consumers to be idempotent.
  • Ordering guarantees vary (partition-level vs. global) and constrain consumer parallelism.

Common pitfalls

  • Producer faster than consumers indefinitely — queue grows without bound.
  • Missing DLQ — poison messages block the whole partition.
  • Long-running consumers exceeding visibility timeout, causing duplicate processing.

Follow-ups interviewers ask

  • How do you detect and handle a stuck consumer?
  • What's the retention policy and what happens when it's exceeded?
  • How does the system behave during a consumer outage of 1 hour?

Transactional outbox

Correctness

Publish events reliably from a service that owns a database — without dual-write inconsistency.

When to reach for it

Whenever a service must both update its database AND publish an event (CDC, downstream pipeline, integration).

How it works

In the same DB transaction as the business write, insert the event into an outbox table. A separate poller (or CDC reader) drains the outbox into the message bus and marks rows as published. Consumers process at-least-once and dedupe.

Trade-offs

  • Adds DB load and write amplification.
  • End-to-end latency depends on poller frequency or CDC lag.
  • Operational complexity — outbox needs monitoring and retention rules.

Common pitfalls

  • Dual-writing to DB and message bus without outbox — the classic correctness bug.
  • Outbox without dedup downstream — duplicate processing.
  • Forgetting to clean up published rows; outbox grows unbounded.

Follow-ups interviewers ask

  • What's the dedup strategy on the consumer side?
  • How do you handle a poller falling behind?
  • What's your retention and cleanup policy for the outbox table?

Multi-region strategy

Availability

Active-active, active-passive, or single-region with replicas — the most over-asked, under-justified architecture question.

When to reach for it

Cross-continent latency requirements, regulatory data residency, regional outage tolerance, planet-scale SLAs.

How it works

Active-active: traffic served from all regions, conflicts handled by CRDTs or last-write-wins per partition. Active-passive: one writer region, async replication to passives, manual or automated failover. Single-region with edge cache: serve reads at edge, write to home region. Choose based on RPO/RTO and conflict tolerance.

Trade-offs

  • Active-active gives lowest latency and best availability but requires real conflict resolution and is expensive.
  • Active-passive is simpler but failover is a planned exercise with measurable RTO.
  • Edge-only multi-region is cheap but writes still go to one region — a regional outage takes writes down.

Common pitfalls

  • Reaching for active-active because it sounds impressive when active-passive meets the SLA.
  • Forgetting that data residency may forbid replicating certain data across regions.
  • Not testing failover until a real outage — RTO is fiction until exercised.

Follow-ups interviewers ask

  • What's your RPO/RTO and how do you test it?
  • How are conflicts resolved during partition?
  • How do schema changes roll out across regions?

Feature flags and progressive delivery

Rollout

Decouple deploy from release. The architectural primitive behind every safe rollout, A/B test, and emergency kill switch.

When to reach for it

High-risk launches, gradual rollouts, A/B experiments, regional rollouts, kill switches for incident response.

How it works

Wrap risky paths in a flag with default-off. Use a central flag service with low-latency client SDKs and audit logs. Bucket users deterministically (hash of user_id) so behavior is stable per user. Maintain a flag lifecycle: introduce → ramp → fully on → remove.

Trade-offs

  • Code complexity grows with flag count — old flags become permanent forks if not cleaned up.
  • Flag service is a critical dependency — outage degrades every flagged path.
  • Per-request flag lookups have latency; client-side caching helps but introduces staleness.

Common pitfalls

  • Permanent flags — feature flag debt accumulates faster than code debt.
  • Inconsistent bucketing across services causing the same user to see different variants.
  • Flags hidden in deep code paths with no test coverage of either branch.

Follow-ups interviewers ask

  • What's the flag-removal SLA?
  • How is the flag service made resilient — does it fail open or closed?
  • How do you handle a flag that needs to be coordinated across multiple services?

How to use this

Don't memorize. Pick a pattern, sketch a system that uses it, then walk yourself through the follow-up questions out loud. The gap between "I know what idempotency is" and "I can defend my idempotency-key TTL choice under interviewer pushback" is the gap between Senior and Staff.