Real-time
Notification Fan-out at Billion Scale
Deliver targeted notifications to billions of devices with priority and per-user policy.
Scale to anchor on
1B+ devices, multi-million per-second peak fan-out, per-user preferences, per-channel deliverability.
Requirements
Functional
- Trigger notifications from many producer systems.
- Respect per-user channel preferences (push, email, SMS, in-app).
- Throttle to avoid notification fatigue.
- Track delivery and engagement.
Non-functional
- High throughput, low end-to-end latency for urgent classes.
- Resilience to downstream provider outage (APNs, FCM).
- Cost-efficient at SMS / email tier.
High-level architecture
Producers emit events to a stream. A policy service resolves recipient channels and rules. A fan-out worker fleet pushes to per-channel delivery services (APNs, FCM, SES, Twilio). Per-user buckets enforce throttling.
Components
Event ingest
Validates and persists producer events.
Policy service
Resolves user preferences, quiet hours, throttles, and dedup.
Channel adapters
Per-provider clients with retry, circuit breakers, and DLQs.
Engagement tracker
Tracks delivery, open, click; feeds quality models.
Key decisions
Async pipeline end-to-end.
Producer latency cannot depend on third-party deliverability; queues decouple producer from variable downstream latency.
Per-user dedup window.
Multiple producers firing for the same event is common; dedup prevents user-visible duplication.
Channel adapters with isolated bulkheads.
An APNs slowdown must not starve FCM workers; isolated pools contain blast radius.
Pitfalls
- Synchronous calls to third-party channels — outages cascade.
- No throttle: users churn from notification fatigue.
- Single shared connection pool across providers.
- Forgetting urgent vs. background priority lanes.
Follow-up questions
- How do you handle a third-party provider outage?
- How do you prevent notification storms when an upstream service replays events?
- What's the user-preference data model?