Platform
Webhook Delivery with Retries
At-least-once event delivery to customer endpoints with retries, isolation, and observability.
Scale to anchor on
Hundreds of thousands of customer endpoints, billions of events/day, p99 delivery within seconds when endpoints are healthy.
Requirements
Functional
- Deliver events to customer-registered URLs with signed payloads.
- Retry failed deliveries with exponential backoff.
- Provide per-event delivery history.
- Support customer-side replay.
Non-functional
- Isolation: one slow customer must not slow others.
- At-least-once with stable event IDs for client dedup.
- Resilience to customer endpoint outages.
High-level architecture
Events from producers go to a durable queue. Per-customer worker pools deliver, signing payloads with the customer's secret. A retry scheduler with exponential backoff and a dead-letter queue handles persistent failures. A dashboard exposes delivery history and replay tools.
Components
Event queue
Durable, partitioned by customer for ordering.
Delivery workers
Per-customer concurrency budget, bulkheaded to prevent noisy-neighbor problems.
Retry scheduler
Holds failed deliveries; releases at exponentially increasing intervals.
DLQ + replay tool
Persistent failures land here; customers can replay manually.
Signer
HMAC over payload using customer secret for verification.
Key decisions
Partition queues by customer.
A slow customer cannot block delivery to others; failure is isolated per partition.
Stable event IDs.
Customers must dedup on their side; we provide stable IDs in headers.
Exponential backoff with jitter.
Prevents thundering herd when a customer comes back online.
Signed payloads.
Customers verify authenticity without us needing TLS pinning.
Pitfalls
- Single shared worker pool — one slow customer takes everyone down.
- Retrying immediately on failure — thundering herd.
- No DLQ — events lost on persistent failure.
- No replay — customers can't recover after their own outage.
Follow-up questions
- How do you handle a customer whose endpoint is down for 24 hours?
- How do you guarantee ordering per customer?
- How do customers replay missed events?
- How do you communicate delivery status back?