Platform

Webhook Delivery with Retries

At-least-once event delivery to customer endpoints with retries, isolation, and observability.

Scale to anchor on

Hundreds of thousands of customer endpoints, billions of events/day, p99 delivery within seconds when endpoints are healthy.

Requirements

Functional

  • Deliver events to customer-registered URLs with signed payloads.
  • Retry failed deliveries with exponential backoff.
  • Provide per-event delivery history.
  • Support customer-side replay.

Non-functional

  • Isolation: one slow customer must not slow others.
  • At-least-once with stable event IDs for client dedup.
  • Resilience to customer endpoint outages.

High-level architecture

Events from producers go to a durable queue. Per-customer worker pools deliver, signing payloads with the customer's secret. A retry scheduler with exponential backoff and a dead-letter queue handles persistent failures. A dashboard exposes delivery history and replay tools.

Components

Event queue
Durable, partitioned by customer for ordering.
Delivery workers
Per-customer concurrency budget, bulkheaded to prevent noisy-neighbor problems.
Retry scheduler
Holds failed deliveries; releases at exponentially increasing intervals.
DLQ + replay tool
Persistent failures land here; customers can replay manually.
Signer
HMAC over payload using customer secret for verification.

Key decisions

Partition queues by customer.
A slow customer cannot block delivery to others; failure is isolated per partition.
Stable event IDs.
Customers must dedup on their side; we provide stable IDs in headers.
Exponential backoff with jitter.
Prevents thundering herd when a customer comes back online.
Signed payloads.
Customers verify authenticity without us needing TLS pinning.

Pitfalls

  • Single shared worker pool — one slow customer takes everyone down.
  • Retrying immediately on failure — thundering herd.
  • No DLQ — events lost on persistent failure.
  • No replay — customers can't recover after their own outage.

Follow-up questions

  • How do you handle a customer whose endpoint is down for 24 hours?
  • How do you guarantee ordering per customer?
  • How do customers replay missed events?
  • How do you communicate delivery status back?

Related patterns

Further reading