Platform

Webhook Delivery with Retries

At-least-once event delivery to customer endpoints with retries, isolation, and observability.

Scale to anchor on

Hundreds of thousands of customer endpoints, billions of events/day, p99 delivery within seconds when endpoints are healthy.

Requirements

Functional

Deliver events to customer-registered URLs with signed payloads.
Retry failed deliveries with exponential backoff.
Provide per-event delivery history.
Support customer-side replay.

Non-functional

Isolation: one slow customer must not slow others.
At-least-once with stable event IDs for client dedup.
Resilience to customer endpoint outages.

High-level architecture

Events from producers go to a durable queue. Per-customer worker pools deliver, signing payloads with the customer's secret. A retry scheduler with exponential backoff and a dead-letter queue handles persistent failures. A dashboard exposes delivery history and replay tools.

Components

Event queue

Durable, partitioned by customer for ordering.

Delivery workers

Per-customer concurrency budget, bulkheaded to prevent noisy-neighbor problems.

Retry scheduler

Holds failed deliveries; releases at exponentially increasing intervals.

DLQ + replay tool

Persistent failures land here; customers can replay manually.

Signer

HMAC over payload using customer secret for verification.

Key decisions

Partition queues by customer.

A slow customer cannot block delivery to others; failure is isolated per partition.

Stable event IDs.

Customers must dedup on their side; we provide stable IDs in headers.

Exponential backoff with jitter.

Prevents thundering herd when a customer comes back online.

Signed payloads.

Customers verify authenticity without us needing TLS pinning.

Pitfalls

Single shared worker pool — one slow customer takes everyone down.
Retrying immediately on failure — thundering herd.
No DLQ — events lost on persistent failure.
No replay — customers can't recover after their own outage.

Follow-up questions

How do you handle a customer whose endpoint is down for 24 hours?
How do you guarantee ordering per customer?
How do customers replay missed events?
How do you communicate delivery status back?

Related patterns

queue-decoupling circuit-breaker idempotency rate-limiting