// the retry problem

Your retries are
killing the service
you’re trying to save.

Every distributed system fails eventually. The difference between a five-minute blip and a two-hour outage is how your code responds when it does.

payment-service.log, 02:14:37 UTC
// the problem

Three failure modes
every distributed system faces.

A slow downstream dependency is painful on its own. A retry storm, cascading failure, or naïve failover turns a transient blip into a sustained outage. Understanding each failure mode is the first step to designing against them.

⛈️

Retry Storm

All your clients see a 503 and retry immediately — in sync. The downstream never gets a breathing window to recover.

🚫

Cascading Failure

One slow dependency makes callers queue up blocked threads. Resource pools exhaust and the failure propagates upstream.

🔄

Naive Failover

Traffic flaps between two unhealthy nodes. Requests fail in both directions. You’ve doubled your blast radius.

“The service was recovering — until all 200 clients retried at the same millisecond.”
— your on-call engineer, 2:14am
// build your resilience stack

Start bare.
Add layers one at a time.

Every diagram starts with the same bare call: Client → Service. Click the buttons below to add each resilience layer and watch how the simulation changes.

// progressive resilience builder
Resilience level
0 / 5
Clientcaller
Service Aprimary

Each layer solves a specific problem. Retry alone creates storms. Backoff alone correlates retries. Jitter breaks the lock-step. The circuit breaker stops the bleeding. Failover routes around the problem entirely.

// backoff strategies

Not all retries
are created equal.

Three clients, one failure, three strategies. Watch when each one fires its retry attempts — and which strategy gives the service the breathing room it needs to recover.

// retry timeline — 30 second window
0s 5s 10s 15s 20s 25s 30s
Fixed (2s)
Exponential
Exp + Jitter

The green dashed line marks when the service recovers. Fixed retries from 200 clients all land at the same moments — the service gets hammered in synchronised bursts. Exponential backoff spreads attempts out. Add jitter and no two clients retry at the same instant.

// exponential backoff with full jitter
function getDelay(attempt: number): number {
const base = 1000; // 1s base
const cap = 30_000; // 30s ceiling
const exp = Math.min(cap, base * 2 ** attempt);
return Math.random() * exp; // full jitter: uniform [0, exp]
}
// circuit breaker

Stop hammering
a service that’s already down.

A circuit breaker tracks consecutive failures. Once it trips, requests fail immediately — no slow timeout, no wasted network call. After a cooldown it sends one probe. Success closes the circuit; failure keeps it open.

// circuit breaker state machine
CLOSED
requests flow
5 failures
OPEN
fail fast
5s timeout
HALF-OPEN
probe request
Error rate 30%
0
requests
0
errors
0
blocked
CLOSED
circuit
// minimal circuit breaker — platform-agnostic TypeScript
class CircuitBreaker {
private state: 'closed' | 'open' | 'half-open' = 'closed';
private failures = 0;
private openedAt = 0;
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'open') {
if (Date.now() - this.openedAt < 5000) throw new Error('Circuit open');
this.state = 'half-open'; // cooldown elapsed — allow one probe
}
try {
const result = await fn();
this.failures = 0; this.state = 'closed'; return result;
} catch (err) {
if (++this.failures >= 5) { this.state = 'open'; this.openedAt = Date.now(); }
throw err;
}
}
}
// libraries: Resilience4j (Java), Polly (.NET), cockatiel (Node), resilience (Go)
// failover

Automatic rerouting
before anyone pages you.

Failover is only as fast as your health checks. Step through what happens when Service A starts failing — from first anomaly to full traffic recovery.

// failover step-through — click Next to trace
Clientcaller
Load Balancerhealth-aware
Service Aprimary
Service Bsecondary
Health Check/healthz
Step 0 / 7 — click Next to begin
// Click "Next Step" to trace a failover event
Watch traffic reroute automatically as Service A degrades — one step at a time.
// key takeaways

Rules that don’t change
across platforms.

Kafka, RabbitMQ, SQS, gRPC, REST — these patterns apply everywhere. The implementation differs; the principles don’t.

01
Always add backoff. Instant retries amplify the failure you’re trying to recover from. Even a 100ms fixed delay is better than zero.
02
Jitter de-correlates retries. Without it, all clients retry in lock-step and re-create the storm. Full jitter — a uniform random value between 0 and the backoff ceiling — spreads the load across the entire window.
03
Circuit breakers protect downstream. Fail fast locally instead of queuing up slow requests. The downstream gets breathing room; your callers get a fast error they can handle.
04
Health checks must reflect real load. A health endpoint that pings a local constant will always pass. Check the thing that actually fails — DB connections, downstream latency, thread pool depth.
05
Failover is not free. Warm standbys, session state, DNS TTLs, and data replication all matter before you need them. Test your failover path before it tests you.