// the retry problem

Your retries are
killing the service
you’re trying to save.

Every distributed system fails eventually. The difference between a five-minute blip and a two-hour outage is how your code responds when it does.

payment-service.log, 02:14:37 UTC
// the problem

Three failure modes
every distributed system faces.

A slow downstream dependency is painful on its own. A retry storm, cascading failure, or naïve failover turns a transient blip into a sustained outage. Understanding each failure mode is the first step to designing against them.

⛈️

Retry Storm

All your clients see a 503 and retry immediately — in sync. The downstream never gets a breathing window to recover.

🚫

Cascading Failure

One slow dependency makes callers queue up blocked threads. Resource pools exhaust and the failure propagates upstream.

🔄

Naive Failover

Traffic flaps between two unhealthy nodes. Requests fail in both directions. You’ve doubled your blast radius.

“The service was recovering — until all 200 clients retried at the same millisecond.”
— your on-call engineer, 2:14am
// build your resilience stack

Start bare.
Add layers one at a time.

Every diagram starts with the same bare call: Client → Service. Click the buttons below to add each resilience layer and watch how the simulation changes.

// progressive resilience builder
Resilience level
0 / 4
Client 1caller
Client 2caller
Client 3caller
Servicedownstream

Each layer solves a specific problem. Retry alone creates storms. Backoff alone correlates retries. Jitter breaks the lock-step. The circuit breaker stops the bleeding entirely.

// backoff strategies

Not all retries
are created equal.

Three clients, one failure, three strategies. Watch when each one fires its retry attempts — and which strategy gives the service the breathing room it needs to recover.

// retry timeline — 30 second window
0s 5s 10s 15s 20s 25s 30s
Fixed (2s)
Exponential
Exp + Jitter

The green dashed line marks when the service recovers. Fixed retries from 200 clients all land at the same moments — the service gets hammered in synchronised bursts. Exponential backoff spreads attempts out. Add jitter and no two clients retry at the same instant.

// exponential backoff with full jitter
function getDelay(attempt: number): number {
const base = 1000; // 1s base
const cap = 30_000; // 30s ceiling
const exp = Math.min(cap, base * 2 ** attempt);
return Math.random() * exp; // full jitter: uniform [0, exp]
}
// circuit breaker

Stop hammering
a service that’s already down.

A circuit breaker tracks the error rate over a sliding window of recent calls. Once the rate crosses a threshold it trips — requests fail immediately, no slow timeout, no wasted network call. After a cooldown it sends one probe. Success closes the circuit; failure keeps it open.

Why rate, not consecutive? A service alternating success/fail at 60% errors would never trip a consecutive counter — but it's clearly degraded. A sliding window catches that.

// circuit breaker state machine
CLOSED
requests flow
≥50% / 10 calls
OPEN
fail fast
5s timeout
HALF-OPEN
probe request
Error rate 30%
0
requests
window err%
0
blocked
CLOSED
circuit
// sliding-window circuit breaker — platform-agnostic TypeScript
class CircuitBreaker {
private state: 'closed' | 'open' | 'half-open' = 'closed';
private window: boolean[] = []; // true = failed
private openedAt = 0;
private readonly windowSize = 10;
private readonly threshold = 0.5; // trip at 50% error rate
private readonly minCalls = 5; // need at least 5 calls before tripping
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'open') {
if (Date.now() - this.openedAt < 5_000) throw new Error('Circuit open');
this.state = 'half-open';
}
try {
const result = await fn();
if (this.state === 'half-open') { this.window = []; this.state = 'closed'; }
else this.record(false);
return result;
} catch (err) {
if (this.state === 'half-open') this.state = 'open';
else this.record(true);
throw err;
}
}
private record(failed: boolean) {
this.window.push(failed);
if (this.window.length > this.windowSize) this.window.shift();
const rate = this.window.filter(Boolean).length / this.window.length;
if (this.window.length >= this.minCalls && rate >= this.threshold) {
this.state = 'open'; this.openedAt = Date.now();
}
}
}
// libraries: Resilience4j (Java), Polly (.NET), cockatiel (Node), resilience (Go)
// failover

Automatic rerouting
before anyone pages you.

Failover is only as fast as your health checks. Step through what happens when Service A starts failing — from first anomaly to full traffic recovery.

// failover step-through — click Next to trace
Clientcaller
Load Balancerhealth-aware
Service Aprimary
Service Bsecondary
Health Check/healthz
Step 0 / 7 — click Next to begin
// Click "Next Step" to trace a failover event
Watch traffic reroute automatically as Service A degrades — one step at a time.
// key takeaways

Rules that don’t change
across platforms.

Kafka, RabbitMQ, SQS, gRPC, REST — these patterns apply everywhere. The implementation differs; the principles don’t.

01
Always add backoff. Instant retries amplify the failure you’re trying to recover from. Even a 100ms fixed delay is better than zero.
02
Jitter de-correlates retries. Without it, all clients retry in lock-step and re-create the storm. Full jitter — a uniform random value between 0 and the backoff ceiling — spreads the load across the entire window.
03
Circuit breakers protect downstream. Fail fast locally instead of queuing up slow requests. The downstream gets breathing room; your callers get a fast error they can handle.
04
Health checks must reflect real load. A health endpoint that pings a local constant will always pass. Check the thing that actually fails — DB connections, downstream latency, thread pool depth.
05
Failover is not free. Warm standbys, session state, DNS TTLs, and data replication all matter before you need them. Test your failover path before it tests you.