Resilience Patterns — Retry, Backoff, Circuit Breakers & Failover

// the problem

Three failure modes
every distributed system faces.

A slow downstream dependency is painful on its own. A retry storm, cascading failure, or naïve failover turns a transient blip into a sustained outage. Understanding each failure mode is the first step to designing against them.

⛈️

Retry Storm

All your clients see a 503 and retry immediately — in sync. The downstream never gets a breathing window to recover.

🚫

Cascading Failure

One slow dependency makes callers queue up blocked threads. Resource pools exhaust and the failure propagates upstream.

🔄

Naive Failover

Traffic flaps between two unhealthy nodes. Requests fail in both directions. You’ve doubled your blast radius.

“The service was recovering — until all 200 clients retried at the same millisecond.”
— your on-call engineer, 2:14am

// build your resilience stack

Start bare.
Add layers one at a time.

Every diagram starts with the same bare call: Client → Service. Click the buttons below to add each resilience layer and watch how the simulation changes.

// progressive resilience builder

Resilience level

0 / 4

Client 1caller

Client 2caller

Client 3caller

→ →

Servicedownstream

Each layer solves a specific problem. Retry alone creates storms. Backoff alone correlates retries. Jitter breaks the lock-step. The circuit breaker stops the bleeding entirely.

// backoff strategies

Not all retries
are created equal.

Three clients, one failure, three strategies. Watch when each one fires its retry attempts — and which strategy gives the service the breathing room it needs to recover.

// retry timeline — 30 second window

0s 5s 10s 15s 20s 25s 30s

Fixed (2s)

Exponential

Exp + Jitter

The green dashed line marks when the service recovers. Fixed retries from 200 clients all land at the same moments — the service gets hammered in synchronised bursts. Exponential backoff spreads attempts out. Add jitter and no two clients retry at the same instant.

// exponential backoff with full jitter

function getDelay(attempt: number): number {

const base = 1000; // 1s base

const cap = 30_000; // 30s ceiling

const exp = Math.min(cap, base * 2 ** attempt);

return Math.random() * exp; // full jitter: uniform [0, exp]

}

// circuit breaker

Stop hammering
a service that’s already down.

A circuit breaker tracks the error rate over a sliding window of recent calls. Once the rate crosses a threshold it trips — requests fail immediately, no slow timeout, no wasted network call. After a cooldown it sends one probe. Success closes the circuit; failure keeps it open.

Why rate, not consecutive? A service alternating success/fail at 60% errors would never trip a consecutive counter — but it's clearly degraded. A sliding window catches that.

// circuit breaker state machine

CLOSED

requests flow

≥50% / 10 calls →

OPEN

fail fast

5s timeout →

HALF-OPEN

probe request

Error rate 30%

0

requests

—

window err%

0

blocked

CLOSED

circuit

// sliding-window circuit breaker — platform-agnostic TypeScript

class CircuitBreaker {

private state: 'closed' | 'open' | 'half-open' = 'closed';

private window: boolean[] = []; // true = failed

private openedAt = 0;

private readonly windowSize = 10;

private readonly threshold = 0.5; // trip at 50% error rate

private readonly minCalls = 5; // need at least 5 calls before tripping

async call<T>(fn: () => Promise<T>): Promise<T> {

if (this.state === 'open') {

if (Date.now() - this.openedAt < 5_000) throw new Error('Circuit open');

this.state = 'half-open';

}

try {

const result = await fn();

if (this.state === 'half-open') { this.window = []; this.state = 'closed'; }

else this.record(false);

return result;

} catch (err) {

if (this.state === 'half-open') this.state = 'open';

else this.record(true);

throw err;

}

private record(failed: boolean) {

this.window.push(failed);

if (this.window.length > this.windowSize) this.window.shift();

const rate = this.window.filter(Boolean).length / this.window.length;

if (this.window.length >= this.minCalls && rate >= this.threshold) {

this.state = 'open'; this.openedAt = Date.now();

}

// libraries: Resilience4j (Java), Polly (.NET), cockatiel (Node), resilience (Go)

// failover

Automatic rerouting
before anyone pages you.

Failover is only as fast as your health checks. Step through what happens when Service A starts failing — from first anomaly to full traffic recovery.

// failover step-through — click Next to trace

Clientcaller

Load Balancerhealth-aware

Service Aprimary

Service Bsecondary

Health Check/healthz

Step 0 / 7 — click Next to begin

// Click "Next Step" to trace a failover event

Watch traffic reroute automatically as Service A degrades — one step at a time.

// key takeaways

Rules that don’t change
across platforms.

Kafka, RabbitMQ, SQS, gRPC, REST — these patterns apply everywhere. The implementation differs; the principles don’t.

01

Always add backoff. Instant retries amplify the failure you’re trying to recover from. Even a 100ms fixed delay is better than zero.

02

Jitter de-correlates retries. Without it, all clients retry in lock-step and re-create the storm. Full jitter — a uniform random value between 0 and the backoff ceiling — spreads the load across the entire window.

03

Circuit breakers protect downstream. Fail fast locally instead of queuing up slow requests. The downstream gets breathing room; your callers get a fast error they can handle.

04

Health checks must reflect real load. A health endpoint that pings a local constant will always pass. Check the thing that actually fails — DB connections, downstream latency, thread pool depth.

05

Failover is not free. Warm standbys, session state, DNS TTLs, and data replication all matter before you need them. Test your failover path before it tests you.

Your retries arekilling the serviceyou’re trying to save.

Three failure modesevery distributed system faces.

Retry Storm

Cascading Failure

Naive Failover

Start bare.Add layers one at a time.

Not all retriesare created equal.

Stop hammeringa service that’s already down.

Automatic reroutingbefore anyone pages you.

Rules that don’t changeacross platforms.

Your retries are
killing the service
you’re trying to save.

Three failure modes
every distributed system faces.

Start bare.
Add layers one at a time.

Not all retries
are created equal.

Stop hammering
a service that’s already down.

Automatic rerouting
before anyone pages you.

Rules that don’t change
across platforms.