Every distributed system fails eventually. The difference between a five-minute blip and a two-hour outage is how your code responds when it does.
A slow downstream dependency is painful on its own. A retry storm, cascading failure, or naïve failover turns a transient blip into a sustained outage. Understanding each failure mode is the first step to designing against them.
All your clients see a 503 and retry immediately — in sync. The downstream never gets a breathing window to recover.
One slow dependency makes callers queue up blocked threads. Resource pools exhaust and the failure propagates upstream.
Traffic flaps between two unhealthy nodes. Requests fail in both directions. You’ve doubled your blast radius.
Every diagram starts with the same bare call: Client → Service. Click the buttons below to add each resilience layer and watch how the simulation changes.
Each layer solves a specific problem. Retry alone creates storms. Backoff alone correlates retries. Jitter breaks the lock-step. The circuit breaker stops the bleeding. Failover routes around the problem entirely.
Three clients, one failure, three strategies. Watch when each one fires its retry attempts — and which strategy gives the service the breathing room it needs to recover.
The green dashed line marks when the service recovers. Fixed retries from 200 clients all land at the same moments — the service gets hammered in synchronised bursts. Exponential backoff spreads attempts out. Add jitter and no two clients retry at the same instant.
A circuit breaker tracks consecutive failures. Once it trips, requests fail immediately — no slow timeout, no wasted network call. After a cooldown it sends one probe. Success closes the circuit; failure keeps it open.
Failover is only as fast as your health checks. Step through what happens when Service A starts failing — from first anomaly to full traffic recovery.
Kafka, RabbitMQ, SQS, gRPC, REST — these patterns apply everywhere. The implementation differs; the principles don’t.