7 min read - 2026-06-29
Load Balancing and Circuit Breaker Architecture Patterns
A service returning 503 errors is better than a service returning wrong answers silently.
Most developers learn to build applications. Fewer learn to design systems. The difference doesn't matter much on small projects, but it becomes the entire conversation when something goes wrong under real production load.
Load balancing is how you handle volume without scaling vertically until the cost becomes absurd. Incoming requests distribute across multiple instances instead of hammering one server. Any single instance can fail or restart without the system going down. You add capacity horizontally rather than paying indefinitely for a more expensive machine.
The nuance is in the distribution strategy. Round-robin is the default and works fine until requests vary significantly in cost, or until stateful sessions need to stick to a specific instance. Understanding the tradeoffs between algorithms matters more than memorizing their names.
Circuit breakers handle a different class of problem. One slow or failing dependency, left unchecked, can hold open connections until the thread pool exhausts and the whole service stops responding. A circuit breaker detects that failure pattern and stops sending traffic to the broken dependency, returning a fast error instead of a slow one.
The operational insight is that a fast failure is better than a slow one. Users can handle being told a feature is temporarily unavailable. They cannot handle a system that appears to accept input and then silently does nothing for thirty seconds.
Both patterns predate the current generation of frameworks by decades. They exist in Kleppmann's writing, in how every system that has survived real production traffic was eventually designed. Frameworks come and go. These patterns are the bedrock.
Designing Data-Intensive Applications covers both in depth, alongside the distributed systems context that makes them make sense. If you're building anything that needs to survive real load, it is the right place to start.
Why These Two Patterns Belong Together
Load balancing spreads traffic. Circuit breakers stop traffic from hammering a dependency that is already failing. One manages volume, the other prevents a bad dependency from taking everything else down with it.
Load Balancing Beyond Round Robin
Round robin is the default, but it is only one option. Weighted balancing, least connections, and sticky sessions all solve different problems, and the right choice depends on request cost, session state, and traffic shape.
Load balancing strategy comparison
| Strategy | Use case | Pros | Cons |
|---|---|---|---|
| Round robin | Uniform requests | Simple | Ignores request cost |
| Weighted | Uneven capacity | Flexible | Needs tuning |
| Least connections | Variable request length | Adaptive | Needs live metrics |
| Sticky sessions | Session state | Predictable | Can create hotspots |
Different traffic patterns need different balancing strategies.
Circuit Breaker States
Closed means requests are flowing normally. Open means the dependency is failing often enough that the system stops sending traffic to it. Half-open is the test state that checks whether the dependency has recovered enough to try again.
Circuit breaker state machine
The breaker switches between closed, open, and half-open based on failure behavior.
What Cascading Failure Actually Looks Like
One slow service can clog the queue upstream, which then causes timeouts, which then exhausts the thread pool, which then takes the next service down as well. Once the cascade starts, the problem is rarely local anymore.
Cascading failure without a circuit breaker
Service chain
Failure cascade
A slow downstream service can overflow the system above it.
Implementation Sketch
The code is less important than the behavior. Track failure count, trip the breaker when the threshold is crossed, and retry only when the recovery window says it is safe to test again.
When Not to Use Circuit Breakers
They are the wrong fit for stateful writes, payment flows, and anything where a partial failure can create a worse outcome than a slow one. Reliability patterns only help when they match the type of work being protected.
Working on something similar?
If your team is still coordinating work manually, tell me what is happening and I will map the first system worth building.
Contact me