The Witness Problem: Circuit Breakers in a Distributed Fleet

Circuit breakers are designed to be local. You add one per dependency, configure a threshold and a window, and each process protects itself. The mental model sells well: bad traffic accumulates, the breaker trips, calls stop, the dependency recovers, the breaker resets.

That model is correct for a single process. In a distributed fleet, it is incomplete in a way that will eventually produce an incident you can’t explain.

The Fleet You Didn’t Instrument

Imagine twenty instances of a service, each holding its own breaker for the same downstream dependency. The dependency starts failing.

Three things happen simultaneously, and none of them coordinate.

Uneven trip coverage. Your breakers trip based on what they individually see. The instances handling the most traffic accumulate failures fast and trip first. The low-traffic instances may never accumulate enough failures to trip at all. While your high-traffic instances are blocking calls, your low-traffic instances keep sending. The dependency continues to receive traffic—from the nodes that had the fewest data points about its health.

Independent cooldown timers. The instances that tripped first will exit their cooldown window first. They enter half-open, start probing, observe a few successes, and begin closing. The instances that tripped later are still cooling down. The instances that never tripped were never protecting you. You have a fleet in three different states making three different decisions about the same dependency.

Dashboard noise. You open the status page. Some nodes say OPEN, some say CLOSED, some say HALF_OPEN. The aggregate is not a signal. You cannot tell from this whether the dependency is recovering or whether your closed nodes simply lacked the data to diagnose the problem.

This is the Witness Problem. Your N circuit breakers are N independent witnesses to the same event. Each reached its own conclusion. Nobody compared notes.

The Half-Open Stampede

The Witness Problem has a sharp edge during recovery.

When nodes trip within a tight window—as they tend to during a sharp dependency failure—their cooldown timers run roughly in parallel. They enter half-open within seconds of each other. They each begin probing. Individually, each probe is cautious: a small exposure window, a few test requests. Collectively, they are not cautious. They are simultaneous.

A dependency that can handle the probe traffic from one instance may not handle it from fifteen. The probes succeed long enough for each breaker to interpret the result as healthy, close, and restore full traffic. Then the restored full traffic from twenty instances hits the dependency at once—and the dependency fails again.

The postmortem will call this a “second wave” or an “unstable recovery.” It was neither. It was a fleet of correctly-behaving local breakers producing globally self-defeating behavior because they had no awareness of what the others were doing.

What the Local Model Costs You

The cost of treating each breaker as sovereign is not obvious until you need to coordinate across the fleet.

During an incident, the question you want to answer is: does the fleet believe this dependency is healthy? The per-instance dashboard cannot answer it. “Six OPEN, two CLOSED, two HALF_OPEN” requires interpretation that shouldn’t be happening in real time while the incident is active.

During recovery, the question you want to answer is: is the dependency ready for fleet-level traffic? A single instance’s half-open probe cannot answer it. The probe volume is a fraction of normal load. A dependency that survives one instance’s probe traffic may not survive twenty.

During configuration changes, the question you want to answer is: did the new threshold take effect everywhere? Config drift in distributed deployments is real. Instances with different thresholds make different trip decisions. The fleet stops behaving uniformly without anyone noticing.

Each of these questions assumes fleet-level awareness. The per-instance breaker model provides none of it.

Gates Versus Sensors

There is a tempting answer to the Witness Problem: centralize the state. One breaker, shared across all instances, backed by a cache or a database. One vote. No coordination overhead.

That answer is wrong. A centralized breaker is a single point of failure in the path of every call. It reintroduces distributed failure modes—network latency, cache unavailability, write contention—in a place where you cannot afford them. You’ve replaced the coordination problem with a new critical dependency.

The correct model is different. Each instance keeps its own breaker state for performance and availability. But the signals each breaker produces are not kept local—they are shared. A trip on any instance is visible to all instances. A recovery confirmation from a subset of instances becomes the basis for a fleet-wide decision, not a per-instance one.

This is the difference between treating circuit breakers as gates and treating them as sensors. A gate is a local stop-or-pass decision. A sensor generates a signal. Signal aggregation lives a layer above the per-instance breaker, and it is the layer that most implementations skip entirely.

The 2am Version

When a dependency fails hard enough to trip every breaker in the fleet, the incident is legible. Everything is OPEN, calls are blocked, the right people are paged.

The expensive version happens when the failure is soft. The dependency is slow, not down. High-traffic instances accumulate failures past threshold and trip. Low-traffic instances have small windows that haven’t filled yet. Half the fleet is blocking calls. Half is still sending them, slowly degrading a dependency that is already struggling.

The SRE on call looks at the status page, sees a mixed fleet state, and calls it “partially degraded” in the incident channel. Traffic keeps hitting the struggling dependency because half the fleet doesn’t know it should stop. Twenty minutes later, the blocking instances exit their cooldown windows and probe simultaneously. The dependency, badly weakened by sustained partial load, fails under the combined probe traffic. The entire fleet trips.

The postmortem describes two incidents: the initial degradation and the “unexpected second failure.” There was one incident. The second failure was the fleet discovering, through a Half-Open Stampede, what coordinated protection would have found twenty minutes earlier.

The Witness Problem is not a circuit breaker bug. It is a deployment pattern that was never examined as a deployment pattern. You chose to run N independent witnesses against the same dependency and assumed they would collectively behave like one.

They don’t. They behave like N witnesses who cannot compare notes. That is not a misconfiguration you can tune away. It is the shape of the problem—and the shape does not change until the architecture does.