Most teams treat recovery like a countdown: Trip the breaker, wait sixty seconds, then try again. If it works, close. If it fails, re-trip.

That is a decent starting point, but it misses the real point. Recovery is not just a technical transition; it is a product decision about what you will accept as “safe enough” to resume normal behavior.

A breaker isn’t just there to stop harm. It is there to prevent you from pretending you are healthy when you are not.


What Recovery Actually Means

When you transition from OPEN back toward normal operation, you are making a specific claim:

“We believe the dependency is healthy enough to trust again.”

That claim needs evidence. Time passing is not evidence; it is only an opportunity to look for evidence. If you do not design the evidence, your system will recover based on “vibes.”


The HALF_OPEN Trap

HALF_OPEN is where many breakers quietly lie. The common failure pattern is: OPENHALF_OPENCLOSED with almost no data. Sometimes literally zero. You re-enable traffic because a clock ran out, not because the dependency improved.

The fix is to treat HALF_OPEN as a series of gates. You aren’t being clever; you are avoiding declaring victory too early.

The Gates That Matter

  1. A Minimum Time Gate: Do not allow instant recovery. Even if the first probe succeeds, you want a short minimum window in HALF_OPEN to avoid the “one lucky request” problem.
  2. A Minimum Evidence Gate: Require a specific number of successful observations before closing. One success is a data point; ten successes is a trend.
  3. A Bounded Exposure Gate: HALF_OPEN should not mean “turn it all back on.” It means limited traffic: a small number of probes, a fixed percentage of requests, or a strict concurrency budget.

Cooldowns vs. Evidence

Cooldowns prevent thrash. They stop the “flip-flopping” that creates load spikes and confusing system behavior.

Cooldowns are good hygiene, but they are not evidence. Do not treat “waited long enough” as “healthy again.” Treat it as “permission to try to collect evidence again.”

The “Prove It” Check

This is the part most teams skip because it feels like extra work. A “Prove It” check is a small, explicit health probe with a clear success condition:

  • The probe should be representative, not just a ping. Can it run a lightweight query? Can it exercise the workflow you actually rely on?
  • The success condition should be strict enough to prevent false confidence.

Communicating the Shift

Recovery creates uncertainty, and uncertainty needs to be visible. If the system is HALF_OPEN or has insufficient evidence, the UI and API should not present it as “all good.”

The most honest phrasing is: “We are testing recovery.”

Users can handle a system that is testing its limits; they cannot handle a system that promises health and immediately fails again.


The Trust Layer

If you are building a product like Tripswitch, the recovery rules are where users decide whether they trust you.

Tripping a breaker is easy. Recovery is where you show either discipline or wishful thinking. HALF_OPEN is not a celebration phase; it is a controlled experiment.