The Half-Open State
The Half-Open state is the recovery phase of a Tripswitch circuit breaker. Rather than using synthetic health checks, Tripswitch passively observes live traffic samples to determine if an upstream service has recovered.
The core of this state is the Hybrid Ratchet, which manages the transition from failure back to full capacity using a graduated traffic-shaping algorithm.
If you have not read Integration Patterns, start there first.
What Half-Open Is For
When a breaker is Open, Tripswitch is confident the dependency is unhealthy. Traffic is stopped. At some point, that certainty expires. You need to test whether the dependency has recovered.
Half-Open exists to answer one question: Is it safe to start sending traffic again?
It does not answer whether the dependency is healthy, whether you can resume normal traffic, or whether the incident is over. Half-Open is an observation phase, not a recovery declaration.
Why Half-Open Is Risky
The most common failure mode: dependency goes down, breaker trips Open, dependency comes back. Traffic resumes too quickly. Dependency falls over again.
Half-Open exists to prevent premature recovery. But Half-Open itself introduces risk — you are intentionally sending traffic to something that was recently failing, across many instances at once, often during an incident. The goal is controlled exposure.
The Illusion of “Small” Allow Rates
Tripswitch models Half-Open using an allow rate. An allow rate of 0.1 means roughly 10% of requests are permitted through. This sounds conservative — per instance.
The surprise comes from aggregation.
Allow rates apply per instance, not globally.
If you run 100 instances, each with a 10% allow rate, you are not sending “a few” probes. You are potentially sending 100 concurrent probes.
What Tripswitch Does During Half-Open
Tripswitch’s responsibilities during Half-Open are intentionally limited.
Tripswitch Does
| Responsibility | What it means |
|---|---|
| Evaluate health centrally | The server determines breaker state based on collected samples |
| Choose when to enter Half-Open | Timing is controlled by cooldown and backoff settings |
| Specify an allow rate | The server tells SDKs what percentage of traffic to allow |
| Observe probe outcomes | Sample data from probes informs the Close/re-Open decision |
Tripswitch Does Not
| Responsibility | Why |
|---|---|
| Coordinate probes across instances | Would require global state and cross-instance interference |
| Enforce a global probe budget | Each instance operates independently by design |
Why Not Synthetic Probes?
Traditional circuit breakers test recovery with a single canary request. If it succeeds, full traffic resumes. The problem: a service that can handle one request may still collapse under production load.
Tripswitch takes a different approach. By allowing a controlled percentage of real production traffic through — rather than generating a synthetic probe — it gathers a statistically meaningful recovery signal before increasing load. A 20% allow rate on real traffic tells you far more than a single canary, and the Hybrid Ratchet ensures load only increases as that signal remains positive.
1. The Hybrid Ratchet
While in Half-Open, Tripswitch limits the volume of requests passed to the upstream service. This allow_rate is managed by a hybrid algorithm that combines a discrete jump with a linear climb:
-
The Anchor Step: Upon each successful evaluation while at the floor rate, the breaker performs an “anchor jump.” By default, it doubles your
min_allow_rate(e.g., from 20% to 40%). This quickly establishes a meaningful baseline of traffic to evaluate. -
The Time-Based Ramp: Once anchored, the
allow_ratebegins a smooth linear climb toward 100% over therecovery_window_ms. - Monotonic Progression: The rate is designed to only increase or hold steady. If an evaluation returns Indeterminate (due to insufficient traffic samples), the rate holds steady to avoid over-pressuring a service with unknown health.
2. Exit Paths
A breaker’s journey through Half-Open concludes in one of two ways:
→ Closed (Confirmed Recovery)
The breaker transitions to Closed once the Confirmation Window expires, provided that:
- Sufficient Data: At least one evaluation during the window captured enough samples to be statistically significant.
- Health Maintained: No evaluation during the window triggered a trip.
→ Open (Failed Recovery)
If the traffic health violates a rule, the breaker immediately trips back to Open.
-
Exponential Backoff: If
half_open_backoff_enabledis true, each consecutive failure in Half-Open doubles the nextcooldown_ms, capped byhalf_open_backoff_cap_ms. -
Fast-Trip for Consecutive Failures: For breakers using a
consecutive_failuresrule, any single failure during the Half-Open state will bypass the evaluation tick and trip the breaker back toOpenimmediately.
3. The Indeterminate Policy
If the confirmation window expires but Tripswitch never received sufficient traffic to confidently verify recovery, it applies the Indeterminate Policy:
| Policy | Behavior |
|---|---|
:conservative (Default) |
Stay in Half-Open and keep waiting for more data. |
:optimistic |
Close the breaker and resume full traffic. |
:pessimistic |
Trip back to Open and wait for the next cooldown. |
4. Configuration Reference
| Field | Default | Description |
|---|---|---|
half_open_confirmation_ms |
90,000 |
The duration of the confirmation window. |
min_allow_rate |
0.2 |
The starting traffic allowance (20%). |
recovery_window_ms |
2x confirm |
The time taken for the ramp to reach 100% capacity. |
half_open_backoff_enabled |
true |
Whether to double the cooldown after repeat failures. |
half_open_backoff_cap_ms |
3,600,000 |
The maximum duration for exponential backoff (1 hour). |
5. Visualizing Recovery (Substates)
The Tripswitch Dashboard provides high-resolution substates to show exactly where a breaker is in its recovery cycle:
-
:confirming: In Half-Open, but no evaluation has occurred yet (waiting for traffic). -
:at_minimum: Evaluating traffic at themin_allow_ratewith no ramp events yet. -
:ramping: The ratchet is actively increasing the traffic allowance. -
:near_full: Traffic allowance is between 90% and 100%. -
:full_allowance: The rate has reached 100%, but the confirmation window is still active.
What You Must Reason About
During Half-Open, the SDK is doing exactly what it says it does. Most problems come from assumptions outside its scope.
Fleet Size
The more instances you run, the more probes you generate in aggregate.
Probe Cost
Not all requests are equal. Some probes are cheap. Others are expensive database operations or third-party API calls. A large fleet probing at even a low allow rate can overwhelm a recovering service.
Timeouts and Outcome Classification
Slow probes are worse than failed probes — they tie up resources and delay recovery signals. Probe requests should have tight timeouts. And if your code treats “bad data” as success, Tripswitch will too. Use WithErrorEvaluator if HTTP 200 with invalid payload should count as failure.
What Success Looks Like
A healthy Half-Open phase looks boring: a small number of probes succeed, failures are visible but contained, the breaker transitions cleanly to Closed, traffic ramps up without a second collapse.
If Half-Open feels noisy, unpredictable, or flappy, it usually means too many probes, probes that are too expensive, or outcomes that don’t reflect reality.
What Half-Open Is Not
| Misconception | Reality |
|---|---|
| A retry mechanism | The SDK doesn’t retry. Rejected requests fail immediately. |
| A fairness system | No coordination between instances. Each gates independently. |
| A guarantee of recovery | The ratchet can fail and reset. That’s the point. |
The job of Half-Open is not to make recovery painless. It is to make recovery survivable.
Summary
Half-Open is a controlled observation phase, not a recovery declaration. Tripswitch passively evaluates live traffic samples and uses the Hybrid Ratchet to gradually restore capacity — anchoring at a meaningful floor, then climbing linearly toward full load as confidence accumulates. If recovery stalls, exponential backoff prevents flapping. If traffic is too sparse to decide, the Indeterminate Policy determines whether to wait, trust, or retreat.
The system handles the mechanics. You handle fleet sizing, probe cost, timeout configuration, and outcome classification.
If recovery feels fragile, look at what you control, not what Tripswitch controls.
Next Steps
- See Common Mistakes for concrete failure patterns
- See Tuning & Operations for guidance on allow rates and thresholds
- Revisit Integration Patterns if Half-Open behavior is surprising