Half-Open State

The Half-Open State

The Half-Open state is the recovery phase of a Tripswitch circuit breaker. Rather than using synthetic health checks, Tripswitch passively observes live traffic samples to determine if an upstream service has recovered.

The core of this state is the Hybrid Ratchet, which manages the transition from failure back to full capacity using a graduated traffic-shaping algorithm.

If you have not read Integration Patterns, start there first.


What Half-Open Is For

When a breaker is Open, Tripswitch is confident the dependency is unhealthy. Traffic is stopped. At some point, that certainty expires. You need to test whether the dependency has recovered.

Half-Open exists to answer one question: Is it safe to start sending traffic again?

It does not answer whether the dependency is healthy, whether you can resume normal traffic, or whether the incident is over. Half-Open is an observation phase, not a recovery declaration.


Why Half-Open Is Risky

The most common failure mode: dependency goes down, breaker trips Open, dependency comes back. Traffic resumes too quickly. Dependency falls over again.

Half-Open exists to prevent premature recovery. But Half-Open itself introduces risk — you are intentionally sending traffic to something that was recently failing, across many instances at once, often during an incident. The goal is controlled exposure.


The Illusion of “Small” Allow Rates

Tripswitch models Half-Open using an allow rate. An allow rate of 0.1 means roughly 10% of requests are permitted through. This sounds conservative — per instance.

The surprise comes from aggregation.

Allow rates apply per instance, not globally.

If you run 100 instances, each with a 10% allow rate, you are not sending “a few” probes. You are potentially sending 100 concurrent probes.


What Tripswitch Does During Half-Open

Tripswitch’s responsibilities during Half-Open are intentionally limited.

Tripswitch Does

Responsibility What it means
Evaluate health centrally The server determines breaker state based on collected samples
Choose when to enter Half-Open Timing is controlled by cooldown and backoff settings
Specify an allow rate The server tells SDKs what percentage of traffic to allow
Observe probe outcomes Sample data from probes informs the Close/re-Open decision

Tripswitch Does Not

Responsibility Why
Coordinate probes across instances Would require global state and cross-instance interference
Enforce a global probe budget Each instance operates independently by design

Why Not Synthetic Probes?

Traditional circuit breakers test recovery with a single canary request. If it succeeds, full traffic resumes. The problem: a service that can handle one request may still collapse under production load.

Tripswitch takes a different approach. By allowing a controlled percentage of real production traffic through — rather than generating a synthetic probe — it gathers a statistically meaningful recovery signal before increasing load. A 20% allow rate on real traffic tells you far more than a single canary, and the Hybrid Ratchet ensures load only increases as that signal remains positive.


1. The Hybrid Ratchet

While in Half-Open, Tripswitch limits the volume of requests passed to the upstream service. This allow_rate is managed by a hybrid algorithm that combines a discrete jump with a linear climb:

  • The Anchor Step: Upon each successful evaluation while at the floor rate, the breaker performs an “anchor jump.” By default, it doubles your min_allow_rate (e.g., from 20% to 40%). This quickly establishes a meaningful baseline of traffic to evaluate.
  • The Time-Based Ramp: Once anchored, the allow_rate begins a smooth linear climb toward 100% over the recovery_window_ms.
  • Monotonic Progression: The rate is designed to only increase or hold steady. If an evaluation returns Indeterminate (due to insufficient traffic samples), the rate holds steady to avoid over-pressuring a service with unknown health.

2. Exit Paths

A breaker’s journey through Half-Open concludes in one of two ways:

→ Closed (Confirmed Recovery)

The breaker transitions to Closed once the Confirmation Window expires, provided that:

  1. Sufficient Data: At least one evaluation during the window captured enough samples to be statistically significant.
  2. Health Maintained: No evaluation during the window triggered a trip.

→ Open (Failed Recovery)

If the traffic health violates a rule, the breaker immediately trips back to Open.

  • Exponential Backoff: If half_open_backoff_enabled is true, each consecutive failure in Half-Open doubles the next cooldown_ms, capped by half_open_backoff_cap_ms.
  • Fast-Trip for Consecutive Failures: For breakers using a consecutive_failures rule, any single failure during the Half-Open state will bypass the evaluation tick and trip the breaker back to Open immediately.

3. The Indeterminate Policy

If the confirmation window expires but Tripswitch never received sufficient traffic to confidently verify recovery, it applies the Indeterminate Policy:

Policy Behavior
:conservative (Default) Stay in Half-Open and keep waiting for more data.
:optimistic Close the breaker and resume full traffic.
:pessimistic Trip back to Open and wait for the next cooldown.

4. Configuration Reference

Field Default Description
half_open_confirmation_ms 90,000 The duration of the confirmation window.
min_allow_rate 0.2 The starting traffic allowance (20%).
recovery_window_ms 2x confirm The time taken for the ramp to reach 100% capacity.
half_open_backoff_enabled true Whether to double the cooldown after repeat failures.
half_open_backoff_cap_ms 3,600,000 The maximum duration for exponential backoff (1 hour).

5. Visualizing Recovery (Substates)

The Tripswitch Dashboard provides high-resolution substates to show exactly where a breaker is in its recovery cycle:

  • :confirming: In Half-Open, but no evaluation has occurred yet (waiting for traffic).
  • :at_minimum: Evaluating traffic at the min_allow_rate with no ramp events yet.
  • :ramping: The ratchet is actively increasing the traffic allowance.
  • :near_full: Traffic allowance is between 90% and 100%.
  • :full_allowance: The rate has reached 100%, but the confirmation window is still active.

What You Must Reason About

During Half-Open, the SDK is doing exactly what it says it does. Most problems come from assumptions outside its scope.

Fleet Size

The more instances you run, the more probes you generate in aggregate.

Probe Cost

Not all requests are equal. Some probes are cheap. Others are expensive database operations or third-party API calls. A large fleet probing at even a low allow rate can overwhelm a recovering service.

Timeouts and Outcome Classification

Slow probes are worse than failed probes — they tie up resources and delay recovery signals. Probe requests should have tight timeouts. And if your code treats “bad data” as success, Tripswitch will too. Use WithErrorEvaluator if HTTP 200 with invalid payload should count as failure.


What Success Looks Like

A healthy Half-Open phase looks boring: a small number of probes succeed, failures are visible but contained, the breaker transitions cleanly to Closed, traffic ramps up without a second collapse.

If Half-Open feels noisy, unpredictable, or flappy, it usually means too many probes, probes that are too expensive, or outcomes that don’t reflect reality.


What Half-Open Is Not

Misconception Reality
A retry mechanism The SDK doesn’t retry. Rejected requests fail immediately.
A fairness system No coordination between instances. Each gates independently.
A guarantee of recovery The ratchet can fail and reset. That’s the point.

The job of Half-Open is not to make recovery painless. It is to make recovery survivable.


Summary

Half-Open is a controlled observation phase, not a recovery declaration. Tripswitch passively evaluates live traffic samples and uses the Hybrid Ratchet to gradually restore capacity — anchoring at a meaningful floor, then climbing linearly toward full load as confidence accumulates. If recovery stalls, exponential backoff prevents flapping. If traffic is too sparse to decide, the Indeterminate Policy determines whether to wait, trust, or retreat.

The system handles the mechanics. You handle fleet sizing, probe cost, timeout configuration, and outcome classification.

If recovery feels fragile, look at what you control, not what Tripswitch controls.


Next Steps

  • See Common Mistakes for concrete failure patterns
  • See Tuning & Operations for guidance on allow rates and thresholds
  • Revisit Integration Patterns if Half-Open behavior is surprising