Fail Open vs Fail Closed: Pick the Default, Then Earn Exceptions

Circuit breakers are policy. When a dependency is failing, you are choosing the failure you prefer.

Essentially, you are choosing between two modes:

Fail Open: Keep running and accept the risk of being wrong.
Fail Closed: Refuse to run and accept being unavailable.

You need a default. If you do not pick one, you will end up with accidental defaults scattered across timeouts, retries, and one-off error handling. Inconsistency is where the expensive bugs live.

What these modes mean in practice

Fail open is not “unsafe.” It means you have a defined fallback—returning cached data, reducing fidelity, or queueing work.
Fail closed is not “pessimistic.” It means you stop before you do something you cannot safely undo.

The decision is simple: Compare the cost of being down versus the cost of being wrong. If being wrong creates irreversible side effects, trust loss, or compliance risk, you must bias toward failing closed.

Why teams get this wrong

Optimizing for visible uptime: It’s easy to measure. Integrity failures are “invisible” until they show up later as support tickets, manual reconciliation, and cleanup.
Applying one rule everywhere: Reads and writes do not deserve the same default.
Hiding uncertainty: A fallback that looks like a normal answer trains users to trust something that isn’t true.

A Starting Posture

If you want a simple, resilient baseline, start here:

1. Writes default to FAIL CLOSED

Anything that charges money, grants permissions, mutates state, sends external side effects, or creates authoritative records.

2. Reads can default to FAIL OPEN

…but only with honesty. Last-known-good, cached, or reduced detail is fine as long as nobody confuses it with fresh truth.

Earn Exceptions by Writing Them Down

Treat exceptions like an interface contract. If you cannot explain the fallback, you do not have a safe fail-open path. Use this checklist to turn policy from folklore into reviewable code:

Declared mode: Fail open or fail closed?
Reason: Why can this path safely degrade (or why must it stop)?
Fallback contract: What happens instead (return cached, queue, skip)?
Reconciliation story: How do you detect and fix wrong outcomes later?
Truth statement: How is uncertainty communicated to users?
Review trigger: When should this exception be revisited?

What to Make Observable

The useful question is not just “is the breaker open?” The useful question is: “What did we choose to do because it was open?”

At minimum, your observability should answer:

Which dependency was unhealthy?
Which mode applied (Open vs Closed)?
What fallback ran, if any?
Was the output Full, Partial, Stale, or Queued?

This is how you avoid the worst outcome: a system that “stays up” while quietly doing the wrong thing.