When a dependency is failing, “uptime” is no longer a single, coherent goal. Instead, you are forced to choose between two distinct types of failure:
- Keep running and risk being wrong.
- Stop running and accept being unavailable.
The right answer depends on which failure is more expensive in your system.
The Decision Table
Outputs: STOP = refuse to run | DEGRADE = reduced response, queue, or last-known-good
| Question | If “Yes” | If “No” |
|---|---|---|
| Does the action have irreversible side effects? | STOP (fail closed) | DEGRADE |
| Is a wrong result worse than no result? | STOP | DEGRADE |
| Can you reconcile later with an audit trail? | DEGRADE | STOP likely |
| Can you clearly label “stale” as stale? | DEGRADE | STOP likely |
| Is this trust, money, or compliance sensitive? | STOP | DEGRADE |
The Rule of Thumb
If you cannot undo it, cannot reconcile it, and cannot clearly communicate uncertainty, then you should not run it.
Concrete Examples
STOP (Fail Closed)
- Payments: Charging a card when the ledger or fraud check is unreliable.
- Security: Granting access when identity or authorization checks are flaky.
- Messaging: Sending emails, webhooks, or orders when you cannot safely dedupe.
- State Mutation: Writing state when you might create a “partial truth” that looks complete.
DEGRADE (Keep Serving)
- Read-Heavy UI: Showing “last-known-good” data with an explicit “Data delayed” indicator.
- Analytics: Non-critical tracking that can lag without misleading business decisions.
- Async Tasks: Queueing work that is safe to process once dependencies recover.
- Optional Features: Disabling non-essential UI components to preserve the core experience.
The Hidden Requirement: Honesty
Degrade only works if the system is honest. Stale data that masquerades as fresh is often more dangerous than downtime, because it produces confident decisions on bad inputs.
Choosing to DEGRADE vs. STOP isn’t just an infrastructure pattern—it’s a product decision. You are deciding what kind of “wrong” you are willing to ship.