When a dependency is failing, “uptime” is no longer a single, coherent goal. Instead, you are forced to choose between two distinct types of failure:

  1. Keep running and risk being wrong.
  2. Stop running and accept being unavailable.

The right answer depends on which failure is more expensive in your system.


The Decision Table

Outputs: STOP = refuse to run | DEGRADE = reduced response, queue, or last-known-good

Question If “Yes” If “No”
Does the action have irreversible side effects? STOP (fail closed) DEGRADE
Is a wrong result worse than no result? STOP DEGRADE
Can you reconcile later with an audit trail? DEGRADE STOP likely
Can you clearly label “stale” as stale? DEGRADE STOP likely
Is this trust, money, or compliance sensitive? STOP DEGRADE

The Rule of Thumb

If you cannot undo it, cannot reconcile it, and cannot clearly communicate uncertainty, then you should not run it.


Concrete Examples

STOP (Fail Closed)

  • Payments: Charging a card when the ledger or fraud check is unreliable.
  • Security: Granting access when identity or authorization checks are flaky.
  • Messaging: Sending emails, webhooks, or orders when you cannot safely dedupe.
  • State Mutation: Writing state when you might create a “partial truth” that looks complete.

DEGRADE (Keep Serving)

  • Read-Heavy UI: Showing “last-known-good” data with an explicit “Data delayed” indicator.
  • Analytics: Non-critical tracking that can lag without misleading business decisions.
  • Async Tasks: Queueing work that is safe to process once dependencies recover.
  • Optional Features: Disabling non-essential UI components to preserve the core experience.

The Hidden Requirement: Honesty

Degrade only works if the system is honest. Stale data that masquerades as fresh is often more dangerous than downtime, because it produces confident decisions on bad inputs.

Choosing to DEGRADE vs. STOP isn’t just an infrastructure pattern—it’s a product decision. You are deciding what kind of “wrong” you are willing to ship.