Failure Modes and Debugging — How: The Method

A Systematic Debugging Framework

When something goes wrong, follow this five-step process. It works for software, hardware, processes, and systems of any kind.


Step 1: Observe the Symptom Precisely

Don't say "it's broken." Say exactly what is happening:

  • ❌ "The page is broken"

  • ✅ "The page loads but shows 0 orders, when the user should have 15 orders"

  • ❌ "The system is slow"

  • ✅ "The search results take 12 seconds to appear; last week it was under 1 second"

  • ❌ "It doesn't work"

  • ✅ "Clicking 'Submit' does nothing — no error message, no loading indicator, no change"

Precise symptoms lead to precise diagnoses. Vague symptoms lead to guessing.


Step 2: Establish What Changed

Most bugs don't appear spontaneously. Something changed:

  • New code was deployed
  • Data volume increased
  • A third-party service updated their API
  • A configuration was modified
  • User behavior shifted (a marketing campaign drove unexpected traffic)

Ask: "What is different between when it worked and when it stopped working?"

If nothing changed internally, the cause is likely external: data, traffic, or a dependency.


Step 3: Bisect the Problem Space

This is the most powerful debugging technique. Instead of searching everywhere, cut the problem in half and determine which half contains the bug.

Your system is a chain of data flow (from the Data Lifecycle section). Data enters at one end and the wrong result appears at the other. Check the midpoint:

Input → [A] → [B] → [C] → [D] → Wrong Output
                 ↑
          Check here first.
          Is the data correct at this point?
  • If the data is correct at [B], the problem is in [C] or [D]
  • If the data is wrong at [B], the problem is in [A] or [B]

You've eliminated half the system. Repeat until you've narrowed it to a single step.

This is binary search applied to debugging, and it works whether you're debugging code, a business process, a network issue, or a recipe.


Step 4: Form and Test Hypotheses

Once you've narrowed the area, form a specific hypothesis:

"I believe the bug is caused by [specific thing] because [evidence]. If I'm right, then [testable prediction]."

Example:

"I believe orders show as 0 because the query is filtering by the wrong date format. If I'm right, then running the query directly will return empty results even though orders exist."

Then test it by observing — check the data, check the intermediate state. Don't change anything yet. Verify or disprove the hypothesis with evidence.

If the hypothesis is wrong, that's progress — you've eliminated a possibility.


Step 5: Verify the Fix

You found the cause. You made a change. How do you know the fix is correct?

  • Does the symptom disappear?
  • Does it work for all cases, or just the one you tested?
  • Did the fix introduce any new problems?
  • Can you explain why the fix works?

If you can't explain why the fix works, it's not a fix — it's a lucky accident that will break again.


Categorizing Failures

Not all failures are the same. Understanding the categories helps you design appropriate responses.

CategoryWhat It IsExampleResponse
InputBad data coming inLetters in a phone number fieldValidate at the boundary. Reject with a clear error.
LogicCode produces wrong resultOff-by-one in a calculationTest with known inputs, verify outputs match contract
IntegrationTwo parts don't alignModule A sends format X, Module B expects YValidate at every boundary. Integration failures are contract violations.
ResourceSystem exhausts somethingDisk full, memory exhausted, rate limit hitMonitor. Set limits and alerts. Design for constrained operation.
DependencyExternal thing stops workingDatabase down, API returns errorsTimeout, retry, fallback, degrade gracefully
TimingWrong order or timeTwo updates hit the same record simultaneouslyThe hardest category. Design explicit ordering where it matters.

Designing for Failure

For every module and contract, answer these five questions:

1. What are the failure modes?

List every way this can fail. Use the categories above as your checklist.

2. What is the blast radius?

If this fails, what else breaks? A well-bounded module limits the blast radius. A tangled one spreads damage everywhere.

3. What is the severity?

SeverityMeaningExample
CriticalData loss, financial impact, security breachDouble-charging a customer
HighCore feature unavailableCan't log in
MediumFeature degraded but usableSearch is slow but returns results
LowCosmetic or minorProfile picture doesn't load

4. What is the response strategy?

StrategyWhen to Use
PreventFailure is predictable and avoidable (validate inputs, check preconditions)
RetryFailure is transient (network blip, temporary overload)
FallbackThere's a "good enough" alternative (show cached data if live data is unavailable)
DegradeTurn off the broken feature, keep everything else running
AlertNeeds human attention (log, notify, escalate)
Fail fastContinuing would make things worse (stop if data is corrupted)

5. What information is needed to diagnose it later?

When something fails at 3am and you're investigating at 9am, what do you need?

  • What was the input?
  • What was the expected output?
  • What was the actual output or error?
  • When did it happen?
  • What was the system state?

This is logging — not an afterthought, but a critical design decision.


What to Look For in the Examples

The following pages each present a system that has failed. You'll see:

  1. A symptom described precisely — the starting point
  2. The bisection process — how we narrow down the cause
  3. Multiple hypotheses — some wrong, some right
  4. The root cause — and how it connects to a failure category
  5. How the failure could have been prevented — what design decision would have caught it earlier