Failure Modes and Debugging — How: The Method
A Systematic Debugging Framework
When something goes wrong, follow this five-step process. It works for software, hardware, processes, and systems of any kind.
Step 1: Observe the Symptom Precisely
Don't say "it's broken." Say exactly what is happening:
-
❌ "The page is broken"
-
✅ "The page loads but shows 0 orders, when the user should have 15 orders"
-
❌ "The system is slow"
-
✅ "The search results take 12 seconds to appear; last week it was under 1 second"
-
❌ "It doesn't work"
-
✅ "Clicking 'Submit' does nothing — no error message, no loading indicator, no change"
Precise symptoms lead to precise diagnoses. Vague symptoms lead to guessing.
Step 2: Establish What Changed
Most bugs don't appear spontaneously. Something changed:
- New code was deployed
- Data volume increased
- A third-party service updated their API
- A configuration was modified
- User behavior shifted (a marketing campaign drove unexpected traffic)
Ask: "What is different between when it worked and when it stopped working?"
If nothing changed internally, the cause is likely external: data, traffic, or a dependency.
Step 3: Bisect the Problem Space
This is the most powerful debugging technique. Instead of searching everywhere, cut the problem in half and determine which half contains the bug.
Your system is a chain of data flow (from the Data Lifecycle section). Data enters at one end and the wrong result appears at the other. Check the midpoint:
Input → [A] → [B] → [C] → [D] → Wrong Output
↑
Check here first.
Is the data correct at this point?
- If the data is correct at [B], the problem is in [C] or [D]
- If the data is wrong at [B], the problem is in [A] or [B]
You've eliminated half the system. Repeat until you've narrowed it to a single step.
This is binary search applied to debugging, and it works whether you're debugging code, a business process, a network issue, or a recipe.
Step 4: Form and Test Hypotheses
Once you've narrowed the area, form a specific hypothesis:
"I believe the bug is caused by [specific thing] because [evidence]. If I'm right, then [testable prediction]."
Example:
"I believe orders show as 0 because the query is filtering by the wrong date format. If I'm right, then running the query directly will return empty results even though orders exist."
Then test it by observing — check the data, check the intermediate state. Don't change anything yet. Verify or disprove the hypothesis with evidence.
If the hypothesis is wrong, that's progress — you've eliminated a possibility.
Step 5: Verify the Fix
You found the cause. You made a change. How do you know the fix is correct?
- Does the symptom disappear?
- Does it work for all cases, or just the one you tested?
- Did the fix introduce any new problems?
- Can you explain why the fix works?
If you can't explain why the fix works, it's not a fix — it's a lucky accident that will break again.
Categorizing Failures
Not all failures are the same. Understanding the categories helps you design appropriate responses.
| Category | What It Is | Example | Response |
|---|---|---|---|
| Input | Bad data coming in | Letters in a phone number field | Validate at the boundary. Reject with a clear error. |
| Logic | Code produces wrong result | Off-by-one in a calculation | Test with known inputs, verify outputs match contract |
| Integration | Two parts don't align | Module A sends format X, Module B expects Y | Validate at every boundary. Integration failures are contract violations. |
| Resource | System exhausts something | Disk full, memory exhausted, rate limit hit | Monitor. Set limits and alerts. Design for constrained operation. |
| Dependency | External thing stops working | Database down, API returns errors | Timeout, retry, fallback, degrade gracefully |
| Timing | Wrong order or time | Two updates hit the same record simultaneously | The hardest category. Design explicit ordering where it matters. |
Designing for Failure
For every module and contract, answer these five questions:
1. What are the failure modes?
List every way this can fail. Use the categories above as your checklist.
2. What is the blast radius?
If this fails, what else breaks? A well-bounded module limits the blast radius. A tangled one spreads damage everywhere.
3. What is the severity?
| Severity | Meaning | Example |
|---|---|---|
| Critical | Data loss, financial impact, security breach | Double-charging a customer |
| High | Core feature unavailable | Can't log in |
| Medium | Feature degraded but usable | Search is slow but returns results |
| Low | Cosmetic or minor | Profile picture doesn't load |
4. What is the response strategy?
| Strategy | When to Use |
|---|---|
| Prevent | Failure is predictable and avoidable (validate inputs, check preconditions) |
| Retry | Failure is transient (network blip, temporary overload) |
| Fallback | There's a "good enough" alternative (show cached data if live data is unavailable) |
| Degrade | Turn off the broken feature, keep everything else running |
| Alert | Needs human attention (log, notify, escalate) |
| Fail fast | Continuing would make things worse (stop if data is corrupted) |
5. What information is needed to diagnose it later?
When something fails at 3am and you're investigating at 9am, what do you need?
- What was the input?
- What was the expected output?
- What was the actual output or error?
- When did it happen?
- What was the system state?
This is logging — not an afterthought, but a critical design decision.
What to Look For in the Examples
The following pages each present a system that has failed. You'll see:
- A symptom described precisely — the starting point
- The bisection process — how we narrow down the cause
- Multiple hypotheses — some wrong, some right
- The root cause — and how it connects to a failure category
- How the failure could have been prevented — what design decision would have caught it earlier