Failure Modes and Debugging — How: The Method

A Systematic Debugging Framework

When something goes wrong, follow this five-step process. It works for software, hardware, processes, and systems of any kind.

Step 1: Observe the Symptom Precisely

Don't say "it's broken." Say exactly what is happening:

❌ "The page is broken"
✅ "The page loads but shows 0 orders, when the user should have 15 orders"
❌ "The system is slow"
✅ "The search results take 12 seconds to appear; last week it was under 1 second"
❌ "It doesn't work"
✅ "Clicking 'Submit' does nothing — no error message, no loading indicator, no change"

Precise symptoms lead to precise diagnoses. Vague symptoms lead to guessing.

Step 2: Establish What Changed

Most bugs don't appear spontaneously. Something changed:

New code was deployed
Data volume increased
A third-party service updated their API
A configuration was modified
User behavior shifted (a marketing campaign drove unexpected traffic)

Ask: "What is different between when it worked and when it stopped working?"

If nothing changed internally, the cause is likely external: data, traffic, or a dependency.

Step 3: Bisect the Problem Space

This is the most powerful debugging technique. Instead of searching everywhere, cut the problem in half and determine which half contains the bug.

Your system is a chain of data flow (from the Data Lifecycle section). Data enters at one end and the wrong result appears at the other. Check the midpoint:

Input → [A] → [B] → [C] → [D] → Wrong Output
                 ↑
          Check here first.
          Is the data correct at this point?

If the data is correct at [B], the problem is in [C] or [D]
If the data is wrong at [B], the problem is in [A] or [B]

You've eliminated half the system. Repeat until you've narrowed it to a single step.

This is binary search applied to debugging, and it works whether you're debugging code, a business process, a network issue, or a recipe.

Step 4: Form and Test Hypotheses

Once you've narrowed the area, form a specific hypothesis:

"I believe the bug is caused by [specific thing] because [evidence]. If I'm right, then [testable prediction]."

Example:

"I believe orders show as 0 because the query is filtering by the wrong date format. If I'm right, then running the query directly will return empty results even though orders exist."

Then test it by observing — check the data, check the intermediate state. Don't change anything yet. Verify or disprove the hypothesis with evidence.

If the hypothesis is wrong, that's progress — you've eliminated a possibility.

Step 5: Verify the Fix

You found the cause. You made a change. How do you know the fix is correct?

Does the symptom disappear?
Does it work for all cases, or just the one you tested?
Did the fix introduce any new problems?
Can you explain why the fix works?

If you can't explain why the fix works, it's not a fix — it's a lucky accident that will break again.

Categorizing Failures

Not all failures are the same. Understanding the categories helps you design appropriate responses.

Category	What It Is	Example	Response
Input	Bad data coming in	Letters in a phone number field	Validate at the boundary. Reject with a clear error.
Logic	Code produces wrong result	Off-by-one in a calculation	Test with known inputs, verify outputs match contract
Integration	Two parts don't align	Module A sends format X, Module B expects Y	Validate at every boundary. Integration failures are contract violations.
Resource	System exhausts something	Disk full, memory exhausted, rate limit hit	Monitor. Set limits and alerts. Design for constrained operation.
Dependency	External thing stops working	Database down, API returns errors	Timeout, retry, fallback, degrade gracefully
Timing	Wrong order or time	Two updates hit the same record simultaneously	The hardest category. Design explicit ordering where it matters.

Severity	Meaning	Example
Critical	Data loss, financial impact, security breach	Double-charging a customer
High	Core feature unavailable	Can't log in
Medium	Feature degraded but usable	Search is slow but returns results
Low	Cosmetic or minor	Profile picture doesn't load

4. What is the response strategy?

Strategy	When to Use
Prevent	Failure is predictable and avoidable (validate inputs, check preconditions)
Retry	Failure is transient (network blip, temporary overload)
Fallback	There's a "good enough" alternative (show cached data if live data is unavailable)
Degrade	Turn off the broken feature, keep everything else running
Alert	Needs human attention (log, notify, escalate)
Fail fast	Continuing would make things worse (stop if data is corrupted)

5. What information is needed to diagnose it later?

When something fails at 3am and you're investigating at 9am, what do you need?

What was the input?
What was the expected output?
What was the actual output or error?
When did it happen?
What was the system state?

This is logging — not an afterthought, but a critical design decision.

What to Look For in the Examples

The following pages each present a system that has failed. You'll see:

A symptom described precisely — the starting point
The bisection process — how we narrow down the cause
Multiple hypotheses — some wrong, some right
The root cause — and how it connects to a failure category
How the failure could have been prevented — what design decision would have caught it earlier

Systems Thinking for Engineers