Failure Modes and Debugging — Test Your Understanding

Answer each question by showing your reasoning process. The goal is structured, systematic thinking — not lucky guesses.

Section A: Diagnose the Problem

Question 1

Symptom: An online store's product pages load correctly, but every product shows "In Stock" even though several products are sold out.

Using the five-step debugging framework:

State the precise symptom
Hypothesize what might have changed
Describe how you would bisect the problem (where would you check first?)
Form two different hypotheses for the cause
For each hypothesis, describe what evidence would confirm or disprove it

Question 2

Symptom: Users report that emails from the system arrive late — sometimes hours after the action that triggered them. The system was working fine until last week.

You know the email flow:

User action triggers an event
Event is placed in a queue
A background worker reads the queue and sends emails
Email is sent via an external email service

Using bisection, walk through how you would isolate whether the delay is in step 1, 2, 3, or 4. What specific thing would you check at each stage?

Symptom: A banking app shows a customer's balance as negative $500, but the customer insists they have not made any large purchases. Looking at the transaction list, all transactions appear normal and small.

This is a data integrity issue. Trace the lifecycle backward:

Where is the balance displayed?
Where is it calculated?
What data feeds the calculation?
What could cause the calculation to produce a wrong result?

List at least four distinct hypotheses, each targeting a different part of the data lifecycle.

Section B: Design for Failure

Question 4

You are designing a system that processes online job applications. The flow:

Applicant fills out a form with personal info and uploads a resume
System validates the form data
Resume is stored
Application record is created in the database
Hiring manager is notified via email
Applicant receives a confirmation email

For each step, list:

What can fail
The severity (critical/high/medium/low)
The appropriate response strategy (prevent/retry/fallback/degrade/alert/fail fast)
What should be logged for debugging

Question 5

A ride-sharing app has these dependencies:

GPS service (for driver location)
Payment processor (for billing)
Map routing service (for directions)
Push notification service (for alerts)

For each dependency, answer:

What happens if it goes down for 30 seconds?
What happens if it goes down for 30 minutes?
What happens if it starts returning wrong data instead of errors?
What should the app do in each case?

Pay special attention to the third question — silent wrong data is the most dangerous failure mode.

Question 6

Perform a pre-mortem for the following system:

A school lunch ordering system where parents pre-order meals for their children through a website. The kitchen prepares meals based on the orders. Children pick up their meal at lunch using their student ID.

Imagine it's been running for three months and something has gone terribly wrong. Write five realistic failure scenarios. For each one:

What went wrong
Why it wasn't caught earlier
What design decision would have prevented it

Section C: Failure Reasoning

Question 7

You have a system with three modules in sequence:

Module A → Module B → Module C → Output

The output is wrong. You check Module A's output — it's correct. You check Module C's output — it's wrong.

Can you conclude the bug is in Module B or Module C? Why or why not? What else do you need to check? Describe the precise reasoning.

Question 8

An engineer says: "I added retry logic everywhere, so our system handles failures well."

Explain at least three scenarios where retrying makes the problem worse instead of better. For each scenario, describe what should be done instead.

Question 9

Two failure strategies are proposed for a checkout system when the payment service is down:

Strategy A: Show the user an error: "Payment service unavailable. Please try again in a few minutes."

Strategy B: Accept the order, save it with status "payment pending," and charge the user when the payment service comes back.

Analyze both strategies. What are the risks of each? Under what circumstances is A better? Under what circumstances is B better? What failure modes does B introduce that A doesn't have?

Section D: The Full Picture

Question 10

This is an integration exercise. You have studied all five pillars. Now apply them all:

Scenario: A hospital system manages patient appointments. Patients book appointments online, doctors see their schedule on a dashboard, and the system sends text message reminders 24 hours before each appointment.

A doctor reports: "My 2pm patient said they never received a reminder, and two of my morning patients received reminders for the wrong date."

Using everything you've learned:

Data Lifecycle: Trace the data from appointment creation to reminder delivery
Boundaries: Identify which module(s) are likely involved in the failure
Contracts: Identify what contract might be violated
Decomposition: Break the problem into investigatable pieces
Failure Mode: Categorize the failure type, form hypotheses, and describe how you would bisect to find the root cause

Question 11

Design a comprehensive failure handling plan for a simple feature: "User changes their password."

The flow: user enters current password and new password → system verifies current password → system validates new password meets requirements → system updates the stored password → user receives email confirming the change.

For this feature:

List every failure mode at every step
Categorize each by type (input, logic, integration, resource, dependency, timing)
Define the response for each
Identify the single most dangerous failure mode and explain why
Describe what logging would be needed to diagnose any failure in this flow without being able to reproduce it

Question 12

The final question. Reflect on this statement:

"A system that has never failed is more dangerous than a system that fails regularly."

Using concepts from all five pillars, explain why this might be true. Consider: untested failure paths, false confidence, unknown data lifecycle gaps, unchecked boundary assumptions, and unvalidated contracts. Give a concrete example to support your argument.

Grading Rubric

Criteria	What It Means
Systematic process	Followed a structured approach — not random guessing. Steps are traceable and logical.
Precise symptoms	Problems are stated specifically, not vaguely. "Shows $0" not "is broken."
Multiple hypotheses	More than one possible cause is considered before committing to a diagnosis
Evidence-based reasoning	Each hypothesis has a way to test it. Decisions are based on evidence, not assumptions.
Failure design completeness	All failure modes are considered, not just the obvious ones. Silent failures and wrong-data failures are addressed, not just crashes.
Cross-pillar integration	Answers draw on data lifecycle, boundaries, contracts, and decomposition — not just debugging techniques in isolation

Systems Thinking for Engineers