Failure Modes and Debugging — Test Your Understanding
Answer each question by showing your reasoning process. The goal is structured, systematic thinking — not lucky guesses.
Section A: Diagnose the Problem
Question 1
Symptom: An online store's product pages load correctly, but every product shows "In Stock" even though several products are sold out.
Using the five-step debugging framework:
- State the precise symptom
- Hypothesize what might have changed
- Describe how you would bisect the problem (where would you check first?)
- Form two different hypotheses for the cause
- For each hypothesis, describe what evidence would confirm or disprove it
Question 2
Symptom: Users report that emails from the system arrive late — sometimes hours after the action that triggered them. The system was working fine until last week.
You know the email flow:
- User action triggers an event
- Event is placed in a queue
- A background worker reads the queue and sends emails
- Email is sent via an external email service
Using bisection, walk through how you would isolate whether the delay is in step 1, 2, 3, or 4. What specific thing would you check at each stage?
Question 3
Symptom: A banking app shows a customer's balance as negative $500, but the customer insists they have not made any large purchases. Looking at the transaction list, all transactions appear normal and small.
This is a data integrity issue. Trace the lifecycle backward:
- Where is the balance displayed?
- Where is it calculated?
- What data feeds the calculation?
- What could cause the calculation to produce a wrong result?
List at least four distinct hypotheses, each targeting a different part of the data lifecycle.
Section B: Design for Failure
Question 4
You are designing a system that processes online job applications. The flow:
- Applicant fills out a form with personal info and uploads a resume
- System validates the form data
- Resume is stored
- Application record is created in the database
- Hiring manager is notified via email
- Applicant receives a confirmation email
For each step, list:
- What can fail
- The severity (critical/high/medium/low)
- The appropriate response strategy (prevent/retry/fallback/degrade/alert/fail fast)
- What should be logged for debugging
Question 5
A ride-sharing app has these dependencies:
- GPS service (for driver location)
- Payment processor (for billing)
- Map routing service (for directions)
- Push notification service (for alerts)
For each dependency, answer:
- What happens if it goes down for 30 seconds?
- What happens if it goes down for 30 minutes?
- What happens if it starts returning wrong data instead of errors?
- What should the app do in each case?
Pay special attention to the third question — silent wrong data is the most dangerous failure mode.
Question 6
Perform a pre-mortem for the following system:
A school lunch ordering system where parents pre-order meals for their children through a website. The kitchen prepares meals based on the orders. Children pick up their meal at lunch using their student ID.
Imagine it's been running for three months and something has gone terribly wrong. Write five realistic failure scenarios. For each one:
- What went wrong
- Why it wasn't caught earlier
- What design decision would have prevented it
Section C: Failure Reasoning
Question 7
You have a system with three modules in sequence:
Module A → Module B → Module C → Output
The output is wrong. You check Module A's output — it's correct. You check Module C's output — it's wrong.
Can you conclude the bug is in Module B or Module C? Why or why not? What else do you need to check? Describe the precise reasoning.
Question 8
An engineer says: "I added retry logic everywhere, so our system handles failures well."
Explain at least three scenarios where retrying makes the problem worse instead of better. For each scenario, describe what should be done instead.
Question 9
Two failure strategies are proposed for a checkout system when the payment service is down:
Strategy A: Show the user an error: "Payment service unavailable. Please try again in a few minutes."
Strategy B: Accept the order, save it with status "payment pending," and charge the user when the payment service comes back.
Analyze both strategies. What are the risks of each? Under what circumstances is A better? Under what circumstances is B better? What failure modes does B introduce that A doesn't have?
Section D: The Full Picture
Question 10
This is an integration exercise. You have studied all five pillars. Now apply them all:
Scenario: A hospital system manages patient appointments. Patients book appointments online, doctors see their schedule on a dashboard, and the system sends text message reminders 24 hours before each appointment.
A doctor reports: "My 2pm patient said they never received a reminder, and two of my morning patients received reminders for the wrong date."
Using everything you've learned:
- Data Lifecycle: Trace the data from appointment creation to reminder delivery
- Boundaries: Identify which module(s) are likely involved in the failure
- Contracts: Identify what contract might be violated
- Decomposition: Break the problem into investigatable pieces
- Failure Mode: Categorize the failure type, form hypotheses, and describe how you would bisect to find the root cause
Question 11
Design a comprehensive failure handling plan for a simple feature: "User changes their password."
The flow: user enters current password and new password → system verifies current password → system validates new password meets requirements → system updates the stored password → user receives email confirming the change.
For this feature:
- List every failure mode at every step
- Categorize each by type (input, logic, integration, resource, dependency, timing)
- Define the response for each
- Identify the single most dangerous failure mode and explain why
- Describe what logging would be needed to diagnose any failure in this flow without being able to reproduce it
Question 12
The final question. Reflect on this statement:
"A system that has never failed is more dangerous than a system that fails regularly."
Using concepts from all five pillars, explain why this might be true. Consider: untested failure paths, false confidence, unknown data lifecycle gaps, unchecked boundary assumptions, and unvalidated contracts. Give a concrete example to support your argument.
Grading Rubric
| Criteria | What It Means |
|---|---|
| Systematic process | Followed a structured approach — not random guessing. Steps are traceable and logical. |
| Precise symptoms | Problems are stated specifically, not vaguely. "Shows $0" not "is broken." |
| Multiple hypotheses | More than one possible cause is considered before committing to a diagnosis |
| Evidence-based reasoning | Each hypothesis has a way to test it. Decisions are based on evidence, not assumptions. |
| Failure design completeness | All failure modes are considered, not just the obvious ones. Silent failures and wrong-data failures are addressed, not just crashes. |
| Cross-pillar integration | Answers draw on data lifecycle, boundaries, contracts, and decomposition — not just debugging techniques in isolation |