Failure Modes — Pre-Mortems and Failure Planning

What Is a Pre-Mortem?

A post-mortem happens after something breaks. You investigate what went wrong.

A pre-mortem happens before anything breaks. You imagine it's six months from now and the system has failed catastrophically, then work backward: "What went wrong?"

This flips the psychology. In a planning meeting, people are optimistic. In a pre-mortem, everyone has permission to be pessimistic — and pessimism is productive.

The Pre-Mortem Process

Step 1: Define the System

State clearly what you're building and its key characteristics:

What does it do?
Who uses it?
What data does it handle?
What are the critical operations?

Step 2: Imagine the Disaster

Each person independently writes down their answer to: "It's six months from now. The system has failed in its worst possible way. What happened?"

Not little bugs. Catastrophes:

Data was lost permanently
Money was charged incorrectly
Security was breached
The system was down for days
Customers left in large numbers

Step 3: Group the Failures

Collect all imagined disasters and group them:

Which ones are about data? (loss, corruption, leakage)
Which ones are about availability? (downtime, slowness)
Which ones are about correctness? (wrong results, wrong actions)
Which ones are about security? (unauthorized access, data exposure)
Which ones are about scaling? (couldn't handle growth)

Step 4: For Each Failure, Ask Three Questions

How likely is this? (Almost certain / Probable / Possible / Unlikely)
How severe is the impact? (Critical / High / Medium / Low)
What would prevent or mitigate it? (Design decision, monitoring, process)

Step 5: Act on the High-Risk Items

Anything that is both likely and severe must be addressed in the design — not deferred to "we'll fix it later."

Worked Pre-Mortem 1: Online Banking App

The System

A mobile banking app. Customers can check balances, transfer money, pay bills, and deposit checks by photographing them.

Imagined Disasters

#	Disaster	Category	Likelihood	Severity
1	"A customer transferred $10,000 but the money disappeared — left the source account but never arrived at the destination"	Correctness	Possible	Critical
2	"Someone gained access to 50,000 customer accounts because session tokens weren't invalidated after password changes"	Security	Probable	Critical
3	"The app was down for 6 hours on a Friday (payday) because a database migration failed and couldn't be rolled back"	Availability	Probable	Critical
4	"A customer deposited the same check 15 times by rapidly submitting photos, and was credited $15,000 for a $1,000 check"	Correctness	Possible	High
5	"Customer service had no way to see what went wrong with a failed transaction because logging was incomplete"	Data	Probable	High
6	"The mobile app crashed on Android 12 devices and wasn't caught because we only tested on iOS"	Availability	Probable	Medium
7	"A third-party payment provider changed their API without notice, and bill payments silently failed for 3 days"	Integration	Possible	High

Prevention Plan

Disaster 1: Disappearing transfer

Prevention: Atomic transactions — debit and credit must be a single, indivisible operation. If one fails, both roll back.
Detection: Reconciliation: every night, verify that total debits = total credits across all accounts. Any mismatch triggers an alert.
Recovery: Transaction is logged with full details regardless of success/failure, enabling manual correction.

Disaster 2: Session tokens after password change

Prevention: On password change, invalidate ALL active sessions for that user. Force re-authentication.
Detection: Monitor for sessions that continue after a password change event. This should trigger a security alert.
Process: Add to the security review checklist: "What happens to active sessions when credentials change?"

Disaster 3: Failed database migration on payday

Prevention: Never run migrations on Fridays. Always have a tested rollback script. Run migrations in a staging environment first.
Detection: Automated health check: if the app can't connect to the database within 5 seconds, page the on-call engineer.
Mitigation: Read-only mode: if the database is mid-migration, customers can view balances but not make transactions. Degraded but not dead.

Disaster 4: Duplicate check deposit

Prevention: Idempotency key — each check deposit gets a unique ID. If the same check image is submitted twice, the second submission is recognized as a duplicate and rejected.
Detection: Flag accounts with multiple deposits of the same amount in a short time window.
Business rule: Hold deposited funds for 24 hours before making them available (standard banking practice, but now you know why).

Disaster 5: Incomplete logging

Prevention: Define logging as part of the contract for every operation. Every contract's side effects section must include what is logged.
Standard: For every transaction: log the input, the output, the timestamp, the customer ID, the IP address, and either the success result or the full error.
Verification: Regularly test that a support engineer can reconstruct what happened for a given transaction using only the logs.

#	Disaster	Category	Likelihood	Severity
1	"Registration opened at 8 AM and the site crashed within 2 minutes because 10,000 parents all clicked at the same time"	Scaling	Almost certain	High
2	"A parent registered their child at School A, but the system assigned them to School B because of a race condition on the last available seat"	Correctness	Probable	High
3	"A parent uploaded their child's medical records, and another parent could see them due to a document ID that was sequential and guessable"	Security	Possible	Critical
4	"Registration closed, but 200 parents say they submitted before the deadline and have no confirmation. No one can prove what happened."	Data	Probable	High
5	"The system accepted a registration without required immunization records. The school discovered this on the first day of class."	Correctness	Probable	Medium
6	"A family with special needs (IEP, 504 plan) registered but the system didn't flag this for the school, so no accommodations were prepared"	Correctness	Possible	High

Prevention Plan

Disaster 1: Opening-day crash

Prevention: Load test with 10x expected traffic before launch. Use a virtual queue ("You are #3,247 in line. Estimated wait: 12 minutes.") instead of letting everyone hit the system simultaneously.
Mitigation: Have a static "we're experiencing high volume" page that doesn't require the database, so the site doesn't show an error.
Communication: Tell parents in advance: "Registration stays open for 2 weeks. Spots are not first-come-first-served. You do not need to register at 8 AM."

Disaster 2: Race condition on last seat

Prevention: Don't assign seats in real time. Accept all registrations as "pending." Run the assignment process after registration closes, with clear tiebreaker rules.
If real-time assignment is required: Use pessimistic locking — when a parent starts registering for School A, temporarily reserve a seat. If they don't complete within 15 minutes, release it.
Never say "you're in" until the seat reservation is confirmed and committed.

Disaster 3: Guessable document IDs

Prevention: Use random, non-sequential document IDs (UUIDs). Never use auto-incrementing IDs for anything the user can see in a URL.
Authorization: Even with random IDs, check that the requesting user is authorized to see the document. Defense in depth — random ID + authorization check.
Encryption: Store uploaded documents encrypted at rest. Even if someone accesses the storage directly, they can't read the files.

Disaster 4: No proof of submission

Prevention: Every submission generates a confirmation number immediately, displayed on screen AND emailed. If the email fails, the confirmation number is still shown on screen.
Logging: Log every submission attempt with timestamp, IP address, and all submitted data. This is the system's proof.
Grace period: If the system was under heavy load near the deadline, extend the deadline. Publish the policy in advance.

#	Disaster	Category	Likelihood	Severity
1	"Internet goes down. Thermostat stops maintaining temperature because it depends on the cloud to get the schedule."	Availability	Almost certain	Critical (pipes freeze in winter)
2	"A software update bricked 50,000 thermostats. They display nothing and don't control temperature."	Availability	Possible	Critical
3	"A hacker accessed the thermostat API and set 100,000 homes to 95°F in August, causing danger for elderly residents."	Security	Possible	Critical
4	"The thermostat reported wrong energy data, and customers got unexpectedly high utility bills"	Correctness	Probable	High
5	"Two family members set conflicting schedules from their phones. The thermostat oscillated between 68°F and 75°F every few minutes."	Correctness	Probable	Medium

Prevention Plan

Disaster 1: Internet dependency

Prevention: The thermostat must operate independently of the internet. The schedule is stored on the device, not just in the cloud. The cloud syncs the schedule, but the device doesn't require it.
Design rule: If the internet connection goes away, the thermostat continues following its last-known schedule indefinitely. The user loses remote control but the house stays warm.
This is a boundary decision: The thermostat is its own module. The cloud is a convenience layer, not a dependency.

Disaster 2: Bricked by update

Prevention: Two-slot firmware — the thermostat stores two copies of its software. An update writes to the backup slot. If the update fails or the device doesn't boot correctly, it automatically reverts to the previous working version.
Rollout: Never update all devices at once. Update 1% → verify → 10% → verify → 100%. This limits blast radius.
Minimum function: Even if all software fails, the hardware should maintain a safe default temperature (60°F) to prevent pipe freezing. This is a hardware fallback, not a software feature.

Disaster 3: Security breach

Prevention: Authentication for every API call. Rate limiting on temperature changes. Maximum temperature bound (can't set above 90°F or below 50°F) enforced on the device, not just in the app.
Detection: Alert if temperature is set outside normal range, or if settings change more than 5 times in an hour.
Physical limit: The device has a physical maximum temperature that the software cannot override. Even a fully compromised cloud can't heat a house to a dangerous temperature.

The Pre-Mortem Toolkit

Questions to Ask for Any System

Question	What It Reveals
What happens when the network goes away?	Dependency on connectivity
What happens when traffic is 10x normal?	Scaling limits
What happens when a database migration fails?	Recovery procedures
What happens when a third-party service changes without notice?	Integration fragility
What happens when two users do the same thing at the same time?	Concurrency issues
What happens when the clock is wrong on one server?	Timing assumptions
What happens when someone deliberately tries to break it?	Security posture
What happens when data from 2019 meets code from 2024?	Data compatibility
What happens when the person who built this leaves the team?	Knowledge concentration
What happens when we have 100x the current data volume?	Storage and performance limits

The Risk Matrix

Plot your pre-mortem findings:

                 Low Impact         High Impact
              ┌───────────────┬───────────────────┐
  Likely      │   Monitor     │  MUST ADDRESS      │
              │               │  (Design for this) │
              ├───────────────┼───────────────────┤
  Unlikely    │   Accept      │  Plan response     │
              │   (Log it)    │  (Have a playbook)  │
              └───────────────┴───────────────────┘

Likely + High Impact: Must be addressed in the design. Not optional.
Likely + Low Impact: Monitor and fix when convenient.
Unlikely + High Impact: Have a response plan. You don't have to prevent it, but you must know what to do if it happens.
Unlikely + Low Impact: Accept the risk. Log it and move on.

Summary: Why Pre-Mortems Work

They give permission to be negative. In planning, people avoid bringing up problems (it feels like criticizing the plan). In a pre-mortem, finding problems is the goal.
They surface assumptions. "The internet will always be available" is an assumption that feels obvious in a pre-mortem but gets ignored in design.
They connect to everything else in this curriculum:
- Data lifecycle → "Where is data at risk of loss or corruption?"
- Boundaries → "What's the blast radius if this module fails?"
- Contracts → "Which error cases are missing from our contracts?"
- Decomposition → "Which dependencies create single points of failure?"
They're cheap. A pre-mortem takes an hour. Recovering from a disaster you could have prevented takes weeks.

Systems Thinking for Engineers

Failure Modes — Pre-Mortems and Failure Planning

What Is a Pre-Mortem?

The Pre-Mortem Process

Step 1: Define the System

Step 2: Imagine the Disaster

Step 3: Group the Failures

Step 4: For Each Failure, Ask Three Questions

Step 5: Act on the High-Risk Items

Worked Pre-Mortem 1: Online Banking App

The System

Imagined Disasters

Prevention Plan

Worked Pre-Mortem 2: School Registration System

The System

Imagined Disasters

Prevention Plan

Worked Pre-Mortem 3: Smart Thermostat System

The System

Imagined Disasters

Prevention Plan

The Pre-Mortem Toolkit

Questions to Ask for Any System

The Risk Matrix

Summary: Why Pre-Mortems Work