Systems Thinking for Engineers

What This Course Is

This is not a programming course. You will not start by learning a language. You will not memorize syntax. You will not write "Hello, World."

Instead, you will learn how to think about systems — the same mental framework that separates a senior engineer with 20 years of experience from someone who just learned to code.

The tools have changed. An LLM can write a function faster than you can type it. But an LLM cannot:

Decide what the system should look like
Know where the boundaries belong
Understand why one design fails under load and another doesn't
Diagnose a problem it has never seen by reasoning from first principles

That is your job. This course teaches you to do it.

Who This Is For

Anyone entering software engineering — whether you have never written a line of code or you have written thousands but never understood why the code is organized the way it is.

If you can answer "what should I build and how should the pieces fit together?" then telling the machine how to build it is the easy part.

The Five Pillars

Each section of this course follows the same structure:

Why — The reasoning behind the concept. Why does this matter? What goes wrong without it?
How — The framework for applying it. Diagrams, patterns, and worked examples.
Test — Exercises that prove you understand. No code. Just thinking.

The pillars build on each other:

#	Pillar	Core Question
1	Data Lifecycle	Where does data live, what changes it, and how does it move?
2	Boundaries	What is a "thing"? Where does one piece end and another begin?
3	Contracts	What goes in, what comes out, and what can go wrong?
4	Decomposition	How do you break a big problem into small, solvable pieces?
5	Failure Modes	When something breaks, how do you reason about why?

Read the Why first. Don't skip it. The temptation will be to jump to "how do I do this?" but the reasoning is the entire point. If you can explain why something is the way it is, the how follows naturally.

Then work through the How — not by memorizing, but by trying to apply each concept to something you already understand (a restaurant, a library, an airport — real-world systems are software systems without the computer).

Finally, do the Test sections honestly. If you can't answer confidently, go back. There is no penalty for re-reading. There is a massive penalty in the real world for building on a shaky foundation.

Let's begin.

Data Lifecycle — Why It Matters

The Single Most Important Idea

Every piece of software that has ever existed does exactly three things with data:

Stores it
Transforms it
Transports it

That's it. Every app, every service, every script, every billion-dollar platform — strip away the UI, the branding, the complexity — and you are looking at data being stored somewhere, changed into something else, and moved from one place to another.

If you understand this, you can look at any system and immediately start reasoning about what it does. If you don't understand this, you will forever be lost in the details.

Why This Comes Before Everything Else

Most courses start with "here is a variable, here is a loop." That's like teaching someone to drive by explaining how a piston works. It's not wrong, but it's the wrong starting point.

When a senior engineer looks at a new system, they don't think about variables. They think:

"Where is the data coming from?" — a user typing? a file on disk? another system calling in?
"What happens to it?" — is it validated? calculated? reformatted? combined with other data?
"Where does it end up?" — saved to a database? shown on a screen? sent to another system?

This is the data lifecycle, and it is the foundation of every design decision in engineering.

What Goes Wrong Without This Mental Model

You build the wrong thing

A stakeholder says "we need a dashboard." Without lifecycle thinking, you start designing a screen. With lifecycle thinking, you ask: what data feeds this dashboard? Where does that data come from? How fresh does it need to be? The answers to those questions determine 90% of the work — the screen is the easy part.

You can't find bugs

Something is broken. Users are seeing stale data. Without lifecycle thinking, you stare at code and guess. With lifecycle thinking, you trace: the data is stored in a cache, transformed when the page loads, and transported from the API. The cache is the problem. You narrowed it from "something is broken" to "the cache isn't invalidating" in 30 seconds.

You can't explain your system to anyone

"It's a web app that does stuff" is not an explanation. "User input is validated and stored, background jobs transform it into reports, and an API transports those reports to the client" — that's an explanation. Anyone can understand that, technical or not.

The Three Stages, Concretely

Storage

Data at rest. It exists somewhere and is not currently being changed.

A row in a database
A file on disk
A value held in memory
A message sitting in a queue, waiting
A cookie in a browser
A configuration file on a server

The key questions about storage:

How long does it live? (forever? until the user closes the tab? five minutes?)
Who can access it? (just this program? any program? the user?)
What happens if it disappears? (catastrophic? inconvenient? nobody notices?)

Transform

Data being changed from one form or value to another.

Validating an email address (raw text → confirmed-valid text)
Calculating a total (line items → sum)
Compressing an image (large file → smaller file)
Sorting a list (unordered → ordered)
Joining data from two sources (customer + orders → customer-with-orders)

The key questions about transforms:

What goes in? (what shape? what constraints?)
What comes out? (what shape? what guarantees?)
Can it fail? (what happens if the input is garbage?) (almost all programs fail here!)

Transport

Data moving from one location to another.

A user submitting a form (browser → server)
An API call between services (service A → service B)
Reading from a database (database → application)
Displaying a result on screen (application → user's eyes)
Sending an email (system → inbox)

The key questions about transport:

How fast does it need to get there? (instantly? eventually? batch every hour?)
What happens if it doesn't arrive? (retry? alert? silent failure?)
How much data is moving? (one record? millions?)

The Mental Shift

Stop thinking about software as "code that does things."

Start thinking about it as data that flows through stages: it arrives from somewhere, it gets stored, it gets transformed, it gets transported to the next stage, and the cycle continues.

When someone describes a feature, your first instinct should be: "What data? Where does it start? What happens to it? Where does it end up?"

This is how experienced engineers think. Not because someone taught them — because after years of debugging, designing, and rebuilding, this is the pattern that always held true.

Now you know it on day one.

Data Lifecycle — Why It Matters

The Single Most Important Idea

Every piece of software that has ever existed does exactly three things with data:

Stores it
Transforms it
Transports it

If you understand this, you can look at any system and immediately start reasoning about what it does. If you don't understand this, you will forever be lost in the details.

Why This Comes Before Everything Else

Most courses start with "here is a variable, here is a loop." That's like teaching someone to drive by explaining how a piston works. It's not wrong, but it's the wrong starting point.

When a senior engineer looks at a new system, they don't think about variables. They think:

"Where is the data coming from?" — a user typing? a file on disk? another system calling in?
"What happens to it?" — is it validated? calculated? reformatted? combined with other data?
"Where does it end up?" — saved to a database? shown on a screen? sent to another system?

This is the data lifecycle, and it is the foundation of every design decision in engineering.

What Goes Wrong Without This Mental Model

You build the wrong thing

You can't find bugs

You can't explain your system to anyone

The Three Stages, Concretely

Storage

Data at rest. It exists somewhere and is not currently being changed.

A row in a database
A file on disk
A value held in memory
A message sitting in a queue, waiting
A cookie in a browser
A configuration file on a server

The key questions about storage:

How long does it live? (forever? until the user closes the tab? five minutes?)
Who can access it? (just this program? any program? the user?)
What happens if it disappears? (catastrophic? inconvenient? nobody notices?)

Transform

Data being changed from one form or value to another.

Validating an email address (raw text → confirmed-valid text)
Calculating a total (line items → sum)
Compressing an image (large file → smaller file)
Sorting a list (unordered → ordered)
Joining data from two sources (customer + orders → customer-with-orders)

The key questions about transforms:

What goes in? (what shape? what constraints?)
What comes out? (what shape? what guarantees?)
Can it fail? (what happens if the input is garbage?) (almost all programs fail here!)

Transport

Data moving from one location to another.

A user submitting a form (browser → server)
An API call between services (service A → service B)
Reading from a database (database → application)
Displaying a result on screen (application → user's eyes)
Sending an email (system → inbox)

The key questions about transport:

How fast does it need to get there? (instantly? eventually? batch every hour?)
What happens if it doesn't arrive? (retry? alert? silent failure?)
How much data is moving? (one record? millions?)

The Mental Shift

Stop thinking about software as "code that does things."

Start thinking about it as data that flows through stages: it arrives from somewhere, it gets stored, it gets transformed, it gets transported to the next stage, and the cycle continues.

When someone describes a feature, your first instinct should be: "What data? Where does it start? What happens to it? Where does it end up?"

This is how experienced engineers think. Not because someone taught them — because after years of debugging, designing, and rebuilding, this is the pattern that always held true.

Now you know it on day one.

Data Lifecycle — How: The Method

Lifecycle Mapping

The core skill is lifecycle mapping — taking any system or feature and tracing what happens to the data from birth to death. You don't need code for this. You need a diagram and the right questions.

The Three-Column Method

For any system, draw three columns:

Storage	Transform	Transport
Where data rests	How data changes	How data moves

Then fill them in for the system you're analyzing. Every piece of data in the system should appear in at least one column. Most data will touch all three.

How to Map a System

Follow these steps in order:

Step 1: Identify the Data

Before mapping anything, list every piece of data the system touches. Don't worry about how it works yet — just name the data.

Ask:

What does the user provide?
What does the system store?
What does the system calculate or derive?
What does the system output or display?
What does the system exchange with other systems?

Step 2: Trace Each Piece Through Its Lifecycle

For each piece of data, follow it from birth to death:

Where is it born? (user types it, another system sends it, it's calculated from other data)
Where does it live? (memory, database, file, cache, queue)
What changes it? (validation, calculation, formatting, enrichment)
Where does it travel? (screen, API, email, another module)
Where does it die? (deleted, archived, expired, overwritten)

Step 3: Build the Map

Use either a table or a flow diagram.

Table format — best for initial analysis:

Stage	What Happens	Category
(describe each step)	(what specifically occurs)	Storage / Transform / Transport

Flow diagram — best for communicating with others:

Boxes = storage (data at rest)
Arrows = transport (data in motion)
Labels on arrows or diamonds = transforms (data being changed)

┌──────────┐              ┌──────────┐              ┌──────────┐
│ Source    │ ──────────► │ Process  │ ──────────► │  Dest    │
│(storage) │  transport   │(transform)│  transport  │(storage) │
└──────────┘              └──────────┘              └──────────┘

Step 4: Find the Hidden Data

The most common mistake is forgetting data that isn't obvious. Every system has hidden data. Look for these specifically:

Metadata — data about data. When was this created? Who created it? What version? How many times has it been accessed? Metadata is critical for debugging and auditing, and it's almost always overlooked.

State — the current condition of something. Is this order pending, paid, in progress, or complete? Is this account active or suspended? State is data, and managing state transitions is where most bugs live.

Configuration — data that controls how the system behaves. Tax rates, store hours, feature flags, allowed file types, maximum limits. Configuration is storage that affects transforms.

Logs — a record of what happened. Every transport and transform should produce a log entry. When things break — and they will — logs are how you reconstruct what happened.

Derived data — data that is calculated from other data. A running total, a user's "membership level," an average rating. This data doesn't come from outside — it's created internally through transforms.

The Lifecycle Question Checklist

When analyzing any system or feature, run through these questions:

Storage:

What data is stored?
Where is it stored? (and is it more than one place?)
How long does it persist? (seconds? days? forever?)
What happens if storage fails or data is lost?
Who can access it?
How much data accumulates over time?

Transform:

What transforms happen to the data?
In what order?
What can go wrong at each step?
Are transforms reversible? (Can you undo them?)
Are there transforms that happen on a schedule vs. on demand?

Transport:

Where does data move from and to?
How quickly must it move? (real-time? batch? eventually?)
What happens if transport fails? (retry? lose it? queue it?)
How much data moves at once? (one record? thousands?)
Is the transport secure? (does it need to be?)
Who initiates the transport — the sender or the receiver?

If you can answer all of these for a given system, you understand that system deeply enough to build it, debug it, or redesign it.

What a Good Lifecycle Map Looks Like

A complete lifecycle map has these properties:

Every piece of data is accounted for — nothing appears from nowhere, nothing vanishes without explanation
Every stage is labeled — you know whether each step is storage, transform, or transport
Hidden data is included — metadata, state, configuration, and logs are on the map
Failure points are visible — you can point to each stage and say "if this fails, here is what breaks"
A stranger could follow it — someone who has never seen the system could read your map and understand the data flow

The following sections present complete worked examples. Study them, then compare them to the test questions. The test will ask you to produce maps at this level of detail.

Data Lifecycle — Example: Coffee Shop Ordering

The Scenario

A customer orders coffee from a mobile app. They browse the menu, customize a drink, pay, and receive their drink at the store. The barista sees the order on a screen. The customer gets a push notification when the drink is ready.

This seems simple. Let's map it and see how much data is actually involved.

Step 1: Identify All the Data

Obvious data:

The customer's drink selection (item, size, modifications)
The customer's payment information
The store's menu
The order itself
The receipt

Less obvious data:

The store's current inventory (do they have oat milk today?)
Store hours and availability (is the store even open?)
The customer's account info (name, payment methods on file, order history)
Order status (placed → paid → in progress → ready → picked up)
Timestamps (when was the order placed? when was it ready?)
The queue position (how many orders are ahead of this one?)
Estimated wait time
Push notification delivery status (did the notification reach the phone?)

That's at least 13 pieces of data for a single coffee order. Most people would have named 3 or 4.

Step 2: Full Lifecycle Map

#	Stage	What Happens	Category	Data Involved
1	Customer opens app	App loads menu from server	Transport (server → app)	Menu, store hours, inventory flags
2	Menu displayed	Menu data held in app memory	Storage (temporary)	Menu items, prices, available mods
3	Customer browses	Selection exists in app memory as they tap	Storage (temporary)	Selected item, size, mods
4	Customer taps "Add to Cart"	Selection formatted into cart item	Transform	Selection → structured cart item
5	Cart displayed	Cart data held in app memory	Storage (temporary)	Cart items, running subtotal
6	Customer taps "Place Order"	Cart data sent to server	Transport (app → server)	Cart items, customer ID, store ID
7	Server validates order	Each item checked: real menu item? Size valid? Mod available? Store open?	Transform	Raw order → validated order or error
8	Inventory checked	Server verifies items are in stock	Transform (comparison)	Order items vs. current inventory
9	Price calculated	Server computes subtotal, tax, total	Transform	Items + prices + tax rate → total
10	Order summary sent to app	Calculated total returned for confirmation	Transport (server → app)	Total, itemized breakdown
11	Customer confirms and pays	Payment info sent to server	Transport (app → server)	Payment method token, total amount
12	Payment forwarded	Server sends charge to payment provider	Transport (server → payment service)	Amount, payment token, merchant ID
13	Payment processed	Payment provider validates and charges	Transform (external)	Charge request → approval or decline
14	Payment result returned	Approval/decline sent back to server	Transport (payment service → server)	Transaction ID, status, timestamp
15	Order status updated	Status changed from "pending" to "paid"	Transform (state change)	Order status field
16	Order saved	Full order record written to database	Storage (persistent)	All order fields, customer ID, timestamps
17	Order sent to store	Order details sent to barista screen	Transport (server → store display)	Items, mods, customer name, order #
18	Queue position calculated	Server computes estimated wait time	Transform	Orders ahead + avg prep time → estimate
19	Estimate sent to customer	Wait time pushed to app	Transport (server → app)	Estimated minutes
20	Barista prepares drink	(Physical process, not data — but status tracked)	—	—
21	Barista marks "ready"	Taps button on store display	Transport (store display → server)	Order ID, "ready" status
22	Order status updated	Status changed from "in progress" to "ready"	Transform (state change)	Order status field, timestamp
23	Push notification sent	Server sends notification to customer phone	Transport (server → notification service → phone)	Customer device token, message text
24	Notification delivery logged	Whether the notification was delivered or failed	Storage (persistent)	Delivery status, timestamp, device info
25	Customer picks up drink	Barista marks "completed"	Transform (state change)	Order status → "completed," pickup timestamp
26	Receipt generated	Order data formatted into receipt	Transform	Order data → receipt format
27	Receipt stored	Saved to customer's order history	Storage (persistent)	Receipt data, linked to customer account
28	Analytics logged	Order data aggregated into business metrics	Transport (server → analytics)	Order total, items, time to fulfill, store ID

Step 3: Flow Diagram

┌─────────┐    load menu    ┌──────────┐   fetch    ┌──────────┐
│Customer  │ ◄───────────── │  Server  │ ─────────►│ Database │
│  App     │                │          │            │(menu,    │
│          │ ──────────────►│          │            │inventory)│
│          │   place order  │          │            └──────────┘
└─────────┘                 │          │
     ▲                      │          │──────────────►┌──────────┐
     │ push notification    │          │  charge card   │ Payment  │
     │                      │          │◄──────────────│ Provider │
     │                      │          │  confirmation  └──────────┘
     │                      │          │
     │                      │          │──────────────►┌──────────┐
     │                      │          │  send order    │  Store   │
     │                      │          │◄──────────────│ Display  │
     │                      │          │  mark ready    └──────────┘
     │                      │          │
     │                      │          │──────────────►┌──────────┐
     │                      │          │  log events    │Analytics │
     │                      └──────────┘               └──────────┘
     │                           │
     │                           │ send notification
     │                      ┌──────────┐
     └──────────────────────│  Push    │
                            │ Service  │
                            └──────────┘

Step 4: Hidden Data Analysis

Let's call out the hidden data specifically:

Hidden Data	Type	Where It Lives	Why It Matters
Timestamps on every action	Metadata	Order record in database	Debugging, performance monitoring, customer disputes
Customer's device token	Configuration	Customer account record	Required for push notifications to work
Tax rate for the store's location	Configuration	Server config or database	Affects price calculation — changes by jurisdiction
Payment provider's transaction ID	Metadata	Order record	Required for refunds, fraud investigation, accounting
Menu version	Metadata	App cache + server	If menu prices changed between browsing and ordering, which price applies?
Notification delivery status	State	Notification log	If customer didn't get notified, was it our fault or theirs?
Inventory snapshot at time of order	Derived	Not typically stored — and maybe it should be	If a customer says "it showed oat milk was available," can you prove whether it was?

That last row is interesting — it's data that doesn't exist in most systems but should, which is something you'd only discover by doing this analysis.

Step 5: What Could Go Wrong (by Lifecycle Stage)

Stage	What Could Fail	Consequence
Transport: Load menu	App can't reach server (no internet)	Customer can't browse. Show cached menu? Or error?
Storage: Menu cache	Cached menu is stale (prices changed)	Customer sees old price, gets charged new price. Angry customer.
Transform: Validate order	Item was removed from menu after customer selected it	Order rejected. Good UX: explain what happened. Bad UX: generic error.
Transport: Payment	Payment service is down or slow	Customer is stuck. Don't show "order confirmed" until payment is confirmed.
Transform: Payment	Card declined	Tell the customer clearly. Don't store the attempt as a completed order.
Transport: Send to store	Store display is offline	Order is paid but barista never sees it. Customer waits forever. Critical failure.
Transform: State change	Barista forgets to tap "ready"	Customer never gets notified. Drink gets cold. Operational failure (not a system bug, but the system should account for it — timeout alert?).
Transport: Push notification	Notification service fails silently	Customer doesn't know drink is ready. System should have fallback (in-app polling?).

Compare and Contrast: What This Example Teaches

This example demonstrates:

A "simple" system has 27+ data stages — complexity hides in the details
Multiple external services (payment provider, notification service, store display) each introduce transport risks
State management is critical — the order status flows through 5 states, and getting out of sync at any point creates visible bugs
Hidden data (metadata, config, logs) is as important as the obvious data — timestamps, tax rates, and delivery receipts make or break the system's debuggability
Failure at any stage has different consequences — some failures are cosmetic, some lose money, some lose customers

When you encounter the test questions, map them at this level of detail. If your lifecycle map has 5 stages, you probably missed 20.

Data Lifecycle — Example: Bank ATM Withdrawal

The Scenario

A customer walks up to an ATM, inserts their card, enters their PIN, requests $200 cash, and walks away with the money and a receipt.

This is one of the most data-sensitive operations in everyday life. Every piece of data must be exact. There is zero tolerance for error — if the system says the customer has $500, they must have exactly $500. Not $499.99, not $500.01.

Let's map it.

Step 1: Identify All the Data

Obvious data:

Card data (account number, card info)
PIN
Withdrawal amount ($200)
Account balance
Cash dispensed
Receipt

Less obvious data:

ATM's physical cash inventory (does it have enough $20 bills?)
Daily withdrawal limit for this account
How much the customer has already withdrawn today
ATM network session (the encrypted connection between ATM and bank)
Transaction authorization code
ATM location and ID
Timestamp of the transaction
Account hold/freeze status
Currency denomination preferences (does the ATM give $20s, $50s, or $100s?)
ATM's own status (is the receipt printer working? is the cash drawer jammed?)
Fraud detection signals (is this card being used in a different country than it was 5 minutes ago?)

Step 2: Full Lifecycle Map

#	Stage	What Happens	Category	Key Details
1	Card inserted	ATM reads magnetic stripe or chip data	Transport (card → ATM)	Card number, expiry, bank identifier extracted
2	Card data stored temporarily	ATM holds card data in encrypted memory	Storage (temporary, encrypted)	Never written to disk. Held only for this session.
3	ATM connects to bank network	Encrypted session established	Transport (ATM → bank)	Session ID, ATM ID, ATM location transmitted
4	Card data sent for validation	ATM sends card info to bank	Transport (ATM → bank)	Card number + bank identifier
5	Bank validates card	Is this a real card? Is it active? Is it reported stolen? Expired?	Transform (validation)	Card status checked against bank records
6	Card status returned	Bank sends result back to ATM	Transport (bank → ATM)	"Valid" or specific rejection reason
7	PIN prompt displayed	ATM asks customer for PIN	Transport (ATM → customer screen)	No data in transit — just a UI prompt
8	Customer enters PIN	Keypad captures digits	Transport (keypad → ATM memory)	PIN stored encrypted, never displayed on screen
9	PIN sent for verification	Encrypted PIN sent to bank	Transport (ATM → bank)	PIN is encrypted end-to-end. ATM never knows the real PIN.
10	Bank verifies PIN	Entered PIN compared to stored PIN hash	Transform (comparison)	Bank doesn't store plaintext PINs either — it compares hashes
11	PIN result returned	Bank sends verification result	Transport (bank → ATM)	"Correct" or "incorrect" + remaining attempts count
12	Customer selects "Withdrawal"	Selection captured	Storage (temporary)	Transaction type stored in session
13	Customer enters $200	Amount captured	Storage (temporary)	Amount stored in session
14	ATM checks local cash	Does ATM have enough cash to dispense?	Transform (comparison)	$200 requested vs. ATM cash inventory
15	Withdrawal request sent to bank	ATM sends full transaction request	Transport (ATM → bank)	Account, amount, ATM ID, timestamp
16	Bank checks balance	Is current balance ≥ $200?	Transform (comparison)	Available balance (accounting for holds and pending transactions)
17	Bank checks daily limit	Has customer exceeded daily withdrawal limit?	Transform (comparison)	Today's total withdrawals + $200 vs. limit
18	Bank checks fraud signals	Is this transaction suspicious?	Transform (analysis)	Location, timing, amount patterns checked
19	Bank authorizes (or denies)	All checks pass → generate authorization	Transform	Creates authorization code, places temporary hold on funds
20	Funds held (not yet deducted)	Bank places a hold of $200 on the account	Storage (state change)	Available balance reduced by $200, but actual balance not yet changed
21	Authorization sent to ATM	Bank sends approval + auth code	Transport (bank → ATM)	Authorization code, approved amount
22	ATM dispenses cash	Physical mechanism releases bills	Transport (ATM cash drawer → customer)	$200 in $20 bills. ATM cash inventory reduced.
23	Customer takes cash	ATM sensors detect cash was taken	Transform (state change)	Transaction status: "cash dispensed"
24	Dispensing confirmed to bank	ATM tells bank: cash was taken successfully	Transport (ATM → bank)	Auth code + "dispensed" confirmation
25	Bank finalizes transaction	Hold converted to actual deduction. Balance permanently reduced.	Transform (state change)	Available balance and actual balance both reduced. Transaction recorded.
26	Transaction logged	Full record written to bank's transaction database	Storage (persistent, permanent)	Account, amount, ATM ID, location, timestamp, auth code
27	Receipt printed	ATM formats and prints receipt	Transform + Transport	Transaction data → receipt format → printed paper
28	Session ended	All temporary data in ATM memory cleared	Storage (destruction)	Card data, PIN, session data all wiped

Step 3: Critical Data Detail — The Two-Phase Commit

Notice stages 20 and 25. This is the most important concept in this example:

The bank does NOT deduct the money when it authorizes the transaction. It places a hold first.

Why? Because of the gap between stages 21-24. What if:

The ATM authorizes the withdrawal but then jams and can't dispense cash?
The customer walks away without taking the money?
The network drops between authorization and dispensing confirmation?

If the bank had already deducted the money at step 19, the customer would lose $200 they never received. Instead:

Phase 1 (Hold): Money is reserved but not gone. If something fails, the hold is released and the customer's balance is restored.
Phase 2 (Finalize): Only after the ATM confirms the cash was taken does the bank permanently deduct.

This is called a two-phase commit and it exists specifically because transport between ATM and bank can fail at any point. The data lifecycle design accounts for the worst case.

Step 4: Hidden Data Analysis

Hidden Data	Type	Where It Lives	Why It Matters
PIN attempt counter	State	Bank's card record	After 3 wrong PINs, card is locked. Counter resets on success.
ATM cash inventory by denomination	State	ATM local storage	ATM must know exactly how many $20s, $50s, $100s it has. If inventory tracking is wrong, it could promise cash it can't deliver.
Daily withdrawal running total	Derived	Bank's transaction records	Calculated from today's transactions. Not stored as a single number — derived from the sum of today's withdrawals.
Transaction sequence number	Metadata	Both ATM and bank	Ensures no transaction is processed twice, even if network hiccups cause a retry.
ATM hardware status	State	ATM local diagnostics	Printer jammed? Card reader failing? Cash drawer low? These are all data that affect whether the ATM can complete a transaction.
Authorization expiry	Configuration	Bank rules	A hold might expire after 24 hours if not finalized. This prevents money from being locked indefinitely.
Fraud scoring signals	Derived	Bank's fraud detection system	Geographic velocity (was this card used 1000 miles away 10 minutes ago?), amount patterns, time-of-day patterns.

Step 5: What Could Go Wrong

Failure Point	What Happens	Consequence	Correct Response
Network fails after PIN, before authorization	ATM can't reach bank	Transaction cannot proceed	Return card. Display "service unavailable." Don't guess.
Authorization granted, but ATM cash drawer jams	Cash can't be dispensed	Customer authorized but didn't receive money	ATM sends "dispensing failed" to bank. Bank releases hold. Customer balance restored.
Network fails after cash dispensed, before confirmation	ATM can't tell bank the cash was taken	Bank doesn't know to finalize	Bank's hold expires → money returns to account. But customer HAS the cash. Bank reconciles from ATM's local transaction log during next sync.
Power failure mid-transaction	Everything stops	Unclear state	ATM writes transaction state to non-volatile storage at each step. On reboot, it replays the state to determine where it stopped and what needs recovery.
Customer walks away without taking cash	Cash is hanging out of the machine	Security risk + accounting mismatch	ATM retracts cash after timeout (usually 30 seconds). Sends "cash retracted" to bank. Hold released.

Compare and Contrast With the Coffee Example

Aspect	Coffee Shop	ATM Withdrawal
Data sensitivity	Low-medium (order data, payment token)	Maximum (financial records, PINs)
Error tolerance	Medium (wrong order is bad, but fixable)	Zero (wrong balance is unacceptable)
Two-phase commit needed?	No (charge and order can be atomic)	Yes (must handle gap between authorization and dispensing)
Physical-digital boundary	Barista → digital status update	Cash drawer → sensors → digital confirmation
Failure cost	Bad customer experience	Financial loss or fraud
Hidden data volume	Moderate	Extensive (fraud signals, hardware status, attempt counters)

The key lesson: the lifecycle structure is the same (storage → transform → transport), but the stakes change everything about how carefully you map it. In the coffee app, missing a notification is annoying. In the ATM system, missing a transaction confirmation means someone loses money.

When you design a system, the first question after "what is the data lifecycle?" is: "What are the stakes when it fails?" The answer determines how much detail your map needs.

The Scenario

A user takes a photo on their phone, types a caption, adds a location tag, and posts it to a social media platform. Their followers see it in their feeds. Some like it. One person comments. The post appears in search results. A week later, the user checks how many views it got.

This example is interesting because the data fans out — one input (a photo) creates dozens of downstream data flows touching many different parts of the system.

Step 1: Identify All the Data

The obvious:

The photo file
The caption text
The location tag
Likes
Comments

The hidden:

Data	Why It Exists
Photo metadata (EXIF)	Camera embeds date, time, GPS coordinates, camera model, exposure settings into every photo file
Multiple photo sizes	The platform doesn't serve the original 12MB file — it creates thumbnail, medium, and full-size versions
The follow graph	The system needs to know who follows this user to build their feeds
Feed entries for every follower	Each follower's personalized feed needs an entry for this post
Notification records	Followers with notifications enabled need to be alerted
Search index entries	The caption and location need to be searchable
View count	Every time someone sees the post, it's counted
Engagement metrics	Likes, comments, shares, saves — each tracked separately
Content moderation signals	Automated scan for prohibited content, nudity detection, etc.
Ad relevance signals	The platform categorizes the post to match with advertisers
User activity timestamp	"Last active" and "posting frequency" updated
Privacy settings	Who can see this post? Public? Friends only? Custom list?

A single photo post touches 15+ data categories.

Step 2: Full Lifecycle Map

Phase 1: Upload and Ingest

#	Stage	What Happens	Category
1	User taps "Post"	Photo file + caption + location sent to server	Transport (phone → server)
2	Upload received	Raw data held in temporary upload storage	Storage (temporary)
3	Input validated	File type check (is it actually an image?), file size check (under limit?), caption length check	Transform (validation)
4	Content moderation scan	Automated analysis for prohibited content	Transform (analysis)
5	EXIF data extracted	GPS, timestamp, camera info pulled from photo file	Transform (extraction)
6	EXIF data compared to provided location	If user tagged "Paris" but EXIF says "Tokyo," flag for review	Transform (comparison)
7	Photo resized	Original → thumbnail (150px), medium (600px), large (1200px)	Transform (image processing)
8	Photos stored	All sizes stored in file storage (not the database — a separate file system)	Storage (persistent)
9	EXIF stripped from public copies	GPS and camera data removed from versions served to viewers (privacy)	Transform (redaction)
10	Post record created	Database record: post ID, user ID, caption, location, timestamp, photo URLs, privacy settings	Storage (persistent)

Phase 2: Distribution (Fan-Out)

#	Stage	What Happens	Category
11	Follower list retrieved	System looks up everyone who follows this user	Transport (database → distribution service)
12	Privacy filter applied	Remove followers who are blocked or excluded by privacy settings	Transform (filtering)
13	Feed entries created	For each eligible follower, a feed entry is generated pointing to this post	Storage (persistent — one entry per follower)
14	Notification candidates identified	Which followers have notifications enabled for this user?	Transform (filtering)
15	Notifications dispatched	Push notifications sent to eligible followers	Transport (server → notification service → devices)
16	Notification delivery logged	For each notification: sent/delivered/failed	Storage (persistent)

Phase 3: Indexing

#	Stage	What Happens	Category
17	Caption text indexed	Words from caption added to search index	Transform (tokenization) + Storage (search index)
18	Location indexed	Location added to geographic search	Storage (geo index)
19	Hashtags extracted and indexed	#sunset, #paris pulled from caption and indexed	Transform (extraction) + Storage (hashtag index)
20	Post added to user's profile timeline	Post appears on the user's own profile page	Storage (profile index)

Phase 4: Engagement (Ongoing)

#	Stage	What Happens	Category
21	Follower views post	Post data retrieved and displayed	Transport (server → follower's phone)
22	View recorded	View count incremented	Transform (increment) + Storage (counter update)
23	Follower taps "Like"	Like event sent to server	Transport (phone → server)
24	Like recorded	Like record created (who liked what, when)	Storage (persistent)
25	Like count updated	Post's like count incremented	Transform (increment)
26	Post author notified of like	Notification sent to original poster	Transport (server → phone)
27	Someone comments	Comment text sent to server	Transport (phone → server)
28	Comment validated and stored	Checked for length/prohibited content, then saved	Transform + Storage
29	Comment count updated	Post's comment count incremented	Transform (increment)
30	Post author notified of comment	Notification sent	Transport

Phase 5: Analytics (Later)

#	Stage	What Happens	Category
31	User checks "insights"	Analytics data aggregated from view counts, like records, comment records	Transform (aggregation)
32	Insights displayed	Aggregated data formatted and sent to user	Transform (formatting) + Transport (server → phone)

Step 3: The Fan-Out Problem

This example reveals a pattern the other examples don't: fan-out.

When a user with 10,000 followers posts a photo, the system must:

Create 10,000 feed entries (one per follower)
Potentially send 10,000 notifications
Handle 10,000 potential views, likes, and comments

This is a one-to-many transport and storage problem. The lifecycle of a single post multiplies at the distribution phase.

                                           ┌─ Follower A's feed
                                           ├─ Follower B's feed
    ┌──────┐       ┌────────┐              ├─ Follower C's feed
    │ Post │──────►│Fan-Out │──────────────├─ Follower D's feed
    │      │       │Service │              ├─ ...
    └──────┘       └────────┘              └─ Follower N's feed
                       │
                       │
                       ▼
                  ┌──────────┐             ┌─ Notification → A
                  │Notify    │─────────────├─ Notification → B
                  │Service   │             └─ Notification → (subset)
                  └──────────┘

This creates interesting data lifecycle questions:

Do you create all 10,000 feed entries immediately? Or lazily when each follower opens their app?
What if a follower opens their feed while the fan-out is still in progress? Do they see the post or not?
What if the user deletes the post 5 seconds after posting? Can you recall all 10,000 feed entries?

These are design decisions that emerge directly from mapping the lifecycle.

Step 4: Multiple Storage Locations — Same Data

Notice that the same photo exists in multiple forms and multiple places:

Version	Storage Location	Purpose	Lifetime
Original upload	Temporary upload storage	Processing input	Deleted after processing (hours)
Original (full resolution)	Permanent file storage	Backup/recovery, "download original" feature	Forever (or until user deletes post)
Large (1200px)	Permanent file storage + CDN cache	Desktop viewing	Forever
Medium (600px)	Permanent file storage + CDN cache	Mobile feed viewing	Forever
Thumbnail (150px)	Permanent file storage + CDN cache	Grid view, previews	Forever

Five copies of what started as one photo. Each has a different purpose and potentially a different lifecycle. If the user deletes the post, ALL five must be deleted — plus the CDN caches must be invalidated. Missing any one copy means orphaned data sitting in storage forever.

Step 5: Comparing All Three Examples

Aspect	Coffee Shop	ATM	Social Media Post
Data flow shape	Linear (order → payment → fulfillment)	Linear with two-phase commit	Fan-out (one post → many feeds)
Number of data copies	1 (the order)	1 (the transaction)	Many (photo versions, feed entries, index entries)
Time sensitivity	Minutes (order should be ready soon)	Seconds (transaction must be instant)	Mixed (post immediately, analytics later)
Deletion complexity	Simple (one record)	N/A (transactions are permanent legal records)	Complex (must remove from all copies, feeds, indexes, caches)
Who consumes the data	Customer + barista	Customer + bank	Thousands of followers, search engines, analytics
Biggest hidden data	Tax config, menu version	Fraud signals, hardware status	Follow graph, EXIF metadata, content moderation

Key Takeaways From This Example

Fan-out multiplies the lifecycle — one action can create thousands of downstream data events
The same data exists in multiple forms — and each form has its own storage, its own lifecycle, and its own deletion requirements
Indexing is a separate lifecycle stage — making data searchable requires transforming and storing it in additional specialized formats
Privacy intersects with data lifecycle — EXIF stripping, privacy-filtered distribution, and blocked-user exclusion are all transforms driven by non-obvious data (privacy settings, block lists)
Analytics are derived data — not stored at the time of action, but aggregated later from atomic records (views, likes, comments)

When mapping a system where one action triggers many reactions, always ask: "How many copies of this data exist, and what happens to all of them when the original changes?"

Data Lifecycle — Common Patterns

Recognizing Patterns

After mapping enough systems, you'll see the same structures repeatedly. Learning to recognize these patterns lets you quickly understand new systems by saying "oh, this is basically a pipeline with a fan-out at the end" instead of mapping every stage from scratch.

Every real system is a combination of these patterns. The coffee shop is CRUD + Request/Response. The ATM is Request/Response with a two-phase commit. The social media post is CRUD + Pipeline + Fan-Out + Event-Driven. Knowing the patterns lets you identify the building blocks.

Pattern 1: CRUD (Create, Read, Update, Delete)

The most basic lifecycle. Data is created, read back, modified, and eventually removed.

Structure

Create:  Input → Validate (Transform) → Store (Storage)
Read:    Request (Transport) → Retrieve (Storage) → Return (Transport)  
Update:  Input → Validate (Transform) → Overwrite (Storage)
Delete:  Request → Remove (Storage) → Confirm (Transport)

Real-World Example: Contact List App

Operation	Lifecycle Steps
Create a contact	User enters name + phone → app validates (phone number format check) → saved to database
Read contacts	User opens app → app requests contacts from database → database returns list → app displays them sorted alphabetically
Update a contact	User edits phone number → app validates new number → database overwrites old record
Delete a contact	User taps delete → app asks "are you sure?" → sends delete request → database removes record → app removes from displayed list

What to Watch For

Read is never just "get the data." There's almost always sorting, filtering, or pagination involved — those are transforms.
Delete is rarely simple. What about related data? If you delete a customer, what happens to their orders? Their reviews? Their saved addresses?
Update conflicts. What if two people update the same record at the same time? The last write wins? The first write wins? They're told there's a conflict?

Lifecycle Map

┌────────────┐     validate      ┌────────────┐     store       ┌────────────┐
│ User Input │ ───────────────► │  Server    │ ─────────────► │  Database  │
│            │                   │ (transform)│                 │ (storage)  │
│            │ ◄─────────────── │            │ ◄───────────── │            │
│            │   display result  │            │   retrieve      │            │
└────────────┘                   └────────────┘                 └────────────┘

Pattern 2: Pipeline

Data flows through a series of transforms in sequence. Each step's output is the next step's input. No step stores data permanently — the final result is what gets stored or delivered.

Structure

Input → Step 1 (Transform) → Step 2 (Transform) → Step 3 (Transform) → Output

Real-World Example: Photo Upload Processing

When a user uploads a profile photo, it doesn't just get saved. It passes through a pipeline:

Step	Input	Transform	Output
1. Receive	Raw uploaded file	Verify it's actually an image (not a virus)	Validated image file
2. Strip metadata	Validated image	Remove EXIF data (GPS, camera info — privacy)	Clean image
3. Resize	Clean image	Create thumbnail (100px), medium (400px), large (800px) versions	Three image files
4. Compress	Three images	Optimize file sizes for web delivery	Three compressed images
5. Content scan	Compressed images (or original)	Automated check for prohibited content	Same images + moderation flag (pass/fail)
6. Store	Compressed images	Save all three sizes to file storage	URLs for each size
7. Update record	Three URLs + moderation flag	Update user's profile record with new photo URLs	Updated database record

Pipeline Characteristics

Order matters. You can't compress before you resize (you'd compress the wrong sizes). You can't strip metadata after you store (the metadata would already be in storage). Each step depends on the previous step's output.

Failure stops the pipeline. If step 5 (content scan) flags the image, steps 6 and 7 never execute. The pipeline has a clear "abort" path at every stage.

Each step is independently testable. Give step 3 a known image, check that the output is three images of the right size. You don't need the rest of the pipeline to test this one step.

Lifecycle Map

Raw Upload → [Validate] → [Strip EXIF] → [Resize] → [Compress] → [Scan] → [Store] → [Update Record]
                                                                     │
                                                                     ▼ (if flagged)
                                                              [Reject + Notify User]

Pattern 3: Request/Response

One system asks a question, another system answers it. The data makes a round trip.

Structure

Requester → Question (Transport) → Responder → Process (Transform) → Answer (Transport) → Requester

Real-World Example: Weather App

Step	What Happens	Category
1	User opens weather app	— (no data yet)
2	App sends request: "What's the weather for ZIP 10001?"	Transport (app → weather API)
3	Weather API receives request	Transport complete
4	API looks up current conditions for ZIP 10001	Storage (read from weather database)
5	API formats response (temperature, conditions, humidity, forecast)	Transform (raw data → structured response)
6	Response sent back to app	Transport (API → app)
7	App stores response in local cache	Storage (temporary — expires after 15 minutes)
8	App displays weather to user	Transport (app memory → screen)
9	User checks again 5 minutes later	App serves from cache (no new request)
10	15 minutes pass, cache expires	Cache entry deleted
11	User checks again	Repeat from step 2

Request/Response Characteristics

There's always a waiting period. Between sending the request and receiving the response, the requester is waiting. What does it show the user? A spinner? Stale cached data? Nothing?

Timeouts are essential. What if the response never comes? The requester must decide: wait forever? Give up after 5 seconds? Show an error?

Caching changes the lifecycle. If you cache responses, you now have the data stored in two places (the source and the cache). They can get out of sync. How stale is acceptable? Who invalidates the cache?

Pattern 4: Event-Driven (Fan-Out)

Something happens, and multiple independent parts of the system react — each with their own lifecycle.

Structure

Event Occurs → Broadcast (Transport)
                ├─→ Listener A → (its own lifecycle)
                ├─→ Listener B → (its own lifecycle)
                └─→ Listener C → (its own lifecycle)

Real-World Example: New User Signs Up

A user creates an account. This single event triggers many independent reactions:

Listener	What It Does	Its Own Lifecycle
Welcome Email Service	Sends a welcome email	Retrieve email template (storage) → Fill in user's name (transform) → Send email (transport) → Log delivery (storage)
Default Settings Service	Creates the user's default preferences	Generate default settings (transform) → Save to database (storage)
Analytics Service	Records the signup event	Format event data (transform) → Write to analytics store (storage)
Onboarding Service	Creates a guided tutorial checklist	Generate checklist (transform) → Save progress tracker (storage)
Admin Dashboard	Updates the "new signups today" counter	Increment counter (transform) → Update dashboard data (storage)
Fraud Detection	Checks if signup looks legitimate	Analyze email domain, IP address, behavior patterns (transform) → Flag or clear (storage)

Fan-Out Characteristics

Listeners are independent. If the welcome email fails, the default settings should still be created. Each listener has its own success/failure path.

The event producer doesn't know (or care) about the listeners. The signup module just says "a user signed up." It doesn't know that six other systems are listening. This is intentional — it keeps the boundary clean.

Order usually doesn't matter. The welcome email can arrive before or after the default settings are created. But sometimes order does matter — the tutorial can't reference the user's settings if settings haven't been created yet. These ordering dependencies need to be explicit.

Fan-out can cascade. The welcome email might trigger its own event ("email sent"), which another listener responds to ("update email tracking dashboard"). One event can cascade into dozens of downstream data lifecycle chains.

Pattern 5: Batch Processing

Data accumulates over time, then is processed all at once on a schedule.

Structure

Events accumulate (Storage) → Timer fires → Retrieve batch (Transport) → Process all (Transform) → Store results (Storage) → Deliver (Transport)

Real-World Example: Daily Sales Report

Step	What Happens	Category	Timing
1	Orders happen throughout the day	Storage (each order written to database as it occurs)	Ongoing, real-time
2	Midnight: report job triggers	— (timer event)	Scheduled
3	Job queries all orders for the day	Transport (database → report service)	~Midnight
4	Orders aggregated by category, region, payment method	Transform (aggregation)	~Midnight
5	Summary formatted into report	Transform (formatting)	~Midnight
6	Report stored	Storage (persistent — saved to reports archive)	~Midnight
7	Report emailed to management	Transport (report service → email service → inboxes)	~Midnight

Batch Processing Characteristics

There's a delay between event and processing. An order at 9am isn't reflected in the report until midnight. This is by design — but stakeholders must understand it.

The batch window is critical. If 100,000 orders need processing and the job takes 3 hours, it must start early enough to finish before anyone needs the results. What if order volume doubles?

Failed batches are painful. If the midnight job fails, there's no report in the morning. Is there a retry? A manual trigger? Does someone get alerted?

Idempotency matters. If the job runs twice (maybe it was retried), does it produce the same report or a duplicate? The job must be safe to re-run.

Using Patterns to Analyze New Systems

When you encounter a new system, ask:

What's the dominant pattern? (Most features are CRUD at their core)
Where are the pipelines? (Any time data is processed in steps)
Where are the request/response boundaries? (Any time two systems talk)
Where are the fan-out points? (Any time one action triggers multiple reactions)
Is there batch processing? (Any time you hear "nightly," "weekly," "scheduled")

Most systems are a combination. "When a user signs up (CRUD: create user), send a welcome email (event-driven), process their uploaded profile photo (pipeline), and load their personalized dashboard (request/response pulling data from multiple sources)."

Naming the pattern lets you immediately know what lifecycle questions to ask, what failure modes to expect, and how the data flows.

Data Lifecycle — Test Your Understanding

Answer each question thoroughly. There is no code here — only thinking. If you can't answer confidently, revisit the Why and How sections before continuing.

Section A: Identification

Question 1

A weather app shows you the current temperature for your city.

List every piece of data involved in showing that single number on your screen. For each piece, label it as Storage, Transform, or Transport. Some pieces may involve more than one.

Question 2

You take a photo on your phone, apply a filter, and post it to social media. Your friend, in another country, sees it on their phone.

Trace the data lifecycle of that photo from the moment you press the shutter button to the moment your friend sees it. Identify every stage of storage, transform, and transport.

Question 3

A thermostat in a house reads the temperature, and if it's below 68°F, turns on the heater. When the temperature reaches 72°F, it turns the heater off.

What data is involved? Where is each piece stored? What transforms occur? What transport happens?

Section B: Analysis

Question 4

A company stores customer orders in a database. Every night at midnight, a report is generated summarizing the day's sales and emailed to the management team.

Draw the full lifecycle map (table or diagram) for this process. Include data you think the description didn't mention but that must exist for the system to work.

Question 5

You are told: "The system is slow." You know the system does three things:

Receives data from an external API
Processes that data (cleans and aggregates it)
Saves the result to a database

Using only the data lifecycle model, list three distinct hypotheses for why it might be slow, one related to each lifecycle stage (storage, transform, transport).

Question 6

An e-commerce site has a feature: "Customers who bought this also bought..."

What data must be stored to make this feature work? What transform produces the recommendations? When does that transform happen — when the page loads, or ahead of time? What are the tradeoffs of each approach?

Section C: Design

Question 7

You are asked to design a system for a public library's book checkout process. Patrons scan their library card, scan the book, and walk out. Overdue books generate a notification after 14 days.

Produce a full lifecycle map. Include:

All data involved (obvious and hidden)
Every storage location
Every transform
Every transport
What happens when things go wrong (book not in system, card expired, network down)

Question 8

A food delivery app needs a feature: real-time order tracking. The customer can see "Order received," "Being prepared," "Out for delivery," and "Delivered" with live updates.

Map the lifecycle of the order status specifically. Where is status stored? What triggers each status change (transform)? How does the updated status reach the customer's screen (transport)? What happens if the driver's phone loses connectivity?

Question 9

A school wants a system where teachers enter grades, students can view their own grades, and parents receive a weekly email summary.

Three different types of users interact with the same data. Map the full lifecycle showing how grade data flows differently for each user type. Identify where the data is shared and where it diverges.

Section D: Critical Thinking

Question 10

Someone proposes: "Let's just store everything in one big database table and figure out the rest later."

Using what you know about the data lifecycle, explain specifically what problems this creates. Don't just say "it's bad" — identify at least three concrete consequences and relate each one back to storage, transform, or transport.

Question 11

You mapped a system's data lifecycle and found that the same piece of data is stored in three different places (a database, a cache, and a local file).

Is this a problem? Under what circumstances would it be the right design? Under what circumstances would it be a mistake? What specific risks does it introduce?

Question 12

A feature request says: "When a user uploads a profile picture, it should appear immediately on their profile."

This sounds simple. Using lifecycle thinking, list everything that actually needs to happen between "user selects a file" and "image appears on profile page." Identify the stages that could fail and what the user would experience in each failure case.

Grading Rubric

For each question, evaluate your answer against these criteria:

Criteria	What It Means
Completeness	Did you identify all the data, including non-obvious data (metadata, state, config)?
Accuracy	Did you correctly label each stage as storage, transform, or transport?
Failure awareness	Did you consider what happens when things go wrong?
Clarity	Could someone else read your answer and build from it?

If your lifecycle map is missing stages, it means you're not seeing the full picture yet. That's fine — re-read the material and try again. The goal is not to get it right the first time. The goal is to train your brain to automatically think in lifecycles.

Boundaries — Why It Matters

The Hardest Problem in Engineering

Ask any experienced engineer what the hardest part of their job is. They won't say "writing code." They'll say some version of:

"Figuring out where one thing ends and another thing begins."

This is the boundary problem, and it is the single most consequential decision in any system. Get the boundaries right and the system is clean, debuggable, changeable, and explainable. Get them wrong and you have a tangled mess that nobody — not even the person who built it — can understand six months later.

An LLM can write code inside a well-defined boundary. It cannot decide where the boundary should be. That's your job.

What Is a Boundary?

A boundary is an answer to the question: "What is this thing responsible for, and what is it NOT responsible for?"

When you say "the authentication module," you are drawing a boundary. Everything about verifying identity is inside. Everything about displaying dashboards is outside. The boundary is the line between them.

Boundaries exist at every level:

This operation handles validating an email address. It does NOT store the email anywhere.
This feature handles user signup. It does NOT handle password resets.
This module handles all authentication. It does NOT handle what the user sees after logging in.
This service handles the entire user account system. It does NOT handle product inventory.

Why Boundaries Are Hard

Everything feels connected

In any real system, almost everything touches something else. A user's name appears on the profile page, in order confirmations, in admin dashboards, in emails. It's tempting to think it's all one thing. It's not — it's one piece of data crossing multiple boundaries.

Premature abstraction

People draw boundaries too early, before they understand the problem. They create a "UserManager" and a "DataProcessor" and an "EventHandler" before they know what the system actually does. These names describe nothing. They bound nothing. They're boundaries without meaning.

Fear of duplication

"But this code does almost the same thing as that code!" So people merge them, destroying two clear boundaries to create one muddled one. Sometimes duplication is the right answer. Two things that happen to look similar today might evolve in completely different directions tomorrow.

What Happens Without Clear Boundaries

The Ripple Effect

You change one thing and seventeen other things break. This happens because responsibilities leaked across boundaries. The payment code shouldn't need to know about the email template format, but someone took a shortcut and now they're coupled.

The "Nobody Understands This" Problem

A new person joins the team. They ask "how does checkout work?" If boundaries are clean, you can point to the checkout module and say "it's all in there." If boundaries are muddy, the answer is "well, it starts here, but then it calls this thing over there, which triggers this other thing, which writes to this shared table that's also used by..." — and the new person learns nothing.

The "Can't Change Anything" Problem

You need to replace the payment provider. If the payment boundary is clean, you swap out the internals and nothing else knows. If the payment logic is scattered across the codebase, you're rewriting half the application.

The "Testing Is Impossible" Problem

You want to verify that order totals are calculated correctly. If calculation is inside a clear boundary with defined inputs and outputs, you test it directly. If calculation is tangled with database access and UI rendering, you have to spin up the entire system just to test arithmetic.

The Hierarchy of Boundaries

Real systems have boundaries nested inside boundaries. Understanding the hierarchy is essential:

Operation

The smallest unit. One focused action.

Validate an email format
Calculate tax on a subtotal
Format a date for display

An operation should do one thing. If you describe it and use the word "and," it might be two operations.

Feature

A user-facing capability composed of operations.

"Sign up" = validate email + validate password + check for duplicate account + create account + send welcome email
"Place order" = validate cart + calculate total + process payment + create order record + send confirmation

A feature is something a user or stakeholder would recognize. "Validate email" is not a feature — "Sign up" is.

Module

A cohesive group of related features.

Authentication module: sign up, log in, log out, reset password, manage sessions
Orders module: browse products, add to cart, checkout, view order history
Notifications module: send email, send push notification, manage preferences

A module should have a clear domain. If you can't name it in one or two words, it might be too broad.

System

The complete application — all modules working together.

Service

An independently deployable system with its own storage. At large scale, modules may become services. But that's an advanced concern — start with modules.

The Naming Test

Here's a simple test for whether your boundary is right: Can you name it clearly?

✅ "AuthenticationModule" — clear, you know what's inside
✅ "OrderCalculator" — clear, it calculates orders
✅ "EmailSender" — clear, it sends emails
❌ "Utilities" — what's in here? Everything that didn't fit elsewhere? This is a junk drawer, not a boundary.
❌ "DataManager" — manages what data? All data? This name tells you nothing.
❌ "Helper" — helps with what? This is a confession that you didn't know where to put something.

If you can't name it precisely, you haven't defined it precisely. The name IS the boundary.

Why This Matters For Your Career

In the age of LLMs, the engineer who can say:

"This system needs four modules. Here's what each one does. Here's what each one does NOT do. Here are the connections between them."

...is the engineer who leads projects. The one who asks "should I use React or Vue?" is asking the wrong question. The technology doesn't matter until the boundaries are clear.

Boundaries are the blueprint. Everything else is construction.

Boundaries — Why It Matters

The Hardest Problem in Engineering

Ask any experienced engineer what the hardest part of their job is. They won't say "writing code." They'll say some version of:

"Figuring out where one thing ends and another thing begins."

An LLM can write code inside a well-defined boundary. It cannot decide where the boundary should be. That's your job.

What Is a Boundary?

A boundary is an answer to the question: "What is this thing responsible for, and what is it NOT responsible for?"

Boundaries exist at every level:

This operation handles validating an email address. It does NOT store the email anywhere.
This feature handles user signup. It does NOT handle password resets.
This module handles all authentication. It does NOT handle what the user sees after logging in.
This service handles the entire user account system. It does NOT handle product inventory.

Why Boundaries Are Hard

Everything feels connected

Premature abstraction

Fear of duplication

What Happens Without Clear Boundaries

The Ripple Effect

The "Nobody Understands This" Problem

The "Can't Change Anything" Problem

The "Testing Is Impossible" Problem

The Hierarchy of Boundaries

Real systems have boundaries nested inside boundaries. Understanding the hierarchy is essential:

Operation

The smallest unit. One focused action.

Validate an email format
Calculate tax on a subtotal
Format a date for display

An operation should do one thing. If you describe it and use the word "and," it might be two operations.

Feature

A user-facing capability composed of operations.

"Sign up" = validate email + validate password + check for duplicate account + create account + send welcome email
"Place order" = validate cart + calculate total + process payment + create order record + send confirmation

A feature is something a user or stakeholder would recognize. "Validate email" is not a feature — "Sign up" is.

Module

A cohesive group of related features.

Authentication module: sign up, log in, log out, reset password, manage sessions
Orders module: browse products, add to cart, checkout, view order history
Notifications module: send email, send push notification, manage preferences

A module should have a clear domain. If you can't name it in one or two words, it might be too broad.

System

The complete application — all modules working together.

Service

An independently deployable system with its own storage. At large scale, modules may become services. But that's an advanced concern — start with modules.

The Naming Test

Here's a simple test for whether your boundary is right: Can you name it clearly?

✅ "AuthenticationModule" — clear, you know what's inside
✅ "OrderCalculator" — clear, it calculates orders
✅ "EmailSender" — clear, it sends emails
❌ "Utilities" — what's in here? Everything that didn't fit elsewhere? This is a junk drawer, not a boundary.
❌ "DataManager" — manages what data? All data? This name tells you nothing.
❌ "Helper" — helps with what? This is a confession that you didn't know where to put something.

If you can't name it precisely, you haven't defined it precisely. The name IS the boundary.

Why This Matters For Your Career

In the age of LLMs, the engineer who can say:

"This system needs four modules. Here's what each one does. Here's what each one does NOT do. Here are the connections between them."

...is the engineer who leads projects. The one who asks "should I use React or Vue?" is asking the wrong question. The technology doesn't matter until the boundaries are clear.

Boundaries are the blueprint. Everything else is construction.

Boundaries — How: The Method

Drawing Boundaries: A Practical Framework

Boundaries don't appear naturally. You have to draw them deliberately. Here is a repeatable four-step process for identifying where boundaries belong in any system.

Step 1: List the Nouns

Take the system description and extract every noun — every "thing" that exists in the domain.

Don't filter yet. Don't decide what's important. Just list every noun you can find.

Example: "A library system where patrons check out books, librarians manage inventory, and overdue books generate fines."

Nouns:

Patron
Book
Librarian
Inventory
Fine
Checkout (the act of checking out — this is a noun too)
Overdue status (implied)

These nouns are your candidate boundaries. Not all of them will become boundaries, but they are where you start.

Step 2: Group by Responsibility

Ask: "Which nouns are really about the same concern?"

Some nouns are clearly siblings. "Book" and "Inventory" are both about the collection. "Patron" and "Librarian" are both people, but they have fundamentally different responsibilities — so they might belong to different groups.

Create a table:

Concern	Nouns
(name the concern in 1-3 words)	(list the related nouns)

Each concern is a candidate module.

Step 3: Define the Inside and the Outside

For each candidate boundary, write two lists:

✅ Inside: What this module is responsible for. All the actions, data, and rules it owns.
❌ Outside: What this module is explicitly NOT responsible for. Name specific things that someone might accidentally put here.

The "Outside" list is the important one. It's easy to say what something does. It's harder — and more valuable — to say what it does NOT do. The outside list is what prevents scope creep.

If you can't clearly write the "outside" list, your boundary is too vague. Go back to step 2 and re-examine.

Step 4: Identify the Connections

Boundaries are not walls — they are membranes. Data flows between them, but only through defined channels.

For each pair of modules that need to communicate, define:

What data crosses the boundary (be specific — not "user data" but "user ID")
Which direction it flows (A asks B, or B notifies A, or both)
What triggers the communication (a user action? a schedule? a state change?)

Draw the connections as arrows with labels. The label should describe the question being asked or the data being passed, not the technical mechanism.

Supporting Concepts

The Responsibility Test

For every piece of logic in your system, ask: "Whose job is it?"

If the answer is clear and singular, your boundaries are right. If the answer is "well, it could be either..." then you have a boundary problem that needs a decision.

Cohesion and Coupling

Two concepts that measure boundary quality:

Cohesion (inside a boundary) — High cohesion = good. Everything inside a boundary should be closely related. If you opened a module and found something unrelated, the boundary is wrong.

Test: "Does every piece inside this boundary serve the same purpose?"

Coupling (between boundaries) — Low coupling = good. Boundaries should depend on each other as little as possible. If changing Module A's internals forces changes inside Module B, they are too tightly coupled.

Test: "If I completely rewrote the inside of this module, would other modules need to change?"

The ideal: high cohesion within boundaries, low coupling between them.

The Naming Test

Can you name each boundary clearly, in 1-3 words, where the name accurately describes EVERYTHING inside?

✅ "Authentication" — clear, you know what's inside
✅ "OrderCalculation" — clear, it calculates orders
❌ "Utilities" — junk drawer, not a boundary
❌ "DataManager" — manages what? This name hides confusion
❌ "Helpers" — a confession that you didn't know where to put something

The Elevator Test

Can you explain each boundary in one sentence — the kind of sentence you'd say in an elevator?

If your one-sentence explanation includes the word "and" more than once, the boundary might be too wide. If you struggle to fill a sentence at all, it might be too narrow.

What to Look For in the Examples

The following pages work through three complete systems using this four-step method. As you read each one, pay attention to:

How many nouns they start with vs. how many boundaries they end with (it's always fewer boundaries than nouns)
Where the controversial decisions are — the moments where something could reasonably go in two places
What the connection diagram looks like — are there many connections (tightly coupled) or few (loosely coupled)?
How the "outside" lists prevent problems — each "NOT responsible for" statement is a bug that won't happen

Boundaries — Example: Library System

The Scenario

A public library system where patrons check out books, librarians manage the collection, overdue books generate fines, and the library sends reminders. Patrons can search the catalog and place holds on books that are currently checked out.

Step 1: List the Nouns

From the description and common sense about how libraries work:

Patron
Librarian
Book
Copy (a library owns multiple copies of the same book)
Catalog
Inventory
Checkout record
Due date
Return
Hold (a reservation on a book)
Hold queue (when multiple patrons want the same book)
Fine
Payment
Reminder (notification about due dates)
Library card
Account (patron's account)
Search query
Search results

18 nouns. This will NOT become 18 modules.

Step 2: Group by Responsibility

Concern	Nouns	Rationale
The Collection	Book, Copy, Catalog, Inventory	All about what the library has
Patron Accounts	Patron, Library Card, Account	All about who uses the library
Circulation	Checkout, Return, Due Date, Hold, Hold Queue	All about the movement of books between library and patron
Finances	Fine, Payment	All about money
Communication	Reminder	All about notifying patrons
Search	Search Query, Search Results	All about finding books

Six concerns from 18 nouns. But are six the right number? Let's examine.

Should Search be its own boundary? Search operates on Catalog data, but it doesn't modify it. It's a read-only view with its own complexity (matching, ranking, filtering). Making it separate means the Catalog can change how books are stored without affecting how they're searched. Yes — keep it separate.

Should Communication be its own boundary? Reminders are triggered by Circulation events (book due soon, book overdue). Should reminders live inside Circulation? No — because the library might also want to send other communications later (new book announcements, event invitations). Keeping Communication separate lets it grow without changing Circulation. Yes — keep it separate.

Should Finances be its own boundary? Fines are triggered by Circulation, and payments come from Patrons. Finances sits between them. If we put fines inside Circulation, then Circulation needs to know about money, which is outside its core concern. Yes — keep Finances separate.

Step 3: Define Inside and Outside

Catalog Module

✅ Inside: Adding books to the collection, removing books, tracking how many copies of each book exist, storing book details (title, author, ISBN, genre, description), tracking the physical condition of copies
❌ Outside: Who has checked out which copy (that's Circulation), searching for books (that's Search), book prices or fines (that's Finances), patron information (that's Accounts)

Elevator test: "Manages the library's collection — what books exist and how many copies are available."

Patron Accounts Module

✅ Inside: Creating patron accounts, issuing library cards, storing patron info (name, address, contact), verifying patron identity, managing account status (active, suspended, expired), storing patron preferences
❌ Outside: What books a patron has checked out (that's Circulation), what fines they owe (that's Finances), sending emails to patrons (that's Communication)

Elevator test: "Manages who the library's patrons are and their account status."

Circulation Module

✅ Inside: Recording checkouts (which patron, which copy, what date), calculating due dates, recording returns, managing the hold queue (who wants a book and in what order), enforcing checkout limits (max 10 books), tracking overdue status
❌ Outside: Book details like title or author (that's Catalog), patron contact info (that's Accounts), fine amounts or payments (that's Finances), sending overdue notices (that's Communication)

Elevator test: "Tracks the movement of books between the library and patrons — who has what and when it's due."

Finances Module

✅ Inside: Calculating fine amounts based on how overdue a book is, recording fine charges, processing payments, tracking patron balance (what they owe), applying fine policies (grace periods, maximum fines, waivers)
❌ Outside: Whether a book is overdue (that's Circulation — Finances gets told), patron contact info (that's Accounts), book details (that's Catalog)

Elevator test: "Handles all money — what patrons owe, what they've paid, and fine policies."

Communication Module

✅ Inside: Deciding how to contact a patron (email, text, postal), formatting messages, sending messages, tracking delivery success/failure, managing communication preferences (patron opts out of texts)
❌ Outside: Deciding when to contact someone (other modules trigger this), determining what the patron owes (that's Finances), knowing what books are due (that's Circulation)

Elevator test: "Delivers messages to patrons through their preferred channel."

Search Module

✅ Inside: Accepting search queries, matching against the catalog (by title, author, genre, ISBN, keyword), ranking results by relevance, filtering (available now, genre, format), paginating results
❌ Outside: Modifying the catalog (that's Catalog), checking out books (that's Circulation), patron info (that's Accounts)

Elevator test: "Helps patrons find books in the collection."

Step 4: Connection Diagram

┌──────────────┐       "does this patron       ┌──────────────┐
│   Patron     │        exist and are            │ Circulation  │
│   Accounts   │◄──── they active?"─────────────│              │
│              │                                 │  - checkouts │
│  - identity  │                                 │  - returns   │
│  - status    │                                 │  - holds     │
└──────────────┘                                 │  - due dates │
       ▲                                         └──────────────┘
       │                                              │     │
       │ "who to contact"                             │     │
       │ + "how"                                      │     │
       │                                              │     │
┌──────────────┐      "this is overdue" ─────────────┘     │
│Communication │◄──── or "due date approaching"             │
│              │                                            │
│  - channels  │      "fine has been charged" ──┐           │
│  - delivery  │◄──── or "payment received"     │           │
└──────────────┘                                │           │
                                          ┌──────────────┐  │ "does this
                                          │  Finances    │  │  book exist?"
                                          │              │◄─┘  + "is a copy
                                          │  - fines     │     available?"
                                          │  - payments  │
                                          └──────────────┘
                                                            │
                    ┌──────────────┐                        │
                    │   Catalog    │◄───────────────────────┘
                    │              │
                    │  - books     │◄──── "search for"
                    │  - copies    │       books matching X
                    │  - details   │
                    └──────────────┘      ┌──────────────┐
                           ▲              │    Search    │
                           └──────────────│              │
                             read-only    │  - queries   │
                             access       │  - results   │
                                          └──────────────┘

Connection Analysis

Let's count the connections:

Circulation connects to: Accounts, Catalog, Finances, Communication = 4 connections
Communication connects to: Accounts, Circulation, Finances = 3 connections
Finances connects to: Circulation = 1 connection (plus triggered by Circulation)
Search connects to: Catalog = 1 connection (read-only)
Catalog connects to: nothing outbound (it's a foundational data store)
Accounts connects to: nothing outbound (it's a foundational data store)

The natural foundation pieces (Catalog, Accounts) have zero outbound dependencies — they don't need anything from the other modules. Everything else depends on them. This is a sign of good architecture: foundational data has no dependencies, and everything else reaches down to it.

Controversial Decisions and Tradeoffs

Should hold notifications come from Circulation or Communication?

When a hold becomes available (the book was returned), someone needs to tell the patron.

Option A: Circulation sends the notification directly.

Pro: Simpler — fewer module interactions
Con: Circulation now needs to know about email/text/postal preferences, which is outside its domain

Option B: Circulation emits an event ("hold available"), Communication picks it up, looks up the patron's contact preferences, and sends the message.

Pro: Each module stays within its boundary. Communication can be enhanced (add push notifications) without changing Circulation.
Con: More moving parts. If Communication fails, the patron doesn't get notified and doesn't know their hold is ready.

Decision: Option B. The extra complexity is worth it because notification channels will change over time, and Circulation shouldn't need to change when they do.

Should "patron has too many fines" block checkout?

A patron with $50 in unpaid fines tries to check out a book. Who blocks them?

Option A: Circulation checks with Finances before every checkout.

Pro: The block happens at the right moment
Con: Circulation is now coupled to Finances — it can't work without it

Option B: Finances notifies Accounts to suspend the patron. Circulation checks patron status (which it already does) and sees "suspended."

Pro: Circulation doesn't need to know about Finances at all. Account status is something it already checks.
Con: There's a window between the fine accruing and the account being suspended where the patron might check out another book.

Decision: Option B for most libraries (the window is acceptable). Option A if the rules are strict and the window is unacceptable (e.g., a very high-value collection).

What This Example Teaches

Start with many nouns, end with few modules — 18 nouns became 6 modules
Every boundary decision has a rationale — "Search is separate because..." not just "it felt right"
The outside list prevents future mistakes — explicitly stating "Circulation does NOT handle fines" means nobody accidentally adds fine logic there
Connections are minimal and directional — foundational modules have zero outbound dependencies
Controversial decisions exist in every system — the mark of good design is making them deliberately and documenting why

Boundaries — Example: E-Commerce Platform

The Scenario

An online store where customers browse products, add items to a cart, check out with payment, receive order confirmations, and track shipment. Administrators manage the product catalog, adjust inventory, and process returns. The system supports discount codes and customer reviews.

Step 1: List the Nouns

Customer
Product
Category
Price
Inventory/Stock
Cart
Cart item
Discount code
Order
Order item
Shipping address
Shipping method
Shipment tracking
Payment
Refund
Return
Order confirmation (email)
Shipping notification (email)
Review
Rating
Admin user
Product image

22 nouns.

Step 2: Group by Responsibility

Concern	Nouns	Rationale
Product Catalog	Product, Category, Price, Product Image	What's for sale and how it's described
Inventory	Inventory/Stock	How much of each product is available
Shopping	Cart, Cart Item	The customer's in-progress selection
Pricing	Discount Code, (Price from Catalog)	What things cost and how discounts apply
Orders	Order, Order Item, Shipping Address, Shipping Method	The committed purchase
Payment	Payment, Refund	Moving money
Fulfillment	Shipment Tracking, Return	Physical delivery and returns
Communication	Order Confirmation, Shipping Notification	Emails and notifications
Reviews	Review, Rating	Customer feedback
Customer Accounts	Customer, Admin User	Identity and authentication

10 candidate modules from 22 nouns. Let's examine whether that's the right number.

Merge Decisions

Should Inventory be separate from Catalog? Catalog is about what exists (product descriptions, images, categories). Inventory is about how many are available right now (stock counts, warehouse locations, restock dates). They change at very different rates — product descriptions change rarely, stock counts change constantly. Keep separate.

Should Pricing be separate from Catalog? Prices could live in the Catalog — they're a product attribute. But discount codes, sale pricing, bulk pricing, regional pricing, and coupon logic are complex enough to be their own concern. If pricing lives in Catalog, then every pricing rule change risks affecting product display. Keep separate.

Should Shopping (Cart) be separate from Orders? A cart is temporary, uncommitted, and belongs to a browsing session. An order is permanent, committed, and has legal/financial implications. They feel similar (both contain items) but have completely different lifecycles and rules. Keep separate.

Split Decisions

Should Customer and Admin be the same module? Both are "accounts" with identity and authentication. But admins have additional permissions: manage products, view all orders, process returns. Customer-specific features (saved addresses, order history, wishlists) don't apply to admins. Split into Customer Accounts and Admin Accounts — or use a single Accounts module with role-based boundaries inside it.

Decision: Single Accounts module with roles. At this scale, splitting them creates unnecessary overhead.

Step 3: Define Inside and Outside

Product Catalog

✅ Inside: Product names, descriptions, images, categories, product pages, product attributes (size, color, weight)
❌ Outside: Current stock levels (Inventory), prices and discounts (Pricing), customer reviews (Reviews), how to ship it (Fulfillment)

Inventory

✅ Inside: Stock counts per product, warehouse locations, low-stock alerts, stock reservations (when an item is in someone's cart or mid-checkout), restock tracking
❌ Outside: Product details (Catalog), pricing (Pricing), order records (Orders)

Shopping (Cart)

✅ Inside: Adding/removing items from cart, updating quantities, cart persistence (survive browser refresh), cart expiration (after 30 days of inactivity)
❌ Outside: Product details displayed in the cart (Catalog), prices shown (Pricing), stock availability checks (Inventory), placing the order (Orders)

Pricing

✅ Inside: Base price lookups, discount code validation, discount calculation, sale/promotional pricing rules, tax calculation, bulk pricing tiers
❌ Outside: Product details (Catalog), cart management (Shopping), payment processing (Payment), order recording (Orders)

Orders

✅ Inside: Creating an order from a cart, recording order items, storing shipping address and method, tracking order status (confirmed → processing → shipped → delivered), order history for customers, cancellation logic
❌ Outside: Payment processing (Payment), physical shipment (Fulfillment), price calculation (Pricing), stock management (Inventory), sending confirmation emails (Communication)

Payment

✅ Inside: Charging credit cards, processing refunds, payment method storage (tokenized), payment confirmation, handling payment failures (retry, alternative methods)
❌ Outside: What was ordered (Orders), shipping details (Fulfillment), product info (Catalog)

Fulfillment

✅ Inside: Generating shipping labels, tracking shipment status, delivery confirmation, handling returns (receiving returned items, inspecting condition), return shipping labels
❌ Outside: Order details beyond what to ship (Orders), payment/refund processing (Payment), customer notification (Communication)

Communication

✅ Inside: Email templates, sending emails, push notifications, SMS, delivery tracking, communication preferences
❌ Outside: Deciding when to communicate (other modules trigger), order details (Orders), payment details (Payment)

Reviews

✅ Inside: Submitting reviews, editing reviews, review moderation, star ratings, calculating average rating, displaying reviews
❌ Outside: Product details (Catalog), customer identity (Accounts), order verification ("did this customer actually buy this product?" — needs to ask Orders)

Accounts

✅ Inside: Registration, login/logout, password management, profiles, roles (customer vs. admin), saved addresses, authentication tokens
❌ Outside: Order history (Orders), cart contents (Shopping), payment methods (Payment — though this is debatable)

Step 4: Connection Diagram

┌──────────┐                    ┌──────────┐
│ Accounts │◄──── "who is      │ Shopping  │
│          │       this?" ─────│  (Cart)   │
│- identity│                    │- items    │
│- roles   │                    │- quantities│
└──────────┘                    └──────────┘
     ▲                              │ │
     │                    "product  │ │ "what's in
     │                     info?"   │ │  the cart?"
     │                              ▼ │
     │                         ┌────────┐            ┌──────────┐
     │                         │Catalog │◄───────────│  Search  │
     │                         │        │ "find      │(if added)│
     │                         │-products│ products"  └──────────┘
     │                         │-images │
     │                         └────────┘
     │                              ▲
     │              "product info?" │
     │                              │
     │    ┌──────────┐        ┌──────────┐
     │    │Inventory │◄───────│ Pricing  │
     │    │          │"is it  │          │
     │    │- stock   │ in     │- prices  │
     │    │- reserves│ stock?"│- discounts│
     │    └──────────┘        └──────────┘
     │         ▲                    ▲
     │         │ "reserve stock"    │ "calculate total"
     │         │                    │
     │    ┌────────────────────────────┐
     │    │         Orders             │
     │    │                            │──────►┌──────────┐
     │    │ - order records            │"charge"│ Payment  │
     │    │ - status tracking          │       │          │
     │    │ - history                  │◄──────│- charges │
     │    └────────────────────────────┘ "paid" │- refunds│
     │              │        │                  └──────────┘
     │   "ship this"│        │"order confirmed"
     │              ▼        ▼
     │    ┌──────────┐  ┌──────────────┐
     │    │Fulfillment│  │Communication │
     │    │           │  │              │
     │    │- shipping │──►│- emails      │
     │    │- returns  │  │- notifications│
     │    └──────────┘  └──────────────┘
     │                        ▲
     │                        │ "did they buy it?"
     │                   ┌──────────┐
     └───────────────────│ Reviews  │
       "who wrote this?" │          │
                         │- ratings │
                         └──────────┘

Interesting Boundary Decisions

Where does "stock reservation" live?

When a customer adds an item to their cart, should that item be reserved (so it doesn't sell out while they're browsing)? If so, who manages that?

Option A: Shopping (Cart) tells Inventory to reserve stock.

Pro: Stock is reserved early, fewer disappointed customers at checkout
Con: Cart is coupled to Inventory. What about abandoned carts? Reservations must expire.

Option B: Stock is only reserved at checkout, when the Order is created.

Pro: Simpler. Inventory only talks to Orders, not to Cart.
Con: Customer shops for 20 minutes, goes to checkout, and finds out the item sold out.

Option C: No reservation. First to complete checkout gets it.

Pro: Simplest. No reservation management at all.
Con: High-demand items cause frustration.

Decision for this design: Option B. Reserve at checkout. The tradeoff is acceptable for most e-commerce, and it avoids complex cart-inventory coupling. For flash sales or limited editions, implement Option A with short expiry times (10 minutes).

Should Payment handle refunds, or should Fulfillment?

A return triggers a refund. Who initiates it?

Option A: Fulfillment receives the return, inspects it, and calls Payment to refund.

Pro: Refund happens at the right moment (item received and inspected)
Con: Fulfillment is coupled to Payment

Option B: Fulfillment marks the return as "received and approved." Orders sees this and tells Payment to refund.

Pro: Orders is the orchestrator — it already connects to Payment. Fulfillment stays focused on physical goods.
Con: Extra hop (Fulfillment → Orders → Payment instead of Fulfillment → Payment)

Decision: Option B. Orders is already the bridge between the digital and physical world. Adding refund orchestration to Orders keeps Fulfillment focused on shipping.

Where do "product reviews" verify purchase?

A review should only be written by someone who bought the product. The Reviews module doesn't have order data.

Reviews must ask Orders: "Did customer X buy product Y?" This is a cross-boundary query, and it's the right design — Reviews shouldn't duplicate order data just to check this.

Comparing Library vs. E-Commerce

Aspect	Library	E-Commerce
Module count	6	10
Why more modules?	Library has simpler domain	E-commerce has more concerns (pricing, fulfillment, payments are all complex)
Central orchestrator	Circulation	Orders
Financial complexity	Simple (flat fine rates)	Complex (discounts, tax, multi-currency, refunds)
Physical-digital bridge	Checkout desk	Shipping/fulfillment
Biggest coupling risk	Circulation ↔ Fines	Orders ↔ Payment ↔ Inventory
Common God Module	"LibrarySystem" (does everything)	"OrderProcessor" (checkout + payment + inventory + email)

The key takeaway: more complex domains need more boundaries, but each boundary should still pass the elevator test. If you can't explain a module in one sentence, it's too big — regardless of how complex the overall system is.

Boundaries — Example: Hospital Patient Management

The Scenario

A hospital system that manages patient registration, doctor scheduling, appointments, medical records, prescriptions, lab tests, billing, and insurance claims. Doctors view patient records. Nurses log vitals. Patients access a portal to see appointments and results.

This is the most complex example so far. Real hospital systems are among the most boundary-critical systems in existence — if data leaks between boundaries incorrectly, the consequences can be fatal.

Step 1: List the Nouns

Patient
Doctor
Nurse
Appointment
Schedule
Medical record
Diagnosis
Prescription
Medication
Lab test
Lab result
Vitals (blood pressure, temperature, heart rate)
Bill/Invoice
Insurance claim
Insurance provider
Payment
Patient portal
Department (cardiology, orthopedics, etc.)
Room/Bed
Admission (inpatient stay)
Discharge
Referral
Allergy
Medical history
Emergency contact

25 nouns. Let's find the boundaries.

Step 2: Group by Responsibility

Concern	Nouns	Rationale
People/Identity	Patient, Doctor, Nurse, Emergency Contact	Who people are, not what they do
Scheduling	Appointment, Schedule, Room/Bed	When and where things happen
Clinical Records	Medical Record, Diagnosis, Vitals, Allergy, Medical History	The patient's health data
Medications	Prescription, Medication	What drugs are prescribed and dispensed
Lab/Diagnostics	Lab Test, Lab Result	Tests ordered and their results
Admissions	Admission, Discharge, Referral	Inpatient stays and transfers
Billing	Bill/Invoice, Payment	What the patient owes
Insurance	Insurance Claim, Insurance Provider	Third-party payer processing
Department	Department	Organizational structure
Patient Portal	Patient Portal	Patient-facing access

10 candidate modules. Let's evaluate.

Merge Decisions

Should Insurance merge with Billing? Billing is "calculate what's owed." Insurance is "submit claims to a third party, track approval/denial, handle coverage rules." Insurance has its own external dependencies (insurance company APIs, claim formats, pre-authorization workflows). Keep separate — insurance is complex enough to be its own domain.

Should Patient Portal be a module? The portal is a view into other modules' data — it shows appointments (Scheduling), lab results (Lab), bills (Billing). It doesn't own any unique data. It's a presentation boundary, not a data boundary. The portal is not a module — it's a consumer of other modules' contracts. Important distinction.

Should Department be a module? Departments are organizational metadata. They affect scheduling (which department a doctor belongs to) and potentially routing (which department handles a referral). But "Department" alone doesn't have enough logic to be its own module. Merge into People/Identity as an attribute.

Revised Module List

People — identity of patients, providers, staff
Scheduling — appointments and resource allocation
Clinical Records — medical data
Medications — prescriptions and pharmacy
Lab/Diagnostics — tests and results
Admissions — inpatient management
Billing — charges and payments
Insurance — claims and coverage

8 modules.

Step 3: Define Inside and Outside

People

✅ Inside: Patient demographics (name, DOB, address, phone), doctor/nurse profiles, credentials, department assignments, emergency contacts, user authentication for portal access
❌ Outside: Medical history (Clinical Records), what appointments they have (Scheduling), what they owe (Billing), what insurance they have (Insurance — though insurance ID might be stored here as an attribute)

Scheduling

✅ Inside: Creating/canceling/rescheduling appointments, doctor availability calendars, room/bed assignments, appointment reminders, waitlist management, recurring appointment series
❌ Outside: What happens during the appointment (Clinical Records), what it costs (Billing), patient demographics (People), test orders (Lab)

Clinical Records

✅ Inside: Diagnoses, clinical notes, vitals recordings, allergy lists, medical history, treatment plans, imaging records, visit summaries
❌ Outside: Prescriptions (Medications — though they link to diagnoses), test ordering (Lab — though linked to clinical decisions), billing (Billing), appointment logistics (Scheduling)

Why is Clinical Records separate from Medications and Lab? Because clinical records are the narrative of patient care — what was observed, what was decided. Medications and Lab are the actions — what was prescribed, what was tested. These change at different rates, are governed by different regulations, and are managed by different people. A pharmacist manages medications. A lab technician manages tests. A doctor manages the clinical record.

Medications

✅ Inside: Prescriptions (what drug, what dose, what duration), drug interaction checks, refill tracking, pharmacy dispensing records, medication history
❌ Outside: The clinical reason for the prescription (Clinical Records), patient identity (People), billing for medications (Billing)

Lab/Diagnostics

✅ Inside: Ordering tests, tracking sample collection, managing lab workflows, recording results, flagging abnormal results, test history
❌ Outside: Clinical interpretation of results (Clinical Records), billing for tests (Billing), patient identity (People)

Admissions

✅ Inside: Admitting patients (inpatient), bed assignments, transfer between departments, discharge processing, length-of-stay tracking, discharge summaries, referrals to other facilities
❌ Outside: Clinical care during the stay (Clinical Records), billing for the stay (Billing), identity (People)

Billing

✅ Inside: Generating charges from procedures/visits/tests/medications, creating invoices, processing patient payments, tracking outstanding balances, payment plans
❌ Outside: Insurance claims (Insurance — Billing passes charges to Insurance for claim submission), clinical details (Clinical Records), scheduling (Scheduling)

Insurance

✅ Inside: Insurance plan details, pre-authorization requests, claim submission, claim status tracking, coverage verification, denial management, appeal processing
❌ Outside: Generating charges (Billing — Insurance receives charges), clinical justification (Clinical Records provides this when needed for pre-auth), patient identity (People)

Step 4: Connection Diagram

                          ┌───────────┐
                          │  People   │
                          │           │
                          │- patients │
                          │- doctors  │
                          │- staff    │
                          └───────────┘
                         ▲  ▲  ▲  ▲  ▲
                "who?"  /  |  |  |  \  "who?"
                       /   |  |  |   \
          ┌───────────┐  ┌─┴──┴──┴─┐  ┌───────────┐
          │Scheduling │  │Clinical  │  │Admissions │
          │           │  │Records   │  │           │
          │-appts     │  │          │  │-admits    │
          │-calendar  │  │-diagnoses│  │-transfers │
          │-rooms     │  │-vitals   │  │-discharges│
          └───────────┘  │-history  │  └───────────┘
                         └──────────┘
                          │        │
           "prescribed    │        │  "ordered
            based on      │        │   based on
            diagnosis"    ▼        ▼   diagnosis"
                   ┌──────────┐ ┌──────────┐
                   │Medications│ │   Lab    │
                   │           │ │          │
                   │-scripts   │ │-tests    │
                   │-drugs     │ │-results  │
                   │-refills   │ │-samples  │
                   └──────────┘ └──────────┘
                        │              │
                        │ "charges"    │ "charges"
                        ▼              ▼
                      ┌──────────────────┐
                      │     Billing      │
                      │                  │
                      │  - invoices      │
                      │  - payments      │
                      └──────────────────┘
                              │
                              │ "submit claim"
                              ▼
                      ┌──────────────────┐
                      │    Insurance     │
                      │                  │
                      │  - claims        │
                      │  - coverage      │
                      └──────────────────┘

Critical Observation: Data Flows Downward

Notice the shape. People is at the top — everyone needs to know who someone is. Clinical Records is in the middle — clinical decisions drive medications, lab tests, and admissions. Billing is near the bottom — it receives charges from multiple sources. Insurance is at the very bottom — it receives data from Billing.

No arrows point upward. This is not an accident — it's good boundary design. Lower modules don't need to know about higher modules.

Why Hospitals Are the Extreme Case for Boundaries

Regulatory boundaries are real boundaries

Medical records have different legal protections than billing data. You can't show a receptionist the same data you show a doctor. Boundaries enforce access control. If Clinical Records and Billing are in the same module, it's harder to ensure the billing clerk can't see clinical notes.

Wrong data can kill

If Medications gets the wrong patient's allergy list from Clinical Records, the patient could receive a drug they're allergic to. If Lab results are attributed to the wrong patient, treatment decisions are made on false data. Boundary contracts in healthcare are literally life-critical.

Audit requirements

Every access to a medical record must be logged: who accessed it, when, and why. This is only possible if Clinical Records is a clear boundary with defined entry points. If patient data is scattered across every module, comprehensive auditing is impossible.

Comparing All Three Examples

Aspect	Library	E-Commerce	Hospital
Modules	6	10	8
Biggest driver of boundary decisions	Clean domain separation	Financial accuracy	Regulatory + safety
Most connected module	Circulation	Orders	Clinical Records
Presentation layer is a module?	No	No	No (Patient Portal is a consumer)
Would work as a monolith?	Yes, for a small library	Yes, for a small store	Risky — regulatory violations likely
Consequence of bad boundaries	Wrong book to wrong patron	Financial errors, bad customer experience	Wrong treatment, legal liability, death
Key lesson	Domain boundaries emerge from nouns	Complexity drives boundary count	Stakes drive boundary rigor

Boundaries — Common Mistakes

The Five Antipatterns

After seeing how boundaries should work across three examples, let's look at how they go wrong. These mistakes are so common that you will encounter every one of them in your career. Recognizing them is half the battle.

Mistake 1: The God Module

What It Looks Like

One module grows to handle a massive portion of the system. It started small and reasonable, then feature after feature was added because "it's related" or "it's easier to put it here."

Before (Bad):

OrderService
├── Create order
├── Calculate totals
├── Apply discount codes
├── Validate inventory
├── Reserve stock
├── Process payment
├── Process refund
├── Generate invoice
├── Send confirmation email
├── Send shipping notification
├── Update order status
├── Track shipment
├── Handle returns
├── Generate sales reports
└── Manage customer loyalty points

15 responsibilities. This module is impossible to name accurately — "OrderService" doesn't cover half of what it actually does. Any change to any of these responsibilities risks breaking all the others.

How to Spot It

The module has more than 5-7 responsibilities
Its name doesn't accurately describe everything inside
Changes to the module are frequent and scary
Multiple developers are constantly working in the same module and colliding
Testing requires setting up the entire system because everything is connected

After (Fixed):

Orders                     Pricing                Payment
├── Create order           ├── Calculate totals   ├── Charge
├── Update status          ├── Apply discounts    ├── Refund
├── Track history          └── Tax calculation    └── Payment history
└── Cancel order

Inventory                  Fulfillment            Communication
├── Check stock            ├── Ship order         ├── Email templates
├── Reserve stock          ├── Track shipment     ├── Send confirmation
└── Release reservation    └── Process return     └── Send notifications

Billing                    Loyalty
├── Generate invoice       ├── Earn points
└── Payment tracking       └── Redeem points

Same functionality. Eight modules instead of one. Each with 2-4 responsibilities. Each with a clear name. Each changeable independently.

Mistake 2: The Micro-Boundary

What It Looks Like

The opposite of the God Module. Everything is its own boundary, each with trivial responsibility.

Before (Bad):

EmailValidator           ← validates email format
PasswordValidator        ← validates password strength
NameValidator            ← validates name length
AddressValidator         ← validates address format
PhoneValidator           ← validates phone format
DateValidator            ← validates date format
LoginHandler             ← handles login
LogoutHandler            ← handles logout
SessionCreator           ← creates sessions
SessionDestroyer         ← destroys sessions
PasswordHasher           ← hashes passwords
TokenGenerator           ← generates auth tokens

12 modules for what is clearly one concern: Authentication.

How to Spot It

You have modules with only 1-2 functions
Many modules always change together (if EmailValidator changes, LoginHandler probably does too)
Understanding a single feature requires reading 10 modules
The connection diagram looks like a plate of spaghetti

After (Fixed):

Authentication
├── Validate credentials (email, password, etc.)
├── Login / Logout
├── Session management
├── Password hashing
└── Token generation

One module. Five cohesive responsibilities. All related to "verifying and managing user identity." If someone asks "where is the login logic?" the answer is one word: Authentication.

The Test

If two things always change together, they probably belong together. Micro-boundaries violate cohesion — they separate things that should be unified.

Mistake 3: Boundaries Follow Technology, Not Domain

What It Looks Like

DatabaseModule          ← all database operations for all features
APIModule               ← all API endpoints for all features
UIModule                ← all user interface code for all features

Why It's Wrong

Where does "checkout" live? Partly in the API (the checkout endpoint), partly in the Database (saving the order), partly in the UI (the checkout page). The checkout logic is scattered across three modules. To understand checkout, you must read all three.

Where does "user registration" live? Also spread across all three modules. Now checkout and registration code live side-by-side in each module, even though they have nothing to do with each other.

The Result

Low cohesion: each module contains unrelated things (checkout + registration + search + ... all in the same "database module")
High coupling: changing checkout requires changing three modules simultaneously
Impossible to reason about: "where is the checkout logic?" → "everywhere"

After (Fixed):

Checkout                    Registration              Search
├── Checkout API endpoint   ├── Registration API      ├── Search API
├── Checkout database ops   ├── Registration DB ops   ├── Search index ops
└── Checkout UI page        └── Registration UI page  └── Search UI component

Each module contains everything needed for its domain — the API, the data access, and the UI. Now changing checkout only touches the Checkout module.

The Principle

Boundaries should follow the domain (what the system does), not the technology (how it's built). "Checkout" is a domain boundary. "Database" is a technology boundary. Domain boundaries create high cohesion. Technology boundaries create high coupling.

Mistake 4: The Shared Junk Drawer

What It Looks Like

Utils/
├── formatDate()
├── calculateShipping()
├── validateEmail()
├── generatePDF()
├── checkUserPermissions()
├── convertCurrency()
├── sendSlackMessage()
├── compressImage()
├── parseCSV()
└── retryWithBackoff()

Why It's Wrong

"Utils" is not a responsibility. It's a confession that nobody thought about where these things should live. Each function belongs somewhere:

Function	Actually Belongs In
formatDate()	Whichever module needs it, as a private helper. Or a shared "Date/Time" utility if multiple modules truly need the same formatting.
calculateShipping()	Fulfillment or Pricing
validateEmail()	Authentication or Accounts
generatePDF()	Billing (for invoices) or Reporting
checkUserPermissions()	Authentication/Authorization
convertCurrency()	Pricing
sendSlackMessage()	Communication/Notifications
compressImage()	Media/Content processing
parseCSV()	Import/Data Processing
retryWithBackoff()	This is genuinely cross-cutting — it's a shared infrastructure utility

The Damage

Half the system depends on "Utils," creating hidden coupling
Changing any function risks breaking modules you didn't know used it
The module grows without limit — there's no criteria for what should or shouldn't be in it
New developers dump everything there because it's the path of least resistance

After (Fixed):

Move each function to the module it actually belongs to. For the genuinely cross-cutting pieces (retry logic, date formatting if truly universal), create a named infrastructure module:

Infrastructure/Resilience    ← retryWithBackoff()
Infrastructure/Formatting    ← formatDate(), formatCurrency() (if truly shared)

These have names. They have boundaries. They are not growing junk drawers.

Mistake 5: Hidden Cross-Boundary Coupling

What It Looks Like

The modules look clean on the org chart, but they secretly share:

Shared database tables. Module A and Module B both read from and write to the same table. Neither "owns" it. If A changes the table structure, B breaks.

Shared data models. Both modules use the same internal data structures. Changing the structure in one requires changing the other.

Behavior assumptions. Module A depends on Module B processing items in a specific order, but that order isn't in the contract — it's just how B happens to work today. When B is optimized to process in a different order, A breaks.

A Concrete Example

Orders module and Shipping module both access the orders table directly.

Orders Module ──writes──► ┌─────────┐ ◄──reads── Shipping Module
                          │ orders  │
                          │  table  │
                          └─────────┘

This seems efficient. But:

Orders adds a new column → Shipping might break if it does SELECT *
Orders changes the status values from "ready" to "awaiting_shipment" → Shipping was filtering on "ready" and stops seeing orders
Shipping updates the tracking number directly in the orders table → Orders doesn't know it happened and might overwrite it

After (Fixed):

Orders Module                        Shipping Module
      │                                    ▲
      │  "here are orders                  │
      │   ready to ship"                   │
      ▼                                    │
┌──────────────────────────────────────────┐
│           Defined Contract               │
│  Orders provides: order_id, items,       │
│    shipping address, priority            │
│  Shipping returns: tracking_number,      │
│    estimated delivery date               │
└──────────────────────────────────────────┘

Each module owns its own data storage. Communication happens through defined contracts. Neither module needs to know how the other stores its data.

How to Detect These Mistakes in Any System

Question to Ask	What a Bad Answer Reveals
"Can you explain this module in one sentence?"	God Module (the sentence uses "and" five times) or Micro-Boundary (the sentence is trivially short)
"If I change the inside of this module, what else breaks?"	Hidden coupling (anything other than "nothing" is concerning)
"What's in the Utils/Helpers/Common module?"	Junk drawer (if the answer takes more than 60 seconds, it's too big)
"Where does feature X live?"	Technology-based boundaries (if the answer is "parts of it are in three different modules")
"Do any two modules read from the same database table?"	Shared data coupling
"When was the last time you changed this module without fear?"	God Module or coupling (if the answer is "never," there's a problem)

Boundaries — Test Your Understanding

Answer each question thoroughly. Focus on defining clear responsibilities — what is inside, what is outside, and why.

Section A: Identification

Question 1

A restaurant has:

Customers who order food from a menu
Waitstaff who take orders and deliver food
A kitchen that prepares the food
A billing system that produces the check
A reservation system for booking tables

Identify the natural boundaries in this system. For each boundary, write what is inside it and what is explicitly outside it.

Question 2

Someone proposes the following module structure for a blogging platform:

DatabaseModule — all database operations
APIModule — all API endpoints
UIModule — all user-facing pages

What is wrong with this boundary structure? Propose a better one and explain why it's better.

Question 3

You encounter a module called "Utils" that contains:

A function that formats dates
A function that calculates shipping costs
A function that validates email addresses
A function that generates PDF reports
A function that checks if a user is logged in

For each function, identify which boundary it actually belongs to. Explain why "Utils" is not a real boundary.

Section B: Analysis

Question 4

Two modules exist in a system:

Module A: OrderProcessing

Creates orders
Calculates totals
Applies discount codes
Charges the customer's credit card
Sends a confirmation email

Module B: CustomerManagement

Stores customer profiles
Manages addresses
Tracks order history

Evaluate the boundaries. Is everything in the right place? Identify at least two items that might belong elsewhere, and explain your reasoning.

Question 5

A team is building a social media app. They have one module called "PostManager" that handles:

Creating posts
Editing posts
Deleting posts
Displaying the news feed
Recommending trending posts
Moderating reported posts
Tracking post analytics (views, shares)

This is becoming a God Module. Propose how to split it into smaller, well-defined boundaries. For each new boundary, apply the elevator test (one-sentence description).

Question 6

You have two modules: Inventory and Shipping. Currently:

Inventory knows how to check stock levels
Shipping needs to know stock levels before it can ship

Someone proposes: "Let's just let Shipping read directly from the Inventory database to check stock."

Using the concepts of cohesion and coupling, explain why this is problematic. Propose a better approach.

Section C: Design

Question 7

Design the boundary structure for a school management system with these requirements:

Students enroll in courses
Teachers are assigned to courses
Grades are recorded per student per course
Parents can view their child's grades
The school generates report cards each semester
Attendance is tracked daily

Produce:

A list of modules with inside/outside definitions
A connection diagram showing what each module needs from the others
The elevator-test sentence for each module

Question 8

You are designing a ride-sharing app (like Uber). The core actions are:

Riders request a ride
Drivers accept rides
The system matches riders to nearby drivers
Pricing is calculated based on distance, time, and demand
Payments are processed after the ride
Both riders and drivers can rate each other

Draw the boundary structure. Pay special attention to: where does "matching" live? Where does "pricing" live? Are they the same boundary or different? Justify your decision.

Question 9

A startup asks you to architect a recipe sharing platform. Users can:

Create and share recipes
Search recipes by ingredient, cuisine, or dietary restriction
Save favorite recipes
Create meal plans for the week
Generate a shopping list from a meal plan
Follow other users and see their new recipes

Define the module boundaries. At least one of your decisions should involve a tradeoff — two reasonable options where you pick one. Explain the tradeoff and why you chose what you chose.

Section D: Critical Thinking

Question 10

"Every module should be completely independent and never talk to any other module."

Is this statement true, false, or misleading? Explain when connections between modules are necessary and how to have them without destroying boundary integrity.

Question 11

Two engineers disagree:

Engineer A: "Shopping cart and checkout should be one module. They're part of the same user flow."

Engineer B: "Shopping cart and checkout should be separate modules. A cart is about managing what you want to buy. Checkout is about paying for it."

Both have reasonable arguments. Evaluate both positions. Under what circumstances is A right? Under what circumstances is B right? What would you recommend for a small team building an MVP? What would you recommend for a large team building a mature platform?

Question 12

You inherit a system where a single module called "NotificationService" handles:

Deciding when to send notifications (business rules)
Deciding who to send them to (recipient logic)
Deciding what channel to use (email vs. push vs. SMS)
Formatting the message content
Actually sending via the appropriate channel
Logging what was sent
Managing user notification preferences

This module works fine today. Argue for or against splitting it up. If you split it, where do you draw the new boundaries? If you don't, explain what conditions would eventually force a split.

Grading Rubric

Criteria	What It Means
Clear inside/outside	Each boundary has an explicit list of what it owns and what it doesn't
Reasonable groupings	Related things are together, unrelated things are apart
Minimal coupling	Boundaries connect through narrow, well-defined channels — not through shared databases or deep knowledge of each other's internals
Defensible names	Every boundary can be explained in one sentence that a non-engineer would understand
Tradeoff awareness	Where a decision could go either way, you acknowledge the alternatives and explain your choice

Contracts and Interfaces — Why It Matters

Every Boundary Has a Door

In the previous section, you learned to draw boundaries — to define what a module is responsible for and what it isn't. But boundaries alone aren't enough. Modules need to talk to each other. The question is: how?

The answer is a contract: a precise agreement about what goes in, what comes out, and what happens when something goes wrong.

Without contracts, modules can't communicate reliably. With sloppy contracts, they communicate badly. With clear contracts, they communicate perfectly — even when the people who built them have never met.

What Is a Contract?

A contract is a promise at a boundary:

"If you give me this (in this exact shape, meeting these exact conditions), I will give you back that (in this exact shape, with these exact guarantees). If you give me something I don't expect, here is exactly what will happen."

That's it. Three parts:

Inputs — what the caller provides
Outputs — what the caller receives
Error cases — what happens when things aren't right

This exists everywhere in life:

A vending machine: insert $1.50 (input), select B4 (input), receive a bag of chips (output), or get your money back if the item is stuck (error case).
A postal service: provide a correctly addressed envelope with proper postage (input), the letter arrives within 3-5 business days (output), or it's returned to sender if the address is invalid (error case).

In software, contracts are the same idea applied to the boundaries between modules, features, and systems.

Why Engineers Care About This

You can build things in parallel

If two people agree on the contract between Module A and Module B, they can build those modules simultaneously without talking to each other again. Person A knows exactly what Module B will provide. Person B knows exactly what Module A will send. The contract is the agreement.

Without contracts, you get: "Wait, I thought you were sending me a list?" "No, I'm sending an object with a list inside it." "But my code expects a plain list." Two days of debugging for a conversation that should have happened upfront.

You can replace parts of the system

If the contract is clear, you can completely replace the internals of a module and nothing breaks — as long as the new version honors the same contract. This is how large systems evolve over years. The contract is the stable surface; the implementation behind it can change freely.

You can test things in isolation

If you know the contract, you can test a module without needing the rest of the system. Send it the defined inputs, check that you get the defined outputs. If the contract is vague, you're guessing.

You can debug faster

"The output is wrong." Okay — does the input match the contract? If yes, the bug is inside the module. If no, the bug is in whoever is calling the module. Contract thinking lets you cut the search space in half immediately.

What Happens Without Contracts

Assumptions replace agreements

"I assumed it would handle that case." "I assumed the data would always be in that format." "I assumed it would return an error if something was wrong." Assumptions are bugs waiting to happen. Contracts replace assumptions with explicit agreements.

Changes break everything

Without a contract, changing a module's behavior is a gamble. You don't know what other modules depend on, because the dependency was never formally defined. You change one thing and discover four other modules were relying on behavior that was never promised — just coincidental.

Nobody knows what anything does

"What does this module accept?" → "Uh, look at the code." If the answer to "what's the contract?" is "read the implementation," there is no contract. And that means nobody really knows what it does without reverse-engineering it every time.

Integration is a nightmare

Two modules need to connect. Without contracts, integration day is chaos — mismatched data formats, unexpected nulls, inconsistent error handling. With contracts, integration is mechanical: both sides already agreed on the interface, so you plug them together and it works.

Interfaces vs. Implementations

This is a critical distinction that separates experienced engineers from everyone else:

The interface is what something does (the contract — inputs, outputs, errors).

The implementation is how it does it internally.

Other modules should only ever depend on the interface. They should never know or care about the implementation. This principle has a name — information hiding — and it is one of the most important ideas in engineering.

Why? Because implementations change constantly. Algorithms get optimized. Databases get swapped. Libraries get updated. But if every other module depends on the implementation details, every change is a catastrophe. If they only depend on the interface, changes are invisible to the outside world.

Think of it like a restaurant kitchen. You (the customer) have a contract with the restaurant: you order from the menu (input), you receive food that matches the description (output), and if they're out of something, they tell you (error). You don't know or care whether the chef uses a gas stove or electric, whether they prep ingredients at 6am or buy them pre-cut, whether there's one cook or five.

The menu is the interface. The kitchen is the implementation. As long as the food is right, you're happy.

Why This Is the Lesson That Separates Beginners From Professionals

Beginners think about how to make it work. Professionals think about how to define the interface so that anyone can make it work.

When an experienced engineer approaches a new problem, they don't start coding. They start defining contracts:

"This module will accept a customer ID and return the customer's order history as a list of orders. If the customer doesn't exist, it returns an empty list, not an error."
"This service will accept an image file up to 10MB in JPEG or PNG format and return a resized version. If the file is too large or the wrong format, it returns an error with a human-readable reason."

These statements are complete enough that anyone (or any LLM) could implement them. That's the point: the contract is the spec. If you can write clear contracts, you can build systems. If you can't, you're guessing — no matter how much code you know.

Contracts and Interfaces — Why It Matters

Every Boundary Has a Door

The answer is a contract: a precise agreement about what goes in, what comes out, and what happens when something goes wrong.

What Is a Contract?

A contract is a promise at a boundary:

"If you give me this (in this exact shape, meeting these exact conditions), I will give you back that (in this exact shape, with these exact guarantees). If you give me something I don't expect, here is exactly what will happen."

That's it. Three parts:

Inputs — what the caller provides
Outputs — what the caller receives
Error cases — what happens when things aren't right

This exists everywhere in life:

A vending machine: insert $1.50 (input), select B4 (input), receive a bag of chips (output), or get your money back if the item is stuck (error case).
A postal service: provide a correctly addressed envelope with proper postage (input), the letter arrives within 3-5 business days (output), or it's returned to sender if the address is invalid (error case).

In software, contracts are the same idea applied to the boundaries between modules, features, and systems.

Why Engineers Care About This

You can build things in parallel

You can replace parts of the system

You can test things in isolation

If you know the contract, you can test a module without needing the rest of the system. Send it the defined inputs, check that you get the defined outputs. If the contract is vague, you're guessing.

You can debug faster

What Happens Without Contracts

Assumptions replace agreements

Changes break everything

Nobody knows what anything does

Integration is a nightmare

Interfaces vs. Implementations

This is a critical distinction that separates experienced engineers from everyone else:

The interface is what something does (the contract — inputs, outputs, errors).

The implementation is how it does it internally.

The menu is the interface. The kitchen is the implementation. As long as the food is right, you're happy.

Why This Is the Lesson That Separates Beginners From Professionals

Beginners think about how to make it work. Professionals think about how to define the interface so that anyone can make it work.

When an experienced engineer approaches a new problem, they don't start coding. They start defining contracts:

"This module will accept a customer ID and return the customer's order history as a list of orders. If the customer doesn't exist, it returns an empty list, not an error."
"This service will accept an image file up to 10MB in JPEG or PNG format and return a resized version. If the file is too large or the wrong format, it returns an error with a human-readable reason."

Contracts and Interfaces — How: The Method

The Contract Template

A well-defined contract has five components. Use this template every time:

CONTRACT: [Name]

ACCEPTS:
  - [input 1]: [type/shape] — [constraints]
  - [input 2]: [type/shape] — [constraints]

RETURNS:
  - [output]: [type/shape] — [guarantees]

ERRORS:
  - [condition] → [response]
  - [condition] → [response]

SIDE EFFECTS:
  - [what else occurs, if anything]

Designing Good Inputs

Be specific about shape

Not "a customer" — but "a customer ID (text, 8-12 alphanumeric characters)." Not "order data" — but "order containing: list of items (each with product_id and quantity), shipping address, and payment method."

Vague inputs create ambiguity. Ambiguity creates bugs.

Distinguish required from optional

Some inputs must always be present. Others have reasonable defaults. Make this explicit:

ACCEPTS:
  - search_query: text — required, 1-200 characters
  - page_number: number — optional, defaults to 1
  - results_per_page: number — optional, defaults to 20, maximum 100

Define constraints

What's valid? What's invalid?

"email: text — must contain exactly one @ symbol and at least one . after the @"
"quantity: number — must be a positive integer, maximum 999"
"date: text — must be in YYYY-MM-DD format, cannot be in the past"

Designing Good Outputs

Be explicit about guarantees

Don't just say "returns a list." Say:

"Returns a list of orders, sorted by date descending"
"Returns an empty list if no orders exist" (different from returning an error)
"Each order contains: order_id, date, total, and status"

Define the shape

RETURNS:
  - user:
    - id: text
    - name: text
    - email: text
    - created_date: text (YYYY-MM-DD format)
    - is_active: yes/no

Handle empty results explicitly

What happens when there's nothing to return?

Return an empty list?
Return nothing at all?
Return an error?

These are three different behaviors. The contract must specify which one.

Designing Good Error Cases

Errors are part of the contract, not an afterthought.

Be exhaustive

List every way the operation can fail:

ERRORS:
  - Input validation:
    - email is empty → error: "Email is required"
    - email format is invalid → error: "Invalid email format"
    - email already exists → error: "Email already registered"
  - Business rules:
    - account is suspended → error: "Account suspended"
  - System failures:
    - database unreachable → error: "Service temporarily unavailable"

Distinguish caller errors from system errors

Caller errors: "you sent me bad input" — expected, the caller should handle them System errors: "something is broken internally" — unexpected, needs investigation

Define recovery guidance

"If 'rate limit exceeded,' wait 60 seconds and retry"
"If 'session expired,' re-authenticate and retry"
"If 'item out of stock,' do not retry — display message to user"

Documenting Side Effects

A side effect is anything the operation does besides returning a value:

PlaceOrder returns the order ID, but also sends a confirmation email
DeleteAccount returns success, but also erases all stored data
Login returns a session token, but also logs the event and updates the last-login timestamp

If a side effect is not documented, someone will depend on it unknowingly.

The Contract Review Checklist

When evaluating a contract, ask:

Can a stranger implement this? Could someone who has never seen the system build a correct implementation from this contract alone?
Are all inputs fully defined? Shape, constraints, required vs. optional?
Are all outputs fully defined? Shape, guarantees, empty behavior?
Is every error case listed? Input validation, business rules, system failures?
Are side effects documented? Everything beyond returning the output?
Is the contract implementation-free? Does it say what without saying how?
Could this contract survive a rewrite? If internals were completely replaced, would it still make sense?

What to Look For in the Examples

The following pages show complete contract sets for three different systems. As you read:

Notice the level of detail in inputs — every constraint, every edge case
Notice how errors are categorized — caller errors vs. system errors
Notice side effects you wouldn't have thought of — logging, notifications, state changes
Notice how contracts chain together — the output of one becomes the input of the next
Compare the same type of operation across different systems — a "create" operation in a library vs. a restaurant vs. a bank

Contracts — Example: Library Checkout System

The Scenario

A patron visits the library, finds a book, and checks it out. Later, they return it. If it's late, a fine is assessed. They can also place a hold on a book that's currently checked out by someone else.

We'll define the full contract for every operation in the Circulation module.

Contract 1: Check Out a Book

CONTRACT: CheckOutBook

ACCEPTS:
  - patron_id: text — required, must be a valid library card number (format: LIB-XXXXX 
    where X is a digit)
  - copy_id: text — required, must be a valid physical copy ID (format: CPY-XXXXXXX)

RETURNS:
  - checkout_record:
    - checkout_id: text (unique identifier for this checkout)
    - patron_id: text
    - copy_id: text
    - book_title: text (included for convenience — pulled from Catalog)
    - checkout_date: date (YYYY-MM-DD, always today)
    - due_date: date (YYYY-MM-DD, always 14 days from checkout_date)

ERRORS:
  - patron_id not found → error: "Unknown patron" (caller error)
  - patron account is expired → error: "Patron account expired. Renewal required." (caller error)
  - patron account is suspended → error: "Patron account suspended. Contact librarian." (caller error)
  - patron has reached checkout limit (10 books) → error: "Checkout limit reached. 
    Return a book before checking out another." (caller error)
  - patron has unpaid fines over $25 → error: "Outstanding fines exceed limit. 
    Payment required before checkout." (caller error)
  - copy_id not found → error: "Unknown copy" (caller error)
  - copy is not currently available (already checked out) → error: "Copy not available. 
    Currently checked out. Consider placing a hold." (caller error)
  - copy is marked damaged/withdrawn → error: "Copy not available for checkout" (caller error)
  - database unreachable → error: "System temporarily unavailable. Please try again." (system error)

SIDE EFFECTS:
  - Copy status changed from "available" to "checked out" in Catalog
  - Patron's active checkout count incremented
  - Checkout event logged with timestamp, patron_id, copy_id, librarian_id (who processed it)
  - If patron had a hold on this book, the hold is consumed (removed from hold queue)

Why This Level of Detail Matters

Notice the error cases. There are 9 distinct error conditions. A beginner would list 2 or 3 ("book not found, patron not found"). An experienced engineer knows that each of these 9 conditions requires a different response from the caller:

"Patron expired" → the librarian can renew them on the spot
"Fines exceed limit" → the librarian directs them to payment
"Copy not available" → suggest placing a hold (a different operation)
"System unavailable" → retry later (completely different from the others)

Each error tells the caller what to do next. That's a good contract.

Contract 2: Return a Book

CONTRACT: ReturnBook

ACCEPTS:
  - copy_id: text — required, must be a valid physical copy ID

    Note: patron_id is NOT required. The system looks up who has this copy checked out.
    This matches real-world behavior — you return a book, not "your checkout record."

RETURNS:
  - return_record:
    - return_id: text
    - checkout_id: text (the original checkout this return closes)
    - patron_id: text (who had it)
    - copy_id: text
    - checkout_date: date
    - due_date: date
    - return_date: date (today)
    - days_overdue: number (0 if on time, positive if late)
    - fine_assessed: currency (0.00 if on time)

ERRORS:
  - copy_id not found → error: "Unknown copy"
  - copy is not currently checked out → error: "This copy is not checked out"
  - database unreachable → error: "System temporarily unavailable"

SIDE EFFECTS:
  - Copy status changed from "checked out" to "available" in Catalog
  - Patron's active checkout count decremented
  - If days_overdue > 0, a fine is created in the Finances module
    (amount = days_overdue × $0.25, capped at replacement cost of book)
  - If there is a hold queue for this book, the next patron in the queue is notified
    (via Communication module)
  - Return event logged with timestamp, copy_id, condition notes (if any)

Key Design Decisions in This Contract

The return contract accepts copy_id, not patron_id. This is a deliberate design choice that matches the physical reality: a librarian scans the book, not the patron's card. The system figures out who had it. This reduces errors (the patron doesn't need their card to return).

The fine is a side effect, not a return value. The return operation calculates the fine and includes it in the return record (for display), but the actual fine creation is a side effect handled by the Finances module. Return doesn't need to know how fines are stored or managed.

Hold notification is a cascading side effect. Returning a book might mean someone else is waiting for it. The contract documents this so that whoever implements it knows they must check the hold queue.

Contract 3: Place a Hold

CONTRACT: PlaceHold

ACCEPTS:
  - patron_id: text — required, valid library card number
  - book_id: text — required, valid book ID (not copy_id — the patron wants
    the book, not a specific physical copy)

RETURNS:
  - hold_record:
    - hold_id: text
    - patron_id: text
    - book_id: text
    - book_title: text
    - hold_date: date (today)
    - queue_position: number (1 = you're next, 2 = one person ahead of you, etc.)
    - estimated_availability: text ("approximately 2 weeks" based on due dates
      of current checkouts and queue length)

ERRORS:
  - patron_id not found → error: "Unknown patron"
  - patron account expired/suspended → error: "Account not active"
  - book_id not found → error: "Unknown book"
  - patron already has a hold on this book → error: "Hold already exists for this book"
  - patron currently has this book checked out → error: "You currently have this book.
    Return it instead of placing a hold."
  - patron has reached hold limit (5 holds) → error: "Hold limit reached"
  - all copies of this book are available (no need for a hold) → error: "Copies are
    available now. No hold needed — check it out directly."
  - database unreachable → error: "System temporarily unavailable"

SIDE EFFECTS:
  - Hold is added to the queue for this book
  - Hold event logged

What Makes This Contract Interesting

Book vs. Copy distinction. When checking out, you specify a copy (a physical item). When placing a hold, you specify a book (the title). The system decides which copy to assign when one becomes available. This distinction matters because it's a different level of abstraction — and the contract makes it explicit.

"Copies are available" is an error. You can place a hold on a book that has copies available — but this contract treats it as an error because the correct action is to check it out, not hold it. This is a business rule baked into the contract. A different library might allow it. The contract forces the decision to be explicit.

Estimated availability is a best guess. The contract says "approximately" — this sets expectations. The caller knows not to treat this as a guarantee.

Contract 4: Cancel a Hold

CONTRACT: CancelHold

ACCEPTS:
  - hold_id: text — required

RETURNS:
  - confirmation:
    - hold_id: text
    - status: "cancelled"
    - cancelled_date: date

ERRORS:
  - hold_id not found → error: "Unknown hold"
  - hold has already been fulfilled (book was checked out) → error: "Hold already
    fulfilled. Book was checked out on [date]."
  - hold was already cancelled → error: "Hold was already cancelled on [date]"
  - database unreachable → error: "System temporarily unavailable"

SIDE EFFECTS:
  - Hold removed from queue
  - All patrons behind this one in the queue move up one position
  - If the book has an available copy and there's a next-in-line patron,
    that patron is notified
  - Cancellation event logged

How These Contracts Work Together

Let's trace a complete scenario:

Patron A checks out the last copy of "Dune." Patron B wants it and places a hold. Patron A returns it late.

Step	Contract Called	Key Data Flow
1	`CheckOutBook(patron_A, copy_42)`	Returns checkout record. Copy marked "checked out."
2	`PlaceHold(patron_B, book_dune)`	Returns hold record. Queue position = 1. Estimated availability = "approximately 2 weeks."
3	(14 days pass. Patron A doesn't return the book.)	—
4	(Day 17. Patron A returns the book.)	—
5	`ReturnBook(copy_42)`	Returns: days_overdue = 3, fine_assessed = $0.75. Side effects: (a) Fine created in Finances. (b) Hold queue checked — Patron B is next. (c) Communication module notifies Patron B: "Your hold is ready."
6	(Patron B receives notification and comes to the library.)	—
7	`CheckOutBook(patron_B, copy_42)`	Returns checkout record. Side effect: Patron B's hold is consumed (removed from queue).

Notice how the contracts chain together through side effects. ReturnBook doesn't call PlaceHold or Communication directly — but its side effects trigger actions in other modules. The contracts document this so that the chain is visible and predictable.

Summary: What This Example Teaches

Error cases outnumber happy paths — each contract has more error conditions than return values
Side effects connect modules — the explicit side effects section shows cross-boundary impacts
Contracts encode business rules — "can't place a hold if copies are available" is a policy, not a technical limitation
Input specificity matters — copy_id vs. book_id is not a minor detail; it changes the entire meaning
Contracts chain through events — one contract's side effect is another contract's trigger

Contracts — Example: Restaurant Ordering System

The Scenario

A restaurant with table service and online ordering. Customers dine in or order delivery. Waitstaff take orders at the table. Kitchen receives orders and marks them complete. The system calculates bills, splits checks, and processes payment. Tips are recorded.

This is a different domain from the library — more real-time, more physical-world interaction, and more complex pricing.

Contract 1: Create Table Order

CONTRACT: CreateTableOrder

ACCEPTS:
  - table_number: number — required, must be a valid table in the system (1-30)
  - server_id: text — required, must be a valid staff ID for an active server
  - party_size: number — required, must be 1-12

RETURNS:
  - order:
    - order_id: text (unique)
    - table_number: number
    - server_id: text
    - server_name: text
    - party_size: number
    - opened_at: timestamp
    - status: "open"
    - items: empty list (no items yet)
    - subtotal: 0.00

ERRORS:
  - table_number not found → error: "Invalid table number"
  - table already has an active order → error: "Table [N] already has an open order 
    (order_id: [X]). Close or transfer it first."
  - server_id not found → error: "Unknown server"
  - server is clocked out → error: "Server is not currently clocked in"
  - party_size is 0 or negative → error: "Party size must be at least 1"
  - party_size exceeds table capacity → error: "Table [N] seats [X]. Party of [Y] 
    requires a different table."
  - database unreachable → error: "System temporarily unavailable"

SIDE EFFECTS:
  - Table status changed to "occupied" in the floor plan system
  - Order opened event logged with timestamp

Design Notes

Table capacity checking. The contract validates party size against table capacity — a business rule that prevents operational problems (8 people at a 4-person table). This data comes from the floor plan, which is configuration data.

"Table already has an active order" is common. Servers sometimes forget to close a tab. Instead of silently creating a second order, the error forces the server to deal with the existing one.

Contract 2: Add Item to Order

CONTRACT: AddItemToOrder

ACCEPTS:
  - order_id: text — required
  - menu_item_id: text — required, must be a valid item from the active menu
  - quantity: number — required, must be 1-20
  - modifications: list of text — optional (e.g., ["no onions", "extra cheese", 
    "sub gluten-free bun"])
  - seat_number: number — optional (for tracking who ordered what within a party)
  - special_instructions: text — optional, max 200 characters

RETURNS:
  - updated_order:
    - order_id: text
    - items: list (now includes the new item)
      - Each item:
        - line_item_id: text (unique per item in the order)
        - menu_item_id: text
        - item_name: text
        - quantity: number
        - unit_price: currency
        - modifications: list of text
        - modification_charges: currency (extra cheese = $1.50, etc.)
        - line_total: currency (quantity × (unit_price + modification_charges))
        - seat_number: number or null
        - special_instructions: text or empty
        - status: "ordered"
    - subtotal: currency (updated sum of all line_totals)

ERRORS:
  - order_id not found → error: "Unknown order"
  - order is not open (already closed/paid) → error: "Order is closed. Cannot add items."
  - menu_item_id not found → error: "Unknown menu item"
  - menu item is unavailable (86'd) → error: "[Item name] is currently unavailable"
  - modification is not recognized → error: "Unknown modification: [text]. 
    Available modifications: [list]"
  - modification is not applicable to this item → error: "'[modification]' cannot be 
    applied to [item name]"
  - quantity exceeds limit → error: "Maximum quantity per line item is 20"

SIDE EFFECTS:
  - Order sent to kitchen display (Transport to kitchen) with new item(s) highlighted
  - If item has an allergy flag (e.g., contains nuts), allergy alert included in
    kitchen display
  - Inventory for ingredients decremented (optional — depends on whether the restaurant
    tracks ingredient inventory in real time)
  - Item addition logged with server_id and timestamp

Why Modifications Are Complex

Modifications look simple ("no onions") but they create contract complexity:

Some modifications are free ("no onions" — they're removing something)
Some modifications have a charge ("extra cheese" = $1.50, "add avocado" = $2.00)
Some modifications are impossible ("sub gluten-free bun" on a salad)
Some modifications create allergy implications ("add peanut sauce")

The contract must handle all of these. A vague contract ("accepts modifications: list") leaves all of this to guesswork.

Contract 3: Send Order to Kitchen

CONTRACT: SendToKitchen

ACCEPTS:
  - order_id: text — required
  - items_to_send: list of line_item_ids — optional. If empty, sends all 
    items with status "ordered" (not yet sent)

    Note on "courses": A server might take the full order upfront but send 
    appetizers to the kitchen first, entrées later. This contract supports that 
    by allowing partial sends.

RETURNS:
  - kitchen_ticket:
    - ticket_id: text
    - order_id: text
    - table_number: number
    - items: list of items being sent
      - Each item: name, quantity, modifications, special instructions, seat number
    - sent_at: timestamp
    - allergy_alerts: list (any items flagged with allergy concerns)
    - estimated_prep_time: minutes (calculated from item prep times)

ERRORS:
  - order_id not found → error: "Unknown order"
  - no items to send (all items already sent or order is empty) → error: 
    "No unsent items on this order"
  - line_item_id not found in order → error: "Item [id] not found on order [id]"
  - kitchen is in "overflow" status → warning (not error): "Kitchen is backed up. 
    Current estimated wait: [X] minutes." (Order is still accepted — this is 
    informational.)

SIDE EFFECTS:
  - Items' status changed from "ordered" to "sent to kitchen"
  - Kitchen display updated with new ticket
  - Ticket print at appropriate kitchen station (grill items → grill station, 
    salads → cold station, etc.)
  - Estimated wait time sent back to server's device

The Course Problem

Real restaurants have courses. Appetizers go first, then entrées, then dessert. The contract handles this by allowing the server to choose which items to send. But the contract doesn't enforce course ordering — a server could send desserts first. Is that an error?

Decision: No. The contract allows it. The server might have a reason (the customer wants dessert only). Business rules about course ordering are the server's training, not the system's enforcement. This is a deliberate contract design choice — not every rule belongs in the software.

Contract 4: Close Order and Calculate Bill

CONTRACT: CalculateBill

ACCEPTS:
  - order_id: text — required
  - split_method: one of ["no_split", "equal_split", "by_seat", "custom"]
    - If "equal_split": split_count: number (how many ways to split, 2-12)
    - If "by_seat": (no additional input — each seat gets their items)
    - If "custom": custom_splits: list of {split_label: text, line_item_ids: list}

RETURNS:
  - bill:
    - order_id: text
    - splits: list of:
      - split_id: text
      - split_label: text ("Check 1", "Seat 3", "Jordan's portion", etc.)
      - items: list of items in this split
      - subtotal: currency
      - tax: currency (calculated from local tax rate)
      - total: currency (subtotal + tax)
    - order_subtotal: currency (pre-tax sum of all splits)
    - order_tax: currency
    - order_total: currency
    - gratuity_suggestion:
      - 15_percent: currency
      - 18_percent: currency
      - 20_percent: currency
      - 25_percent: currency
      (calculated on pre-tax subtotal)

ERRORS:
  - order_id not found → error: "Unknown order"
  - order has no items → error: "Cannot generate bill for empty order"
  - order has items with status "sent to kitchen" but not "completed" → 
    warning: "Kitchen has not completed all items. Generate bill anyway?"
  - split_method "by_seat" but some items have no seat assigned → error: 
    "[N] items have no seat number. Assign seats or use a different split method."
  - custom_splits don't cover all items → error: "The following items are not 
    assigned to any split: [list]"
  - custom_splits assign the same item to multiple splits → error: "Item [name] 
    is assigned to multiple splits"
  - database unreachable → error: "System temporarily unavailable"

SIDE EFFECTS:
  - Order status changed to "bill generated"
  - Bill event logged with split details and timestamp

The Split Check Problem

Check splitting might be the most complex everyday contract. Consider:

Equal split — simple math, but what about items that cost significantly more? Equitable ≠ equal.
By seat — requires every item to be assigned to a seat. If the server didn't track seats, this fails.
Custom — maximum flexibility, but the contract must verify that all items are covered (no orphans) and no item is double-counted.

The contract handles all three approaches with clear errors for each. A weaker contract would just say "accepts split_method: text" and leave all validation to implementation.

Contract 5: Process Payment

CONTRACT: ProcessPayment

ACCEPTS:
  - split_id: text — required (pays one split at a time)
  - payment_method: one of ["cash", "credit_card", "debit_card", "gift_card"]
    - If credit/debit: card_token: text (tokenized card data, never raw card numbers)
    - If gift_card: card_number: text, pin: text
    - If cash: amount_tendered: currency
  - tip_amount: currency — optional, default 0.00

RETURNS:
  - payment_receipt:
    - payment_id: text
    - split_id: text
    - amount_charged: currency
    - tip_amount: currency
    - total_charged: currency (amount + tip)
    - payment_method: text
    - change_due: currency (only for cash, 0.00 otherwise)
    - paid_at: timestamp

ERRORS:
  - split_id not found → error: "Unknown split"
  - split already paid → error: "This split has already been paid"
  - credit card declined → error: "Card declined. Reason: [reason from processor]"
  - gift card insufficient balance → error: "Gift card balance is [X]. 
    Split total is [Y]. Remaining [Z] must be paid by another method."
  - cash amount_tendered is less than total → error: "Amount tendered ([X]) 
    is less than total ([Y])"
  - tip_amount is negative → error: "Tip cannot be negative"
  - database unreachable → error: "System temporarily unavailable"
  - payment processor unreachable → error: "Payment system temporarily unavailable. 
    Cash payment available."

SIDE EFFECTS:
  - Split marked as "paid"
  - If all splits for the order are paid, order status changed to "closed"
  - If order is closed, table status changed to "available" in floor plan
  - Payment logged for accounting/end-of-day reconciliation
  - Tip recorded and attributed to server for tip-out calculations
  - If cash: cash drawer amount updated
  - Receipt generated for printing or digital delivery

How These Contracts Chain Together

A complete table service flow:

CreateTableOrder(table_5, server_12, party_4)
    ↓
AddItemToOrder(order_001, "calamari", qty=1)
AddItemToOrder(order_001, "burger", qty=2, mods=["no onions"], seat=1)
AddItemToOrder(order_001, "pasta", qty=1, seat=2)
AddItemToOrder(order_001, "salmon", qty=1, seat=3)
    ↓
SendToKitchen(order_001, items=["calamari"])       ← Appetizer first
    ↓
    ... kitchen prepares and marks complete ...
    ↓
SendToKitchen(order_001)                            ← Remaining items (entrées)
    ↓
    ... kitchen prepares and marks complete ...
    ↓
CalculateBill(order_001, split_method="by_seat")
    ↓
ProcessPayment(split_seat1, credit_card, tip=8.00)
ProcessPayment(split_seat2, cash, amount_tendered=30.00)
ProcessPayment(split_seat3, credit_card, tip=6.00)
ProcessPayment(split_seat4, gift_card, tip=5.00)
    ↓
Order closed. Table 5 available.

Each step has its own contract. Each step can fail independently with clear error messages. The chain is explicit — no hidden dependencies.

Comparing Library vs. Restaurant Contracts

Aspect	Library	Restaurant
Complexity of inputs	Simple (IDs)	Complex (modifications, split methods, multiple payment types)
Error case count per contract	6-9	7-10
Side effects crossing modules	Catalog, Finances, Communication	Kitchen display, Floor plan, Accounting
Time sensitivity	Relaxed (books are due in 14 days)	High (food gets cold, customers get impatient)
Physical-world interaction	Book in/out	Food preparation, cash handling
Business rules in contracts	"Can't hold available books"	"Can't split by seat without seat assignments"
Partial operations	Partial (hold vs. checkout)	Extensive (courses, split checks, partial payments, gift card remainder)

The restaurant contracts are more complex because the domain is more complex — but the structure is identical: name, inputs, outputs, errors, side effects.

Contracts — Example: Bank Transfer System

The Scenario

A banking system where customers transfer money between their own accounts and to other people's accounts. Wire transfers to external banks are supported. Daily limits and fraud detection are enforced. Every transaction must be auditable.

This is the highest-stakes contract environment. A vague contract in a bank means money appears, disappears, or doubles. There is zero tolerance for ambiguity.

Contract 1: Internal Transfer (Between Own Accounts)

CONTRACT: TransferBetweenOwnAccounts

ACCEPTS:
  - customer_id: text — required, authenticated customer
  - source_account_id: text — required, must belong to customer_id
  - destination_account_id: text — required, must belong to customer_id
  - amount: currency — required, must be positive, two decimal places maximum
    (e.g., 100.00, not 100.001)
  - memo: text — optional, max 140 characters (customer's note for their records)

RETURNS:
  - transfer_record:
    - transfer_id: text (unique, used for all future references to this transaction)
    - source_account_id: text
    - source_new_balance: currency
    - destination_account_id: text
    - destination_new_balance: currency
    - amount: currency
    - memo: text or empty
    - executed_at: timestamp (precise to millisecond)
    - status: "completed"

ERRORS:
  - customer_id not authenticated → error: "Authentication required"
  - source_account_id not found → error: "Account not found"
  - source_account does not belong to customer → error: "Account not found" 
    (IMPORTANT: same message as "not found" — never reveal that the account 
    exists but belongs to someone else)
  - destination_account_id not found → error: "Account not found"
  - destination_account does not belong to customer → error: "Account not found"
  - source and destination are the same account → error: "Source and destination 
    must be different accounts"
  - amount is zero or negative → error: "Amount must be greater than zero"
  - amount has more than 2 decimal places → error: "Amount must not exceed 
    two decimal places"
  - insufficient funds (source balance < amount) → error: "Insufficient funds. 
    Available balance: [X]"
  - source account is frozen → error: "Account is restricted. Contact support."
  - destination account is frozen → error: "Destination account is restricted. 
    Contact support."
  - daily transfer limit exceeded → error: "Daily transfer limit of [X] reached. 
    Transferred today: [Y]. Remaining: [Z]. Resets at midnight [timezone]."
  - database unreachable → error: "Service temporarily unavailable. Your transfer 
    has not been processed. Please try again."

SIDE EFFECTS:
  - Source account balance decreased by amount (atomic operation)
  - Destination account balance increased by amount (atomic operation)
  - Two transaction records created (one debit, one credit) — both reference
    the same transfer_id for traceability
  - Daily transfer running total updated for this customer
  - Transaction event logged with full details (all inputs, all outputs, 
    timestamp, IP address, device info)
  - If amount exceeds $10,000: regulatory reporting flag set (Currency 
    Transaction Report required by law)

Critical Design Point: Atomicity

The most important word in this contract is "atomic." Both the deduction from the source and the addition to the destination must happen as a single, indivisible operation. You cannot have a state where:

Money has left the source but not arrived at the destination (money lost)
Money has arrived at the destination but not left the source (money created)

This is the same "two-phase commit" concept from the ATM example. The contract specifies atomic behavior, and the implementation must guarantee it — how it does so is an implementation detail, but the guarantee is part of the contract.

Critical Design Point: Security in Error Messages

Notice that "account not found" and "account doesn't belong to you" return the same error message. This is intentional. If the system said "account 12345 exists but doesn't belong to you," an attacker could probe for valid account numbers. The contract uses identical error messages for different failure reasons to prevent information leakage.

Contract 2: Transfer to Another Person

CONTRACT: TransferToOtherCustomer

ACCEPTS:
  - customer_id: text — required, authenticated customer (the sender)
  - source_account_id: text — required, must belong to customer_id
  - recipient_identifier: one of:
    - account_number: text (direct account number)
    - email: text (if recipient has registered email for receiving transfers)
    - phone: text (if recipient has registered phone for receiving transfers)
  - amount: currency — required, positive, two decimal places max
  - memo: text — optional, max 140 characters

RETURNS:
  - transfer_record:
    - transfer_id: text
    - source_account_id: text
    - source_new_balance: currency
    - recipient_display: text (recipient's name, partially masked: "J*** Smith")
    - amount: currency
    - memo: text or empty
    - status: one of:
      - "completed" (instant transfer, recipient is at the same bank)
      - "pending" (recipient at external bank, or amount triggers review)
    - executed_at: timestamp
    - estimated_arrival: text (if pending — "within 1 business day" or 
      "within 3 business days")

ERRORS:
  (All errors from TransferBetweenOwnAccounts, PLUS:)
  - recipient not found → error: "No account found for this recipient"
  - recipient account is closed → error: "Recipient account is not active"
  - sender and recipient are the same person → error: "Use internal transfer 
    for transfers between your own accounts" (different flow, different limits)
  - amount exceeds person-to-person daily limit → error: "Person-to-person 
    daily limit of [X] reached"
  - fraud detection flag → error: "Transfer requires additional verification. 
    Please contact support or verify via [method]." 
    (The transfer is NOT processed. It is held.)

SIDE EFFECTS:
  - Source balance decreased by amount
  - If same bank and status = "completed": recipient balance increased immediately
  - If different bank and status = "pending": transfer queued for batch processing
  - If amount > $3,000 to a new recipient: additional verification step triggered
    (two-factor authentication sent to customer's phone)
  - Both sender's and recipient's transaction histories updated
  - Transfer event logged (sender's IP, device fingerprint, amount, recipient info)
  - Fraud scoring model updated with this transfer's characteristics
  - If amount > $10,000: regulatory reporting flag set
  - Notification sent to recipient (if they have notifications enabled)

Key Difference From Internal Transfer

The recipient might not get the money immediately. This introduces the concept of eventual consistency — the sender's balance changes now, but the recipient's balance might change later. The contract makes this explicit through the status and estimated_arrival fields.

Fraud detection can block the transfer. Unlike internal transfers (low risk), person-to-person transfers are a fraud vector. The contract includes a specific error for this case that instructs the caller to handle it as a "held" state — not a rejection, not a success, but a third state.

Contract 3: Wire Transfer (External Bank)

CONTRACT: WireTransfer

ACCEPTS:
  - customer_id: text — required, authenticated
  - source_account_id: text — required, must belong to customer_id
  - recipient_name: text — required, full legal name
  - recipient_bank_routing_number: text — required, 9 digits
  - recipient_account_number: text — required
  - amount: currency — required, positive, two decimal places max
  - wire_type: one of ["domestic", "international"]
    - If "international":
      - swift_code: text — required, 8 or 11 characters
      - recipient_bank_name: text — required
      - recipient_bank_address: text — required
      - recipient_country: text — required, ISO country code
  - purpose: text — required for international (regulatory requirement), 
    optional for domestic, max 200 characters
  - memo: text — optional, max 140 characters

RETURNS:
  - wire_record:
    - wire_id: text
    - source_account_id: text
    - source_new_balance: currency (amount + wire fee already deducted)
    - recipient_name: text
    - amount: currency
    - wire_fee: currency (displayed separately — $25 domestic, $45 international)
    - total_deducted: currency (amount + wire_fee)
    - status: "pending" (wires are never instant)
    - submitted_at: timestamp
    - estimated_arrival: text ("1-2 business days" domestic, 
      "3-5 business days" international)
    - confirmation_number: text (for tracking with the wire network)

ERRORS:
  (All standard account/amount errors, PLUS:)
  - routing_number invalid format → error: "Routing number must be exactly 9 digits"
  - routing_number not found in bank directory → error: "Unknown routing number. 
    Verify with recipient's bank."
  - swift_code invalid format → error: "SWIFT code must be 8 or 11 characters"
  - wire_type is "international" and purpose is empty → error: "Purpose is 
    required for international wire transfers"
  - insufficient funds for amount + wire_fee → error: "Insufficient funds. 
    Transfer amount ([X]) + wire fee ([Y]) = [Z]. Available balance: [W]."
  - wire transfer daily limit exceeded → error: "Daily wire limit exceeded"
  - customer has not completed wire transfer authorization form → error: 
    "Wire transfer authorization required. Complete enrollment first."
  - fraud or compliance hold → error: "Transfer requires manual review. 
    Expected completion: [1-2 business days]. Reference: [case_id]."

SIDE EFFECTS:
  - Source balance decreased by (amount + wire_fee) — this happens immediately
    even though the wire is "pending"
  - Wire queued for submission to Federal Reserve wire network (domestic) 
    or SWIFT network (international)
  - Compliance review automatically triggered for:
    - Any international wire
    - Domestic wires over $10,000
    - Wires to certain countries (OFAC screening)
  - Customer receives confirmation email with wire details
  - Wire event logged with full audit trail

Why Wire Contracts Are Maximally Detailed

The money leaves immediately, but the wire takes days. This creates a period where the customer's balance is reduced but the recipient hasn't received anything. The contract must make this clear — and the side effects section must document that the balance deduction is immediate even though delivery is not.

Regulatory requirements are part of the contract. Purpose is required for international wires — not because the bank wants it, but because the law requires it. The contract enforces this. If you omitted it, the implementation might skip it, and the bank could face legal penalties.

Wire fees are not optional. The contract explicitly includes the fee and shows the total deduction. A vague contract might say "returns amount" without clarifying whether the fee is included or separate — this ambiguity could cause accounting errors.

Comparing All Three Domain Examples

Aspect	Library	Restaurant	Bank
Strictest constraint	Checkout limits	Food timing	Atomicity + compliance
Error message security	Low concern	Low concern	Critical (never reveal account existence)
Side effects count	3-4 per contract	4-6 per contract	6-10 per contract
Regulatory requirements	Minimal	Health codes (not in contracts)	Extensive (CTR, OFAC, wire auth)
Can you "undo" the operation?	Yes (return the book)	Partially (can void before kitchen)	Depends (internal yes, wire maybe not)
Money involved	Small fines	Meal costs	Unlimited
Time horizon	14 days (loan period)	Minutes (meal duration)	Days (wire processing)
Highest-stakes error case	Lost book ($25 replacement)	Food allergy incident	Money loss, legal violation

The contract structure is identical across all three: name, inputs, outputs, errors, side effects. But the rigor scales with the stakes. A missing error case in the library contract is an inconvenience. A missing error case in the bank contract is a potential financial loss or legal violation.

This is the core lesson: the contract template is universal, but the thoroughness is proportional to what's at risk.

Contracts — Composing Contract Chains

Why Composition Matters

Real features are never one contract. They are chains — sequences of contracts where each step's output feeds the next step's input. The chain's success depends on every link, and a failure at any point must be handled.

Composing contracts is where design becomes architecture.

The Three Rules of Contract Chains

Rule 1: Output Shape Must Match Input Shape

If Contract A returns a validated_cart and Contract B accepts a validated_cart, they connect. If A returns a cart (not validated) and B expects validated_cart, they don't — and the gap will produce a bug.

Rule 2: Every Link Has a Failure Path

What happens if step 3 of 5 fails? Does the whole chain stop? Do steps 1 and 2 need to be undone? Does the chain skip step 3 and continue? The answer must be explicit for every link.

Rule 3: Side Effects Complicate Rollback

If step 1 sends a confirmation email and step 3 fails, you can't "unsend" the email. Side effects are often irreversible, so the chain must account for which steps can be undone and which can't.

Worked Example 1: E-Commerce Checkout

A customer clicks "Place Order." Here's the full chain:

Step 1: ValidateCart
  IN:  cart_id
  OUT: validated_cart (items confirmed in stock, prices confirmed current)
  FAIL: "Item X is out of stock" → stop, show error, suggest alternatives

Step 2: CalculateTotal
  IN:  validated_cart, discount_code (optional), shipping_address
  OUT: order_total (subtotal, discount_amount, tax, shipping, grand_total)
  FAIL: "Invalid discount code" → stop, show error, let customer fix
  FAIL: "Cannot ship to this address" → stop, show error

Step 3: ReserveInventory
  IN:  validated_cart
  OUT: reservation_id (items held for 10 minutes)
  FAIL: "Item X went out of stock since cart was validated" → stop, show 
        error, go back to step 1
  NOTE: This is a TEMPORARY hold. If step 5 fails, reservation is released.

Step 4: ProcessPayment
  IN:  grand_total, payment_method
  OUT: payment_confirmation (transaction_id, status)
  FAIL: "Card declined" → release reservation (undo step 3), show error
  FAIL: "Payment service unavailable" → release reservation, show error,
        suggest retry

Step 5: CreateOrder
  IN:  validated_cart, order_total, payment_confirmation, customer_id
  OUT: order_record (order_id, status = "confirmed")
  FAIL: "System error creating order" → THIS IS CRITICAL. Payment was 
        already processed. Must either: (a) retry order creation, or 
        (b) refund the payment. Never leave money charged without an order.

Step 6: SendConfirmation
  IN:  order_record, customer_email
  OUT: (none — fire and forget)
  FAIL: "Email service unavailable" → log the failure, do NOT undo the 
        order. The order is valid. Email can be resent later.
  SIDE EFFECT: Email sent to customer.

The Failure Cascade

Let's visualize what happens at each failure point:

Fails At	Steps Completed	What Must Be Undone	User Sees
Step 1	None	Nothing	"Item out of stock" — redirect to cart
Step 2	Cart validated	Nothing (validation has no side effects)	"Invalid discount code" — fix and retry
Step 3	Cart validated, total calculated	Nothing (no side effects yet)	"Item just went out of stock" — back to cart
Step 4	Cart validated, total calculated, inventory reserved	Release inventory reservation	"Card declined" — try another card
Step 5	All above + payment charged	Refund payment + release inventory	"System error — please contact support" + automatic refund
Step 6	Everything above + order created	Nothing to undo — order is valid	Order succeeds. Email will be retried later.

This table is the most valuable artifact in the design. It shows exactly what's at risk at each step and what recovery looks like.

Worked Example 2: Employee Onboarding

A new employee is onboarded into a company's systems. This is a multi-system chain involving HR, IT, Facilities, Payroll, and more.

Step 1: CreateEmployeeRecord
  IN:  name, role, department, start_date, manager_id, salary
  OUT: employee_id, employee_record
  FAIL: "Manager not found" → stop, HR fixes manager assignment
  FAIL: "Duplicate employee (matching name + DOB)" → stop, HR investigates

Step 2: SetupPayroll
  IN:  employee_id, salary, tax_withholding_info, bank_account (for direct deposit)
  OUT: payroll_enrollment_confirmation
  FAIL: "Invalid bank routing number" → stop, request corrected info
        (Step 1 persists — employee record exists but payroll isn't set up)
  SIDE EFFECT: Employee added to next payroll cycle

Step 3: CreateIT Accounts
  IN:  employee_id, role, department
  OUT: email_address, system_credentials, access_permissions_list
  FAIL: "Email address conflict (name.lastname already taken)" → 
        auto-generate alternative (name.middle.lastname), proceed
  SIDE EFFECTS: Email created, VPN access granted, software licenses assigned

Step 4: AssignEquipment
  IN:  employee_id, role, department
  OUT: equipment_list (laptop model, monitor, phone, badge)
  FAIL: "Laptop model out of stock" → substitute, proceed with warning
  SIDE EFFECT: Equipment reserved in inventory, shipping initiated

Step 5: SetupWorkspace
  IN:  employee_id, department, start_date
  OUT: workspace_assignment (building, floor, desk number)
  FAIL: "No desks available in department area" → assign temporary desk, 
        add to waitlist
  SIDE EFFECT: Desk reserved in facilities system

Step 6: SendWelcomePackage
  IN:  employee_id, email_address, start_date, workspace, equipment_list
  OUT: (confirmation)
  FAIL: Non-critical — retry later
  SIDE EFFECT: Welcome email sent with first-day instructions

Key Differences From E-Commerce Chain

Not all steps are dependent. Steps 2, 3, 4, and 5 can happen in parallel — they all need employee_id from step 1, but they don't need each other's outputs. This changes the chain from a strict sequence to a fan-out:

                                ┌── Step 2: Payroll
                                ├── Step 3: IT Accounts
Step 1: Create Record ──────────┤
                                ├── Step 4: Equipment
                                └── Step 5: Workspace
                                          │
                         All complete ─────┘
                                          │
                                    Step 6: Welcome

Failures don't cascade backward. If IT can't create an account, that doesn't mean HR needs to delete the employee record. Each step has its own failure handling. This is a design choice — the chain is tolerant of partial completion, unlike the e-commerce chain where payment requires inventory reservation.

Some failures are handled with substitution, not cancellation. "Laptop out of stock" → substitute a different model. "Email conflict" → generate alternative. The chain tries to continue whenever possible.

Worked Example 3: Medical Lab Test Process

A doctor orders a blood test. The sample is collected, processed, and results are delivered.

Step 1: OrderTest
  IN:  doctor_id, patient_id, test_type (e.g., "complete blood count"), 
       urgency ("routine" | "urgent" | "stat"), clinical_notes
  OUT: test_order (order_id, patient_name, test_type, collection_instructions)
  FAIL: "Patient has allergy flagged for this test prep" → warning to doctor
  SIDE EFFECT: Order appears on lab's work queue

Step 2: CollectSample
  IN:  order_id, collector_id (phlebotomist), patient_id_verification 
       (wristband scan or verbal confirmation of DOB)
  OUT: sample_record (sample_id, collection_time, tube_type, volume)
  FAIL: "Patient ID verification failed (wristband doesn't match order)" 
        → HARD STOP. Do not collect. This prevents testing the wrong 
        patient's blood — a potentially fatal error.
  FAIL: "Insufficient sample volume" → recollect
  SIDE EFFECT: Sample labeled with barcode, linked to order_id

Step 3: ProcessSample
  IN:  sample_id
  OUT: processing_record (processing_start_time, analyzer_id, status)
  FAIL: "Sample hemolyzed (damaged)" → error to collector: "Recollection 
        needed. Reason: hemolysis." → back to Step 2
  FAIL: "Analyzer malfunction" → route to backup analyzer
  SIDE EFFECT: Sample processing logged for quality control

Step 4: AnalyzeResults
  IN:  processing_record
  OUT: raw_results (values for each test component, reference ranges, 
       flags for abnormal values)
  FAIL: "Results outside analyzable range" → flag for manual review 
        by lab technician
  SIDE EFFECT: Results stored in lab information system

Step 5: ReviewResults
  IN:  raw_results, patient_history (previous test results for comparison)
  OUT: reviewed_results (same as raw, plus: technician_notes, 
       critical_value_flag)
  FAIL: None — this step always produces a result (even if the result 
        is "requires further testing")
  SIDE EFFECT: If critical value detected (life-threatening result), 
  IMMEDIATE notification to ordering doctor — this must happen within 
  minutes, not hours. This is a regulatory requirement.

Step 6: DeliverResults
  IN:  reviewed_results, doctor_id, patient_id
  OUT: delivery_confirmation
  FAIL: "Doctor not available" → deliver to covering physician
  SIDE EFFECTS: Results appear in patient's medical record, 
  doctor receives notification in their clinical dashboard

Why This Chain Is Unique

Patient safety creates hard stops. Step 2 has a verification check that cannot be bypassed or substituted. If the patient ID doesn't match, the chain stops completely. No alternative, no workaround. This is the highest-stakes failure in the chain — a wrong patient's blood being analyzed means wrong treatment decisions.

Some failures loop backward. "Sample hemolyzed" at step 3 sends the chain back to step 2 (recollect). This isn't a simple linear chain — it has loops.

Time sensitivity varies by step. Routine orders might wait hours at each step. Stat orders bypass the queue at every step. The urgency flag changes the behavior of every contract in the chain without changing the contracts themselves — it's a priority signal that travels through the chain.

Critical value notification is a side effect that overrides normal flow. Normally, results go through all steps sequentially. But if step 5 detects a critical value (e.g., dangerously low blood sugar), the side effect triggers an immediate alert — even before step 6 formally delivers the results. The side effect has higher priority than the main chain.

Composing Contracts: Summary Principles

Principle	What It Means
Map the happy path first	Get the chain right when everything works, then add failure handling
Define the failure point for every step	What happens here if this fails? Stop? Undo? Substitute? Skip?
Identify irreversible steps	Emails sent, payments charged, physical actions taken — these can't be undone
Look for parallelizable steps	Not every chain is strictly sequential — find steps that don't depend on each other
Look for loops	Some failures send you back to an earlier step. Map these explicitly.
Time sensitivity shapes the chain	A chain that must complete in 200 milliseconds is designed very differently from one that spans 5 business days
The rollback plan is as important as the happy path	For every step that changes state, document how to reverse it if a later step fails

Contracts and Interfaces — Test Your Understanding

Answer each question by writing contracts in plain language. No code. Focus on precision and completeness.

Section A: Write the Contract

Question 1

Write the contract for a password reset operation. A user provides their email address and requests a password reset. Think through: what are all the inputs? What are all the outputs? What can go wrong? What side effects occur?

Use the template:

CONTRACT: ResetPassword
ACCEPTS: ...
RETURNS: ...
ERRORS: ...
SIDE EFFECTS: ...

Question 2

Write the contract for searching a product catalog. A user can search by keyword, filter by category, filter by price range, and choose how results are sorted. Define every input (required vs. optional, constraints), exactly what the output looks like, and what happens when no results match.

Question 3

Write the contract for transferring money between two bank accounts. This is a high-stakes operation. Be thorough: think about validation, insufficient funds, daily transfer limits, accounts that don't exist, accounts that are frozen, and what happens if the system fails mid-transfer.

Section B: Evaluate the Contract

Question 4

Here is a contract someone wrote:

CONTRACT: GetUser
ACCEPTS: user_id
RETURNS: user data
ERRORS: returns error if something goes wrong

List every problem with this contract. Then rewrite it properly.

Question 5

Here is a more detailed contract:

CONTRACT: SubmitReview
ACCEPTS:
  - product_id: text
  - user_id: text
  - rating: number (1-5)
  - review_text: text
RETURNS:
  - review_id: text
ERRORS:
  - invalid product → error
  - invalid user → error

This is better, but still incomplete. Identify at least five things that are missing or underspecified. Rewrite the contract to address them.

Question 6

You receive two different contract proposals for the same operation — sending a notification to a user:

Proposal A:

CONTRACT: SendNotification
ACCEPTS: user_id, message_text, channel (email/push/sms)
RETURNS: success/failure
ERRORS: user not found, invalid channel

Proposal B:

CONTRACT: SendNotification
ACCEPTS: user_id, message_text
RETURNS: notification_id, channel_used, status
ERRORS: user not found, user has no valid contact methods, message too long
SIDE EFFECTS: notification logged, delivery attempted via user's preferred channel

Compare these proposals. Which is better and why? What does Proposal B handle that Proposal A ignores? Is there anything Proposal A does better?

Section C: Contract Composition

Question 7

A food delivery app has the following user action: "Customer places an order from a restaurant."

Break this into a chain of individual contracts. Each contract should have full inputs, outputs, errors, and side effects. The chain should cover everything from the customer pressing "Place Order" to the restaurant receiving the order on their screen.

Define at least 4 contracts in the chain. For each one, show how the output of the previous step feeds into the input of the next.

Question 8

An event ticketing system needs to handle: "User purchases 3 tickets for a concert."

This involves checking seat availability, holding seats temporarily while the user pays, processing payment, and issuing the tickets. These steps must happen in order, and failure at any step has specific consequences.

Write the contract chain. Pay special attention to: what happens if seats are taken between checking availability and completing payment? What happens if payment fails after seats are held?

Question 9

A user wants to change their email address in a system. This seems simple but involves:

Verifying the new email is valid and not already in use
Sending a verification link to the new email
The user clicking the link to confirm
Updating the email in the system
Notifying the old email address about the change

Write contracts for each step. Identify where time gaps exist between steps (e.g., the user might click the link 5 minutes later, or never). How do the contracts handle these gaps?

Section D: Critical Thinking

Question 10

"A good contract should cover every possible edge case."

Is this true? Is it realistic? Where do you draw the line between thoroughness and over-specification? Give an example of an edge case that MUST be in the contract and one that reasonably could be omitted.

Question 11

You write a contract that says:

RETURNS: list of orders, sorted by date descending, maximum 100 results

A colleague argues: "Don't put 'maximum 100' in the contract. That's an implementation detail — maybe we'll change it later."

Who is right? Make the case for each side. What is the cost of including it in the contract? What is the cost of leaving it out?

Question 12

A contract exists between Module A and Module B. Module A has been happily calling Module B for a year. Now Module B needs to add a new required input to the contract.

What problem does this create? How would you handle this change without breaking Module A? Describe at least two approaches and the tradeoffs of each.

Grading Rubric

Criteria	What It Means
Completeness	All five components present: name, inputs (with shapes and constraints), outputs (with guarantees), errors (exhaustive), side effects
Precision	No vague terms like "user data" or "returns error" — every statement is specific enough to implement from
Implementation-free	The contract says what, never how. No mention of databases, languages, or algorithms
Error awareness	Edge cases are considered. Empty results, invalid inputs, system failures, and time-related issues are addressed
Composability	Where contracts chain together, outputs clearly match the next step's inputs. Failure at each step has defined consequences

Decomposition — Why It Matters

The One Skill That Makes Everything Possible

You now know how to trace data through a system (lifecycle), draw lines around responsibilities (boundaries), and define the agreements between parts (contracts). But there's a skill that comes before all of them — the skill that determines whether you can even begin solving a problem:

Decomposition — the ability to take something large and vague and break it into small, concrete, solvable pieces.

This is the single most important skill in engineering, and it has nothing to do with code.

Every senior engineer you'll ever meet does this instinctively. When handed a problem — any problem — their brain immediately starts dividing it into parts. Not because they were taught a formal method, but because they've learned through painful experience that trying to solve a big problem all at once leads to failure, every time.

Why Big Problems Are Impossible

Human brains have limits. You can hold about 4-7 things in working memory at once. A real-world feature might involve 50 things — data sources, business rules, edge cases, user interactions, error handling, performance concerns, security requirements, and more.

If you try to think about all 50 at once, you'll get overwhelmed, miss details, and build something that sort of works but falls apart under scrutiny. This is the experience every beginner has: "it works on my machine, for the simple case, if nothing goes wrong."

Decomposition is the antidote. You don't solve a 50-piece problem. You solve ten 5-piece problems. Each one is small enough to fit in your head. Each one can be verified independently. And when you compose them together, they form the complete solution.

What Decomposition Actually Means

Decomposition is not just "break it into pieces." It's breaking it into pieces that are:

Independent enough to work on separately
Small enough to understand completely
Concrete enough to know when you're done
Ordered correctly so dependencies flow naturally

Bad decomposition leads to pieces that can't be built without building other pieces first, pieces so vague you don't know where to start, or pieces that don't actually combine into a solution.

The Vague Feature Problem

The number one challenge in professional engineering is not technical. It's this: someone gives you a vague request, and you have to turn it into something buildable.

"We need reporting."

What does that mean? Reports about what? For whom? How often? In what format? What data do they need? What decisions will they make from the reports? How accurate does the data need to be? Can it be delayed by an hour? A day?

A beginner hears "we need reporting" and starts building a report page. A senior engineer hears "we need reporting" and asks 20 questions — because they know that the decomposition of the request determines the entire architecture.

Why Decomposition Before Code

In the LLM era, writing code is the cheapest part of the process. An LLM can generate a module in seconds. But it can only do that if someone has already decomposed the problem into a clear, well-bounded piece with a defined contract.

The expensive part — the part that requires human judgment — is:

Understanding what the actual problem is
Breaking it into pieces that make sense
Deciding what order to tackle them in
Knowing when a piece is "done"

This is decomposition. Get it right, and the rest is mechanical. Get it wrong, and no amount of code will save you.

What Goes Wrong Without Decomposition

The Monolith

Someone tries to build everything at once. They create one massive piece of work that does a hundred things. It takes months. Nobody can review it because it's too big. It has subtle bugs that won't be found until production. When it needs a change, any change, the whole thing is at risk.

The Wrong Order

Someone builds the dashboard before building the data pipeline that feeds it. They build the payment system before defining the order structure. They build the notification system before deciding what events trigger notifications. Now they have to rework everything because the foundation doesn't support what was built on top.

The Black Hole

A piece of work keeps getting bigger because nobody defined its edges. "While I'm in here, I'll also add..." and the scope expands endlessly. What was supposed to take a week takes two months, and nobody knows how close it is to done because the target keeps moving.

Invisible Progress

Without decomposition, the only status update is "I'm still working on it." With decomposition, you can say "5 of 8 pieces are complete, the 6th is in progress, and I'm blocked on the 7th until we get a decision on X." This is the difference between a project that surprises everyone with delays and a project that's managed with clarity.

Decomposition Is Not Just For Code

This skill applies to:

Writing a document: What sections does it need? What does each section cover? What order makes sense for the reader?
Planning an event: What are the independent tasks? What depends on what? What can be done in parallel?
Diagnosing a problem: What are the possible causes? How do I eliminate them one by one?
Learning something new: What are the foundational concepts? What builds on what? What can I skip for now?

Engineers who think in decomposition don't just write better software. They communicate better, plan better, and solve problems faster — because they've trained their brain to automatically ask: "What are the pieces, and what's the right order?"

That's what this section teaches you to do deliberately.

Decomposition — Why It Matters

The One Skill That Makes Everything Possible

Decomposition — the ability to take something large and vague and break it into small, concrete, solvable pieces.

This is the single most important skill in engineering, and it has nothing to do with code.

Why Big Problems Are Impossible

What Decomposition Actually Means

Decomposition is not just "break it into pieces." It's breaking it into pieces that are:

Independent enough to work on separately
Small enough to understand completely
Concrete enough to know when you're done
Ordered correctly so dependencies flow naturally

Bad decomposition leads to pieces that can't be built without building other pieces first, pieces so vague you don't know where to start, or pieces that don't actually combine into a solution.

The Vague Feature Problem

The number one challenge in professional engineering is not technical. It's this: someone gives you a vague request, and you have to turn it into something buildable.

"We need reporting."

Why Decomposition Before Code

The expensive part — the part that requires human judgment — is:

Understanding what the actual problem is
Breaking it into pieces that make sense
Deciding what order to tackle them in
Knowing when a piece is "done"

This is decomposition. Get it right, and the rest is mechanical. Get it wrong, and no amount of code will save you.

What Goes Wrong Without Decomposition

The Monolith

The Wrong Order

The Black Hole

Invisible Progress

Decomposition Is Not Just For Code

This skill applies to:

Writing a document: What sections does it need? What does each section cover? What order makes sense for the reader?
Planning an event: What are the independent tasks? What depends on what? What can be done in parallel?
Diagnosing a problem: What are the possible causes? How do I eliminate them one by one?
Learning something new: What are the foundational concepts? What builds on what? What can I skip for now?

That's what this section teaches you to do deliberately.

Decomposition — How: The Method

Two Directions

There are two fundamental approaches, and experts use both — often on the same problem.

Top-Down: Start From the Goal

Begin with the end result and repeatedly ask: "What sub-problems does this require?"

Each answer becomes a new question. You keep asking until each piece is small enough to describe with a clear contract — precise inputs, precise outputs, and you can estimate the effort.

Quick example: "Build an online bookstore."

What does a bookstore need? → Browsing, cart, checkout, admin
What does browsing need? → Search, filter, sort, details page
What does search need? → Accept query, match results, handle "no results"

Three levels. Each level more specific. Stop when the pieces feel tangible.

Bottom-Up: Start From What You Have

Begin with the pieces you know and ask: "What can I compose from these?"

This works when you have existing components, known constraints, or a technology platform with built-in capabilities.

Quick example: You have a database of books, an email service, and a payment API.

Database → browsing and search features
Email → order confirmations and notifications
Payment API → checkout flow
Combine all three → a basic bookstore

Bottom-up is powerful when building blocks constrain the design. If your payment API only supports credit cards, that fact shapes the checkout feature — you discover this from the bottom, not the top.

When to Use Which

Situation	Approach
New project, blank slate	Top-down — start from what users need
Existing system, adding features	Bottom-up — start from what already exists
Unclear requirements	Top-down first to clarify scope, then bottom-up to ground it in reality
Well-understood problem	Either works; most engineers blend both naturally

The Decomposition Tree

The primary artifact of decomposition is a tree — a hierarchy where each node is a piece of the problem and its children are its sub-pieces.

At every leaf of this tree, you should be able to:

Write a contract (inputs, outputs, errors)
Estimate the effort (small, medium, large)
Identify dependencies (does this need something else built first?)

If you can't do all three, the piece isn't decomposed enough — keep breaking it down.

Finding Seams

A seam is a natural break point — a place where one concern ends and another begins. Recognizing seams makes decomposition faster and more accurate.

Data format changes

Wherever data changes shape, there's a seam. Raw input → validated input. Validated input → database record. Database record → display format.

Responsibility changes

Wherever "whose job is this?" changes, there's a seam. The user's browser collects input. The server validates it. The database stores it. Three responsibilities, three seams.

Time boundaries

Wherever something can happen "later" or "separately," there's a seam. The order is placed now. The shipping label is generated later. The daily summary runs overnight.

Audience changes

Wherever different users see different things, there's a seam. The customer sees their order. The admin sees all orders. The warehouse sees orders ready to ship.

Error handling boundaries

Wherever the response to failure changes, there's a seam. Search fails → show "no results." Payment fails → stop checkout. Email fails → log it and continue.

Dependency Mapping

Once you have a tree, identify what depends on what.

Three dependency rules:

Things at the bottom should depend on nothing or on stable abstractions
Things at the top can depend on things below
Circular dependencies are a design error — if A needs B and B needs A, your decomposition is wrong

The dependency arrows tell you the build order: start with pieces that have no dependencies, then build what depends on them, then what depends on those.

Estimating From Decomposition

Once a problem is decomposed into leaves, you can estimate by creating a table:

Leaf	Complexity	Dependencies	Priority
Each decomposed piece	Small/Medium/Large	What it needs	High/Medium/Low

This table, derived entirely from decomposition, gives you a project plan — not a guess, but a structured breakdown where each piece is estimable and the order is logical.

The Decomposition Checklist

When you've finished decomposing, verify:

Every leaf is concrete — you can write a contract for it
Every leaf is small — you could explain it completely in 2-3 sentences
No leaf has hidden complexity — if it feels big, it needs more decomposition
Dependencies are explicit — you know what depends on what
No circular dependencies — everything flows in one direction
Nothing is missing — trace the user's journey start to finish; every step has a leaf
Nothing overlaps — each responsibility appears exactly once
Build order is clear — you know what to start with

What to Look For in the Examples

The following pages take three very different systems and decompose them completely. As you read:

Watch how the tree grows — from a vague goal to concrete, estimable leaves
Notice where seams appear — and which type of seam it is
Compare the final tree depth — some systems are deeper than others
Look at the dependency map — what must be built first?
Notice the decisions — decomposition isn't mechanical; there are judgment calls about where to split

Decomposition — Example: Online Bookstore

The Starting Point

Goal: Build an online bookstore where customers can browse, search, buy books, and track orders. Administrators can manage inventory.

This is deliberately a "boring" system. It's well-understood, which lets us focus purely on the decomposition technique without domain complexity getting in the way.

Step 1: Top-Down — First Level

Ask: "What does an online bookstore need?"

Online Bookstore
├── Browsing & Search
├── Shopping Cart
├── Checkout & Payment
├── Order Management
└── Admin: Inventory

Five branches. Each is a major area of functionality. But none of these are concrete enough to build yet — "Browsing & Search" could mean a hundred different things.

Step 2: Second Level — Each Branch

Ask: "What does each of these need?"

Browsing & Search

Browsing & Search
├── List all books (paginated)
├── Search by keyword (title, author, or ISBN)
├── Filter by genre, price range, publication date
├── Sort by price / date / rating / relevance
└── View book details (description, reviews, availability)

Shopping Cart

Shopping Cart
├── Add item to cart
├── Remove item from cart
├── Update quantity
├── View cart summary (items, quantities, subtotal)
└── Cart persistence (survives page refresh, login/logout)

Checkout & Payment

Checkout & Payment
├── Enter shipping address
├── Validate shipping address
├── Select shipping method (standard, express)
├── Calculate order total (items + shipping + tax)
├── Enter payment information
├── Process payment
├── Create order record
└── Send confirmation email

Order Management

Order Management
├── View order history (list of past orders)
├── View single order details (items, shipping status, tracking)
├── Cancel order (if not yet shipped)
└── Request return (if within return window)

Admin: Inventory

Admin: Inventory
├── Add new book to catalog
├── Update book details (price, description, cover image)
├── Adjust stock levels (restock, corrections)
├── Mark book as discontinued
└── View low-stock alerts

Step 3: Check Each Leaf Against the Three Tests

For every leaf, ask:

Can I write a contract for it? (Inputs, outputs, errors)
Can I estimate the effort? (Small, medium, large)
Can I identify its dependencies?

Let's test a few:

"Add item to cart" — Leaf Test

Question	Answer
Contract?	✅ IN: customer_id, book_id, quantity. OUT: updated cart. ERRORS: book not found, out of stock, invalid quantity.
Estimate?	✅ Small — straightforward data operation
Dependencies?	✅ Needs: Book Catalog (to verify the book exists and is in stock)

Verdict: Concrete enough. This is a good leaf.

"Process payment" — Leaf Test

Question	Answer
Contract?	⚠️ Partially — but what about different payment methods? Retries? Partial payments?
Estimate?	⚠️ "Medium" is a guess — there might be hidden complexity
Dependencies?	✅ Needs: order total from "Calculate order total," payment info from "Enter payment"

Verdict: Not quite decomposed enough. Let's break it down further.

Process Payment
├── Validate payment method (card not expired, sufficient funds estimate)
├── Submit payment to processor
├── Handle processor response (approved, declined, error)
├── Record payment result
└── Handle retry on transient failure

Now each sub-piece is concrete and estimable.

"Cart persistence" — Leaf Test

Question	Answer
Contract?	⚠️ This isn't an operation — it's a behavior. "Persist" isn't something a user does; it's something the system must maintain.
Estimate?	⚠️ Depends on implementation approach (cookies? database? session?)
Dependencies?	⚠️ Unclear

Verdict: This is a requirement, not a leaf. It constrains how the cart works but doesn't decompose into an operation with a contract. Move it to a "requirements" list and let it inform the design of the other cart operations.

Step 4: Dependency Map

Now draw the dependency arrows:

Book Catalog ──────────────────── (foundation — depends on nothing)
          │
          ▼
Shopping Cart ──── needs catalog for prices, stock checks
          │
          ▼
Checkout ──── needs cart contents
    │    │
    │    ▼
    │  Shipping ──── needs shipping address
    │    │
    │    ▼
    │  Order Total ──── needs cart + shipping + tax rules
    │    │
    │    ▼
    └─ Payment ──── needs order total
          │
          ▼
Order Record ──── needs payment confirmation + cart + shipping
          │
          ▼
Confirmation Email ──── needs order record

And separately:

Order Management ──── needs Order Record store (read-only)
Admin: Inventory ──── needs Book Catalog store (read-write)

What This Map Tells Us

Build order:

Book Catalog (no dependencies)
Shopping Cart (needs catalog)
Shipping + Tax calculations
Payment integration
Order creation
Email notifications
Order management views
Admin tools

Steps 3 and 4 can be built in parallel — they don't depend on each other.

Step 5: The Complete Tree

Online Bookstore
├── Book Catalog (foundation)
│   ├── Store book data (title, author, price, genre, description, cover)
│   ├── List books (paginated, sorted)
│   ├── Search books (keyword match against title/author/ISBN)
│   ├── Filter books (genre, price range, date range)
│   └── Get single book details
│
├── Shopping Cart
│   ├── Add item (book_id, quantity) → validate against catalog
│   ├── Remove item (line_item_id)
│   ├── Update quantity (line_item_id, new_quantity) → revalidate stock
│   └── Get cart summary (items, subtotal)
│
├── Checkout
│   ├── Enter shipping address
│   ├── Validate shipping address (real address? deliverable?)
│   ├── Select shipping method → calculate shipping cost
│   ├── Calculate order total (items + shipping + tax)
│   ├── Validate payment method
│   ├── Submit payment → handle response
│   ├── Record payment
│   ├── Create order record
│   └── Send confirmation email
│
├── Order Management (customer-facing)
│   ├── List my orders (paginated, newest first)
│   ├── View order details (items, status, tracking)
│   ├── Cancel order (if status = "processing")
│   └── Request return (if within 30-day window)
│
└── Admin: Inventory
    ├── Add book to catalog
    ├── Update book details
    ├── Adjust stock levels
    ├── Mark book as discontinued
    └── View low-stock alerts (books with stock < threshold)

Total leaf count: 23 operations.

Each one can be contracted, estimated, and built. The original "build an online bookstore" has been transformed from a vague idea into 23 specific, concrete tasks with a clear build order.

The Estimation Table

Leaf	Complexity	Dependencies	Priority	Notes
Store book data	Small	None	Critical	Foundation for everything
List books	Small	Book data	Critical	Core feature
Search books	Medium	Book data	Critical	Needs text matching logic
Filter books	Small	Book data	High	Simple query constraints
Get book details	Small	Book data	Critical	Used by cart and display
Add to cart	Small	Get book details	Critical
Remove from cart	Small	None	Critical
Update cart quantity	Small	Get book details (stock check)	High
Get cart summary	Small	None	Critical	Checkout needs this
Enter shipping address	Small	None	Critical
Validate shipping address	Medium	External validation service	High	Edge cases with international
Select shipping method	Small	Validated address	High
Calculate order total	Small	Cart + shipping + tax rules	Critical
Validate payment	Small	None	Critical
Submit payment	Medium	External payment API + order total	Critical	Error handling is complex
Record payment	Small	Payment response	Critical
Create order record	Small	Cart + payment + shipping	Critical
Send confirmation email	Small	Order record	High	Can be async
List my orders	Small	Order records	High
View order details	Small	Order records	High
Cancel order	Small	Order record (status check)	Medium
Request return	Medium	Order record + return policy rules	Low	Can ship v1 without it
Add book (admin)	Small	None	High
Update book (admin)	Small	Book data	High
Adjust stock (admin)	Small	Book data	High
Discontinue book (admin)	Small	Book data	Medium
Low-stock alerts (admin)	Small	Book data	Low	Nice to have for v1

Rough estimate: 9 medium tasks + 18 small tasks. This is a plannable project.

What This Example Teaches

Start big, get specific — 5 branches became 23+ concrete leaves
Test every leaf — if you can't contract it, it's not done decomposing
Some "leaves" are actually requirements — cart persistence is a constraint, not an operation
Dependencies give you build order — don't guess what to build first; the map tells you
The boring system is a good training system — if you can fully decompose a bookstore, you can decompose anything

Decomposition — Example: Messaging Application

The Starting Point

Goal: Build a messaging app where users can send direct messages, create group chats, share files, and see who's online. Think Slack or Discord — but focus on the decomposition, not the scale.

This system is different from the bookstore because it's real-time, multi-user, and has state that changes constantly (who's online, who's typing, unread counts). These properties create new kinds of seams.

Step 1: Top-Down — First Level

Ask: "What does a messaging app need?"

Messaging App
├── User Management
├── Conversations (1-on-1 and groups)
├── Messages
├── Presence (online/offline/typing)
├── Notifications
└── File Sharing

Six branches. Immediately, questions arise:

Is "conversations" one thing, or are 1-on-1 and groups different enough to be separate branches?
Where does "search messages" live? Under Messages? Its own branch?
Is "presence" really separate from "user management"?

Decision: Keep them separate for now. If two branches share too much, that's a signal to merge them later. It's easier to merge than to split.

Step 2: Second Level — Each Branch

User Management

User Management
├── Register new user
├── Log in (with session creation)
├── Log out (with session cleanup)
├── Update profile (display name, avatar, status message)
├── Block a user
└── Unblock a user

Conversations

Conversations
├── Direct Messages
│   ├── Start a DM conversation (with one other user)
│   └── List my DM conversations (sorted by most recent message)
├── Group Chats
│   ├── Create a group (name, initial members)
│   ├── Add member to group
│   ├── Remove member from group
│   ├── Leave group
│   ├── Update group details (name, description, avatar)
│   └── List my groups (sorted by most recent message)
└── Shared
    ├── Get conversation history (paginated, newest first)
    └── Mark conversation as read

Messages

Messages
├── Send text message (to a conversation)
├── Edit a sent message
├── Delete a sent message
├── React to a message (emoji)
├── Reply to a specific message (threaded)
├── Search messages (across all conversations or within one)
└── Pin a message in a conversation

Presence

Presence
├── Update my status (online / away / do-not-disturb / offline)
├── Get a user's current status
├── Get status for a list of users (for the sidebar)
├── Typing indicators (start typing, stop typing)
└── Last seen timestamp

Notifications

Notifications
├── In-app notification (badge count, pop-up)
├── Push notification (mobile device)
├── Email notification (for offline users, after delay)
├── Notification preferences (per conversation: all, mentions only, mute)
└── Mark notification as read

File Sharing
├── Upload file (attached to a message in a conversation)
├── Download file
├── Generate file preview (images, PDFs)
├── Enforce file size limits
└── Track storage usage per user

Step 3: Finding Seams — Where This Gets Interesting

Seam: Real-Time vs. Stored

Some operations are real-time (typing indicators, presence) and some are stored (message history, user profiles). This creates a fundamental seam:

Real-time data is ephemeral — "user is typing" doesn't go in a database; it's a transient signal
Stored data is permanent — messages are kept until deleted

This seam affects the entire architecture. Real-time features use different patterns (publish-subscribe) than stored features (request-response).

Seam: Who Sees What

When a message is sent to a group of 50 people, all 50 need to receive it. But:

10 are currently online → they see it instantly
15 have push notifications → they get a phone notification
25 are offline with email notifications → they get an email after 15 minutes

Same event, three different delivery paths. This is an audience seam — the delivery mechanism changes based on the recipient's state.

Seam: Sender vs. Recipient Experience

When you send a message:

You see it immediately in your conversation (optimistic display)
The system stores it
The system delivers it to recipients
Recipients see a notification

The sender's experience and the recipient's experience are different flows triggered by the same event. This is a seam.

Seam: Conversation Create vs. Message in Conversation

What happens when you message someone for the first time? Is it:

Create a conversation, then send a message into it? (Two operations)
"Send message to user" — and the conversation is created as a side effect?

Both are valid decompositions. Option 1 is more explicit. Option 2 is more user-friendly. This is a decomposition judgment call — the right answer depends on how you want the user experience to work.

Decision: The user action is "send message to person." Internally, this decomposes into "find or create conversation" + "add message to conversation." The decomposition has two pieces, but they're presented to the user as one action.

Step 4: Leaf Test — Checking Our Work

"Send text message" — Leaf Test

Question	Answer
Contract?	✅ IN: sender_id, conversation_id, message_text. OUT: message_record (id, timestamp, content). ERRORS: conversation not found, sender not a member, message too long, sender is blocked by recipient.
Estimate?	⚠️ Sending is small, but delivery is complex — 50 people need to receive it via 3 different channels
Dependencies?	✅ Needs: conversation existence, sender membership

Verdict: "Send" is concrete, but "deliver to all recipients" is hidden inside it. Split it:

Send text message
├── Validate and store message
└── Fan out to recipients
    ├── Deliver to online recipients (real-time)
    ├── Queue push notifications (for mobile recipients)
    └── Queue email notifications (for offline recipients, delayed)

Now each sub-piece is estimable. "Fan out to recipients" was hidden complexity — it looked simple until you asked "what about the 50 people?"

"Typing indicators" — Leaf Test

Question	Answer
Contract?	✅ IN: user_id, conversation_id, is_typing (yes/no). OUT: (broadcast event to other members). ERRORS: none meaningful — this is fire-and-forget.
Estimate?	✅ Small — ephemeral event, no storage
Dependencies?	✅ Needs: real-time connection to conversation members

Verdict: Concrete enough. But note the seam: this is a real-time feature. It follows a completely different pattern from message storage.

"Search messages" — Leaf Test

Question	Answer
Contract?	⚠️ IN: query, scope (all conversations or specific one). OUT: list of matching messages with context. But what about: fuzzy matching? Matching within files? Searching by date? Searching by sender?
Estimate?	❌ "Medium" is a guess — full-text search is a deep problem
Dependencies?	✅ Needs: all stored messages

Verdict: Needs further decomposition:

Search messages
├── Basic keyword search (exact match within text)
├── Filter by conversation
├── Filter by sender
├── Filter by date range
├── Combine filters (keyword + sender + date range)
└── Return results with surrounding context (messages before/after)

Step 5: Dependency Map

User Management ────────────────── (foundation)
       │
       ▼
Conversations ──── needs users to exist
       │
       ├──────────────────────────┐
       ▼                          ▼
Messages ──── needs conversation   Presence ──── needs user sessions
       │                                │
       ▼                                │
Fan-out to recipients ──────────────────┘
       │                     needs presence to know 
       │                     who's online vs. offline
       ▼
Notifications ──── needs message events + user preferences
       │
       ▼
File Sharing ──── needs messages (files are attached to messages)

What This Map Reveals

The fan-out node depends on BOTH messages AND presence. To deliver a message, you need to know who's online (real-time delivery) vs. offline (push/email). This dependency is invisible if you decompose messages and presence separately — the dependency map reveals the connection.

File sharing is a leaf-level feature. It depends on messages but nothing depends on it. This means it can be built last (or omitted from v1).

Notifications depend on nearly everything. They need messages (what happened), users (who to notify), presence (how to notify), and preferences (should we notify). This makes notifications a high-dependency, build-last feature.

Step 6: The Complete Tree

Messaging App
├── User Management (foundation)
│   ├── Register
│   ├── Login / Logout
│   ├── Update profile
│   └── Block / Unblock user
│
├── Conversations
│   ├── DM: Start conversation (find or create)
│   ├── DM: List my conversations
│   ├── Group: Create group
│   ├── Group: Add / Remove member
│   ├── Group: Leave group
│   ├── Group: Update group details
│   ├── Group: List my groups
│   ├── Shared: Get message history (paginated)
│   └── Shared: Mark as read
│
├── Messages
│   ├── Send message (validate + store)
│   ├── Fan out to recipients
│   │   ├── Real-time delivery (online users)
│   │   ├── Push notification queue (mobile)
│   │   └── Email notification queue (offline, delayed)
│   ├── Edit message
│   ├── Delete message
│   ├── React to message
│   ├── Reply (thread)
│   ├── Pin message
│   └── Search
│       ├── Keyword search
│       ├── Filter by conversation / sender / date
│       └── Return results with context
│
├── Presence
│   ├── Update my status
│   ├── Get user status
│   ├── Get bulk status (sidebar)
│   ├── Typing indicator (broadcast)
│   └── Last seen timestamp
│
├── Notifications
│   ├── In-app badge / pop-up
│   ├── Push notification dispatch
│   ├── Email notification dispatch
│   ├── Notification preferences
│   └── Mark notification as read
│
└── File Sharing
    ├── Upload file (to message)
    ├── Download file
    ├── Generate preview
    └── Enforce limits (size, storage)

Total leaf count: 35+ operations.

Comparing Bookstore vs. Messaging App

Aspect	Bookstore	Messaging App
Primary data flow	Request-response (user asks, system answers)	Bidirectional (users send/receive continuously)
Real-time requirements	None (page refreshes are fine)	Critical (messages must appear instantly)
Hardest decomposition challenge	Checkout flow (sequential, many steps)	Message fan-out (one event → many recipients × multiple channels)
Hidden complexity	Payment error handling	Presence + notification routing
Deepest branch	Checkout (8 leaves)	Messages → Fan-out → 3 delivery channels
Seams discovered	Data format changes, time boundaries	Real-time vs. stored, sender vs. recipient, audience routing
Build-last features	Returns, low-stock alerts	Notifications, file sharing, search
Approximate leaf count	23	35+

The messaging app has more leaves not because it's "harder" but because it has more dimensions: real-time + stored, sender + receiver, online + offline. Each dimension multiplies the decomposition.

What This Example Teaches

Real-time creates new seam types — ephemeral vs. persistent data is a fundamental split
Fan-out is hidden complexity — "send a message" sounds atomic but it triggers per-recipient work
Some features bridge multiple branches — notifications depend on messages, presence, AND user preferences
"Start conversation" is a design decision — explicit vs. implicit conversation creation changes the decomposition
More dimensions = more leaves — systems with multiple audiences, delivery channels, and timing requirements have larger trees

Decomposition — Example: Package Delivery Logistics

The Starting Point

Goal: Build a system for a delivery company that picks up packages from senders, routes them through sorting facilities, and delivers them to recipients. Track every package at every step. Handle failed deliveries, re-routing, and returns.

This system is different from the previous examples because it spans physical space and physical time. A package moves through multiple locations over multiple days. The decomposition must account for geography, vehicle routing, and the hard reality that physical objects can be lost.

Step 1: Top-Down — First Level

Ask: "What does a package delivery system need?"

Package Delivery
├── Package Intake (getting packages into the system)
├── Routing (deciding how packages travel)
├── Sorting & Transfer (physical movement through facilities)
├── Last-Mile Delivery (getting packages to recipients)
├── Tracking (visibility into package location)
└── Exception Handling (when things go wrong)

Six branches. But already, this decomposition reveals something: "Exception Handling" is its own branch. In the bookstore and messaging app, error handling was part of each leaf. Here, exceptions are so common and varied that they form a top-level concern.

Why? Because physical systems fail differently than digital ones. A database transaction either commits or rolls back. A package can be partially delivered (wrong address, left with neighbor, returned to sender). The failure modes create their own subsystem.

Step 2: Second Level — Each Branch

Package Intake

Package Intake
├── Customer creates shipment request
│   ├── Enter sender address
│   ├── Enter recipient address
│   ├── Enter package details (weight, dimensions, contents description)
│   ├── Select service level (overnight, 2-day, ground, economy)
│   └── Validate addresses (real addresses? deliverable? restricted areas?)
├── Calculate price (based on weight, dimensions, distance, service level)
├── Generate shipping label (with barcode/QR code and tracking number)
├── Schedule pickup (or drop at a facility)
└── Record package in system (status: "label created")

Routing

Routing
├── Determine route plan
│   ├── Origin facility (nearest sorting center to sender)
│   ├── Destination facility (nearest sorting center to recipient)
│   ├── Intermediate hops (if origin and destination aren't connected directly)
│   └── Transportation mode for each leg (truck, air, rail)
├── Optimize route for service level
│   ├── Overnight → air route, priority loading
│   ├── Ground → truck route, cost-optimized
│   └── Economy → most cost-efficient, may wait for full truck
└── Update route (if re-routing is needed due to weather, capacity, etc.)

Sorting & Transfer

Sorting & Transfer
├── Package arrives at facility (scan barcode → update status: "at [facility]")
├── Sort package to outbound lane (based on route's next hop)
├── Load package onto vehicle (scan → update status: "in transit to [next facility]")
├── Transfer between vehicles (for multi-leg routes)
└── Package arrives at destination facility (scan → update status: "at destination facility")

Last-Mile Delivery

Last-Mile Delivery
├── Assign package to delivery route (group packages by neighborhood)
├── Load delivery vehicle (scan each package)
├── Attempt delivery
│   ├── Successful delivery (scan → status: "delivered", capture signature or photo)
│   ├── No one home → leave at door (if authorized) or leave notice
│   ├── Wrong address → return to facility, flag for investigation
│   ├── Refused by recipient → return to facility
│   └── Access issue (gated community, locked building) → leave notice, reschedule
└── End of day: reconcile (all packages either delivered or accounted for)

Tracking

Tracking
├── Generate tracking number (at intake)
├── Record scan event (every scan at every point creates a tracking update)
├── Calculate estimated delivery date (based on route + service level)
├── Update estimated delivery date (if delays occur)
├── Provide tracking timeline to customer (ordered list of events)
└── Send proactive notifications
    ├── "Package picked up"
    ├── "Package in transit"
    ├── "Out for delivery"
    ├── "Delivered" (with photo/signature)
    └── "Delivery attempted — notice left"

Exception Handling

Exception Handling
├── Failed delivery (no one home, refused, wrong address)
│   ├── Reschedule delivery attempt
│   ├── Hold at facility for customer pickup
│   └── Return to sender (after N failed attempts)
├── Damaged package
│   ├── Assess damage (at any scan point)
│   ├── Notify sender and recipient
│   ├── File insurance claim (if insured)
│   └── Decide: deliver damaged or return to sender
├── Lost package
│   ├── Detect: expected scan didn't happen within time window
│   ├── Investigate: check last known scan, vehicle manifest
│   ├── Notify customer
│   └── File claim / send replacement or refund
├── Address correction
│   ├── Recipient contacts company with corrected address
│   ├── Update route while package is in transit
│   └── If package already at destination facility, re-sort
└── Customer-initiated redirect
    ├── Hold at facility
    ├── Deliver to alternate address
    └── Return to sender (sender requests)

Step 3: Finding Seams

Seam: Physical Scan Points

Every barcode scan is a seam. Data literally changes at each scan point:

Before scan: "package was loaded onto truck" (assumed location)
After scan: "package is confirmed at Chicago facility" (known location)

The system transitions from assumed state to confirmed state at every scan. Between scans, the package's location is inferred, not known. This is fundamentally different from digital systems where data is always in a known state.

Seam: Service Level → Route Strategy

The same package going from New York to Los Angeles decomposes differently based on service level:

Overnight: NYC facility → JFK airport → LAX airport → LA facility → delivery
Ground: NYC facility → truck to Pittsburgh hub → truck to Denver hub → truck to LA facility → delivery
Economy: NYC facility → waits for full truck → Pittsburgh → waits → Denver → waits → LA → delivery

Same origin, same destination, completely different decomposition of the journey. The service level seam changes the entire routing tree.

Seam: Custody Changes

Every time the package changes hands (sender → pickup driver → facility worker → vehicle → next facility → delivery driver → recipient), there's a custody seam. At each custody transfer, responsibility shifts. If the package is damaged, the question is: during whose custody? This is a liability seam as much as a technical one.

Seam: Customer Expectation vs. Physical Reality

The customer sees: "In transit → Out for delivery → Delivered." The system sees: "Scan 47 → Sort lane B → Load truck 104 → Scan 48 → Driver route position 12/38 → Scan 49 → Delivery confirmed GPS 40.7128° N."

Same package, radically different levels of detail. The tracking system must translate between these two views — that's a seam between the physical world and the customer experience.

Step 4: Dependency Map

Package Intake ──────────── (entry point, no dependencies)
       │
       ├────────────────┐
       ▼                ▼
   Routing          Tracking (tracking number created at intake)
       │                │
       ▼                │
Sorting & Transfer ─────┘ (each scan updates tracking)
       │
       ▼
Last-Mile Delivery ──── needs sorted packages + route assignments
       │
       ├────────────────┐
       ▼                ▼
  Tracking           Exception Handling
  (delivery scan)    (failed delivery triggers exception flow)

What This Map Reveals

Tracking runs in parallel with everything. It's not a step in the chain — it's a continuous side channel that records events from every other branch. Every scan in Sorting, Routing, and Delivery feeds into Tracking.

Exception Handling is triggered from multiple points. A failed delivery triggers an exception. A damaged package at a sort facility triggers a different exception. A lost package (detected by Tracking) triggers yet another. Exception Handling isn't downstream of one branch — it's connected to everything.

The physical chain is strictly sequential. A package must be: picked up → routed → sorted → transported → sorted again → delivered. You can't build or deliver out of order. This contrasts with the messaging app where sending and receiving happen simultaneously.

Step 5: The Complete Tree (Abridged)

Package Delivery
├── Package Intake
│   ├── Create shipment request (addresses, weight, dimensions, service level)
│   ├── Validate addresses
│   ├── Calculate price
│   ├── Generate label + tracking number
│   ├── Schedule pickup
│   └── Record in system
│
├── Routing
│   ├── Determine origin/destination facilities
│   ├── Plan route (hops, transport modes)
│   ├── Optimize for service level
│   └── Re-route (on delay or capacity change)
│
├── Sorting & Transfer
│   ├── Scan at arrival (per facility)
│   ├── Sort to outbound lane
│   ├── Load onto vehicle (scan)
│   └── Track vehicle movement
│
├── Last-Mile Delivery
│   ├── Create delivery routes (cluster by geography)
│   ├── Load delivery vehicle (scan each package)
│   ├── Attempt delivery (success / fail scenarios)
│   ├── Capture proof (signature, photo)
│   └── End-of-day reconciliation
│
├── Tracking
│   ├── Record scan events (from all sources)
│   ├── Calculate estimated delivery
│   ├── Update estimated delivery on delay
│   ├── Provide customer timeline
│   └── Send proactive notifications (picked up, in transit, out for delivery, delivered)
│
└── Exception Handling
    ├── Failed delivery (reschedule, hold, return)
    ├── Damaged package (assess, notify, claim)
    ├── Lost package (detect, investigate, claim)
    ├── Address correction (update route, re-sort)
    └── Customer redirect (hold, alternate address, return)

Total leaf count: 30+ operations

Comparing All Three Systems

Aspect	Bookstore	Messaging App	Package Delivery
Primary domain	Digital commerce	Digital communication	Physical logistics
Time horizon	Minutes (browsing → purchase)	Milliseconds (real-time messages)	Days (pickup → delivery)
Failure recovery	Refund/retry (reversible)	Retry/resend (mostly reversible)	Physical recovery (often irreversible)
Fan-out pattern	1 order → 1 customer	1 message → N recipients	1 package → many scan points
Deepest complexity	Checkout (sequential chain)	Notification routing (multi-channel)	Exception handling (branching physical outcomes)
Seams unique to this domain	Format changes (cart → order → record)	Real-time vs. stored, audience routing	Custody changes, physical scan points, assumed vs. confirmed state
Exception handling	Part of individual operations	Part of individual operations	Its own top-level branch
Build order	Catalog → Cart → Checkout → Orders	Users → Conversations → Messages → Presence → Notifications	Intake → Routing → Sort → Delivery → Tracking → Exceptions

What This Example Teaches

Physical systems have assumed state — between scans, you're guessing where the package is. Digital systems always know.
Exception handling can be its own subsystem — when failures are common and varied, they deserve a top-level branch, not just error cases in contracts
The same entity (package) decomposes differently based on context — overnight vs. ground creates entirely different route trees
Custody changes are seams — every handoff between people, vehicles, or facilities is a decomposition boundary
Sequential physical chains can't be parallelized — unlike software where operations can run simultaneously, a package must physically move step by step

Decomposition — Common Mistakes and Dependency Traps

Mistake 1: Technology-Layer Decomposition

The Wrong Way

Online Store
├── Frontend
├── Backend
├── Database
└── DevOps

This isn't decomposition — it's technology labeling. It tells you what kind of code lives where, but not what the system does. You can't write a contract for "Frontend." You can't estimate "Backend." These aren't features; they're implementation categories.

Why People Do It

It's comfortable. Developers naturally think in terms of where code lives. "I'll build the frontend, you build the backend" feels like a plan. But it's not a plan — it's an organizational split that leaves all the actual design decisions unmade.

The Fix

Decompose by feature, not by technology. "Browse books" is a feature that touches frontend, backend, and database. When it's decomposed correctly, the technology concerns become implementation details within each leaf:

Online Store
├── Browse books
│   ├── List books (paginated)
│   ├── Search books
│   └── View book details
├── Shopping Cart
│   ├── Add item
│   ├── Remove item
│   └── View cart
...

Each leaf here has a clear contract, a clear estimate, and a clear set of dependencies — regardless of which technology implements it.

How to Spot It

If every branch of your tree is a technology or a "layer" rather than something a user does or the business needs, you've done technology-layer decomposition.

Mistake 2: Decomposing Too Shallow

The Wrong Way

Hospital System
├── Patient Management
├── Appointments
├── Billing
└── Records

Four branches. Each one is an entire system. "Patient Management" alone could be 50 operations. This isn't a decomposition — it's a table of contents.

Why People Do It

They stop when the branches feel "reasonable" rather than when they're concrete. "Patient Management" sounds like a reasonable module. But can you write a contract for it? Can you estimate it? No — it's still an entire subsystem compressed into two words.

The Fix

Keep asking "what does this need?" until every leaf passes the three tests (contractable, estimable, dependency-identified):

Patient Management
├── Register new patient
│   ├── Collect demographics (name, DOB, address, phone, email)
│   ├── Assign patient ID
│   ├── Verify insurance (if applicable)
│   └── Create initial medical record
├── Update patient information
│   ├── Update demographics
│   ├── Update insurance
│   └── Update emergency contact
├── Search patients
│   ├── By name
│   ├── By patient ID
│   └── By date of birth
├── Merge duplicate patient records
└── Deactivate patient record (moved away, deceased)

Now every leaf is contractable. "Collect demographics" has clear inputs, clear outputs, and a clear estimate (small).

How to Spot It

If a non-technical person can't understand what each leaf does, it's too shallow. "Patient Management" is abstract. "Register new patient" is concrete.

Mistake 3: Decomposing Too Deep

The Wrong Way

Search books by keyword
├── Receive search text from user interface
├── Trim whitespace from search text
├── Convert search text to lowercase
├── Split search text into individual words
├── Remove common words (the, a, an, is)
├── For each remaining word:
│   ├── Look up word in search index
│   ├── Retrieve list of matching book IDs
│   └── Score each match by relevance
├── Combine results from all words
├── Remove duplicate book IDs
├── Sort combined results by total relevance score
├── Fetch book details for top N results
└── Return results to user interface

This is implementation pseudocode, not decomposition. At the decomposition stage, "Search books by keyword" is a single leaf. The internal algorithm is an implementation detail.

Why People Do It

Perfectionism. The desire to have everything figured out before starting. Or anxiety about estimation — "I can't estimate 'search' unless I know exactly how it works."

The Fix

Stop decomposing when a leaf is one responsibility with a clear contract:

CONTRACT: SearchBooks
ACCEPTS: search_query (text, 1-200 characters)
RETURNS: list of matching books (title, author, price, relevance score), sorted by relevance
ERRORS: empty query, no results found

That's the leaf. How search works internally — tokenization, indexing, scoring — is decided when you implement the leaf, not when you decompose the system.

How to Spot It

If your leaves describe how something works rather than what it does, you've gone too deep. Decomposition answers "what are the pieces?" Implementation answers "how does each piece work?"

Mistake 4: Overlapping Responsibilities

The Wrong Way

E-Commerce Platform
├── Product Page
│   ├── Display product details
│   ├── Show stock availability   ← checks inventory
│   └── Show recommended products
├── Shopping Cart
│   ├── Add item to cart
│   ├── Validate stock on add     ← checks inventory
│   └── View cart
├── Checkout
│   ├── Reserve inventory          ← modifies inventory
│   ├── Process payment
│   └── Create order
└── Admin
    ├── Update stock levels         ← modifies inventory
    └── View low-stock alerts       ← checks inventory

"Inventory" appears in four different branches. Stock checking happens in Product Page, Cart, and Checkout. Stock modification happens in Checkout and Admin. If you build each branch independently, you'll build the inventory logic four different times — with four different behaviors.

The Fix

Identify the shared responsibility and make it explicit:

E-Commerce Platform
├── Inventory (shared)
│   ├── Check stock level
│   ├── Reserve stock (temporary hold)
│   ├── Confirm reservation (convert to deduction)
│   ├── Release reservation (timeout or cancel)
│   └── Adjust stock (admin)
├── Product Page → uses Inventory.CheckStock
├── Shopping Cart → uses Inventory.CheckStock
├── Checkout → uses Inventory.Reserve, then Inventory.Confirm
└── Admin → uses Inventory.Adjust, Inventory.CheckStock

Now inventory is decomposed once and referenced by the branches that need it. The dependency is explicit.

How to Spot It

If the same verb + noun appears in multiple branches ("check stock," "validate user," "calculate price"), it's an overlap. Extract it as a shared dependency.

Mistake 5: Missing the Unhappy Path

The Wrong Way

Flight Booking
├── Search flights
├── Select flight
├── Enter passenger details
├── Pay
└── Issue ticket

Five steps. All happy path. But what about:

Flight sells out between search and payment?
Payment is declined?
Passenger name doesn't match their ID?
Flight is canceled after booking?
Customer wants to change their flight?
Customer wants a refund?

The unhappy paths are at least as numerous as the happy path, often more.

The Fix

For every happy-path branch, ask: "What can go wrong, and what do we do about it?"

Flight Booking
├── Search flights
├── Select flight
│   └── Handle: flight no longer available → show alternatives
├── Enter passenger details
│   └── Handle: validation failures → show field-level errors
├── Pay
│   ├── Handle: payment declined → retry or try different card
│   ├── Handle: flight sold out during payment → refund, show alternatives
│   └── Handle: payment timeout → check if payment went through, avoid double charge
├── Issue ticket
│   └── Handle: system error after payment → queue for retry, notify customer of delay
├── Post-Booking
│   ├── Cancel booking → calculate refund based on fare rules
│   ├── Change flight → calculate fare difference
│   ├── Flight canceled by airline → auto-rebook or refund
│   └── Schedule change by airline → notify and offer alternatives

The tree roughly doubled. That's normal for real systems — the unhappy paths are half the work.

How to Spot It

If your decomposition tree reads like a tutorial ("step 1, step 2, step 3...") with no branching, you've only captured the happy path. Real systems branch extensively.

Mistake 6: Circular Dependencies

The Wrong Way

Module A (User Profiles) needs Module B (Permissions) to check if user can edit profiles
Module B (Permissions) needs Module A (User Profiles) to look up user's role

A depends on B. B depends on A. Neither can be built first. Neither can be tested alone. This is a circular dependency — and it's always a decomposition error.

Why It Happens

Two things that are related get decomposed as peers that reference each other. In reality, one should depend on the other, or both should depend on a third, more fundamental thing.

The Fix

Find the deeper abstraction:

Module C (User Data) ← stores user ID, role, basic profile data (no logic)
       │
       ├────────────────┐
       ▼                ▼
Module A (Profiles)  Module B (Permissions)
uses User Data       uses User Data

Now both A and B depend on C, but not on each other. The cycle is broken.

How to Spot It

Draw your dependency arrows. If you can follow the arrows in a circle (A → B → C → A), you have a cycle. Every cycle must be broken by extracting the shared dependency.

The Dependency Health Checklist

After decomposing, validate your dependency structure:

Check	What to Look For
No cycles	Can you sort all modules in a build order where each module only depends on things above it?
Shared responsibilities are explicit	Is any logic duplicated across branches? Extract it.
Foundation modules depend on nothing	Your data stores, core entities, and configuration should be at the bottom of the dependency graph.
High-level features depend on low-level services	"Checkout" depends on "Inventory" and "Payment" — not the reverse.
Every dependency is justified	For each arrow, can you explain why it exists? If not, it might be artificial.
It's possible to build and test each piece independently	If you can't build module X without also building module Y, either X depends on Y (document it) or they should be combined.

Decomposition — Test Your Understanding

Answer each question by producing decomposition trees, dependency maps, and/or build orders. No code. Show your reasoning.

Section A: Decompose It

Question 1

"We need a system for a veterinary clinic."

Clients bring their pets for appointments. The vet records diagnoses and prescribes treatments. The clinic sends appointment reminders. Clients pay for visits.

Produce a complete decomposition tree. Go at least three levels deep. Identify the leaves and verify that each one could have a contract written for it.

Question 2

"Build a recipe sharing platform."

Users create recipes with ingredients and steps. Other users can search, save favorites, and leave reviews. Users can create weekly meal plans and generate shopping lists from their meal plans.

Decompose this top-down. Then identify the dependencies between your leaves. What is the build order?

Question 3

You're in a bottom-up situation. You already have:

A user authentication service
A file storage service (can store and retrieve files)
An email sending service
A database for structured data

A client asks: "Can you build me a simple document collaboration tool where teams can upload, share, and comment on documents?"

Using the existing services as your starting point, decompose what needs to be built (not what already exists). Show how the new pieces connect to the existing services.

Section B: Find the Seams

Question 4

A single feature request reads:

"When a customer completes a purchase, they should see an order confirmation page, receive a confirmation email, the inventory should be updated, the sales team should see the order in their dashboard, and if the order is over $500, it should be flagged for manual review."

Find every seam in this description. Group the pieces by the boundary they belong to. Show which pieces can happen in parallel and which must happen in sequence.

Question 5

Here is a vague feature request:

"We need analytics."

Write the 10 questions you would ask to decompose this. For each question, explain why the answer matters for decomposition (i.e., how does it change the shape of the tree?).

Question 6

A system currently works as follows:

User uploads a CSV file
System parses the CSV
System validates each row
Valid rows are saved to the database
Invalid rows are collected into an error report
Error report is emailed to the user
A summary is displayed on screen

Map the seams. Then answer: if step 3 (validation) needs to become much more complex (adding cross-row validation, checking against external data sources), which seams help you isolate that change? Which other steps would be affected?

Section C: Dependencies and Ordering

Question 7

You've decomposed a project into these pieces:

A: User registration
B: User login
C: Create a post
D: View feed (list of posts from followed users)
E: Follow/unfollow other users
F: Like a post
G: Notification when someone likes your post
H: User profile page

Map all the dependencies (which pieces need which other pieces to exist first). Draw the dependency graph. Determine the build order — what gets built in phase 1, phase 2, etc.?

Question 8

You have a dependency problem. You've identified:

Module X needs data from Module Y
Module Y needs a callback from Module X when processing is done
This creates a circular dependency

Without knowing any specifics about what X and Y do, describe three general strategies for breaking a circular dependency. For each strategy, explain the tradeoff.

Question 9

A project has 12 decomposed tasks. Here are their dependencies:

Task	Depends On
A	nothing
B	nothing
C	A
D	A
E	B
F	C, E
G	D
H	F
I	F, G
J	H
K	I
L	J, K

Draw the dependency graph. What is the critical path (the longest chain of dependencies from start to finish)? If you had two people working in parallel, what's the most efficient assignment of tasks to people?

Section D: Critical Thinking

Question 10

"You should decompose until every leaf takes less than a day to build."

Is this good advice? When is it right? When might it be wrong? What are the risks of decomposing too finely versus too coarsely?

Question 11

You're decomposing a system and you encounter a feature that feels like it could belong in two different branches of your tree:

"Send a notification when an order ships."

Is this part of Orders (since it's triggered by an order event)? Or part of Notifications (since it's a notification)? Both feel reasonable.

How do you resolve this? Propose a decomposition that handles this cleanly. Explain the principle behind your decision.

Question 12

You've been given a completed decomposition tree by a colleague. How do you evaluate it? Create a checklist of at least 8 specific questions you would ask to determine if the decomposition is good, complete, and buildable. For each question, explain what a bad answer would reveal.

Grading Rubric

Criteria	What It Means
Depth	Trees go deep enough that leaves are concrete and estimable — but not so deep that they describe implementation
Completeness	No missing steps. Trace the user journey and confirm every step has a corresponding leaf
No overlap	Each responsibility appears exactly once in the tree
Dependencies are explicit	Clear arrows showing what needs what. No hidden assumptions
Build order is logical	Foundation pieces first, dependent pieces after. Circular dependencies identified and resolved
Seam recognition	Natural break points are identified and used to structure the decomposition

Failure Modes and Debugging — Why It Matters

Things Will Break

Every system fails. Not "might fail" — will fail. Hardware dies. Networks drop. Users do unexpected things. Data gets corrupted. Services go down. Bugs hide in logic that worked fine for a year and then didn't.

The difference between a junior and senior engineer isn't that the senior's systems don't break. It's that the senior expects failure, designs for it, and diagnoses it systematically when it happens.

This section is not about learning debugging tools. Tools change. This section is about learning to reason about failure — a skill that works in any language, on any platform, in any decade.

Why Debugging Is a Thinking Skill, Not a Tool Skill

Most courses teach debugging as: "here's how to set a breakpoint, here's how to read a stack trace, here's how to use print statements." These are useful techniques, but they're like teaching someone to use a stethoscope without teaching them medicine. The tool is worthless without the reasoning behind it.

Real debugging is a reasoning process:

Something is wrong (the symptom)
The symptom has a cause
The cause is usually not where the symptom appears
Finding the cause requires systematic elimination of possibilities

This is pure critical thinking. It doesn't require a computer. It requires the ability to form hypotheses, test them, and follow evidence.

The Two Failures Most People Make When Debugging

Failure 1: Guessing Instead of Reasoning

Something breaks. The engineer's first instinct is to change something — anything — and see if it fixes the problem. This is like a doctor prescribing random medication because the patient has a headache. Sometimes it works by luck. Usually it wastes hours, introduces new bugs, and teaches nothing.

The alternative: stop. Think. What do you know? What do you not know? What would help you narrow it down?

Failure 2: Assuming Instead of Verifying

"That part works fine, the problem must be somewhere else." Says who? Have you verified it? One of the most common debugging experiences is spending hours looking in the wrong place because you assumed some component was correct — and it wasn't.

The alternative: verify everything. Trust nothing. Check each assumption with evidence.

Why Failure Modes Are a Design Concern

Most people think about failure after the system is built. That's backwards. You should think about failure during design, for two reasons:

1. The cost of failure is a design decision

Some failures are acceptable ("the profile picture takes 2 seconds longer to load"). Some are catastrophic ("we charged the customer twice"). The difference isn't technical — it's about what the system does and who it serves. This must be decided during design, not discovered during an outage.

2. Error handling is half the work

In a typical system, the "happy path" (everything works) is maybe 30% of the logic. The other 70% is: what if this input is invalid? What if that service is down? What if the data is in an unexpected format? What if the network times out? What if the user does something in the wrong order?

If you design only for the happy path, you've built 30% of the system."But it works!" Yes — until it doesn't. And when it doesn't, nobody planned for it, so the failure is chaotic rather than graceful.

What Does "Graceful Failure" Mean?

A system that fails gracefully does these things:

Detects that something went wrong (not silently corrupting data)
Contains the failure (one broken feature doesn't take down the whole system)
Communicates what happened (to the user, to the logs, to the monitoring system)
Degrades rather than crashes (if search is down, the rest of the site still works)
Recovers when possible (retries, fallbacks, self-healing)

A system that fails badly:

Crashes entirely because one component failed
Shows the user a cryptic technical error
Corrupts data silently
Provides no information about what went wrong or why
Requires a manual restart or intervention to recover

The difference is not complexity. It's forethought. Graceful failure is designed in. Bad failure is what happens when nobody thought about it.

Why This Is The Capstone Skill

This section comes last because it requires everything before it:

Data Lifecycle — to trace where data went wrong, you must know where it flows
Boundaries — to contain failures, you must have clear boundaries to contain them within
Contracts — to detect failures, you must know what the expected behavior is (the contract) so you can recognize when it's violated
Decomposition — to isolate failures, the system must be decomposed into testable pieces

A well-decomposed system with clear boundaries and explicit contracts is inherently debuggable. A tangled system with no structure is inherently not. Debugging skill matters, but system design determines whether debugging is even possible.

The Mindset Shift

Stop thinking: "How do I make this work?" Start thinking: "How will this fail, and what should happen when it does?"

For every operation, every contract, every module, the questions are:

What are the ways this can fail?
Which failures are likely? Which are unlikely but catastrophic?
For each failure, what should the system do?
Can the user recover? Can the system recover automatically?
If nothing else works, what information do we need to diagnose the problem later?

This isn't pessimism. It's engineering. Bridges don't collapse because someone thought about load limits. They collapse when someone didn't.

The same is true of software.

Failure Modes and Debugging — Why It Matters

Things Will Break

This section is not about learning debugging tools. Tools change. This section is about learning to reason about failure — a skill that works in any language, on any platform, in any decade.

Why Debugging Is a Thinking Skill, Not a Tool Skill

Real debugging is a reasoning process:

Something is wrong (the symptom)
The symptom has a cause
The cause is usually not where the symptom appears
Finding the cause requires systematic elimination of possibilities

This is pure critical thinking. It doesn't require a computer. It requires the ability to form hypotheses, test them, and follow evidence.

The Two Failures Most People Make When Debugging

Failure 1: Guessing Instead of Reasoning

The alternative: stop. Think. What do you know? What do you not know? What would help you narrow it down?

Failure 2: Assuming Instead of Verifying

The alternative: verify everything. Trust nothing. Check each assumption with evidence.

Why Failure Modes Are a Design Concern

Most people think about failure after the system is built. That's backwards. You should think about failure during design, for two reasons:

1. The cost of failure is a design decision

2. Error handling is half the work

What Does "Graceful Failure" Mean?

A system that fails gracefully does these things:

Detects that something went wrong (not silently corrupting data)
Contains the failure (one broken feature doesn't take down the whole system)
Communicates what happened (to the user, to the logs, to the monitoring system)
Degrades rather than crashes (if search is down, the rest of the site still works)
Recovers when possible (retries, fallbacks, self-healing)

A system that fails badly:

Crashes entirely because one component failed
Shows the user a cryptic technical error
Corrupts data silently
Provides no information about what went wrong or why
Requires a manual restart or intervention to recover

The difference is not complexity. It's forethought. Graceful failure is designed in. Bad failure is what happens when nobody thought about it.

Why This Is The Capstone Skill

This section comes last because it requires everything before it:

Data Lifecycle — to trace where data went wrong, you must know where it flows
Boundaries — to contain failures, you must have clear boundaries to contain them within
Contracts — to detect failures, you must know what the expected behavior is (the contract) so you can recognize when it's violated
Decomposition — to isolate failures, the system must be decomposed into testable pieces

The Mindset Shift

Stop thinking: "How do I make this work?" Start thinking: "How will this fail, and what should happen when it does?"

For every operation, every contract, every module, the questions are:

What are the ways this can fail?
Which failures are likely? Which are unlikely but catastrophic?
For each failure, what should the system do?
Can the user recover? Can the system recover automatically?
If nothing else works, what information do we need to diagnose the problem later?

This isn't pessimism. It's engineering. Bridges don't collapse because someone thought about load limits. They collapse when someone didn't.

The same is true of software.

Failure Modes and Debugging — How: The Method

A Systematic Debugging Framework

When something goes wrong, follow this five-step process. It works for software, hardware, processes, and systems of any kind.

Step 1: Observe the Symptom Precisely

Don't say "it's broken." Say exactly what is happening:

❌ "The page is broken"
✅ "The page loads but shows 0 orders, when the user should have 15 orders"
❌ "The system is slow"
✅ "The search results take 12 seconds to appear; last week it was under 1 second"
❌ "It doesn't work"
✅ "Clicking 'Submit' does nothing — no error message, no loading indicator, no change"

Precise symptoms lead to precise diagnoses. Vague symptoms lead to guessing.

Step 2: Establish What Changed

Most bugs don't appear spontaneously. Something changed:

New code was deployed
Data volume increased
A third-party service updated their API
A configuration was modified
User behavior shifted (a marketing campaign drove unexpected traffic)

Ask: "What is different between when it worked and when it stopped working?"

If nothing changed internally, the cause is likely external: data, traffic, or a dependency.

Step 3: Bisect the Problem Space

This is the most powerful debugging technique. Instead of searching everywhere, cut the problem in half and determine which half contains the bug.

Your system is a chain of data flow (from the Data Lifecycle section). Data enters at one end and the wrong result appears at the other. Check the midpoint:

Input → [A] → [B] → [C] → [D] → Wrong Output
                 ↑
          Check here first.
          Is the data correct at this point?

If the data is correct at [B], the problem is in [C] or [D]
If the data is wrong at [B], the problem is in [A] or [B]

You've eliminated half the system. Repeat until you've narrowed it to a single step.

This is binary search applied to debugging, and it works whether you're debugging code, a business process, a network issue, or a recipe.

Step 4: Form and Test Hypotheses

Once you've narrowed the area, form a specific hypothesis:

"I believe the bug is caused by [specific thing] because [evidence]. If I'm right, then [testable prediction]."

Example:

"I believe orders show as 0 because the query is filtering by the wrong date format. If I'm right, then running the query directly will return empty results even though orders exist."

Then test it by observing — check the data, check the intermediate state. Don't change anything yet. Verify or disprove the hypothesis with evidence.

If the hypothesis is wrong, that's progress — you've eliminated a possibility.

Step 5: Verify the Fix

You found the cause. You made a change. How do you know the fix is correct?

Does the symptom disappear?
Does it work for all cases, or just the one you tested?
Did the fix introduce any new problems?
Can you explain why the fix works?

If you can't explain why the fix works, it's not a fix — it's a lucky accident that will break again.

Categorizing Failures

Not all failures are the same. Understanding the categories helps you design appropriate responses.

Category	What It Is	Example	Response
Input	Bad data coming in	Letters in a phone number field	Validate at the boundary. Reject with a clear error.
Logic	Code produces wrong result	Off-by-one in a calculation	Test with known inputs, verify outputs match contract
Integration	Two parts don't align	Module A sends format X, Module B expects Y	Validate at every boundary. Integration failures are contract violations.
Resource	System exhausts something	Disk full, memory exhausted, rate limit hit	Monitor. Set limits and alerts. Design for constrained operation.
Dependency	External thing stops working	Database down, API returns errors	Timeout, retry, fallback, degrade gracefully
Timing	Wrong order or time	Two updates hit the same record simultaneously	The hardest category. Design explicit ordering where it matters.

Designing for Failure

For every module and contract, answer these five questions:

1. What are the failure modes?

List every way this can fail. Use the categories above as your checklist.

2. What is the blast radius?

If this fails, what else breaks? A well-bounded module limits the blast radius. A tangled one spreads damage everywhere.

3. What is the severity?

Severity	Meaning	Example
Critical	Data loss, financial impact, security breach	Double-charging a customer
High	Core feature unavailable	Can't log in
Medium	Feature degraded but usable	Search is slow but returns results
Low	Cosmetic or minor	Profile picture doesn't load

4. What is the response strategy?

Strategy	When to Use
Prevent	Failure is predictable and avoidable (validate inputs, check preconditions)
Retry	Failure is transient (network blip, temporary overload)
Fallback	There's a "good enough" alternative (show cached data if live data is unavailable)
Degrade	Turn off the broken feature, keep everything else running
Alert	Needs human attention (log, notify, escalate)
Fail fast	Continuing would make things worse (stop if data is corrupted)

5. What information is needed to diagnose it later?

When something fails at 3am and you're investigating at 9am, what do you need?

What was the input?
What was the expected output?
What was the actual output or error?
When did it happen?
What was the system state?

This is logging — not an afterthought, but a critical design decision.

What to Look For in the Examples

The following pages each present a system that has failed. You'll see:

A symptom described precisely — the starting point
The bisection process — how we narrow down the cause
Multiple hypotheses — some wrong, some right
The root cause — and how it connects to a failure category
How the failure could have been prevented — what design decision would have caught it earlier

Failure Modes — Example: E-Commerce Order Goes Wrong

The Scenario

An online store selling electronics. Customers are reporting a strange problem: some orders show the wrong items. A customer ordered a laptop and received a phone charger. Another ordered headphones and received a keyboard. It's not happening to all orders — just some.

This is a real investigation. Let's walk through it.

Step 1: Observe the Symptom Precisely

We gather reports and look for patterns:

Customer	Ordered	Received	Order Date	Payment Correct?
Customer A	Laptop ($899)	Phone Charger ($15)	March 5	Charged $899 ✅
Customer B	Headphones ($79)	Keyboard ($49)	March 5	Charged $79 ✅
Customer C	Monitor ($350)	Monitor ($350)	March 5	Charged $350 ✅
Customer D	Tablet ($449)	Mouse ($25)	March 6	Charged $449 ✅
Customer E	Keyboard ($49)	Keyboard ($49)	March 6	Charged $49 ✅

Observations:

Payment amounts are always correct (matches what they ordered, not what they received)
Some orders are fine (C and E got the right items)
Wrong items don't seem related (laptop → charger, headphones → keyboard)
Started March 5

The symptom is: The warehouse is shipping the wrong physical items for some orders, but the order records and payments are correct.

Step 2: Establish What Changed

What happened around March 5?

March 4: New inventory system deployed (upgraded from v2.3 to v3.0)
March 5: First wrong-item reports
Nothing else changed (no code deploys, no staff changes, no new warehouse)

Strong correlation: new inventory system → wrong items. But correlation isn't causation — let's investigate.

Step 3: Bisect the Problem Space

The order lifecycle is:

Customer places order → Order recorded → Warehouse receives pick list → 
Worker picks items → Items packed → Items shipped → Customer receives

Where is the wrong data? Let's check the midpoint — the pick list that the warehouse receives.

Check 1: Is the order record correct?

We look at Customer A's order in the database:

Order #10547: Product = "Laptop XPS 15", Product ID = LP-2001, Quantity = 1

The order record is correct. The customer ordered a laptop and the database says laptop.

Check 2: Is the pick list correct?

We look at the pick list that was sent to the warehouse for Order #10547:

Order #10547: Bin Location = B-14, Quantity = 1

Wait — the pick list shows a bin location, not a product name. The warehouse worker goes to bin B-14 and picks whatever is there.

Check 3: What's in bin B-14?

Before March 4: Bin B-14 = Laptop XPS 15 (correct) After March 4 (inventory system upgrade): Bin B-14 = Phone Charger USB-C

The bin assignments changed when the inventory system was upgraded. The old system had one mapping of products to bins. The new system reassigned bins based on a different optimization algorithm. But the pick list generation was still using the old mapping — it was reading from a cached or stale copy of the bin assignments.

Step 4: Form and Test Hypothesis

Hypothesis: The pick list generator is using a cached copy of the product-to-bin mapping that wasn't updated when the inventory system was upgraded on March 4. Products whose bins didn't change (same bin in old and new system) are shipping correctly. Products whose bins changed are shipping wrong items.

Test the prediction:

If this hypothesis is correct, then:

Products that shipped correctly should have the same bin location in both old and new systems
Products that shipped wrong should have different bin locations

Customer	Product	Old Bin	New Bin	Same?	Shipped Correctly?
A	Laptop	B-14	C-22	❌ No	❌ Wrong item
B	Headphones	D-08	A-31	❌ No	❌ Wrong item
C	Monitor	F-15	F-15	✅ Yes	✅ Correct
D	Tablet	A-31	D-08	❌ No	❌ Wrong item
E	Keyboard	G-03	G-03	✅ Yes	✅ Correct

Perfect correlation. Every wrong shipment has a bin mismatch. Every correct shipment has the same bin. Hypothesis confirmed.

Notice something extra:

Customer B ordered headphones (old bin D-08, new bin A-31). Customer D ordered a tablet (old bin A-31, new bin D-08). Their bins swapped. So Customer D likely received Customer B's headphones, and Customer B likely received... it depends on what was in A-31 in the old system.

This is how a bin mapping error creates cross-contamination — wrong items go to wrong customers in unpredictable combinations.

Step 5: Root Cause and Fix

Root Cause

Integration failure (Category: Integration). Two modules — the pick list generator and the inventory system — were reading from different versions of the bin mapping. The upgrade updated the inventory system's internal mapping but didn't invalidate or update the cache used by the pick list generator.

The Direct Fix

Update the pick list generator to read bin locations from the new inventory system's live data, not from a cached copy.

Verify the Fix

After the fix, run 10 test orders for products with changed bins. Verify the pick lists show the new bin locations.
Check that the fix doesn't affect products with unchanged bins (they should still work).
Check timing: the fix should take effect immediately, not after a cache timeout.

The Deeper Lesson: What Should Have Prevented This

1. Contract violation

The pick list generator had an implicit contract with the inventory system: "I will give you bin locations for product IDs." But the contract didn't specify where that data came from — live query vs. cached copy. If the contract had been explicit ("bin locations must be queried from the inventory system at pick time, not cached"), the cache would never have been built.

2. Boundary violation

The pick list generator cached data that belonged to another module (inventory). It crossed a boundary. If the boundary were enforced — "only the inventory module knows bin locations; everyone else must ask" — the stale cache wouldn't exist.

3. Missing failure mode in the upgrade plan

The inventory system upgrade plan didn't include: "What other systems read our bin mapping, and how do they read it?" A pre-mortem would have surfaced this: "What if other systems have a stale copy of our bin assignments?"

4. No verification at the seam

The seam between "pick list generated" and "warehouse worker picks item" has no verification. The worker goes to the bin and picks what's there — they have no way to verify it's the right product (unless they check the product name, which wasn't on the pick list). Adding a product name or barcode scan at pick time would have caught the mismatch immediately.

Failure Category Map for This Scenario

Root cause:    Integration failure (stale cache)
Amplifier:     No verification at physical seam
Blast radius:  All orders with changed bin locations (~30% of products)
Severity:      High (wrong items shipped, expensive returns)
Could prevent: Explicit contract, boundary enforcement, upgrade checklist
Could detect:  Barcode verification at pick, bin mapping comparison test
Could reduce:  Faster detection through customer complaint pattern analysis

Failure Modes — Example: Messaging App Mystery

The Scenario

A team messaging app (like Slack). Users are reporting that messages are appearing out of order in group conversations. A message sent at 2:03 PM appears above a message sent at 2:01 PM. It doesn't happen in every conversation, and it doesn't happen all the time. Some users say they can't reproduce it.

This is a timing failure — the hardest category to debug because the problem is intermittent and order-dependent.

Step 1: Observe the Symptom Precisely

We collect specific reports:

Report	Conversation	What User Sees	Expected Order
1	#engineering (45 members)	Message from Bob at 2:03 appears above Alice's from 2:01	Alice first, Bob second
2	#engineering (45 members)	Same message pair — but Carol sees them in the correct order	Depends on viewer?
3	DM between Dave and Eve	Never happens — DMs always in order	—
4	#general (200 members)	Frequent reordering, especially during active discussion	—
5	#random (10 members)	Rarely happens	—

Pattern emerging:

Happens in group conversations, not DMs
More frequent in larger groups
More frequent during high activity
Different users see different orders for the same messages

Step 2: Establish What Changed

Users say this "started recently" but can't say exactly when. We check the deployment log:

2 weeks ago: Scaled the messaging backend from 1 server to 3 servers (load balancing) to handle growing user count
Nothing else changed

Hypothesis forming: Scaling from 1 server to 3 might be related to the ordering issue.

Step 3: Bisect the Problem Space

The message flow is:

Sender types message → Sender's device sends to server → Server stores message → 
Server broadcasts to group members → Each member's device receives and displays

Check the midpoint: Are messages stored in the correct order?

We query the database for the #engineering conversation around 2:00 PM:

Message ID	Sender	Text	Timestamp (server)	Stored Order
msg-4401	Alice	"Has anyone seen the test results?"	2:01:03.142	1st
msg-4402	Bob	"Just posted them in the doc"	2:01:47.891	2nd
msg-4403	Carol	"Thanks!"	2:02:15.003	3rd
msg-4404	Bob	"The latency numbers look bad"	2:03:01.556	4th

The database has the correct order. Messages are stored with server timestamps and the order is right.

So the bug is after storage — in the broadcast/display phase.

Step 4: Deeper Investigation — The Broadcast

How does broadcast work?

Before the scale-up (1 server):

Message stored → Server sends to all connected members → Done

After the scale-up (3 servers):

Message stored → Publish event to message queue → All 3 servers read from queue → 
Each server sends to its connected members

With 3 servers, the 45 members of #engineering are distributed:

Server 1: 18 members connected
Server 2: 15 members connected
Server 3: 12 members connected

When Alice sends a message at 2:01, the flow is:

Alice's device → Server 2 (she happens to be connected to Server 2)
Server 2 stores the message
Server 2 publishes "new message" event to the message queue
All 3 servers pick up the event and send it to their connected members

When Bob sends a message at 2:03, the flow is:

Bob's device → Server 1 (he's connected to Server 1)
Server 1 stores the message
Server 1 publishes "new message" event to the message queue
All 3 servers pick up the event

Here's the problem:

The message queue doesn't guarantee that events are delivered to all consumers in the same order. Server 1 might receive Bob's event before Alice's event because Bob's message was published from Server 1 (local) while Alice's had to travel across the network.

Timeline:

Server 2 stores Alice's msg at 2:01:03.142
Server 2 publishes event ─────────────────────── travels across network
Server 1 stores Bob's msg at 2:03:01.556
Server 1 publishes event ─── stays local

Server 1 receives Bob's event at 2:03:01.560 ← 4ms later (local)
Server 1 receives Alice's event at 2:03:01.580 ← 20ms later (network travel)

Server 1 broadcasts Bob's message to its 18 connected members FIRST
Server 1 broadcasts Alice's message 20ms later

Users on Server 1 see: Bob, then Alice (WRONG ORDER)
Users on Server 2 see: Alice, then Bob (CORRECT ORDER)

Different users see different orders because they're connected to different servers, and the servers receive events in different orders.

Step 4 (continued): Form Hypothesis

Hypothesis: When multiple messages arrive at a server via the message queue within a short time window, the server broadcasts them in arrival order (when the event reached that particular server) rather than timestamp order (when the message was actually created). Users connected to different servers receive messages in different orders.

Test the prediction:

If this is correct, then:

DMs never have this problem (they involve only 2 people, likely routed through one server)
Larger groups are more affected (more members = more servers involved along the way)
Fast-paced conversations are more affected (messages close together in time are more susceptible to reordering)
Users on the same server always see the same order (right or wrong)

All four predictions match the reports. Hypothesis confirmed.

Step 5: The Fix — And Why It's Not Obvious

Attempted Fix 1: "Just sort by timestamp on the server"

Have each server sort messages by timestamp before broadcasting.

Problem: The server doesn't know if more messages are coming. When it receives Bob's event, should it wait to see if an earlier message might arrive? How long should it wait? 10ms? 100ms? 1 second?

Wait too short → still might miss earlier messages
Wait too long → messages feel laggy to users

This is the fundamental tradeoff of distributed systems: you can't have both instant delivery AND perfectly correct ordering without coordination.

Attempted Fix 2: "Sort on the client"

Each user's device sorts messages by server timestamp after receiving them.

Problem: This mostly works, but creates a jarring experience — a message appears at the bottom, then "jumps up" when an earlier message arrives a moment later. Users see messages rearranging in real time, which feels buggy even though it's technically correct.

Actual Fix: Client-side insertion sort with timestamp

Each user's device maintains messages sorted by server timestamp. When a new message arrives:

Check its timestamp against the last displayed message
If it's newer → append at the bottom (most common case, feels instant)
If it's older → insert it at the correct position AND show a subtle visual indicator ("1 earlier message inserted above")

This is a compromise: correct ordering with a visual cue so users aren't confused by messages appearing "above" what they already read.

The Deeper Lesson: Distributed Systems Create Timing Failures

Why 1 server didn't have this problem

With one server, all messages passed through a single point. The server processed them sequentially, so the broadcast order always matched the storage order. No timing ambiguity.

Why 3 servers created the problem

With three servers, there are three independent paths for messages to travel. Each path has slightly different timing. This is called non-deterministic ordering — the order depends on network latency, load, and which server the sender is connected to.

The general principle

Any time you add parallelism, you create the possibility of ordering problems. This applies to:

Multiple servers
Multiple threads in a program
Multiple workers processing a queue
Multiple microservices handling events

The question isn't "will ordering be a problem?" It's "how will we handle the ordering problem?"

Failure Category Map

Root cause:    Timing failure (non-deterministic event ordering across servers)
Amplifier:     High message volume in large groups
Blast radius:  All group conversations (DMs unaffected)
Severity:      Medium (annoying, confusing, but no data loss)
Could prevent: Design the broadcast system with ordering guarantees from day one
Could detect:  Automated ordering test (send messages with known timestamps,
               verify all clients receive in correct order)
Could reduce:  Client-side timestamp sorting with visual indicators

Compare With the E-Commerce Bug

Aspect	E-Commerce (wrong items)	Messaging (wrong order)
Failure category	Integration (stale cache)	Timing (non-deterministic ordering)
Reproducibility	100% for affected products	Intermittent, depends on timing
Who's affected	~30% of orders (those with changed bins)	Users on different servers, during high activity
Data corrupted?	No (database correct, pick list wrong)	No (database correct, display order wrong)
Root cause found by	Checking midpoint (pick list)	Checking delivery path (server → client)
Fix complexity	Simple (update the data source)	Hard (fundamental distributed systems tradeoff)
Prevention	Better contracts and boundary enforcement	Architectural decision about ordering guarantees

The e-commerce bug had a clear, fixable root cause. The messaging bug revealed a fundamental limitation of the architecture that required a compromise, not a simple fix. This is the difference between a bug and a design constraint.

Failure Modes — Example: Flight Booking Cascade

The Scenario

An airline booking system. On a busy Friday afternoon, the following happens within a 30-minute window:

Customers report they can't search for flights (the search page spins forever)
Customers who already selected flights can't complete payment
The customer service phone lines are flooded
An agent manually checks and sees that the booking database is responding, but slowly
Internal monitoring shows the flight search API is responding in 45 seconds (normal: under 500 milliseconds)

This is a cascading failure — one problem triggers a chain of other problems that makes everything worse.

Step 1: Observe the Symptoms — All of Them

Unlike the previous examples (single symptom), here we have multiple symptoms appearing simultaneously:

Symptom	Affected System	Severity
Flight search returns in 45 seconds	Search API	High
Payment processing times out	Payment Service	Critical
Customer service call volume 5x normal	Call Center	High
Booking database slow (but responding)	Database	Medium
Internal admin dashboard unresponsive	Admin UI	Low

These aren't five separate bugs. They're connected. The key question: which one caused the others?

Step 2: Establish a Timeline

We reconstruct what happened:

Time	Event
2:00 PM	Everything normal
2:12 PM	Marketing team launches a flash sale: "50% off all Caribbean flights this weekend." Email sent to 2 million subscribers.
2:15 PM	Website traffic increases 10x
2:17 PM	Search API response times begin rising (500ms → 2s → 5s → 15s → 45s)
2:20 PM	Payment service starts timing out (its requests to the database are queued behind search queries)
2:22 PM	Customers who can't search or pay start calling customer service
2:25 PM	Admin dashboard becomes unresponsive (it also queries the same database)
2:30 PM	All systems severely degraded

The trigger: A flash sale email drove sudden, massive traffic. But the trigger is not the root cause. Traffic spikes are expected. The question is: why didn't the system handle it?

Step 3: Bisect — Find the Bottleneck

The system architecture:

Users → Web Servers → Search API ──┐
                                    ├── Database
Users → Web Servers → Payment API ──┘
                                    │
Admin → Admin Dashboard ────────────┘

Three services (Search, Payment, Admin) all share one database. Let's check each layer:

Web servers: Handling requests, but slowly. They're waiting on responses from the APIs. Not the bottleneck — they're victims.

Search API: Sending queries to the database, but queries are slow. Not the root cause — it's also waiting.

Payment API: Same situation. Queries are queuing up and timing out.

Database: Here's the bottleneck.

Inside the Database

The database can handle approximately 500 queries per second under normal load. Each search query involves:

Searching available flights by route, date, and class
Checking seat availability for each matching flight
Calculating dynamic pricing for each available flight

A single search is about 3-5 database queries.

Normal traffic: 50 searches/sec × 5 queries = 250 queries/sec → database comfortable

Flash sale traffic: 500 searches/sec × 5 queries = 2,500 queries/sec → database overwhelmed

The database hits its connection limit. New queries queue up. Queue times increase. The search API waits for the database, the web server waits for the search API, the user waits for the web server. Each layer adds its own timeout on top.

But here's the critical part: The payment API, which only handles 10-20 transactions per second, also queries the same database — but its queries are now stuck behind 2,500 search queries. A payment that normally takes 200ms now takes 30 seconds and times out.

The flash sale broke search AND payment, even though payment traffic didn't increase at all.

Step 4: The Cascade Chain

Flash sale email
    → 10x website traffic
        → 5x database query volume (search queries)
            → Database connection pool exhausted
                → Search queries slow to 45 seconds
                → Payment queries can't get a database connection
                    → Payment times out
                    → Customers can't pay
                        → Customers call support
                            → Support lines overwhelmed
                → Admin queries can't get a database connection
                    → Admin dashboard unresponsive
                    → Operators can't see what's happening
                        → Slow response to the incident

One event (flash sale) → six cascading failures. Each failure amplifies the next.

The Amplification Pattern

Notice the feedback loop:

Search is slow → users retry (refresh the page) → more search queries → database even slower → users retry more aggressively

This is a thundering herd — when a system slows down, users retry, which generates even more load, which makes it slower, which generates more retries. The system enters a death spiral where recovery is impossible without intervention.

Step 5: Hypotheses and What Would Fix Each

Hypothesis 1: "Just upgrade the database"

Get a bigger database that can handle 2,500 queries/sec.

Problem: This fixes today's flash sale but doesn't fix the next one. If the sale is bigger (5 million emails), you'd need an even bigger database. You're scaling to the peak — expensive and always one step behind.

Verdict: Treats the symptom, not the disease.

Hypothesis 2: "Separate the databases"

Give search, payment, and admin their own databases.

Search API  → Search Database (can be slow under load — annoying but not critical)
Payment API → Payment Database (protected from search traffic — critical operations stay fast)
Admin       → Admin Database (or read replica)

Verdict: This prevents search traffic from killing payments. The blast radius of a search overload no longer includes payment. This is the boundary principle from Section 2 — critical and non-critical operations should not share the same resource pool.

Hypothesis 3: "Add rate limiting to search"

Limit search to 200 queries per second. Beyond that, return a "please try again in a moment" message.

Verdict: This prevents the database from being overwhelmed. Users see a brief delay instead of a 45-second hang. It's annoying but proterable to the entire system collapsing. This is the degrade strategy — intentionally limit one feature to protect the rest.

Hypothesis 4: "Cache popular search results"

Caribbean flights are what the sale promoted. Cache the results for common Caribbean route+date queries. The first search hits the database; subsequent identical searches are answered from cache.

Verdict: This dramatically reduces database load for the exact queries the flash sale generates. If 80% of search queries are for the same Caribbean routes, cache handles 2,000 of the 2,500 queries/sec, leaving only 500 for the database (within capacity).

The Real Fix: All of the Above (Layered Defense)

No single fix is sufficient. Real systems use layered defenses:

Rate limiting (immediate: deploy within hours, prevents the death spiral)
Caching (short-term: deploy within days, reduces database load for common queries)
Separate databases (medium-term: deploy within weeks, isolates critical from non-critical)
Load testing before promotions (process: coordinate with marketing — "tell engineering before sending 2 million emails")

The Deeper Lesson: Shared Resources Create Cascading Failures

The Shared Resource Anti-Pattern

BEFORE (dangerous):

Search ──┐
Payment ──┼── Shared Database
Admin ────┘

Any one service can saturate the database and starve the others.

AFTER (isolated):

Search  → Search DB (or cache + DB)
Payment → Payment DB
Admin   → Read replica

Each service has its own resource pool. A surge in one doesn't affect the others.

The Blast Radius Principle

Every shared resource is a potential blast radius amplifier. When you share:

A database
A network connection
A thread pool
A queue
A rate limit
A budget

…you're saying "the failure of any one consumer can affect all consumers." Sometimes sharing is the right choice (cost, simplicity). But you must know what you're risking.

The Traffic Spike Is Not the Bug

The flash sale email was the trigger, not the cause. The cause was:

No isolation between critical (payment) and non-critical (search) systems
No rate limiting to prevent overload
No caching for predictable high-volume queries
No coordination between marketing and engineering

The flash sale was a normal business event. The system should have handled it — or at least degraded gracefully instead of collapsing completely.

Failure Category Map

Root cause:    Resource failure (shared database overwhelmed)
Trigger:       External traffic spike (flash sale email)
Amplifier:     Thundering herd (user retries), shared resource (database)
Blast radius:  ALL services (search, payment, admin, support)
Severity:      Critical (payment broken = lost revenue)
Could prevent: Resource isolation (separate databases), rate limiting,
               caching, load testing, marketing coordination
Could detect:  Database connection pool monitoring, query queue length 
               alerts, response time thresholds
Could reduce:  Graceful degradation (return cached results, queue 
               payments for retry instead of dropping them)

Compare With Previous Examples

Aspect	E-Commerce (wrong items)	Messaging (wrong order)	Flight Booking (cascade)
Number of symptoms	1 (wrong items)	1 (wrong order)	5+ (everything breaks)
Failure category	Integration	Timing	Resource + cascading
Trigger	System upgrade	Scaling to 3 servers	Traffic spike
Root cause	Stale cache	Non-deterministic ordering	Shared database bottleneck
Fix complexity	Simple (one change)	Moderate (client-side compromise)	High (layered, multiple changes)
Blast radius	30% of orders	Group conversations	Every user, every feature
Feedback loop?	No	No	Yes (retries amplify load)
Prevention theme	Contracts + boundaries	Architectural ordering decisions	Resource isolation + graceful degradation

The key escalation: from a single-cause bug, to a design limitation, to a systemic vulnerability. Each example requires more sophisticated thinking about failure.

Failure Modes — Pre-Mortems and Failure Planning

What Is a Pre-Mortem?

A post-mortem happens after something breaks. You investigate what went wrong.

A pre-mortem happens before anything breaks. You imagine it's six months from now and the system has failed catastrophically, then work backward: "What went wrong?"

This flips the psychology. In a planning meeting, people are optimistic. In a pre-mortem, everyone has permission to be pessimistic — and pessimism is productive.

The Pre-Mortem Process

Step 1: Define the System

State clearly what you're building and its key characteristics:

What does it do?
Who uses it?
What data does it handle?
What are the critical operations?

Step 2: Imagine the Disaster

Each person independently writes down their answer to: "It's six months from now. The system has failed in its worst possible way. What happened?"

Not little bugs. Catastrophes:

Data was lost permanently
Money was charged incorrectly
Security was breached
The system was down for days
Customers left in large numbers

Step 3: Group the Failures

Collect all imagined disasters and group them:

Which ones are about data? (loss, corruption, leakage)
Which ones are about availability? (downtime, slowness)
Which ones are about correctness? (wrong results, wrong actions)
Which ones are about security? (unauthorized access, data exposure)
Which ones are about scaling? (couldn't handle growth)

Step 4: For Each Failure, Ask Three Questions

How likely is this? (Almost certain / Probable / Possible / Unlikely)
How severe is the impact? (Critical / High / Medium / Low)
What would prevent or mitigate it? (Design decision, monitoring, process)

Step 5: Act on the High-Risk Items

Anything that is both likely and severe must be addressed in the design — not deferred to "we'll fix it later."

Worked Pre-Mortem 1: Online Banking App

The System

A mobile banking app. Customers can check balances, transfer money, pay bills, and deposit checks by photographing them.

Imagined Disasters

#	Disaster	Category	Likelihood	Severity
1	"A customer transferred $10,000 but the money disappeared — left the source account but never arrived at the destination"	Correctness	Possible	Critical
2	"Someone gained access to 50,000 customer accounts because session tokens weren't invalidated after password changes"	Security	Probable	Critical
3	"The app was down for 6 hours on a Friday (payday) because a database migration failed and couldn't be rolled back"	Availability	Probable	Critical
4	"A customer deposited the same check 15 times by rapidly submitting photos, and was credited $15,000 for a $1,000 check"	Correctness	Possible	High
5	"Customer service had no way to see what went wrong with a failed transaction because logging was incomplete"	Data	Probable	High
6	"The mobile app crashed on Android 12 devices and wasn't caught because we only tested on iOS"	Availability	Probable	Medium
7	"A third-party payment provider changed their API without notice, and bill payments silently failed for 3 days"	Integration	Possible	High

Prevention Plan

Disaster 1: Disappearing transfer

Prevention: Atomic transactions — debit and credit must be a single, indivisible operation. If one fails, both roll back.
Detection: Reconciliation: every night, verify that total debits = total credits across all accounts. Any mismatch triggers an alert.
Recovery: Transaction is logged with full details regardless of success/failure, enabling manual correction.

Disaster 2: Session tokens after password change

Prevention: On password change, invalidate ALL active sessions for that user. Force re-authentication.
Detection: Monitor for sessions that continue after a password change event. This should trigger a security alert.
Process: Add to the security review checklist: "What happens to active sessions when credentials change?"

Disaster 3: Failed database migration on payday

Prevention: Never run migrations on Fridays. Always have a tested rollback script. Run migrations in a staging environment first.
Detection: Automated health check: if the app can't connect to the database within 5 seconds, page the on-call engineer.
Mitigation: Read-only mode: if the database is mid-migration, customers can view balances but not make transactions. Degraded but not dead.

Disaster 4: Duplicate check deposit

Prevention: Idempotency key — each check deposit gets a unique ID. If the same check image is submitted twice, the second submission is recognized as a duplicate and rejected.
Detection: Flag accounts with multiple deposits of the same amount in a short time window.
Business rule: Hold deposited funds for 24 hours before making them available (standard banking practice, but now you know why).

Disaster 5: Incomplete logging

Prevention: Define logging as part of the contract for every operation. Every contract's side effects section must include what is logged.
Standard: For every transaction: log the input, the output, the timestamp, the customer ID, the IP address, and either the success result or the full error.
Verification: Regularly test that a support engineer can reconstruct what happened for a given transaction using only the logs.

Worked Pre-Mortem 2: School Registration System

The System

An online system where parents register their children for the upcoming school year. Choose a school, submit personal information, upload documents (proof of address, immunization records), and get a confirmation.

Imagined Disasters

#	Disaster	Category	Likelihood	Severity
1	"Registration opened at 8 AM and the site crashed within 2 minutes because 10,000 parents all clicked at the same time"	Scaling	Almost certain	High
2	"A parent registered their child at School A, but the system assigned them to School B because of a race condition on the last available seat"	Correctness	Probable	High
3	"A parent uploaded their child's medical records, and another parent could see them due to a document ID that was sequential and guessable"	Security	Possible	Critical
4	"Registration closed, but 200 parents say they submitted before the deadline and have no confirmation. No one can prove what happened."	Data	Probable	High
5	"The system accepted a registration without required immunization records. The school discovered this on the first day of class."	Correctness	Probable	Medium
6	"A family with special needs (IEP, 504 plan) registered but the system didn't flag this for the school, so no accommodations were prepared"	Correctness	Possible	High

Prevention Plan

Disaster 1: Opening-day crash

Prevention: Load test with 10x expected traffic before launch. Use a virtual queue ("You are #3,247 in line. Estimated wait: 12 minutes.") instead of letting everyone hit the system simultaneously.
Mitigation: Have a static "we're experiencing high volume" page that doesn't require the database, so the site doesn't show an error.
Communication: Tell parents in advance: "Registration stays open for 2 weeks. Spots are not first-come-first-served. You do not need to register at 8 AM."

Disaster 2: Race condition on last seat

Prevention: Don't assign seats in real time. Accept all registrations as "pending." Run the assignment process after registration closes, with clear tiebreaker rules.
If real-time assignment is required: Use pessimistic locking — when a parent starts registering for School A, temporarily reserve a seat. If they don't complete within 15 minutes, release it.
Never say "you're in" until the seat reservation is confirmed and committed.

Disaster 3: Guessable document IDs

Prevention: Use random, non-sequential document IDs (UUIDs). Never use auto-incrementing IDs for anything the user can see in a URL.
Authorization: Even with random IDs, check that the requesting user is authorized to see the document. Defense in depth — random ID + authorization check.
Encryption: Store uploaded documents encrypted at rest. Even if someone accesses the storage directly, they can't read the files.

Disaster 4: No proof of submission

Prevention: Every submission generates a confirmation number immediately, displayed on screen AND emailed. If the email fails, the confirmation number is still shown on screen.
Logging: Log every submission attempt with timestamp, IP address, and all submitted data. This is the system's proof.
Grace period: If the system was under heavy load near the deadline, extend the deadline. Publish the policy in advance.

Worked Pre-Mortem 3: Smart Thermostat System

The System

A home thermostat connected to the internet. Users set schedules via a phone app. The thermostat communicates with the furnace/AC and reports energy usage.

Imagined Disasters

#	Disaster	Category	Likelihood	Severity
1	"Internet goes down. Thermostat stops maintaining temperature because it depends on the cloud to get the schedule."	Availability	Almost certain	Critical (pipes freeze in winter)
2	"A software update bricked 50,000 thermostats. They display nothing and don't control temperature."	Availability	Possible	Critical
3	"A hacker accessed the thermostat API and set 100,000 homes to 95°F in August, causing danger for elderly residents."	Security	Possible	Critical
4	"The thermostat reported wrong energy data, and customers got unexpectedly high utility bills"	Correctness	Probable	High
5	"Two family members set conflicting schedules from their phones. The thermostat oscillated between 68°F and 75°F every few minutes."	Correctness	Probable	Medium

Prevention Plan

Disaster 1: Internet dependency

Prevention: The thermostat must operate independently of the internet. The schedule is stored on the device, not just in the cloud. The cloud syncs the schedule, but the device doesn't require it.
Design rule: If the internet connection goes away, the thermostat continues following its last-known schedule indefinitely. The user loses remote control but the house stays warm.
This is a boundary decision: The thermostat is its own module. The cloud is a convenience layer, not a dependency.

Disaster 2: Bricked by update

Prevention: Two-slot firmware — the thermostat stores two copies of its software. An update writes to the backup slot. If the update fails or the device doesn't boot correctly, it automatically reverts to the previous working version.
Rollout: Never update all devices at once. Update 1% → verify → 10% → verify → 100%. This limits blast radius.
Minimum function: Even if all software fails, the hardware should maintain a safe default temperature (60°F) to prevent pipe freezing. This is a hardware fallback, not a software feature.

Disaster 3: Security breach

Prevention: Authentication for every API call. Rate limiting on temperature changes. Maximum temperature bound (can't set above 90°F or below 50°F) enforced on the device, not just in the app.
Detection: Alert if temperature is set outside normal range, or if settings change more than 5 times in an hour.
Physical limit: The device has a physical maximum temperature that the software cannot override. Even a fully compromised cloud can't heat a house to a dangerous temperature.

The Pre-Mortem Toolkit

Questions to Ask for Any System

Question	What It Reveals
What happens when the network goes away?	Dependency on connectivity
What happens when traffic is 10x normal?	Scaling limits
What happens when a database migration fails?	Recovery procedures
What happens when a third-party service changes without notice?	Integration fragility
What happens when two users do the same thing at the same time?	Concurrency issues
What happens when the clock is wrong on one server?	Timing assumptions
What happens when someone deliberately tries to break it?	Security posture
What happens when data from 2019 meets code from 2024?	Data compatibility
What happens when the person who built this leaves the team?	Knowledge concentration
What happens when we have 100x the current data volume?	Storage and performance limits

The Risk Matrix

Plot your pre-mortem findings:

                 Low Impact         High Impact
              ┌───────────────┬───────────────────┐
  Likely      │   Monitor     │  MUST ADDRESS      │
              │               │  (Design for this) │
              ├───────────────┼───────────────────┤
  Unlikely    │   Accept      │  Plan response     │
              │   (Log it)    │  (Have a playbook)  │
              └───────────────┴───────────────────┘

Likely + High Impact: Must be addressed in the design. Not optional.
Likely + Low Impact: Monitor and fix when convenient.
Unlikely + High Impact: Have a response plan. You don't have to prevent it, but you must know what to do if it happens.
Unlikely + Low Impact: Accept the risk. Log it and move on.

Summary: Why Pre-Mortems Work

They give permission to be negative. In planning, people avoid bringing up problems (it feels like criticizing the plan). In a pre-mortem, finding problems is the goal.
They surface assumptions. "The internet will always be available" is an assumption that feels obvious in a pre-mortem but gets ignored in design.
They connect to everything else in this curriculum:
- Data lifecycle → "Where is data at risk of loss or corruption?"
- Boundaries → "What's the blast radius if this module fails?"
- Contracts → "Which error cases are missing from our contracts?"
- Decomposition → "Which dependencies create single points of failure?"
They're cheap. A pre-mortem takes an hour. Recovering from a disaster you could have prevented takes weeks.

Failure Modes and Debugging — Test Your Understanding

Answer each question by showing your reasoning process. The goal is structured, systematic thinking — not lucky guesses.

Section A: Diagnose the Problem

Question 1

Symptom: An online store's product pages load correctly, but every product shows "In Stock" even though several products are sold out.

Using the five-step debugging framework:

State the precise symptom
Hypothesize what might have changed
Describe how you would bisect the problem (where would you check first?)
Form two different hypotheses for the cause
For each hypothesis, describe what evidence would confirm or disprove it

Question 2

Symptom: Users report that emails from the system arrive late — sometimes hours after the action that triggered them. The system was working fine until last week.

You know the email flow:

User action triggers an event
Event is placed in a queue
A background worker reads the queue and sends emails
Email is sent via an external email service

Using bisection, walk through how you would isolate whether the delay is in step 1, 2, 3, or 4. What specific thing would you check at each stage?

Question 3

Symptom: A banking app shows a customer's balance as negative $500, but the customer insists they have not made any large purchases. Looking at the transaction list, all transactions appear normal and small.

This is a data integrity issue. Trace the lifecycle backward:

Where is the balance displayed?
Where is it calculated?
What data feeds the calculation?
What could cause the calculation to produce a wrong result?

List at least four distinct hypotheses, each targeting a different part of the data lifecycle.

Section B: Design for Failure

Question 4

You are designing a system that processes online job applications. The flow:

Applicant fills out a form with personal info and uploads a resume
System validates the form data
Resume is stored
Application record is created in the database
Hiring manager is notified via email
Applicant receives a confirmation email

For each step, list:

What can fail
The severity (critical/high/medium/low)
The appropriate response strategy (prevent/retry/fallback/degrade/alert/fail fast)
What should be logged for debugging

Question 5

A ride-sharing app has these dependencies:

GPS service (for driver location)
Payment processor (for billing)
Map routing service (for directions)
Push notification service (for alerts)

For each dependency, answer:

What happens if it goes down for 30 seconds?
What happens if it goes down for 30 minutes?
What happens if it starts returning wrong data instead of errors?
What should the app do in each case?

Pay special attention to the third question — silent wrong data is the most dangerous failure mode.

Question 6

Perform a pre-mortem for the following system:

A school lunch ordering system where parents pre-order meals for their children through a website. The kitchen prepares meals based on the orders. Children pick up their meal at lunch using their student ID.

Imagine it's been running for three months and something has gone terribly wrong. Write five realistic failure scenarios. For each one:

What went wrong
Why it wasn't caught earlier
What design decision would have prevented it

Section C: Failure Reasoning

Question 7

You have a system with three modules in sequence:

Module A → Module B → Module C → Output

The output is wrong. You check Module A's output — it's correct. You check Module C's output — it's wrong.

Can you conclude the bug is in Module B or Module C? Why or why not? What else do you need to check? Describe the precise reasoning.

Question 8

An engineer says: "I added retry logic everywhere, so our system handles failures well."

Explain at least three scenarios where retrying makes the problem worse instead of better. For each scenario, describe what should be done instead.

Question 9

Two failure strategies are proposed for a checkout system when the payment service is down:

Strategy A: Show the user an error: "Payment service unavailable. Please try again in a few minutes."

Strategy B: Accept the order, save it with status "payment pending," and charge the user when the payment service comes back.

Analyze both strategies. What are the risks of each? Under what circumstances is A better? Under what circumstances is B better? What failure modes does B introduce that A doesn't have?

Section D: The Full Picture

Question 10

This is an integration exercise. You have studied all five pillars. Now apply them all:

Scenario: A hospital system manages patient appointments. Patients book appointments online, doctors see their schedule on a dashboard, and the system sends text message reminders 24 hours before each appointment.

A doctor reports: "My 2pm patient said they never received a reminder, and two of my morning patients received reminders for the wrong date."

Using everything you've learned:

Data Lifecycle: Trace the data from appointment creation to reminder delivery
Boundaries: Identify which module(s) are likely involved in the failure
Contracts: Identify what contract might be violated
Decomposition: Break the problem into investigatable pieces
Failure Mode: Categorize the failure type, form hypotheses, and describe how you would bisect to find the root cause

Question 11

Design a comprehensive failure handling plan for a simple feature: "User changes their password."

The flow: user enters current password and new password → system verifies current password → system validates new password meets requirements → system updates the stored password → user receives email confirming the change.

For this feature:

List every failure mode at every step
Categorize each by type (input, logic, integration, resource, dependency, timing)
Define the response for each
Identify the single most dangerous failure mode and explain why
Describe what logging would be needed to diagnose any failure in this flow without being able to reproduce it

Question 12

The final question. Reflect on this statement:

"A system that has never failed is more dangerous than a system that fails regularly."

Using concepts from all five pillars, explain why this might be true. Consider: untested failure paths, false confidence, unknown data lifecycle gaps, unchecked boundary assumptions, and unvalidated contracts. Give a concrete example to support your argument.

Grading Rubric

Criteria	What It Means
Systematic process	Followed a structured approach — not random guessing. Steps are traceable and logical.
Precise symptoms	Problems are stated specifically, not vaguely. "Shows $0" not "is broken."
Multiple hypotheses	More than one possible cause is considered before committing to a diagnosis
Evidence-based reasoning	Each hypothesis has a way to test it. Decisions are based on evidence, not assumptions.
Failure design completeness	All failure modes are considered, not just the obvious ones. Silent failures and wrong-data failures are addressed, not just crashes.
Cross-pillar integration	Answers draw on data lifecycle, boundaries, contracts, and decomposition — not just debugging techniques in isolation