Systems Thinking for Engineers

What This Course Is

This is not a programming course. You will not start by learning a language. You will not memorize syntax. You will not write "Hello, World."

Instead, you will learn how to think about systems — the same mental framework that separates a senior engineer with 20 years of experience from someone who just learned to code.

The tools have changed. An LLM can write a function faster than you can type it. But an LLM cannot:

  • Decide what the system should look like
  • Know where the boundaries belong
  • Understand why one design fails under load and another doesn't
  • Diagnose a problem it has never seen by reasoning from first principles

That is your job. This course teaches you to do it.

Who This Is For

Anyone entering software engineering — whether you have never written a line of code or you have written thousands but never understood why the code is organized the way it is.

If you can answer "what should I build and how should the pieces fit together?" then telling the machine how to build it is the easy part.

The Five Pillars

Each section of this course follows the same structure:

  1. Why — The reasoning behind the concept. Why does this matter? What goes wrong without it?
  2. How — The framework for applying it. Diagrams, patterns, and worked examples.
  3. Test — Exercises that prove you understand. No code. Just thinking.

The pillars build on each other:

#PillarCore Question
1Data LifecycleWhere does data live, what changes it, and how does it move?
2BoundariesWhat is a "thing"? Where does one piece end and another begin?
3ContractsWhat goes in, what comes out, and what can go wrong?
4DecompositionHow do you break a big problem into small, solvable pieces?
5Failure ModesWhen something breaks, how do you reason about why?

How to Use This Material

Read the Why first. Don't skip it. The temptation will be to jump to "how do I do this?" but the reasoning is the entire point. If you can explain why something is the way it is, the how follows naturally.

Then work through the How — not by memorizing, but by trying to apply each concept to something you already understand (a restaurant, a library, an airport — real-world systems are software systems without the computer).

Finally, do the Test sections honestly. If you can't answer confidently, go back. There is no penalty for re-reading. There is a massive penalty in the real world for building on a shaky foundation.

Let's begin.

Data Lifecycle — Why It Matters

The Single Most Important Idea

Every piece of software that has ever existed does exactly three things with data:

  1. Stores it
  2. Transforms it
  3. Transports it

That's it. Every app, every service, every script, every billion-dollar platform — strip away the UI, the branding, the complexity — and you are looking at data being stored somewhere, changed into something else, and moved from one place to another.

If you understand this, you can look at any system and immediately start reasoning about what it does. If you don't understand this, you will forever be lost in the details.

Why This Comes Before Everything Else

Most courses start with "here is a variable, here is a loop." That's like teaching someone to drive by explaining how a piston works. It's not wrong, but it's the wrong starting point.

When a senior engineer looks at a new system, they don't think about variables. They think:

  • "Where is the data coming from?" — a user typing? a file on disk? another system calling in?
  • "What happens to it?" — is it validated? calculated? reformatted? combined with other data?
  • "Where does it end up?" — saved to a database? shown on a screen? sent to another system?

This is the data lifecycle, and it is the foundation of every design decision in engineering.

What Goes Wrong Without This Mental Model

You build the wrong thing

A stakeholder says "we need a dashboard." Without lifecycle thinking, you start designing a screen. With lifecycle thinking, you ask: what data feeds this dashboard? Where does that data come from? How fresh does it need to be? The answers to those questions determine 90% of the work — the screen is the easy part.

You can't find bugs

Something is broken. Users are seeing stale data. Without lifecycle thinking, you stare at code and guess. With lifecycle thinking, you trace: the data is stored in a cache, transformed when the page loads, and transported from the API. The cache is the problem. You narrowed it from "something is broken" to "the cache isn't invalidating" in 30 seconds.

You can't explain your system to anyone

"It's a web app that does stuff" is not an explanation. "User input is validated and stored, background jobs transform it into reports, and an API transports those reports to the client" — that's an explanation. Anyone can understand that, technical or not.

The Three Stages, Concretely

Storage

Data at rest. It exists somewhere and is not currently being changed.

  • A row in a database
  • A file on disk
  • A value held in memory
  • A message sitting in a queue, waiting
  • A cookie in a browser
  • A configuration file on a server

The key questions about storage:

  • How long does it live? (forever? until the user closes the tab? five minutes?)
  • Who can access it? (just this program? any program? the user?)
  • What happens if it disappears? (catastrophic? inconvenient? nobody notices?)

Transform

Data being changed from one form or value to another.

  • Validating an email address (raw text → confirmed-valid text)
  • Calculating a total (line items → sum)
  • Compressing an image (large file → smaller file)
  • Sorting a list (unordered → ordered)
  • Joining data from two sources (customer + orders → customer-with-orders)

The key questions about transforms:

  • What goes in? (what shape? what constraints?)
  • What comes out? (what shape? what guarantees?)
  • Can it fail? (what happens if the input is garbage?) (almost all programs fail here!)

Transport

Data moving from one location to another.

  • A user submitting a form (browser → server)
  • An API call between services (service A → service B)
  • Reading from a database (database → application)
  • Displaying a result on screen (application → user's eyes)
  • Sending an email (system → inbox)

The key questions about transport:

  • How fast does it need to get there? (instantly? eventually? batch every hour?)
  • What happens if it doesn't arrive? (retry? alert? silent failure?)
  • How much data is moving? (one record? millions?)

The Mental Shift

Stop thinking about software as "code that does things."

Start thinking about it as data that flows through stages: it arrives from somewhere, it gets stored, it gets transformed, it gets transported to the next stage, and the cycle continues.

When someone describes a feature, your first instinct should be: "What data? Where does it start? What happens to it? Where does it end up?"

This is how experienced engineers think. Not because someone taught them — because after years of debugging, designing, and rebuilding, this is the pattern that always held true.

Now you know it on day one.

Data Lifecycle — Why It Matters

The Single Most Important Idea

Every piece of software that has ever existed does exactly three things with data:

  1. Stores it
  2. Transforms it
  3. Transports it

That's it. Every app, every service, every script, every billion-dollar platform — strip away the UI, the branding, the complexity — and you are looking at data being stored somewhere, changed into something else, and moved from one place to another.

If you understand this, you can look at any system and immediately start reasoning about what it does. If you don't understand this, you will forever be lost in the details.

Why This Comes Before Everything Else

Most courses start with "here is a variable, here is a loop." That's like teaching someone to drive by explaining how a piston works. It's not wrong, but it's the wrong starting point.

When a senior engineer looks at a new system, they don't think about variables. They think:

  • "Where is the data coming from?" — a user typing? a file on disk? another system calling in?
  • "What happens to it?" — is it validated? calculated? reformatted? combined with other data?
  • "Where does it end up?" — saved to a database? shown on a screen? sent to another system?

This is the data lifecycle, and it is the foundation of every design decision in engineering.

What Goes Wrong Without This Mental Model

You build the wrong thing

A stakeholder says "we need a dashboard." Without lifecycle thinking, you start designing a screen. With lifecycle thinking, you ask: what data feeds this dashboard? Where does that data come from? How fresh does it need to be? The answers to those questions determine 90% of the work — the screen is the easy part.

You can't find bugs

Something is broken. Users are seeing stale data. Without lifecycle thinking, you stare at code and guess. With lifecycle thinking, you trace: the data is stored in a cache, transformed when the page loads, and transported from the API. The cache is the problem. You narrowed it from "something is broken" to "the cache isn't invalidating" in 30 seconds.

You can't explain your system to anyone

"It's a web app that does stuff" is not an explanation. "User input is validated and stored, background jobs transform it into reports, and an API transports those reports to the client" — that's an explanation. Anyone can understand that, technical or not.

The Three Stages, Concretely

Storage

Data at rest. It exists somewhere and is not currently being changed.

  • A row in a database
  • A file on disk
  • A value held in memory
  • A message sitting in a queue, waiting
  • A cookie in a browser
  • A configuration file on a server

The key questions about storage:

  • How long does it live? (forever? until the user closes the tab? five minutes?)
  • Who can access it? (just this program? any program? the user?)
  • What happens if it disappears? (catastrophic? inconvenient? nobody notices?)

Transform

Data being changed from one form or value to another.

  • Validating an email address (raw text → confirmed-valid text)
  • Calculating a total (line items → sum)
  • Compressing an image (large file → smaller file)
  • Sorting a list (unordered → ordered)
  • Joining data from two sources (customer + orders → customer-with-orders)

The key questions about transforms:

  • What goes in? (what shape? what constraints?)
  • What comes out? (what shape? what guarantees?)
  • Can it fail? (what happens if the input is garbage?) (almost all programs fail here!)

Transport

Data moving from one location to another.

  • A user submitting a form (browser → server)
  • An API call between services (service A → service B)
  • Reading from a database (database → application)
  • Displaying a result on screen (application → user's eyes)
  • Sending an email (system → inbox)

The key questions about transport:

  • How fast does it need to get there? (instantly? eventually? batch every hour?)
  • What happens if it doesn't arrive? (retry? alert? silent failure?)
  • How much data is moving? (one record? millions?)

The Mental Shift

Stop thinking about software as "code that does things."

Start thinking about it as data that flows through stages: it arrives from somewhere, it gets stored, it gets transformed, it gets transported to the next stage, and the cycle continues.

When someone describes a feature, your first instinct should be: "What data? Where does it start? What happens to it? Where does it end up?"

This is how experienced engineers think. Not because someone taught them — because after years of debugging, designing, and rebuilding, this is the pattern that always held true.

Now you know it on day one.

Data Lifecycle — How: The Method

Lifecycle Mapping

The core skill is lifecycle mapping — taking any system or feature and tracing what happens to the data from birth to death. You don't need code for this. You need a diagram and the right questions.

The Three-Column Method

For any system, draw three columns:

StorageTransformTransport
Where data restsHow data changesHow data moves

Then fill them in for the system you're analyzing. Every piece of data in the system should appear in at least one column. Most data will touch all three.

How to Map a System

Follow these steps in order:

Step 1: Identify the Data

Before mapping anything, list every piece of data the system touches. Don't worry about how it works yet — just name the data.

Ask:

  • What does the user provide?
  • What does the system store?
  • What does the system calculate or derive?
  • What does the system output or display?
  • What does the system exchange with other systems?

Step 2: Trace Each Piece Through Its Lifecycle

For each piece of data, follow it from birth to death:

  • Where is it born? (user types it, another system sends it, it's calculated from other data)
  • Where does it live? (memory, database, file, cache, queue)
  • What changes it? (validation, calculation, formatting, enrichment)
  • Where does it travel? (screen, API, email, another module)
  • Where does it die? (deleted, archived, expired, overwritten)

Step 3: Build the Map

Use either a table or a flow diagram.

Table format — best for initial analysis:

StageWhat HappensCategory
(describe each step)(what specifically occurs)Storage / Transform / Transport

Flow diagram — best for communicating with others:

  • Boxes = storage (data at rest)
  • Arrows = transport (data in motion)
  • Labels on arrows or diamonds = transforms (data being changed)
┌──────────┐              ┌──────────┐              ┌──────────┐
│ Source    │ ──────────► │ Process  │ ──────────► │  Dest    │
│(storage) │  transport   │(transform)│  transport  │(storage) │
└──────────┘              └──────────┘              └──────────┘

Step 4: Find the Hidden Data

The most common mistake is forgetting data that isn't obvious. Every system has hidden data. Look for these specifically:

Metadata — data about data. When was this created? Who created it? What version? How many times has it been accessed? Metadata is critical for debugging and auditing, and it's almost always overlooked.

State — the current condition of something. Is this order pending, paid, in progress, or complete? Is this account active or suspended? State is data, and managing state transitions is where most bugs live.

Configuration — data that controls how the system behaves. Tax rates, store hours, feature flags, allowed file types, maximum limits. Configuration is storage that affects transforms.

Logs — a record of what happened. Every transport and transform should produce a log entry. When things break — and they will — logs are how you reconstruct what happened.

Derived data — data that is calculated from other data. A running total, a user's "membership level," an average rating. This data doesn't come from outside — it's created internally through transforms.

The Lifecycle Question Checklist

When analyzing any system or feature, run through these questions:

Storage:

  • What data is stored?
  • Where is it stored? (and is it more than one place?)
  • How long does it persist? (seconds? days? forever?)
  • What happens if storage fails or data is lost?
  • Who can access it?
  • How much data accumulates over time?

Transform:

  • What transforms happen to the data?
  • In what order?
  • What can go wrong at each step?
  • Are transforms reversible? (Can you undo them?)
  • Are there transforms that happen on a schedule vs. on demand?

Transport:

  • Where does data move from and to?
  • How quickly must it move? (real-time? batch? eventually?)
  • What happens if transport fails? (retry? lose it? queue it?)
  • How much data moves at once? (one record? thousands?)
  • Is the transport secure? (does it need to be?)
  • Who initiates the transport — the sender or the receiver?

If you can answer all of these for a given system, you understand that system deeply enough to build it, debug it, or redesign it.

What a Good Lifecycle Map Looks Like

A complete lifecycle map has these properties:

  1. Every piece of data is accounted for — nothing appears from nowhere, nothing vanishes without explanation
  2. Every stage is labeled — you know whether each step is storage, transform, or transport
  3. Hidden data is included — metadata, state, configuration, and logs are on the map
  4. Failure points are visible — you can point to each stage and say "if this fails, here is what breaks"
  5. A stranger could follow it — someone who has never seen the system could read your map and understand the data flow

The following sections present complete worked examples. Study them, then compare them to the test questions. The test will ask you to produce maps at this level of detail.

Data Lifecycle — Example: Coffee Shop Ordering

The Scenario

A customer orders coffee from a mobile app. They browse the menu, customize a drink, pay, and receive their drink at the store. The barista sees the order on a screen. The customer gets a push notification when the drink is ready.

This seems simple. Let's map it and see how much data is actually involved.


Step 1: Identify All the Data

Obvious data:

  • The customer's drink selection (item, size, modifications)
  • The customer's payment information
  • The store's menu
  • The order itself
  • The receipt

Less obvious data:

  • The store's current inventory (do they have oat milk today?)
  • Store hours and availability (is the store even open?)
  • The customer's account info (name, payment methods on file, order history)
  • Order status (placed → paid → in progress → ready → picked up)
  • Timestamps (when was the order placed? when was it ready?)
  • The queue position (how many orders are ahead of this one?)
  • Estimated wait time
  • Push notification delivery status (did the notification reach the phone?)

That's at least 13 pieces of data for a single coffee order. Most people would have named 3 or 4.


Step 2: Full Lifecycle Map

#StageWhat HappensCategoryData Involved
1Customer opens appApp loads menu from serverTransport (server → app)Menu, store hours, inventory flags
2Menu displayedMenu data held in app memoryStorage (temporary)Menu items, prices, available mods
3Customer browsesSelection exists in app memory as they tapStorage (temporary)Selected item, size, mods
4Customer taps "Add to Cart"Selection formatted into cart itemTransformSelection → structured cart item
5Cart displayedCart data held in app memoryStorage (temporary)Cart items, running subtotal
6Customer taps "Place Order"Cart data sent to serverTransport (app → server)Cart items, customer ID, store ID
7Server validates orderEach item checked: real menu item? Size valid? Mod available? Store open?TransformRaw order → validated order or error
8Inventory checkedServer verifies items are in stockTransform (comparison)Order items vs. current inventory
9Price calculatedServer computes subtotal, tax, totalTransformItems + prices + tax rate → total
10Order summary sent to appCalculated total returned for confirmationTransport (server → app)Total, itemized breakdown
11Customer confirms and paysPayment info sent to serverTransport (app → server)Payment method token, total amount
12Payment forwardedServer sends charge to payment providerTransport (server → payment service)Amount, payment token, merchant ID
13Payment processedPayment provider validates and chargesTransform (external)Charge request → approval or decline
14Payment result returnedApproval/decline sent back to serverTransport (payment service → server)Transaction ID, status, timestamp
15Order status updatedStatus changed from "pending" to "paid"Transform (state change)Order status field
16Order savedFull order record written to databaseStorage (persistent)All order fields, customer ID, timestamps
17Order sent to storeOrder details sent to barista screenTransport (server → store display)Items, mods, customer name, order #
18Queue position calculatedServer computes estimated wait timeTransformOrders ahead + avg prep time → estimate
19Estimate sent to customerWait time pushed to appTransport (server → app)Estimated minutes
20Barista prepares drink(Physical process, not data — but status tracked)
21Barista marks "ready"Taps button on store displayTransport (store display → server)Order ID, "ready" status
22Order status updatedStatus changed from "in progress" to "ready"Transform (state change)Order status field, timestamp
23Push notification sentServer sends notification to customer phoneTransport (server → notification service → phone)Customer device token, message text
24Notification delivery loggedWhether the notification was delivered or failedStorage (persistent)Delivery status, timestamp, device info
25Customer picks up drinkBarista marks "completed"Transform (state change)Order status → "completed," pickup timestamp
26Receipt generatedOrder data formatted into receiptTransformOrder data → receipt format
27Receipt storedSaved to customer's order historyStorage (persistent)Receipt data, linked to customer account
28Analytics loggedOrder data aggregated into business metricsTransport (server → analytics)Order total, items, time to fulfill, store ID

Step 3: Flow Diagram

┌─────────┐    load menu    ┌──────────┐   fetch    ┌──────────┐
│Customer  │ ◄───────────── │  Server  │ ─────────►│ Database │
│  App     │                │          │            │(menu,    │
│          │ ──────────────►│          │            │inventory)│
│          │   place order  │          │            └──────────┘
└─────────┘                 │          │
     ▲                      │          │──────────────►┌──────────┐
     │ push notification    │          │  charge card   │ Payment  │
     │                      │          │◄──────────────│ Provider │
     │                      │          │  confirmation  └──────────┘
     │                      │          │
     │                      │          │──────────────►┌──────────┐
     │                      │          │  send order    │  Store   │
     │                      │          │◄──────────────│ Display  │
     │                      │          │  mark ready    └──────────┘
     │                      │          │
     │                      │          │──────────────►┌──────────┐
     │                      │          │  log events    │Analytics │
     │                      └──────────┘               └──────────┘
     │                           │
     │                           │ send notification
     │                      ┌──────────┐
     └──────────────────────│  Push    │
                            │ Service  │
                            └──────────┘

Step 4: Hidden Data Analysis

Let's call out the hidden data specifically:

Hidden DataTypeWhere It LivesWhy It Matters
Timestamps on every actionMetadataOrder record in databaseDebugging, performance monitoring, customer disputes
Customer's device tokenConfigurationCustomer account recordRequired for push notifications to work
Tax rate for the store's locationConfigurationServer config or databaseAffects price calculation — changes by jurisdiction
Payment provider's transaction IDMetadataOrder recordRequired for refunds, fraud investigation, accounting
Menu versionMetadataApp cache + serverIf menu prices changed between browsing and ordering, which price applies?
Notification delivery statusStateNotification logIf customer didn't get notified, was it our fault or theirs?
Inventory snapshot at time of orderDerivedNot typically stored — and maybe it should beIf a customer says "it showed oat milk was available," can you prove whether it was?

That last row is interesting — it's data that doesn't exist in most systems but should, which is something you'd only discover by doing this analysis.


Step 5: What Could Go Wrong (by Lifecycle Stage)

StageWhat Could FailConsequence
Transport: Load menuApp can't reach server (no internet)Customer can't browse. Show cached menu? Or error?
Storage: Menu cacheCached menu is stale (prices changed)Customer sees old price, gets charged new price. Angry customer.
Transform: Validate orderItem was removed from menu after customer selected itOrder rejected. Good UX: explain what happened. Bad UX: generic error.
Transport: PaymentPayment service is down or slowCustomer is stuck. Don't show "order confirmed" until payment is confirmed.
Transform: PaymentCard declinedTell the customer clearly. Don't store the attempt as a completed order.
Transport: Send to storeStore display is offlineOrder is paid but barista never sees it. Customer waits forever. Critical failure.
Transform: State changeBarista forgets to tap "ready"Customer never gets notified. Drink gets cold. Operational failure (not a system bug, but the system should account for it — timeout alert?).
Transport: Push notificationNotification service fails silentlyCustomer doesn't know drink is ready. System should have fallback (in-app polling?).

Compare and Contrast: What This Example Teaches

This example demonstrates:

  1. A "simple" system has 27+ data stages — complexity hides in the details
  2. Multiple external services (payment provider, notification service, store display) each introduce transport risks
  3. State management is critical — the order status flows through 5 states, and getting out of sync at any point creates visible bugs
  4. Hidden data (metadata, config, logs) is as important as the obvious data — timestamps, tax rates, and delivery receipts make or break the system's debuggability
  5. Failure at any stage has different consequences — some failures are cosmetic, some lose money, some lose customers

When you encounter the test questions, map them at this level of detail. If your lifecycle map has 5 stages, you probably missed 20.

Data Lifecycle — Example: Bank ATM Withdrawal

The Scenario

A customer walks up to an ATM, inserts their card, enters their PIN, requests $200 cash, and walks away with the money and a receipt.

This is one of the most data-sensitive operations in everyday life. Every piece of data must be exact. There is zero tolerance for error — if the system says the customer has $500, they must have exactly $500. Not $499.99, not $500.01.

Let's map it.


Step 1: Identify All the Data

Obvious data:

  • Card data (account number, card info)
  • PIN
  • Withdrawal amount ($200)
  • Account balance
  • Cash dispensed
  • Receipt

Less obvious data:

  • ATM's physical cash inventory (does it have enough $20 bills?)
  • Daily withdrawal limit for this account
  • How much the customer has already withdrawn today
  • ATM network session (the encrypted connection between ATM and bank)
  • Transaction authorization code
  • ATM location and ID
  • Timestamp of the transaction
  • Account hold/freeze status
  • Currency denomination preferences (does the ATM give $20s, $50s, or $100s?)
  • ATM's own status (is the receipt printer working? is the cash drawer jammed?)
  • Fraud detection signals (is this card being used in a different country than it was 5 minutes ago?)

Step 2: Full Lifecycle Map

#StageWhat HappensCategoryKey Details
1Card insertedATM reads magnetic stripe or chip dataTransport (card → ATM)Card number, expiry, bank identifier extracted
2Card data stored temporarilyATM holds card data in encrypted memoryStorage (temporary, encrypted)Never written to disk. Held only for this session.
3ATM connects to bank networkEncrypted session establishedTransport (ATM → bank)Session ID, ATM ID, ATM location transmitted
4Card data sent for validationATM sends card info to bankTransport (ATM → bank)Card number + bank identifier
5Bank validates cardIs this a real card? Is it active? Is it reported stolen? Expired?Transform (validation)Card status checked against bank records
6Card status returnedBank sends result back to ATMTransport (bank → ATM)"Valid" or specific rejection reason
7PIN prompt displayedATM asks customer for PINTransport (ATM → customer screen)No data in transit — just a UI prompt
8Customer enters PINKeypad captures digitsTransport (keypad → ATM memory)PIN stored encrypted, never displayed on screen
9PIN sent for verificationEncrypted PIN sent to bankTransport (ATM → bank)PIN is encrypted end-to-end. ATM never knows the real PIN.
10Bank verifies PINEntered PIN compared to stored PIN hashTransform (comparison)Bank doesn't store plaintext PINs either — it compares hashes
11PIN result returnedBank sends verification resultTransport (bank → ATM)"Correct" or "incorrect" + remaining attempts count
12Customer selects "Withdrawal"Selection capturedStorage (temporary)Transaction type stored in session
13Customer enters $200Amount capturedStorage (temporary)Amount stored in session
14ATM checks local cashDoes ATM have enough cash to dispense?Transform (comparison)$200 requested vs. ATM cash inventory
15Withdrawal request sent to bankATM sends full transaction requestTransport (ATM → bank)Account, amount, ATM ID, timestamp
16Bank checks balanceIs current balance ≥ $200?Transform (comparison)Available balance (accounting for holds and pending transactions)
17Bank checks daily limitHas customer exceeded daily withdrawal limit?Transform (comparison)Today's total withdrawals + $200 vs. limit
18Bank checks fraud signalsIs this transaction suspicious?Transform (analysis)Location, timing, amount patterns checked
19Bank authorizes (or denies)All checks pass → generate authorizationTransformCreates authorization code, places temporary hold on funds
20Funds held (not yet deducted)Bank places a hold of $200 on the accountStorage (state change)Available balance reduced by $200, but actual balance not yet changed
21Authorization sent to ATMBank sends approval + auth codeTransport (bank → ATM)Authorization code, approved amount
22ATM dispenses cashPhysical mechanism releases billsTransport (ATM cash drawer → customer)$200 in $20 bills. ATM cash inventory reduced.
23Customer takes cashATM sensors detect cash was takenTransform (state change)Transaction status: "cash dispensed"
24Dispensing confirmed to bankATM tells bank: cash was taken successfullyTransport (ATM → bank)Auth code + "dispensed" confirmation
25Bank finalizes transactionHold converted to actual deduction. Balance permanently reduced.Transform (state change)Available balance and actual balance both reduced. Transaction recorded.
26Transaction loggedFull record written to bank's transaction databaseStorage (persistent, permanent)Account, amount, ATM ID, location, timestamp, auth code
27Receipt printedATM formats and prints receiptTransform + TransportTransaction data → receipt format → printed paper
28Session endedAll temporary data in ATM memory clearedStorage (destruction)Card data, PIN, session data all wiped

Step 3: Critical Data Detail — The Two-Phase Commit

Notice stages 20 and 25. This is the most important concept in this example:

The bank does NOT deduct the money when it authorizes the transaction. It places a hold first.

Why? Because of the gap between stages 21-24. What if:

  • The ATM authorizes the withdrawal but then jams and can't dispense cash?
  • The customer walks away without taking the money?
  • The network drops between authorization and dispensing confirmation?

If the bank had already deducted the money at step 19, the customer would lose $200 they never received. Instead:

  1. Phase 1 (Hold): Money is reserved but not gone. If something fails, the hold is released and the customer's balance is restored.
  2. Phase 2 (Finalize): Only after the ATM confirms the cash was taken does the bank permanently deduct.

This is called a two-phase commit and it exists specifically because transport between ATM and bank can fail at any point. The data lifecycle design accounts for the worst case.


Step 4: Hidden Data Analysis

Hidden DataTypeWhere It LivesWhy It Matters
PIN attempt counterStateBank's card recordAfter 3 wrong PINs, card is locked. Counter resets on success.
ATM cash inventory by denominationStateATM local storageATM must know exactly how many $20s, $50s, $100s it has. If inventory tracking is wrong, it could promise cash it can't deliver.
Daily withdrawal running totalDerivedBank's transaction recordsCalculated from today's transactions. Not stored as a single number — derived from the sum of today's withdrawals.
Transaction sequence numberMetadataBoth ATM and bankEnsures no transaction is processed twice, even if network hiccups cause a retry.
ATM hardware statusStateATM local diagnosticsPrinter jammed? Card reader failing? Cash drawer low? These are all data that affect whether the ATM can complete a transaction.
Authorization expiryConfigurationBank rulesA hold might expire after 24 hours if not finalized. This prevents money from being locked indefinitely.
Fraud scoring signalsDerivedBank's fraud detection systemGeographic velocity (was this card used 1000 miles away 10 minutes ago?), amount patterns, time-of-day patterns.

Step 5: What Could Go Wrong

Failure PointWhat HappensConsequenceCorrect Response
Network fails after PIN, before authorizationATM can't reach bankTransaction cannot proceedReturn card. Display "service unavailable." Don't guess.
Authorization granted, but ATM cash drawer jamsCash can't be dispensedCustomer authorized but didn't receive moneyATM sends "dispensing failed" to bank. Bank releases hold. Customer balance restored.
Network fails after cash dispensed, before confirmationATM can't tell bank the cash was takenBank doesn't know to finalizeBank's hold expires → money returns to account. But customer HAS the cash. Bank reconciles from ATM's local transaction log during next sync.
Power failure mid-transactionEverything stopsUnclear stateATM writes transaction state to non-volatile storage at each step. On reboot, it replays the state to determine where it stopped and what needs recovery.
Customer walks away without taking cashCash is hanging out of the machineSecurity risk + accounting mismatchATM retracts cash after timeout (usually 30 seconds). Sends "cash retracted" to bank. Hold released.

Compare and Contrast With the Coffee Example

AspectCoffee ShopATM Withdrawal
Data sensitivityLow-medium (order data, payment token)Maximum (financial records, PINs)
Error toleranceMedium (wrong order is bad, but fixable)Zero (wrong balance is unacceptable)
Two-phase commit needed?No (charge and order can be atomic)Yes (must handle gap between authorization and dispensing)
Physical-digital boundaryBarista → digital status updateCash drawer → sensors → digital confirmation
Failure costBad customer experienceFinancial loss or fraud
Hidden data volumeModerateExtensive (fraud signals, hardware status, attempt counters)

The key lesson: the lifecycle structure is the same (storage → transform → transport), but the stakes change everything about how carefully you map it. In the coffee app, missing a notification is annoying. In the ATM system, missing a transaction confirmation means someone loses money.

When you design a system, the first question after "what is the data lifecycle?" is: "What are the stakes when it fails?" The answer determines how much detail your map needs.

Data Lifecycle — Example: Social Media Photo Post

The Scenario

A user takes a photo on their phone, types a caption, adds a location tag, and posts it to a social media platform. Their followers see it in their feeds. Some like it. One person comments. The post appears in search results. A week later, the user checks how many views it got.

This example is interesting because the data fans out — one input (a photo) creates dozens of downstream data flows touching many different parts of the system.


Step 1: Identify All the Data

The obvious:

  • The photo file
  • The caption text
  • The location tag
  • Likes
  • Comments

The hidden:

DataWhy It Exists
Photo metadata (EXIF)Camera embeds date, time, GPS coordinates, camera model, exposure settings into every photo file
Multiple photo sizesThe platform doesn't serve the original 12MB file — it creates thumbnail, medium, and full-size versions
The follow graphThe system needs to know who follows this user to build their feeds
Feed entries for every followerEach follower's personalized feed needs an entry for this post
Notification recordsFollowers with notifications enabled need to be alerted
Search index entriesThe caption and location need to be searchable
View countEvery time someone sees the post, it's counted
Engagement metricsLikes, comments, shares, saves — each tracked separately
Content moderation signalsAutomated scan for prohibited content, nudity detection, etc.
Ad relevance signalsThe platform categorizes the post to match with advertisers
User activity timestamp"Last active" and "posting frequency" updated
Privacy settingsWho can see this post? Public? Friends only? Custom list?

A single photo post touches 15+ data categories.


Step 2: Full Lifecycle Map

Phase 1: Upload and Ingest

#StageWhat HappensCategory
1User taps "Post"Photo file + caption + location sent to serverTransport (phone → server)
2Upload receivedRaw data held in temporary upload storageStorage (temporary)
3Input validatedFile type check (is it actually an image?), file size check (under limit?), caption length checkTransform (validation)
4Content moderation scanAutomated analysis for prohibited contentTransform (analysis)
5EXIF data extractedGPS, timestamp, camera info pulled from photo fileTransform (extraction)
6EXIF data compared to provided locationIf user tagged "Paris" but EXIF says "Tokyo," flag for reviewTransform (comparison)
7Photo resizedOriginal → thumbnail (150px), medium (600px), large (1200px)Transform (image processing)
8Photos storedAll sizes stored in file storage (not the database — a separate file system)Storage (persistent)
9EXIF stripped from public copiesGPS and camera data removed from versions served to viewers (privacy)Transform (redaction)
10Post record createdDatabase record: post ID, user ID, caption, location, timestamp, photo URLs, privacy settingsStorage (persistent)

Phase 2: Distribution (Fan-Out)

#StageWhat HappensCategory
11Follower list retrievedSystem looks up everyone who follows this userTransport (database → distribution service)
12Privacy filter appliedRemove followers who are blocked or excluded by privacy settingsTransform (filtering)
13Feed entries createdFor each eligible follower, a feed entry is generated pointing to this postStorage (persistent — one entry per follower)
14Notification candidates identifiedWhich followers have notifications enabled for this user?Transform (filtering)
15Notifications dispatchedPush notifications sent to eligible followersTransport (server → notification service → devices)
16Notification delivery loggedFor each notification: sent/delivered/failedStorage (persistent)

Phase 3: Indexing

#StageWhat HappensCategory
17Caption text indexedWords from caption added to search indexTransform (tokenization) + Storage (search index)
18Location indexedLocation added to geographic searchStorage (geo index)
19Hashtags extracted and indexed#sunset, #paris pulled from caption and indexedTransform (extraction) + Storage (hashtag index)
20Post added to user's profile timelinePost appears on the user's own profile pageStorage (profile index)

Phase 4: Engagement (Ongoing)

#StageWhat HappensCategory
21Follower views postPost data retrieved and displayedTransport (server → follower's phone)
22View recordedView count incrementedTransform (increment) + Storage (counter update)
23Follower taps "Like"Like event sent to serverTransport (phone → server)
24Like recordedLike record created (who liked what, when)Storage (persistent)
25Like count updatedPost's like count incrementedTransform (increment)
26Post author notified of likeNotification sent to original posterTransport (server → phone)
27Someone commentsComment text sent to serverTransport (phone → server)
28Comment validated and storedChecked for length/prohibited content, then savedTransform + Storage
29Comment count updatedPost's comment count incrementedTransform (increment)
30Post author notified of commentNotification sentTransport

Phase 5: Analytics (Later)

#StageWhat HappensCategory
31User checks "insights"Analytics data aggregated from view counts, like records, comment recordsTransform (aggregation)
32Insights displayedAggregated data formatted and sent to userTransform (formatting) + Transport (server → phone)

Step 3: The Fan-Out Problem

This example reveals a pattern the other examples don't: fan-out.

When a user with 10,000 followers posts a photo, the system must:

  • Create 10,000 feed entries (one per follower)
  • Potentially send 10,000 notifications
  • Handle 10,000 potential views, likes, and comments

This is a one-to-many transport and storage problem. The lifecycle of a single post multiplies at the distribution phase.

                                           ┌─ Follower A's feed
                                           ├─ Follower B's feed
    ┌──────┐       ┌────────┐              ├─ Follower C's feed
    │ Post │──────►│Fan-Out │──────────────├─ Follower D's feed
    │      │       │Service │              ├─ ...
    └──────┘       └────────┘              └─ Follower N's feed
                       │
                       │
                       ▼
                  ┌──────────┐             ┌─ Notification → A
                  │Notify    │─────────────├─ Notification → B
                  │Service   │             └─ Notification → (subset)
                  └──────────┘

This creates interesting data lifecycle questions:

  • Do you create all 10,000 feed entries immediately? Or lazily when each follower opens their app?
  • What if a follower opens their feed while the fan-out is still in progress? Do they see the post or not?
  • What if the user deletes the post 5 seconds after posting? Can you recall all 10,000 feed entries?

These are design decisions that emerge directly from mapping the lifecycle.


Step 4: Multiple Storage Locations — Same Data

Notice that the same photo exists in multiple forms and multiple places:

VersionStorage LocationPurposeLifetime
Original uploadTemporary upload storageProcessing inputDeleted after processing (hours)
Original (full resolution)Permanent file storageBackup/recovery, "download original" featureForever (or until user deletes post)
Large (1200px)Permanent file storage + CDN cacheDesktop viewingForever
Medium (600px)Permanent file storage + CDN cacheMobile feed viewingForever
Thumbnail (150px)Permanent file storage + CDN cacheGrid view, previewsForever

Five copies of what started as one photo. Each has a different purpose and potentially a different lifecycle. If the user deletes the post, ALL five must be deleted — plus the CDN caches must be invalidated. Missing any one copy means orphaned data sitting in storage forever.


Step 5: Comparing All Three Examples

AspectCoffee ShopATMSocial Media Post
Data flow shapeLinear (order → payment → fulfillment)Linear with two-phase commitFan-out (one post → many feeds)
Number of data copies1 (the order)1 (the transaction)Many (photo versions, feed entries, index entries)
Time sensitivityMinutes (order should be ready soon)Seconds (transaction must be instant)Mixed (post immediately, analytics later)
Deletion complexitySimple (one record)N/A (transactions are permanent legal records)Complex (must remove from all copies, feeds, indexes, caches)
Who consumes the dataCustomer + baristaCustomer + bankThousands of followers, search engines, analytics
Biggest hidden dataTax config, menu versionFraud signals, hardware statusFollow graph, EXIF metadata, content moderation

Key Takeaways From This Example

  1. Fan-out multiplies the lifecycle — one action can create thousands of downstream data events
  2. The same data exists in multiple forms — and each form has its own storage, its own lifecycle, and its own deletion requirements
  3. Indexing is a separate lifecycle stage — making data searchable requires transforming and storing it in additional specialized formats
  4. Privacy intersects with data lifecycle — EXIF stripping, privacy-filtered distribution, and blocked-user exclusion are all transforms driven by non-obvious data (privacy settings, block lists)
  5. Analytics are derived data — not stored at the time of action, but aggregated later from atomic records (views, likes, comments)

When mapping a system where one action triggers many reactions, always ask: "How many copies of this data exist, and what happens to all of them when the original changes?"

Data Lifecycle — Common Patterns

Recognizing Patterns

After mapping enough systems, you'll see the same structures repeatedly. Learning to recognize these patterns lets you quickly understand new systems by saying "oh, this is basically a pipeline with a fan-out at the end" instead of mapping every stage from scratch.

Every real system is a combination of these patterns. The coffee shop is CRUD + Request/Response. The ATM is Request/Response with a two-phase commit. The social media post is CRUD + Pipeline + Fan-Out + Event-Driven. Knowing the patterns lets you identify the building blocks.


Pattern 1: CRUD (Create, Read, Update, Delete)

The most basic lifecycle. Data is created, read back, modified, and eventually removed.

Structure

Create:  Input → Validate (Transform) → Store (Storage)
Read:    Request (Transport) → Retrieve (Storage) → Return (Transport)  
Update:  Input → Validate (Transform) → Overwrite (Storage)
Delete:  Request → Remove (Storage) → Confirm (Transport)

Real-World Example: Contact List App

OperationLifecycle Steps
Create a contactUser enters name + phone → app validates (phone number format check) → saved to database
Read contactsUser opens app → app requests contacts from database → database returns list → app displays them sorted alphabetically
Update a contactUser edits phone number → app validates new number → database overwrites old record
Delete a contactUser taps delete → app asks "are you sure?" → sends delete request → database removes record → app removes from displayed list

What to Watch For

  • Read is never just "get the data." There's almost always sorting, filtering, or pagination involved — those are transforms.
  • Delete is rarely simple. What about related data? If you delete a customer, what happens to their orders? Their reviews? Their saved addresses?
  • Update conflicts. What if two people update the same record at the same time? The last write wins? The first write wins? They're told there's a conflict?

Lifecycle Map

┌────────────┐     validate      ┌────────────┐     store       ┌────────────┐
│ User Input │ ───────────────► │  Server    │ ─────────────► │  Database  │
│            │                   │ (transform)│                 │ (storage)  │
│            │ ◄─────────────── │            │ ◄───────────── │            │
│            │   display result  │            │   retrieve      │            │
└────────────┘                   └────────────┘                 └────────────┘

Pattern 2: Pipeline

Data flows through a series of transforms in sequence. Each step's output is the next step's input. No step stores data permanently — the final result is what gets stored or delivered.

Structure

Input → Step 1 (Transform) → Step 2 (Transform) → Step 3 (Transform) → Output

Real-World Example: Photo Upload Processing

When a user uploads a profile photo, it doesn't just get saved. It passes through a pipeline:

StepInputTransformOutput
1. ReceiveRaw uploaded fileVerify it's actually an image (not a virus)Validated image file
2. Strip metadataValidated imageRemove EXIF data (GPS, camera info — privacy)Clean image
3. ResizeClean imageCreate thumbnail (100px), medium (400px), large (800px) versionsThree image files
4. CompressThree imagesOptimize file sizes for web deliveryThree compressed images
5. Content scanCompressed images (or original)Automated check for prohibited contentSame images + moderation flag (pass/fail)
6. StoreCompressed imagesSave all three sizes to file storageURLs for each size
7. Update recordThree URLs + moderation flagUpdate user's profile record with new photo URLsUpdated database record

Pipeline Characteristics

Order matters. You can't compress before you resize (you'd compress the wrong sizes). You can't strip metadata after you store (the metadata would already be in storage). Each step depends on the previous step's output.

Failure stops the pipeline. If step 5 (content scan) flags the image, steps 6 and 7 never execute. The pipeline has a clear "abort" path at every stage.

Each step is independently testable. Give step 3 a known image, check that the output is three images of the right size. You don't need the rest of the pipeline to test this one step.

Lifecycle Map

Raw Upload → [Validate] → [Strip EXIF] → [Resize] → [Compress] → [Scan] → [Store] → [Update Record]
                                                                     │
                                                                     ▼ (if flagged)
                                                              [Reject + Notify User]

Pattern 3: Request/Response

One system asks a question, another system answers it. The data makes a round trip.

Structure

Requester → Question (Transport) → Responder → Process (Transform) → Answer (Transport) → Requester

Real-World Example: Weather App

StepWhat HappensCategory
1User opens weather app— (no data yet)
2App sends request: "What's the weather for ZIP 10001?"Transport (app → weather API)
3Weather API receives requestTransport complete
4API looks up current conditions for ZIP 10001Storage (read from weather database)
5API formats response (temperature, conditions, humidity, forecast)Transform (raw data → structured response)
6Response sent back to appTransport (API → app)
7App stores response in local cacheStorage (temporary — expires after 15 minutes)
8App displays weather to userTransport (app memory → screen)
9User checks again 5 minutes laterApp serves from cache (no new request)
1015 minutes pass, cache expiresCache entry deleted
11User checks againRepeat from step 2

Request/Response Characteristics

There's always a waiting period. Between sending the request and receiving the response, the requester is waiting. What does it show the user? A spinner? Stale cached data? Nothing?

Timeouts are essential. What if the response never comes? The requester must decide: wait forever? Give up after 5 seconds? Show an error?

Caching changes the lifecycle. If you cache responses, you now have the data stored in two places (the source and the cache). They can get out of sync. How stale is acceptable? Who invalidates the cache?


Pattern 4: Event-Driven (Fan-Out)

Something happens, and multiple independent parts of the system react — each with their own lifecycle.

Structure

Event Occurs → Broadcast (Transport)
                ├─→ Listener A → (its own lifecycle)
                ├─→ Listener B → (its own lifecycle)
                └─→ Listener C → (its own lifecycle)

Real-World Example: New User Signs Up

A user creates an account. This single event triggers many independent reactions:

ListenerWhat It DoesIts Own Lifecycle
Welcome Email ServiceSends a welcome emailRetrieve email template (storage) → Fill in user's name (transform) → Send email (transport) → Log delivery (storage)
Default Settings ServiceCreates the user's default preferencesGenerate default settings (transform) → Save to database (storage)
Analytics ServiceRecords the signup eventFormat event data (transform) → Write to analytics store (storage)
Onboarding ServiceCreates a guided tutorial checklistGenerate checklist (transform) → Save progress tracker (storage)
Admin DashboardUpdates the "new signups today" counterIncrement counter (transform) → Update dashboard data (storage)
Fraud DetectionChecks if signup looks legitimateAnalyze email domain, IP address, behavior patterns (transform) → Flag or clear (storage)

Fan-Out Characteristics

Listeners are independent. If the welcome email fails, the default settings should still be created. Each listener has its own success/failure path.

The event producer doesn't know (or care) about the listeners. The signup module just says "a user signed up." It doesn't know that six other systems are listening. This is intentional — it keeps the boundary clean.

Order usually doesn't matter. The welcome email can arrive before or after the default settings are created. But sometimes order does matter — the tutorial can't reference the user's settings if settings haven't been created yet. These ordering dependencies need to be explicit.

Fan-out can cascade. The welcome email might trigger its own event ("email sent"), which another listener responds to ("update email tracking dashboard"). One event can cascade into dozens of downstream data lifecycle chains.


Pattern 5: Batch Processing

Data accumulates over time, then is processed all at once on a schedule.

Structure

Events accumulate (Storage) → Timer fires → Retrieve batch (Transport) → Process all (Transform) → Store results (Storage) → Deliver (Transport)

Real-World Example: Daily Sales Report

StepWhat HappensCategoryTiming
1Orders happen throughout the dayStorage (each order written to database as it occurs)Ongoing, real-time
2Midnight: report job triggers— (timer event)Scheduled
3Job queries all orders for the dayTransport (database → report service)~Midnight
4Orders aggregated by category, region, payment methodTransform (aggregation)~Midnight
5Summary formatted into reportTransform (formatting)~Midnight
6Report storedStorage (persistent — saved to reports archive)~Midnight
7Report emailed to managementTransport (report service → email service → inboxes)~Midnight

Batch Processing Characteristics

There's a delay between event and processing. An order at 9am isn't reflected in the report until midnight. This is by design — but stakeholders must understand it.

The batch window is critical. If 100,000 orders need processing and the job takes 3 hours, it must start early enough to finish before anyone needs the results. What if order volume doubles?

Failed batches are painful. If the midnight job fails, there's no report in the morning. Is there a retry? A manual trigger? Does someone get alerted?

Idempotency matters. If the job runs twice (maybe it was retried), does it produce the same report or a duplicate? The job must be safe to re-run.


Using Patterns to Analyze New Systems

When you encounter a new system, ask:

  1. What's the dominant pattern? (Most features are CRUD at their core)
  2. Where are the pipelines? (Any time data is processed in steps)
  3. Where are the request/response boundaries? (Any time two systems talk)
  4. Where are the fan-out points? (Any time one action triggers multiple reactions)
  5. Is there batch processing? (Any time you hear "nightly," "weekly," "scheduled")

Most systems are a combination. "When a user signs up (CRUD: create user), send a welcome email (event-driven), process their uploaded profile photo (pipeline), and load their personalized dashboard (request/response pulling data from multiple sources)."

Naming the pattern lets you immediately know what lifecycle questions to ask, what failure modes to expect, and how the data flows.

Data Lifecycle — Test Your Understanding

Answer each question thoroughly. There is no code here — only thinking. If you can't answer confidently, revisit the Why and How sections before continuing.


Section A: Identification

Question 1

A weather app shows you the current temperature for your city.

List every piece of data involved in showing that single number on your screen. For each piece, label it as Storage, Transform, or Transport. Some pieces may involve more than one.


Question 2

You take a photo on your phone, apply a filter, and post it to social media. Your friend, in another country, sees it on their phone.

Trace the data lifecycle of that photo from the moment you press the shutter button to the moment your friend sees it. Identify every stage of storage, transform, and transport.


Question 3

A thermostat in a house reads the temperature, and if it's below 68°F, turns on the heater. When the temperature reaches 72°F, it turns the heater off.

What data is involved? Where is each piece stored? What transforms occur? What transport happens?


Section B: Analysis

Question 4

A company stores customer orders in a database. Every night at midnight, a report is generated summarizing the day's sales and emailed to the management team.

Draw the full lifecycle map (table or diagram) for this process. Include data you think the description didn't mention but that must exist for the system to work.


Question 5

You are told: "The system is slow." You know the system does three things:

  1. Receives data from an external API
  2. Processes that data (cleans and aggregates it)
  3. Saves the result to a database

Using only the data lifecycle model, list three distinct hypotheses for why it might be slow, one related to each lifecycle stage (storage, transform, transport).


Question 6

An e-commerce site has a feature: "Customers who bought this also bought..."

What data must be stored to make this feature work? What transform produces the recommendations? When does that transform happen — when the page loads, or ahead of time? What are the tradeoffs of each approach?


Section C: Design

Question 7

You are asked to design a system for a public library's book checkout process. Patrons scan their library card, scan the book, and walk out. Overdue books generate a notification after 14 days.

Produce a full lifecycle map. Include:

  • All data involved (obvious and hidden)
  • Every storage location
  • Every transform
  • Every transport
  • What happens when things go wrong (book not in system, card expired, network down)

Question 8

A food delivery app needs a feature: real-time order tracking. The customer can see "Order received," "Being prepared," "Out for delivery," and "Delivered" with live updates.

Map the lifecycle of the order status specifically. Where is status stored? What triggers each status change (transform)? How does the updated status reach the customer's screen (transport)? What happens if the driver's phone loses connectivity?


Question 9

A school wants a system where teachers enter grades, students can view their own grades, and parents receive a weekly email summary.

Three different types of users interact with the same data. Map the full lifecycle showing how grade data flows differently for each user type. Identify where the data is shared and where it diverges.


Section D: Critical Thinking

Question 10

Someone proposes: "Let's just store everything in one big database table and figure out the rest later."

Using what you know about the data lifecycle, explain specifically what problems this creates. Don't just say "it's bad" — identify at least three concrete consequences and relate each one back to storage, transform, or transport.


Question 11

You mapped a system's data lifecycle and found that the same piece of data is stored in three different places (a database, a cache, and a local file).

Is this a problem? Under what circumstances would it be the right design? Under what circumstances would it be a mistake? What specific risks does it introduce?


Question 12

A feature request says: "When a user uploads a profile picture, it should appear immediately on their profile."

This sounds simple. Using lifecycle thinking, list everything that actually needs to happen between "user selects a file" and "image appears on profile page." Identify the stages that could fail and what the user would experience in each failure case.


Grading Rubric

For each question, evaluate your answer against these criteria:

CriteriaWhat It Means
CompletenessDid you identify all the data, including non-obvious data (metadata, state, config)?
AccuracyDid you correctly label each stage as storage, transform, or transport?
Failure awarenessDid you consider what happens when things go wrong?
ClarityCould someone else read your answer and build from it?

If your lifecycle map is missing stages, it means you're not seeing the full picture yet. That's fine — re-read the material and try again. The goal is not to get it right the first time. The goal is to train your brain to automatically think in lifecycles.

Boundaries — Why It Matters

The Hardest Problem in Engineering

Ask any experienced engineer what the hardest part of their job is. They won't say "writing code." They'll say some version of:

"Figuring out where one thing ends and another thing begins."

This is the boundary problem, and it is the single most consequential decision in any system. Get the boundaries right and the system is clean, debuggable, changeable, and explainable. Get them wrong and you have a tangled mess that nobody — not even the person who built it — can understand six months later.

An LLM can write code inside a well-defined boundary. It cannot decide where the boundary should be. That's your job.

What Is a Boundary?

A boundary is an answer to the question: "What is this thing responsible for, and what is it NOT responsible for?"

When you say "the authentication module," you are drawing a boundary. Everything about verifying identity is inside. Everything about displaying dashboards is outside. The boundary is the line between them.

Boundaries exist at every level:

  • This operation handles validating an email address. It does NOT store the email anywhere.
  • This feature handles user signup. It does NOT handle password resets.
  • This module handles all authentication. It does NOT handle what the user sees after logging in.
  • This service handles the entire user account system. It does NOT handle product inventory.

Why Boundaries Are Hard

Everything feels connected

In any real system, almost everything touches something else. A user's name appears on the profile page, in order confirmations, in admin dashboards, in emails. It's tempting to think it's all one thing. It's not — it's one piece of data crossing multiple boundaries.

Premature abstraction

People draw boundaries too early, before they understand the problem. They create a "UserManager" and a "DataProcessor" and an "EventHandler" before they know what the system actually does. These names describe nothing. They bound nothing. They're boundaries without meaning.

Fear of duplication

"But this code does almost the same thing as that code!" So people merge them, destroying two clear boundaries to create one muddled one. Sometimes duplication is the right answer. Two things that happen to look similar today might evolve in completely different directions tomorrow.

What Happens Without Clear Boundaries

The Ripple Effect

You change one thing and seventeen other things break. This happens because responsibilities leaked across boundaries. The payment code shouldn't need to know about the email template format, but someone took a shortcut and now they're coupled.

The "Nobody Understands This" Problem

A new person joins the team. They ask "how does checkout work?" If boundaries are clean, you can point to the checkout module and say "it's all in there." If boundaries are muddy, the answer is "well, it starts here, but then it calls this thing over there, which triggers this other thing, which writes to this shared table that's also used by..." — and the new person learns nothing.

The "Can't Change Anything" Problem

You need to replace the payment provider. If the payment boundary is clean, you swap out the internals and nothing else knows. If the payment logic is scattered across the codebase, you're rewriting half the application.

The "Testing Is Impossible" Problem

You want to verify that order totals are calculated correctly. If calculation is inside a clear boundary with defined inputs and outputs, you test it directly. If calculation is tangled with database access and UI rendering, you have to spin up the entire system just to test arithmetic.

The Hierarchy of Boundaries

Real systems have boundaries nested inside boundaries. Understanding the hierarchy is essential:

Operation

The smallest unit. One focused action.

  • Validate an email format
  • Calculate tax on a subtotal
  • Format a date for display

An operation should do one thing. If you describe it and use the word "and," it might be two operations.

Feature

A user-facing capability composed of operations.

  • "Sign up" = validate email + validate password + check for duplicate account + create account + send welcome email
  • "Place order" = validate cart + calculate total + process payment + create order record + send confirmation

A feature is something a user or stakeholder would recognize. "Validate email" is not a feature — "Sign up" is.

Module

A cohesive group of related features.

  • Authentication module: sign up, log in, log out, reset password, manage sessions
  • Orders module: browse products, add to cart, checkout, view order history
  • Notifications module: send email, send push notification, manage preferences

A module should have a clear domain. If you can't name it in one or two words, it might be too broad.

System

The complete application — all modules working together.

Service

An independently deployable system with its own storage. At large scale, modules may become services. But that's an advanced concern — start with modules.

The Naming Test

Here's a simple test for whether your boundary is right: Can you name it clearly?

  • ✅ "AuthenticationModule" — clear, you know what's inside
  • ✅ "OrderCalculator" — clear, it calculates orders
  • ✅ "EmailSender" — clear, it sends emails
  • ❌ "Utilities" — what's in here? Everything that didn't fit elsewhere? This is a junk drawer, not a boundary.
  • ❌ "DataManager" — manages what data? All data? This name tells you nothing.
  • ❌ "Helper" — helps with what? This is a confession that you didn't know where to put something.

If you can't name it precisely, you haven't defined it precisely. The name IS the boundary.

Why This Matters For Your Career

In the age of LLMs, the engineer who can say:

"This system needs four modules. Here's what each one does. Here's what each one does NOT do. Here are the connections between them."

...is the engineer who leads projects. The one who asks "should I use React or Vue?" is asking the wrong question. The technology doesn't matter until the boundaries are clear.

Boundaries are the blueprint. Everything else is construction.

Boundaries — Why It Matters

The Hardest Problem in Engineering

Ask any experienced engineer what the hardest part of their job is. They won't say "writing code." They'll say some version of:

"Figuring out where one thing ends and another thing begins."

This is the boundary problem, and it is the single most consequential decision in any system. Get the boundaries right and the system is clean, debuggable, changeable, and explainable. Get them wrong and you have a tangled mess that nobody — not even the person who built it — can understand six months later.

An LLM can write code inside a well-defined boundary. It cannot decide where the boundary should be. That's your job.

What Is a Boundary?

A boundary is an answer to the question: "What is this thing responsible for, and what is it NOT responsible for?"

When you say "the authentication module," you are drawing a boundary. Everything about verifying identity is inside. Everything about displaying dashboards is outside. The boundary is the line between them.

Boundaries exist at every level:

  • This operation handles validating an email address. It does NOT store the email anywhere.
  • This feature handles user signup. It does NOT handle password resets.
  • This module handles all authentication. It does NOT handle what the user sees after logging in.
  • This service handles the entire user account system. It does NOT handle product inventory.

Why Boundaries Are Hard

Everything feels connected

In any real system, almost everything touches something else. A user's name appears on the profile page, in order confirmations, in admin dashboards, in emails. It's tempting to think it's all one thing. It's not — it's one piece of data crossing multiple boundaries.

Premature abstraction

People draw boundaries too early, before they understand the problem. They create a "UserManager" and a "DataProcessor" and an "EventHandler" before they know what the system actually does. These names describe nothing. They bound nothing. They're boundaries without meaning.

Fear of duplication

"But this code does almost the same thing as that code!" So people merge them, destroying two clear boundaries to create one muddled one. Sometimes duplication is the right answer. Two things that happen to look similar today might evolve in completely different directions tomorrow.

What Happens Without Clear Boundaries

The Ripple Effect

You change one thing and seventeen other things break. This happens because responsibilities leaked across boundaries. The payment code shouldn't need to know about the email template format, but someone took a shortcut and now they're coupled.

The "Nobody Understands This" Problem

A new person joins the team. They ask "how does checkout work?" If boundaries are clean, you can point to the checkout module and say "it's all in there." If boundaries are muddy, the answer is "well, it starts here, but then it calls this thing over there, which triggers this other thing, which writes to this shared table that's also used by..." — and the new person learns nothing.

The "Can't Change Anything" Problem

You need to replace the payment provider. If the payment boundary is clean, you swap out the internals and nothing else knows. If the payment logic is scattered across the codebase, you're rewriting half the application.

The "Testing Is Impossible" Problem

You want to verify that order totals are calculated correctly. If calculation is inside a clear boundary with defined inputs and outputs, you test it directly. If calculation is tangled with database access and UI rendering, you have to spin up the entire system just to test arithmetic.

The Hierarchy of Boundaries

Real systems have boundaries nested inside boundaries. Understanding the hierarchy is essential:

Operation

The smallest unit. One focused action.

  • Validate an email format
  • Calculate tax on a subtotal
  • Format a date for display

An operation should do one thing. If you describe it and use the word "and," it might be two operations.

Feature

A user-facing capability composed of operations.

  • "Sign up" = validate email + validate password + check for duplicate account + create account + send welcome email
  • "Place order" = validate cart + calculate total + process payment + create order record + send confirmation

A feature is something a user or stakeholder would recognize. "Validate email" is not a feature — "Sign up" is.

Module

A cohesive group of related features.

  • Authentication module: sign up, log in, log out, reset password, manage sessions
  • Orders module: browse products, add to cart, checkout, view order history
  • Notifications module: send email, send push notification, manage preferences

A module should have a clear domain. If you can't name it in one or two words, it might be too broad.

System

The complete application — all modules working together.

Service

An independently deployable system with its own storage. At large scale, modules may become services. But that's an advanced concern — start with modules.

The Naming Test

Here's a simple test for whether your boundary is right: Can you name it clearly?

  • ✅ "AuthenticationModule" — clear, you know what's inside
  • ✅ "OrderCalculator" — clear, it calculates orders
  • ✅ "EmailSender" — clear, it sends emails
  • ❌ "Utilities" — what's in here? Everything that didn't fit elsewhere? This is a junk drawer, not a boundary.
  • ❌ "DataManager" — manages what data? All data? This name tells you nothing.
  • ❌ "Helper" — helps with what? This is a confession that you didn't know where to put something.

If you can't name it precisely, you haven't defined it precisely. The name IS the boundary.

Why This Matters For Your Career

In the age of LLMs, the engineer who can say:

"This system needs four modules. Here's what each one does. Here's what each one does NOT do. Here are the connections between them."

...is the engineer who leads projects. The one who asks "should I use React or Vue?" is asking the wrong question. The technology doesn't matter until the boundaries are clear.

Boundaries are the blueprint. Everything else is construction.

Boundaries — How: The Method

Drawing Boundaries: A Practical Framework

Boundaries don't appear naturally. You have to draw them deliberately. Here is a repeatable four-step process for identifying where boundaries belong in any system.


Step 1: List the Nouns

Take the system description and extract every noun — every "thing" that exists in the domain.

Don't filter yet. Don't decide what's important. Just list every noun you can find.

Example: "A library system where patrons check out books, librarians manage inventory, and overdue books generate fines."

Nouns:

  • Patron
  • Book
  • Librarian
  • Inventory
  • Fine
  • Checkout (the act of checking out — this is a noun too)
  • Overdue status (implied)

These nouns are your candidate boundaries. Not all of them will become boundaries, but they are where you start.

Step 2: Group by Responsibility

Ask: "Which nouns are really about the same concern?"

Some nouns are clearly siblings. "Book" and "Inventory" are both about the collection. "Patron" and "Librarian" are both people, but they have fundamentally different responsibilities — so they might belong to different groups.

Create a table:

ConcernNouns
(name the concern in 1-3 words)(list the related nouns)

Each concern is a candidate module.

Step 3: Define the Inside and the Outside

For each candidate boundary, write two lists:

  • ✅ Inside: What this module is responsible for. All the actions, data, and rules it owns.
  • ❌ Outside: What this module is explicitly NOT responsible for. Name specific things that someone might accidentally put here.

The "Outside" list is the important one. It's easy to say what something does. It's harder — and more valuable — to say what it does NOT do. The outside list is what prevents scope creep.

If you can't clearly write the "outside" list, your boundary is too vague. Go back to step 2 and re-examine.

Step 4: Identify the Connections

Boundaries are not walls — they are membranes. Data flows between them, but only through defined channels.

For each pair of modules that need to communicate, define:

  • What data crosses the boundary (be specific — not "user data" but "user ID")
  • Which direction it flows (A asks B, or B notifies A, or both)
  • What triggers the communication (a user action? a schedule? a state change?)

Draw the connections as arrows with labels. The label should describe the question being asked or the data being passed, not the technical mechanism.


Supporting Concepts

The Responsibility Test

For every piece of logic in your system, ask: "Whose job is it?"

If the answer is clear and singular, your boundaries are right. If the answer is "well, it could be either..." then you have a boundary problem that needs a decision.

Cohesion and Coupling

Two concepts that measure boundary quality:

Cohesion (inside a boundary) — High cohesion = good. Everything inside a boundary should be closely related. If you opened a module and found something unrelated, the boundary is wrong.

Test: "Does every piece inside this boundary serve the same purpose?"

Coupling (between boundaries) — Low coupling = good. Boundaries should depend on each other as little as possible. If changing Module A's internals forces changes inside Module B, they are too tightly coupled.

Test: "If I completely rewrote the inside of this module, would other modules need to change?"

The ideal: high cohesion within boundaries, low coupling between them.

The Naming Test

Can you name each boundary clearly, in 1-3 words, where the name accurately describes EVERYTHING inside?

  • ✅ "Authentication" — clear, you know what's inside
  • ✅ "OrderCalculation" — clear, it calculates orders
  • ❌ "Utilities" — junk drawer, not a boundary
  • ❌ "DataManager" — manages what? This name hides confusion
  • ❌ "Helpers" — a confession that you didn't know where to put something

The Elevator Test

Can you explain each boundary in one sentence — the kind of sentence you'd say in an elevator?

If your one-sentence explanation includes the word "and" more than once, the boundary might be too wide. If you struggle to fill a sentence at all, it might be too narrow.


What to Look For in the Examples

The following pages work through three complete systems using this four-step method. As you read each one, pay attention to:

  1. How many nouns they start with vs. how many boundaries they end with (it's always fewer boundaries than nouns)
  2. Where the controversial decisions are — the moments where something could reasonably go in two places
  3. What the connection diagram looks like — are there many connections (tightly coupled) or few (loosely coupled)?
  4. How the "outside" lists prevent problems — each "NOT responsible for" statement is a bug that won't happen

Boundaries — Example: Library System

The Scenario

A public library system where patrons check out books, librarians manage the collection, overdue books generate fines, and the library sends reminders. Patrons can search the catalog and place holds on books that are currently checked out.


Step 1: List the Nouns

From the description and common sense about how libraries work:

  • Patron
  • Librarian
  • Book
  • Copy (a library owns multiple copies of the same book)
  • Catalog
  • Inventory
  • Checkout record
  • Due date
  • Return
  • Hold (a reservation on a book)
  • Hold queue (when multiple patrons want the same book)
  • Fine
  • Payment
  • Reminder (notification about due dates)
  • Library card
  • Account (patron's account)
  • Search query
  • Search results

18 nouns. This will NOT become 18 modules.


Step 2: Group by Responsibility

ConcernNounsRationale
The CollectionBook, Copy, Catalog, InventoryAll about what the library has
Patron AccountsPatron, Library Card, AccountAll about who uses the library
CirculationCheckout, Return, Due Date, Hold, Hold QueueAll about the movement of books between library and patron
FinancesFine, PaymentAll about money
CommunicationReminderAll about notifying patrons
SearchSearch Query, Search ResultsAll about finding books

Six concerns from 18 nouns. But are six the right number? Let's examine.

Should Search be its own boundary? Search operates on Catalog data, but it doesn't modify it. It's a read-only view with its own complexity (matching, ranking, filtering). Making it separate means the Catalog can change how books are stored without affecting how they're searched. Yes — keep it separate.

Should Communication be its own boundary? Reminders are triggered by Circulation events (book due soon, book overdue). Should reminders live inside Circulation? No — because the library might also want to send other communications later (new book announcements, event invitations). Keeping Communication separate lets it grow without changing Circulation. Yes — keep it separate.

Should Finances be its own boundary? Fines are triggered by Circulation, and payments come from Patrons. Finances sits between them. If we put fines inside Circulation, then Circulation needs to know about money, which is outside its core concern. Yes — keep Finances separate.


Step 3: Define Inside and Outside

Catalog Module

  • Inside: Adding books to the collection, removing books, tracking how many copies of each book exist, storing book details (title, author, ISBN, genre, description), tracking the physical condition of copies
  • Outside: Who has checked out which copy (that's Circulation), searching for books (that's Search), book prices or fines (that's Finances), patron information (that's Accounts)

Elevator test: "Manages the library's collection — what books exist and how many copies are available."

Patron Accounts Module

  • Inside: Creating patron accounts, issuing library cards, storing patron info (name, address, contact), verifying patron identity, managing account status (active, suspended, expired), storing patron preferences
  • Outside: What books a patron has checked out (that's Circulation), what fines they owe (that's Finances), sending emails to patrons (that's Communication)

Elevator test: "Manages who the library's patrons are and their account status."

Circulation Module

  • Inside: Recording checkouts (which patron, which copy, what date), calculating due dates, recording returns, managing the hold queue (who wants a book and in what order), enforcing checkout limits (max 10 books), tracking overdue status
  • Outside: Book details like title or author (that's Catalog), patron contact info (that's Accounts), fine amounts or payments (that's Finances), sending overdue notices (that's Communication)

Elevator test: "Tracks the movement of books between the library and patrons — who has what and when it's due."

Finances Module

  • Inside: Calculating fine amounts based on how overdue a book is, recording fine charges, processing payments, tracking patron balance (what they owe), applying fine policies (grace periods, maximum fines, waivers)
  • Outside: Whether a book is overdue (that's Circulation — Finances gets told), patron contact info (that's Accounts), book details (that's Catalog)

Elevator test: "Handles all money — what patrons owe, what they've paid, and fine policies."

Communication Module

  • Inside: Deciding how to contact a patron (email, text, postal), formatting messages, sending messages, tracking delivery success/failure, managing communication preferences (patron opts out of texts)
  • Outside: Deciding when to contact someone (other modules trigger this), determining what the patron owes (that's Finances), knowing what books are due (that's Circulation)

Elevator test: "Delivers messages to patrons through their preferred channel."

Search Module

  • Inside: Accepting search queries, matching against the catalog (by title, author, genre, ISBN, keyword), ranking results by relevance, filtering (available now, genre, format), paginating results
  • Outside: Modifying the catalog (that's Catalog), checking out books (that's Circulation), patron info (that's Accounts)

Elevator test: "Helps patrons find books in the collection."


Step 4: Connection Diagram

┌──────────────┐       "does this patron       ┌──────────────┐
│   Patron     │        exist and are            │ Circulation  │
│   Accounts   │◄──── they active?"─────────────│              │
│              │                                 │  - checkouts │
│  - identity  │                                 │  - returns   │
│  - status    │                                 │  - holds     │
└──────────────┘                                 │  - due dates │
       ▲                                         └──────────────┘
       │                                              │     │
       │ "who to contact"                             │     │
       │ + "how"                                      │     │
       │                                              │     │
┌──────────────┐      "this is overdue" ─────────────┘     │
│Communication │◄──── or "due date approaching"             │
│              │                                            │
│  - channels  │      "fine has been charged" ──┐           │
│  - delivery  │◄──── or "payment received"     │           │
└──────────────┘                                │           │
                                          ┌──────────────┐  │ "does this
                                          │  Finances    │  │  book exist?"
                                          │              │◄─┘  + "is a copy
                                          │  - fines     │     available?"
                                          │  - payments  │
                                          └──────────────┘
                                                            │
                    ┌──────────────┐                        │
                    │   Catalog    │◄───────────────────────┘
                    │              │
                    │  - books     │◄──── "search for"
                    │  - copies    │       books matching X
                    │  - details   │
                    └──────────────┘      ┌──────────────┐
                           ▲              │    Search    │
                           └──────────────│              │
                             read-only    │  - queries   │
                             access       │  - results   │
                                          └──────────────┘

Connection Analysis

Let's count the connections:

  • Circulation connects to: Accounts, Catalog, Finances, Communication = 4 connections
  • Communication connects to: Accounts, Circulation, Finances = 3 connections
  • Finances connects to: Circulation = 1 connection (plus triggered by Circulation)
  • Search connects to: Catalog = 1 connection (read-only)
  • Catalog connects to: nothing outbound (it's a foundational data store)
  • Accounts connects to: nothing outbound (it's a foundational data store)

The natural foundation pieces (Catalog, Accounts) have zero outbound dependencies — they don't need anything from the other modules. Everything else depends on them. This is a sign of good architecture: foundational data has no dependencies, and everything else reaches down to it.


Controversial Decisions and Tradeoffs

Should hold notifications come from Circulation or Communication?

When a hold becomes available (the book was returned), someone needs to tell the patron.

Option A: Circulation sends the notification directly.

  • Pro: Simpler — fewer module interactions
  • Con: Circulation now needs to know about email/text/postal preferences, which is outside its domain

Option B: Circulation emits an event ("hold available"), Communication picks it up, looks up the patron's contact preferences, and sends the message.

  • Pro: Each module stays within its boundary. Communication can be enhanced (add push notifications) without changing Circulation.
  • Con: More moving parts. If Communication fails, the patron doesn't get notified and doesn't know their hold is ready.

Decision: Option B. The extra complexity is worth it because notification channels will change over time, and Circulation shouldn't need to change when they do.

Should "patron has too many fines" block checkout?

A patron with $50 in unpaid fines tries to check out a book. Who blocks them?

Option A: Circulation checks with Finances before every checkout.

  • Pro: The block happens at the right moment
  • Con: Circulation is now coupled to Finances — it can't work without it

Option B: Finances notifies Accounts to suspend the patron. Circulation checks patron status (which it already does) and sees "suspended."

  • Pro: Circulation doesn't need to know about Finances at all. Account status is something it already checks.
  • Con: There's a window between the fine accruing and the account being suspended where the patron might check out another book.

Decision: Option B for most libraries (the window is acceptable). Option A if the rules are strict and the window is unacceptable (e.g., a very high-value collection).


What This Example Teaches

  1. Start with many nouns, end with few modules — 18 nouns became 6 modules
  2. Every boundary decision has a rationale — "Search is separate because..." not just "it felt right"
  3. The outside list prevents future mistakes — explicitly stating "Circulation does NOT handle fines" means nobody accidentally adds fine logic there
  4. Connections are minimal and directional — foundational modules have zero outbound dependencies
  5. Controversial decisions exist in every system — the mark of good design is making them deliberately and documenting why

Boundaries — Example: E-Commerce Platform

The Scenario

An online store where customers browse products, add items to a cart, check out with payment, receive order confirmations, and track shipment. Administrators manage the product catalog, adjust inventory, and process returns. The system supports discount codes and customer reviews.


Step 1: List the Nouns

  • Customer
  • Product
  • Category
  • Price
  • Inventory/Stock
  • Cart
  • Cart item
  • Discount code
  • Order
  • Order item
  • Shipping address
  • Shipping method
  • Shipment tracking
  • Payment
  • Refund
  • Return
  • Order confirmation (email)
  • Shipping notification (email)
  • Review
  • Rating
  • Admin user
  • Product image

22 nouns.


Step 2: Group by Responsibility

ConcernNounsRationale
Product CatalogProduct, Category, Price, Product ImageWhat's for sale and how it's described
InventoryInventory/StockHow much of each product is available
ShoppingCart, Cart ItemThe customer's in-progress selection
PricingDiscount Code, (Price from Catalog)What things cost and how discounts apply
OrdersOrder, Order Item, Shipping Address, Shipping MethodThe committed purchase
PaymentPayment, RefundMoving money
FulfillmentShipment Tracking, ReturnPhysical delivery and returns
CommunicationOrder Confirmation, Shipping NotificationEmails and notifications
ReviewsReview, RatingCustomer feedback
Customer AccountsCustomer, Admin UserIdentity and authentication

10 candidate modules from 22 nouns. Let's examine whether that's the right number.

Merge Decisions

Should Inventory be separate from Catalog? Catalog is about what exists (product descriptions, images, categories). Inventory is about how many are available right now (stock counts, warehouse locations, restock dates). They change at very different rates — product descriptions change rarely, stock counts change constantly. Keep separate.

Should Pricing be separate from Catalog? Prices could live in the Catalog — they're a product attribute. But discount codes, sale pricing, bulk pricing, regional pricing, and coupon logic are complex enough to be their own concern. If pricing lives in Catalog, then every pricing rule change risks affecting product display. Keep separate.

Should Shopping (Cart) be separate from Orders? A cart is temporary, uncommitted, and belongs to a browsing session. An order is permanent, committed, and has legal/financial implications. They feel similar (both contain items) but have completely different lifecycles and rules. Keep separate.

Split Decisions

Should Customer and Admin be the same module? Both are "accounts" with identity and authentication. But admins have additional permissions: manage products, view all orders, process returns. Customer-specific features (saved addresses, order history, wishlists) don't apply to admins. Split into Customer Accounts and Admin Accounts — or use a single Accounts module with role-based boundaries inside it.

Decision: Single Accounts module with roles. At this scale, splitting them creates unnecessary overhead.


Step 3: Define Inside and Outside

Product Catalog

  • Inside: Product names, descriptions, images, categories, product pages, product attributes (size, color, weight)
  • Outside: Current stock levels (Inventory), prices and discounts (Pricing), customer reviews (Reviews), how to ship it (Fulfillment)

Inventory

  • Inside: Stock counts per product, warehouse locations, low-stock alerts, stock reservations (when an item is in someone's cart or mid-checkout), restock tracking
  • Outside: Product details (Catalog), pricing (Pricing), order records (Orders)

Shopping (Cart)

  • Inside: Adding/removing items from cart, updating quantities, cart persistence (survive browser refresh), cart expiration (after 30 days of inactivity)
  • Outside: Product details displayed in the cart (Catalog), prices shown (Pricing), stock availability checks (Inventory), placing the order (Orders)

Pricing

  • Inside: Base price lookups, discount code validation, discount calculation, sale/promotional pricing rules, tax calculation, bulk pricing tiers
  • Outside: Product details (Catalog), cart management (Shopping), payment processing (Payment), order recording (Orders)

Orders

  • Inside: Creating an order from a cart, recording order items, storing shipping address and method, tracking order status (confirmed → processing → shipped → delivered), order history for customers, cancellation logic
  • Outside: Payment processing (Payment), physical shipment (Fulfillment), price calculation (Pricing), stock management (Inventory), sending confirmation emails (Communication)

Payment

  • Inside: Charging credit cards, processing refunds, payment method storage (tokenized), payment confirmation, handling payment failures (retry, alternative methods)
  • Outside: What was ordered (Orders), shipping details (Fulfillment), product info (Catalog)

Fulfillment

  • Inside: Generating shipping labels, tracking shipment status, delivery confirmation, handling returns (receiving returned items, inspecting condition), return shipping labels
  • Outside: Order details beyond what to ship (Orders), payment/refund processing (Payment), customer notification (Communication)

Communication

  • Inside: Email templates, sending emails, push notifications, SMS, delivery tracking, communication preferences
  • Outside: Deciding when to communicate (other modules trigger), order details (Orders), payment details (Payment)

Reviews

  • Inside: Submitting reviews, editing reviews, review moderation, star ratings, calculating average rating, displaying reviews
  • Outside: Product details (Catalog), customer identity (Accounts), order verification ("did this customer actually buy this product?" — needs to ask Orders)

Accounts

  • Inside: Registration, login/logout, password management, profiles, roles (customer vs. admin), saved addresses, authentication tokens
  • Outside: Order history (Orders), cart contents (Shopping), payment methods (Payment — though this is debatable)

Step 4: Connection Diagram

┌──────────┐                    ┌──────────┐
│ Accounts │◄──── "who is      │ Shopping  │
│          │       this?" ─────│  (Cart)   │
│- identity│                    │- items    │
│- roles   │                    │- quantities│
└──────────┘                    └──────────┘
     ▲                              │ │
     │                    "product  │ │ "what's in
     │                     info?"   │ │  the cart?"
     │                              ▼ │
     │                         ┌────────┐            ┌──────────┐
     │                         │Catalog │◄───────────│  Search  │
     │                         │        │ "find      │(if added)│
     │                         │-products│ products"  └──────────┘
     │                         │-images │
     │                         └────────┘
     │                              ▲
     │              "product info?" │
     │                              │
     │    ┌──────────┐        ┌──────────┐
     │    │Inventory │◄───────│ Pricing  │
     │    │          │"is it  │          │
     │    │- stock   │ in     │- prices  │
     │    │- reserves│ stock?"│- discounts│
     │    └──────────┘        └──────────┘
     │         ▲                    ▲
     │         │ "reserve stock"    │ "calculate total"
     │         │                    │
     │    ┌────────────────────────────┐
     │    │         Orders             │
     │    │                            │──────►┌──────────┐
     │    │ - order records            │"charge"│ Payment  │
     │    │ - status tracking          │       │          │
     │    │ - history                  │◄──────│- charges │
     │    └────────────────────────────┘ "paid" │- refunds│
     │              │        │                  └──────────┘
     │   "ship this"│        │"order confirmed"
     │              ▼        ▼
     │    ┌──────────┐  ┌──────────────┐
     │    │Fulfillment│  │Communication │
     │    │           │  │              │
     │    │- shipping │──►│- emails      │
     │    │- returns  │  │- notifications│
     │    └──────────┘  └──────────────┘
     │                        ▲
     │                        │ "did they buy it?"
     │                   ┌──────────┐
     └───────────────────│ Reviews  │
       "who wrote this?" │          │
                         │- ratings │
                         └──────────┘

Interesting Boundary Decisions

Where does "stock reservation" live?

When a customer adds an item to their cart, should that item be reserved (so it doesn't sell out while they're browsing)? If so, who manages that?

Option A: Shopping (Cart) tells Inventory to reserve stock.

  • Pro: Stock is reserved early, fewer disappointed customers at checkout
  • Con: Cart is coupled to Inventory. What about abandoned carts? Reservations must expire.

Option B: Stock is only reserved at checkout, when the Order is created.

  • Pro: Simpler. Inventory only talks to Orders, not to Cart.
  • Con: Customer shops for 20 minutes, goes to checkout, and finds out the item sold out.

Option C: No reservation. First to complete checkout gets it.

  • Pro: Simplest. No reservation management at all.
  • Con: High-demand items cause frustration.

Decision for this design: Option B. Reserve at checkout. The tradeoff is acceptable for most e-commerce, and it avoids complex cart-inventory coupling. For flash sales or limited editions, implement Option A with short expiry times (10 minutes).

Should Payment handle refunds, or should Fulfillment?

A return triggers a refund. Who initiates it?

Option A: Fulfillment receives the return, inspects it, and calls Payment to refund.

  • Pro: Refund happens at the right moment (item received and inspected)
  • Con: Fulfillment is coupled to Payment

Option B: Fulfillment marks the return as "received and approved." Orders sees this and tells Payment to refund.

  • Pro: Orders is the orchestrator — it already connects to Payment. Fulfillment stays focused on physical goods.
  • Con: Extra hop (Fulfillment → Orders → Payment instead of Fulfillment → Payment)

Decision: Option B. Orders is already the bridge between the digital and physical world. Adding refund orchestration to Orders keeps Fulfillment focused on shipping.

Where do "product reviews" verify purchase?

A review should only be written by someone who bought the product. The Reviews module doesn't have order data.

Reviews must ask Orders: "Did customer X buy product Y?" This is a cross-boundary query, and it's the right design — Reviews shouldn't duplicate order data just to check this.


Comparing Library vs. E-Commerce

AspectLibraryE-Commerce
Module count610
Why more modules?Library has simpler domainE-commerce has more concerns (pricing, fulfillment, payments are all complex)
Central orchestratorCirculationOrders
Financial complexitySimple (flat fine rates)Complex (discounts, tax, multi-currency, refunds)
Physical-digital bridgeCheckout deskShipping/fulfillment
Biggest coupling riskCirculation ↔ FinesOrders ↔ Payment ↔ Inventory
Common God Module"LibrarySystem" (does everything)"OrderProcessor" (checkout + payment + inventory + email)

The key takeaway: more complex domains need more boundaries, but each boundary should still pass the elevator test. If you can't explain a module in one sentence, it's too big — regardless of how complex the overall system is.

Boundaries — Example: Hospital Patient Management

The Scenario

A hospital system that manages patient registration, doctor scheduling, appointments, medical records, prescriptions, lab tests, billing, and insurance claims. Doctors view patient records. Nurses log vitals. Patients access a portal to see appointments and results.

This is the most complex example so far. Real hospital systems are among the most boundary-critical systems in existence — if data leaks between boundaries incorrectly, the consequences can be fatal.


Step 1: List the Nouns

  • Patient
  • Doctor
  • Nurse
  • Appointment
  • Schedule
  • Medical record
  • Diagnosis
  • Prescription
  • Medication
  • Lab test
  • Lab result
  • Vitals (blood pressure, temperature, heart rate)
  • Bill/Invoice
  • Insurance claim
  • Insurance provider
  • Payment
  • Patient portal
  • Department (cardiology, orthopedics, etc.)
  • Room/Bed
  • Admission (inpatient stay)
  • Discharge
  • Referral
  • Allergy
  • Medical history
  • Emergency contact

25 nouns. Let's find the boundaries.


Step 2: Group by Responsibility

ConcernNounsRationale
People/IdentityPatient, Doctor, Nurse, Emergency ContactWho people are, not what they do
SchedulingAppointment, Schedule, Room/BedWhen and where things happen
Clinical RecordsMedical Record, Diagnosis, Vitals, Allergy, Medical HistoryThe patient's health data
MedicationsPrescription, MedicationWhat drugs are prescribed and dispensed
Lab/DiagnosticsLab Test, Lab ResultTests ordered and their results
AdmissionsAdmission, Discharge, ReferralInpatient stays and transfers
BillingBill/Invoice, PaymentWhat the patient owes
InsuranceInsurance Claim, Insurance ProviderThird-party payer processing
DepartmentDepartmentOrganizational structure
Patient PortalPatient PortalPatient-facing access

10 candidate modules. Let's evaluate.

Merge Decisions

Should Insurance merge with Billing? Billing is "calculate what's owed." Insurance is "submit claims to a third party, track approval/denial, handle coverage rules." Insurance has its own external dependencies (insurance company APIs, claim formats, pre-authorization workflows). Keep separate — insurance is complex enough to be its own domain.

Should Patient Portal be a module? The portal is a view into other modules' data — it shows appointments (Scheduling), lab results (Lab), bills (Billing). It doesn't own any unique data. It's a presentation boundary, not a data boundary. The portal is not a module — it's a consumer of other modules' contracts. Important distinction.

Should Department be a module? Departments are organizational metadata. They affect scheduling (which department a doctor belongs to) and potentially routing (which department handles a referral). But "Department" alone doesn't have enough logic to be its own module. Merge into People/Identity as an attribute.

Revised Module List

  1. People — identity of patients, providers, staff
  2. Scheduling — appointments and resource allocation
  3. Clinical Records — medical data
  4. Medications — prescriptions and pharmacy
  5. Lab/Diagnostics — tests and results
  6. Admissions — inpatient management
  7. Billing — charges and payments
  8. Insurance — claims and coverage

8 modules.


Step 3: Define Inside and Outside

People

  • Inside: Patient demographics (name, DOB, address, phone), doctor/nurse profiles, credentials, department assignments, emergency contacts, user authentication for portal access
  • Outside: Medical history (Clinical Records), what appointments they have (Scheduling), what they owe (Billing), what insurance they have (Insurance — though insurance ID might be stored here as an attribute)

Scheduling

  • Inside: Creating/canceling/rescheduling appointments, doctor availability calendars, room/bed assignments, appointment reminders, waitlist management, recurring appointment series
  • Outside: What happens during the appointment (Clinical Records), what it costs (Billing), patient demographics (People), test orders (Lab)

Clinical Records

  • Inside: Diagnoses, clinical notes, vitals recordings, allergy lists, medical history, treatment plans, imaging records, visit summaries
  • Outside: Prescriptions (Medications — though they link to diagnoses), test ordering (Lab — though linked to clinical decisions), billing (Billing), appointment logistics (Scheduling)

Why is Clinical Records separate from Medications and Lab? Because clinical records are the narrative of patient care — what was observed, what was decided. Medications and Lab are the actions — what was prescribed, what was tested. These change at different rates, are governed by different regulations, and are managed by different people. A pharmacist manages medications. A lab technician manages tests. A doctor manages the clinical record.

Medications

  • Inside: Prescriptions (what drug, what dose, what duration), drug interaction checks, refill tracking, pharmacy dispensing records, medication history
  • Outside: The clinical reason for the prescription (Clinical Records), patient identity (People), billing for medications (Billing)

Lab/Diagnostics

  • Inside: Ordering tests, tracking sample collection, managing lab workflows, recording results, flagging abnormal results, test history
  • Outside: Clinical interpretation of results (Clinical Records), billing for tests (Billing), patient identity (People)

Admissions

  • Inside: Admitting patients (inpatient), bed assignments, transfer between departments, discharge processing, length-of-stay tracking, discharge summaries, referrals to other facilities
  • Outside: Clinical care during the stay (Clinical Records), billing for the stay (Billing), identity (People)

Billing

  • Inside: Generating charges from procedures/visits/tests/medications, creating invoices, processing patient payments, tracking outstanding balances, payment plans
  • Outside: Insurance claims (Insurance — Billing passes charges to Insurance for claim submission), clinical details (Clinical Records), scheduling (Scheduling)

Insurance

  • Inside: Insurance plan details, pre-authorization requests, claim submission, claim status tracking, coverage verification, denial management, appeal processing
  • Outside: Generating charges (Billing — Insurance receives charges), clinical justification (Clinical Records provides this when needed for pre-auth), patient identity (People)

Step 4: Connection Diagram

                          ┌───────────┐
                          │  People   │
                          │           │
                          │- patients │
                          │- doctors  │
                          │- staff    │
                          └───────────┘
                         ▲  ▲  ▲  ▲  ▲
                "who?"  /  |  |  |  \  "who?"
                       /   |  |  |   \
          ┌───────────┐  ┌─┴──┴──┴─┐  ┌───────────┐
          │Scheduling │  │Clinical  │  │Admissions │
          │           │  │Records   │  │           │
          │-appts     │  │          │  │-admits    │
          │-calendar  │  │-diagnoses│  │-transfers │
          │-rooms     │  │-vitals   │  │-discharges│
          └───────────┘  │-history  │  └───────────┘
                         └──────────┘
                          │        │
           "prescribed    │        │  "ordered
            based on      │        │   based on
            diagnosis"    ▼        ▼   diagnosis"
                   ┌──────────┐ ┌──────────┐
                   │Medications│ │   Lab    │
                   │           │ │          │
                   │-scripts   │ │-tests    │
                   │-drugs     │ │-results  │
                   │-refills   │ │-samples  │
                   └──────────┘ └──────────┘
                        │              │
                        │ "charges"    │ "charges"
                        ▼              ▼
                      ┌──────────────────┐
                      │     Billing      │
                      │                  │
                      │  - invoices      │
                      │  - payments      │
                      └──────────────────┘
                              │
                              │ "submit claim"
                              ▼
                      ┌──────────────────┐
                      │    Insurance     │
                      │                  │
                      │  - claims        │
                      │  - coverage      │
                      └──────────────────┘

Critical Observation: Data Flows Downward

Notice the shape. People is at the top — everyone needs to know who someone is. Clinical Records is in the middle — clinical decisions drive medications, lab tests, and admissions. Billing is near the bottom — it receives charges from multiple sources. Insurance is at the very bottom — it receives data from Billing.

No arrows point upward. This is not an accident — it's good boundary design. Lower modules don't need to know about higher modules.


Why Hospitals Are the Extreme Case for Boundaries

Regulatory boundaries are real boundaries

Medical records have different legal protections than billing data. You can't show a receptionist the same data you show a doctor. Boundaries enforce access control. If Clinical Records and Billing are in the same module, it's harder to ensure the billing clerk can't see clinical notes.

Wrong data can kill

If Medications gets the wrong patient's allergy list from Clinical Records, the patient could receive a drug they're allergic to. If Lab results are attributed to the wrong patient, treatment decisions are made on false data. Boundary contracts in healthcare are literally life-critical.

Audit requirements

Every access to a medical record must be logged: who accessed it, when, and why. This is only possible if Clinical Records is a clear boundary with defined entry points. If patient data is scattered across every module, comprehensive auditing is impossible.


Comparing All Three Examples

AspectLibraryE-CommerceHospital
Modules6108
Biggest driver of boundary decisionsClean domain separationFinancial accuracyRegulatory + safety
Most connected moduleCirculationOrdersClinical Records
Presentation layer is a module?NoNoNo (Patient Portal is a consumer)
Would work as a monolith?Yes, for a small libraryYes, for a small storeRisky — regulatory violations likely
Consequence of bad boundariesWrong book to wrong patronFinancial errors, bad customer experienceWrong treatment, legal liability, death
Key lessonDomain boundaries emerge from nounsComplexity drives boundary countStakes drive boundary rigor

Boundaries — Common Mistakes

The Five Antipatterns

After seeing how boundaries should work across three examples, let's look at how they go wrong. These mistakes are so common that you will encounter every one of them in your career. Recognizing them is half the battle.


Mistake 1: The God Module

What It Looks Like

One module grows to handle a massive portion of the system. It started small and reasonable, then feature after feature was added because "it's related" or "it's easier to put it here."

Before (Bad):

OrderService
├── Create order
├── Calculate totals
├── Apply discount codes
├── Validate inventory
├── Reserve stock
├── Process payment
├── Process refund
├── Generate invoice
├── Send confirmation email
├── Send shipping notification
├── Update order status
├── Track shipment
├── Handle returns
├── Generate sales reports
└── Manage customer loyalty points

15 responsibilities. This module is impossible to name accurately — "OrderService" doesn't cover half of what it actually does. Any change to any of these responsibilities risks breaking all the others.

How to Spot It

  • The module has more than 5-7 responsibilities
  • Its name doesn't accurately describe everything inside
  • Changes to the module are frequent and scary
  • Multiple developers are constantly working in the same module and colliding
  • Testing requires setting up the entire system because everything is connected

After (Fixed):

Orders                     Pricing                Payment
├── Create order           ├── Calculate totals   ├── Charge
├── Update status          ├── Apply discounts    ├── Refund
├── Track history          └── Tax calculation    └── Payment history
└── Cancel order

Inventory                  Fulfillment            Communication
├── Check stock            ├── Ship order         ├── Email templates
├── Reserve stock          ├── Track shipment     ├── Send confirmation
└── Release reservation    └── Process return     └── Send notifications

Billing                    Loyalty
├── Generate invoice       ├── Earn points
└── Payment tracking       └── Redeem points

Same functionality. Eight modules instead of one. Each with 2-4 responsibilities. Each with a clear name. Each changeable independently.


Mistake 2: The Micro-Boundary

What It Looks Like

The opposite of the God Module. Everything is its own boundary, each with trivial responsibility.

Before (Bad):

EmailValidator           ← validates email format
PasswordValidator        ← validates password strength
NameValidator            ← validates name length
AddressValidator         ← validates address format
PhoneValidator           ← validates phone format
DateValidator            ← validates date format
LoginHandler             ← handles login
LogoutHandler            ← handles logout
SessionCreator           ← creates sessions
SessionDestroyer         ← destroys sessions
PasswordHasher           ← hashes passwords
TokenGenerator           ← generates auth tokens

12 modules for what is clearly one concern: Authentication.

How to Spot It

  • You have modules with only 1-2 functions
  • Many modules always change together (if EmailValidator changes, LoginHandler probably does too)
  • Understanding a single feature requires reading 10 modules
  • The connection diagram looks like a plate of spaghetti

After (Fixed):

Authentication
├── Validate credentials (email, password, etc.)
├── Login / Logout
├── Session management
├── Password hashing
└── Token generation

One module. Five cohesive responsibilities. All related to "verifying and managing user identity." If someone asks "where is the login logic?" the answer is one word: Authentication.

The Test

If two things always change together, they probably belong together. Micro-boundaries violate cohesion — they separate things that should be unified.


Mistake 3: Boundaries Follow Technology, Not Domain

What It Looks Like

DatabaseModule          ← all database operations for all features
APIModule               ← all API endpoints for all features
UIModule                ← all user interface code for all features

Why It's Wrong

Where does "checkout" live? Partly in the API (the checkout endpoint), partly in the Database (saving the order), partly in the UI (the checkout page). The checkout logic is scattered across three modules. To understand checkout, you must read all three.

Where does "user registration" live? Also spread across all three modules. Now checkout and registration code live side-by-side in each module, even though they have nothing to do with each other.

The Result

  • Low cohesion: each module contains unrelated things (checkout + registration + search + ... all in the same "database module")
  • High coupling: changing checkout requires changing three modules simultaneously
  • Impossible to reason about: "where is the checkout logic?" → "everywhere"

After (Fixed):

Checkout                    Registration              Search
├── Checkout API endpoint   ├── Registration API      ├── Search API
├── Checkout database ops   ├── Registration DB ops   ├── Search index ops
└── Checkout UI page        └── Registration UI page  └── Search UI component

Each module contains everything needed for its domain — the API, the data access, and the UI. Now changing checkout only touches the Checkout module.

The Principle

Boundaries should follow the domain (what the system does), not the technology (how it's built). "Checkout" is a domain boundary. "Database" is a technology boundary. Domain boundaries create high cohesion. Technology boundaries create high coupling.


Mistake 4: The Shared Junk Drawer

What It Looks Like

Utils/
├── formatDate()
├── calculateShipping()
├── validateEmail()
├── generatePDF()
├── checkUserPermissions()
├── convertCurrency()
├── sendSlackMessage()
├── compressImage()
├── parseCSV()
└── retryWithBackoff()

Why It's Wrong

"Utils" is not a responsibility. It's a confession that nobody thought about where these things should live. Each function belongs somewhere:

FunctionActually Belongs In
formatDate()Whichever module needs it, as a private helper. Or a shared "Date/Time" utility if multiple modules truly need the same formatting.
calculateShipping()Fulfillment or Pricing
validateEmail()Authentication or Accounts
generatePDF()Billing (for invoices) or Reporting
checkUserPermissions()Authentication/Authorization
convertCurrency()Pricing
sendSlackMessage()Communication/Notifications
compressImage()Media/Content processing
parseCSV()Import/Data Processing
retryWithBackoff()This is genuinely cross-cutting — it's a shared infrastructure utility

The Damage

  • Half the system depends on "Utils," creating hidden coupling
  • Changing any function risks breaking modules you didn't know used it
  • The module grows without limit — there's no criteria for what should or shouldn't be in it
  • New developers dump everything there because it's the path of least resistance

After (Fixed):

Move each function to the module it actually belongs to. For the genuinely cross-cutting pieces (retry logic, date formatting if truly universal), create a named infrastructure module:

Infrastructure/Resilience    ← retryWithBackoff()
Infrastructure/Formatting    ← formatDate(), formatCurrency() (if truly shared)

These have names. They have boundaries. They are not growing junk drawers.


Mistake 5: Hidden Cross-Boundary Coupling

What It Looks Like

The modules look clean on the org chart, but they secretly share:

Shared database tables. Module A and Module B both read from and write to the same table. Neither "owns" it. If A changes the table structure, B breaks.

Shared data models. Both modules use the same internal data structures. Changing the structure in one requires changing the other.

Behavior assumptions. Module A depends on Module B processing items in a specific order, but that order isn't in the contract — it's just how B happens to work today. When B is optimized to process in a different order, A breaks.

A Concrete Example

Orders module and Shipping module both access the orders table directly.

Orders Module ──writes──► ┌─────────┐ ◄──reads── Shipping Module
                          │ orders  │
                          │  table  │
                          └─────────┘

This seems efficient. But:

  • Orders adds a new column → Shipping might break if it does SELECT *
  • Orders changes the status values from "ready" to "awaiting_shipment" → Shipping was filtering on "ready" and stops seeing orders
  • Shipping updates the tracking number directly in the orders table → Orders doesn't know it happened and might overwrite it

After (Fixed):

Orders Module                        Shipping Module
      │                                    ▲
      │  "here are orders                  │
      │   ready to ship"                   │
      ▼                                    │
┌──────────────────────────────────────────┐
│           Defined Contract               │
│  Orders provides: order_id, items,       │
│    shipping address, priority            │
│  Shipping returns: tracking_number,      │
│    estimated delivery date               │
└──────────────────────────────────────────┘

Each module owns its own data storage. Communication happens through defined contracts. Neither module needs to know how the other stores its data.


How to Detect These Mistakes in Any System

Question to AskWhat a Bad Answer Reveals
"Can you explain this module in one sentence?"God Module (the sentence uses "and" five times) or Micro-Boundary (the sentence is trivially short)
"If I change the inside of this module, what else breaks?"Hidden coupling (anything other than "nothing" is concerning)
"What's in the Utils/Helpers/Common module?"Junk drawer (if the answer takes more than 60 seconds, it's too big)
"Where does feature X live?"Technology-based boundaries (if the answer is "parts of it are in three different modules")
"Do any two modules read from the same database table?"Shared data coupling
"When was the last time you changed this module without fear?"God Module or coupling (if the answer is "never," there's a problem)

Boundaries — Test Your Understanding

Answer each question thoroughly. Focus on defining clear responsibilities — what is inside, what is outside, and why.


Section A: Identification

Question 1

A restaurant has:

  • Customers who order food from a menu
  • Waitstaff who take orders and deliver food
  • A kitchen that prepares the food
  • A billing system that produces the check
  • A reservation system for booking tables

Identify the natural boundaries in this system. For each boundary, write what is inside it and what is explicitly outside it.


Question 2

Someone proposes the following module structure for a blogging platform:

  • DatabaseModule — all database operations
  • APIModule — all API endpoints
  • UIModule — all user-facing pages

What is wrong with this boundary structure? Propose a better one and explain why it's better.


Question 3

You encounter a module called "Utils" that contains:

  • A function that formats dates
  • A function that calculates shipping costs
  • A function that validates email addresses
  • A function that generates PDF reports
  • A function that checks if a user is logged in

For each function, identify which boundary it actually belongs to. Explain why "Utils" is not a real boundary.


Section B: Analysis

Question 4

Two modules exist in a system:

Module A: OrderProcessing

  • Creates orders
  • Calculates totals
  • Applies discount codes
  • Charges the customer's credit card
  • Sends a confirmation email

Module B: CustomerManagement

  • Stores customer profiles
  • Manages addresses
  • Tracks order history

Evaluate the boundaries. Is everything in the right place? Identify at least two items that might belong elsewhere, and explain your reasoning.


Question 5

A team is building a social media app. They have one module called "PostManager" that handles:

  • Creating posts
  • Editing posts
  • Deleting posts
  • Displaying the news feed
  • Recommending trending posts
  • Moderating reported posts
  • Tracking post analytics (views, shares)

This is becoming a God Module. Propose how to split it into smaller, well-defined boundaries. For each new boundary, apply the elevator test (one-sentence description).


Question 6

You have two modules: Inventory and Shipping. Currently:

  • Inventory knows how to check stock levels
  • Shipping needs to know stock levels before it can ship

Someone proposes: "Let's just let Shipping read directly from the Inventory database to check stock."

Using the concepts of cohesion and coupling, explain why this is problematic. Propose a better approach.


Section C: Design

Question 7

Design the boundary structure for a school management system with these requirements:

  • Students enroll in courses
  • Teachers are assigned to courses
  • Grades are recorded per student per course
  • Parents can view their child's grades
  • The school generates report cards each semester
  • Attendance is tracked daily

Produce:

  1. A list of modules with inside/outside definitions
  2. A connection diagram showing what each module needs from the others
  3. The elevator-test sentence for each module

Question 8

You are designing a ride-sharing app (like Uber). The core actions are:

  • Riders request a ride
  • Drivers accept rides
  • The system matches riders to nearby drivers
  • Pricing is calculated based on distance, time, and demand
  • Payments are processed after the ride
  • Both riders and drivers can rate each other

Draw the boundary structure. Pay special attention to: where does "matching" live? Where does "pricing" live? Are they the same boundary or different? Justify your decision.


Question 9

A startup asks you to architect a recipe sharing platform. Users can:

  • Create and share recipes
  • Search recipes by ingredient, cuisine, or dietary restriction
  • Save favorite recipes
  • Create meal plans for the week
  • Generate a shopping list from a meal plan
  • Follow other users and see their new recipes

Define the module boundaries. At least one of your decisions should involve a tradeoff — two reasonable options where you pick one. Explain the tradeoff and why you chose what you chose.


Section D: Critical Thinking

Question 10

"Every module should be completely independent and never talk to any other module."

Is this statement true, false, or misleading? Explain when connections between modules are necessary and how to have them without destroying boundary integrity.


Question 11

Two engineers disagree:

Engineer A: "Shopping cart and checkout should be one module. They're part of the same user flow."

Engineer B: "Shopping cart and checkout should be separate modules. A cart is about managing what you want to buy. Checkout is about paying for it."

Both have reasonable arguments. Evaluate both positions. Under what circumstances is A right? Under what circumstances is B right? What would you recommend for a small team building an MVP? What would you recommend for a large team building a mature platform?


Question 12

You inherit a system where a single module called "NotificationService" handles:

  • Deciding when to send notifications (business rules)
  • Deciding who to send them to (recipient logic)
  • Deciding what channel to use (email vs. push vs. SMS)
  • Formatting the message content
  • Actually sending via the appropriate channel
  • Logging what was sent
  • Managing user notification preferences

This module works fine today. Argue for or against splitting it up. If you split it, where do you draw the new boundaries? If you don't, explain what conditions would eventually force a split.


Grading Rubric

CriteriaWhat It Means
Clear inside/outsideEach boundary has an explicit list of what it owns and what it doesn't
Reasonable groupingsRelated things are together, unrelated things are apart
Minimal couplingBoundaries connect through narrow, well-defined channels — not through shared databases or deep knowledge of each other's internals
Defensible namesEvery boundary can be explained in one sentence that a non-engineer would understand
Tradeoff awarenessWhere a decision could go either way, you acknowledge the alternatives and explain your choice

Contracts and Interfaces — Why It Matters

Every Boundary Has a Door

In the previous section, you learned to draw boundaries — to define what a module is responsible for and what it isn't. But boundaries alone aren't enough. Modules need to talk to each other. The question is: how?

The answer is a contract: a precise agreement about what goes in, what comes out, and what happens when something goes wrong.

Without contracts, modules can't communicate reliably. With sloppy contracts, they communicate badly. With clear contracts, they communicate perfectly — even when the people who built them have never met.

What Is a Contract?

A contract is a promise at a boundary:

"If you give me this (in this exact shape, meeting these exact conditions), I will give you back that (in this exact shape, with these exact guarantees). If you give me something I don't expect, here is exactly what will happen."

That's it. Three parts:

  1. Inputs — what the caller provides
  2. Outputs — what the caller receives
  3. Error cases — what happens when things aren't right

This exists everywhere in life:

  • A vending machine: insert $1.50 (input), select B4 (input), receive a bag of chips (output), or get your money back if the item is stuck (error case).
  • A postal service: provide a correctly addressed envelope with proper postage (input), the letter arrives within 3-5 business days (output), or it's returned to sender if the address is invalid (error case).

In software, contracts are the same idea applied to the boundaries between modules, features, and systems.

Why Engineers Care About This

You can build things in parallel

If two people agree on the contract between Module A and Module B, they can build those modules simultaneously without talking to each other again. Person A knows exactly what Module B will provide. Person B knows exactly what Module A will send. The contract is the agreement.

Without contracts, you get: "Wait, I thought you were sending me a list?" "No, I'm sending an object with a list inside it." "But my code expects a plain list." Two days of debugging for a conversation that should have happened upfront.

You can replace parts of the system

If the contract is clear, you can completely replace the internals of a module and nothing breaks — as long as the new version honors the same contract. This is how large systems evolve over years. The contract is the stable surface; the implementation behind it can change freely.

You can test things in isolation

If you know the contract, you can test a module without needing the rest of the system. Send it the defined inputs, check that you get the defined outputs. If the contract is vague, you're guessing.

You can debug faster

"The output is wrong." Okay — does the input match the contract? If yes, the bug is inside the module. If no, the bug is in whoever is calling the module. Contract thinking lets you cut the search space in half immediately.

What Happens Without Contracts

Assumptions replace agreements

"I assumed it would handle that case." "I assumed the data would always be in that format." "I assumed it would return an error if something was wrong." Assumptions are bugs waiting to happen. Contracts replace assumptions with explicit agreements.

Changes break everything

Without a contract, changing a module's behavior is a gamble. You don't know what other modules depend on, because the dependency was never formally defined. You change one thing and discover four other modules were relying on behavior that was never promised — just coincidental.

Nobody knows what anything does

"What does this module accept?" → "Uh, look at the code." If the answer to "what's the contract?" is "read the implementation," there is no contract. And that means nobody really knows what it does without reverse-engineering it every time.

Integration is a nightmare

Two modules need to connect. Without contracts, integration day is chaos — mismatched data formats, unexpected nulls, inconsistent error handling. With contracts, integration is mechanical: both sides already agreed on the interface, so you plug them together and it works.

Interfaces vs. Implementations

This is a critical distinction that separates experienced engineers from everyone else:

The interface is what something does (the contract — inputs, outputs, errors).

The implementation is how it does it internally.

Other modules should only ever depend on the interface. They should never know or care about the implementation. This principle has a name — information hiding — and it is one of the most important ideas in engineering.

Why? Because implementations change constantly. Algorithms get optimized. Databases get swapped. Libraries get updated. But if every other module depends on the implementation details, every change is a catastrophe. If they only depend on the interface, changes are invisible to the outside world.

Think of it like a restaurant kitchen. You (the customer) have a contract with the restaurant: you order from the menu (input), you receive food that matches the description (output), and if they're out of something, they tell you (error). You don't know or care whether the chef uses a gas stove or electric, whether they prep ingredients at 6am or buy them pre-cut, whether there's one cook or five.

The menu is the interface. The kitchen is the implementation. As long as the food is right, you're happy.

Why This Is the Lesson That Separates Beginners From Professionals

Beginners think about how to make it work. Professionals think about how to define the interface so that anyone can make it work.

When an experienced engineer approaches a new problem, they don't start coding. They start defining contracts:

  • "This module will accept a customer ID and return the customer's order history as a list of orders. If the customer doesn't exist, it returns an empty list, not an error."
  • "This service will accept an image file up to 10MB in JPEG or PNG format and return a resized version. If the file is too large or the wrong format, it returns an error with a human-readable reason."

These statements are complete enough that anyone (or any LLM) could implement them. That's the point: the contract is the spec. If you can write clear contracts, you can build systems. If you can't, you're guessing — no matter how much code you know.

Contracts and Interfaces — Why It Matters

Every Boundary Has a Door

In the previous section, you learned to draw boundaries — to define what a module is responsible for and what it isn't. But boundaries alone aren't enough. Modules need to talk to each other. The question is: how?

The answer is a contract: a precise agreement about what goes in, what comes out, and what happens when something goes wrong.

Without contracts, modules can't communicate reliably. With sloppy contracts, they communicate badly. With clear contracts, they communicate perfectly — even when the people who built them have never met.

What Is a Contract?

A contract is a promise at a boundary:

"If you give me this (in this exact shape, meeting these exact conditions), I will give you back that (in this exact shape, with these exact guarantees). If you give me something I don't expect, here is exactly what will happen."

That's it. Three parts:

  1. Inputs — what the caller provides
  2. Outputs — what the caller receives
  3. Error cases — what happens when things aren't right

This exists everywhere in life:

  • A vending machine: insert $1.50 (input), select B4 (input), receive a bag of chips (output), or get your money back if the item is stuck (error case).
  • A postal service: provide a correctly addressed envelope with proper postage (input), the letter arrives within 3-5 business days (output), or it's returned to sender if the address is invalid (error case).

In software, contracts are the same idea applied to the boundaries between modules, features, and systems.

Why Engineers Care About This

You can build things in parallel

If two people agree on the contract between Module A and Module B, they can build those modules simultaneously without talking to each other again. Person A knows exactly what Module B will provide. Person B knows exactly what Module A will send. The contract is the agreement.

Without contracts, you get: "Wait, I thought you were sending me a list?" "No, I'm sending an object with a list inside it." "But my code expects a plain list." Two days of debugging for a conversation that should have happened upfront.

You can replace parts of the system

If the contract is clear, you can completely replace the internals of a module and nothing breaks — as long as the new version honors the same contract. This is how large systems evolve over years. The contract is the stable surface; the implementation behind it can change freely.

You can test things in isolation

If you know the contract, you can test a module without needing the rest of the system. Send it the defined inputs, check that you get the defined outputs. If the contract is vague, you're guessing.

You can debug faster

"The output is wrong." Okay — does the input match the contract? If yes, the bug is inside the module. If no, the bug is in whoever is calling the module. Contract thinking lets you cut the search space in half immediately.

What Happens Without Contracts

Assumptions replace agreements

"I assumed it would handle that case." "I assumed the data would always be in that format." "I assumed it would return an error if something was wrong." Assumptions are bugs waiting to happen. Contracts replace assumptions with explicit agreements.

Changes break everything

Without a contract, changing a module's behavior is a gamble. You don't know what other modules depend on, because the dependency was never formally defined. You change one thing and discover four other modules were relying on behavior that was never promised — just coincidental.

Nobody knows what anything does

"What does this module accept?" → "Uh, look at the code." If the answer to "what's the contract?" is "read the implementation," there is no contract. And that means nobody really knows what it does without reverse-engineering it every time.

Integration is a nightmare

Two modules need to connect. Without contracts, integration day is chaos — mismatched data formats, unexpected nulls, inconsistent error handling. With contracts, integration is mechanical: both sides already agreed on the interface, so you plug them together and it works.

Interfaces vs. Implementations

This is a critical distinction that separates experienced engineers from everyone else:

The interface is what something does (the contract — inputs, outputs, errors).

The implementation is how it does it internally.

Other modules should only ever depend on the interface. They should never know or care about the implementation. This principle has a name — information hiding — and it is one of the most important ideas in engineering.

Why? Because implementations change constantly. Algorithms get optimized. Databases get swapped. Libraries get updated. But if every other module depends on the implementation details, every change is a catastrophe. If they only depend on the interface, changes are invisible to the outside world.

Think of it like a restaurant kitchen. You (the customer) have a contract with the restaurant: you order from the menu (input), you receive food that matches the description (output), and if they're out of something, they tell you (error). You don't know or care whether the chef uses a gas stove or electric, whether they prep ingredients at 6am or buy them pre-cut, whether there's one cook or five.

The menu is the interface. The kitchen is the implementation. As long as the food is right, you're happy.

Why This Is the Lesson That Separates Beginners From Professionals

Beginners think about how to make it work. Professionals think about how to define the interface so that anyone can make it work.

When an experienced engineer approaches a new problem, they don't start coding. They start defining contracts:

  • "This module will accept a customer ID and return the customer's order history as a list of orders. If the customer doesn't exist, it returns an empty list, not an error."
  • "This service will accept an image file up to 10MB in JPEG or PNG format and return a resized version. If the file is too large or the wrong format, it returns an error with a human-readable reason."

These statements are complete enough that anyone (or any LLM) could implement them. That's the point: the contract is the spec. If you can write clear contracts, you can build systems. If you can't, you're guessing — no matter how much code you know.

Contracts and Interfaces — How: The Method

The Contract Template

A well-defined contract has five components. Use this template every time:

CONTRACT: [Name]

ACCEPTS:
  - [input 1]: [type/shape] — [constraints]
  - [input 2]: [type/shape] — [constraints]

RETURNS:
  - [output]: [type/shape] — [guarantees]

ERRORS:
  - [condition] → [response]
  - [condition] → [response]

SIDE EFFECTS:
  - [what else occurs, if anything]

Designing Good Inputs

Be specific about shape

Not "a customer" — but "a customer ID (text, 8-12 alphanumeric characters)." Not "order data" — but "order containing: list of items (each with product_id and quantity), shipping address, and payment method."

Vague inputs create ambiguity. Ambiguity creates bugs.

Distinguish required from optional

Some inputs must always be present. Others have reasonable defaults. Make this explicit:

ACCEPTS:
  - search_query: text — required, 1-200 characters
  - page_number: number — optional, defaults to 1
  - results_per_page: number — optional, defaults to 20, maximum 100

Define constraints

What's valid? What's invalid?

  • "email: text — must contain exactly one @ symbol and at least one . after the @"
  • "quantity: number — must be a positive integer, maximum 999"
  • "date: text — must be in YYYY-MM-DD format, cannot be in the past"

Designing Good Outputs

Be explicit about guarantees

Don't just say "returns a list." Say:

  • "Returns a list of orders, sorted by date descending"
  • "Returns an empty list if no orders exist" (different from returning an error)
  • "Each order contains: order_id, date, total, and status"

Define the shape

RETURNS:
  - user:
    - id: text
    - name: text
    - email: text
    - created_date: text (YYYY-MM-DD format)
    - is_active: yes/no

Handle empty results explicitly

What happens when there's nothing to return?

  • Return an empty list?
  • Return nothing at all?
  • Return an error?

These are three different behaviors. The contract must specify which one.


Designing Good Error Cases

Errors are part of the contract, not an afterthought.

Be exhaustive

List every way the operation can fail:

ERRORS:
  - Input validation:
    - email is empty → error: "Email is required"
    - email format is invalid → error: "Invalid email format"
    - email already exists → error: "Email already registered"
  - Business rules:
    - account is suspended → error: "Account suspended"
  - System failures:
    - database unreachable → error: "Service temporarily unavailable"

Distinguish caller errors from system errors

Caller errors: "you sent me bad input" — expected, the caller should handle them System errors: "something is broken internally" — unexpected, needs investigation

Define recovery guidance

  • "If 'rate limit exceeded,' wait 60 seconds and retry"
  • "If 'session expired,' re-authenticate and retry"
  • "If 'item out of stock,' do not retry — display message to user"

Documenting Side Effects

A side effect is anything the operation does besides returning a value:

  • PlaceOrder returns the order ID, but also sends a confirmation email
  • DeleteAccount returns success, but also erases all stored data
  • Login returns a session token, but also logs the event and updates the last-login timestamp

If a side effect is not documented, someone will depend on it unknowingly.


The Contract Review Checklist

When evaluating a contract, ask:

  • Can a stranger implement this? Could someone who has never seen the system build a correct implementation from this contract alone?
  • Are all inputs fully defined? Shape, constraints, required vs. optional?
  • Are all outputs fully defined? Shape, guarantees, empty behavior?
  • Is every error case listed? Input validation, business rules, system failures?
  • Are side effects documented? Everything beyond returning the output?
  • Is the contract implementation-free? Does it say what without saying how?
  • Could this contract survive a rewrite? If internals were completely replaced, would it still make sense?

What to Look For in the Examples

The following pages show complete contract sets for three different systems. As you read:

  1. Notice the level of detail in inputs — every constraint, every edge case
  2. Notice how errors are categorized — caller errors vs. system errors
  3. Notice side effects you wouldn't have thought of — logging, notifications, state changes
  4. Notice how contracts chain together — the output of one becomes the input of the next
  5. Compare the same type of operation across different systems — a "create" operation in a library vs. a restaurant vs. a bank

Contracts — Example: Library Checkout System

The Scenario

A patron visits the library, finds a book, and checks it out. Later, they return it. If it's late, a fine is assessed. They can also place a hold on a book that's currently checked out by someone else.

We'll define the full contract for every operation in the Circulation module.


Contract 1: Check Out a Book

CONTRACT: CheckOutBook

ACCEPTS:
  - patron_id: text — required, must be a valid library card number (format: LIB-XXXXX 
    where X is a digit)
  - copy_id: text — required, must be a valid physical copy ID (format: CPY-XXXXXXX)

RETURNS:
  - checkout_record:
    - checkout_id: text (unique identifier for this checkout)
    - patron_id: text
    - copy_id: text
    - book_title: text (included for convenience — pulled from Catalog)
    - checkout_date: date (YYYY-MM-DD, always today)
    - due_date: date (YYYY-MM-DD, always 14 days from checkout_date)

ERRORS:
  - patron_id not found → error: "Unknown patron" (caller error)
  - patron account is expired → error: "Patron account expired. Renewal required." (caller error)
  - patron account is suspended → error: "Patron account suspended. Contact librarian." (caller error)
  - patron has reached checkout limit (10 books) → error: "Checkout limit reached. 
    Return a book before checking out another." (caller error)
  - patron has unpaid fines over $25 → error: "Outstanding fines exceed limit. 
    Payment required before checkout." (caller error)
  - copy_id not found → error: "Unknown copy" (caller error)
  - copy is not currently available (already checked out) → error: "Copy not available. 
    Currently checked out. Consider placing a hold." (caller error)
  - copy is marked damaged/withdrawn → error: "Copy not available for checkout" (caller error)
  - database unreachable → error: "System temporarily unavailable. Please try again." (system error)

SIDE EFFECTS:
  - Copy status changed from "available" to "checked out" in Catalog
  - Patron's active checkout count incremented
  - Checkout event logged with timestamp, patron_id, copy_id, librarian_id (who processed it)
  - If patron had a hold on this book, the hold is consumed (removed from hold queue)

Why This Level of Detail Matters

Notice the error cases. There are 9 distinct error conditions. A beginner would list 2 or 3 ("book not found, patron not found"). An experienced engineer knows that each of these 9 conditions requires a different response from the caller:

  • "Patron expired" → the librarian can renew them on the spot
  • "Fines exceed limit" → the librarian directs them to payment
  • "Copy not available" → suggest placing a hold (a different operation)
  • "System unavailable" → retry later (completely different from the others)

Each error tells the caller what to do next. That's a good contract.


Contract 2: Return a Book

CONTRACT: ReturnBook

ACCEPTS:
  - copy_id: text — required, must be a valid physical copy ID

    Note: patron_id is NOT required. The system looks up who has this copy checked out.
    This matches real-world behavior — you return a book, not "your checkout record."

RETURNS:
  - return_record:
    - return_id: text
    - checkout_id: text (the original checkout this return closes)
    - patron_id: text (who had it)
    - copy_id: text
    - checkout_date: date
    - due_date: date
    - return_date: date (today)
    - days_overdue: number (0 if on time, positive if late)
    - fine_assessed: currency (0.00 if on time)

ERRORS:
  - copy_id not found → error: "Unknown copy"
  - copy is not currently checked out → error: "This copy is not checked out"
  - database unreachable → error: "System temporarily unavailable"

SIDE EFFECTS:
  - Copy status changed from "checked out" to "available" in Catalog
  - Patron's active checkout count decremented
  - If days_overdue > 0, a fine is created in the Finances module
    (amount = days_overdue × $0.25, capped at replacement cost of book)
  - If there is a hold queue for this book, the next patron in the queue is notified
    (via Communication module)
  - Return event logged with timestamp, copy_id, condition notes (if any)

Key Design Decisions in This Contract

The return contract accepts copy_id, not patron_id. This is a deliberate design choice that matches the physical reality: a librarian scans the book, not the patron's card. The system figures out who had it. This reduces errors (the patron doesn't need their card to return).

The fine is a side effect, not a return value. The return operation calculates the fine and includes it in the return record (for display), but the actual fine creation is a side effect handled by the Finances module. Return doesn't need to know how fines are stored or managed.

Hold notification is a cascading side effect. Returning a book might mean someone else is waiting for it. The contract documents this so that whoever implements it knows they must check the hold queue.


Contract 3: Place a Hold

CONTRACT: PlaceHold

ACCEPTS:
  - patron_id: text — required, valid library card number
  - book_id: text — required, valid book ID (not copy_id — the patron wants
    the book, not a specific physical copy)

RETURNS:
  - hold_record:
    - hold_id: text
    - patron_id: text
    - book_id: text
    - book_title: text
    - hold_date: date (today)
    - queue_position: number (1 = you're next, 2 = one person ahead of you, etc.)
    - estimated_availability: text ("approximately 2 weeks" based on due dates
      of current checkouts and queue length)

ERRORS:
  - patron_id not found → error: "Unknown patron"
  - patron account expired/suspended → error: "Account not active"
  - book_id not found → error: "Unknown book"
  - patron already has a hold on this book → error: "Hold already exists for this book"
  - patron currently has this book checked out → error: "You currently have this book.
    Return it instead of placing a hold."
  - patron has reached hold limit (5 holds) → error: "Hold limit reached"
  - all copies of this book are available (no need for a hold) → error: "Copies are
    available now. No hold needed — check it out directly."
  - database unreachable → error: "System temporarily unavailable"

SIDE EFFECTS:
  - Hold is added to the queue for this book
  - Hold event logged

What Makes This Contract Interesting

Book vs. Copy distinction. When checking out, you specify a copy (a physical item). When placing a hold, you specify a book (the title). The system decides which copy to assign when one becomes available. This distinction matters because it's a different level of abstraction — and the contract makes it explicit.

"Copies are available" is an error. You can place a hold on a book that has copies available — but this contract treats it as an error because the correct action is to check it out, not hold it. This is a business rule baked into the contract. A different library might allow it. The contract forces the decision to be explicit.

Estimated availability is a best guess. The contract says "approximately" — this sets expectations. The caller knows not to treat this as a guarantee.


Contract 4: Cancel a Hold

CONTRACT: CancelHold

ACCEPTS:
  - hold_id: text — required

RETURNS:
  - confirmation:
    - hold_id: text
    - status: "cancelled"
    - cancelled_date: date

ERRORS:
  - hold_id not found → error: "Unknown hold"
  - hold has already been fulfilled (book was checked out) → error: "Hold already
    fulfilled. Book was checked out on [date]."
  - hold was already cancelled → error: "Hold was already cancelled on [date]"
  - database unreachable → error: "System temporarily unavailable"

SIDE EFFECTS:
  - Hold removed from queue
  - All patrons behind this one in the queue move up one position
  - If the book has an available copy and there's a next-in-line patron,
    that patron is notified
  - Cancellation event logged

How These Contracts Work Together

Let's trace a complete scenario:

Patron A checks out the last copy of "Dune." Patron B wants it and places a hold. Patron A returns it late.

StepContract CalledKey Data Flow
1CheckOutBook(patron_A, copy_42)Returns checkout record. Copy marked "checked out."
2PlaceHold(patron_B, book_dune)Returns hold record. Queue position = 1. Estimated availability = "approximately 2 weeks."
3(14 days pass. Patron A doesn't return the book.)
4(Day 17. Patron A returns the book.)
5ReturnBook(copy_42)Returns: days_overdue = 3, fine_assessed = $0.75. Side effects: (a) Fine created in Finances. (b) Hold queue checked — Patron B is next. (c) Communication module notifies Patron B: "Your hold is ready."
6(Patron B receives notification and comes to the library.)
7CheckOutBook(patron_B, copy_42)Returns checkout record. Side effect: Patron B's hold is consumed (removed from queue).

Notice how the contracts chain together through side effects. ReturnBook doesn't call PlaceHold or Communication directly — but its side effects trigger actions in other modules. The contracts document this so that the chain is visible and predictable.


Summary: What This Example Teaches

  1. Error cases outnumber happy paths — each contract has more error conditions than return values
  2. Side effects connect modules — the explicit side effects section shows cross-boundary impacts
  3. Contracts encode business rules — "can't place a hold if copies are available" is a policy, not a technical limitation
  4. Input specificity matters — copy_id vs. book_id is not a minor detail; it changes the entire meaning
  5. Contracts chain through events — one contract's side effect is another contract's trigger

Contracts — Example: Restaurant Ordering System

The Scenario

A restaurant with table service and online ordering. Customers dine in or order delivery. Waitstaff take orders at the table. Kitchen receives orders and marks them complete. The system calculates bills, splits checks, and processes payment. Tips are recorded.

This is a different domain from the library — more real-time, more physical-world interaction, and more complex pricing.


Contract 1: Create Table Order

CONTRACT: CreateTableOrder

ACCEPTS:
  - table_number: number — required, must be a valid table in the system (1-30)
  - server_id: text — required, must be a valid staff ID for an active server
  - party_size: number — required, must be 1-12

RETURNS:
  - order:
    - order_id: text (unique)
    - table_number: number
    - server_id: text
    - server_name: text
    - party_size: number
    - opened_at: timestamp
    - status: "open"
    - items: empty list (no items yet)
    - subtotal: 0.00

ERRORS:
  - table_number not found → error: "Invalid table number"
  - table already has an active order → error: "Table [N] already has an open order 
    (order_id: [X]). Close or transfer it first."
  - server_id not found → error: "Unknown server"
  - server is clocked out → error: "Server is not currently clocked in"
  - party_size is 0 or negative → error: "Party size must be at least 1"
  - party_size exceeds table capacity → error: "Table [N] seats [X]. Party of [Y] 
    requires a different table."
  - database unreachable → error: "System temporarily unavailable"

SIDE EFFECTS:
  - Table status changed to "occupied" in the floor plan system
  - Order opened event logged with timestamp

Design Notes

Table capacity checking. The contract validates party size against table capacity — a business rule that prevents operational problems (8 people at a 4-person table). This data comes from the floor plan, which is configuration data.

"Table already has an active order" is common. Servers sometimes forget to close a tab. Instead of silently creating a second order, the error forces the server to deal with the existing one.


Contract 2: Add Item to Order

CONTRACT: AddItemToOrder

ACCEPTS:
  - order_id: text — required
  - menu_item_id: text — required, must be a valid item from the active menu
  - quantity: number — required, must be 1-20
  - modifications: list of text — optional (e.g., ["no onions", "extra cheese", 
    "sub gluten-free bun"])
  - seat_number: number — optional (for tracking who ordered what within a party)
  - special_instructions: text — optional, max 200 characters

RETURNS:
  - updated_order:
    - order_id: text
    - items: list (now includes the new item)
      - Each item:
        - line_item_id: text (unique per item in the order)
        - menu_item_id: text
        - item_name: text
        - quantity: number
        - unit_price: currency
        - modifications: list of text
        - modification_charges: currency (extra cheese = $1.50, etc.)
        - line_total: currency (quantity × (unit_price + modification_charges))
        - seat_number: number or null
        - special_instructions: text or empty
        - status: "ordered"
    - subtotal: currency (updated sum of all line_totals)

ERRORS:
  - order_id not found → error: "Unknown order"
  - order is not open (already closed/paid) → error: "Order is closed. Cannot add items."
  - menu_item_id not found → error: "Unknown menu item"
  - menu item is unavailable (86'd) → error: "[Item name] is currently unavailable"
  - modification is not recognized → error: "Unknown modification: [text]. 
    Available modifications: [list]"
  - modification is not applicable to this item → error: "'[modification]' cannot be 
    applied to [item name]"
  - quantity exceeds limit → error: "Maximum quantity per line item is 20"

SIDE EFFECTS:
  - Order sent to kitchen display (Transport to kitchen) with new item(s) highlighted
  - If item has an allergy flag (e.g., contains nuts), allergy alert included in
    kitchen display
  - Inventory for ingredients decremented (optional — depends on whether the restaurant
    tracks ingredient inventory in real time)
  - Item addition logged with server_id and timestamp

Why Modifications Are Complex

Modifications look simple ("no onions") but they create contract complexity:

  1. Some modifications are free ("no onions" — they're removing something)
  2. Some modifications have a charge ("extra cheese" = $1.50, "add avocado" = $2.00)
  3. Some modifications are impossible ("sub gluten-free bun" on a salad)
  4. Some modifications create allergy implications ("add peanut sauce")

The contract must handle all of these. A vague contract ("accepts modifications: list") leaves all of this to guesswork.


Contract 3: Send Order to Kitchen

CONTRACT: SendToKitchen

ACCEPTS:
  - order_id: text — required
  - items_to_send: list of line_item_ids — optional. If empty, sends all 
    items with status "ordered" (not yet sent)

    Note on "courses": A server might take the full order upfront but send 
    appetizers to the kitchen first, entrées later. This contract supports that 
    by allowing partial sends.

RETURNS:
  - kitchen_ticket:
    - ticket_id: text
    - order_id: text
    - table_number: number
    - items: list of items being sent
      - Each item: name, quantity, modifications, special instructions, seat number
    - sent_at: timestamp
    - allergy_alerts: list (any items flagged with allergy concerns)
    - estimated_prep_time: minutes (calculated from item prep times)

ERRORS:
  - order_id not found → error: "Unknown order"
  - no items to send (all items already sent or order is empty) → error: 
    "No unsent items on this order"
  - line_item_id not found in order → error: "Item [id] not found on order [id]"
  - kitchen is in "overflow" status → warning (not error): "Kitchen is backed up. 
    Current estimated wait: [X] minutes." (Order is still accepted — this is 
    informational.)

SIDE EFFECTS:
  - Items' status changed from "ordered" to "sent to kitchen"
  - Kitchen display updated with new ticket
  - Ticket print at appropriate kitchen station (grill items → grill station, 
    salads → cold station, etc.)
  - Estimated wait time sent back to server's device

The Course Problem

Real restaurants have courses. Appetizers go first, then entrées, then dessert. The contract handles this by allowing the server to choose which items to send. But the contract doesn't enforce course ordering — a server could send desserts first. Is that an error?

Decision: No. The contract allows it. The server might have a reason (the customer wants dessert only). Business rules about course ordering are the server's training, not the system's enforcement. This is a deliberate contract design choice — not every rule belongs in the software.


Contract 4: Close Order and Calculate Bill

CONTRACT: CalculateBill

ACCEPTS:
  - order_id: text — required
  - split_method: one of ["no_split", "equal_split", "by_seat", "custom"]
    - If "equal_split": split_count: number (how many ways to split, 2-12)
    - If "by_seat": (no additional input — each seat gets their items)
    - If "custom": custom_splits: list of {split_label: text, line_item_ids: list}

RETURNS:
  - bill:
    - order_id: text
    - splits: list of:
      - split_id: text
      - split_label: text ("Check 1", "Seat 3", "Jordan's portion", etc.)
      - items: list of items in this split
      - subtotal: currency
      - tax: currency (calculated from local tax rate)
      - total: currency (subtotal + tax)
    - order_subtotal: currency (pre-tax sum of all splits)
    - order_tax: currency
    - order_total: currency
    - gratuity_suggestion:
      - 15_percent: currency
      - 18_percent: currency
      - 20_percent: currency
      - 25_percent: currency
      (calculated on pre-tax subtotal)

ERRORS:
  - order_id not found → error: "Unknown order"
  - order has no items → error: "Cannot generate bill for empty order"
  - order has items with status "sent to kitchen" but not "completed" → 
    warning: "Kitchen has not completed all items. Generate bill anyway?"
  - split_method "by_seat" but some items have no seat assigned → error: 
    "[N] items have no seat number. Assign seats or use a different split method."
  - custom_splits don't cover all items → error: "The following items are not 
    assigned to any split: [list]"
  - custom_splits assign the same item to multiple splits → error: "Item [name] 
    is assigned to multiple splits"
  - database unreachable → error: "System temporarily unavailable"

SIDE EFFECTS:
  - Order status changed to "bill generated"
  - Bill event logged with split details and timestamp

The Split Check Problem

Check splitting might be the most complex everyday contract. Consider:

  • Equal split — simple math, but what about items that cost significantly more? Equitable ≠ equal.
  • By seat — requires every item to be assigned to a seat. If the server didn't track seats, this fails.
  • Custom — maximum flexibility, but the contract must verify that all items are covered (no orphans) and no item is double-counted.

The contract handles all three approaches with clear errors for each. A weaker contract would just say "accepts split_method: text" and leave all validation to implementation.


Contract 5: Process Payment

CONTRACT: ProcessPayment

ACCEPTS:
  - split_id: text — required (pays one split at a time)
  - payment_method: one of ["cash", "credit_card", "debit_card", "gift_card"]
    - If credit/debit: card_token: text (tokenized card data, never raw card numbers)
    - If gift_card: card_number: text, pin: text
    - If cash: amount_tendered: currency
  - tip_amount: currency — optional, default 0.00

RETURNS:
  - payment_receipt:
    - payment_id: text
    - split_id: text
    - amount_charged: currency
    - tip_amount: currency
    - total_charged: currency (amount + tip)
    - payment_method: text
    - change_due: currency (only for cash, 0.00 otherwise)
    - paid_at: timestamp

ERRORS:
  - split_id not found → error: "Unknown split"
  - split already paid → error: "This split has already been paid"
  - credit card declined → error: "Card declined. Reason: [reason from processor]"
  - gift card insufficient balance → error: "Gift card balance is [X]. 
    Split total is [Y]. Remaining [Z] must be paid by another method."
  - cash amount_tendered is less than total → error: "Amount tendered ([X]) 
    is less than total ([Y])"
  - tip_amount is negative → error: "Tip cannot be negative"
  - database unreachable → error: "System temporarily unavailable"
  - payment processor unreachable → error: "Payment system temporarily unavailable. 
    Cash payment available."

SIDE EFFECTS:
  - Split marked as "paid"
  - If all splits for the order are paid, order status changed to "closed"
  - If order is closed, table status changed to "available" in floor plan
  - Payment logged for accounting/end-of-day reconciliation
  - Tip recorded and attributed to server for tip-out calculations
  - If cash: cash drawer amount updated
  - Receipt generated for printing or digital delivery

How These Contracts Chain Together

A complete table service flow:

CreateTableOrder(table_5, server_12, party_4)
    ↓
AddItemToOrder(order_001, "calamari", qty=1)
AddItemToOrder(order_001, "burger", qty=2, mods=["no onions"], seat=1)
AddItemToOrder(order_001, "pasta", qty=1, seat=2)
AddItemToOrder(order_001, "salmon", qty=1, seat=3)
    ↓
SendToKitchen(order_001, items=["calamari"])       ← Appetizer first
    ↓
    ... kitchen prepares and marks complete ...
    ↓
SendToKitchen(order_001)                            ← Remaining items (entrées)
    ↓
    ... kitchen prepares and marks complete ...
    ↓
CalculateBill(order_001, split_method="by_seat")
    ↓
ProcessPayment(split_seat1, credit_card, tip=8.00)
ProcessPayment(split_seat2, cash, amount_tendered=30.00)
ProcessPayment(split_seat3, credit_card, tip=6.00)
ProcessPayment(split_seat4, gift_card, tip=5.00)
    ↓
Order closed. Table 5 available.

Each step has its own contract. Each step can fail independently with clear error messages. The chain is explicit — no hidden dependencies.


Comparing Library vs. Restaurant Contracts

AspectLibraryRestaurant
Complexity of inputsSimple (IDs)Complex (modifications, split methods, multiple payment types)
Error case count per contract6-97-10
Side effects crossing modulesCatalog, Finances, CommunicationKitchen display, Floor plan, Accounting
Time sensitivityRelaxed (books are due in 14 days)High (food gets cold, customers get impatient)
Physical-world interactionBook in/outFood preparation, cash handling
Business rules in contracts"Can't hold available books""Can't split by seat without seat assignments"
Partial operationsPartial (hold vs. checkout)Extensive (courses, split checks, partial payments, gift card remainder)

The restaurant contracts are more complex because the domain is more complex — but the structure is identical: name, inputs, outputs, errors, side effects.

Contracts — Example: Bank Transfer System

The Scenario

A banking system where customers transfer money between their own accounts and to other people's accounts. Wire transfers to external banks are supported. Daily limits and fraud detection are enforced. Every transaction must be auditable.

This is the highest-stakes contract environment. A vague contract in a bank means money appears, disappears, or doubles. There is zero tolerance for ambiguity.


Contract 1: Internal Transfer (Between Own Accounts)

CONTRACT: TransferBetweenOwnAccounts

ACCEPTS:
  - customer_id: text — required, authenticated customer
  - source_account_id: text — required, must belong to customer_id
  - destination_account_id: text — required, must belong to customer_id
  - amount: currency — required, must be positive, two decimal places maximum
    (e.g., 100.00, not 100.001)
  - memo: text — optional, max 140 characters (customer's note for their records)

RETURNS:
  - transfer_record:
    - transfer_id: text (unique, used for all future references to this transaction)
    - source_account_id: text
    - source_new_balance: currency
    - destination_account_id: text
    - destination_new_balance: currency
    - amount: currency
    - memo: text or empty
    - executed_at: timestamp (precise to millisecond)
    - status: "completed"

ERRORS:
  - customer_id not authenticated → error: "Authentication required"
  - source_account_id not found → error: "Account not found"
  - source_account does not belong to customer → error: "Account not found" 
    (IMPORTANT: same message as "not found" — never reveal that the account 
    exists but belongs to someone else)
  - destination_account_id not found → error: "Account not found"
  - destination_account does not belong to customer → error: "Account not found"
  - source and destination are the same account → error: "Source and destination 
    must be different accounts"
  - amount is zero or negative → error: "Amount must be greater than zero"
  - amount has more than 2 decimal places → error: "Amount must not exceed 
    two decimal places"
  - insufficient funds (source balance < amount) → error: "Insufficient funds. 
    Available balance: [X]"
  - source account is frozen → error: "Account is restricted. Contact support."
  - destination account is frozen → error: "Destination account is restricted. 
    Contact support."
  - daily transfer limit exceeded → error: "Daily transfer limit of [X] reached. 
    Transferred today: [Y]. Remaining: [Z]. Resets at midnight [timezone]."
  - database unreachable → error: "Service temporarily unavailable. Your transfer 
    has not been processed. Please try again."

SIDE EFFECTS:
  - Source account balance decreased by amount (atomic operation)
  - Destination account balance increased by amount (atomic operation)
  - Two transaction records created (one debit, one credit) — both reference
    the same transfer_id for traceability
  - Daily transfer running total updated for this customer
  - Transaction event logged with full details (all inputs, all outputs, 
    timestamp, IP address, device info)
  - If amount exceeds $10,000: regulatory reporting flag set (Currency 
    Transaction Report required by law)

Critical Design Point: Atomicity

The most important word in this contract is "atomic." Both the deduction from the source and the addition to the destination must happen as a single, indivisible operation. You cannot have a state where:

  • Money has left the source but not arrived at the destination (money lost)
  • Money has arrived at the destination but not left the source (money created)

This is the same "two-phase commit" concept from the ATM example. The contract specifies atomic behavior, and the implementation must guarantee it — how it does so is an implementation detail, but the guarantee is part of the contract.

Critical Design Point: Security in Error Messages

Notice that "account not found" and "account doesn't belong to you" return the same error message. This is intentional. If the system said "account 12345 exists but doesn't belong to you," an attacker could probe for valid account numbers. The contract uses identical error messages for different failure reasons to prevent information leakage.


Contract 2: Transfer to Another Person

CONTRACT: TransferToOtherCustomer

ACCEPTS:
  - customer_id: text — required, authenticated customer (the sender)
  - source_account_id: text — required, must belong to customer_id
  - recipient_identifier: one of:
    - account_number: text (direct account number)
    - email: text (if recipient has registered email for receiving transfers)
    - phone: text (if recipient has registered phone for receiving transfers)
  - amount: currency — required, positive, two decimal places max
  - memo: text — optional, max 140 characters

RETURNS:
  - transfer_record:
    - transfer_id: text
    - source_account_id: text
    - source_new_balance: currency
    - recipient_display: text (recipient's name, partially masked: "J*** Smith")
    - amount: currency
    - memo: text or empty
    - status: one of:
      - "completed" (instant transfer, recipient is at the same bank)
      - "pending" (recipient at external bank, or amount triggers review)
    - executed_at: timestamp
    - estimated_arrival: text (if pending — "within 1 business day" or 
      "within 3 business days")

ERRORS:
  (All errors from TransferBetweenOwnAccounts, PLUS:)
  - recipient not found → error: "No account found for this recipient"
  - recipient account is closed → error: "Recipient account is not active"
  - sender and recipient are the same person → error: "Use internal transfer 
    for transfers between your own accounts" (different flow, different limits)
  - amount exceeds person-to-person daily limit → error: "Person-to-person 
    daily limit of [X] reached"
  - fraud detection flag → error: "Transfer requires additional verification. 
    Please contact support or verify via [method]." 
    (The transfer is NOT processed. It is held.)

SIDE EFFECTS:
  - Source balance decreased by amount
  - If same bank and status = "completed": recipient balance increased immediately
  - If different bank and status = "pending": transfer queued for batch processing
  - If amount > $3,000 to a new recipient: additional verification step triggered
    (two-factor authentication sent to customer's phone)
  - Both sender's and recipient's transaction histories updated
  - Transfer event logged (sender's IP, device fingerprint, amount, recipient info)
  - Fraud scoring model updated with this transfer's characteristics
  - If amount > $10,000: regulatory reporting flag set
  - Notification sent to recipient (if they have notifications enabled)

Key Difference From Internal Transfer

The recipient might not get the money immediately. This introduces the concept of eventual consistency — the sender's balance changes now, but the recipient's balance might change later. The contract makes this explicit through the status and estimated_arrival fields.

Fraud detection can block the transfer. Unlike internal transfers (low risk), person-to-person transfers are a fraud vector. The contract includes a specific error for this case that instructs the caller to handle it as a "held" state — not a rejection, not a success, but a third state.


Contract 3: Wire Transfer (External Bank)

CONTRACT: WireTransfer

ACCEPTS:
  - customer_id: text — required, authenticated
  - source_account_id: text — required, must belong to customer_id
  - recipient_name: text — required, full legal name
  - recipient_bank_routing_number: text — required, 9 digits
  - recipient_account_number: text — required
  - amount: currency — required, positive, two decimal places max
  - wire_type: one of ["domestic", "international"]
    - If "international":
      - swift_code: text — required, 8 or 11 characters
      - recipient_bank_name: text — required
      - recipient_bank_address: text — required
      - recipient_country: text — required, ISO country code
  - purpose: text — required for international (regulatory requirement), 
    optional for domestic, max 200 characters
  - memo: text — optional, max 140 characters

RETURNS:
  - wire_record:
    - wire_id: text
    - source_account_id: text
    - source_new_balance: currency (amount + wire fee already deducted)
    - recipient_name: text
    - amount: currency
    - wire_fee: currency (displayed separately — $25 domestic, $45 international)
    - total_deducted: currency (amount + wire_fee)
    - status: "pending" (wires are never instant)
    - submitted_at: timestamp
    - estimated_arrival: text ("1-2 business days" domestic, 
      "3-5 business days" international)
    - confirmation_number: text (for tracking with the wire network)

ERRORS:
  (All standard account/amount errors, PLUS:)
  - routing_number invalid format → error: "Routing number must be exactly 9 digits"
  - routing_number not found in bank directory → error: "Unknown routing number. 
    Verify with recipient's bank."
  - swift_code invalid format → error: "SWIFT code must be 8 or 11 characters"
  - wire_type is "international" and purpose is empty → error: "Purpose is 
    required for international wire transfers"
  - insufficient funds for amount + wire_fee → error: "Insufficient funds. 
    Transfer amount ([X]) + wire fee ([Y]) = [Z]. Available balance: [W]."
  - wire transfer daily limit exceeded → error: "Daily wire limit exceeded"
  - customer has not completed wire transfer authorization form → error: 
    "Wire transfer authorization required. Complete enrollment first."
  - fraud or compliance hold → error: "Transfer requires manual review. 
    Expected completion: [1-2 business days]. Reference: [case_id]."

SIDE EFFECTS:
  - Source balance decreased by (amount + wire_fee) — this happens immediately
    even though the wire is "pending"
  - Wire queued for submission to Federal Reserve wire network (domestic) 
    or SWIFT network (international)
  - Compliance review automatically triggered for:
    - Any international wire
    - Domestic wires over $10,000
    - Wires to certain countries (OFAC screening)
  - Customer receives confirmation email with wire details
  - Wire event logged with full audit trail

Why Wire Contracts Are Maximally Detailed

The money leaves immediately, but the wire takes days. This creates a period where the customer's balance is reduced but the recipient hasn't received anything. The contract must make this clear — and the side effects section must document that the balance deduction is immediate even though delivery is not.

Regulatory requirements are part of the contract. Purpose is required for international wires — not because the bank wants it, but because the law requires it. The contract enforces this. If you omitted it, the implementation might skip it, and the bank could face legal penalties.

Wire fees are not optional. The contract explicitly includes the fee and shows the total deduction. A vague contract might say "returns amount" without clarifying whether the fee is included or separate — this ambiguity could cause accounting errors.


Comparing All Three Domain Examples

AspectLibraryRestaurantBank
Strictest constraintCheckout limitsFood timingAtomicity + compliance
Error message securityLow concernLow concernCritical (never reveal account existence)
Side effects count3-4 per contract4-6 per contract6-10 per contract
Regulatory requirementsMinimalHealth codes (not in contracts)Extensive (CTR, OFAC, wire auth)
Can you "undo" the operation?Yes (return the book)Partially (can void before kitchen)Depends (internal yes, wire maybe not)
Money involvedSmall finesMeal costsUnlimited
Time horizon14 days (loan period)Minutes (meal duration)Days (wire processing)
Highest-stakes error caseLost book ($25 replacement)Food allergy incidentMoney loss, legal violation

The contract structure is identical across all three: name, inputs, outputs, errors, side effects. But the rigor scales with the stakes. A missing error case in the library contract is an inconvenience. A missing error case in the bank contract is a potential financial loss or legal violation.

This is the core lesson: the contract template is universal, but the thoroughness is proportional to what's at risk.

Contracts — Composing Contract Chains

Why Composition Matters

Real features are never one contract. They are chains — sequences of contracts where each step's output feeds the next step's input. The chain's success depends on every link, and a failure at any point must be handled.

Composing contracts is where design becomes architecture.


The Three Rules of Contract Chains

Rule 1: Output Shape Must Match Input Shape

If Contract A returns a validated_cart and Contract B accepts a validated_cart, they connect. If A returns a cart (not validated) and B expects validated_cart, they don't — and the gap will produce a bug.

What happens if step 3 of 5 fails? Does the whole chain stop? Do steps 1 and 2 need to be undone? Does the chain skip step 3 and continue? The answer must be explicit for every link.

Rule 3: Side Effects Complicate Rollback

If step 1 sends a confirmation email and step 3 fails, you can't "unsend" the email. Side effects are often irreversible, so the chain must account for which steps can be undone and which can't.


Worked Example 1: E-Commerce Checkout

A customer clicks "Place Order." Here's the full chain:

Step 1: ValidateCart
  IN:  cart_id
  OUT: validated_cart (items confirmed in stock, prices confirmed current)
  FAIL: "Item X is out of stock" → stop, show error, suggest alternatives

Step 2: CalculateTotal
  IN:  validated_cart, discount_code (optional), shipping_address
  OUT: order_total (subtotal, discount_amount, tax, shipping, grand_total)
  FAIL: "Invalid discount code" → stop, show error, let customer fix
  FAIL: "Cannot ship to this address" → stop, show error

Step 3: ReserveInventory
  IN:  validated_cart
  OUT: reservation_id (items held for 10 minutes)
  FAIL: "Item X went out of stock since cart was validated" → stop, show 
        error, go back to step 1
  NOTE: This is a TEMPORARY hold. If step 5 fails, reservation is released.

Step 4: ProcessPayment
  IN:  grand_total, payment_method
  OUT: payment_confirmation (transaction_id, status)
  FAIL: "Card declined" → release reservation (undo step 3), show error
  FAIL: "Payment service unavailable" → release reservation, show error,
        suggest retry

Step 5: CreateOrder
  IN:  validated_cart, order_total, payment_confirmation, customer_id
  OUT: order_record (order_id, status = "confirmed")
  FAIL: "System error creating order" → THIS IS CRITICAL. Payment was 
        already processed. Must either: (a) retry order creation, or 
        (b) refund the payment. Never leave money charged without an order.

Step 6: SendConfirmation
  IN:  order_record, customer_email
  OUT: (none — fire and forget)
  FAIL: "Email service unavailable" → log the failure, do NOT undo the 
        order. The order is valid. Email can be resent later.
  SIDE EFFECT: Email sent to customer.

The Failure Cascade

Let's visualize what happens at each failure point:

Fails AtSteps CompletedWhat Must Be UndoneUser Sees
Step 1NoneNothing"Item out of stock" — redirect to cart
Step 2Cart validatedNothing (validation has no side effects)"Invalid discount code" — fix and retry
Step 3Cart validated, total calculatedNothing (no side effects yet)"Item just went out of stock" — back to cart
Step 4Cart validated, total calculated, inventory reservedRelease inventory reservation"Card declined" — try another card
Step 5All above + payment chargedRefund payment + release inventory"System error — please contact support" + automatic refund
Step 6Everything above + order createdNothing to undo — order is validOrder succeeds. Email will be retried later.

This table is the most valuable artifact in the design. It shows exactly what's at risk at each step and what recovery looks like.


Worked Example 2: Employee Onboarding

A new employee is onboarded into a company's systems. This is a multi-system chain involving HR, IT, Facilities, Payroll, and more.

Step 1: CreateEmployeeRecord
  IN:  name, role, department, start_date, manager_id, salary
  OUT: employee_id, employee_record
  FAIL: "Manager not found" → stop, HR fixes manager assignment
  FAIL: "Duplicate employee (matching name + DOB)" → stop, HR investigates

Step 2: SetupPayroll
  IN:  employee_id, salary, tax_withholding_info, bank_account (for direct deposit)
  OUT: payroll_enrollment_confirmation
  FAIL: "Invalid bank routing number" → stop, request corrected info
        (Step 1 persists — employee record exists but payroll isn't set up)
  SIDE EFFECT: Employee added to next payroll cycle

Step 3: CreateIT Accounts
  IN:  employee_id, role, department
  OUT: email_address, system_credentials, access_permissions_list
  FAIL: "Email address conflict (name.lastname already taken)" → 
        auto-generate alternative (name.middle.lastname), proceed
  SIDE EFFECTS: Email created, VPN access granted, software licenses assigned

Step 4: AssignEquipment
  IN:  employee_id, role, department
  OUT: equipment_list (laptop model, monitor, phone, badge)
  FAIL: "Laptop model out of stock" → substitute, proceed with warning
  SIDE EFFECT: Equipment reserved in inventory, shipping initiated

Step 5: SetupWorkspace
  IN:  employee_id, department, start_date
  OUT: workspace_assignment (building, floor, desk number)
  FAIL: "No desks available in department area" → assign temporary desk, 
        add to waitlist
  SIDE EFFECT: Desk reserved in facilities system

Step 6: SendWelcomePackage
  IN:  employee_id, email_address, start_date, workspace, equipment_list
  OUT: (confirmation)
  FAIL: Non-critical — retry later
  SIDE EFFECT: Welcome email sent with first-day instructions

Key Differences From E-Commerce Chain

Not all steps are dependent. Steps 2, 3, 4, and 5 can happen in parallel — they all need employee_id from step 1, but they don't need each other's outputs. This changes the chain from a strict sequence to a fan-out:

                                ┌── Step 2: Payroll
                                ├── Step 3: IT Accounts
Step 1: Create Record ──────────┤
                                ├── Step 4: Equipment
                                └── Step 5: Workspace
                                          │
                         All complete ─────┘
                                          │
                                    Step 6: Welcome

Failures don't cascade backward. If IT can't create an account, that doesn't mean HR needs to delete the employee record. Each step has its own failure handling. This is a design choice — the chain is tolerant of partial completion, unlike the e-commerce chain where payment requires inventory reservation.

Some failures are handled with substitution, not cancellation. "Laptop out of stock" → substitute a different model. "Email conflict" → generate alternative. The chain tries to continue whenever possible.


Worked Example 3: Medical Lab Test Process

A doctor orders a blood test. The sample is collected, processed, and results are delivered.

Step 1: OrderTest
  IN:  doctor_id, patient_id, test_type (e.g., "complete blood count"), 
       urgency ("routine" | "urgent" | "stat"), clinical_notes
  OUT: test_order (order_id, patient_name, test_type, collection_instructions)
  FAIL: "Patient has allergy flagged for this test prep" → warning to doctor
  SIDE EFFECT: Order appears on lab's work queue

Step 2: CollectSample
  IN:  order_id, collector_id (phlebotomist), patient_id_verification 
       (wristband scan or verbal confirmation of DOB)
  OUT: sample_record (sample_id, collection_time, tube_type, volume)
  FAIL: "Patient ID verification failed (wristband doesn't match order)" 
        → HARD STOP. Do not collect. This prevents testing the wrong 
        patient's blood — a potentially fatal error.
  FAIL: "Insufficient sample volume" → recollect
  SIDE EFFECT: Sample labeled with barcode, linked to order_id

Step 3: ProcessSample
  IN:  sample_id
  OUT: processing_record (processing_start_time, analyzer_id, status)
  FAIL: "Sample hemolyzed (damaged)" → error to collector: "Recollection 
        needed. Reason: hemolysis." → back to Step 2
  FAIL: "Analyzer malfunction" → route to backup analyzer
  SIDE EFFECT: Sample processing logged for quality control

Step 4: AnalyzeResults
  IN:  processing_record
  OUT: raw_results (values for each test component, reference ranges, 
       flags for abnormal values)
  FAIL: "Results outside analyzable range" → flag for manual review 
        by lab technician
  SIDE EFFECT: Results stored in lab information system

Step 5: ReviewResults
  IN:  raw_results, patient_history (previous test results for comparison)
  OUT: reviewed_results (same as raw, plus: technician_notes, 
       critical_value_flag)
  FAIL: None — this step always produces a result (even if the result 
        is "requires further testing")
  SIDE EFFECT: If critical value detected (life-threatening result), 
  IMMEDIATE notification to ordering doctor — this must happen within 
  minutes, not hours. This is a regulatory requirement.

Step 6: DeliverResults
  IN:  reviewed_results, doctor_id, patient_id
  OUT: delivery_confirmation
  FAIL: "Doctor not available" → deliver to covering physician
  SIDE EFFECTS: Results appear in patient's medical record, 
  doctor receives notification in their clinical dashboard

Why This Chain Is Unique

Patient safety creates hard stops. Step 2 has a verification check that cannot be bypassed or substituted. If the patient ID doesn't match, the chain stops completely. No alternative, no workaround. This is the highest-stakes failure in the chain — a wrong patient's blood being analyzed means wrong treatment decisions.

Some failures loop backward. "Sample hemolyzed" at step 3 sends the chain back to step 2 (recollect). This isn't a simple linear chain — it has loops.

Time sensitivity varies by step. Routine orders might wait hours at each step. Stat orders bypass the queue at every step. The urgency flag changes the behavior of every contract in the chain without changing the contracts themselves — it's a priority signal that travels through the chain.

Critical value notification is a side effect that overrides normal flow. Normally, results go through all steps sequentially. But if step 5 detects a critical value (e.g., dangerously low blood sugar), the side effect triggers an immediate alert — even before step 6 formally delivers the results. The side effect has higher priority than the main chain.


Composing Contracts: Summary Principles

PrincipleWhat It Means
Map the happy path firstGet the chain right when everything works, then add failure handling
Define the failure point for every stepWhat happens here if this fails? Stop? Undo? Substitute? Skip?
Identify irreversible stepsEmails sent, payments charged, physical actions taken — these can't be undone
Look for parallelizable stepsNot every chain is strictly sequential — find steps that don't depend on each other
Look for loopsSome failures send you back to an earlier step. Map these explicitly.
Time sensitivity shapes the chainA chain that must complete in 200 milliseconds is designed very differently from one that spans 5 business days
The rollback plan is as important as the happy pathFor every step that changes state, document how to reverse it if a later step fails

Contracts and Interfaces — Test Your Understanding

Answer each question by writing contracts in plain language. No code. Focus on precision and completeness.


Section A: Write the Contract

Question 1

Write the contract for a password reset operation. A user provides their email address and requests a password reset. Think through: what are all the inputs? What are all the outputs? What can go wrong? What side effects occur?

Use the template:

CONTRACT: ResetPassword
ACCEPTS: ...
RETURNS: ...
ERRORS: ...
SIDE EFFECTS: ...

Question 2

Write the contract for searching a product catalog. A user can search by keyword, filter by category, filter by price range, and choose how results are sorted. Define every input (required vs. optional, constraints), exactly what the output looks like, and what happens when no results match.


Question 3

Write the contract for transferring money between two bank accounts. This is a high-stakes operation. Be thorough: think about validation, insufficient funds, daily transfer limits, accounts that don't exist, accounts that are frozen, and what happens if the system fails mid-transfer.


Section B: Evaluate the Contract

Question 4

Here is a contract someone wrote:

CONTRACT: GetUser
ACCEPTS: user_id
RETURNS: user data
ERRORS: returns error if something goes wrong

List every problem with this contract. Then rewrite it properly.


Question 5

Here is a more detailed contract:

CONTRACT: SubmitReview
ACCEPTS:
  - product_id: text
  - user_id: text
  - rating: number (1-5)
  - review_text: text
RETURNS:
  - review_id: text
ERRORS:
  - invalid product → error
  - invalid user → error

This is better, but still incomplete. Identify at least five things that are missing or underspecified. Rewrite the contract to address them.


Question 6

You receive two different contract proposals for the same operation — sending a notification to a user:

Proposal A:

CONTRACT: SendNotification
ACCEPTS: user_id, message_text, channel (email/push/sms)
RETURNS: success/failure
ERRORS: user not found, invalid channel

Proposal B:

CONTRACT: SendNotification
ACCEPTS: user_id, message_text
RETURNS: notification_id, channel_used, status
ERRORS: user not found, user has no valid contact methods, message too long
SIDE EFFECTS: notification logged, delivery attempted via user's preferred channel

Compare these proposals. Which is better and why? What does Proposal B handle that Proposal A ignores? Is there anything Proposal A does better?


Section C: Contract Composition

Question 7

A food delivery app has the following user action: "Customer places an order from a restaurant."

Break this into a chain of individual contracts. Each contract should have full inputs, outputs, errors, and side effects. The chain should cover everything from the customer pressing "Place Order" to the restaurant receiving the order on their screen.

Define at least 4 contracts in the chain. For each one, show how the output of the previous step feeds into the input of the next.


Question 8

An event ticketing system needs to handle: "User purchases 3 tickets for a concert."

This involves checking seat availability, holding seats temporarily while the user pays, processing payment, and issuing the tickets. These steps must happen in order, and failure at any step has specific consequences.

Write the contract chain. Pay special attention to: what happens if seats are taken between checking availability and completing payment? What happens if payment fails after seats are held?


Question 9

A user wants to change their email address in a system. This seems simple but involves:

  • Verifying the new email is valid and not already in use
  • Sending a verification link to the new email
  • The user clicking the link to confirm
  • Updating the email in the system
  • Notifying the old email address about the change

Write contracts for each step. Identify where time gaps exist between steps (e.g., the user might click the link 5 minutes later, or never). How do the contracts handle these gaps?


Section D: Critical Thinking

Question 10

"A good contract should cover every possible edge case."

Is this true? Is it realistic? Where do you draw the line between thoroughness and over-specification? Give an example of an edge case that MUST be in the contract and one that reasonably could be omitted.


Question 11

You write a contract that says:

RETURNS: list of orders, sorted by date descending, maximum 100 results

A colleague argues: "Don't put 'maximum 100' in the contract. That's an implementation detail — maybe we'll change it later."

Who is right? Make the case for each side. What is the cost of including it in the contract? What is the cost of leaving it out?


Question 12

A contract exists between Module A and Module B. Module A has been happily calling Module B for a year. Now Module B needs to add a new required input to the contract.

What problem does this create? How would you handle this change without breaking Module A? Describe at least two approaches and the tradeoffs of each.


Grading Rubric

CriteriaWhat It Means
CompletenessAll five components present: name, inputs (with shapes and constraints), outputs (with guarantees), errors (exhaustive), side effects
PrecisionNo vague terms like "user data" or "returns error" — every statement is specific enough to implement from
Implementation-freeThe contract says what, never how. No mention of databases, languages, or algorithms
Error awarenessEdge cases are considered. Empty results, invalid inputs, system failures, and time-related issues are addressed
ComposabilityWhere contracts chain together, outputs clearly match the next step's inputs. Failure at each step has defined consequences

Decomposition — Why It Matters

The One Skill That Makes Everything Possible

You now know how to trace data through a system (lifecycle), draw lines around responsibilities (boundaries), and define the agreements between parts (contracts). But there's a skill that comes before all of them — the skill that determines whether you can even begin solving a problem:

Decomposition — the ability to take something large and vague and break it into small, concrete, solvable pieces.

This is the single most important skill in engineering, and it has nothing to do with code.

Every senior engineer you'll ever meet does this instinctively. When handed a problem — any problem — their brain immediately starts dividing it into parts. Not because they were taught a formal method, but because they've learned through painful experience that trying to solve a big problem all at once leads to failure, every time.

Why Big Problems Are Impossible

Human brains have limits. You can hold about 4-7 things in working memory at once. A real-world feature might involve 50 things — data sources, business rules, edge cases, user interactions, error handling, performance concerns, security requirements, and more.

If you try to think about all 50 at once, you'll get overwhelmed, miss details, and build something that sort of works but falls apart under scrutiny. This is the experience every beginner has: "it works on my machine, for the simple case, if nothing goes wrong."

Decomposition is the antidote. You don't solve a 50-piece problem. You solve ten 5-piece problems. Each one is small enough to fit in your head. Each one can be verified independently. And when you compose them together, they form the complete solution.

What Decomposition Actually Means

Decomposition is not just "break it into pieces." It's breaking it into pieces that are:

  1. Independent enough to work on separately
  2. Small enough to understand completely
  3. Concrete enough to know when you're done
  4. Ordered correctly so dependencies flow naturally

Bad decomposition leads to pieces that can't be built without building other pieces first, pieces so vague you don't know where to start, or pieces that don't actually combine into a solution.

The Vague Feature Problem

The number one challenge in professional engineering is not technical. It's this: someone gives you a vague request, and you have to turn it into something buildable.

"We need reporting."

What does that mean? Reports about what? For whom? How often? In what format? What data do they need? What decisions will they make from the reports? How accurate does the data need to be? Can it be delayed by an hour? A day?

A beginner hears "we need reporting" and starts building a report page. A senior engineer hears "we need reporting" and asks 20 questions — because they know that the decomposition of the request determines the entire architecture.

Why Decomposition Before Code

In the LLM era, writing code is the cheapest part of the process. An LLM can generate a module in seconds. But it can only do that if someone has already decomposed the problem into a clear, well-bounded piece with a defined contract.

The expensive part — the part that requires human judgment — is:

  1. Understanding what the actual problem is
  2. Breaking it into pieces that make sense
  3. Deciding what order to tackle them in
  4. Knowing when a piece is "done"

This is decomposition. Get it right, and the rest is mechanical. Get it wrong, and no amount of code will save you.

What Goes Wrong Without Decomposition

The Monolith

Someone tries to build everything at once. They create one massive piece of work that does a hundred things. It takes months. Nobody can review it because it's too big. It has subtle bugs that won't be found until production. When it needs a change, any change, the whole thing is at risk.

The Wrong Order

Someone builds the dashboard before building the data pipeline that feeds it. They build the payment system before defining the order structure. They build the notification system before deciding what events trigger notifications. Now they have to rework everything because the foundation doesn't support what was built on top.

The Black Hole

A piece of work keeps getting bigger because nobody defined its edges. "While I'm in here, I'll also add..." and the scope expands endlessly. What was supposed to take a week takes two months, and nobody knows how close it is to done because the target keeps moving.

Invisible Progress

Without decomposition, the only status update is "I'm still working on it." With decomposition, you can say "5 of 8 pieces are complete, the 6th is in progress, and I'm blocked on the 7th until we get a decision on X." This is the difference between a project that surprises everyone with delays and a project that's managed with clarity.

Decomposition Is Not Just For Code

This skill applies to:

  • Writing a document: What sections does it need? What does each section cover? What order makes sense for the reader?
  • Planning an event: What are the independent tasks? What depends on what? What can be done in parallel?
  • Diagnosing a problem: What are the possible causes? How do I eliminate them one by one?
  • Learning something new: What are the foundational concepts? What builds on what? What can I skip for now?

Engineers who think in decomposition don't just write better software. They communicate better, plan better, and solve problems faster — because they've trained their brain to automatically ask: "What are the pieces, and what's the right order?"

That's what this section teaches you to do deliberately.

Decomposition — Why It Matters

The One Skill That Makes Everything Possible

You now know how to trace data through a system (lifecycle), draw lines around responsibilities (boundaries), and define the agreements between parts (contracts). But there's a skill that comes before all of them — the skill that determines whether you can even begin solving a problem:

Decomposition — the ability to take something large and vague and break it into small, concrete, solvable pieces.

This is the single most important skill in engineering, and it has nothing to do with code.

Every senior engineer you'll ever meet does this instinctively. When handed a problem — any problem — their brain immediately starts dividing it into parts. Not because they were taught a formal method, but because they've learned through painful experience that trying to solve a big problem all at once leads to failure, every time.

Why Big Problems Are Impossible

Human brains have limits. You can hold about 4-7 things in working memory at once. A real-world feature might involve 50 things — data sources, business rules, edge cases, user interactions, error handling, performance concerns, security requirements, and more.

If you try to think about all 50 at once, you'll get overwhelmed, miss details, and build something that sort of works but falls apart under scrutiny. This is the experience every beginner has: "it works on my machine, for the simple case, if nothing goes wrong."

Decomposition is the antidote. You don't solve a 50-piece problem. You solve ten 5-piece problems. Each one is small enough to fit in your head. Each one can be verified independently. And when you compose them together, they form the complete solution.

What Decomposition Actually Means

Decomposition is not just "break it into pieces." It's breaking it into pieces that are:

  1. Independent enough to work on separately
  2. Small enough to understand completely
  3. Concrete enough to know when you're done
  4. Ordered correctly so dependencies flow naturally

Bad decomposition leads to pieces that can't be built without building other pieces first, pieces so vague you don't know where to start, or pieces that don't actually combine into a solution.

The Vague Feature Problem

The number one challenge in professional engineering is not technical. It's this: someone gives you a vague request, and you have to turn it into something buildable.

"We need reporting."

What does that mean? Reports about what? For whom? How often? In what format? What data do they need? What decisions will they make from the reports? How accurate does the data need to be? Can it be delayed by an hour? A day?

A beginner hears "we need reporting" and starts building a report page. A senior engineer hears "we need reporting" and asks 20 questions — because they know that the decomposition of the request determines the entire architecture.

Why Decomposition Before Code

In the LLM era, writing code is the cheapest part of the process. An LLM can generate a module in seconds. But it can only do that if someone has already decomposed the problem into a clear, well-bounded piece with a defined contract.

The expensive part — the part that requires human judgment — is:

  1. Understanding what the actual problem is
  2. Breaking it into pieces that make sense
  3. Deciding what order to tackle them in
  4. Knowing when a piece is "done"

This is decomposition. Get it right, and the rest is mechanical. Get it wrong, and no amount of code will save you.

What Goes Wrong Without Decomposition

The Monolith

Someone tries to build everything at once. They create one massive piece of work that does a hundred things. It takes months. Nobody can review it because it's too big. It has subtle bugs that won't be found until production. When it needs a change, any change, the whole thing is at risk.

The Wrong Order

Someone builds the dashboard before building the data pipeline that feeds it. They build the payment system before defining the order structure. They build the notification system before deciding what events trigger notifications. Now they have to rework everything because the foundation doesn't support what was built on top.

The Black Hole

A piece of work keeps getting bigger because nobody defined its edges. "While I'm in here, I'll also add..." and the scope expands endlessly. What was supposed to take a week takes two months, and nobody knows how close it is to done because the target keeps moving.

Invisible Progress

Without decomposition, the only status update is "I'm still working on it." With decomposition, you can say "5 of 8 pieces are complete, the 6th is in progress, and I'm blocked on the 7th until we get a decision on X." This is the difference between a project that surprises everyone with delays and a project that's managed with clarity.

Decomposition Is Not Just For Code

This skill applies to:

  • Writing a document: What sections does it need? What does each section cover? What order makes sense for the reader?
  • Planning an event: What are the independent tasks? What depends on what? What can be done in parallel?
  • Diagnosing a problem: What are the possible causes? How do I eliminate them one by one?
  • Learning something new: What are the foundational concepts? What builds on what? What can I skip for now?

Engineers who think in decomposition don't just write better software. They communicate better, plan better, and solve problems faster — because they've trained their brain to automatically ask: "What are the pieces, and what's the right order?"

That's what this section teaches you to do deliberately.

Decomposition — How: The Method

Two Directions

There are two fundamental approaches, and experts use both — often on the same problem.

Top-Down: Start From the Goal

Begin with the end result and repeatedly ask: "What sub-problems does this require?"

Each answer becomes a new question. You keep asking until each piece is small enough to describe with a clear contract — precise inputs, precise outputs, and you can estimate the effort.

Quick example: "Build an online bookstore."

  • What does a bookstore need? → Browsing, cart, checkout, admin
  • What does browsing need? → Search, filter, sort, details page
  • What does search need? → Accept query, match results, handle "no results"

Three levels. Each level more specific. Stop when the pieces feel tangible.

Bottom-Up: Start From What You Have

Begin with the pieces you know and ask: "What can I compose from these?"

This works when you have existing components, known constraints, or a technology platform with built-in capabilities.

Quick example: You have a database of books, an email service, and a payment API.

  • Database → browsing and search features
  • Email → order confirmations and notifications
  • Payment API → checkout flow
  • Combine all three → a basic bookstore

Bottom-up is powerful when building blocks constrain the design. If your payment API only supports credit cards, that fact shapes the checkout feature — you discover this from the bottom, not the top.

When to Use Which

SituationApproach
New project, blank slateTop-down — start from what users need
Existing system, adding featuresBottom-up — start from what already exists
Unclear requirementsTop-down first to clarify scope, then bottom-up to ground it in reality
Well-understood problemEither works; most engineers blend both naturally

The Decomposition Tree

The primary artifact of decomposition is a tree — a hierarchy where each node is a piece of the problem and its children are its sub-pieces.

At every leaf of this tree, you should be able to:

  1. Write a contract (inputs, outputs, errors)
  2. Estimate the effort (small, medium, large)
  3. Identify dependencies (does this need something else built first?)

If you can't do all three, the piece isn't decomposed enough — keep breaking it down.


Finding Seams

A seam is a natural break point — a place where one concern ends and another begins. Recognizing seams makes decomposition faster and more accurate.

Data format changes

Wherever data changes shape, there's a seam. Raw input → validated input. Validated input → database record. Database record → display format.

Responsibility changes

Wherever "whose job is this?" changes, there's a seam. The user's browser collects input. The server validates it. The database stores it. Three responsibilities, three seams.

Time boundaries

Wherever something can happen "later" or "separately," there's a seam. The order is placed now. The shipping label is generated later. The daily summary runs overnight.

Audience changes

Wherever different users see different things, there's a seam. The customer sees their order. The admin sees all orders. The warehouse sees orders ready to ship.

Error handling boundaries

Wherever the response to failure changes, there's a seam. Search fails → show "no results." Payment fails → stop checkout. Email fails → log it and continue.


Dependency Mapping

Once you have a tree, identify what depends on what.

Three dependency rules:

  1. Things at the bottom should depend on nothing or on stable abstractions
  2. Things at the top can depend on things below
  3. Circular dependencies are a design error — if A needs B and B needs A, your decomposition is wrong

The dependency arrows tell you the build order: start with pieces that have no dependencies, then build what depends on them, then what depends on those.


Estimating From Decomposition

Once a problem is decomposed into leaves, you can estimate by creating a table:

LeafComplexityDependenciesPriority
Each decomposed pieceSmall/Medium/LargeWhat it needsHigh/Medium/Low

This table, derived entirely from decomposition, gives you a project plan — not a guess, but a structured breakdown where each piece is estimable and the order is logical.


The Decomposition Checklist

When you've finished decomposing, verify:

  • Every leaf is concrete — you can write a contract for it
  • Every leaf is small — you could explain it completely in 2-3 sentences
  • No leaf has hidden complexity — if it feels big, it needs more decomposition
  • Dependencies are explicit — you know what depends on what
  • No circular dependencies — everything flows in one direction
  • Nothing is missing — trace the user's journey start to finish; every step has a leaf
  • Nothing overlaps — each responsibility appears exactly once
  • Build order is clear — you know what to start with

What to Look For in the Examples

The following pages take three very different systems and decompose them completely. As you read:

  1. Watch how the tree grows — from a vague goal to concrete, estimable leaves
  2. Notice where seams appear — and which type of seam it is
  3. Compare the final tree depth — some systems are deeper than others
  4. Look at the dependency map — what must be built first?
  5. Notice the decisions — decomposition isn't mechanical; there are judgment calls about where to split

Decomposition — Example: Online Bookstore

The Starting Point

Goal: Build an online bookstore where customers can browse, search, buy books, and track orders. Administrators can manage inventory.

This is deliberately a "boring" system. It's well-understood, which lets us focus purely on the decomposition technique without domain complexity getting in the way.


Step 1: Top-Down — First Level

Ask: "What does an online bookstore need?"

Online Bookstore
├── Browsing & Search
├── Shopping Cart
├── Checkout & Payment
├── Order Management
└── Admin: Inventory

Five branches. Each is a major area of functionality. But none of these are concrete enough to build yet — "Browsing & Search" could mean a hundred different things.


Step 2: Second Level — Each Branch

Ask: "What does each of these need?"

Browsing & Search
├── List all books (paginated)
├── Search by keyword (title, author, or ISBN)
├── Filter by genre, price range, publication date
├── Sort by price / date / rating / relevance
└── View book details (description, reviews, availability)

Shopping Cart

Shopping Cart
├── Add item to cart
├── Remove item from cart
├── Update quantity
├── View cart summary (items, quantities, subtotal)
└── Cart persistence (survives page refresh, login/logout)

Checkout & Payment

Checkout & Payment
├── Enter shipping address
├── Validate shipping address
├── Select shipping method (standard, express)
├── Calculate order total (items + shipping + tax)
├── Enter payment information
├── Process payment
├── Create order record
└── Send confirmation email

Order Management

Order Management
├── View order history (list of past orders)
├── View single order details (items, shipping status, tracking)
├── Cancel order (if not yet shipped)
└── Request return (if within return window)

Admin: Inventory

Admin: Inventory
├── Add new book to catalog
├── Update book details (price, description, cover image)
├── Adjust stock levels (restock, corrections)
├── Mark book as discontinued
└── View low-stock alerts

Step 3: Check Each Leaf Against the Three Tests

For every leaf, ask:

  1. Can I write a contract for it? (Inputs, outputs, errors)
  2. Can I estimate the effort? (Small, medium, large)
  3. Can I identify its dependencies?

Let's test a few:

"Add item to cart" — Leaf Test

QuestionAnswer
Contract?✅ IN: customer_id, book_id, quantity. OUT: updated cart. ERRORS: book not found, out of stock, invalid quantity.
Estimate?✅ Small — straightforward data operation
Dependencies?✅ Needs: Book Catalog (to verify the book exists and is in stock)

Verdict: Concrete enough. This is a good leaf.

"Process payment" — Leaf Test

QuestionAnswer
Contract?⚠️ Partially — but what about different payment methods? Retries? Partial payments?
Estimate?⚠️ "Medium" is a guess — there might be hidden complexity
Dependencies?✅ Needs: order total from "Calculate order total," payment info from "Enter payment"

Verdict: Not quite decomposed enough. Let's break it down further.

Process Payment
├── Validate payment method (card not expired, sufficient funds estimate)
├── Submit payment to processor
├── Handle processor response (approved, declined, error)
├── Record payment result
└── Handle retry on transient failure

Now each sub-piece is concrete and estimable.

"Cart persistence" — Leaf Test

QuestionAnswer
Contract?⚠️ This isn't an operation — it's a behavior. "Persist" isn't something a user does; it's something the system must maintain.
Estimate?⚠️ Depends on implementation approach (cookies? database? session?)
Dependencies?⚠️ Unclear

Verdict: This is a requirement, not a leaf. It constrains how the cart works but doesn't decompose into an operation with a contract. Move it to a "requirements" list and let it inform the design of the other cart operations.


Step 4: Dependency Map

Now draw the dependency arrows:

Book Catalog ──────────────────── (foundation — depends on nothing)
          │
          ▼
Shopping Cart ──── needs catalog for prices, stock checks
          │
          ▼
Checkout ──── needs cart contents
    │    │
    │    ▼
    │  Shipping ──── needs shipping address
    │    │
    │    ▼
    │  Order Total ──── needs cart + shipping + tax rules
    │    │
    │    ▼
    └─ Payment ──── needs order total
          │
          ▼
Order Record ──── needs payment confirmation + cart + shipping
          │
          ▼
Confirmation Email ──── needs order record

And separately:

Order Management ──── needs Order Record store (read-only)
Admin: Inventory ──── needs Book Catalog store (read-write)

What This Map Tells Us

Build order:

  1. Book Catalog (no dependencies)
  2. Shopping Cart (needs catalog)
  3. Shipping + Tax calculations
  4. Payment integration
  5. Order creation
  6. Email notifications
  7. Order management views
  8. Admin tools

Steps 3 and 4 can be built in parallel — they don't depend on each other.


Step 5: The Complete Tree

Online Bookstore
├── Book Catalog (foundation)
│   ├── Store book data (title, author, price, genre, description, cover)
│   ├── List books (paginated, sorted)
│   ├── Search books (keyword match against title/author/ISBN)
│   ├── Filter books (genre, price range, date range)
│   └── Get single book details
│
├── Shopping Cart
│   ├── Add item (book_id, quantity) → validate against catalog
│   ├── Remove item (line_item_id)
│   ├── Update quantity (line_item_id, new_quantity) → revalidate stock
│   └── Get cart summary (items, subtotal)
│
├── Checkout
│   ├── Enter shipping address
│   ├── Validate shipping address (real address? deliverable?)
│   ├── Select shipping method → calculate shipping cost
│   ├── Calculate order total (items + shipping + tax)
│   ├── Validate payment method
│   ├── Submit payment → handle response
│   ├── Record payment
│   ├── Create order record
│   └── Send confirmation email
│
├── Order Management (customer-facing)
│   ├── List my orders (paginated, newest first)
│   ├── View order details (items, status, tracking)
│   ├── Cancel order (if status = "processing")
│   └── Request return (if within 30-day window)
│
└── Admin: Inventory
    ├── Add book to catalog
    ├── Update book details
    ├── Adjust stock levels
    ├── Mark book as discontinued
    └── View low-stock alerts (books with stock < threshold)

Total leaf count: 23 operations.

Each one can be contracted, estimated, and built. The original "build an online bookstore" has been transformed from a vague idea into 23 specific, concrete tasks with a clear build order.


The Estimation Table

LeafComplexityDependenciesPriorityNotes
Store book dataSmallNoneCriticalFoundation for everything
List booksSmallBook dataCriticalCore feature
Search booksMediumBook dataCriticalNeeds text matching logic
Filter booksSmallBook dataHighSimple query constraints
Get book detailsSmallBook dataCriticalUsed by cart and display
Add to cartSmallGet book detailsCritical
Remove from cartSmallNoneCritical
Update cart quantitySmallGet book details (stock check)High
Get cart summarySmallNoneCriticalCheckout needs this
Enter shipping addressSmallNoneCritical
Validate shipping addressMediumExternal validation serviceHighEdge cases with international
Select shipping methodSmallValidated addressHigh
Calculate order totalSmallCart + shipping + tax rulesCritical
Validate paymentSmallNoneCritical
Submit paymentMediumExternal payment API + order totalCriticalError handling is complex
Record paymentSmallPayment responseCritical
Create order recordSmallCart + payment + shippingCritical
Send confirmation emailSmallOrder recordHighCan be async
List my ordersSmallOrder recordsHigh
View order detailsSmallOrder recordsHigh
Cancel orderSmallOrder record (status check)Medium
Request returnMediumOrder record + return policy rulesLowCan ship v1 without it
Add book (admin)SmallNoneHigh
Update book (admin)SmallBook dataHigh
Adjust stock (admin)SmallBook dataHigh
Discontinue book (admin)SmallBook dataMedium
Low-stock alerts (admin)SmallBook dataLowNice to have for v1

Rough estimate: 9 medium tasks + 18 small tasks. This is a plannable project.


What This Example Teaches

  1. Start big, get specific — 5 branches became 23+ concrete leaves
  2. Test every leaf — if you can't contract it, it's not done decomposing
  3. Some "leaves" are actually requirements — cart persistence is a constraint, not an operation
  4. Dependencies give you build order — don't guess what to build first; the map tells you
  5. The boring system is a good training system — if you can fully decompose a bookstore, you can decompose anything

Decomposition — Example: Messaging Application

The Starting Point

Goal: Build a messaging app where users can send direct messages, create group chats, share files, and see who's online. Think Slack or Discord — but focus on the decomposition, not the scale.

This system is different from the bookstore because it's real-time, multi-user, and has state that changes constantly (who's online, who's typing, unread counts). These properties create new kinds of seams.


Step 1: Top-Down — First Level

Ask: "What does a messaging app need?"

Messaging App
├── User Management
├── Conversations (1-on-1 and groups)
├── Messages
├── Presence (online/offline/typing)
├── Notifications
└── File Sharing

Six branches. Immediately, questions arise:

  • Is "conversations" one thing, or are 1-on-1 and groups different enough to be separate branches?
  • Where does "search messages" live? Under Messages? Its own branch?
  • Is "presence" really separate from "user management"?

Decision: Keep them separate for now. If two branches share too much, that's a signal to merge them later. It's easier to merge than to split.


Step 2: Second Level — Each Branch

User Management

User Management
├── Register new user
├── Log in (with session creation)
├── Log out (with session cleanup)
├── Update profile (display name, avatar, status message)
├── Block a user
└── Unblock a user

Conversations

Conversations
├── Direct Messages
│   ├── Start a DM conversation (with one other user)
│   └── List my DM conversations (sorted by most recent message)
├── Group Chats
│   ├── Create a group (name, initial members)
│   ├── Add member to group
│   ├── Remove member from group
│   ├── Leave group
│   ├── Update group details (name, description, avatar)
│   └── List my groups (sorted by most recent message)
└── Shared
    ├── Get conversation history (paginated, newest first)
    └── Mark conversation as read

Messages

Messages
├── Send text message (to a conversation)
├── Edit a sent message
├── Delete a sent message
├── React to a message (emoji)
├── Reply to a specific message (threaded)
├── Search messages (across all conversations or within one)
└── Pin a message in a conversation

Presence

Presence
├── Update my status (online / away / do-not-disturb / offline)
├── Get a user's current status
├── Get status for a list of users (for the sidebar)
├── Typing indicators (start typing, stop typing)
└── Last seen timestamp

Notifications

Notifications
├── In-app notification (badge count, pop-up)
├── Push notification (mobile device)
├── Email notification (for offline users, after delay)
├── Notification preferences (per conversation: all, mentions only, mute)
└── Mark notification as read

File Sharing

File Sharing
├── Upload file (attached to a message in a conversation)
├── Download file
├── Generate file preview (images, PDFs)
├── Enforce file size limits
└── Track storage usage per user

Step 3: Finding Seams — Where This Gets Interesting

Seam: Real-Time vs. Stored

Some operations are real-time (typing indicators, presence) and some are stored (message history, user profiles). This creates a fundamental seam:

  • Real-time data is ephemeral — "user is typing" doesn't go in a database; it's a transient signal
  • Stored data is permanent — messages are kept until deleted

This seam affects the entire architecture. Real-time features use different patterns (publish-subscribe) than stored features (request-response).

Seam: Who Sees What

When a message is sent to a group of 50 people, all 50 need to receive it. But:

  • 10 are currently online → they see it instantly
  • 15 have push notifications → they get a phone notification
  • 25 are offline with email notifications → they get an email after 15 minutes

Same event, three different delivery paths. This is an audience seam — the delivery mechanism changes based on the recipient's state.

Seam: Sender vs. Recipient Experience

When you send a message:

  • You see it immediately in your conversation (optimistic display)
  • The system stores it
  • The system delivers it to recipients
  • Recipients see a notification

The sender's experience and the recipient's experience are different flows triggered by the same event. This is a seam.

Seam: Conversation Create vs. Message in Conversation

What happens when you message someone for the first time? Is it:

  1. Create a conversation, then send a message into it? (Two operations)
  2. "Send message to user" — and the conversation is created as a side effect?

Both are valid decompositions. Option 1 is more explicit. Option 2 is more user-friendly. This is a decomposition judgment call — the right answer depends on how you want the user experience to work.

Decision: The user action is "send message to person." Internally, this decomposes into "find or create conversation" + "add message to conversation." The decomposition has two pieces, but they're presented to the user as one action.


Step 4: Leaf Test — Checking Our Work

"Send text message" — Leaf Test

QuestionAnswer
Contract?✅ IN: sender_id, conversation_id, message_text. OUT: message_record (id, timestamp, content). ERRORS: conversation not found, sender not a member, message too long, sender is blocked by recipient.
Estimate?⚠️ Sending is small, but delivery is complex — 50 people need to receive it via 3 different channels
Dependencies?✅ Needs: conversation existence, sender membership

Verdict: "Send" is concrete, but "deliver to all recipients" is hidden inside it. Split it:

Send text message
├── Validate and store message
└── Fan out to recipients
    ├── Deliver to online recipients (real-time)
    ├── Queue push notifications (for mobile recipients)
    └── Queue email notifications (for offline recipients, delayed)

Now each sub-piece is estimable. "Fan out to recipients" was hidden complexity — it looked simple until you asked "what about the 50 people?"

"Typing indicators" — Leaf Test

QuestionAnswer
Contract?✅ IN: user_id, conversation_id, is_typing (yes/no). OUT: (broadcast event to other members). ERRORS: none meaningful — this is fire-and-forget.
Estimate?✅ Small — ephemeral event, no storage
Dependencies?✅ Needs: real-time connection to conversation members

Verdict: Concrete enough. But note the seam: this is a real-time feature. It follows a completely different pattern from message storage.

"Search messages" — Leaf Test

QuestionAnswer
Contract?⚠️ IN: query, scope (all conversations or specific one). OUT: list of matching messages with context. But what about: fuzzy matching? Matching within files? Searching by date? Searching by sender?
Estimate?❌ "Medium" is a guess — full-text search is a deep problem
Dependencies?✅ Needs: all stored messages

Verdict: Needs further decomposition:

Search messages
├── Basic keyword search (exact match within text)
├── Filter by conversation
├── Filter by sender
├── Filter by date range
├── Combine filters (keyword + sender + date range)
└── Return results with surrounding context (messages before/after)

Step 5: Dependency Map

User Management ────────────────── (foundation)
       │
       ▼
Conversations ──── needs users to exist
       │
       ├──────────────────────────┐
       ▼                          ▼
Messages ──── needs conversation   Presence ──── needs user sessions
       │                                │
       ▼                                │
Fan-out to recipients ──────────────────┘
       │                     needs presence to know 
       │                     who's online vs. offline
       ▼
Notifications ──── needs message events + user preferences
       │
       ▼
File Sharing ──── needs messages (files are attached to messages)

What This Map Reveals

The fan-out node depends on BOTH messages AND presence. To deliver a message, you need to know who's online (real-time delivery) vs. offline (push/email). This dependency is invisible if you decompose messages and presence separately — the dependency map reveals the connection.

File sharing is a leaf-level feature. It depends on messages but nothing depends on it. This means it can be built last (or omitted from v1).

Notifications depend on nearly everything. They need messages (what happened), users (who to notify), presence (how to notify), and preferences (should we notify). This makes notifications a high-dependency, build-last feature.


Step 6: The Complete Tree

Messaging App
├── User Management (foundation)
│   ├── Register
│   ├── Login / Logout
│   ├── Update profile
│   └── Block / Unblock user
│
├── Conversations
│   ├── DM: Start conversation (find or create)
│   ├── DM: List my conversations
│   ├── Group: Create group
│   ├── Group: Add / Remove member
│   ├── Group: Leave group
│   ├── Group: Update group details
│   ├── Group: List my groups
│   ├── Shared: Get message history (paginated)
│   └── Shared: Mark as read
│
├── Messages
│   ├── Send message (validate + store)
│   ├── Fan out to recipients
│   │   ├── Real-time delivery (online users)
│   │   ├── Push notification queue (mobile)
│   │   └── Email notification queue (offline, delayed)
│   ├── Edit message
│   ├── Delete message
│   ├── React to message
│   ├── Reply (thread)
│   ├── Pin message
│   └── Search
│       ├── Keyword search
│       ├── Filter by conversation / sender / date
│       └── Return results with context
│
├── Presence
│   ├── Update my status
│   ├── Get user status
│   ├── Get bulk status (sidebar)
│   ├── Typing indicator (broadcast)
│   └── Last seen timestamp
│
├── Notifications
│   ├── In-app badge / pop-up
│   ├── Push notification dispatch
│   ├── Email notification dispatch
│   ├── Notification preferences
│   └── Mark notification as read
│
└── File Sharing
    ├── Upload file (to message)
    ├── Download file
    ├── Generate preview
    └── Enforce limits (size, storage)

Total leaf count: 35+ operations.


Comparing Bookstore vs. Messaging App

AspectBookstoreMessaging App
Primary data flowRequest-response (user asks, system answers)Bidirectional (users send/receive continuously)
Real-time requirementsNone (page refreshes are fine)Critical (messages must appear instantly)
Hardest decomposition challengeCheckout flow (sequential, many steps)Message fan-out (one event → many recipients × multiple channels)
Hidden complexityPayment error handlingPresence + notification routing
Deepest branchCheckout (8 leaves)Messages → Fan-out → 3 delivery channels
Seams discoveredData format changes, time boundariesReal-time vs. stored, sender vs. recipient, audience routing
Build-last featuresReturns, low-stock alertsNotifications, file sharing, search
Approximate leaf count2335+

The messaging app has more leaves not because it's "harder" but because it has more dimensions: real-time + stored, sender + receiver, online + offline. Each dimension multiplies the decomposition.


What This Example Teaches

  1. Real-time creates new seam types — ephemeral vs. persistent data is a fundamental split
  2. Fan-out is hidden complexity — "send a message" sounds atomic but it triggers per-recipient work
  3. Some features bridge multiple branches — notifications depend on messages, presence, AND user preferences
  4. "Start conversation" is a design decision — explicit vs. implicit conversation creation changes the decomposition
  5. More dimensions = more leaves — systems with multiple audiences, delivery channels, and timing requirements have larger trees

Decomposition — Example: Package Delivery Logistics

The Starting Point

Goal: Build a system for a delivery company that picks up packages from senders, routes them through sorting facilities, and delivers them to recipients. Track every package at every step. Handle failed deliveries, re-routing, and returns.

This system is different from the previous examples because it spans physical space and physical time. A package moves through multiple locations over multiple days. The decomposition must account for geography, vehicle routing, and the hard reality that physical objects can be lost.


Step 1: Top-Down — First Level

Ask: "What does a package delivery system need?"

Package Delivery
├── Package Intake (getting packages into the system)
├── Routing (deciding how packages travel)
├── Sorting & Transfer (physical movement through facilities)
├── Last-Mile Delivery (getting packages to recipients)
├── Tracking (visibility into package location)
└── Exception Handling (when things go wrong)

Six branches. But already, this decomposition reveals something: "Exception Handling" is its own branch. In the bookstore and messaging app, error handling was part of each leaf. Here, exceptions are so common and varied that they form a top-level concern.

Why? Because physical systems fail differently than digital ones. A database transaction either commits or rolls back. A package can be partially delivered (wrong address, left with neighbor, returned to sender). The failure modes create their own subsystem.


Step 2: Second Level — Each Branch

Package Intake

Package Intake
├── Customer creates shipment request
│   ├── Enter sender address
│   ├── Enter recipient address
│   ├── Enter package details (weight, dimensions, contents description)
│   ├── Select service level (overnight, 2-day, ground, economy)
│   └── Validate addresses (real addresses? deliverable? restricted areas?)
├── Calculate price (based on weight, dimensions, distance, service level)
├── Generate shipping label (with barcode/QR code and tracking number)
├── Schedule pickup (or drop at a facility)
└── Record package in system (status: "label created")

Routing

Routing
├── Determine route plan
│   ├── Origin facility (nearest sorting center to sender)
│   ├── Destination facility (nearest sorting center to recipient)
│   ├── Intermediate hops (if origin and destination aren't connected directly)
│   └── Transportation mode for each leg (truck, air, rail)
├── Optimize route for service level
│   ├── Overnight → air route, priority loading
│   ├── Ground → truck route, cost-optimized
│   └── Economy → most cost-efficient, may wait for full truck
└── Update route (if re-routing is needed due to weather, capacity, etc.)

Sorting & Transfer

Sorting & Transfer
├── Package arrives at facility (scan barcode → update status: "at [facility]")
├── Sort package to outbound lane (based on route's next hop)
├── Load package onto vehicle (scan → update status: "in transit to [next facility]")
├── Transfer between vehicles (for multi-leg routes)
└── Package arrives at destination facility (scan → update status: "at destination facility")

Last-Mile Delivery

Last-Mile Delivery
├── Assign package to delivery route (group packages by neighborhood)
├── Load delivery vehicle (scan each package)
├── Attempt delivery
│   ├── Successful delivery (scan → status: "delivered", capture signature or photo)
│   ├── No one home → leave at door (if authorized) or leave notice
│   ├── Wrong address → return to facility, flag for investigation
│   ├── Refused by recipient → return to facility
│   └── Access issue (gated community, locked building) → leave notice, reschedule
└── End of day: reconcile (all packages either delivered or accounted for)

Tracking

Tracking
├── Generate tracking number (at intake)
├── Record scan event (every scan at every point creates a tracking update)
├── Calculate estimated delivery date (based on route + service level)
├── Update estimated delivery date (if delays occur)
├── Provide tracking timeline to customer (ordered list of events)
└── Send proactive notifications
    ├── "Package picked up"
    ├── "Package in transit"
    ├── "Out for delivery"
    ├── "Delivered" (with photo/signature)
    └── "Delivery attempted — notice left"

Exception Handling

Exception Handling
├── Failed delivery (no one home, refused, wrong address)
│   ├── Reschedule delivery attempt
│   ├── Hold at facility for customer pickup
│   └── Return to sender (after N failed attempts)
├── Damaged package
│   ├── Assess damage (at any scan point)
│   ├── Notify sender and recipient
│   ├── File insurance claim (if insured)
│   └── Decide: deliver damaged or return to sender
├── Lost package
│   ├── Detect: expected scan didn't happen within time window
│   ├── Investigate: check last known scan, vehicle manifest
│   ├── Notify customer
│   └── File claim / send replacement or refund
├── Address correction
│   ├── Recipient contacts company with corrected address
│   ├── Update route while package is in transit
│   └── If package already at destination facility, re-sort
└── Customer-initiated redirect
    ├── Hold at facility
    ├── Deliver to alternate address
    └── Return to sender (sender requests)

Step 3: Finding Seams

Seam: Physical Scan Points

Every barcode scan is a seam. Data literally changes at each scan point:

  • Before scan: "package was loaded onto truck" (assumed location)
  • After scan: "package is confirmed at Chicago facility" (known location)

The system transitions from assumed state to confirmed state at every scan. Between scans, the package's location is inferred, not known. This is fundamentally different from digital systems where data is always in a known state.

Seam: Service Level → Route Strategy

The same package going from New York to Los Angeles decomposes differently based on service level:

  • Overnight: NYC facility → JFK airport → LAX airport → LA facility → delivery
  • Ground: NYC facility → truck to Pittsburgh hub → truck to Denver hub → truck to LA facility → delivery
  • Economy: NYC facility → waits for full truck → Pittsburgh → waits → Denver → waits → LA → delivery

Same origin, same destination, completely different decomposition of the journey. The service level seam changes the entire routing tree.

Seam: Custody Changes

Every time the package changes hands (sender → pickup driver → facility worker → vehicle → next facility → delivery driver → recipient), there's a custody seam. At each custody transfer, responsibility shifts. If the package is damaged, the question is: during whose custody? This is a liability seam as much as a technical one.

Seam: Customer Expectation vs. Physical Reality

The customer sees: "In transit → Out for delivery → Delivered." The system sees: "Scan 47 → Sort lane B → Load truck 104 → Scan 48 → Driver route position 12/38 → Scan 49 → Delivery confirmed GPS 40.7128° N."

Same package, radically different levels of detail. The tracking system must translate between these two views — that's a seam between the physical world and the customer experience.


Step 4: Dependency Map

Package Intake ──────────── (entry point, no dependencies)
       │
       ├────────────────┐
       ▼                ▼
   Routing          Tracking (tracking number created at intake)
       │                │
       ▼                │
Sorting & Transfer ─────┘ (each scan updates tracking)
       │
       ▼
Last-Mile Delivery ──── needs sorted packages + route assignments
       │
       ├────────────────┐
       ▼                ▼
  Tracking           Exception Handling
  (delivery scan)    (failed delivery triggers exception flow)

What This Map Reveals

Tracking runs in parallel with everything. It's not a step in the chain — it's a continuous side channel that records events from every other branch. Every scan in Sorting, Routing, and Delivery feeds into Tracking.

Exception Handling is triggered from multiple points. A failed delivery triggers an exception. A damaged package at a sort facility triggers a different exception. A lost package (detected by Tracking) triggers yet another. Exception Handling isn't downstream of one branch — it's connected to everything.

The physical chain is strictly sequential. A package must be: picked up → routed → sorted → transported → sorted again → delivered. You can't build or deliver out of order. This contrasts with the messaging app where sending and receiving happen simultaneously.


Step 5: The Complete Tree (Abridged)

Package Delivery
├── Package Intake
│   ├── Create shipment request (addresses, weight, dimensions, service level)
│   ├── Validate addresses
│   ├── Calculate price
│   ├── Generate label + tracking number
│   ├── Schedule pickup
│   └── Record in system
│
├── Routing
│   ├── Determine origin/destination facilities
│   ├── Plan route (hops, transport modes)
│   ├── Optimize for service level
│   └── Re-route (on delay or capacity change)
│
├── Sorting & Transfer
│   ├── Scan at arrival (per facility)
│   ├── Sort to outbound lane
│   ├── Load onto vehicle (scan)
│   └── Track vehicle movement
│
├── Last-Mile Delivery
│   ├── Create delivery routes (cluster by geography)
│   ├── Load delivery vehicle (scan each package)
│   ├── Attempt delivery (success / fail scenarios)
│   ├── Capture proof (signature, photo)
│   └── End-of-day reconciliation
│
├── Tracking
│   ├── Record scan events (from all sources)
│   ├── Calculate estimated delivery
│   ├── Update estimated delivery on delay
│   ├── Provide customer timeline
│   └── Send proactive notifications (picked up, in transit, out for delivery, delivered)
│
└── Exception Handling
    ├── Failed delivery (reschedule, hold, return)
    ├── Damaged package (assess, notify, claim)
    ├── Lost package (detect, investigate, claim)
    ├── Address correction (update route, re-sort)
    └── Customer redirect (hold, alternate address, return)

Total leaf count: 30+ operations


Comparing All Three Systems

AspectBookstoreMessaging AppPackage Delivery
Primary domainDigital commerceDigital communicationPhysical logistics
Time horizonMinutes (browsing → purchase)Milliseconds (real-time messages)Days (pickup → delivery)
Failure recoveryRefund/retry (reversible)Retry/resend (mostly reversible)Physical recovery (often irreversible)
Fan-out pattern1 order → 1 customer1 message → N recipients1 package → many scan points
Deepest complexityCheckout (sequential chain)Notification routing (multi-channel)Exception handling (branching physical outcomes)
Seams unique to this domainFormat changes (cart → order → record)Real-time vs. stored, audience routingCustody changes, physical scan points, assumed vs. confirmed state
Exception handlingPart of individual operationsPart of individual operationsIts own top-level branch
Build orderCatalog → Cart → Checkout → OrdersUsers → Conversations → Messages → Presence → NotificationsIntake → Routing → Sort → Delivery → Tracking → Exceptions

What This Example Teaches

  1. Physical systems have assumed state — between scans, you're guessing where the package is. Digital systems always know.
  2. Exception handling can be its own subsystem — when failures are common and varied, they deserve a top-level branch, not just error cases in contracts
  3. The same entity (package) decomposes differently based on context — overnight vs. ground creates entirely different route trees
  4. Custody changes are seams — every handoff between people, vehicles, or facilities is a decomposition boundary
  5. Sequential physical chains can't be parallelized — unlike software where operations can run simultaneously, a package must physically move step by step

Decomposition — Common Mistakes and Dependency Traps

Mistake 1: Technology-Layer Decomposition

The Wrong Way

Online Store
├── Frontend
├── Backend
├── Database
└── DevOps

This isn't decomposition — it's technology labeling. It tells you what kind of code lives where, but not what the system does. You can't write a contract for "Frontend." You can't estimate "Backend." These aren't features; they're implementation categories.

Why People Do It

It's comfortable. Developers naturally think in terms of where code lives. "I'll build the frontend, you build the backend" feels like a plan. But it's not a plan — it's an organizational split that leaves all the actual design decisions unmade.

The Fix

Decompose by feature, not by technology. "Browse books" is a feature that touches frontend, backend, and database. When it's decomposed correctly, the technology concerns become implementation details within each leaf:

Online Store
├── Browse books
│   ├── List books (paginated)
│   ├── Search books
│   └── View book details
├── Shopping Cart
│   ├── Add item
│   ├── Remove item
│   └── View cart
...

Each leaf here has a clear contract, a clear estimate, and a clear set of dependencies — regardless of which technology implements it.

How to Spot It

If every branch of your tree is a technology or a "layer" rather than something a user does or the business needs, you've done technology-layer decomposition.


Mistake 2: Decomposing Too Shallow

The Wrong Way

Hospital System
├── Patient Management
├── Appointments
├── Billing
└── Records

Four branches. Each one is an entire system. "Patient Management" alone could be 50 operations. This isn't a decomposition — it's a table of contents.

Why People Do It

They stop when the branches feel "reasonable" rather than when they're concrete. "Patient Management" sounds like a reasonable module. But can you write a contract for it? Can you estimate it? No — it's still an entire subsystem compressed into two words.

The Fix

Keep asking "what does this need?" until every leaf passes the three tests (contractable, estimable, dependency-identified):

Patient Management
├── Register new patient
│   ├── Collect demographics (name, DOB, address, phone, email)
│   ├── Assign patient ID
│   ├── Verify insurance (if applicable)
│   └── Create initial medical record
├── Update patient information
│   ├── Update demographics
│   ├── Update insurance
│   └── Update emergency contact
├── Search patients
│   ├── By name
│   ├── By patient ID
│   └── By date of birth
├── Merge duplicate patient records
└── Deactivate patient record (moved away, deceased)

Now every leaf is contractable. "Collect demographics" has clear inputs, clear outputs, and a clear estimate (small).

How to Spot It

If a non-technical person can't understand what each leaf does, it's too shallow. "Patient Management" is abstract. "Register new patient" is concrete.


Mistake 3: Decomposing Too Deep

The Wrong Way

Search books by keyword
├── Receive search text from user interface
├── Trim whitespace from search text
├── Convert search text to lowercase
├── Split search text into individual words
├── Remove common words (the, a, an, is)
├── For each remaining word:
│   ├── Look up word in search index
│   ├── Retrieve list of matching book IDs
│   └── Score each match by relevance
├── Combine results from all words
├── Remove duplicate book IDs
├── Sort combined results by total relevance score
├── Fetch book details for top N results
└── Return results to user interface

This is implementation pseudocode, not decomposition. At the decomposition stage, "Search books by keyword" is a single leaf. The internal algorithm is an implementation detail.

Why People Do It

Perfectionism. The desire to have everything figured out before starting. Or anxiety about estimation — "I can't estimate 'search' unless I know exactly how it works."

The Fix

Stop decomposing when a leaf is one responsibility with a clear contract:

CONTRACT: SearchBooks
ACCEPTS: search_query (text, 1-200 characters)
RETURNS: list of matching books (title, author, price, relevance score), sorted by relevance
ERRORS: empty query, no results found

That's the leaf. How search works internally — tokenization, indexing, scoring — is decided when you implement the leaf, not when you decompose the system.

How to Spot It

If your leaves describe how something works rather than what it does, you've gone too deep. Decomposition answers "what are the pieces?" Implementation answers "how does each piece work?"


Mistake 4: Overlapping Responsibilities

The Wrong Way

E-Commerce Platform
├── Product Page
│   ├── Display product details
│   ├── Show stock availability   ← checks inventory
│   └── Show recommended products
├── Shopping Cart
│   ├── Add item to cart
│   ├── Validate stock on add     ← checks inventory
│   └── View cart
├── Checkout
│   ├── Reserve inventory          ← modifies inventory
│   ├── Process payment
│   └── Create order
└── Admin
    ├── Update stock levels         ← modifies inventory
    └── View low-stock alerts       ← checks inventory

"Inventory" appears in four different branches. Stock checking happens in Product Page, Cart, and Checkout. Stock modification happens in Checkout and Admin. If you build each branch independently, you'll build the inventory logic four different times — with four different behaviors.

The Fix

Identify the shared responsibility and make it explicit:

E-Commerce Platform
├── Inventory (shared)
│   ├── Check stock level
│   ├── Reserve stock (temporary hold)
│   ├── Confirm reservation (convert to deduction)
│   ├── Release reservation (timeout or cancel)
│   └── Adjust stock (admin)
├── Product Page → uses Inventory.CheckStock
├── Shopping Cart → uses Inventory.CheckStock
├── Checkout → uses Inventory.Reserve, then Inventory.Confirm
└── Admin → uses Inventory.Adjust, Inventory.CheckStock

Now inventory is decomposed once and referenced by the branches that need it. The dependency is explicit.

How to Spot It

If the same verb + noun appears in multiple branches ("check stock," "validate user," "calculate price"), it's an overlap. Extract it as a shared dependency.


Mistake 5: Missing the Unhappy Path

The Wrong Way

Flight Booking
├── Search flights
├── Select flight
├── Enter passenger details
├── Pay
└── Issue ticket

Five steps. All happy path. But what about:

  • Flight sells out between search and payment?
  • Payment is declined?
  • Passenger name doesn't match their ID?
  • Flight is canceled after booking?
  • Customer wants to change their flight?
  • Customer wants a refund?

The unhappy paths are at least as numerous as the happy path, often more.

The Fix

For every happy-path branch, ask: "What can go wrong, and what do we do about it?"

Flight Booking
├── Search flights
├── Select flight
│   └── Handle: flight no longer available → show alternatives
├── Enter passenger details
│   └── Handle: validation failures → show field-level errors
├── Pay
│   ├── Handle: payment declined → retry or try different card
│   ├── Handle: flight sold out during payment → refund, show alternatives
│   └── Handle: payment timeout → check if payment went through, avoid double charge
├── Issue ticket
│   └── Handle: system error after payment → queue for retry, notify customer of delay
├── Post-Booking
│   ├── Cancel booking → calculate refund based on fare rules
│   ├── Change flight → calculate fare difference
│   ├── Flight canceled by airline → auto-rebook or refund
│   └── Schedule change by airline → notify and offer alternatives

The tree roughly doubled. That's normal for real systems — the unhappy paths are half the work.

How to Spot It

If your decomposition tree reads like a tutorial ("step 1, step 2, step 3...") with no branching, you've only captured the happy path. Real systems branch extensively.


Mistake 6: Circular Dependencies

The Wrong Way

Module A (User Profiles) needs Module B (Permissions) to check if user can edit profiles
Module B (Permissions) needs Module A (User Profiles) to look up user's role

A depends on B. B depends on A. Neither can be built first. Neither can be tested alone. This is a circular dependency — and it's always a decomposition error.

Why It Happens

Two things that are related get decomposed as peers that reference each other. In reality, one should depend on the other, or both should depend on a third, more fundamental thing.

The Fix

Find the deeper abstraction:

Module C (User Data) ← stores user ID, role, basic profile data (no logic)
       │
       ├────────────────┐
       ▼                ▼
Module A (Profiles)  Module B (Permissions)
uses User Data       uses User Data

Now both A and B depend on C, but not on each other. The cycle is broken.

How to Spot It

Draw your dependency arrows. If you can follow the arrows in a circle (A → B → C → A), you have a cycle. Every cycle must be broken by extracting the shared dependency.


The Dependency Health Checklist

After decomposing, validate your dependency structure:

CheckWhat to Look For
No cyclesCan you sort all modules in a build order where each module only depends on things above it?
Shared responsibilities are explicitIs any logic duplicated across branches? Extract it.
Foundation modules depend on nothingYour data stores, core entities, and configuration should be at the bottom of the dependency graph.
High-level features depend on low-level services"Checkout" depends on "Inventory" and "Payment" — not the reverse.
Every dependency is justifiedFor each arrow, can you explain why it exists? If not, it might be artificial.
It's possible to build and test each piece independentlyIf you can't build module X without also building module Y, either X depends on Y (document it) or they should be combined.

Decomposition — Test Your Understanding

Answer each question by producing decomposition trees, dependency maps, and/or build orders. No code. Show your reasoning.


Section A: Decompose It

Question 1

"We need a system for a veterinary clinic."

Clients bring their pets for appointments. The vet records diagnoses and prescribes treatments. The clinic sends appointment reminders. Clients pay for visits.

Produce a complete decomposition tree. Go at least three levels deep. Identify the leaves and verify that each one could have a contract written for it.


Question 2

"Build a recipe sharing platform."

Users create recipes with ingredients and steps. Other users can search, save favorites, and leave reviews. Users can create weekly meal plans and generate shopping lists from their meal plans.

Decompose this top-down. Then identify the dependencies between your leaves. What is the build order?


Question 3

You're in a bottom-up situation. You already have:

  • A user authentication service
  • A file storage service (can store and retrieve files)
  • An email sending service
  • A database for structured data

A client asks: "Can you build me a simple document collaboration tool where teams can upload, share, and comment on documents?"

Using the existing services as your starting point, decompose what needs to be built (not what already exists). Show how the new pieces connect to the existing services.


Section B: Find the Seams

Question 4

A single feature request reads:

"When a customer completes a purchase, they should see an order confirmation page, receive a confirmation email, the inventory should be updated, the sales team should see the order in their dashboard, and if the order is over $500, it should be flagged for manual review."

Find every seam in this description. Group the pieces by the boundary they belong to. Show which pieces can happen in parallel and which must happen in sequence.


Question 5

Here is a vague feature request:

"We need analytics."

Write the 10 questions you would ask to decompose this. For each question, explain why the answer matters for decomposition (i.e., how does it change the shape of the tree?).


Question 6

A system currently works as follows:

  1. User uploads a CSV file
  2. System parses the CSV
  3. System validates each row
  4. Valid rows are saved to the database
  5. Invalid rows are collected into an error report
  6. Error report is emailed to the user
  7. A summary is displayed on screen

Map the seams. Then answer: if step 3 (validation) needs to become much more complex (adding cross-row validation, checking against external data sources), which seams help you isolate that change? Which other steps would be affected?


Section C: Dependencies and Ordering

Question 7

You've decomposed a project into these pieces:

  • A: User registration
  • B: User login
  • C: Create a post
  • D: View feed (list of posts from followed users)
  • E: Follow/unfollow other users
  • F: Like a post
  • G: Notification when someone likes your post
  • H: User profile page

Map all the dependencies (which pieces need which other pieces to exist first). Draw the dependency graph. Determine the build order — what gets built in phase 1, phase 2, etc.?


Question 8

You have a dependency problem. You've identified:

  • Module X needs data from Module Y
  • Module Y needs a callback from Module X when processing is done
  • This creates a circular dependency

Without knowing any specifics about what X and Y do, describe three general strategies for breaking a circular dependency. For each strategy, explain the tradeoff.


Question 9

A project has 12 decomposed tasks. Here are their dependencies:

TaskDepends On
Anothing
Bnothing
CA
DA
EB
FC, E
GD
HF
IF, G
JH
KI
LJ, K

Draw the dependency graph. What is the critical path (the longest chain of dependencies from start to finish)? If you had two people working in parallel, what's the most efficient assignment of tasks to people?


Section D: Critical Thinking

Question 10

"You should decompose until every leaf takes less than a day to build."

Is this good advice? When is it right? When might it be wrong? What are the risks of decomposing too finely versus too coarsely?


Question 11

You're decomposing a system and you encounter a feature that feels like it could belong in two different branches of your tree:

"Send a notification when an order ships."

Is this part of Orders (since it's triggered by an order event)? Or part of Notifications (since it's a notification)? Both feel reasonable.

How do you resolve this? Propose a decomposition that handles this cleanly. Explain the principle behind your decision.


Question 12

You've been given a completed decomposition tree by a colleague. How do you evaluate it? Create a checklist of at least 8 specific questions you would ask to determine if the decomposition is good, complete, and buildable. For each question, explain what a bad answer would reveal.


Grading Rubric

CriteriaWhat It Means
DepthTrees go deep enough that leaves are concrete and estimable — but not so deep that they describe implementation
CompletenessNo missing steps. Trace the user journey and confirm every step has a corresponding leaf
No overlapEach responsibility appears exactly once in the tree
Dependencies are explicitClear arrows showing what needs what. No hidden assumptions
Build order is logicalFoundation pieces first, dependent pieces after. Circular dependencies identified and resolved
Seam recognitionNatural break points are identified and used to structure the decomposition

Failure Modes and Debugging — Why It Matters

Things Will Break

Every system fails. Not "might fail" — will fail. Hardware dies. Networks drop. Users do unexpected things. Data gets corrupted. Services go down. Bugs hide in logic that worked fine for a year and then didn't.

The difference between a junior and senior engineer isn't that the senior's systems don't break. It's that the senior expects failure, designs for it, and diagnoses it systematically when it happens.

This section is not about learning debugging tools. Tools change. This section is about learning to reason about failure — a skill that works in any language, on any platform, in any decade.

Why Debugging Is a Thinking Skill, Not a Tool Skill

Most courses teach debugging as: "here's how to set a breakpoint, here's how to read a stack trace, here's how to use print statements." These are useful techniques, but they're like teaching someone to use a stethoscope without teaching them medicine. The tool is worthless without the reasoning behind it.

Real debugging is a reasoning process:

  1. Something is wrong (the symptom)
  2. The symptom has a cause
  3. The cause is usually not where the symptom appears
  4. Finding the cause requires systematic elimination of possibilities

This is pure critical thinking. It doesn't require a computer. It requires the ability to form hypotheses, test them, and follow evidence.

The Two Failures Most People Make When Debugging

Failure 1: Guessing Instead of Reasoning

Something breaks. The engineer's first instinct is to change something — anything — and see if it fixes the problem. This is like a doctor prescribing random medication because the patient has a headache. Sometimes it works by luck. Usually it wastes hours, introduces new bugs, and teaches nothing.

The alternative: stop. Think. What do you know? What do you not know? What would help you narrow it down?

Failure 2: Assuming Instead of Verifying

"That part works fine, the problem must be somewhere else." Says who? Have you verified it? One of the most common debugging experiences is spending hours looking in the wrong place because you assumed some component was correct — and it wasn't.

The alternative: verify everything. Trust nothing. Check each assumption with evidence.

Why Failure Modes Are a Design Concern

Most people think about failure after the system is built. That's backwards. You should think about failure during design, for two reasons:

1. The cost of failure is a design decision

Some failures are acceptable ("the profile picture takes 2 seconds longer to load"). Some are catastrophic ("we charged the customer twice"). The difference isn't technical — it's about what the system does and who it serves. This must be decided during design, not discovered during an outage.

2. Error handling is half the work

In a typical system, the "happy path" (everything works) is maybe 30% of the logic. The other 70% is: what if this input is invalid? What if that service is down? What if the data is in an unexpected format? What if the network times out? What if the user does something in the wrong order?

If you design only for the happy path, you've built 30% of the system."But it works!" Yes — until it doesn't. And when it doesn't, nobody planned for it, so the failure is chaotic rather than graceful.

What Does "Graceful Failure" Mean?

A system that fails gracefully does these things:

  1. Detects that something went wrong (not silently corrupting data)
  2. Contains the failure (one broken feature doesn't take down the whole system)
  3. Communicates what happened (to the user, to the logs, to the monitoring system)
  4. Degrades rather than crashes (if search is down, the rest of the site still works)
  5. Recovers when possible (retries, fallbacks, self-healing)

A system that fails badly:

  • Crashes entirely because one component failed
  • Shows the user a cryptic technical error
  • Corrupts data silently
  • Provides no information about what went wrong or why
  • Requires a manual restart or intervention to recover

The difference is not complexity. It's forethought. Graceful failure is designed in. Bad failure is what happens when nobody thought about it.

Why This Is The Capstone Skill

This section comes last because it requires everything before it:

  • Data Lifecycle — to trace where data went wrong, you must know where it flows
  • Boundaries — to contain failures, you must have clear boundaries to contain them within
  • Contracts — to detect failures, you must know what the expected behavior is (the contract) so you can recognize when it's violated
  • Decomposition — to isolate failures, the system must be decomposed into testable pieces

A well-decomposed system with clear boundaries and explicit contracts is inherently debuggable. A tangled system with no structure is inherently not. Debugging skill matters, but system design determines whether debugging is even possible.

The Mindset Shift

Stop thinking: "How do I make this work?" Start thinking: "How will this fail, and what should happen when it does?"

For every operation, every contract, every module, the questions are:

  • What are the ways this can fail?
  • Which failures are likely? Which are unlikely but catastrophic?
  • For each failure, what should the system do?
  • Can the user recover? Can the system recover automatically?
  • If nothing else works, what information do we need to diagnose the problem later?

This isn't pessimism. It's engineering. Bridges don't collapse because someone thought about load limits. They collapse when someone didn't.

The same is true of software.

Failure Modes and Debugging — Why It Matters

Things Will Break

Every system fails. Not "might fail" — will fail. Hardware dies. Networks drop. Users do unexpected things. Data gets corrupted. Services go down. Bugs hide in logic that worked fine for a year and then didn't.

The difference between a junior and senior engineer isn't that the senior's systems don't break. It's that the senior expects failure, designs for it, and diagnoses it systematically when it happens.

This section is not about learning debugging tools. Tools change. This section is about learning to reason about failure — a skill that works in any language, on any platform, in any decade.

Why Debugging Is a Thinking Skill, Not a Tool Skill

Most courses teach debugging as: "here's how to set a breakpoint, here's how to read a stack trace, here's how to use print statements." These are useful techniques, but they're like teaching someone to use a stethoscope without teaching them medicine. The tool is worthless without the reasoning behind it.

Real debugging is a reasoning process:

  1. Something is wrong (the symptom)
  2. The symptom has a cause
  3. The cause is usually not where the symptom appears
  4. Finding the cause requires systematic elimination of possibilities

This is pure critical thinking. It doesn't require a computer. It requires the ability to form hypotheses, test them, and follow evidence.

The Two Failures Most People Make When Debugging

Failure 1: Guessing Instead of Reasoning

Something breaks. The engineer's first instinct is to change something — anything — and see if it fixes the problem. This is like a doctor prescribing random medication because the patient has a headache. Sometimes it works by luck. Usually it wastes hours, introduces new bugs, and teaches nothing.

The alternative: stop. Think. What do you know? What do you not know? What would help you narrow it down?

Failure 2: Assuming Instead of Verifying

"That part works fine, the problem must be somewhere else." Says who? Have you verified it? One of the most common debugging experiences is spending hours looking in the wrong place because you assumed some component was correct — and it wasn't.

The alternative: verify everything. Trust nothing. Check each assumption with evidence.

Why Failure Modes Are a Design Concern

Most people think about failure after the system is built. That's backwards. You should think about failure during design, for two reasons:

1. The cost of failure is a design decision

Some failures are acceptable ("the profile picture takes 2 seconds longer to load"). Some are catastrophic ("we charged the customer twice"). The difference isn't technical — it's about what the system does and who it serves. This must be decided during design, not discovered during an outage.

2. Error handling is half the work

In a typical system, the "happy path" (everything works) is maybe 30% of the logic. The other 70% is: what if this input is invalid? What if that service is down? What if the data is in an unexpected format? What if the network times out? What if the user does something in the wrong order?

If you design only for the happy path, you've built 30% of the system."But it works!" Yes — until it doesn't. And when it doesn't, nobody planned for it, so the failure is chaotic rather than graceful.

What Does "Graceful Failure" Mean?

A system that fails gracefully does these things:

  1. Detects that something went wrong (not silently corrupting data)
  2. Contains the failure (one broken feature doesn't take down the whole system)
  3. Communicates what happened (to the user, to the logs, to the monitoring system)
  4. Degrades rather than crashes (if search is down, the rest of the site still works)
  5. Recovers when possible (retries, fallbacks, self-healing)

A system that fails badly:

  • Crashes entirely because one component failed
  • Shows the user a cryptic technical error
  • Corrupts data silently
  • Provides no information about what went wrong or why
  • Requires a manual restart or intervention to recover

The difference is not complexity. It's forethought. Graceful failure is designed in. Bad failure is what happens when nobody thought about it.

Why This Is The Capstone Skill

This section comes last because it requires everything before it:

  • Data Lifecycle — to trace where data went wrong, you must know where it flows
  • Boundaries — to contain failures, you must have clear boundaries to contain them within
  • Contracts — to detect failures, you must know what the expected behavior is (the contract) so you can recognize when it's violated
  • Decomposition — to isolate failures, the system must be decomposed into testable pieces

A well-decomposed system with clear boundaries and explicit contracts is inherently debuggable. A tangled system with no structure is inherently not. Debugging skill matters, but system design determines whether debugging is even possible.

The Mindset Shift

Stop thinking: "How do I make this work?" Start thinking: "How will this fail, and what should happen when it does?"

For every operation, every contract, every module, the questions are:

  • What are the ways this can fail?
  • Which failures are likely? Which are unlikely but catastrophic?
  • For each failure, what should the system do?
  • Can the user recover? Can the system recover automatically?
  • If nothing else works, what information do we need to diagnose the problem later?

This isn't pessimism. It's engineering. Bridges don't collapse because someone thought about load limits. They collapse when someone didn't.

The same is true of software.

Failure Modes and Debugging — How: The Method

A Systematic Debugging Framework

When something goes wrong, follow this five-step process. It works for software, hardware, processes, and systems of any kind.


Step 1: Observe the Symptom Precisely

Don't say "it's broken." Say exactly what is happening:

  • ❌ "The page is broken"

  • ✅ "The page loads but shows 0 orders, when the user should have 15 orders"

  • ❌ "The system is slow"

  • ✅ "The search results take 12 seconds to appear; last week it was under 1 second"

  • ❌ "It doesn't work"

  • ✅ "Clicking 'Submit' does nothing — no error message, no loading indicator, no change"

Precise symptoms lead to precise diagnoses. Vague symptoms lead to guessing.


Step 2: Establish What Changed

Most bugs don't appear spontaneously. Something changed:

  • New code was deployed
  • Data volume increased
  • A third-party service updated their API
  • A configuration was modified
  • User behavior shifted (a marketing campaign drove unexpected traffic)

Ask: "What is different between when it worked and when it stopped working?"

If nothing changed internally, the cause is likely external: data, traffic, or a dependency.


Step 3: Bisect the Problem Space

This is the most powerful debugging technique. Instead of searching everywhere, cut the problem in half and determine which half contains the bug.

Your system is a chain of data flow (from the Data Lifecycle section). Data enters at one end and the wrong result appears at the other. Check the midpoint:

Input → [A] → [B] → [C] → [D] → Wrong Output
                 ↑
          Check here first.
          Is the data correct at this point?
  • If the data is correct at [B], the problem is in [C] or [D]
  • If the data is wrong at [B], the problem is in [A] or [B]

You've eliminated half the system. Repeat until you've narrowed it to a single step.

This is binary search applied to debugging, and it works whether you're debugging code, a business process, a network issue, or a recipe.


Step 4: Form and Test Hypotheses

Once you've narrowed the area, form a specific hypothesis:

"I believe the bug is caused by [specific thing] because [evidence]. If I'm right, then [testable prediction]."

Example:

"I believe orders show as 0 because the query is filtering by the wrong date format. If I'm right, then running the query directly will return empty results even though orders exist."

Then test it by observing — check the data, check the intermediate state. Don't change anything yet. Verify or disprove the hypothesis with evidence.

If the hypothesis is wrong, that's progress — you've eliminated a possibility.


Step 5: Verify the Fix

You found the cause. You made a change. How do you know the fix is correct?

  • Does the symptom disappear?
  • Does it work for all cases, or just the one you tested?
  • Did the fix introduce any new problems?
  • Can you explain why the fix works?

If you can't explain why the fix works, it's not a fix — it's a lucky accident that will break again.


Categorizing Failures

Not all failures are the same. Understanding the categories helps you design appropriate responses.

CategoryWhat It IsExampleResponse
InputBad data coming inLetters in a phone number fieldValidate at the boundary. Reject with a clear error.
LogicCode produces wrong resultOff-by-one in a calculationTest with known inputs, verify outputs match contract
IntegrationTwo parts don't alignModule A sends format X, Module B expects YValidate at every boundary. Integration failures are contract violations.
ResourceSystem exhausts somethingDisk full, memory exhausted, rate limit hitMonitor. Set limits and alerts. Design for constrained operation.
DependencyExternal thing stops workingDatabase down, API returns errorsTimeout, retry, fallback, degrade gracefully
TimingWrong order or timeTwo updates hit the same record simultaneouslyThe hardest category. Design explicit ordering where it matters.

Designing for Failure

For every module and contract, answer these five questions:

1. What are the failure modes?

List every way this can fail. Use the categories above as your checklist.

2. What is the blast radius?

If this fails, what else breaks? A well-bounded module limits the blast radius. A tangled one spreads damage everywhere.

3. What is the severity?

SeverityMeaningExample
CriticalData loss, financial impact, security breachDouble-charging a customer
HighCore feature unavailableCan't log in
MediumFeature degraded but usableSearch is slow but returns results
LowCosmetic or minorProfile picture doesn't load

4. What is the response strategy?

StrategyWhen to Use
PreventFailure is predictable and avoidable (validate inputs, check preconditions)
RetryFailure is transient (network blip, temporary overload)
FallbackThere's a "good enough" alternative (show cached data if live data is unavailable)
DegradeTurn off the broken feature, keep everything else running
AlertNeeds human attention (log, notify, escalate)
Fail fastContinuing would make things worse (stop if data is corrupted)

5. What information is needed to diagnose it later?

When something fails at 3am and you're investigating at 9am, what do you need?

  • What was the input?
  • What was the expected output?
  • What was the actual output or error?
  • When did it happen?
  • What was the system state?

This is logging — not an afterthought, but a critical design decision.


What to Look For in the Examples

The following pages each present a system that has failed. You'll see:

  1. A symptom described precisely — the starting point
  2. The bisection process — how we narrow down the cause
  3. Multiple hypotheses — some wrong, some right
  4. The root cause — and how it connects to a failure category
  5. How the failure could have been prevented — what design decision would have caught it earlier

Failure Modes — Example: E-Commerce Order Goes Wrong

The Scenario

An online store selling electronics. Customers are reporting a strange problem: some orders show the wrong items. A customer ordered a laptop and received a phone charger. Another ordered headphones and received a keyboard. It's not happening to all orders — just some.

This is a real investigation. Let's walk through it.


Step 1: Observe the Symptom Precisely

We gather reports and look for patterns:

CustomerOrderedReceivedOrder DatePayment Correct?
Customer ALaptop ($899)Phone Charger ($15)March 5Charged $899 ✅
Customer BHeadphones ($79)Keyboard ($49)March 5Charged $79 ✅
Customer CMonitor ($350)Monitor ($350)March 5Charged $350 ✅
Customer DTablet ($449)Mouse ($25)March 6Charged $449 ✅
Customer EKeyboard ($49)Keyboard ($49)March 6Charged $49 ✅

Observations:

  • Payment amounts are always correct (matches what they ordered, not what they received)
  • Some orders are fine (C and E got the right items)
  • Wrong items don't seem related (laptop → charger, headphones → keyboard)
  • Started March 5

The symptom is: The warehouse is shipping the wrong physical items for some orders, but the order records and payments are correct.


Step 2: Establish What Changed

What happened around March 5?

  • March 4: New inventory system deployed (upgraded from v2.3 to v3.0)
  • March 5: First wrong-item reports
  • Nothing else changed (no code deploys, no staff changes, no new warehouse)

Strong correlation: new inventory system → wrong items. But correlation isn't causation — let's investigate.


Step 3: Bisect the Problem Space

The order lifecycle is:

Customer places order → Order recorded → Warehouse receives pick list → 
Worker picks items → Items packed → Items shipped → Customer receives

Where is the wrong data? Let's check the midpoint — the pick list that the warehouse receives.

Check 1: Is the order record correct?

We look at Customer A's order in the database:

  • Order #10547: Product = "Laptop XPS 15", Product ID = LP-2001, Quantity = 1

The order record is correct. The customer ordered a laptop and the database says laptop.

Check 2: Is the pick list correct?

We look at the pick list that was sent to the warehouse for Order #10547:

  • Order #10547: Bin Location = B-14, Quantity = 1

Wait — the pick list shows a bin location, not a product name. The warehouse worker goes to bin B-14 and picks whatever is there.

Check 3: What's in bin B-14?

Before March 4: Bin B-14 = Laptop XPS 15 (correct) After March 4 (inventory system upgrade): Bin B-14 = Phone Charger USB-C

The bin assignments changed when the inventory system was upgraded. The old system had one mapping of products to bins. The new system reassigned bins based on a different optimization algorithm. But the pick list generation was still using the old mapping — it was reading from a cached or stale copy of the bin assignments.


Step 4: Form and Test Hypothesis

Hypothesis: The pick list generator is using a cached copy of the product-to-bin mapping that wasn't updated when the inventory system was upgraded on March 4. Products whose bins didn't change (same bin in old and new system) are shipping correctly. Products whose bins changed are shipping wrong items.

Test the prediction:

If this hypothesis is correct, then:

  1. Products that shipped correctly should have the same bin location in both old and new systems
  2. Products that shipped wrong should have different bin locations
CustomerProductOld BinNew BinSame?Shipped Correctly?
ALaptopB-14C-22❌ No❌ Wrong item
BHeadphonesD-08A-31❌ No❌ Wrong item
CMonitorF-15F-15✅ Yes✅ Correct
DTabletA-31D-08❌ No❌ Wrong item
EKeyboardG-03G-03✅ Yes✅ Correct

Perfect correlation. Every wrong shipment has a bin mismatch. Every correct shipment has the same bin. Hypothesis confirmed.

Notice something extra:

Customer B ordered headphones (old bin D-08, new bin A-31). Customer D ordered a tablet (old bin A-31, new bin D-08). Their bins swapped. So Customer D likely received Customer B's headphones, and Customer B likely received... it depends on what was in A-31 in the old system.

This is how a bin mapping error creates cross-contamination — wrong items go to wrong customers in unpredictable combinations.


Step 5: Root Cause and Fix

Root Cause

Integration failure (Category: Integration). Two modules — the pick list generator and the inventory system — were reading from different versions of the bin mapping. The upgrade updated the inventory system's internal mapping but didn't invalidate or update the cache used by the pick list generator.

The Direct Fix

Update the pick list generator to read bin locations from the new inventory system's live data, not from a cached copy.

Verify the Fix

  • After the fix, run 10 test orders for products with changed bins. Verify the pick lists show the new bin locations.
  • Check that the fix doesn't affect products with unchanged bins (they should still work).
  • Check timing: the fix should take effect immediately, not after a cache timeout.

The Deeper Lesson: What Should Have Prevented This

1. Contract violation

The pick list generator had an implicit contract with the inventory system: "I will give you bin locations for product IDs." But the contract didn't specify where that data came from — live query vs. cached copy. If the contract had been explicit ("bin locations must be queried from the inventory system at pick time, not cached"), the cache would never have been built.

2. Boundary violation

The pick list generator cached data that belonged to another module (inventory). It crossed a boundary. If the boundary were enforced — "only the inventory module knows bin locations; everyone else must ask" — the stale cache wouldn't exist.

3. Missing failure mode in the upgrade plan

The inventory system upgrade plan didn't include: "What other systems read our bin mapping, and how do they read it?" A pre-mortem would have surfaced this: "What if other systems have a stale copy of our bin assignments?"

4. No verification at the seam

The seam between "pick list generated" and "warehouse worker picks item" has no verification. The worker goes to the bin and picks what's there — they have no way to verify it's the right product (unless they check the product name, which wasn't on the pick list). Adding a product name or barcode scan at pick time would have caught the mismatch immediately.


Failure Category Map for This Scenario

Root cause:    Integration failure (stale cache)
Amplifier:     No verification at physical seam
Blast radius:  All orders with changed bin locations (~30% of products)
Severity:      High (wrong items shipped, expensive returns)
Could prevent: Explicit contract, boundary enforcement, upgrade checklist
Could detect:  Barcode verification at pick, bin mapping comparison test
Could reduce:  Faster detection through customer complaint pattern analysis

Failure Modes — Example: Messaging App Mystery

The Scenario

A team messaging app (like Slack). Users are reporting that messages are appearing out of order in group conversations. A message sent at 2:03 PM appears above a message sent at 2:01 PM. It doesn't happen in every conversation, and it doesn't happen all the time. Some users say they can't reproduce it.

This is a timing failure — the hardest category to debug because the problem is intermittent and order-dependent.


Step 1: Observe the Symptom Precisely

We collect specific reports:

ReportConversationWhat User SeesExpected Order
1#engineering (45 members)Message from Bob at 2:03 appears above Alice's from 2:01Alice first, Bob second
2#engineering (45 members)Same message pair — but Carol sees them in the correct orderDepends on viewer?
3DM between Dave and EveNever happens — DMs always in order
4#general (200 members)Frequent reordering, especially during active discussion
5#random (10 members)Rarely happens

Pattern emerging:

  • Happens in group conversations, not DMs
  • More frequent in larger groups
  • More frequent during high activity
  • Different users see different orders for the same messages

Step 2: Establish What Changed

Users say this "started recently" but can't say exactly when. We check the deployment log:

  • 2 weeks ago: Scaled the messaging backend from 1 server to 3 servers (load balancing) to handle growing user count
  • Nothing else changed

Hypothesis forming: Scaling from 1 server to 3 might be related to the ordering issue.


Step 3: Bisect the Problem Space

The message flow is:

Sender types message → Sender's device sends to server → Server stores message → 
Server broadcasts to group members → Each member's device receives and displays

Check the midpoint: Are messages stored in the correct order?

We query the database for the #engineering conversation around 2:00 PM:

Message IDSenderTextTimestamp (server)Stored Order
msg-4401Alice"Has anyone seen the test results?"2:01:03.1421st
msg-4402Bob"Just posted them in the doc"2:01:47.8912nd
msg-4403Carol"Thanks!"2:02:15.0033rd
msg-4404Bob"The latency numbers look bad"2:03:01.5564th

The database has the correct order. Messages are stored with server timestamps and the order is right.

So the bug is after storage — in the broadcast/display phase.


Step 4: Deeper Investigation — The Broadcast

How does broadcast work?

Before the scale-up (1 server):

Message stored → Server sends to all connected members → Done

After the scale-up (3 servers):

Message stored → Publish event to message queue → All 3 servers read from queue → 
Each server sends to its connected members

With 3 servers, the 45 members of #engineering are distributed:

  • Server 1: 18 members connected
  • Server 2: 15 members connected
  • Server 3: 12 members connected

When Alice sends a message at 2:01, the flow is:

  1. Alice's device → Server 2 (she happens to be connected to Server 2)
  2. Server 2 stores the message
  3. Server 2 publishes "new message" event to the message queue
  4. All 3 servers pick up the event and send it to their connected members

When Bob sends a message at 2:03, the flow is:

  1. Bob's device → Server 1 (he's connected to Server 1)
  2. Server 1 stores the message
  3. Server 1 publishes "new message" event to the message queue
  4. All 3 servers pick up the event

Here's the problem:

The message queue doesn't guarantee that events are delivered to all consumers in the same order. Server 1 might receive Bob's event before Alice's event because Bob's message was published from Server 1 (local) while Alice's had to travel across the network.

Timeline:

Server 2 stores Alice's msg at 2:01:03.142
Server 2 publishes event ─────────────────────── travels across network
Server 1 stores Bob's msg at 2:03:01.556
Server 1 publishes event ─── stays local

Server 1 receives Bob's event at 2:03:01.560 ← 4ms later (local)
Server 1 receives Alice's event at 2:03:01.580 ← 20ms later (network travel)

Server 1 broadcasts Bob's message to its 18 connected members FIRST
Server 1 broadcasts Alice's message 20ms later

Users on Server 1 see: Bob, then Alice (WRONG ORDER)
Users on Server 2 see: Alice, then Bob (CORRECT ORDER)

Different users see different orders because they're connected to different servers, and the servers receive events in different orders.


Step 4 (continued): Form Hypothesis

Hypothesis: When multiple messages arrive at a server via the message queue within a short time window, the server broadcasts them in arrival order (when the event reached that particular server) rather than timestamp order (when the message was actually created). Users connected to different servers receive messages in different orders.

Test the prediction:

If this is correct, then:

  1. DMs never have this problem (they involve only 2 people, likely routed through one server)
  2. Larger groups are more affected (more members = more servers involved along the way)
  3. Fast-paced conversations are more affected (messages close together in time are more susceptible to reordering)
  4. Users on the same server always see the same order (right or wrong)

All four predictions match the reports. Hypothesis confirmed.


Step 5: The Fix — And Why It's Not Obvious

Attempted Fix 1: "Just sort by timestamp on the server"

Have each server sort messages by timestamp before broadcasting.

Problem: The server doesn't know if more messages are coming. When it receives Bob's event, should it wait to see if an earlier message might arrive? How long should it wait? 10ms? 100ms? 1 second?

  • Wait too short → still might miss earlier messages
  • Wait too long → messages feel laggy to users

This is the fundamental tradeoff of distributed systems: you can't have both instant delivery AND perfectly correct ordering without coordination.

Attempted Fix 2: "Sort on the client"

Each user's device sorts messages by server timestamp after receiving them.

Problem: This mostly works, but creates a jarring experience — a message appears at the bottom, then "jumps up" when an earlier message arrives a moment later. Users see messages rearranging in real time, which feels buggy even though it's technically correct.

Actual Fix: Client-side insertion sort with timestamp

Each user's device maintains messages sorted by server timestamp. When a new message arrives:

  1. Check its timestamp against the last displayed message
  2. If it's newer → append at the bottom (most common case, feels instant)
  3. If it's older → insert it at the correct position AND show a subtle visual indicator ("1 earlier message inserted above")

This is a compromise: correct ordering with a visual cue so users aren't confused by messages appearing "above" what they already read.


The Deeper Lesson: Distributed Systems Create Timing Failures

Why 1 server didn't have this problem

With one server, all messages passed through a single point. The server processed them sequentially, so the broadcast order always matched the storage order. No timing ambiguity.

Why 3 servers created the problem

With three servers, there are three independent paths for messages to travel. Each path has slightly different timing. This is called non-deterministic ordering — the order depends on network latency, load, and which server the sender is connected to.

The general principle

Any time you add parallelism, you create the possibility of ordering problems. This applies to:

  • Multiple servers
  • Multiple threads in a program
  • Multiple workers processing a queue
  • Multiple microservices handling events

The question isn't "will ordering be a problem?" It's "how will we handle the ordering problem?"


Failure Category Map

Root cause:    Timing failure (non-deterministic event ordering across servers)
Amplifier:     High message volume in large groups
Blast radius:  All group conversations (DMs unaffected)
Severity:      Medium (annoying, confusing, but no data loss)
Could prevent: Design the broadcast system with ordering guarantees from day one
Could detect:  Automated ordering test (send messages with known timestamps,
               verify all clients receive in correct order)
Could reduce:  Client-side timestamp sorting with visual indicators

Compare With the E-Commerce Bug

AspectE-Commerce (wrong items)Messaging (wrong order)
Failure categoryIntegration (stale cache)Timing (non-deterministic ordering)
Reproducibility100% for affected productsIntermittent, depends on timing
Who's affected~30% of orders (those with changed bins)Users on different servers, during high activity
Data corrupted?No (database correct, pick list wrong)No (database correct, display order wrong)
Root cause found byChecking midpoint (pick list)Checking delivery path (server → client)
Fix complexitySimple (update the data source)Hard (fundamental distributed systems tradeoff)
PreventionBetter contracts and boundary enforcementArchitectural decision about ordering guarantees

The e-commerce bug had a clear, fixable root cause. The messaging bug revealed a fundamental limitation of the architecture that required a compromise, not a simple fix. This is the difference between a bug and a design constraint.

Failure Modes — Example: Flight Booking Cascade

The Scenario

An airline booking system. On a busy Friday afternoon, the following happens within a 30-minute window:

  1. Customers report they can't search for flights (the search page spins forever)
  2. Customers who already selected flights can't complete payment
  3. The customer service phone lines are flooded
  4. An agent manually checks and sees that the booking database is responding, but slowly
  5. Internal monitoring shows the flight search API is responding in 45 seconds (normal: under 500 milliseconds)

This is a cascading failure — one problem triggers a chain of other problems that makes everything worse.


Step 1: Observe the Symptoms — All of Them

Unlike the previous examples (single symptom), here we have multiple symptoms appearing simultaneously:

SymptomAffected SystemSeverity
Flight search returns in 45 secondsSearch APIHigh
Payment processing times outPayment ServiceCritical
Customer service call volume 5x normalCall CenterHigh
Booking database slow (but responding)DatabaseMedium
Internal admin dashboard unresponsiveAdmin UILow

These aren't five separate bugs. They're connected. The key question: which one caused the others?


Step 2: Establish a Timeline

We reconstruct what happened:

TimeEvent
2:00 PMEverything normal
2:12 PMMarketing team launches a flash sale: "50% off all Caribbean flights this weekend." Email sent to 2 million subscribers.
2:15 PMWebsite traffic increases 10x
2:17 PMSearch API response times begin rising (500ms → 2s → 5s → 15s → 45s)
2:20 PMPayment service starts timing out (its requests to the database are queued behind search queries)
2:22 PMCustomers who can't search or pay start calling customer service
2:25 PMAdmin dashboard becomes unresponsive (it also queries the same database)
2:30 PMAll systems severely degraded

The trigger: A flash sale email drove sudden, massive traffic. But the trigger is not the root cause. Traffic spikes are expected. The question is: why didn't the system handle it?


Step 3: Bisect — Find the Bottleneck

The system architecture:

Users → Web Servers → Search API ──┐
                                    ├── Database
Users → Web Servers → Payment API ──┘
                                    │
Admin → Admin Dashboard ────────────┘

Three services (Search, Payment, Admin) all share one database. Let's check each layer:

Web servers: Handling requests, but slowly. They're waiting on responses from the APIs. Not the bottleneck — they're victims.

Search API: Sending queries to the database, but queries are slow. Not the root cause — it's also waiting.

Payment API: Same situation. Queries are queuing up and timing out.

Database: Here's the bottleneck.

Inside the Database

The database can handle approximately 500 queries per second under normal load. Each search query involves:

  • Searching available flights by route, date, and class
  • Checking seat availability for each matching flight
  • Calculating dynamic pricing for each available flight

A single search is about 3-5 database queries.

Normal traffic: 50 searches/sec × 5 queries = 250 queries/sec → database comfortable

Flash sale traffic: 500 searches/sec × 5 queries = 2,500 queries/sec → database overwhelmed

The database hits its connection limit. New queries queue up. Queue times increase. The search API waits for the database, the web server waits for the search API, the user waits for the web server. Each layer adds its own timeout on top.

But here's the critical part: The payment API, which only handles 10-20 transactions per second, also queries the same database — but its queries are now stuck behind 2,500 search queries. A payment that normally takes 200ms now takes 30 seconds and times out.

The flash sale broke search AND payment, even though payment traffic didn't increase at all.


Step 4: The Cascade Chain

Flash sale email
    → 10x website traffic
        → 5x database query volume (search queries)
            → Database connection pool exhausted
                → Search queries slow to 45 seconds
                → Payment queries can't get a database connection
                    → Payment times out
                    → Customers can't pay
                        → Customers call support
                            → Support lines overwhelmed
                → Admin queries can't get a database connection
                    → Admin dashboard unresponsive
                    → Operators can't see what's happening
                        → Slow response to the incident

One event (flash sale) → six cascading failures. Each failure amplifies the next.

The Amplification Pattern

Notice the feedback loop:

  • Search is slow → users retry (refresh the page) → more search queries → database even slower → users retry more aggressively

This is a thundering herd — when a system slows down, users retry, which generates even more load, which makes it slower, which generates more retries. The system enters a death spiral where recovery is impossible without intervention.


Step 5: Hypotheses and What Would Fix Each

Hypothesis 1: "Just upgrade the database"

Get a bigger database that can handle 2,500 queries/sec.

Problem: This fixes today's flash sale but doesn't fix the next one. If the sale is bigger (5 million emails), you'd need an even bigger database. You're scaling to the peak — expensive and always one step behind.

Verdict: Treats the symptom, not the disease.

Hypothesis 2: "Separate the databases"

Give search, payment, and admin their own databases.

Search API  → Search Database (can be slow under load — annoying but not critical)
Payment API → Payment Database (protected from search traffic — critical operations stay fast)
Admin       → Admin Database (or read replica)

Verdict: This prevents search traffic from killing payments. The blast radius of a search overload no longer includes payment. This is the boundary principle from Section 2 — critical and non-critical operations should not share the same resource pool.

Limit search to 200 queries per second. Beyond that, return a "please try again in a moment" message.

Verdict: This prevents the database from being overwhelmed. Users see a brief delay instead of a 45-second hang. It's annoying but proterable to the entire system collapsing. This is the degrade strategy — intentionally limit one feature to protect the rest.

Caribbean flights are what the sale promoted. Cache the results for common Caribbean route+date queries. The first search hits the database; subsequent identical searches are answered from cache.

Verdict: This dramatically reduces database load for the exact queries the flash sale generates. If 80% of search queries are for the same Caribbean routes, cache handles 2,000 of the 2,500 queries/sec, leaving only 500 for the database (within capacity).

The Real Fix: All of the Above (Layered Defense)

No single fix is sufficient. Real systems use layered defenses:

  1. Rate limiting (immediate: deploy within hours, prevents the death spiral)
  2. Caching (short-term: deploy within days, reduces database load for common queries)
  3. Separate databases (medium-term: deploy within weeks, isolates critical from non-critical)
  4. Load testing before promotions (process: coordinate with marketing — "tell engineering before sending 2 million emails")

The Deeper Lesson: Shared Resources Create Cascading Failures

The Shared Resource Anti-Pattern

BEFORE (dangerous):

Search ──┐
Payment ──┼── Shared Database
Admin ────┘

Any one service can saturate the database and starve the others.
AFTER (isolated):

Search  → Search DB (or cache + DB)
Payment → Payment DB
Admin   → Read replica

Each service has its own resource pool. A surge in one doesn't affect the others.

The Blast Radius Principle

Every shared resource is a potential blast radius amplifier. When you share:

  • A database
  • A network connection
  • A thread pool
  • A queue
  • A rate limit
  • A budget

…you're saying "the failure of any one consumer can affect all consumers." Sometimes sharing is the right choice (cost, simplicity). But you must know what you're risking.

The Traffic Spike Is Not the Bug

The flash sale email was the trigger, not the cause. The cause was:

  1. No isolation between critical (payment) and non-critical (search) systems
  2. No rate limiting to prevent overload
  3. No caching for predictable high-volume queries
  4. No coordination between marketing and engineering

The flash sale was a normal business event. The system should have handled it — or at least degraded gracefully instead of collapsing completely.


Failure Category Map

Root cause:    Resource failure (shared database overwhelmed)
Trigger:       External traffic spike (flash sale email)
Amplifier:     Thundering herd (user retries), shared resource (database)
Blast radius:  ALL services (search, payment, admin, support)
Severity:      Critical (payment broken = lost revenue)
Could prevent: Resource isolation (separate databases), rate limiting,
               caching, load testing, marketing coordination
Could detect:  Database connection pool monitoring, query queue length 
               alerts, response time thresholds
Could reduce:  Graceful degradation (return cached results, queue 
               payments for retry instead of dropping them)

Compare With Previous Examples

AspectE-Commerce (wrong items)Messaging (wrong order)Flight Booking (cascade)
Number of symptoms1 (wrong items)1 (wrong order)5+ (everything breaks)
Failure categoryIntegrationTimingResource + cascading
TriggerSystem upgradeScaling to 3 serversTraffic spike
Root causeStale cacheNon-deterministic orderingShared database bottleneck
Fix complexitySimple (one change)Moderate (client-side compromise)High (layered, multiple changes)
Blast radius30% of ordersGroup conversationsEvery user, every feature
Feedback loop?NoNoYes (retries amplify load)
Prevention themeContracts + boundariesArchitectural ordering decisionsResource isolation + graceful degradation

The key escalation: from a single-cause bug, to a design limitation, to a systemic vulnerability. Each example requires more sophisticated thinking about failure.

Failure Modes — Pre-Mortems and Failure Planning

What Is a Pre-Mortem?

A post-mortem happens after something breaks. You investigate what went wrong.

A pre-mortem happens before anything breaks. You imagine it's six months from now and the system has failed catastrophically, then work backward: "What went wrong?"

This flips the psychology. In a planning meeting, people are optimistic. In a pre-mortem, everyone has permission to be pessimistic — and pessimism is productive.


The Pre-Mortem Process

Step 1: Define the System

State clearly what you're building and its key characteristics:

  • What does it do?
  • Who uses it?
  • What data does it handle?
  • What are the critical operations?

Step 2: Imagine the Disaster

Each person independently writes down their answer to: "It's six months from now. The system has failed in its worst possible way. What happened?"

Not little bugs. Catastrophes:

  • Data was lost permanently
  • Money was charged incorrectly
  • Security was breached
  • The system was down for days
  • Customers left in large numbers

Step 3: Group the Failures

Collect all imagined disasters and group them:

  • Which ones are about data? (loss, corruption, leakage)
  • Which ones are about availability? (downtime, slowness)
  • Which ones are about correctness? (wrong results, wrong actions)
  • Which ones are about security? (unauthorized access, data exposure)
  • Which ones are about scaling? (couldn't handle growth)

Step 4: For Each Failure, Ask Three Questions

  1. How likely is this? (Almost certain / Probable / Possible / Unlikely)
  2. How severe is the impact? (Critical / High / Medium / Low)
  3. What would prevent or mitigate it? (Design decision, monitoring, process)

Step 5: Act on the High-Risk Items

Anything that is both likely and severe must be addressed in the design — not deferred to "we'll fix it later."


Worked Pre-Mortem 1: Online Banking App

The System

A mobile banking app. Customers can check balances, transfer money, pay bills, and deposit checks by photographing them.

Imagined Disasters

#DisasterCategoryLikelihoodSeverity
1"A customer transferred $10,000 but the money disappeared — left the source account but never arrived at the destination"CorrectnessPossibleCritical
2"Someone gained access to 50,000 customer accounts because session tokens weren't invalidated after password changes"SecurityProbableCritical
3"The app was down for 6 hours on a Friday (payday) because a database migration failed and couldn't be rolled back"AvailabilityProbableCritical
4"A customer deposited the same check 15 times by rapidly submitting photos, and was credited $15,000 for a $1,000 check"CorrectnessPossibleHigh
5"Customer service had no way to see what went wrong with a failed transaction because logging was incomplete"DataProbableHigh
6"The mobile app crashed on Android 12 devices and wasn't caught because we only tested on iOS"AvailabilityProbableMedium
7"A third-party payment provider changed their API without notice, and bill payments silently failed for 3 days"IntegrationPossibleHigh

Prevention Plan

Disaster 1: Disappearing transfer

  • Prevention: Atomic transactions — debit and credit must be a single, indivisible operation. If one fails, both roll back.
  • Detection: Reconciliation: every night, verify that total debits = total credits across all accounts. Any mismatch triggers an alert.
  • Recovery: Transaction is logged with full details regardless of success/failure, enabling manual correction.

Disaster 2: Session tokens after password change

  • Prevention: On password change, invalidate ALL active sessions for that user. Force re-authentication.
  • Detection: Monitor for sessions that continue after a password change event. This should trigger a security alert.
  • Process: Add to the security review checklist: "What happens to active sessions when credentials change?"

Disaster 3: Failed database migration on payday

  • Prevention: Never run migrations on Fridays. Always have a tested rollback script. Run migrations in a staging environment first.
  • Detection: Automated health check: if the app can't connect to the database within 5 seconds, page the on-call engineer.
  • Mitigation: Read-only mode: if the database is mid-migration, customers can view balances but not make transactions. Degraded but not dead.

Disaster 4: Duplicate check deposit

  • Prevention: Idempotency key — each check deposit gets a unique ID. If the same check image is submitted twice, the second submission is recognized as a duplicate and rejected.
  • Detection: Flag accounts with multiple deposits of the same amount in a short time window.
  • Business rule: Hold deposited funds for 24 hours before making them available (standard banking practice, but now you know why).

Disaster 5: Incomplete logging

  • Prevention: Define logging as part of the contract for every operation. Every contract's side effects section must include what is logged.
  • Standard: For every transaction: log the input, the output, the timestamp, the customer ID, the IP address, and either the success result or the full error.
  • Verification: Regularly test that a support engineer can reconstruct what happened for a given transaction using only the logs.

Worked Pre-Mortem 2: School Registration System

The System

An online system where parents register their children for the upcoming school year. Choose a school, submit personal information, upload documents (proof of address, immunization records), and get a confirmation.

Imagined Disasters

#DisasterCategoryLikelihoodSeverity
1"Registration opened at 8 AM and the site crashed within 2 minutes because 10,000 parents all clicked at the same time"ScalingAlmost certainHigh
2"A parent registered their child at School A, but the system assigned them to School B because of a race condition on the last available seat"CorrectnessProbableHigh
3"A parent uploaded their child's medical records, and another parent could see them due to a document ID that was sequential and guessable"SecurityPossibleCritical
4"Registration closed, but 200 parents say they submitted before the deadline and have no confirmation. No one can prove what happened."DataProbableHigh
5"The system accepted a registration without required immunization records. The school discovered this on the first day of class."CorrectnessProbableMedium
6"A family with special needs (IEP, 504 plan) registered but the system didn't flag this for the school, so no accommodations were prepared"CorrectnessPossibleHigh

Prevention Plan

Disaster 1: Opening-day crash

  • Prevention: Load test with 10x expected traffic before launch. Use a virtual queue ("You are #3,247 in line. Estimated wait: 12 minutes.") instead of letting everyone hit the system simultaneously.
  • Mitigation: Have a static "we're experiencing high volume" page that doesn't require the database, so the site doesn't show an error.
  • Communication: Tell parents in advance: "Registration stays open for 2 weeks. Spots are not first-come-first-served. You do not need to register at 8 AM."

Disaster 2: Race condition on last seat

  • Prevention: Don't assign seats in real time. Accept all registrations as "pending." Run the assignment process after registration closes, with clear tiebreaker rules.
  • If real-time assignment is required: Use pessimistic locking — when a parent starts registering for School A, temporarily reserve a seat. If they don't complete within 15 minutes, release it.
  • Never say "you're in" until the seat reservation is confirmed and committed.

Disaster 3: Guessable document IDs

  • Prevention: Use random, non-sequential document IDs (UUIDs). Never use auto-incrementing IDs for anything the user can see in a URL.
  • Authorization: Even with random IDs, check that the requesting user is authorized to see the document. Defense in depth — random ID + authorization check.
  • Encryption: Store uploaded documents encrypted at rest. Even if someone accesses the storage directly, they can't read the files.

Disaster 4: No proof of submission

  • Prevention: Every submission generates a confirmation number immediately, displayed on screen AND emailed. If the email fails, the confirmation number is still shown on screen.
  • Logging: Log every submission attempt with timestamp, IP address, and all submitted data. This is the system's proof.
  • Grace period: If the system was under heavy load near the deadline, extend the deadline. Publish the policy in advance.

Worked Pre-Mortem 3: Smart Thermostat System

The System

A home thermostat connected to the internet. Users set schedules via a phone app. The thermostat communicates with the furnace/AC and reports energy usage.

Imagined Disasters

#DisasterCategoryLikelihoodSeverity
1"Internet goes down. Thermostat stops maintaining temperature because it depends on the cloud to get the schedule."AvailabilityAlmost certainCritical (pipes freeze in winter)
2"A software update bricked 50,000 thermostats. They display nothing and don't control temperature."AvailabilityPossibleCritical
3"A hacker accessed the thermostat API and set 100,000 homes to 95°F in August, causing danger for elderly residents."SecurityPossibleCritical
4"The thermostat reported wrong energy data, and customers got unexpectedly high utility bills"CorrectnessProbableHigh
5"Two family members set conflicting schedules from their phones. The thermostat oscillated between 68°F and 75°F every few minutes."CorrectnessProbableMedium

Prevention Plan

Disaster 1: Internet dependency

  • Prevention: The thermostat must operate independently of the internet. The schedule is stored on the device, not just in the cloud. The cloud syncs the schedule, but the device doesn't require it.
  • Design rule: If the internet connection goes away, the thermostat continues following its last-known schedule indefinitely. The user loses remote control but the house stays warm.
  • This is a boundary decision: The thermostat is its own module. The cloud is a convenience layer, not a dependency.

Disaster 2: Bricked by update

  • Prevention: Two-slot firmware — the thermostat stores two copies of its software. An update writes to the backup slot. If the update fails or the device doesn't boot correctly, it automatically reverts to the previous working version.
  • Rollout: Never update all devices at once. Update 1% → verify → 10% → verify → 100%. This limits blast radius.
  • Minimum function: Even if all software fails, the hardware should maintain a safe default temperature (60°F) to prevent pipe freezing. This is a hardware fallback, not a software feature.

Disaster 3: Security breach

  • Prevention: Authentication for every API call. Rate limiting on temperature changes. Maximum temperature bound (can't set above 90°F or below 50°F) enforced on the device, not just in the app.
  • Detection: Alert if temperature is set outside normal range, or if settings change more than 5 times in an hour.
  • Physical limit: The device has a physical maximum temperature that the software cannot override. Even a fully compromised cloud can't heat a house to a dangerous temperature.

The Pre-Mortem Toolkit

Questions to Ask for Any System

QuestionWhat It Reveals
What happens when the network goes away?Dependency on connectivity
What happens when traffic is 10x normal?Scaling limits
What happens when a database migration fails?Recovery procedures
What happens when a third-party service changes without notice?Integration fragility
What happens when two users do the same thing at the same time?Concurrency issues
What happens when the clock is wrong on one server?Timing assumptions
What happens when someone deliberately tries to break it?Security posture
What happens when data from 2019 meets code from 2024?Data compatibility
What happens when the person who built this leaves the team?Knowledge concentration
What happens when we have 100x the current data volume?Storage and performance limits

The Risk Matrix

Plot your pre-mortem findings:

                 Low Impact         High Impact
              ┌───────────────┬───────────────────┐
  Likely      │   Monitor     │  MUST ADDRESS      │
              │               │  (Design for this) │
              ├───────────────┼───────────────────┤
  Unlikely    │   Accept      │  Plan response     │
              │   (Log it)    │  (Have a playbook)  │
              └───────────────┴───────────────────┘
  • Likely + High Impact: Must be addressed in the design. Not optional.
  • Likely + Low Impact: Monitor and fix when convenient.
  • Unlikely + High Impact: Have a response plan. You don't have to prevent it, but you must know what to do if it happens.
  • Unlikely + Low Impact: Accept the risk. Log it and move on.

Summary: Why Pre-Mortems Work

  1. They give permission to be negative. In planning, people avoid bringing up problems (it feels like criticizing the plan). In a pre-mortem, finding problems is the goal.

  2. They surface assumptions. "The internet will always be available" is an assumption that feels obvious in a pre-mortem but gets ignored in design.

  3. They connect to everything else in this curriculum:

    • Data lifecycle → "Where is data at risk of loss or corruption?"
    • Boundaries → "What's the blast radius if this module fails?"
    • Contracts → "Which error cases are missing from our contracts?"
    • Decomposition → "Which dependencies create single points of failure?"
  4. They're cheap. A pre-mortem takes an hour. Recovering from a disaster you could have prevented takes weeks.

Failure Modes and Debugging — Test Your Understanding

Answer each question by showing your reasoning process. The goal is structured, systematic thinking — not lucky guesses.


Section A: Diagnose the Problem

Question 1

Symptom: An online store's product pages load correctly, but every product shows "In Stock" even though several products are sold out.

Using the five-step debugging framework:

  1. State the precise symptom
  2. Hypothesize what might have changed
  3. Describe how you would bisect the problem (where would you check first?)
  4. Form two different hypotheses for the cause
  5. For each hypothesis, describe what evidence would confirm or disprove it

Question 2

Symptom: Users report that emails from the system arrive late — sometimes hours after the action that triggered them. The system was working fine until last week.

You know the email flow:

  1. User action triggers an event
  2. Event is placed in a queue
  3. A background worker reads the queue and sends emails
  4. Email is sent via an external email service

Using bisection, walk through how you would isolate whether the delay is in step 1, 2, 3, or 4. What specific thing would you check at each stage?


Question 3

Symptom: A banking app shows a customer's balance as negative $500, but the customer insists they have not made any large purchases. Looking at the transaction list, all transactions appear normal and small.

This is a data integrity issue. Trace the lifecycle backward:

  • Where is the balance displayed?
  • Where is it calculated?
  • What data feeds the calculation?
  • What could cause the calculation to produce a wrong result?

List at least four distinct hypotheses, each targeting a different part of the data lifecycle.


Section B: Design for Failure

Question 4

You are designing a system that processes online job applications. The flow:

  1. Applicant fills out a form with personal info and uploads a resume
  2. System validates the form data
  3. Resume is stored
  4. Application record is created in the database
  5. Hiring manager is notified via email
  6. Applicant receives a confirmation email

For each step, list:

  • What can fail
  • The severity (critical/high/medium/low)
  • The appropriate response strategy (prevent/retry/fallback/degrade/alert/fail fast)
  • What should be logged for debugging

Question 5

A ride-sharing app has these dependencies:

  • GPS service (for driver location)
  • Payment processor (for billing)
  • Map routing service (for directions)
  • Push notification service (for alerts)

For each dependency, answer:

  • What happens if it goes down for 30 seconds?
  • What happens if it goes down for 30 minutes?
  • What happens if it starts returning wrong data instead of errors?
  • What should the app do in each case?

Pay special attention to the third question — silent wrong data is the most dangerous failure mode.


Question 6

Perform a pre-mortem for the following system:

A school lunch ordering system where parents pre-order meals for their children through a website. The kitchen prepares meals based on the orders. Children pick up their meal at lunch using their student ID.

Imagine it's been running for three months and something has gone terribly wrong. Write five realistic failure scenarios. For each one:

  • What went wrong
  • Why it wasn't caught earlier
  • What design decision would have prevented it

Section C: Failure Reasoning

Question 7

You have a system with three modules in sequence:

Module A → Module B → Module C → Output

The output is wrong. You check Module A's output — it's correct. You check Module C's output — it's wrong.

Can you conclude the bug is in Module B or Module C? Why or why not? What else do you need to check? Describe the precise reasoning.


Question 8

An engineer says: "I added retry logic everywhere, so our system handles failures well."

Explain at least three scenarios where retrying makes the problem worse instead of better. For each scenario, describe what should be done instead.


Question 9

Two failure strategies are proposed for a checkout system when the payment service is down:

Strategy A: Show the user an error: "Payment service unavailable. Please try again in a few minutes."

Strategy B: Accept the order, save it with status "payment pending," and charge the user when the payment service comes back.

Analyze both strategies. What are the risks of each? Under what circumstances is A better? Under what circumstances is B better? What failure modes does B introduce that A doesn't have?


Section D: The Full Picture

Question 10

This is an integration exercise. You have studied all five pillars. Now apply them all:

Scenario: A hospital system manages patient appointments. Patients book appointments online, doctors see their schedule on a dashboard, and the system sends text message reminders 24 hours before each appointment.

A doctor reports: "My 2pm patient said they never received a reminder, and two of my morning patients received reminders for the wrong date."

Using everything you've learned:

  1. Data Lifecycle: Trace the data from appointment creation to reminder delivery
  2. Boundaries: Identify which module(s) are likely involved in the failure
  3. Contracts: Identify what contract might be violated
  4. Decomposition: Break the problem into investigatable pieces
  5. Failure Mode: Categorize the failure type, form hypotheses, and describe how you would bisect to find the root cause

Question 11

Design a comprehensive failure handling plan for a simple feature: "User changes their password."

The flow: user enters current password and new password → system verifies current password → system validates new password meets requirements → system updates the stored password → user receives email confirming the change.

For this feature:

  1. List every failure mode at every step
  2. Categorize each by type (input, logic, integration, resource, dependency, timing)
  3. Define the response for each
  4. Identify the single most dangerous failure mode and explain why
  5. Describe what logging would be needed to diagnose any failure in this flow without being able to reproduce it

Question 12

The final question. Reflect on this statement:

"A system that has never failed is more dangerous than a system that fails regularly."

Using concepts from all five pillars, explain why this might be true. Consider: untested failure paths, false confidence, unknown data lifecycle gaps, unchecked boundary assumptions, and unvalidated contracts. Give a concrete example to support your argument.


Grading Rubric

CriteriaWhat It Means
Systematic processFollowed a structured approach — not random guessing. Steps are traceable and logical.
Precise symptomsProblems are stated specifically, not vaguely. "Shows $0" not "is broken."
Multiple hypothesesMore than one possible cause is considered before committing to a diagnosis
Evidence-based reasoningEach hypothesis has a way to test it. Decisions are based on evidence, not assumptions.
Failure design completenessAll failure modes are considered, not just the obvious ones. Silent failures and wrong-data failures are addressed, not just crashes.
Cross-pillar integrationAnswers draw on data lifecycle, boundaries, contracts, and decomposition — not just debugging techniques in isolation