Failure Modes — Example: E-Commerce Order Goes Wrong

The Scenario

An online store selling electronics. Customers are reporting a strange problem: some orders show the wrong items. A customer ordered a laptop and received a phone charger. Another ordered headphones and received a keyboard. It's not happening to all orders — just some.

This is a real investigation. Let's walk through it.


Step 1: Observe the Symptom Precisely

We gather reports and look for patterns:

CustomerOrderedReceivedOrder DatePayment Correct?
Customer ALaptop ($899)Phone Charger ($15)March 5Charged $899 ✅
Customer BHeadphones ($79)Keyboard ($49)March 5Charged $79 ✅
Customer CMonitor ($350)Monitor ($350)March 5Charged $350 ✅
Customer DTablet ($449)Mouse ($25)March 6Charged $449 ✅
Customer EKeyboard ($49)Keyboard ($49)March 6Charged $49 ✅

Observations:

  • Payment amounts are always correct (matches what they ordered, not what they received)
  • Some orders are fine (C and E got the right items)
  • Wrong items don't seem related (laptop → charger, headphones → keyboard)
  • Started March 5

The symptom is: The warehouse is shipping the wrong physical items for some orders, but the order records and payments are correct.


Step 2: Establish What Changed

What happened around March 5?

  • March 4: New inventory system deployed (upgraded from v2.3 to v3.0)
  • March 5: First wrong-item reports
  • Nothing else changed (no code deploys, no staff changes, no new warehouse)

Strong correlation: new inventory system → wrong items. But correlation isn't causation — let's investigate.


Step 3: Bisect the Problem Space

The order lifecycle is:

Customer places order → Order recorded → Warehouse receives pick list → 
Worker picks items → Items packed → Items shipped → Customer receives

Where is the wrong data? Let's check the midpoint — the pick list that the warehouse receives.

Check 1: Is the order record correct?

We look at Customer A's order in the database:

  • Order #10547: Product = "Laptop XPS 15", Product ID = LP-2001, Quantity = 1

The order record is correct. The customer ordered a laptop and the database says laptop.

Check 2: Is the pick list correct?

We look at the pick list that was sent to the warehouse for Order #10547:

  • Order #10547: Bin Location = B-14, Quantity = 1

Wait — the pick list shows a bin location, not a product name. The warehouse worker goes to bin B-14 and picks whatever is there.

Check 3: What's in bin B-14?

Before March 4: Bin B-14 = Laptop XPS 15 (correct) After March 4 (inventory system upgrade): Bin B-14 = Phone Charger USB-C

The bin assignments changed when the inventory system was upgraded. The old system had one mapping of products to bins. The new system reassigned bins based on a different optimization algorithm. But the pick list generation was still using the old mapping — it was reading from a cached or stale copy of the bin assignments.


Step 4: Form and Test Hypothesis

Hypothesis: The pick list generator is using a cached copy of the product-to-bin mapping that wasn't updated when the inventory system was upgraded on March 4. Products whose bins didn't change (same bin in old and new system) are shipping correctly. Products whose bins changed are shipping wrong items.

Test the prediction:

If this hypothesis is correct, then:

  1. Products that shipped correctly should have the same bin location in both old and new systems
  2. Products that shipped wrong should have different bin locations
CustomerProductOld BinNew BinSame?Shipped Correctly?
ALaptopB-14C-22❌ No❌ Wrong item
BHeadphonesD-08A-31❌ No❌ Wrong item
CMonitorF-15F-15✅ Yes✅ Correct
DTabletA-31D-08❌ No❌ Wrong item
EKeyboardG-03G-03✅ Yes✅ Correct

Perfect correlation. Every wrong shipment has a bin mismatch. Every correct shipment has the same bin. Hypothesis confirmed.

Notice something extra:

Customer B ordered headphones (old bin D-08, new bin A-31). Customer D ordered a tablet (old bin A-31, new bin D-08). Their bins swapped. So Customer D likely received Customer B's headphones, and Customer B likely received... it depends on what was in A-31 in the old system.

This is how a bin mapping error creates cross-contamination — wrong items go to wrong customers in unpredictable combinations.


Step 5: Root Cause and Fix

Root Cause

Integration failure (Category: Integration). Two modules — the pick list generator and the inventory system — were reading from different versions of the bin mapping. The upgrade updated the inventory system's internal mapping but didn't invalidate or update the cache used by the pick list generator.

The Direct Fix

Update the pick list generator to read bin locations from the new inventory system's live data, not from a cached copy.

Verify the Fix

  • After the fix, run 10 test orders for products with changed bins. Verify the pick lists show the new bin locations.
  • Check that the fix doesn't affect products with unchanged bins (they should still work).
  • Check timing: the fix should take effect immediately, not after a cache timeout.

The Deeper Lesson: What Should Have Prevented This

1. Contract violation

The pick list generator had an implicit contract with the inventory system: "I will give you bin locations for product IDs." But the contract didn't specify where that data came from — live query vs. cached copy. If the contract had been explicit ("bin locations must be queried from the inventory system at pick time, not cached"), the cache would never have been built.

2. Boundary violation

The pick list generator cached data that belonged to another module (inventory). It crossed a boundary. If the boundary were enforced — "only the inventory module knows bin locations; everyone else must ask" — the stale cache wouldn't exist.

3. Missing failure mode in the upgrade plan

The inventory system upgrade plan didn't include: "What other systems read our bin mapping, and how do they read it?" A pre-mortem would have surfaced this: "What if other systems have a stale copy of our bin assignments?"

4. No verification at the seam

The seam between "pick list generated" and "warehouse worker picks item" has no verification. The worker goes to the bin and picks what's there — they have no way to verify it's the right product (unless they check the product name, which wasn't on the pick list). Adding a product name or barcode scan at pick time would have caught the mismatch immediately.


Failure Category Map for This Scenario

Root cause:    Integration failure (stale cache)
Amplifier:     No verification at physical seam
Blast radius:  All orders with changed bin locations (~30% of products)
Severity:      High (wrong items shipped, expensive returns)
Could prevent: Explicit contract, boundary enforcement, upgrade checklist
Could detect:  Barcode verification at pick, bin mapping comparison test
Could reduce:  Faster detection through customer complaint pattern analysis