Failure Modes — Example: Flight Booking Cascade
The Scenario
An airline booking system. On a busy Friday afternoon, the following happens within a 30-minute window:
- Customers report they can't search for flights (the search page spins forever)
- Customers who already selected flights can't complete payment
- The customer service phone lines are flooded
- An agent manually checks and sees that the booking database is responding, but slowly
- Internal monitoring shows the flight search API is responding in 45 seconds (normal: under 500 milliseconds)
This is a cascading failure — one problem triggers a chain of other problems that makes everything worse.
Step 1: Observe the Symptoms — All of Them
Unlike the previous examples (single symptom), here we have multiple symptoms appearing simultaneously:
| Symptom | Affected System | Severity |
|---|---|---|
| Flight search returns in 45 seconds | Search API | High |
| Payment processing times out | Payment Service | Critical |
| Customer service call volume 5x normal | Call Center | High |
| Booking database slow (but responding) | Database | Medium |
| Internal admin dashboard unresponsive | Admin UI | Low |
These aren't five separate bugs. They're connected. The key question: which one caused the others?
Step 2: Establish a Timeline
We reconstruct what happened:
| Time | Event |
|---|---|
| 2:00 PM | Everything normal |
| 2:12 PM | Marketing team launches a flash sale: "50% off all Caribbean flights this weekend." Email sent to 2 million subscribers. |
| 2:15 PM | Website traffic increases 10x |
| 2:17 PM | Search API response times begin rising (500ms → 2s → 5s → 15s → 45s) |
| 2:20 PM | Payment service starts timing out (its requests to the database are queued behind search queries) |
| 2:22 PM | Customers who can't search or pay start calling customer service |
| 2:25 PM | Admin dashboard becomes unresponsive (it also queries the same database) |
| 2:30 PM | All systems severely degraded |
The trigger: A flash sale email drove sudden, massive traffic. But the trigger is not the root cause. Traffic spikes are expected. The question is: why didn't the system handle it?
Step 3: Bisect — Find the Bottleneck
The system architecture:
Users → Web Servers → Search API ──┐
├── Database
Users → Web Servers → Payment API ──┘
│
Admin → Admin Dashboard ────────────┘
Three services (Search, Payment, Admin) all share one database. Let's check each layer:
Web servers: Handling requests, but slowly. They're waiting on responses from the APIs. Not the bottleneck — they're victims.
Search API: Sending queries to the database, but queries are slow. Not the root cause — it's also waiting.
Payment API: Same situation. Queries are queuing up and timing out.
Database: Here's the bottleneck.
Inside the Database
The database can handle approximately 500 queries per second under normal load. Each search query involves:
- Searching available flights by route, date, and class
- Checking seat availability for each matching flight
- Calculating dynamic pricing for each available flight
A single search is about 3-5 database queries.
Normal traffic: 50 searches/sec × 5 queries = 250 queries/sec → database comfortable
Flash sale traffic: 500 searches/sec × 5 queries = 2,500 queries/sec → database overwhelmed
The database hits its connection limit. New queries queue up. Queue times increase. The search API waits for the database, the web server waits for the search API, the user waits for the web server. Each layer adds its own timeout on top.
But here's the critical part: The payment API, which only handles 10-20 transactions per second, also queries the same database — but its queries are now stuck behind 2,500 search queries. A payment that normally takes 200ms now takes 30 seconds and times out.
The flash sale broke search AND payment, even though payment traffic didn't increase at all.
Step 4: The Cascade Chain
Flash sale email
→ 10x website traffic
→ 5x database query volume (search queries)
→ Database connection pool exhausted
→ Search queries slow to 45 seconds
→ Payment queries can't get a database connection
→ Payment times out
→ Customers can't pay
→ Customers call support
→ Support lines overwhelmed
→ Admin queries can't get a database connection
→ Admin dashboard unresponsive
→ Operators can't see what's happening
→ Slow response to the incident
One event (flash sale) → six cascading failures. Each failure amplifies the next.
The Amplification Pattern
Notice the feedback loop:
- Search is slow → users retry (refresh the page) → more search queries → database even slower → users retry more aggressively
This is a thundering herd — when a system slows down, users retry, which generates even more load, which makes it slower, which generates more retries. The system enters a death spiral where recovery is impossible without intervention.
Step 5: Hypotheses and What Would Fix Each
Hypothesis 1: "Just upgrade the database"
Get a bigger database that can handle 2,500 queries/sec.
Problem: This fixes today's flash sale but doesn't fix the next one. If the sale is bigger (5 million emails), you'd need an even bigger database. You're scaling to the peak — expensive and always one step behind.
Verdict: Treats the symptom, not the disease.
Hypothesis 2: "Separate the databases"
Give search, payment, and admin their own databases.
Search API → Search Database (can be slow under load — annoying but not critical)
Payment API → Payment Database (protected from search traffic — critical operations stay fast)
Admin → Admin Database (or read replica)
Verdict: This prevents search traffic from killing payments. The blast radius of a search overload no longer includes payment. This is the boundary principle from Section 2 — critical and non-critical operations should not share the same resource pool.
Hypothesis 3: "Add rate limiting to search"
Limit search to 200 queries per second. Beyond that, return a "please try again in a moment" message.
Verdict: This prevents the database from being overwhelmed. Users see a brief delay instead of a 45-second hang. It's annoying but proterable to the entire system collapsing. This is the degrade strategy — intentionally limit one feature to protect the rest.
Hypothesis 4: "Cache popular search results"
Caribbean flights are what the sale promoted. Cache the results for common Caribbean route+date queries. The first search hits the database; subsequent identical searches are answered from cache.
Verdict: This dramatically reduces database load for the exact queries the flash sale generates. If 80% of search queries are for the same Caribbean routes, cache handles 2,000 of the 2,500 queries/sec, leaving only 500 for the database (within capacity).
The Real Fix: All of the Above (Layered Defense)
No single fix is sufficient. Real systems use layered defenses:
- Rate limiting (immediate: deploy within hours, prevents the death spiral)
- Caching (short-term: deploy within days, reduces database load for common queries)
- Separate databases (medium-term: deploy within weeks, isolates critical from non-critical)
- Load testing before promotions (process: coordinate with marketing — "tell engineering before sending 2 million emails")
The Deeper Lesson: Shared Resources Create Cascading Failures
The Shared Resource Anti-Pattern
BEFORE (dangerous):
Search ──┐
Payment ──┼── Shared Database
Admin ────┘
Any one service can saturate the database and starve the others.
AFTER (isolated):
Search → Search DB (or cache + DB)
Payment → Payment DB
Admin → Read replica
Each service has its own resource pool. A surge in one doesn't affect the others.
The Blast Radius Principle
Every shared resource is a potential blast radius amplifier. When you share:
- A database
- A network connection
- A thread pool
- A queue
- A rate limit
- A budget
…you're saying "the failure of any one consumer can affect all consumers." Sometimes sharing is the right choice (cost, simplicity). But you must know what you're risking.
The Traffic Spike Is Not the Bug
The flash sale email was the trigger, not the cause. The cause was:
- No isolation between critical (payment) and non-critical (search) systems
- No rate limiting to prevent overload
- No caching for predictable high-volume queries
- No coordination between marketing and engineering
The flash sale was a normal business event. The system should have handled it — or at least degraded gracefully instead of collapsing completely.
Failure Category Map
Root cause: Resource failure (shared database overwhelmed)
Trigger: External traffic spike (flash sale email)
Amplifier: Thundering herd (user retries), shared resource (database)
Blast radius: ALL services (search, payment, admin, support)
Severity: Critical (payment broken = lost revenue)
Could prevent: Resource isolation (separate databases), rate limiting,
caching, load testing, marketing coordination
Could detect: Database connection pool monitoring, query queue length
alerts, response time thresholds
Could reduce: Graceful degradation (return cached results, queue
payments for retry instead of dropping them)
Compare With Previous Examples
| Aspect | E-Commerce (wrong items) | Messaging (wrong order) | Flight Booking (cascade) |
|---|---|---|---|
| Number of symptoms | 1 (wrong items) | 1 (wrong order) | 5+ (everything breaks) |
| Failure category | Integration | Timing | Resource + cascading |
| Trigger | System upgrade | Scaling to 3 servers | Traffic spike |
| Root cause | Stale cache | Non-deterministic ordering | Shared database bottleneck |
| Fix complexity | Simple (one change) | Moderate (client-side compromise) | High (layered, multiple changes) |
| Blast radius | 30% of orders | Group conversations | Every user, every feature |
| Feedback loop? | No | No | Yes (retries amplify load) |
| Prevention theme | Contracts + boundaries | Architectural ordering decisions | Resource isolation + graceful degradation |
The key escalation: from a single-cause bug, to a design limitation, to a systemic vulnerability. Each example requires more sophisticated thinking about failure.