Failure Modes — Example: Flight Booking Cascade

The Scenario

An airline booking system. On a busy Friday afternoon, the following happens within a 30-minute window:

Customers report they can't search for flights (the search page spins forever)
Customers who already selected flights can't complete payment
The customer service phone lines are flooded
An agent manually checks and sees that the booking database is responding, but slowly
Internal monitoring shows the flight search API is responding in 45 seconds (normal: under 500 milliseconds)

This is a cascading failure — one problem triggers a chain of other problems that makes everything worse.

Step 1: Observe the Symptoms — All of Them

Unlike the previous examples (single symptom), here we have multiple symptoms appearing simultaneously:

Symptom	Affected System	Severity
Flight search returns in 45 seconds	Search API	High
Payment processing times out	Payment Service	Critical
Customer service call volume 5x normal	Call Center	High
Booking database slow (but responding)	Database	Medium
Internal admin dashboard unresponsive	Admin UI	Low

These aren't five separate bugs. They're connected. The key question: which one caused the others?

Step 2: Establish a Timeline

We reconstruct what happened:

Time	Event
2:00 PM	Everything normal
2:12 PM	Marketing team launches a flash sale: "50% off all Caribbean flights this weekend." Email sent to 2 million subscribers.
2:15 PM	Website traffic increases 10x
2:17 PM	Search API response times begin rising (500ms → 2s → 5s → 15s → 45s)
2:20 PM	Payment service starts timing out (its requests to the database are queued behind search queries)
2:22 PM	Customers who can't search or pay start calling customer service
2:25 PM	Admin dashboard becomes unresponsive (it also queries the same database)
2:30 PM	All systems severely degraded

The trigger: A flash sale email drove sudden, massive traffic. But the trigger is not the root cause. Traffic spikes are expected. The question is: why didn't the system handle it?

Step 3: Bisect — Find the Bottleneck

The system architecture:

Users → Web Servers → Search API ──┐
                                    ├── Database
Users → Web Servers → Payment API ──┘
                                    │
Admin → Admin Dashboard ────────────┘

Three services (Search, Payment, Admin) all share one database. Let's check each layer:

Web servers: Handling requests, but slowly. They're waiting on responses from the APIs. Not the bottleneck — they're victims.

Search API: Sending queries to the database, but queries are slow. Not the root cause — it's also waiting.

Payment API: Same situation. Queries are queuing up and timing out.

Database: Here's the bottleneck.

Inside the Database

The database can handle approximately 500 queries per second under normal load. Each search query involves:

Searching available flights by route, date, and class
Checking seat availability for each matching flight
Calculating dynamic pricing for each available flight

A single search is about 3-5 database queries.

Normal traffic: 50 searches/sec × 5 queries = 250 queries/sec → database comfortable

Flash sale traffic: 500 searches/sec × 5 queries = 2,500 queries/sec → database overwhelmed

The database hits its connection limit. New queries queue up. Queue times increase. The search API waits for the database, the web server waits for the search API, the user waits for the web server. Each layer adds its own timeout on top.

But here's the critical part: The payment API, which only handles 10-20 transactions per second, also queries the same database — but its queries are now stuck behind 2,500 search queries. A payment that normally takes 200ms now takes 30 seconds and times out.

The flash sale broke search AND payment, even though payment traffic didn't increase at all.

Step 4: The Cascade Chain

Flash sale email
    → 10x website traffic
        → 5x database query volume (search queries)
            → Database connection pool exhausted
                → Search queries slow to 45 seconds
                → Payment queries can't get a database connection
                    → Payment times out
                    → Customers can't pay
                        → Customers call support
                            → Support lines overwhelmed
                → Admin queries can't get a database connection
                    → Admin dashboard unresponsive
                    → Operators can't see what's happening
                        → Slow response to the incident

One event (flash sale) → six cascading failures. Each failure amplifies the next.

The Amplification Pattern

Notice the feedback loop:

Search is slow → users retry (refresh the page) → more search queries → database even slower → users retry more aggressively

This is a thundering herd — when a system slows down, users retry, which generates even more load, which makes it slower, which generates more retries. The system enters a death spiral where recovery is impossible without intervention.

Step 5: Hypotheses and What Would Fix Each

Hypothesis 1: "Just upgrade the database"

Get a bigger database that can handle 2,500 queries/sec.

Problem: This fixes today's flash sale but doesn't fix the next one. If the sale is bigger (5 million emails), you'd need an even bigger database. You're scaling to the peak — expensive and always one step behind.

Verdict: Treats the symptom, not the disease.

Hypothesis 2: "Separate the databases"

Give search, payment, and admin their own databases.

Search API  → Search Database (can be slow under load — annoying but not critical)
Payment API → Payment Database (protected from search traffic — critical operations stay fast)
Admin       → Admin Database (or read replica)

Verdict: This prevents search traffic from killing payments. The blast radius of a search overload no longer includes payment. This is the boundary principle from Section 2 — critical and non-critical operations should not share the same resource pool.

Hypothesis 3: "Add rate limiting to search"

Limit search to 200 queries per second. Beyond that, return a "please try again in a moment" message.

Verdict: This prevents the database from being overwhelmed. Users see a brief delay instead of a 45-second hang. It's annoying but proterable to the entire system collapsing. This is the degrade strategy — intentionally limit one feature to protect the rest.

Hypothesis 4: "Cache popular search results"

Caribbean flights are what the sale promoted. Cache the results for common Caribbean route+date queries. The first search hits the database; subsequent identical searches are answered from cache.

Verdict: This dramatically reduces database load for the exact queries the flash sale generates. If 80% of search queries are for the same Caribbean routes, cache handles 2,000 of the 2,500 queries/sec, leaving only 500 for the database (within capacity).

The Real Fix: All of the Above (Layered Defense)

No single fix is sufficient. Real systems use layered defenses:

Rate limiting (immediate: deploy within hours, prevents the death spiral)
Caching (short-term: deploy within days, reduces database load for common queries)
Separate databases (medium-term: deploy within weeks, isolates critical from non-critical)
Load testing before promotions (process: coordinate with marketing — "tell engineering before sending 2 million emails")

The Deeper Lesson: Shared Resources Create Cascading Failures

The Shared Resource Anti-Pattern

BEFORE (dangerous):

Search ──┐
Payment ──┼── Shared Database
Admin ────┘

Any one service can saturate the database and starve the others.

AFTER (isolated):

Search  → Search DB (or cache + DB)
Payment → Payment DB
Admin   → Read replica

Each service has its own resource pool. A surge in one doesn't affect the others.

The Blast Radius Principle

Every shared resource is a potential blast radius amplifier. When you share:

A database
A network connection
A thread pool
A queue
A rate limit
A budget

…you're saying "the failure of any one consumer can affect all consumers." Sometimes sharing is the right choice (cost, simplicity). But you must know what you're risking.

The Traffic Spike Is Not the Bug

The flash sale email was the trigger, not the cause. The cause was:

No isolation between critical (payment) and non-critical (search) systems
No rate limiting to prevent overload
No caching for predictable high-volume queries
No coordination between marketing and engineering

The flash sale was a normal business event. The system should have handled it — or at least degraded gracefully instead of collapsing completely.

Failure Category Map

Root cause:    Resource failure (shared database overwhelmed)
Trigger:       External traffic spike (flash sale email)
Amplifier:     Thundering herd (user retries), shared resource (database)
Blast radius:  ALL services (search, payment, admin, support)
Severity:      Critical (payment broken = lost revenue)
Could prevent: Resource isolation (separate databases), rate limiting,
               caching, load testing, marketing coordination
Could detect:  Database connection pool monitoring, query queue length 
               alerts, response time thresholds
Could reduce:  Graceful degradation (return cached results, queue 
               payments for retry instead of dropping them)

Compare With Previous Examples

Aspect	E-Commerce (wrong items)	Messaging (wrong order)	Flight Booking (cascade)
Number of symptoms	1 (wrong items)	1 (wrong order)	5+ (everything breaks)
Failure category	Integration	Timing	Resource + cascading
Trigger	System upgrade	Scaling to 3 servers	Traffic spike
Root cause	Stale cache	Non-deterministic ordering	Shared database bottleneck
Fix complexity	Simple (one change)	Moderate (client-side compromise)	High (layered, multiple changes)
Blast radius	30% of orders	Group conversations	Every user, every feature
Feedback loop?	No	No	Yes (retries amplify load)
Prevention theme	Contracts + boundaries	Architectural ordering decisions	Resource isolation + graceful degradation

The key escalation: from a single-cause bug, to a design limitation, to a systemic vulnerability. Each example requires more sophisticated thinking about failure.

Systems Thinking for Engineers