Failure Modes — Example: Messaging App Mystery
The Scenario
A team messaging app (like Slack). Users are reporting that messages are appearing out of order in group conversations. A message sent at 2:03 PM appears above a message sent at 2:01 PM. It doesn't happen in every conversation, and it doesn't happen all the time. Some users say they can't reproduce it.
This is a timing failure — the hardest category to debug because the problem is intermittent and order-dependent.
Step 1: Observe the Symptom Precisely
We collect specific reports:
| Report | Conversation | What User Sees | Expected Order |
|---|---|---|---|
| 1 | #engineering (45 members) | Message from Bob at 2:03 appears above Alice's from 2:01 | Alice first, Bob second |
| 2 | #engineering (45 members) | Same message pair — but Carol sees them in the correct order | Depends on viewer? |
| 3 | DM between Dave and Eve | Never happens — DMs always in order | — |
| 4 | #general (200 members) | Frequent reordering, especially during active discussion | — |
| 5 | #random (10 members) | Rarely happens | — |
Pattern emerging:
- Happens in group conversations, not DMs
- More frequent in larger groups
- More frequent during high activity
- Different users see different orders for the same messages
Step 2: Establish What Changed
Users say this "started recently" but can't say exactly when. We check the deployment log:
- 2 weeks ago: Scaled the messaging backend from 1 server to 3 servers (load balancing) to handle growing user count
- Nothing else changed
Hypothesis forming: Scaling from 1 server to 3 might be related to the ordering issue.
Step 3: Bisect the Problem Space
The message flow is:
Sender types message → Sender's device sends to server → Server stores message →
Server broadcasts to group members → Each member's device receives and displays
Check the midpoint: Are messages stored in the correct order?
We query the database for the #engineering conversation around 2:00 PM:
| Message ID | Sender | Text | Timestamp (server) | Stored Order |
|---|---|---|---|---|
| msg-4401 | Alice | "Has anyone seen the test results?" | 2:01:03.142 | 1st |
| msg-4402 | Bob | "Just posted them in the doc" | 2:01:47.891 | 2nd |
| msg-4403 | Carol | "Thanks!" | 2:02:15.003 | 3rd |
| msg-4404 | Bob | "The latency numbers look bad" | 2:03:01.556 | 4th |
The database has the correct order. Messages are stored with server timestamps and the order is right.
So the bug is after storage — in the broadcast/display phase.
Step 4: Deeper Investigation — The Broadcast
How does broadcast work?
Before the scale-up (1 server):
Message stored → Server sends to all connected members → Done
After the scale-up (3 servers):
Message stored → Publish event to message queue → All 3 servers read from queue →
Each server sends to its connected members
With 3 servers, the 45 members of #engineering are distributed:
- Server 1: 18 members connected
- Server 2: 15 members connected
- Server 3: 12 members connected
When Alice sends a message at 2:01, the flow is:
- Alice's device → Server 2 (she happens to be connected to Server 2)
- Server 2 stores the message
- Server 2 publishes "new message" event to the message queue
- All 3 servers pick up the event and send it to their connected members
When Bob sends a message at 2:03, the flow is:
- Bob's device → Server 1 (he's connected to Server 1)
- Server 1 stores the message
- Server 1 publishes "new message" event to the message queue
- All 3 servers pick up the event
Here's the problem:
The message queue doesn't guarantee that events are delivered to all consumers in the same order. Server 1 might receive Bob's event before Alice's event because Bob's message was published from Server 1 (local) while Alice's had to travel across the network.
Timeline:
Server 2 stores Alice's msg at 2:01:03.142
Server 2 publishes event ─────────────────────── travels across network
Server 1 stores Bob's msg at 2:03:01.556
Server 1 publishes event ─── stays local
Server 1 receives Bob's event at 2:03:01.560 ← 4ms later (local)
Server 1 receives Alice's event at 2:03:01.580 ← 20ms later (network travel)
Server 1 broadcasts Bob's message to its 18 connected members FIRST
Server 1 broadcasts Alice's message 20ms later
Users on Server 1 see: Bob, then Alice (WRONG ORDER)
Users on Server 2 see: Alice, then Bob (CORRECT ORDER)
Different users see different orders because they're connected to different servers, and the servers receive events in different orders.
Step 4 (continued): Form Hypothesis
Hypothesis: When multiple messages arrive at a server via the message queue within a short time window, the server broadcasts them in arrival order (when the event reached that particular server) rather than timestamp order (when the message was actually created). Users connected to different servers receive messages in different orders.
Test the prediction:
If this is correct, then:
- DMs never have this problem (they involve only 2 people, likely routed through one server)
- Larger groups are more affected (more members = more servers involved along the way)
- Fast-paced conversations are more affected (messages close together in time are more susceptible to reordering)
- Users on the same server always see the same order (right or wrong)
All four predictions match the reports. Hypothesis confirmed.
Step 5: The Fix — And Why It's Not Obvious
Attempted Fix 1: "Just sort by timestamp on the server"
Have each server sort messages by timestamp before broadcasting.
Problem: The server doesn't know if more messages are coming. When it receives Bob's event, should it wait to see if an earlier message might arrive? How long should it wait? 10ms? 100ms? 1 second?
- Wait too short → still might miss earlier messages
- Wait too long → messages feel laggy to users
This is the fundamental tradeoff of distributed systems: you can't have both instant delivery AND perfectly correct ordering without coordination.
Attempted Fix 2: "Sort on the client"
Each user's device sorts messages by server timestamp after receiving them.
Problem: This mostly works, but creates a jarring experience — a message appears at the bottom, then "jumps up" when an earlier message arrives a moment later. Users see messages rearranging in real time, which feels buggy even though it's technically correct.
Actual Fix: Client-side insertion sort with timestamp
Each user's device maintains messages sorted by server timestamp. When a new message arrives:
- Check its timestamp against the last displayed message
- If it's newer → append at the bottom (most common case, feels instant)
- If it's older → insert it at the correct position AND show a subtle visual indicator ("1 earlier message inserted above")
This is a compromise: correct ordering with a visual cue so users aren't confused by messages appearing "above" what they already read.
The Deeper Lesson: Distributed Systems Create Timing Failures
Why 1 server didn't have this problem
With one server, all messages passed through a single point. The server processed them sequentially, so the broadcast order always matched the storage order. No timing ambiguity.
Why 3 servers created the problem
With three servers, there are three independent paths for messages to travel. Each path has slightly different timing. This is called non-deterministic ordering — the order depends on network latency, load, and which server the sender is connected to.
The general principle
Any time you add parallelism, you create the possibility of ordering problems. This applies to:
- Multiple servers
- Multiple threads in a program
- Multiple workers processing a queue
- Multiple microservices handling events
The question isn't "will ordering be a problem?" It's "how will we handle the ordering problem?"
Failure Category Map
Root cause: Timing failure (non-deterministic event ordering across servers)
Amplifier: High message volume in large groups
Blast radius: All group conversations (DMs unaffected)
Severity: Medium (annoying, confusing, but no data loss)
Could prevent: Design the broadcast system with ordering guarantees from day one
Could detect: Automated ordering test (send messages with known timestamps,
verify all clients receive in correct order)
Could reduce: Client-side timestamp sorting with visual indicators
Compare With the E-Commerce Bug
| Aspect | E-Commerce (wrong items) | Messaging (wrong order) |
|---|---|---|
| Failure category | Integration (stale cache) | Timing (non-deterministic ordering) |
| Reproducibility | 100% for affected products | Intermittent, depends on timing |
| Who's affected | ~30% of orders (those with changed bins) | Users on different servers, during high activity |
| Data corrupted? | No (database correct, pick list wrong) | No (database correct, display order wrong) |
| Root cause found by | Checking midpoint (pick list) | Checking delivery path (server → client) |
| Fix complexity | Simple (update the data source) | Hard (fundamental distributed systems tradeoff) |
| Prevention | Better contracts and boundary enforcement | Architectural decision about ordering guarantees |
The e-commerce bug had a clear, fixable root cause. The messaging bug revealed a fundamental limitation of the architecture that required a compromise, not a simple fix. This is the difference between a bug and a design constraint.