Failure Modes — Example: Messaging App Mystery

The Scenario

A team messaging app (like Slack). Users are reporting that messages are appearing out of order in group conversations. A message sent at 2:03 PM appears above a message sent at 2:01 PM. It doesn't happen in every conversation, and it doesn't happen all the time. Some users say they can't reproduce it.

This is a timing failure — the hardest category to debug because the problem is intermittent and order-dependent.


Step 1: Observe the Symptom Precisely

We collect specific reports:

ReportConversationWhat User SeesExpected Order
1#engineering (45 members)Message from Bob at 2:03 appears above Alice's from 2:01Alice first, Bob second
2#engineering (45 members)Same message pair — but Carol sees them in the correct orderDepends on viewer?
3DM between Dave and EveNever happens — DMs always in order
4#general (200 members)Frequent reordering, especially during active discussion
5#random (10 members)Rarely happens

Pattern emerging:

  • Happens in group conversations, not DMs
  • More frequent in larger groups
  • More frequent during high activity
  • Different users see different orders for the same messages

Step 2: Establish What Changed

Users say this "started recently" but can't say exactly when. We check the deployment log:

  • 2 weeks ago: Scaled the messaging backend from 1 server to 3 servers (load balancing) to handle growing user count
  • Nothing else changed

Hypothesis forming: Scaling from 1 server to 3 might be related to the ordering issue.


Step 3: Bisect the Problem Space

The message flow is:

Sender types message → Sender's device sends to server → Server stores message → 
Server broadcasts to group members → Each member's device receives and displays

Check the midpoint: Are messages stored in the correct order?

We query the database for the #engineering conversation around 2:00 PM:

Message IDSenderTextTimestamp (server)Stored Order
msg-4401Alice"Has anyone seen the test results?"2:01:03.1421st
msg-4402Bob"Just posted them in the doc"2:01:47.8912nd
msg-4403Carol"Thanks!"2:02:15.0033rd
msg-4404Bob"The latency numbers look bad"2:03:01.5564th

The database has the correct order. Messages are stored with server timestamps and the order is right.

So the bug is after storage — in the broadcast/display phase.


Step 4: Deeper Investigation — The Broadcast

How does broadcast work?

Before the scale-up (1 server):

Message stored → Server sends to all connected members → Done

After the scale-up (3 servers):

Message stored → Publish event to message queue → All 3 servers read from queue → 
Each server sends to its connected members

With 3 servers, the 45 members of #engineering are distributed:

  • Server 1: 18 members connected
  • Server 2: 15 members connected
  • Server 3: 12 members connected

When Alice sends a message at 2:01, the flow is:

  1. Alice's device → Server 2 (she happens to be connected to Server 2)
  2. Server 2 stores the message
  3. Server 2 publishes "new message" event to the message queue
  4. All 3 servers pick up the event and send it to their connected members

When Bob sends a message at 2:03, the flow is:

  1. Bob's device → Server 1 (he's connected to Server 1)
  2. Server 1 stores the message
  3. Server 1 publishes "new message" event to the message queue
  4. All 3 servers pick up the event

Here's the problem:

The message queue doesn't guarantee that events are delivered to all consumers in the same order. Server 1 might receive Bob's event before Alice's event because Bob's message was published from Server 1 (local) while Alice's had to travel across the network.

Timeline:

Server 2 stores Alice's msg at 2:01:03.142
Server 2 publishes event ─────────────────────── travels across network
Server 1 stores Bob's msg at 2:03:01.556
Server 1 publishes event ─── stays local

Server 1 receives Bob's event at 2:03:01.560 ← 4ms later (local)
Server 1 receives Alice's event at 2:03:01.580 ← 20ms later (network travel)

Server 1 broadcasts Bob's message to its 18 connected members FIRST
Server 1 broadcasts Alice's message 20ms later

Users on Server 1 see: Bob, then Alice (WRONG ORDER)
Users on Server 2 see: Alice, then Bob (CORRECT ORDER)

Different users see different orders because they're connected to different servers, and the servers receive events in different orders.


Step 4 (continued): Form Hypothesis

Hypothesis: When multiple messages arrive at a server via the message queue within a short time window, the server broadcasts them in arrival order (when the event reached that particular server) rather than timestamp order (when the message was actually created). Users connected to different servers receive messages in different orders.

Test the prediction:

If this is correct, then:

  1. DMs never have this problem (they involve only 2 people, likely routed through one server)
  2. Larger groups are more affected (more members = more servers involved along the way)
  3. Fast-paced conversations are more affected (messages close together in time are more susceptible to reordering)
  4. Users on the same server always see the same order (right or wrong)

All four predictions match the reports. Hypothesis confirmed.


Step 5: The Fix — And Why It's Not Obvious

Attempted Fix 1: "Just sort by timestamp on the server"

Have each server sort messages by timestamp before broadcasting.

Problem: The server doesn't know if more messages are coming. When it receives Bob's event, should it wait to see if an earlier message might arrive? How long should it wait? 10ms? 100ms? 1 second?

  • Wait too short → still might miss earlier messages
  • Wait too long → messages feel laggy to users

This is the fundamental tradeoff of distributed systems: you can't have both instant delivery AND perfectly correct ordering without coordination.

Attempted Fix 2: "Sort on the client"

Each user's device sorts messages by server timestamp after receiving them.

Problem: This mostly works, but creates a jarring experience — a message appears at the bottom, then "jumps up" when an earlier message arrives a moment later. Users see messages rearranging in real time, which feels buggy even though it's technically correct.

Actual Fix: Client-side insertion sort with timestamp

Each user's device maintains messages sorted by server timestamp. When a new message arrives:

  1. Check its timestamp against the last displayed message
  2. If it's newer → append at the bottom (most common case, feels instant)
  3. If it's older → insert it at the correct position AND show a subtle visual indicator ("1 earlier message inserted above")

This is a compromise: correct ordering with a visual cue so users aren't confused by messages appearing "above" what they already read.


The Deeper Lesson: Distributed Systems Create Timing Failures

Why 1 server didn't have this problem

With one server, all messages passed through a single point. The server processed them sequentially, so the broadcast order always matched the storage order. No timing ambiguity.

Why 3 servers created the problem

With three servers, there are three independent paths for messages to travel. Each path has slightly different timing. This is called non-deterministic ordering — the order depends on network latency, load, and which server the sender is connected to.

The general principle

Any time you add parallelism, you create the possibility of ordering problems. This applies to:

  • Multiple servers
  • Multiple threads in a program
  • Multiple workers processing a queue
  • Multiple microservices handling events

The question isn't "will ordering be a problem?" It's "how will we handle the ordering problem?"


Failure Category Map

Root cause:    Timing failure (non-deterministic event ordering across servers)
Amplifier:     High message volume in large groups
Blast radius:  All group conversations (DMs unaffected)
Severity:      Medium (annoying, confusing, but no data loss)
Could prevent: Design the broadcast system with ordering guarantees from day one
Could detect:  Automated ordering test (send messages with known timestamps,
               verify all clients receive in correct order)
Could reduce:  Client-side timestamp sorting with visual indicators

Compare With the E-Commerce Bug

AspectE-Commerce (wrong items)Messaging (wrong order)
Failure categoryIntegration (stale cache)Timing (non-deterministic ordering)
Reproducibility100% for affected productsIntermittent, depends on timing
Who's affected~30% of orders (those with changed bins)Users on different servers, during high activity
Data corrupted?No (database correct, pick list wrong)No (database correct, display order wrong)
Root cause found byChecking midpoint (pick list)Checking delivery path (server → client)
Fix complexitySimple (update the data source)Hard (fundamental distributed systems tradeoff)
PreventionBetter contracts and boundary enforcementArchitectural decision about ordering guarantees

The e-commerce bug had a clear, fixable root cause. The messaging bug revealed a fundamental limitation of the architecture that required a compromise, not a simple fix. This is the difference between a bug and a design constraint.