Not All Race Conditions Are Threads — Race Conditions in Distributed Systems

When engineers hear “race condition”, most imagine two threads modifying the same variable.

That is the smallest version of the problem.

In distributed systems, race conditions are far more dangerous because they don’t depend on shared memory.
They depend on time, ordering and partial knowledge.

No locks.
No stack traces.
No deterministic reproduction.

And the system can be perfectly healthy from an infrastructure perspective.

This post is about the kinds of race conditions that actually appear in production backend systems.

1) The Double-Execution Race (Duplicate Processing)

This is the most common distributed race condition.

A worker processes a message.
The message broker doesn’t receive the acknowledgement in time.
The broker redelivers.

Now two workers execute the same operation.

Typical scenario:

order creation
payment capture
email sending
inventory reservation
coupon redemption

Nothing crashed.

The system did exactly what it was designed to do: at-least-once delivery.

But the business operation was not idempotent.

What makes this dangerous

The second execution is not a retry from the same process.
It is a concurrent logical operation.

You now have:

two payment captures
two shipments
two state transitions
inconsistent accounting

And logs look completely valid.

Typical mistaken fixes

increasing visibility timeout
reducing consumer concurrency
adding delays

Those reduce probability, not the race.

Real fix

You need idempotency at the business boundary, not at the infrastructure layer.

Examples:

idempotency keys stored with unique constraints
operation tokens
deduplication tables
state transition guards

The system must be able to answer:

“Has this operation already been logically completed?”

Not:

“Has this message already been seen by this worker?”

2) The Lost Update Race (Concurrent Writers)

Two services read the same entity state and both decide to modify it.

Timeline:

Service A reads balance = 100
Service B reads balance = 100
A subtracts 40 → writes 60
B subtracts 80 → writes 20

Final state: 20
Correct state: −20 or rejected

No conflicts detected.
Database behaved correctly.

This happens frequently with:

wallets
inventory
quotas
rate limits
seat reservations

Why transactions don’t automatically save you

Because both transactions are individually valid.

The race is between reads, not writes.

Correct approaches

optimistic locking (version column)
compare-and-swap updates
conditional writes
atomic database operations
append-only ledgers instead of mutable state

The real solution is not stronger transactions.

It is state transition control.

3) The Out-of-Order Event Race

Distributed systems do not guarantee global ordering.

Even Kafka does not — only per partition.

Typical example:

OrderCancelled
`Order

Not All Race Conditions Are Threads — Race Conditions in Distributed Systems

1) The Double-Execution Race (Duplicate Processing)

What makes this dangerous

Typical mistaken fixes

Real fix

2) The Lost Update Race (Concurrent Writers)

Why transactions don’t automatically save you

Correct approaches

3) The Out-of-Order Event Race

More from this blog

When the Message “Disappears” : A Production-Focused Guide Using AWS SQS

Java 21 in Distributed Systems: Bounded Concurrency, Deadlines, and Failure Containment

The Operational Cost of LLM APIs

Why AI Features Are Becoming Reliability Problems

What Technical Interviews in Distributed Systems Actually Test

Command Palette

1) The Double-Execution Race (Duplicate Processing)

What makes this dangerous

Typical mistaken fixes

Real fix

2) The Lost Update Race (Concurrent Writers)

Why transactions don’t automatically save you

Correct approaches

3) The Out-of-Order Event Race

More from this blog