Skip to main content

Command Palette

Search for a command to run...

When the Message “Disappears” : A Production-Focused Guide Using AWS SQS

Published
6 min read

In most production incidents involving “missing messages,” the queue is blamed early.

SQS is down.
The message was dropped.
AWS lost it.

True message loss inside managed queue infrastructure is extremely rare. What teams experience instead is a loss of certainty across lifecycle boundaries.

The system accepted an event.
Infrastructure metrics look healthy.
The business outcome did not occur.

That gap — between technical signals and business reality — is where distributed systems become difficult.

This article breaks down how messages appear to disappear, why teams usually detect it too late, and how to design systems that remain diagnosable and recoverable.


1. Start With the Lifecycle, Not the Queue

A simplified SQS lifecycle:

Producer → SQS → Consumer → Process → Commit → Delete
Else → Visibility Timeout → Retry → DLQ

Every transition is a failure boundary.

A message can:

  • Fail to publish (including partial batch failures).

  • Be published but never consumed (misconfiguration, IAM, polling issues).

  • Be consumed but fail during processing.

  • Succeed in processing but fail during state commit.

  • Be retried due to visibility timeout.

  • Move to a DLQ after max receives.

  • Expire due to retention limits.

  • Be processed twice and overwrite newer state.

If these transitions are not observable, investigation becomes reconstruction.


2. Infrastructure Success vs Business Correctness

One of the most expensive patterns in distributed systems:

  1. Message is received.

  2. Business logic throws.

  3. Exception is caught or downgraded.

  4. Message is deleted.

  5. Metrics remain green.

From the queue’s perspective, the lifecycle completed.

From the business perspective, nothing happened.

This disconnect emerges when systems measure:

  • Messages sent

  • Messages received

  • Messages deleted

But do not measure:

  • Domain invariants

  • State transitions

  • Outcome completion

Queue health is not system health.


3. Visibility Timeout and Duplicate Effects

SQS guarantees at-least-once delivery.

If processing time exceeds visibility timeout:

  • The message becomes visible again.

  • Another consumer processes it.

  • Side effects execute more than once.

Without idempotent handlers, this leads to:

  • Reverted state

  • Conflicting updates

  • Financial inconsistencies

  • External API duplication

Exactly-once semantics do not emerge automatically from SQS. They must be constructed at the application layer.

Idempotency and conditional state transitions are foundational, not optional.


4. DLQ as a System Signal

A DLQ is a diagnostic channel.

In multiple real-world incidents:

  • The primary queue throughput was normal.

  • Consumers were active.

  • No alarms fired.

Meanwhile, the DLQ accumulated messages due to:

  • Schema evolution mismatches

  • Validation failures

  • Unexpected enum values

  • Downstream dependency errors

Teams discovered this days later during reconciliation.

DLQ depth should be treated as a production signal with strict alerting thresholds.


5. Retention Is Part of Reliability

SQS retention defaults to four days and can extend to fourteen.

If consumers are unavailable beyond retention, messages are deleted.

When detection occurs late:

  • The original events may no longer exist.

  • Replay is impossible without external persistence.

  • Reconstruction requires alternative data sources.

Retention settings must align with operational recovery expectations.

If recovery time objectives exceed retention, data loss becomes predictable.


6. Why Detection Happens Too Late

Most systems monitor infrastructure but not domain outcomes.

Infrastructure metrics:

  • Sent

  • Received

  • Deleted

  • Queue depth

Business metrics:

  • Orders completed

  • Payments captured

  • State transitions finalized

Without business-level observability, failures surface only when humans notice discrepancies.

By that time:

  • Retention windows may have closed.

  • Logs may have rotated.

  • State divergence may have propagated.

The problem transitions from debugging to recovery.


7. Replay Is a System Capability, Not an Emergency Script

Replaying messages introduces additional constraints.

Idempotency

Reprocessing events can trigger duplicate side effects:

  • Financial operations

  • Notifications

  • External integrations

Consumers must tolerate historical re-execution safely.

Ordering

Standard SQS queues do not guarantee ordering.

Replaying subsets of events may apply state transitions out of sequence.

Version checks or sequence validation are required to prevent regression.

Reconstruction

If events are no longer available in the queue, replay requires:

  • Audit tables

  • Change data capture streams

  • Data warehouse reconstruction

  • External reconciliation

This significantly increases operational complexity.

Load Amplification

Bulk reprocessing can overload downstream services and recreate the original failure condition.

Replay requires throttling, isolation, and staged execution.


8. Designing for Certainty

Systems that remain diagnosable under stress share several characteristics.

End-to-End Traceability

Every message carries a correlation identifier across boundaries.

You can answer:

  • When was the event published?

  • Which consumer processed it?

  • Was it retried?

  • Did it reach the DLQ?

  • Was state committed durably?

Without this, incident timelines become speculative.


Idempotent State Transitions

State changes are guarded by:

  • Version checks

  • Conditional updates

  • Deduplication keys

  • Event sequencing

This enables safe replay and retry.


Durable Event Storage

Queues are delivery systems, not archival systems.

Persisting events durably — or maintaining an append-only event log — expands recovery options beyond queue retention.


Business-Level Monitoring

Emit metrics tied directly to domain outcomes.

Infrastructure metrics indicate delivery behavior.

Business metrics indicate correctness.

Detection speed determines recovery complexity.


Incident Checklist

When someone says “a message is missing,” proceed methodically:

  1. Verify publish acknowledgment and identifiers.

  2. Compare Sent vs Received vs Deleted.

  3. Inspect DLQ depth and payload contents.

  4. Evaluate processing time vs visibility timeout.

  5. Assess idempotency guarantees.

  6. Confirm retention settings.

  7. Compare business metrics against expected throughput.

  8. Evaluate replay risk before executing recovery.

Structured progression reduces uncertainty quickly.


Closing Thoughts

Distributed systems rarely fail in dramatic ways.

They degrade at lifecycle boundaries:

  • Retries

  • Timeouts

  • Partial commits

  • Schema drift

  • Weak observability

Messages do not typically vanish outright.

What erodes is certainty.

Systems designed with traceability, idempotency, and replay in mind remain bounded during incidents.

Systems without those properties turn a simple Slack question into a multi-day investigation.

Design for clarity early. Operational confidence depends on it.