Leandro Maia

When the Message “Disappears” : A Production-Focused Guide Using AWS SQS

Leandro Maia — Tue, 03 Mar 2026 15:53:10 GMT

In most production incidents involving “missing messages,” the queue is blamed early.

SQS is down.
The message was dropped.
AWS lost it.

True message loss inside managed queue infrastructure is extremely rare. What teams experience instead is a loss of certainty across lifecycle boundaries.

The system accepted an event.
Infrastructure metrics look healthy.
The business outcome did not occur.

That gap — between technical signals and business reality — is where distributed systems become difficult.

This article breaks down how messages appear to disappear, why teams usually detect it too late, and how to design systems that remain diagnosable and recoverable.

1. Start With the Lifecycle, Not the Queue

A simplified SQS lifecycle:

Producer → SQS → Consumer → Process → Commit → Delete
Else → Visibility Timeout → Retry → DLQ

Every transition is a failure boundary.

A message can:

Fail to publish (including partial batch failures).
Be published but never consumed (misconfiguration, IAM, polling issues).
Be consumed but fail during processing.
Succeed in processing but fail during state commit.
Be retried due to visibility timeout.
Move to a DLQ after max receives.
Expire due to retention limits.
Be processed twice and overwrite newer state.

If these transitions are not observable, investigation becomes reconstruction.

2. Infrastructure Success vs Business Correctness

One of the most expensive patterns in distributed systems:

Message is received.
Business logic throws.
Exception is caught or downgraded.
Message is deleted.
Metrics remain green.

From the queue’s perspective, the lifecycle completed.

From the business perspective, nothing happened.

This disconnect emerges when systems measure:

Messages sent
Messages received
Messages deleted

But do not measure:

Domain invariants
State transitions
Outcome completion

Queue health is not system health.

3. Visibility Timeout and Duplicate Effects

SQS guarantees at-least-once delivery.

If processing time exceeds visibility timeout:

The message becomes visible again.
Another consumer processes it.
Side effects execute more than once.

Without idempotent handlers, this leads to:

Reverted state
Conflicting updates
Financial inconsistencies
External API duplication

Exactly-once semantics do not emerge automatically from SQS. They must be constructed at the application layer.

Idempotency and conditional state transitions are foundational, not optional.

4. DLQ as a System Signal

A DLQ is a diagnostic channel.

In multiple real-world incidents:

The primary queue throughput was normal.
Consumers were active.
No alarms fired.

Meanwhile, the DLQ accumulated messages due to:

Schema evolution mismatches
Validation failures
Unexpected enum values
Downstream dependency errors

Teams discovered this days later during reconciliation.

DLQ depth should be treated as a production signal with strict alerting thresholds.

5. Retention Is Part of Reliability

SQS retention defaults to four days and can extend to fourteen.

If consumers are unavailable beyond retention, messages are deleted.

When detection occurs late:

The original events may no longer exist.
Replay is impossible without external persistence.
Reconstruction requires alternative data sources.

Retention settings must align with operational recovery expectations.

If recovery time objectives exceed retention, data loss becomes predictable.

6. Why Detection Happens Too Late

Most systems monitor infrastructure but not domain outcomes.

Infrastructure metrics:

Sent
Received
Deleted
Queue depth

Business metrics:

Orders completed
Payments captured
State transitions finalized

Without business-level observability, failures surface only when humans notice discrepancies.

By that time:

Retention windows may have closed.
Logs may have rotated.
State divergence may have propagated.

The problem transitions from debugging to recovery.

7. Replay Is a System Capability, Not an Emergency Script

Replaying messages introduces additional constraints.

Idempotency

Reprocessing events can trigger duplicate side effects:

Financial operations
Notifications
External integrations

Consumers must tolerate historical re-execution safely.

Ordering

Standard SQS queues do not guarantee ordering.

Replaying subsets of events may apply state transitions out of sequence.

Version checks or sequence validation are required to prevent regression.

Reconstruction

If events are no longer available in the queue, replay requires:

Audit tables
Change data capture streams
Data warehouse reconstruction
External reconciliation

This significantly increases operational complexity.

Load Amplification

Bulk reprocessing can overload downstream services and recreate the original failure condition.

Replay requires throttling, isolation, and staged execution.

8. Designing for Certainty

Systems that remain diagnosable under stress share several characteristics.

End-to-End Traceability

Every message carries a correlation identifier across boundaries.

You can answer:

When was the event published?
Which consumer processed it?
Was it retried?
Did it reach the DLQ?
Was state committed durably?

Without this, incident timelines become speculative.

Idempotent State Transitions

State changes are guarded by:

Version checks
Conditional updates
Deduplication keys
Event sequencing

This enables safe replay and retry.

Durable Event Storage

Queues are delivery systems, not archival systems.

Persisting events durably — or maintaining an append-only event log — expands recovery options beyond queue retention.

Business-Level Monitoring

Emit metrics tied directly to domain outcomes.

Infrastructure metrics indicate delivery behavior.

Business metrics indicate correctness.

Detection speed determines recovery complexity.

Incident Checklist

When someone says “a message is missing,” proceed methodically:

Verify publish acknowledgment and identifiers.
Compare Sent vs Received vs Deleted.
Inspect DLQ depth and payload contents.
Evaluate processing time vs visibility timeout.
Assess idempotency guarantees.
Confirm retention settings.
Compare business metrics against expected throughput.
Evaluate replay risk before executing recovery.

Structured progression reduces uncertainty quickly.

Closing Thoughts

Distributed systems rarely fail in dramatic ways.

They degrade at lifecycle boundaries:

Retries
Timeouts
Partial commits
Schema drift
Weak observability

Messages do not typically vanish outright.

What erodes is certainty.

Systems designed with traceability, idempotency, and replay in mind remain bounded during incidents.

Systems without those properties turn a simple Slack question into a multi-day investigation.

Design for clarity early. Operational confidence depends on it.

Java 21 in Distributed Systems: Bounded Concurrency, Deadlines, and Failure Containment

Leandro Maia — Wed, 25 Feb 2026 20:25:08 GMT

Modern backend services rarely perform isolated work. A single request often fans out into multiple network calls, database queries and asynchronous operations. The service is effectively coordinating latency rather than performing computation.

In that environment, reliability problems usually come from resource pressure rather than functional errors. Threads pile up waiting on I/O, retry logic multiplies work, and a slow dependency spreads delay across the system. The service remains technically “up”, but it stops behaving predictably.

Java 21 finally gives us practical tools to manage this properly: virtual threads and structured concurrency. They allow writing synchronous-style code while retaining the scalability properties typically associated with reactive frameworks. The real benefit appears when we combine them with three explicit controls:

bounded concurrency
a global request deadline
cancellation propagation

The combination keeps work proportional to capacity and limits the blast radius of downstream failures.

The Aggregator Problem

Consider an API endpoint that returns a product page. To assemble the response, it calls several internal services:

product metadata
pricing
inventory
reviews
recommendations

Each call is fast in isolation. The endpoint is implemented sequentially first, then parallelized to improve latency.

Without constraints, the parallel version introduces a subtle risk: the service can now initiate many outbound calls simultaneously for every incoming request.

When traffic grows or a dependency slows down, the service stops being limited by CPU and becomes limited by waiting operations.

Client request triggers multiple downstream calls per request. Under load, the number of concurrent calls grows uncontrollably.

Multiple clients multiply the pattern, and downstream latency feeds back into the caller as growing concurrency.

A Realistic Aggregator Implementation

A typical implementation starts simple and perfectly reasonable.

Sequential version:

public ProductPage getProductPage(String id) {
    Product product = productClient.get(id);
    Price price = pricingClient.get(id);
    Inventory inventory = inventoryClient.get(id);
    Reviews reviews = reviewsClient.get(id);

    return new ProductPage(product, price, inventory, reviews);
}

Latency is the sum of downstream calls.
If each dependency takes ~80ms, the endpoint takes ~320ms.

The natural next step is parallelization.

First Attempt: CompletableFuture Fan-Out

Before Java 21, many teams used CompletableFuture to parallelize I/O:

public ProductPage getProductPage(String id) {

    CompletableFuture product =
        CompletableFuture.supplyAsync(() -> productClient.get(id));

    CompletableFuture price =
        CompletableFuture.supplyAsync(() -> pricingClient.get(id));

    CompletableFuture inventory =
        CompletableFuture.supplyAsync(() -> inventoryClient.get(id));

    CompletableFuture reviews =
        CompletableFuture.supplyAsync(() -> reviewsClient.get(id));

    return CompletableFuture.allOf(product, price, inventory, reviews)
        .thenApply(v -> new ProductPage(
            product.join(),
            price.join(),
            inventory.join(),
            reviews.join()
        ))
        .join();
}

Latency improves significantly and the endpoint now behaves in parallel.

At this stage the service often passes load testing and looks production-ready.

Where It Starts Failing

Assume:

200 requests per second
each request calls 4 downstream services

The service now initiates 800 outbound requests per second.

If one dependency slows down — for example pricing increases from 80ms to 1.5s — those futures remain active and occupy resources much longer than expected.

What accumulates is not CPU work but waiting work:

HTTP connections remain open
thread pools saturate
retries multiply
latency increases upstream

The system is still functional, but its behavior changes under pressure. Response times become unstable and tail latency grows quickly.

The code is correct.
The concurrency model is not bounded.

Using Virtual Threads Safely

Virtual threads make parallel I/O simple:

ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();

try (executor) {
    Future product = executor.submit(() -> productClient.get(id));
    Future price = executor.submit(() -> pricingClient.get(id));
    Future inventory = executor.submit(() -> inventoryClient.get(id));

    return new ProductPage(
        product.get(),
        price.get(),
        inventory.get()
    );
}

This code is easy to read and scales far better than platform threads. However, it introduces a new risk: every incoming request may create many concurrent outbound operations.

Virtual threads are cheap, but downstream capacity is not.

Bounded Concurrency (The Missing Control)

Instead of allowing unlimited parallelism, the service should explicitly limit how many external operations it performs at once.

Concurrency is capped, and excess work is rejected quickly instead of accumulating.

The system sheds load instead of amplifying latency.

A simple and effective mechanism is a semaphore acting as a bulkhead.

public class DownstreamLimiter {

    private final Semaphore permits = new Semaphore(100);

    public  T call(Callable task) throws Exception {
        if (!permits.tryAcquire(200, TimeUnit.MILLISECONDS)) {
            throw new RuntimeException("Downstream concurrency limit reached");
        }

        try {
            return task.call();
        } finally {
            permits.release();
        }
    }
}

Usage:

var limiter = new DownstreamLimiter();

Future price = executor.submit(
    () -> limiter.call(() -> pricingClient.get(id))
);

Now the service’s behavior depends on a defined capacity rather than incoming traffic spikes.

Deadlines Instead of Timeouts

Timeouts are typically configured per call.
In practice, a request should have a total time budget.

Java 21 Structured Concurrency makes this straightforward:

try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {

    var product = scope.fork(() -> productClient.get(id));
    var price = scope.fork(() -> pricingClient.get(id));
    var inventory = scope.fork(() -> inventoryClient.get(id));

    scope.joinUntil(Instant.now().plusMillis(300));
    scope.throwIfFailed();

    return new ProductPage(
        product.get(),
        price.get(),
        inventory.get()
    );
}

The deadline applies to the entire request, not individual calls.

When the deadline expires, unfinished work is cancelled.

Cancellation Propagation

Without cancellation, a request can time out to the client while the service continues executing downstream calls. The system keeps consuming resources for a response nobody will read.

Structured concurrency automatically interrupts remaining tasks when the scope closes.
This reduces wasted work and prevents retry storms during partial failures.

For example, with Java 21 structured concurrency the request scope itself controls the lifecycle of downstream work:

public ProductPage getProductPage(String id) throws Exception {

    Instant deadline = Instant.now().plusMillis(300);

    try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {

        var product = scope.fork(() -> productClient.get(id));
        var price = scope.fork(() -> pricingClient.get(id));
        var inventory = scope.fork(() -> inventoryClient.get(id));

        // wait only until the request deadline
        scope.joinUntil(deadline);

        // deadline reached → interrupt remaining tasks
        if (!product.state().isDone()
                || !price.state().isDone()
                || !inventory.state().isDone()) {

            scope.shutdown();
            throw new TimeoutException("request deadline exceeded");
        }

        scope.throwIfFailed();

        return new ProductPage(product.get(), price.get(), inventory.get());
    }
}

When the deadline expires, unfinished downstream calls are interrupted and the service stops doing work for a response the client will no longer receive.

Operational Impact

Three behaviors change immediately:

Slow dependencies no longer saturate threads.
Retries decrease because requests fail quickly.
Latency distribution becomes tighter (p99 improves even if p50 does not).

The service stops amplifying downstream instability.

In practice this becomes very visible in observability tooling. A typical Datadog APM view during an incident looks like this:

Before bounded concurrency

api-service p99 latency: 2.8s
error rate: low (system is technically healthy)
active requests: continuously growing
downstream pricing-service latency: elevated but stable

In the APM flame graph, most of the request time appears as waiting, not CPU work.
The main span shows long gaps where the service is idle but holding resources.

Trace Analytics often shows:

many concurrent traces stuck in http.client
connection pool saturation
retries from upstream clients

After introducing concurrency limits and deadlines:

After bounded concurrency + deadline

api-service p99 latency: 350–450ms
some requests fail fast (429/timeout)
active requests plateau instead of growing
downstream latency unchanged

The important change is not that the dependency became faster.
The service stopped amplifying its slowness.

In Datadog’s service map, the edge between api-service and pricing-service changes from a thick, high-latency connection to a stable one with lower request volume. The number of concurrent traces drops sharply, and flame graphs become short and consistent rather than long with idle gaps.

The system did not gain capacity.
It regained control over how work is admitted.

Final Thoughts

Virtual threads make concurrency easier, but they also make it easier to create unbounded work. Distributed systems reward services that keep strict control over resource usage.

Bounded fan-out, deadlines and cancellation form a small set of constraints that dramatically improve production behavior. Instead of reacting to incidents, the service actively limits the scope of failures.

The code remains straightforward and synchronous, but the operational characteristics become much closer to a well-designed asynchronous system.

The Operational Cost of LLM APIs

Leandro Maia — Tue, 17 Feb 2026 13:06:55 GMT

Large language model APIs feel deceptively simple from an engineering perspective.
You send a prompt, you receive text. Compared to provisioning databases, tuning JVM memory or debugging distributed locks, the integration looks almost trivial. A single HTTP request and a JSON response. Most SDKs make it possible to have a working prototype in less than an afternoon.

Because of that simplicity, teams often evaluate LLM features as product work rather than infrastructure work. They estimate development time, UX complexity and maybe latency. What they rarely estimate correctly is operational behavior.

An LLM integration is not just a remote function call. It is a probabilistic, metered, latency-sensitive external compute dependency whose cost scales with user behavior, not with system capacity. That distinction matters much more than it initially appears.

The Invisible Meter

Traditional backend infrastructure has a fairly intuitive scaling model.
If your system doubles in users, CPU usage and database load grow in a somewhat predictable way. Engineers already know how to reason about it: caching, queues, horizontal scaling and rate limits.

LLM APIs introduce a different scaling axis: token consumption.

The system is no longer paying per request, nor per server, nor per hour of uptime. It is paying for every unit of generated reasoning. A single user interaction can cost more than thousands of database reads. And the expensive part is not the request itself — it is the output length and how many times the user retries, iterates or explores.

Unlike most APIs, LLM usage encourages repetition. Users don’t submit one request. They refine.

They adjust a prompt, regenerate, ask for another version, ask for clarification, request expansion and then ask the model to rewrite the answer in a different tone. From a human perspective this feels like one interaction. From a billing perspective it can be fifteen.

This is the first operational shift: cost is tied to conversation depth, not traffic volume.

A Small SaaS: 100 Users

Imagine a small productivity SaaS that adds an AI assistant to help users draft reports.
The team estimates that each user will generate about five reports per day, and each report requires one LLM call. They calculate cost assuming roughly 500 requests daily. The feature looks financially safe.

In reality, usage looks different.

A single report becomes a dialogue:

“summarize this data”
“make it shorter”
“add a professional tone”
“rewrite for a technical audience”
“give me three alternative versions”

The user did not use the system five times. They used it once — but the backend performed five LLM invocations. Many users will also retry when latency exceeds a few seconds because they assume the system stalled.

After deployment, the service stabilizes around 100 daily active users.
However, instead of 500 model calls per day, the system performs closer to 3,000.

Nothing is broken.
The feature is popular.

But the cost model is wrong by a factor of six.

The engineering system is healthy. The product is successful. Yet finance notices that the new feature is now the single largest operating expense of the platform. The infrastructure did not scale unexpectedly — human curiosity did.

At this stage the problem is manageable, but a second operational effect appears: engineers start shaping user behavior. They introduce response limits, shorten outputs and add cooldowns. This is unusual; backend engineers rarely need to think about how wording affects infrastructure cost. With LLMs, prompt design and product UX directly influence operating margins.

The Spike Scenario: 10,000 Users

Now consider a different situation.

The same SaaS releases a new “AI project planner” feature.
A well-known influencer shares it, and over two days the product receives 10,000 new users. This kind of spike is familiar in SaaS. Usually the concern is database capacity, queue backlog or CPU saturation. Auto-scaling groups exist precisely for this scenario.

But LLM APIs do not scale with your servers.

Your system may handle the HTTP traffic perfectly while your costs grow faster than your infrastructure ever could.

Let’s assume each new user performs ten exploratory interactions during onboarding — which is realistic because new users experiment more than established ones. If each interaction consumes a moderately sized prompt and response, the system may suddenly generate hundreds of thousands of tokens per hour.

Nothing crashes.
There is no 500 error.
Your monitoring dashboards remain green.

However, the billing dashboard tells a different story. In less than 48 hours, the LLM usage cost exceeds the previous month’s total infrastructure spend.

This is a uniquely uncomfortable operational situation. Traditional incidents degrade service; this incident degrades the company’s financial predictability. Engineers cannot fix it by scaling servers or restarting workers. The system is behaving correctly.

The system is simply too successful too quickly.

Latency Is Also Operational Cost

There is another operational dimension beyond billing.

LLM APIs have variable latency. A database query might fluctuate between 5 and 20 milliseconds. An LLM response might vary between 2 seconds and 25 seconds depending on load and output length.

Users react strongly to waiting. When a response takes longer than expected, they retry, refresh or open multiple tabs. Each retry is a new model invocation. Latency therefore multiplies cost.

In distributed systems we often worry about retry storms against downstream services. LLM integrations can produce a similar pattern, except the downstream system is a metered compute provider. A single slow period can double both request volume and billing simultaneously.

This creates a feedback loop: slower responses cause retries, retries cause higher usage, higher usage increases latency and the cycle repeats.

Why Caching Is Not Straightforward

The natural engineering instinct is caching.
If the same question is asked, store the answer.

The difficulty is that LLM requests are rarely identical. Small wording changes produce different prompts and therefore different cache keys. Even when two prompts are semantically equivalent, they are textually different. Traditional caching strategies depend on deterministic inputs; conversational systems are inherently non-deterministic.

You can cache aggressively for structured tasks (classification, tagging, summarization templates), but creative or exploratory usage — precisely the usage users value — resists caching.

This is why LLM integrations behave more like human labor than like computation. Each request is unique work.

Operational Mitigations

Over time, teams operating LLM features converge on similar patterns:

explicit rate limiting per user
bounded output size
asynchronous processing for long tasks
progressive responses instead of regeneration
usage quotas tied to subscription plans
internal token accounting, not just request counting

The most important shift is cultural. Engineers begin tracking not only latency and error rate but also token burn rate. Observability expands from technical health to economic health.

In practice, a production dashboard for an AI feature often includes: request latency, error rate, queue backlog and daily cost per active user. All four become operational signals.

Human Behavior Becomes Infrastructure

The core lesson is that LLM APIs move part of system reliability into user psychology.

Traditional backend engineering assumes users submit discrete actions. LLM interfaces encourage exploration. People converse, iterate and experiment. The system is no longer processing transactions; it is hosting thinking sessions.

Because of that, operational planning must account for behavior patterns rather than just concurrency. The number of users matters less than how they engage.

A hundred power users can cost more than ten thousand passive ones.
A successful onboarding flow can be more expensive than steady-state usage.
A viral moment can become a financial incident before it becomes a technical one.

Final Thought

Integrating an LLM API is easy. Operating it responsibly is not.

The challenge is not calling the model. The challenge is predicting and shaping the interaction between human curiosity and metered computation. Traditional systems fail when servers overload. LLM-enabled systems can remain perfectly stable while the operational cost becomes the primary reliability risk.

When adopting these systems, engineers are not only managing software behavior anymore.
They are managing an economic feedback loop attached to human conversation.

Why AI Features Are Becoming Reliability Problems

Leandro Maia — Tue, 10 Feb 2026 08:00:00 GMT

Over the last year, many products added AI features.

Chat assistants, automatic summaries, classification, recommendations, drafting emails, generating documentation, suggesting actions. In many cases these features were relatively easy to ship. An API call, some prompt engineering, a bit of UI, and something useful appears on the screen.

From an implementation perspective, they often look simpler than traditional backend features.

From an operational perspective, they are not.

What many teams are discovering is that AI features rarely fail like software systems used to fail.

They create a new category of reliability problem.

Traditional failures are visible

Historically, backend reliability issues were easy to detect.

A service crashed.
A database timed out.
An endpoint returned 500.
Latency spiked.

Monitoring worked because failures were explicit. Systems either produced a correct response or an error. Alerts fired, dashboards changed, and on-call engineers investigated.

The system clearly signaled: something is wrong.

AI features do not behave this way.

They usually return a response.

The problem is plausible wrongness

An LLM rarely returns a null pointer, a stack trace, or a malformed response. Instead, it produces something coherent and confident that looks reasonable to both monitoring systems and users.

The output is syntactically valid.
The API returned 200.
Latency is normal.

But the behavior is unacceptable.

A summary omits critical information.
A classification routes a ticket to the wrong team.
A generated email misrepresents the situation.
A suggested action creates operational confusion.

Nothing technically failed, yet the system did the wrong thing.

This is a reliability issue, but it does not look like one.

Monitoring no longer detects the incident

Traditional observability assumes failures are binary: success or error.

AI features introduce a third state: successful but incorrect.

HTTP metrics remain healthy.
Error rates remain low.
Infrastructure dashboards look normal.

The first signal of a problem often comes from support tickets, confused users, or business teams noticing unexpected behavior. In other words, the monitoring system becomes human.

This is a major shift. Reliability engineering historically depended on detecting technical anomalies. With AI features, the anomaly is semantic.

The system worked exactly as implemented, but not as intended.

Testing becomes weaker

Testing AI features is also different from testing deterministic systems.

Traditional features can be validated with assertions. Given an input, the expected output is known. Automated tests verify correctness with high confidence.

AI systems produce distributions, not exact outputs. The same prompt can yield slightly different results. The challenge is not whether the system returns a response, but whether the response is acceptable.

A test suite can confirm the feature runs.
It cannot easily confirm the feature behaves appropriately across real usage.

This weakens one of the strongest reliability tools teams rely on before deployment.

Rollbacks stop working

When a normal feature causes problems, rollback is a reliable safety mechanism. Reverting the change restores the previous behavior.

AI failures often do not map cleanly to deploys.

The model may be external.
The data distribution changed.
The prompt interacts differently with real user inputs.
Caching stores incorrect outputs.
Fine-tuned behavior evolves.

The incident does not necessarily correspond to a specific code release. Teams may see a degradation in behavior without any deployment occurring. From an operational perspective, this is deeply unfamiliar territory.

You cannot always roll back to a previous commit if the system’s behavior depends on probabilistic outputs and live data.

Support becomes part of the reliability pipeline

In many organizations, customer support has quietly become the primary detection system for AI issues.

Users report confusing results.
Operators notice inconsistent classifications.
Internal teams start distrusting recommendations.

The reliability loop changes:

previously → monitoring detects → engineering investigates
now → users notice → support escalates → engineering investigates

The incident is real, but it emerges socially rather than technically.

Why this matters for system design

AI features are often introduced as product enhancements, but they behave operationally like external dependencies with unpredictable behavior. They should be treated less like deterministic code and more like a probabilistic subsystem.

This affects architectural decisions.

You need:
fallback behaviors
human override paths
auditability
clear UI communication
bounded authority

In other words, you design not only for failure, but for uncertainty.

The question is no longer “what happens if the service is down?” but “what happens if the service is confidently wrong?”

A different reliability mindset

For years, backend reliability engineering focused on availability and latency. Systems were considered healthy when they responded quickly and without errors.

AI systems expand the definition of reliability. A system can be fully available and still operationally harmful.

Correctness now includes behavioral trust.

Teams that succeed with AI features are not the ones that integrate models fastest, but the ones that design guardrails around them. The work shifts from implementing functionality to managing risk.

The engineering challenge is no longer only keeping systems online.

It is keeping them trustworthy.

What Technical Interviews in Distributed Systems Actually Test

Leandro Maia — Tue, 03 Feb 2026 08:00:00 GMT

Modern backend engineering increasingly revolves around distributed systems.
As a consequence, many technical interviews — even for senior and leadership roles — are designed around deceptively simple scenarios: a text editor, a counter, a cart, a document, a status update.

Then the interviewer asks:

“Why did the system end up with the wrong value?”

Very often, the correct answer is not about architecture diagrams, microservices, or cloud providers.

It is about concurrency.

Below are some of the core concepts these interviews tend to probe, and why they matter in real systems.

1. Race Conditions: The Default State of Distributed Systems

A race condition occurs when multiple operations access and modify shared state concurrently, and the final result depends on the timing of execution rather than the logical order of events.

Consider a simple pattern:

read current value
apply change
write new value

If two requests execute simultaneously across two backend instances, both may read the same previous value and overwrite each other.

This is known as the lost update problem.

The system did not crash.
No exception occurred.
Every operation “succeeded”.

Yet the state is incorrect.

This is one of the most common real production bugs in multi-instance services.

2. The Illusion of Ordering

Engineers often intuitively assume that requests arrive and are processed in order.

In practice:

clients retry
networks reorder packets
load balancers distribute requests
UDP is unordered
mobile devices reconnect
clocks differ

The system does not process “events”.
It processes arrivals.

These are not the same thing.

A later user action can be processed before an earlier one.
Without safeguards, the system may persist an older state after a newer one.

3. Why “Read Then Write” Is Dangerous

Many naive implementations rely on:

SELECT state
compute new state
UPDATE state

In a single-threaded program this is safe.

In distributed systems, this is a critical section — but there is no lock.

Two processes can execute this sequence simultaneously and overwrite each other.
This is not a performance issue. It is a correctness issue.

Scaling stateless services horizontally amplifies this risk because concurrency increases with capacity.

4. Typical Solutions

There is no single universal fix. Instead, systems use different consistency strategies.

4.1 Optimistic Concurrency Control (Versioning)

Each record carries a version:

UPDATE document
SET content = ?, version = version + 1
WHERE id = ? AND version = ?

Only one writer succeeds. Others must retry.

This is effectively a compare-and-swap (CAS).

Widely used in:

relational databases
DynamoDB conditional writes
document stores

It prevents lost updates without heavy locking.

4.2 Idempotency

Requests should be safe to repeat.

If the same operation arrives twice (retries, network duplication), the system should not produce a different result.

This is essential in:

payment systems
event consumers
APIs behind unreliable networks

Idempotency keys or operation identifiers allow systems to detect duplicates.

4.3 Event Ordering

Sometimes the state should reflect the latest logical event, not the last write.

Solutions include:

timestamps (careful: clocks drift)
logical clocks
sequence numbers per entity
monotonic versioning

The key insight:
Last write ≠ most recent action.

4.4 Serialization via Queues

Instead of multiple concurrent writers:

clients → queue → single consumer → database

Queues provide ordering and eliminate write races at the cost of latency and throughput constraints.

Common in:

collaborative editing
inventory systems
financial ledgers

5. Why Timestamps Alone Are Not Enough

A common first instinct is to store timestamps and keep the “latest”.

This works only if:

clocks are synchronized
events are monotonic
no client is offline
no retries occur

In real systems:

client clocks lie
mobile reconnects happen
messages are delayed

Relying solely on timestamps often replaces a race condition with a time consistency bug.

6. Databases Do Not Automatically Save You

Developers often assume the database guarantees correctness.

Databases guarantee atomicity per operation, not per workflow.

This is atomic:

UPDATE row SET value = 5

This is not:

READ row
MODIFY
WRITE row

Without isolation (locks or conditional updates), the database cannot detect a logical conflict.

The bug lives above the database layer.

7. What These Questions Really Evaluate

Concurrency questions in interviews are not about memorizing definitions.

They evaluate whether you understand:

the difference between scaling and correctness
state vs events
arrival vs ordering
retries vs duplicates
atomic operations vs atomic workflows

In other words:

Do you design systems assuming the network is unreliable and multiple things happen at once?

Because in production, they always do.

8. A Useful Mental Model

Single-machine programming assumes:

“Things happen one after another.”

Distributed systems require assuming:

“Everything happens at the same time, out of order, and at least once.”

Once you adopt this model, many design decisions change:

APIs
database writes
caching
retries
message processing

Concurrency is not an edge case.
It is the baseline.

Closing Thoughts

Many system design discussions focus on scale, cloud architecture, and service boundaries.

But some of the most critical failures in real systems come from simpler issues:
two valid operations interacting in an invalid way.

Before worrying about microservices, queues, or multi-region deployments, systems must answer a more fundamental question:

What happens when two users change the same thing at the same time?

The answer to that question often defines whether a system is merely scalable — or actually correct.

AI Made Writing Code Easier — Software Development Didn’t Get Easier

Leandro Maia — Tue, 27 Jan 2026 08:00:00 GMT

Over the last year, most conversations about AI in software engineering have revolved around speed. People ask whether engineers are now two, five, or ten times faster. The assumption behind that question is that writing code was the main thing slowing software delivery down.

In practice, that rarely matched my experience. Even before AI, most teams I worked with were not waiting on someone to type faster. They were waiting on decisions, coordination, and risk assessment. The work around the code mattered more than the code itself.

AI didn’t remove the bottleneck. It moved it.

We were rarely waiting on code

In a small team, code is a large part of the job. A few engineers understand the whole system and progress mostly depends on someone sitting down and implementing the feature. But once systems and organizations grow, delivery depends on a chain of human processes. Someone needs to decide the feature is worth doing, multiple teams need to agree on behavior, operational risk has to be evaluated, and someone has to support the system after customers start relying on it.

Because implementation was expensive, it acted as a natural filter. If a feature required weeks of work, someone had to justify it. Discussions were more careful, and priorities were clearer. That friction was often frustrating, but it forced clarity.

AI changed that dynamic. Implementation became cheap enough that the filter weakened. The number of things that could be built suddenly increased, but the organization’s ability to evaluate them did not.

Cheap implementation increases volume

Lowering the cost of implementation does not automatically produce better outcomes. It produces more outcomes. Teams can prototype faster, experiment more, and try more variations of ideas. On paper this sounds like pure productivity, but most organizations are not structured to process a large volume of change.

Before, weak ideas often died early because they were costly. Now they survive longer because they are easy to implement. The result is not necessarily better software — it is more software entering the system. The constraint shifts from “can we build this?” to “should this exist at all?”

This is where much of the real work now lives.

The new work: integration

What I see in practice is not dramatically faster systems, but a higher number of partially correct ones. AI-generated code frequently looks reasonable and works locally. The problems appear when the code interacts with the rest of the system.

Real systems have expectations that are rarely explicit: data contracts, operational behavior, retry logic, monitoring, and ownership boundaries. Software rarely fails because a function was hard to write. It fails because multiple correct components interact in an incorrect way.

AI handles the first 80% of implementation easily. The remaining 20% — understanding how a change behaves in production — remains difficult. Engineers spend less time creating code from scratch and more time validating, adapting, and stabilizing generated work so that it behaves predictably in a larger system.

The review bottleneck

One immediate consequence is that review capacity becomes a constraint. The number of changes increases faster than the organization’s ability to understand them. Teams did not suddenly gain more reviewers, deeper system knowledge, or better operational visibility. They simply gained more code.

As a result, engineers are often less limited by writing code than by reading it. Code review used to involve carefully reasoning about a focused change. Now it frequently involves evaluating large generated modifications whose correctness depends on context not visible in the diff.

Speed increased on the production side, but not on the understanding side. And most software failures originate from misunderstanding, not syntax errors.

AI amplifies existing problems

AI does not introduce entirely new dysfunctions. It amplifies what already exists. If prioritization is weak, more low-value work appears. If ownership is unclear, integration failures multiply. If coordination is slow, conflicts increase.

The surrounding organization — support teams, product processes, operations, training — still operates at human speed. Even if code generation accelerates dramatically, delivery remains constrained by alignment and understanding. The bottleneck simply relocates.

Why experience matters more

AI reduces the effort required to produce code. It does not reduce the effort required to reason about consequences. That changes which skills matter most.

The valuable skill shifts away from implementation speed and toward judgment: recognizing coupling, anticipating operational impact, and deciding when a feature should not yet exist. Maintaining system clarity becomes more important than producing additional code.

Software systems rarely collapse because code was difficult to write. They collapse because their behavior became too complex to reason about. AI increases code abundance, but understanding remains scarce.

What actually changed

AI is genuinely useful. It helps with exploration, scaffolding, and repetitive tasks. I use it regularly and it meaningfully improves parts of the workflow. But its main impact is not replacing engineers or eliminating effort. It changes the type of effort required.

There is less typing and more evaluation, less syntax work and more system reasoning. The teams that benefit most will not be those that generate the most code, but those that maintain a clear understanding of how their systems behave.

Software rarely fails because nobody could implement the solution. It fails because, over time, nobody fully understood the system they had built.

Observability Is Not Dashboards — How I Actually Use Datadog in Production

Leandro Maia — Tue, 20 Jan 2026 08:00:00 GMT

Many teams say they “have observability”.

Usually that means:

CPU graphs
memory usage
request count
maybe error rate

Those are useful, but they are not observability.

They are infrastructure visibility.

Observability only starts when you can answer a different kind of question:

“Why is this specific user request slow right now?”

Not average latency.
Not system health.

A concrete request, in a real moment, under real load.

This is how I use Datadog in practice: not as a monitoring wall, but as a debugging environment for live systems.

The Shift That Changed Everything

Earlier in my career, monitoring meant:

Alert fires
Look at dashboards
Guess which service is responsible
SSH into machines
Grep logs

This works for small systems.

It collapses in distributed architectures.

In microservices, a single request might pass through:

API gateway
authentication
core service
cache
database
message broker
async worker
third-party API

Dashboards don’t reconstruct that story.

Traces do.

The biggest shift was moving from service health to request behavior.

The Three Pillars (How They Actually Connect)

People often mention logs, metrics and traces as separate tools.

They are not.

They are three zoom levels of the same incident.

Metrics — “Something is wrong”

Traces — “Where it is wrong”

Logs — “What exactly happened”

The value appears when you can jump between them in seconds.

That is the core reason I rely heavily on Datadog’s APM rather than only metrics.

Starting Point: The Latency Graph

Most real incidents I’ve seen start the same way:

Latency increases slightly.

Not a spike.
Not an outage.

Just p95 drifting from 180ms → 400ms → 900ms over 20 minutes.

Error rate is still normal.

This is the most dangerous state in production: a system degrading without failing.

In Datadog I rarely start with CPU or memory.

I start with:

Service → APM → Endpoint latency → p95/p99

Because users don’t experience servers.
They experience requests.

From Metric to Trace

Once I see a slow endpoint, the next step is not guessing.

I open a real trace.

This is where observability becomes different from monitoring.

A single trace shows:

every service hop
DB queries
external calls
cache usage
retries
time spent waiting vs executing

Now the system is no longer abstract.

I’m looking at a specific request that actually happened.

Very often the problem is immediately visible:

a 2.4s external API call
a blocking call inside async flow
N+1 database queries
a retry storm
thread pool saturation

Without tracing, these issues look identical from metrics.

The Most Valuable Feature: Outliers

Averages hide incidents.

I care much more about the slowest 1%.

In Datadog I frequently filter:

“Show traces where duration > 2 seconds”

This reveals patterns dashboards never show.

Examples I’ve encountered:

only requests with large payloads fail
only specific tenants are slow
only cache misses trigger a downstream bottleneck
only first request after deployment is slow (cold initialization)

Production problems are rarely global.
They are conditional.

Tracing exposes conditions.

Database Visibility

One of the most practical benefits is query-level insight.

Instead of:

“database seems slow”

I can see:

exact SQL query
execution time
frequency
which endpoint triggered it

This immediately distinguishes:

bad indexing
missing caching
accidental full table scan
ORM misuse

Many “performance problems” turn out to be query shape problems.

And they are visible in seconds.

Logs — But Only After the Trace

A mistake I made early on was starting with logs.

Logs are high volume and low signal.

Now I almost never search logs first.

Find a slow trace
Jump to correlated logs from that trace

This changes everything.

Instead of reading thousands of lines, I read the exact log lines produced by the failing request.

Logs become evidence instead of noise.

Alerts I Actually Trust

The alerts that wake engineers at night should be very few.

The most useful alerts I’ve configured are not CPU or memory alerts.

They are:

p99 latency above threshold for sustained period
error rate per endpoint (not global)
saturation of worker queues
retry explosion indicators

Hardware metrics often alert after users notice.

Request-level alerts often alert before support tickets appear.

An Unexpected Use: Verifying Fixes

Observability is not only for incidents.

It is also for confidence.

After deploying a fix, I don’t just wait.

I compare traces:

before vs after

I verify:

latency distribution
retry count
DB query time
downstream calls

This turns performance work from intuition into evidence.

Final Thought

Monitoring tells you when the system is unhealthy.

Observability lets you understand behavior while the system is still running.

Modern backend systems rarely fail loudly.

They degrade, retry, compensate and partially work.

Without request-level visibility, engineers debug symptoms.

With observability, they debug causality.

Datadog is simply the tool I use to see that causality while the system is alive.

Working With an AI Coding Assistant (Codex) as a Backend Engineer

Leandro Maia — Tue, 13 Jan 2026 08:00:00 GMT

Over the last months I started using an AI coding assistant powered by large language models (Codex-style systems).

I did not approach it as a novelty or productivity experiment.

I approached it the same way I approach any new piece of infrastructure:
with skepticism and with a production mindset.

The interesting discovery was this:

The assistant is not a faster autocomplete.

It behaves much closer to a very fast junior engineer with perfect recall and zero operational experience.

Once I started treating it that way, it became genuinely useful.

This post is not about whether AI will replace engineers.
It is about how it actually changes day-to-day backend work.

What It Is Actually Good At

The first surprise was not code generation.

It was code navigation.

In large systems, a lot of time is not spent writing code.
It is spent reconstructing intent.

Typical tasks:

understanding an unfamiliar module
finding where a side effect originates
tracing request flows
reconstructing configuration behavior
mapping DTOs across layers

The assistant is very good at building a mental index of a codebase quickly.

You can ask questions like:

“Where could a timeout be happening in this flow?”

And it will point to:

HTTP client configuration
thread pool limits
retry wrappers
circuit breaker policies

Not always correctly — but almost always usefully.

The real productivity gain is not typing less code.

It is reducing search time.

The Refactoring Multiplier

The second strong use case is mechanical refactoring.

Things engineers postpone for months:

renaming confusing interfaces
splitting large classes
extracting validation logic
migrating method signatures
removing duplication

These tasks are cognitively easy but operationally expensive.

They require attention, but not deep design thinking.

The assistant is extremely effective here.

You still review every change.

But the cost of attempting a refactor drops dramatically.

The interesting side effect:

I started performing refactors earlier.

Not because the assistant is perfect — but because the activation energy disappeared.

Where It Fails (Consistently)

The assistant writes correct-looking code far more often than correct systems.

This is the most important observation.

It is strong at:

syntax
API usage
small local logic

It is weak at:

concurrency
distributed systems
failure handling
timeouts
idempotency
partial failure

In other words, it struggles exactly where real backend incidents happen.

If you ask it to implement a retry mechanism, it will produce one.

If you ask it to design a safe retry mechanism, it will often produce a system that can duplicate side effects.

This is a critical difference.

The assistant optimizes for plausibility, not for operability.

The Illusion of Correctness

The most dangerous property is fluency.

Bad code used to look suspicious.

AI-generated code often looks clean, documented and well structured.

Which makes engineers trust it more than they should.

The failure mode is subtle:

You stop questioning decisions that you did not consciously make.

Over time, this introduces architectural drift.

Not dramatic failures — but many small design decisions that nobody truly owns.

I’ve seen:

retries added in three layers
hidden blocking calls inside async flows
silent error swallowing
accidental N+1 queries

All of them reasonable locally.
All of them problematic systemically.

The Real Productivity Shift

The assistant does not remove the need for senior engineers.

It increases the value of judgment.

Before:

Senior engineers wrote more code correctly.

Now:

Senior engineers reject more code correctly.

A large part of using an AI assistant well is knowing when not to accept its solution.

You stop thinking of it as a coding tool and start thinking of it as a proposal generator.

A Practical Workflow That Worked For Me

What worked best for me was separating tasks into two categories.

Tasks I delegate to the assistant

boilerplate
DTO mapping
test scaffolding
refactor mechanics
documentation drafts
codebase exploration

Tasks I never delegate

concurrency control
transactional boundaries
caching strategy
retries
idempotency
API contracts
state machines

Interestingly, this boundary maps almost exactly to the boundary between programming and engineering.

The assistant is good at programming.

Engineering still requires responsibility for behavior in production.

Unexpected Benefit: Thinking More Explicitly

One side effect I did not expect:

I started writing clearer code.

Because I needed to describe intent precisely when prompting, I became more explicit about:

invariants
failure modes
assumptions
data ownership

The tool forces you to articulate reasoning you previously kept in your head.

That alone improved code reviews and documentation.

Final Thought

AI coding assistants change how code is produced.

They do not change what reliable systems require.

Production systems are constrained by latency, partial failure, concurrency and time.

The assistant does not experience incidents, on-call or operational consequences.

Engineers do.

The most useful mental model I found is this:

The assistant can generate solutions.

The engineer is still accountable for reality.

And in backend systems, reality is what eventually wins.

Why backend systems become fragile as companies grow

Leandro Maia — Tue, 06 Jan 2026 08:00:00 GMT

Most backend systems don’t fail because of a bad initial design.

Many systems start simple, clean and understandable. Early teams usually know every component, every database table and most side effects of a change. Deployments feel safe and incidents are rare.

Yet, after some growth, the same systems often become fragile. Small changes cause unexpected problems. Incidents appear more frequently. Teams begin to fear deployments.

What changed is rarely the programming language, the framework or even the main architecture.

What changed is the context around the system.

The moment complexity becomes invisible

In early stages, a system is small enough to exist inside a shared mental model. A few engineers understand how data flows, which services depend on others and what assumptions exist.

Growth breaks that.

As a company grows, three things tend to happen at the same time:

more teams interact with the same system
integrations increase
local decisions accumulate

None of these are inherently bad. Each decision is usually reasonable in isolation. A new integration enables a business opportunity. A quick workaround solves an urgent need. A new service isolates a responsibility.

The fragility comes from how these decisions interact over time.

No one owns the full mental model anymore, but the system still behaves as if someone should.

Local optimizations, global consequences

A common pattern in growing organizations is local optimization.

A team improves performance for their feature. Another team adds caching for a specific endpoint. A third team creates a background job to guarantee retries.

Individually, these changes make sense.

Collectively, they create hidden coupling.

Soon, actions that were once simple — like reprocessing an event, replaying a queue, or fixing a database record — become dangerous. Not because the code is poorly written, but because the number of implicit assumptions increased.

The system did not become fragile due to complexity alone.

It became fragile because complexity became implicit.

Microservices don’t automatically solve this

At this stage, many organizations assume the solution is an architectural change.

Often the reaction is to split the system further: more services, more queues, more boundaries.

This can help, but only when boundaries reflect real ownership and domain understanding.

Otherwise, microservices simply distribute fragility across network calls instead of function calls.

The core issue is not whether the system is a monolith or microservices.

The issue is whether the system’s structure matches how teams understand and operate it.

What actually improves stability

In practice, stability improves not primarily through new technology, but through clearer system thinking.

A few patterns consistently help:

Clear ownership

Every important component should have a team that understands its behavior in production, not just its code.

Explicit boundaries

Systems become safer when assumptions are documented and contracts are treated seriously. Many production issues come from assumptions that were never written down.

Observability over cleverness

Metrics, logs and traces reduce fear because they allow engineers to reason about behavior instead of guessing.

Fewer responsibilities per component

Components that handle many unrelated responsibilities become risk multipliers. Simplicity in responsibility often matters more than technical elegance.

Fragility is a systems problem

It is tempting to see incidents as isolated technical failures.

More often, they are signals that the system outgrew its original mental model.

The code did not suddenly become worse. The system simply reached a scale where informal knowledge stopped being enough.

Backend fragility rarely comes from bad engineers or bad intentions.

It comes from successful systems growing beyond the structures that once kept them understandable.

Improving stability, therefore, is less about rewriting everything and more about making the system understandable again.

Not All Race Conditions Are Threads — Race Conditions in Distributed Systems

Leandro Maia — Tue, 06 Jan 2026 08:00:00 GMT

When engineers hear “race condition”, most imagine two threads modifying the same variable.

That is the smallest version of the problem.

In distributed systems, race conditions are far more dangerous because they don’t depend on shared memory.
They depend on time, ordering and partial knowledge.

No locks.
No stack traces.
No deterministic reproduction.

And the system can be perfectly healthy from an infrastructure perspective.

This post is about the kinds of race conditions that actually appear in production backend systems.

1) The Double-Execution Race (Duplicate Processing)

This is the most common distributed race condition.

A worker processes a message.
The message broker doesn’t receive the acknowledgement in time.
The broker redelivers.

Now two workers execute the same operation.

Typical scenario:

order creation
payment capture
email sending
inventory reservation
coupon redemption

Nothing crashed.

The system did exactly what it was designed to do: at-least-once delivery.

But the business operation was not idempotent.

What makes this dangerous

The second execution is not a retry from the same process.
It is a concurrent logical operation.

You now have:

two payment captures
two shipments
two state transitions
inconsistent accounting

And logs look completely valid.

Typical mistaken fixes

increasing visibility timeout
reducing consumer concurrency
adding delays

Those reduce probability, not the race.

Real fix

You need idempotency at the business boundary, not at the infrastructure layer.

Examples:

idempotency keys stored with unique constraints
operation tokens
deduplication tables
state transition guards

The system must be able to answer:

“Has this operation already been logically completed?”

Not:

“Has this message already been seen by this worker?”

2) The Lost Update Race (Concurrent Writers)

Two services read the same entity state and both decide to modify it.

Timeline:

Service A reads balance = 100
Service B reads balance = 100
A subtracts 40 → writes 60
B subtracts 80 → writes 20

Final state: 20
Correct state: −20 or rejected

No conflicts detected.
Database behaved correctly.

This happens frequently with:

wallets
inventory
quotas
rate limits
seat reservations

Why transactions don’t automatically save you

Because both transactions are individually valid.

The race is between reads, not writes.

Correct approaches

optimistic locking (version column)
compare-and-swap updates
conditional writes
atomic database operations
append-only ledgers instead of mutable state

The real solution is not stronger transactions.

It is state transition control.

3) The Out-of-Order Event Race

Distributed systems do not guarantee global ordering.

Even Kafka does not — only per partition.

Typical example:

OrderCancelled
`Order