<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Leandro Maia]]></title><description><![CDATA[Notes on Backend Systems and Software Architecture]]></description><link>https://leandromaia.dev</link><generator>RSS for Node</generator><lastBuildDate>Thu, 09 Apr 2026 14:34:06 GMT</lastBuildDate><atom:link href="https://leandromaia.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[When the Message “Disappears” : A Production-Focused Guide Using AWS SQS]]></title><description><![CDATA[In most production incidents involving “missing messages,” the queue is blamed early.
SQS is down.The message was dropped.AWS lost it.
True message loss inside managed queue infrastructure is extremel]]></description><link>https://leandromaia.dev/when-the-message-disappears-how-distributed-systems-lose-certainty-and-how-to-get-it-back</link><guid isPermaLink="true">https://leandromaia.dev/when-the-message-disappears-how-distributed-systems-lose-certainty-and-how-to-get-it-back</guid><dc:creator><![CDATA[Leandro Maia]]></dc:creator><pubDate>Tue, 03 Mar 2026 15:53:10 GMT</pubDate><content:encoded><![CDATA[<p>In most production incidents involving “missing messages,” the queue is blamed early.</p>
<p>SQS is down.<br />The message was dropped.<br />AWS lost it.</p>
<p>True message loss inside managed queue infrastructure is extremely rare. What teams experience instead is a loss of <strong>certainty across lifecycle boundaries</strong>.</p>
<p>The system accepted an event.<br />Infrastructure metrics look healthy.<br />The business outcome did not occur.</p>
<p>That gap — between technical signals and business reality — is where distributed systems become difficult.</p>
<p>This article breaks down how messages appear to disappear, why teams usually detect it too late, and how to design systems that remain diagnosable and recoverable.</p>
<hr />
<h2>1. Start With the Lifecycle, Not the Queue</h2>
<p>A simplified SQS lifecycle:</p>
<pre><code class="language-plaintext">Producer → SQS → Consumer → Process → Commit → Delete
Else → Visibility Timeout → Retry → DLQ
</code></pre>
<p>Every transition is a failure boundary.</p>
<p>A message can:</p>
<ul>
<li><p>Fail to publish (including partial batch failures).</p>
</li>
<li><p>Be published but never consumed (misconfiguration, IAM, polling issues).</p>
</li>
<li><p>Be consumed but fail during processing.</p>
</li>
<li><p>Succeed in processing but fail during state commit.</p>
</li>
<li><p>Be retried due to visibility timeout.</p>
</li>
<li><p>Move to a DLQ after max receives.</p>
</li>
<li><p>Expire due to retention limits.</p>
</li>
<li><p>Be processed twice and overwrite newer state.</p>
</li>
</ul>
<p>If these transitions are not observable, investigation becomes reconstruction.</p>
<hr />
<h2>2. Infrastructure Success vs Business Correctness</h2>
<p>One of the most expensive patterns in distributed systems:</p>
<ol>
<li><p>Message is received.</p>
</li>
<li><p>Business logic throws.</p>
</li>
<li><p>Exception is caught or downgraded.</p>
</li>
<li><p>Message is deleted.</p>
</li>
<li><p>Metrics remain green.</p>
</li>
</ol>
<p>From the queue’s perspective, the lifecycle completed.</p>
<p>From the business perspective, nothing happened.</p>
<p>This disconnect emerges when systems measure:</p>
<ul>
<li><p>Messages sent</p>
</li>
<li><p>Messages received</p>
</li>
<li><p>Messages deleted</p>
</li>
</ul>
<p>But do not measure:</p>
<ul>
<li><p>Domain invariants</p>
</li>
<li><p>State transitions</p>
</li>
<li><p>Outcome completion</p>
</li>
</ul>
<p>Queue health is not system health.</p>
<hr />
<h2>3. Visibility Timeout and Duplicate Effects</h2>
<p>SQS guarantees <strong>at-least-once delivery</strong>.</p>
<p>If processing time exceeds visibility timeout:</p>
<ul>
<li><p>The message becomes visible again.</p>
</li>
<li><p>Another consumer processes it.</p>
</li>
<li><p>Side effects execute more than once.</p>
</li>
</ul>
<p>Without idempotent handlers, this leads to:</p>
<ul>
<li><p>Reverted state</p>
</li>
<li><p>Conflicting updates</p>
</li>
<li><p>Financial inconsistencies</p>
</li>
<li><p>External API duplication</p>
</li>
</ul>
<p>Exactly-once semantics do not emerge automatically from SQS. They must be constructed at the application layer.</p>
<p>Idempotency and conditional state transitions are foundational, not optional.</p>
<hr />
<h2>4. DLQ as a System Signal</h2>
<p>A DLQ is a diagnostic channel.</p>
<p>In multiple real-world incidents:</p>
<ul>
<li><p>The primary queue throughput was normal.</p>
</li>
<li><p>Consumers were active.</p>
</li>
<li><p>No alarms fired.</p>
</li>
</ul>
<p>Meanwhile, the DLQ accumulated messages due to:</p>
<ul>
<li><p>Schema evolution mismatches</p>
</li>
<li><p>Validation failures</p>
</li>
<li><p>Unexpected enum values</p>
</li>
<li><p>Downstream dependency errors</p>
</li>
</ul>
<p>Teams discovered this days later during reconciliation.</p>
<p>DLQ depth should be treated as a production signal with strict alerting thresholds.</p>
<hr />
<h2>5. Retention Is Part of Reliability</h2>
<p>SQS retention defaults to four days and can extend to fourteen.</p>
<p>If consumers are unavailable beyond retention, messages are deleted.</p>
<p>When detection occurs late:</p>
<ul>
<li><p>The original events may no longer exist.</p>
</li>
<li><p>Replay is impossible without external persistence.</p>
</li>
<li><p>Reconstruction requires alternative data sources.</p>
</li>
</ul>
<p>Retention settings must align with operational recovery expectations.  </p>
<p>If recovery time objectives exceed retention, data loss becomes predictable.</p>
<hr />
<h2>6. Why Detection Happens Too Late</h2>
<p>Most systems monitor infrastructure but not domain outcomes.</p>
<p>Infrastructure metrics:</p>
<ul>
<li><p>Sent</p>
</li>
<li><p>Received</p>
</li>
<li><p>Deleted</p>
</li>
<li><p>Queue depth</p>
</li>
</ul>
<p>Business metrics:</p>
<ul>
<li><p>Orders completed</p>
</li>
<li><p>Payments captured</p>
</li>
<li><p>State transitions finalized</p>
</li>
</ul>
<p>Without business-level observability, failures surface only when humans notice discrepancies.</p>
<p>By that time:</p>
<ul>
<li><p>Retention windows may have closed.</p>
</li>
<li><p>Logs may have rotated.</p>
</li>
<li><p>State divergence may have propagated.</p>
</li>
</ul>
<p>The problem transitions from debugging to recovery.</p>
<hr />
<h2>7. Replay Is a System Capability, Not an Emergency Script</h2>
<p>Replaying messages introduces additional constraints.</p>
<h3>Idempotency</h3>
<p>Reprocessing events can trigger duplicate side effects:</p>
<ul>
<li><p>Financial operations</p>
</li>
<li><p>Notifications</p>
</li>
<li><p>External integrations</p>
</li>
</ul>
<p>Consumers must tolerate historical re-execution safely.</p>
<h3>Ordering</h3>
<p>Standard SQS queues do not guarantee ordering.</p>
<p>Replaying subsets of events may apply state transitions out of sequence.  </p>
<p>Version checks or sequence validation are required to prevent regression.</p>
<h3>Reconstruction</h3>
<p>If events are no longer available in the queue, replay requires:</p>
<ul>
<li><p>Audit tables</p>
</li>
<li><p>Change data capture streams</p>
</li>
<li><p>Data warehouse reconstruction</p>
</li>
<li><p>External reconciliation</p>
</li>
</ul>
<p>This significantly increases operational complexity.</p>
<h3>Load Amplification</h3>
<p>Bulk reprocessing can overload downstream services and recreate the original failure condition.</p>
<p>Replay requires throttling, isolation, and staged execution.</p>
<hr />
<h2>8. Designing for Certainty</h2>
<p>Systems that remain diagnosable under stress share several characteristics.</p>
<h3>End-to-End Traceability</h3>
<p>Every message carries a correlation identifier across boundaries.</p>
<p>You can answer:</p>
<ul>
<li><p>When was the event published?</p>
</li>
<li><p>Which consumer processed it?</p>
</li>
<li><p>Was it retried?</p>
</li>
<li><p>Did it reach the DLQ?</p>
</li>
<li><p>Was state committed durably?</p>
</li>
</ul>
<p>Without this, incident timelines become speculative.</p>
<hr />
<h3>Idempotent State Transitions</h3>
<p>State changes are guarded by:</p>
<ul>
<li><p>Version checks</p>
</li>
<li><p>Conditional updates</p>
</li>
<li><p>Deduplication keys</p>
</li>
<li><p>Event sequencing</p>
</li>
</ul>
<p>This enables safe replay and retry.</p>
<hr />
<h3>Durable Event Storage</h3>
<p>Queues are delivery systems, not archival systems.</p>
<p>Persisting events durably — or maintaining an append-only event log — expands recovery options beyond queue retention.</p>
<hr />
<h3>Business-Level Monitoring</h3>
<p>Emit metrics tied directly to domain outcomes.</p>
<p>Infrastructure metrics indicate delivery behavior.  </p>
<p>Business metrics indicate correctness.</p>
<p>Detection speed determines recovery complexity.</p>
<hr />
<h2>Incident Checklist</h2>
<p>When someone says “a message is missing,” proceed methodically:</p>
<ol>
<li><p>Verify publish acknowledgment and identifiers.</p>
</li>
<li><p>Compare Sent vs Received vs Deleted.</p>
</li>
<li><p>Inspect DLQ depth and payload contents.</p>
</li>
<li><p>Evaluate processing time vs visibility timeout.</p>
</li>
<li><p>Assess idempotency guarantees.</p>
</li>
<li><p>Confirm retention settings.</p>
</li>
<li><p>Compare business metrics against expected throughput.</p>
</li>
<li><p>Evaluate replay risk before executing recovery.</p>
</li>
</ol>
<p>Structured progression reduces uncertainty quickly.</p>
<hr />
<h2>Closing Thoughts</h2>
<p>Distributed systems rarely fail in dramatic ways.</p>
<p>They degrade at lifecycle boundaries:</p>
<ul>
<li><p>Retries</p>
</li>
<li><p>Timeouts</p>
</li>
<li><p>Partial commits</p>
</li>
<li><p>Schema drift</p>
</li>
<li><p>Weak observability</p>
</li>
</ul>
<p>Messages do not typically vanish outright.</p>
<p>What erodes is certainty.</p>
<p>Systems designed with traceability, idempotency, and replay in mind remain bounded during incidents.</p>
<p>Systems without those properties turn a simple Slack question into a multi-day investigation.</p>
<p>Design for clarity early. Operational confidence depends on it.</p>
]]></content:encoded></item><item><title><![CDATA[Java 21 in Distributed Systems: Bounded Concurrency, Deadlines, and Failure Containment]]></title><description><![CDATA[Modern backend services rarely perform isolated work. A single request often fans out into multiple network calls, database queries and asynchronous operations. The service is effectively coordinating]]></description><link>https://leandromaia.dev/java-21-in-distributed-systems-bounded-concurrency-deadlines-and-failure-containment</link><guid isPermaLink="true">https://leandromaia.dev/java-21-in-distributed-systems-bounded-concurrency-deadlines-and-failure-containment</guid><dc:creator><![CDATA[Leandro Maia]]></dc:creator><pubDate>Wed, 25 Feb 2026 20:25:08 GMT</pubDate><content:encoded><![CDATA[<p>Modern backend services rarely perform isolated work. A single request often fans out into multiple network calls, database queries and asynchronous operations. The service is effectively coordinating latency rather than performing computation.</p>
<p>In that environment, reliability problems usually come from resource pressure rather than functional errors. Threads pile up waiting on I/O, retry logic multiplies work, and a slow dependency spreads delay across the system. The service remains technically “up”, but it stops behaving predictably.</p>
<p>Java 21 finally gives us practical tools to manage this properly: virtual threads and structured concurrency. They allow writing synchronous-style code while retaining the scalability properties typically associated with reactive frameworks. The real benefit appears when we combine them with three explicit controls:</p>
<ul>
<li><p>bounded concurrency</p>
</li>
<li><p>a global request deadline</p>
</li>
<li><p>cancellation propagation</p>
</li>
</ul>
<p>The combination keeps work proportional to capacity and limits the blast radius of downstream failures.</p>
<hr />
<h2>The Aggregator Problem</h2>
<p>Consider an API endpoint that returns a product page. To assemble the response, it calls several internal services:</p>
<ul>
<li><p>product metadata</p>
</li>
<li><p>pricing</p>
</li>
<li><p>inventory</p>
</li>
<li><p>reviews</p>
</li>
<li><p>recommendations</p>
</li>
</ul>
<p>Each call is fast in isolation. The endpoint is implemented sequentially first, then parallelized to improve latency.</p>
<p>Without constraints, the parallel version introduces a subtle risk: the service can now initiate many outbound calls simultaneously for every incoming request.</p>
<p>When traffic grows or a dependency slows down, the service stops being limited by CPU and becomes limited by waiting operations.</p>
<p>Client request triggers multiple downstream calls per request. Under load, the number of concurrent calls grows uncontrollably.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69943b5f2f5eee9031f88a4a/9210be28-b51b-4bee-bf30-099d1f19e99e.png" alt="" style="display:block;margin:0 auto" />

<p>Multiple clients multiply the pattern, and downstream latency feeds back into the caller as growing concurrency.</p>
<h2>A Realistic Aggregator Implementation</h2>
<p>A typical implementation starts simple and perfectly reasonable.</p>
<p>Sequential version:</p>
<pre><code class="language-java">public ProductPage getProductPage(String id) {
    Product product = productClient.get(id);
    Price price = pricingClient.get(id);
    Inventory inventory = inventoryClient.get(id);
    Reviews reviews = reviewsClient.get(id);

    return new ProductPage(product, price, inventory, reviews);
}
</code></pre>
<p>Latency is the sum of downstream calls.<br />If each dependency takes ~80ms, the endpoint takes ~320ms.</p>
<p>The natural next step is parallelization.</p>
<hr />
<h3>First Attempt: CompletableFuture Fan-Out</h3>
<p>Before Java 21, many teams used <code>CompletableFuture</code> to parallelize I/O:</p>
<pre><code class="language-java">public ProductPage getProductPage(String id) {

    CompletableFuture&lt;Product&gt; product =
        CompletableFuture.supplyAsync(() -&gt; productClient.get(id));

    CompletableFuture&lt;Price&gt; price =
        CompletableFuture.supplyAsync(() -&gt; pricingClient.get(id));

    CompletableFuture&lt;Inventory&gt; inventory =
        CompletableFuture.supplyAsync(() -&gt; inventoryClient.get(id));

    CompletableFuture&lt;Reviews&gt; reviews =
        CompletableFuture.supplyAsync(() -&gt; reviewsClient.get(id));

    return CompletableFuture.allOf(product, price, inventory, reviews)
        .thenApply(v -&gt; new ProductPage(
            product.join(),
            price.join(),
            inventory.join(),
            reviews.join()
        ))
        .join();
}
</code></pre>
<p>Latency improves significantly and the endpoint now behaves in parallel.</p>
<p>At this stage the service often passes load testing and looks production-ready.</p>
<hr />
<h3>Where It Starts Failing</h3>
<p>Assume:</p>
<ul>
<li><p>200 requests per second</p>
</li>
<li><p>each request calls 4 downstream services</p>
</li>
</ul>
<p>The service now initiates <strong>800 outbound requests per second</strong>.</p>
<p>If one dependency slows down — for example pricing increases from 80ms to 1.5s — those futures remain active and occupy resources much longer than expected.</p>
<p>What accumulates is not CPU work but <em>waiting work</em>:</p>
<ul>
<li><p>HTTP connections remain open</p>
</li>
<li><p>thread pools saturate</p>
</li>
<li><p>retries multiply</p>
</li>
<li><p>latency increases upstream</p>
</li>
</ul>
<p>The system is still functional, but its behavior changes under pressure. Response times become unstable and tail latency grows quickly.</p>
<p>The code is correct.<br />The concurrency model is not bounded.</p>
<hr />
<h2>Using Virtual Threads Safely</h2>
<p>Virtual threads make parallel I/O simple:</p>
<pre><code class="language-java">ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();

try (executor) {
    Future&lt;Product&gt; product = executor.submit(() -&gt; productClient.get(id));
    Future&lt;Price&gt; price = executor.submit(() -&gt; pricingClient.get(id));
    Future&lt;Inventory&gt; inventory = executor.submit(() -&gt; inventoryClient.get(id));

    return new ProductPage(
        product.get(),
        price.get(),
        inventory.get()
    );
}
</code></pre>
<p>This code is easy to read and scales far better than platform threads. However, it introduces a new risk: every incoming request may create many concurrent outbound operations.</p>
<p>Virtual threads are cheap, but downstream capacity is not.</p>
<hr />
<h2>Bounded Concurrency (The Missing Control)</h2>
<p>Instead of allowing unlimited parallelism, the service should explicitly limit how many external operations it performs at once.</p>
<p>Concurrency is capped, and excess work is rejected quickly instead of accumulating.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69943b5f2f5eee9031f88a4a/df53e082-9c26-4e57-a148-7ea1e2147244.png" alt="" style="display:block;margin:0 auto" />

<p>The system sheds load instead of amplifying latency.</p>
<p>A simple and effective mechanism is a semaphore acting as a bulkhead.</p>
<pre><code class="language-java">public class DownstreamLimiter {

    private final Semaphore permits = new Semaphore(100);

    public &lt;T&gt; T call(Callable&lt;T&gt; task) throws Exception {
        if (!permits.tryAcquire(200, TimeUnit.MILLISECONDS)) {
            throw new RuntimeException("Downstream concurrency limit reached");
        }

        try {
            return task.call();
        } finally {
            permits.release();
        }
    }
}
</code></pre>
<p>Usage:</p>
<pre><code class="language-java">var limiter = new DownstreamLimiter();

Future&lt;Price&gt; price = executor.submit(
    () -&gt; limiter.call(() -&gt; pricingClient.get(id))
);
</code></pre>
<p>Now the service’s behavior depends on a defined capacity rather than incoming traffic spikes.</p>
<hr />
<h2>Deadlines Instead of Timeouts</h2>
<p>Timeouts are typically configured per call.<br />In practice, a request should have a total time budget.</p>
<p>Java 21 Structured Concurrency makes this straightforward:</p>
<pre><code class="language-java">try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {

    var product = scope.fork(() -&gt; productClient.get(id));
    var price = scope.fork(() -&gt; pricingClient.get(id));
    var inventory = scope.fork(() -&gt; inventoryClient.get(id));

    scope.joinUntil(Instant.now().plusMillis(300));
    scope.throwIfFailed();

    return new ProductPage(
        product.get(),
        price.get(),
        inventory.get()
    );
}
</code></pre>
<p>The deadline applies to the entire request, not individual calls.</p>
<p>When the deadline expires, unfinished work is cancelled.</p>
<hr />
<h2>Cancellation Propagation</h2>
<p>Without cancellation, a request can time out to the client while the service continues executing downstream calls. The system keeps consuming resources for a response nobody will read.</p>
<p>Structured concurrency automatically interrupts remaining tasks when the scope closes.<br />This reduces wasted work and prevents retry storms during partial failures.</p>
<p>For example, with Java 21 structured concurrency the request scope itself controls the lifecycle of downstream work:</p>
<pre><code class="language-java">public ProductPage getProductPage(String id) throws Exception {

    Instant deadline = Instant.now().plusMillis(300);

    try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {

        var product = scope.fork(() -&gt; productClient.get(id));
        var price = scope.fork(() -&gt; pricingClient.get(id));
        var inventory = scope.fork(() -&gt; inventoryClient.get(id));

        // wait only until the request deadline
        scope.joinUntil(deadline);

        // deadline reached → interrupt remaining tasks
        if (!product.state().isDone()
                || !price.state().isDone()
                || !inventory.state().isDone()) {

            scope.shutdown();
            throw new TimeoutException("request deadline exceeded");
        }

        scope.throwIfFailed();

        return new ProductPage(product.get(), price.get(), inventory.get());
    }
}
</code></pre>
<p>When the deadline expires, unfinished downstream calls are interrupted and the service stops doing work for a response the client will no longer receive.</p>
<hr />
<h2>Operational Impact</h2>
<p>Three behaviors change immediately:</p>
<ol>
<li><p>Slow dependencies no longer saturate threads.</p>
</li>
<li><p>Retries decrease because requests fail quickly.</p>
</li>
<li><p>Latency distribution becomes tighter (p99 improves even if p50 does not).</p>
</li>
</ol>
<p>The service stops amplifying downstream instability.</p>
<p>In practice this becomes very visible in observability tooling. A typical Datadog APM view during an incident looks like this:</p>
<p><strong>Before bounded concurrency</strong></p>
<ul>
<li><p><code>api-service</code> p99 latency: 2.8s</p>
</li>
<li><p>error rate: low (system is technically healthy)</p>
</li>
<li><p>active requests: continuously growing</p>
</li>
<li><p>downstream <code>pricing-service</code> latency: elevated but stable</p>
</li>
</ul>
<p>In the APM flame graph, most of the request time appears as <em>waiting</em>, not CPU work.<br />The main span shows long gaps where the service is idle but holding resources.</p>
<p>Trace Analytics often shows:</p>
<ul>
<li><p>many concurrent traces stuck in <code>http.client</code></p>
</li>
<li><p>connection pool saturation</p>
</li>
<li><p>retries from upstream clients</p>
</li>
</ul>
<p>After introducing concurrency limits and deadlines:</p>
<p><strong>After bounded concurrency + deadline</strong></p>
<ul>
<li><p><code>api-service</code> p99 latency: 350–450ms</p>
</li>
<li><p>some requests fail fast (429/timeout)</p>
</li>
<li><p>active requests plateau instead of growing</p>
</li>
<li><p>downstream latency unchanged</p>
</li>
</ul>
<p>The important change is not that the dependency became faster.<br />The service stopped amplifying its slowness.</p>
<p>In Datadog’s service map, the edge between <code>api-service</code> and <code>pricing-service</code> changes from a thick, high-latency connection to a stable one with lower request volume. The number of concurrent traces drops sharply, and flame graphs become short and consistent rather than long with idle gaps.</p>
<p>The system did not gain capacity.<br />It regained control over how work is admitted.</p>
<hr />
<h2>Final Thoughts</h2>
<p>Virtual threads make concurrency easier, but they also make it easier to create unbounded work. Distributed systems reward services that keep strict control over resource usage.</p>
<p>Bounded fan-out, deadlines and cancellation form a small set of constraints that dramatically improve production behavior. Instead of reacting to incidents, the service actively limits the scope of failures.</p>
<p>The code remains straightforward and synchronous, but the operational characteristics become much closer to a well-designed asynchronous system.</p>
]]></content:encoded></item><item><title><![CDATA[The Operational Cost of LLM APIs]]></title><description><![CDATA[Large language model APIs feel deceptively simple from an engineering perspective.You send a prompt, you receive text. Compared to provisioning databases, tuning JVM memory or debugging distributed locks, the integration looks almost trivial. A singl...]]></description><link>https://leandromaia.dev/the-operational-cost-of-llm-apis</link><guid isPermaLink="true">https://leandromaia.dev/the-operational-cost-of-llm-apis</guid><dc:creator><![CDATA[Leandro Maia]]></dc:creator><pubDate>Tue, 17 Feb 2026 13:06:55 GMT</pubDate><content:encoded><![CDATA[<p>Large language model APIs feel deceptively simple from an engineering perspective.<br />You send a prompt, you receive text. Compared to provisioning databases, tuning JVM memory or debugging distributed locks, the integration looks almost trivial. A single HTTP request and a JSON response. Most SDKs make it possible to have a working prototype in less than an afternoon.</p>
<p>Because of that simplicity, teams often evaluate LLM features as product work rather than infrastructure work. They estimate development time, UX complexity and maybe latency. What they rarely estimate correctly is operational behavior.</p>
<p>An LLM integration is not just a remote function call. It is a probabilistic, metered, latency-sensitive external compute dependency whose cost scales with <em>user behavior</em>, not with system capacity. That distinction matters much more than it initially appears.</p>
<hr />
<h2 id="heading-the-invisible-meter">The Invisible Meter</h2>
<p>Traditional backend infrastructure has a fairly intuitive scaling model.<br />If your system doubles in users, CPU usage and database load grow in a somewhat predictable way. Engineers already know how to reason about it: caching, queues, horizontal scaling and rate limits.</p>
<p>LLM APIs introduce a different scaling axis: token consumption.</p>
<p>The system is no longer paying per request, nor per server, nor per hour of uptime. It is paying for <em>every unit of generated reasoning</em>. A single user interaction can cost more than thousands of database reads. And the expensive part is not the request itself — it is the output length and how many times the user retries, iterates or explores.</p>
<p>Unlike most APIs, LLM usage encourages repetition. Users don’t submit one request. They refine.</p>
<p>They adjust a prompt, regenerate, ask for another version, ask for clarification, request expansion and then ask the model to rewrite the answer in a different tone. From a human perspective this feels like one interaction. From a billing perspective it can be fifteen.</p>
<p>This is the first operational shift: cost is tied to <em>conversation depth</em>, not traffic volume.</p>
<hr />
<h2 id="heading-a-small-saas-100-users">A Small SaaS: 100 Users</h2>
<p>Imagine a small productivity SaaS that adds an AI assistant to help users draft reports.<br />The team estimates that each user will generate about five reports per day, and each report requires one LLM call. They calculate cost assuming roughly 500 requests daily. The feature looks financially safe.</p>
<p>In reality, usage looks different.</p>
<p>A single report becomes a dialogue:</p>
<ul>
<li><p>“summarize this data”</p>
</li>
<li><p>“make it shorter”</p>
</li>
<li><p>“add a professional tone”</p>
</li>
<li><p>“rewrite for a technical audience”</p>
</li>
<li><p>“give me three alternative versions”</p>
</li>
</ul>
<p>The user did not use the system five times. They used it once — but the backend performed five LLM invocations. Many users will also retry when latency exceeds a few seconds because they assume the system stalled.</p>
<p>After deployment, the service stabilizes around 100 daily active users.<br />However, instead of 500 model calls per day, the system performs closer to 3,000.</p>
<p>Nothing is broken.<br />The feature is popular.</p>
<p>But the cost model is wrong by a factor of six.</p>
<p>The engineering system is healthy. The product is successful. Yet finance notices that the new feature is now the single largest operating expense of the platform. The infrastructure did not scale unexpectedly — <em>human curiosity did</em>.</p>
<p>At this stage the problem is manageable, but a second operational effect appears: engineers start shaping user behavior. They introduce response limits, shorten outputs and add cooldowns. This is unusual; backend engineers rarely need to think about how wording affects infrastructure cost. With LLMs, prompt design and product UX directly influence operating margins.</p>
<hr />
<h2 id="heading-the-spike-scenario-10000-users">The Spike Scenario: 10,000 Users</h2>
<p>Now consider a different situation.</p>
<p>The same SaaS releases a new “AI project planner” feature.<br />A well-known influencer shares it, and over two days the product receives 10,000 new users. This kind of spike is familiar in SaaS. Usually the concern is database capacity, queue backlog or CPU saturation. Auto-scaling groups exist precisely for this scenario.</p>
<p>But LLM APIs do not scale with your servers.</p>
<p>Your system may handle the HTTP traffic perfectly while your costs grow faster than your infrastructure ever could.</p>
<p>Let’s assume each new user performs ten exploratory interactions during onboarding — which is realistic because new users experiment more than established ones. If each interaction consumes a moderately sized prompt and response, the system may suddenly generate hundreds of thousands of tokens per hour.</p>
<p>Nothing crashes.<br />There is no 500 error.<br />Your monitoring dashboards remain green.</p>
<p>However, the billing dashboard tells a different story. In less than 48 hours, the LLM usage cost exceeds the previous month’s total infrastructure spend.</p>
<p>This is a uniquely uncomfortable operational situation. Traditional incidents degrade service; this incident degrades the company’s financial predictability. Engineers cannot fix it by scaling servers or restarting workers. The system is behaving correctly.</p>
<p>The system is simply <em>too successful too quickly</em>.</p>
<hr />
<h2 id="heading-latency-is-also-operational-cost">Latency Is Also Operational Cost</h2>
<p>There is another operational dimension beyond billing.</p>
<p>LLM APIs have variable latency. A database query might fluctuate between 5 and 20 milliseconds. An LLM response might vary between 2 seconds and 25 seconds depending on load and output length.</p>
<p>Users react strongly to waiting. When a response takes longer than expected, they retry, refresh or open multiple tabs. Each retry is a new model invocation. Latency therefore multiplies cost.</p>
<p>In distributed systems we often worry about retry storms against downstream services. LLM integrations can produce a similar pattern, except the downstream system is a metered compute provider. A single slow period can double both request volume and billing simultaneously.</p>
<p>This creates a feedback loop: slower responses cause retries, retries cause higher usage, higher usage increases latency and the cycle repeats.</p>
<hr />
<h2 id="heading-why-caching-is-not-straightforward">Why Caching Is Not Straightforward</h2>
<p>The natural engineering instinct is caching.<br />If the same question is asked, store the answer.</p>
<p>The difficulty is that LLM requests are rarely identical. Small wording changes produce different prompts and therefore different cache keys. Even when two prompts are semantically equivalent, they are textually different. Traditional caching strategies depend on deterministic inputs; conversational systems are inherently non-deterministic.</p>
<p>You can cache aggressively for structured tasks (classification, tagging, summarization templates), but creative or exploratory usage — precisely the usage users value — resists caching.</p>
<p>This is why LLM integrations behave more like human labor than like computation. Each request is unique work.</p>
<hr />
<h2 id="heading-operational-mitigations">Operational Mitigations</h2>
<p>Over time, teams operating LLM features converge on similar patterns:</p>
<ul>
<li><p>explicit rate limiting per user</p>
</li>
<li><p>bounded output size</p>
</li>
<li><p>asynchronous processing for long tasks</p>
</li>
<li><p>progressive responses instead of regeneration</p>
</li>
<li><p>usage quotas tied to subscription plans</p>
</li>
<li><p>internal token accounting, not just request counting</p>
</li>
</ul>
<p>The most important shift is cultural. Engineers begin tracking not only latency and error rate but also <em>token burn rate</em>. Observability expands from technical health to economic health.</p>
<p>In practice, a production dashboard for an AI feature often includes: request latency, error rate, queue backlog and daily cost per active user. All four become operational signals.</p>
<hr />
<h2 id="heading-human-behavior-becomes-infrastructure">Human Behavior Becomes Infrastructure</h2>
<p>The core lesson is that LLM APIs move part of system reliability into user psychology.</p>
<p>Traditional backend engineering assumes users submit discrete actions. LLM interfaces encourage exploration. People converse, iterate and experiment. The system is no longer processing transactions; it is hosting thinking sessions.</p>
<p>Because of that, operational planning must account for behavior patterns rather than just concurrency. The number of users matters less than <em>how they engage</em>.</p>
<p>A hundred power users can cost more than ten thousand passive ones.<br />A successful onboarding flow can be more expensive than steady-state usage.<br />A viral moment can become a financial incident before it becomes a technical one.</p>
<hr />
<h2 id="heading-final-thought">Final Thought</h2>
<p>Integrating an LLM API is easy. Operating it responsibly is not.</p>
<p>The challenge is not calling the model. The challenge is predicting and shaping the interaction between human curiosity and metered computation. Traditional systems fail when servers overload. LLM-enabled systems can remain perfectly stable while the operational cost becomes the primary reliability risk.</p>
<p>When adopting these systems, engineers are not only managing software behavior anymore.<br />They are managing an economic feedback loop attached to human conversation.</p>
]]></content:encoded></item><item><title><![CDATA[Why AI Features Are Becoming Reliability Problems]]></title><description><![CDATA[Over the last year, many products added AI features.
Chat assistants, automatic summaries, classification, recommendations, drafting emails, generating documentation, suggesting actions. In many cases]]></description><link>https://leandromaia.dev/why-ai-features-are-becoming-reliability-problems</link><guid isPermaLink="true">https://leandromaia.dev/why-ai-features-are-becoming-reliability-problems</guid><dc:creator><![CDATA[Leandro Maia]]></dc:creator><pubDate>Tue, 10 Feb 2026 08:00:00 GMT</pubDate><content:encoded><![CDATA[<p>Over the last year, many products added AI features.</p>
<p>Chat assistants, automatic summaries, classification, recommendations, drafting emails, generating documentation, suggesting actions. In many cases these features were relatively easy to ship. An API call, some prompt engineering, a bit of UI, and something useful appears on the screen.</p>
<p>From an implementation perspective, they often look simpler than traditional backend features.</p>
<p>From an operational perspective, they are not.</p>
<p>What many teams are discovering is that AI features rarely fail like software systems used to fail.</p>
<p>They create a new category of reliability problem.</p>
<hr />
<h2>Traditional failures are visible</h2>
<p>Historically, backend reliability issues were easy to detect.</p>
<p>A service crashed.<br />A database timed out.<br />An endpoint returned 500.<br />Latency spiked.</p>
<p>Monitoring worked because failures were explicit. Systems either produced a correct response or an error. Alerts fired, dashboards changed, and on-call engineers investigated.</p>
<p>The system clearly signaled: something is wrong.</p>
<p>AI features do not behave this way.</p>
<p>They usually return a response.</p>
<hr />
<h2>The problem is plausible wrongness</h2>
<p>An LLM rarely returns a null pointer, a stack trace, or a malformed response. Instead, it produces something coherent and confident that looks reasonable to both monitoring systems and users.</p>
<p>The output is syntactically valid.<br />The API returned 200.<br />Latency is normal.</p>
<p>But the behavior is unacceptable.</p>
<p>A summary omits critical information.<br />A classification routes a ticket to the wrong team.<br />A generated email misrepresents the situation.<br />A suggested action creates operational confusion.</p>
<p>Nothing technically failed, yet the system did the wrong thing.</p>
<p>This is a reliability issue, but it does not look like one.</p>
<hr />
<h2>Monitoring no longer detects the incident</h2>
<p>Traditional observability assumes failures are binary: success or error.</p>
<p>AI features introduce a third state: successful but incorrect.</p>
<p>HTTP metrics remain healthy.<br />Error rates remain low.<br />Infrastructure dashboards look normal.</p>
<p>The first signal of a problem often comes from support tickets, confused users, or business teams noticing unexpected behavior. In other words, the monitoring system becomes human.</p>
<p>This is a major shift. Reliability engineering historically depended on detecting technical anomalies. With AI features, the anomaly is semantic.</p>
<p>The system worked exactly as implemented, but not as intended.</p>
<hr />
<h2>Testing becomes weaker</h2>
<p>Testing AI features is also different from testing deterministic systems.</p>
<p>Traditional features can be validated with assertions. Given an input, the expected output is known. Automated tests verify correctness with high confidence.</p>
<p>AI systems produce distributions, not exact outputs. The same prompt can yield slightly different results. The challenge is not whether the system returns a response, but whether the response is acceptable.</p>
<p>A test suite can confirm the feature runs.<br />It cannot easily confirm the feature behaves appropriately across real usage.</p>
<p>This weakens one of the strongest reliability tools teams rely on before deployment.</p>
<hr />
<h2>Rollbacks stop working</h2>
<p>When a normal feature causes problems, rollback is a reliable safety mechanism. Reverting the change restores the previous behavior.</p>
<p>AI failures often do not map cleanly to deploys.</p>
<p>The model may be external.<br />The data distribution changed.<br />The prompt interacts differently with real user inputs.<br />Caching stores incorrect outputs.<br />Fine-tuned behavior evolves.</p>
<p>The incident does not necessarily correspond to a specific code release. Teams may see a degradation in behavior without any deployment occurring. From an operational perspective, this is deeply unfamiliar territory.</p>
<p>You cannot always roll back to a previous commit if the system’s behavior depends on probabilistic outputs and live data.</p>
<hr />
<h2>Support becomes part of the reliability pipeline</h2>
<p>In many organizations, customer support has quietly become the primary detection system for AI issues.</p>
<p>Users report confusing results.<br />Operators notice inconsistent classifications.<br />Internal teams start distrusting recommendations.</p>
<p>The reliability loop changes:</p>
<p>previously → monitoring detects → engineering investigates<br />now → users notice → support escalates → engineering investigates</p>
<p>The incident is real, but it emerges socially rather than technically.</p>
<hr />
<h2>Why this matters for system design</h2>
<p>AI features are often introduced as product enhancements, but they behave operationally like external dependencies with unpredictable behavior. They should be treated less like deterministic code and more like a probabilistic subsystem.</p>
<p>This affects architectural decisions.</p>
<p>You need:<br />fallback behaviors<br />human override paths<br />auditability<br />clear UI communication<br />bounded authority</p>
<p>In other words, you design not only for failure, but for uncertainty.</p>
<p>The question is no longer “what happens if the service is down?” but “what happens if the service is confidently wrong?”</p>
<hr />
<h2>A different reliability mindset</h2>
<p>For years, backend reliability engineering focused on availability and latency. Systems were considered healthy when they responded quickly and without errors.</p>
<p>AI systems expand the definition of reliability. A system can be fully available and still operationally harmful.</p>
<p>Correctness now includes behavioral trust.</p>
<p>Teams that succeed with AI features are not the ones that integrate models fastest, but the ones that design guardrails around them. The work shifts from implementing functionality to managing risk.</p>
<p>The engineering challenge is no longer only keeping systems online.</p>
<p>It is keeping them trustworthy.</p>
]]></content:encoded></item><item><title><![CDATA[What Technical Interviews in Distributed Systems Actually Test]]></title><description><![CDATA[Modern backend engineering increasingly revolves around distributed systems.As a consequence, many technical interviews — even for senior and leadership roles — are designed around deceptively simple ]]></description><link>https://leandromaia.dev/what-technical-interviews-in-distributed-systems-actually-test</link><guid isPermaLink="true">https://leandromaia.dev/what-technical-interviews-in-distributed-systems-actually-test</guid><dc:creator><![CDATA[Leandro Maia]]></dc:creator><pubDate>Tue, 03 Feb 2026 08:00:00 GMT</pubDate><content:encoded><![CDATA[<p>Modern backend engineering increasingly revolves around distributed systems.<br />As a consequence, many technical interviews — even for senior and leadership roles — are designed around deceptively simple scenarios: a text editor, a counter, a cart, a document, a status update.</p>
<p>Then the interviewer asks:</p>
<blockquote>
<p>“Why did the system end up with the wrong value?”</p>
</blockquote>
<p>Very often, the correct answer is not about architecture diagrams, microservices, or cloud providers.</p>
<p>It is about <strong>concurrency</strong>.</p>
<p>Below are some of the core concepts these interviews tend to probe, and why they matter in real systems.</p>
<hr />
<h2>1. Race Conditions: The Default State of Distributed Systems</h2>
<p>A race condition occurs when multiple operations access and modify shared state concurrently, and the final result depends on the timing of execution rather than the logical order of events.</p>
<p>Consider a simple pattern:</p>
<pre><code class="language-plaintext">read current value
apply change
write new value
</code></pre>
<p>If two requests execute simultaneously across two backend instances, both may read the same previous value and overwrite each other.</p>
<p>This is known as the <strong>lost update problem</strong>.</p>
<p>The system did not crash.<br />No exception occurred.<br />Every operation “succeeded”.</p>
<p>Yet the state is incorrect.</p>
<p>This is one of the most common real production bugs in multi-instance services.</p>
<hr />
<h2>2. The Illusion of Ordering</h2>
<p>Engineers often intuitively assume that requests arrive and are processed in order.</p>
<p>In practice:</p>
<ul>
<li><p>clients retry</p>
</li>
<li><p>networks reorder packets</p>
</li>
<li><p>load balancers distribute requests</p>
</li>
<li><p>UDP is unordered</p>
</li>
<li><p>mobile devices reconnect</p>
</li>
<li><p>clocks differ</p>
</li>
</ul>
<p>The system does not process “events”.<br />It processes <strong>arrivals</strong>.</p>
<p>These are not the same thing.</p>
<p>A later user action can be processed before an earlier one.<br />Without safeguards, the system may persist an older state after a newer one.</p>
<hr />
<h2>3. Why “Read Then Write” Is Dangerous</h2>
<p>Many naive implementations rely on:</p>
<pre><code class="language-plaintext">SELECT state
compute new state
UPDATE state
</code></pre>
<p>In a single-threaded program this is safe.</p>
<p>In distributed systems, this is a critical section — but there is no lock.</p>
<p>Two processes can execute this sequence simultaneously and overwrite each other.<br />This is not a performance issue. It is a <strong>correctness issue</strong>.</p>
<p>Scaling stateless services horizontally amplifies this risk because concurrency increases with capacity.</p>
<hr />
<h2>4. Typical Solutions</h2>
<p>There is no single universal fix. Instead, systems use different consistency strategies.</p>
<h3>4.1 Optimistic Concurrency Control (Versioning)</h3>
<p>Each record carries a version:</p>
<pre><code class="language-plaintext">UPDATE document
SET content = ?, version = version + 1
WHERE id = ? AND version = ?
</code></pre>
<p>Only one writer succeeds. Others must retry.</p>
<p>This is effectively a <strong>compare-and-swap (CAS)</strong>.</p>
<p>Widely used in:</p>
<ul>
<li><p>relational databases</p>
</li>
<li><p>DynamoDB conditional writes</p>
</li>
<li><p>document stores</p>
</li>
</ul>
<p>It prevents lost updates without heavy locking.</p>
<hr />
<h3>4.2 Idempotency</h3>
<p>Requests should be safe to repeat.</p>
<p>If the same operation arrives twice (retries, network duplication), the system should not produce a different result.</p>
<p>This is essential in:</p>
<ul>
<li><p>payment systems</p>
</li>
<li><p>event consumers</p>
</li>
<li><p>APIs behind unreliable networks</p>
</li>
</ul>
<p>Idempotency keys or operation identifiers allow systems to detect duplicates.</p>
<hr />
<h3>4.3 Event Ordering</h3>
<p>Sometimes the state should reflect the <em>latest logical event</em>, not the last write.</p>
<p>Solutions include:</p>
<ul>
<li><p>timestamps (careful: clocks drift)</p>
</li>
<li><p>logical clocks</p>
</li>
<li><p>sequence numbers per entity</p>
</li>
<li><p>monotonic versioning</p>
</li>
</ul>
<p>The key insight:<br /><strong>Last write ≠ most recent action.</strong></p>
<hr />
<h3>4.4 Serialization via Queues</h3>
<p>Instead of multiple concurrent writers:</p>
<pre><code class="language-plaintext">clients → queue → single consumer → database
</code></pre>
<p>Queues provide ordering and eliminate write races at the cost of latency and throughput constraints.</p>
<p>Common in:</p>
<ul>
<li><p>collaborative editing</p>
</li>
<li><p>inventory systems</p>
</li>
<li><p>financial ledgers</p>
</li>
</ul>
<hr />
<h2>5. Why Timestamps Alone Are Not Enough</h2>
<p>A common first instinct is to store timestamps and keep the “latest”.</p>
<p>This works only if:</p>
<ul>
<li><p>clocks are synchronized</p>
</li>
<li><p>events are monotonic</p>
</li>
<li><p>no client is offline</p>
</li>
<li><p>no retries occur</p>
</li>
</ul>
<p>In real systems:</p>
<ul>
<li><p>client clocks lie</p>
</li>
<li><p>mobile reconnects happen</p>
</li>
<li><p>messages are delayed</p>
</li>
</ul>
<p>Relying solely on timestamps often replaces a race condition with a <strong>time consistency bug</strong>.</p>
<hr />
<h2>6. Databases Do Not Automatically Save You</h2>
<p>Developers often assume the database guarantees correctness.</p>
<p>Databases guarantee <strong>atomicity per operation</strong>, not per workflow.</p>
<p>This is atomic:</p>
<pre><code class="language-plaintext">UPDATE row SET value = 5
</code></pre>
<p>This is not:</p>
<pre><code class="language-plaintext">READ row
MODIFY
WRITE row
</code></pre>
<p>Without isolation (locks or conditional updates), the database cannot detect a logical conflict.</p>
<p>The bug lives above the database layer.</p>
<hr />
<h2>7. What These Questions Really Evaluate</h2>
<p>Concurrency questions in interviews are not about memorizing definitions.</p>
<p>They evaluate whether you understand:</p>
<ul>
<li><p>the difference between scaling and correctness</p>
</li>
<li><p>state vs events</p>
</li>
<li><p>arrival vs ordering</p>
</li>
<li><p>retries vs duplicates</p>
</li>
<li><p>atomic operations vs atomic workflows</p>
</li>
</ul>
<p>In other words:</p>
<blockquote>
<p>Do you design systems assuming the network is unreliable and multiple things happen at once?</p>
</blockquote>
<p>Because in production, they always do.</p>
<hr />
<h2>8. A Useful Mental Model</h2>
<p>Single-machine programming assumes:</p>
<blockquote>
<p>“Things happen one after another.”</p>
</blockquote>
<p>Distributed systems require assuming:</p>
<blockquote>
<p>“Everything happens at the same time, out of order, and at least once.”</p>
</blockquote>
<p>Once you adopt this model, many design decisions change:</p>
<ul>
<li><p>APIs</p>
</li>
<li><p>database writes</p>
</li>
<li><p>caching</p>
</li>
<li><p>retries</p>
</li>
<li><p>message processing</p>
</li>
</ul>
<p>Concurrency is not an edge case.<br />It is the baseline.</p>
<hr />
<h2>Closing Thoughts</h2>
<p>Many system design discussions focus on scale, cloud architecture, and service boundaries.</p>
<p>But some of the most critical failures in real systems come from simpler issues:<br />two valid operations interacting in an invalid way.</p>
<p>Before worrying about microservices, queues, or multi-region deployments, systems must answer a more fundamental question:</p>
<p><strong>What happens when two users change the same thing at the same time?</strong></p>
<p>The answer to that question often defines whether a system is merely scalable — or actually correct.</p>
]]></content:encoded></item><item><title><![CDATA[AI Made Writing Code Easier — Software Development Didn’t Get Easier]]></title><description><![CDATA[Over the last year, most conversations about AI in software engineering have revolved around speed. People ask whether engineers are now two, five, or ten times faster. The assumption behind that ques]]></description><link>https://leandromaia.dev/ai-made-writing-code-easier-software-development-didnt-get-easier</link><guid isPermaLink="true">https://leandromaia.dev/ai-made-writing-code-easier-software-development-didnt-get-easier</guid><dc:creator><![CDATA[Leandro Maia]]></dc:creator><pubDate>Tue, 27 Jan 2026 08:00:00 GMT</pubDate><content:encoded><![CDATA[<p>Over the last year, most conversations about AI in software engineering have revolved around speed. People ask whether engineers are now two, five, or ten times faster. The assumption behind that question is that writing code was the main thing slowing software delivery down.</p>
<p>In practice, that rarely matched my experience. Even before AI, most teams I worked with were not waiting on someone to type faster. They were waiting on decisions, coordination, and risk assessment. The work around the code mattered more than the code itself.</p>
<p>AI didn’t remove the bottleneck. It moved it.</p>
<hr />
<h2>We were rarely waiting on code</h2>
<p>In a small team, code is a large part of the job. A few engineers understand the whole system and progress mostly depends on someone sitting down and implementing the feature. But once systems and organizations grow, delivery depends on a chain of human processes. Someone needs to decide the feature is worth doing, multiple teams need to agree on behavior, operational risk has to be evaluated, and someone has to support the system after customers start relying on it.</p>
<p>Because implementation was expensive, it acted as a natural filter. If a feature required weeks of work, someone had to justify it. Discussions were more careful, and priorities were clearer. That friction was often frustrating, but it forced clarity.</p>
<p>AI changed that dynamic. Implementation became cheap enough that the filter weakened. The number of things that <em>could</em> be built suddenly increased, but the organization’s ability to evaluate them did not.</p>
<hr />
<h2>Cheap implementation increases volume</h2>
<p>Lowering the cost of implementation does not automatically produce better outcomes. It produces more outcomes. Teams can prototype faster, experiment more, and try more variations of ideas. On paper this sounds like pure productivity, but most organizations are not structured to process a large volume of change.</p>
<p>Before, weak ideas often died early because they were costly. Now they survive longer because they are easy to implement. The result is not necessarily better software — it is more software entering the system. The constraint shifts from “can we build this?” to “should this exist at all?”</p>
<p>This is where much of the real work now lives.</p>
<hr />
<h2>The new work: integration</h2>
<p>What I see in practice is not dramatically faster systems, but a higher number of partially correct ones. AI-generated code frequently looks reasonable and works locally. The problems appear when the code interacts with the rest of the system.</p>
<p>Real systems have expectations that are rarely explicit: data contracts, operational behavior, retry logic, monitoring, and ownership boundaries. Software rarely fails because a function was hard to write. It fails because multiple correct components interact in an incorrect way.</p>
<p>AI handles the first 80% of implementation easily. The remaining 20% — understanding how a change behaves in production — remains difficult. Engineers spend less time creating code from scratch and more time validating, adapting, and stabilizing generated work so that it behaves predictably in a larger system.</p>
<hr />
<h2>The review bottleneck</h2>
<p>One immediate consequence is that review capacity becomes a constraint. The number of changes increases faster than the organization’s ability to understand them. Teams did not suddenly gain more reviewers, deeper system knowledge, or better operational visibility. They simply gained more code.</p>
<p>As a result, engineers are often less limited by writing code than by reading it. Code review used to involve carefully reasoning about a focused change. Now it frequently involves evaluating large generated modifications whose correctness depends on context not visible in the diff.</p>
<p>Speed increased on the production side, but not on the understanding side. And most software failures originate from misunderstanding, not syntax errors.</p>
<hr />
<h2>AI amplifies existing problems</h2>
<p>AI does not introduce entirely new dysfunctions. It amplifies what already exists. If prioritization is weak, more low-value work appears. If ownership is unclear, integration failures multiply. If coordination is slow, conflicts increase.</p>
<p>The surrounding organization — support teams, product processes, operations, training — still operates at human speed. Even if code generation accelerates dramatically, delivery remains constrained by alignment and understanding. The bottleneck simply relocates.</p>
<hr />
<h2>Why experience matters more</h2>
<p>AI reduces the effort required to produce code. It does not reduce the effort required to reason about consequences. That changes which skills matter most.</p>
<p>The valuable skill shifts away from implementation speed and toward judgment: recognizing coupling, anticipating operational impact, and deciding when a feature should not yet exist. Maintaining system clarity becomes more important than producing additional code.</p>
<p>Software systems rarely collapse because code was difficult to write. They collapse because their behavior became too complex to reason about. AI increases code abundance, but understanding remains scarce.</p>
<hr />
<h2>What actually changed</h2>
<p>AI is genuinely useful. It helps with exploration, scaffolding, and repetitive tasks. I use it regularly and it meaningfully improves parts of the workflow. But its main impact is not replacing engineers or eliminating effort. It changes the type of effort required.</p>
<p>There is less typing and more evaluation, less syntax work and more system reasoning. The teams that benefit most will not be those that generate the most code, but those that maintain a clear understanding of how their systems behave.</p>
<p>Software rarely fails because nobody could implement the solution. It fails because, over time, nobody fully understood the system they had built.</p>
]]></content:encoded></item><item><title><![CDATA[Observability Is Not Dashboards — How I Actually Use Datadog in Production]]></title><description><![CDATA[Many teams say they “have observability”.
Usually that means:

CPU graphs

memory usage

request count

maybe error rate


Those are useful, but they are not observability.
They are infrastructure vis]]></description><link>https://leandromaia.dev/observability-is-not-dashboards-how-i-actually-use-datadog-in-production</link><guid isPermaLink="true">https://leandromaia.dev/observability-is-not-dashboards-how-i-actually-use-datadog-in-production</guid><dc:creator><![CDATA[Leandro Maia]]></dc:creator><pubDate>Tue, 20 Jan 2026 08:00:00 GMT</pubDate><content:encoded><![CDATA[<p>Many teams say they “have observability”.</p>
<p>Usually that means:</p>
<ul>
<li><p>CPU graphs</p>
</li>
<li><p>memory usage</p>
</li>
<li><p>request count</p>
</li>
<li><p>maybe error rate</p>
</li>
</ul>
<p>Those are useful, but they are not observability.</p>
<p>They are <strong>infrastructure visibility</strong>.</p>
<p>Observability only starts when you can answer a different kind of question:</p>
<blockquote>
<p>“Why is this specific user request slow right now?”</p>
</blockquote>
<p>Not average latency.<br />Not system health.</p>
<p>A concrete request, in a real moment, under real load.</p>
<p>This is how I use Datadog in practice: not as a monitoring wall, but as a debugging environment for live systems.</p>
<hr />
<h2>The Shift That Changed Everything</h2>
<p>Earlier in my career, monitoring meant:</p>
<ol>
<li><p>Alert fires</p>
</li>
<li><p>Look at dashboards</p>
</li>
<li><p>Guess which service is responsible</p>
</li>
<li><p>SSH into machines</p>
</li>
<li><p>Grep logs</p>
</li>
</ol>
<p>This works for small systems.</p>
<p>It collapses in distributed architectures.</p>
<p>In microservices, a single request might pass through:</p>
<ul>
<li><p>API gateway</p>
</li>
<li><p>authentication</p>
</li>
<li><p>core service</p>
</li>
<li><p>cache</p>
</li>
<li><p>database</p>
</li>
<li><p>message broker</p>
</li>
<li><p>async worker</p>
</li>
<li><p>third-party API</p>
</li>
</ul>
<p>Dashboards don’t reconstruct that story.</p>
<p>Traces do.</p>
<p>The biggest shift was moving from <strong>service health</strong> to <strong>request behavior</strong>.</p>
<hr />
<h2>The Three Pillars (How They Actually Connect)</h2>
<p>People often mention logs, metrics and traces as separate tools.</p>
<p>They are not.</p>
<p>They are three zoom levels of the same incident.</p>
<h3>Metrics — “Something is wrong”</h3>
<h3>Traces — “Where it is wrong”</h3>
<h3>Logs — “What exactly happened”</h3>
<p>The value appears when you can jump between them in seconds.</p>
<p>That is the core reason I rely heavily on Datadog’s APM rather than only metrics.</p>
<hr />
<h2>Starting Point: The Latency Graph</h2>
<p>Most real incidents I’ve seen start the same way:</p>
<p>Latency increases slightly.</p>
<p>Not a spike.<br />Not an outage.</p>
<p>Just p95 drifting from 180ms → 400ms → 900ms over 20 minutes.</p>
<p>Error rate is still normal.</p>
<p>This is the most dangerous state in production: <strong>a system degrading without failing</strong>.</p>
<p>In Datadog I rarely start with CPU or memory.</p>
<p>I start with:</p>
<p><strong>Service → APM → Endpoint latency → p95/p99</strong></p>
<p>Because users don’t experience servers.<br />They experience requests.</p>
<hr />
<h2>From Metric to Trace</h2>
<p>Once I see a slow endpoint, the next step is not guessing.</p>
<p>I open a real trace.</p>
<p>This is where observability becomes different from monitoring.</p>
<p>A single trace shows:</p>
<ul>
<li><p>every service hop</p>
</li>
<li><p>DB queries</p>
</li>
<li><p>external calls</p>
</li>
<li><p>cache usage</p>
</li>
<li><p>retries</p>
</li>
<li><p>time spent waiting vs executing</p>
</li>
</ul>
<p>Now the system is no longer abstract.</p>
<p>I’m looking at a specific request that actually happened.</p>
<p>Very often the problem is immediately visible:</p>
<ul>
<li><p>a 2.4s external API call</p>
</li>
<li><p>a blocking call inside async flow</p>
</li>
<li><p>N+1 database queries</p>
</li>
<li><p>a retry storm</p>
</li>
<li><p>thread pool saturation</p>
</li>
</ul>
<p>Without tracing, these issues look identical from metrics.</p>
<hr />
<h2>The Most Valuable Feature: Outliers</h2>
<p>Averages hide incidents.</p>
<p>I care much more about <strong>the slowest 1%</strong>.</p>
<p>In Datadog I frequently filter:</p>
<p>“Show traces where duration &gt; 2 seconds”</p>
<p>This reveals patterns dashboards never show.</p>
<p>Examples I’ve encountered:</p>
<ul>
<li><p>only requests with large payloads fail</p>
</li>
<li><p>only specific tenants are slow</p>
</li>
<li><p>only cache misses trigger a downstream bottleneck</p>
</li>
<li><p>only first request after deployment is slow (cold initialization)</p>
</li>
</ul>
<p>Production problems are rarely global.<br />They are conditional.</p>
<p>Tracing exposes conditions.</p>
<hr />
<h2>Database Visibility</h2>
<p>One of the most practical benefits is query-level insight.</p>
<p>Instead of:</p>
<p>“database seems slow”</p>
<p>I can see:</p>
<ul>
<li><p>exact SQL query</p>
</li>
<li><p>execution time</p>
</li>
<li><p>frequency</p>
</li>
<li><p>which endpoint triggered it</p>
</li>
</ul>
<p>This immediately distinguishes:</p>
<ul>
<li><p>bad indexing</p>
</li>
<li><p>missing caching</p>
</li>
<li><p>accidental full table scan</p>
</li>
<li><p>ORM misuse</p>
</li>
</ul>
<p>Many “performance problems” turn out to be <strong>query shape problems</strong>.</p>
<p>And they are visible in seconds.</p>
<hr />
<h2>Logs — But Only After the Trace</h2>
<p>A mistake I made early on was starting with logs.</p>
<p>Logs are high volume and low signal.</p>
<p>Now I almost never search logs first.</p>
<p>I:</p>
<ol>
<li><p>Find a slow trace</p>
</li>
<li><p>Jump to correlated logs from that trace</p>
</li>
</ol>
<p>This changes everything.</p>
<p>Instead of reading thousands of lines, I read the exact log lines produced by the failing request.</p>
<p>Logs become evidence instead of noise.</p>
<hr />
<h2>Alerts I Actually Trust</h2>
<p>The alerts that wake engineers at night should be very few.</p>
<p>The most useful alerts I’ve configured are not CPU or memory alerts.</p>
<p>They are:</p>
<ul>
<li><p>p99 latency above threshold for sustained period</p>
</li>
<li><p>error rate per endpoint (not global)</p>
</li>
<li><p>saturation of worker queues</p>
</li>
<li><p>retry explosion indicators</p>
</li>
</ul>
<p>Hardware metrics often alert <em>after</em> users notice.</p>
<p>Request-level alerts often alert <em>before</em> support tickets appear.</p>
<hr />
<h2>An Unexpected Use: Verifying Fixes</h2>
<p>Observability is not only for incidents.</p>
<p>It is also for confidence.</p>
<p>After deploying a fix, I don’t just wait.</p>
<p>I compare traces:</p>
<p>before vs after</p>
<p>I verify:</p>
<ul>
<li><p>latency distribution</p>
</li>
<li><p>retry count</p>
</li>
<li><p>DB query time</p>
</li>
<li><p>downstream calls</p>
</li>
</ul>
<p>This turns performance work from intuition into evidence.</p>
<hr />
<h2>Final Thought</h2>
<p>Monitoring tells you when the system is unhealthy.</p>
<p>Observability lets you understand behavior while the system is still running.</p>
<p>Modern backend systems rarely fail loudly.</p>
<p>They degrade, retry, compensate and partially work.</p>
<p>Without request-level visibility, engineers debug symptoms.</p>
<p>With observability, they debug causality.</p>
<p>Datadog is simply the tool I use to see that causality while the system is alive.</p>
]]></content:encoded></item><item><title><![CDATA[Working With an AI Coding Assistant (Codex) as a Backend Engineer]]></title><description><![CDATA[Over the last months I started using an AI coding assistant powered by large language models (Codex-style systems).
I did not approach it as a novelty or productivity experiment.
I approached it the s]]></description><link>https://leandromaia.dev/working-with-an-ai-coding-assistant-codex-as-a-backend-engineer</link><guid isPermaLink="true">https://leandromaia.dev/working-with-an-ai-coding-assistant-codex-as-a-backend-engineer</guid><dc:creator><![CDATA[Leandro Maia]]></dc:creator><pubDate>Tue, 13 Jan 2026 08:00:00 GMT</pubDate><content:encoded><![CDATA[<p>Over the last months I started using an AI coding assistant powered by large language models (Codex-style systems).</p>
<p>I did not approach it as a novelty or productivity experiment.</p>
<p>I approached it the same way I approach any new piece of infrastructure:<br />with skepticism and with a production mindset.</p>
<p>The interesting discovery was this:</p>
<p>The assistant is not a faster autocomplete.</p>
<p>It behaves much closer to a <strong>very fast junior engineer with perfect recall and zero operational experience</strong>.</p>
<p>Once I started treating it that way, it became genuinely useful.</p>
<p>This post is not about whether AI will replace engineers.<br />It is about how it actually changes day-to-day backend work.</p>
<hr />
<h2>What It Is Actually Good At</h2>
<p>The first surprise was not code generation.</p>
<p>It was <em>code navigation</em>.</p>
<p>In large systems, a lot of time is not spent writing code.<br />It is spent reconstructing intent.</p>
<p>Typical tasks:</p>
<ul>
<li><p>understanding an unfamiliar module</p>
</li>
<li><p>finding where a side effect originates</p>
</li>
<li><p>tracing request flows</p>
</li>
<li><p>reconstructing configuration behavior</p>
</li>
<li><p>mapping DTOs across layers</p>
</li>
</ul>
<p>The assistant is very good at building a mental index of a codebase quickly.</p>
<p>You can ask questions like:</p>
<p>“Where could a timeout be happening in this flow?”</p>
<p>And it will point to:</p>
<ul>
<li><p>HTTP client configuration</p>
</li>
<li><p>thread pool limits</p>
</li>
<li><p>retry wrappers</p>
</li>
<li><p>circuit breaker policies</p>
</li>
</ul>
<p>Not always correctly — but almost always <em>usefully</em>.</p>
<p>The real productivity gain is not typing less code.</p>
<p>It is reducing search time.</p>
<hr />
<h2>The Refactoring Multiplier</h2>
<p>The second strong use case is mechanical refactoring.</p>
<p>Things engineers postpone for months:</p>
<ul>
<li><p>renaming confusing interfaces</p>
</li>
<li><p>splitting large classes</p>
</li>
<li><p>extracting validation logic</p>
</li>
<li><p>migrating method signatures</p>
</li>
<li><p>removing duplication</p>
</li>
</ul>
<p>These tasks are cognitively easy but operationally expensive.</p>
<p>They require attention, but not deep design thinking.</p>
<p>The assistant is extremely effective here.</p>
<p>You still review every change.</p>
<p>But the cost of attempting a refactor drops dramatically.</p>
<p>The interesting side effect:</p>
<p>I started performing refactors earlier.</p>
<p>Not because the assistant is perfect — but because the activation energy disappeared.</p>
<hr />
<h2>Where It Fails (Consistently)</h2>
<p>The assistant writes correct-looking code far more often than correct systems.</p>
<p>This is the most important observation.</p>
<p>It is strong at:</p>
<ul>
<li><p>syntax</p>
</li>
<li><p>API usage</p>
</li>
<li><p>small local logic</p>
</li>
</ul>
<p>It is weak at:</p>
<ul>
<li><p>concurrency</p>
</li>
<li><p>distributed systems</p>
</li>
<li><p>failure handling</p>
</li>
<li><p>timeouts</p>
</li>
<li><p>idempotency</p>
</li>
<li><p>partial failure</p>
</li>
</ul>
<p>In other words, it struggles exactly where real backend incidents happen.</p>
<p>If you ask it to implement a retry mechanism, it will produce one.</p>
<p>If you ask it to design a safe retry mechanism, it will often produce a system that can duplicate side effects.</p>
<p>This is a critical difference.</p>
<p>The assistant optimizes for <em>plausibility</em>, not for <em>operability</em>.</p>
<hr />
<h2>The Illusion of Correctness</h2>
<p>The most dangerous property is fluency.</p>
<p>Bad code used to look suspicious.</p>
<p>AI-generated code often looks clean, documented and well structured.</p>
<p>Which makes engineers trust it more than they should.</p>
<p>The failure mode is subtle:</p>
<p>You stop questioning decisions that you did not consciously make.</p>
<p>Over time, this introduces architectural drift.</p>
<p>Not dramatic failures — but many small design decisions that nobody truly owns.</p>
<p>I’ve seen:</p>
<ul>
<li><p>retries added in three layers</p>
</li>
<li><p>hidden blocking calls inside async flows</p>
</li>
<li><p>silent error swallowing</p>
</li>
<li><p>accidental N+1 queries</p>
</li>
</ul>
<p>All of them reasonable locally.<br />All of them problematic systemically.</p>
<hr />
<h2>The Real Productivity Shift</h2>
<p>The assistant does not remove the need for senior engineers.</p>
<p>It increases the value of judgment.</p>
<p>Before:</p>
<p>Senior engineers wrote more code correctly.</p>
<p>Now:</p>
<p>Senior engineers <strong>reject more code correctly</strong>.</p>
<p>A large part of using an AI assistant well is knowing when <em>not</em> to accept its solution.</p>
<p>You stop thinking of it as a coding tool and start thinking of it as a proposal generator.</p>
<hr />
<h2>A Practical Workflow That Worked For Me</h2>
<p>What worked best for me was separating tasks into two categories.</p>
<h3>Tasks I delegate to the assistant</h3>
<ul>
<li><p>boilerplate</p>
</li>
<li><p>DTO mapping</p>
</li>
<li><p>test scaffolding</p>
</li>
<li><p>refactor mechanics</p>
</li>
<li><p>documentation drafts</p>
</li>
<li><p>codebase exploration</p>
</li>
</ul>
<h3>Tasks I never delegate</h3>
<ul>
<li><p>concurrency control</p>
</li>
<li><p>transactional boundaries</p>
</li>
<li><p>caching strategy</p>
</li>
<li><p>retries</p>
</li>
<li><p>idempotency</p>
</li>
<li><p>API contracts</p>
</li>
<li><p>state machines</p>
</li>
</ul>
<p>Interestingly, this boundary maps almost exactly to the boundary between programming and engineering.</p>
<p>The assistant is good at programming.</p>
<p>Engineering still requires responsibility for behavior in production.</p>
<hr />
<h2>Unexpected Benefit: Thinking More Explicitly</h2>
<p>One side effect I did not expect:</p>
<p>I started writing clearer code.</p>
<p>Because I needed to describe intent precisely when prompting, I became more explicit about:</p>
<ul>
<li><p>invariants</p>
</li>
<li><p>failure modes</p>
</li>
<li><p>assumptions</p>
</li>
<li><p>data ownership</p>
</li>
</ul>
<p>The tool forces you to articulate reasoning you previously kept in your head.</p>
<p>That alone improved code reviews and documentation.</p>
<hr />
<h2>Final Thought</h2>
<p>AI coding assistants change how code is produced.</p>
<p>They do not change what reliable systems require.</p>
<p>Production systems are constrained by latency, partial failure, concurrency and time.</p>
<p>The assistant does not experience incidents, on-call or operational consequences.</p>
<p>Engineers do.</p>
<p>The most useful mental model I found is this:</p>
<p>The assistant can generate solutions.</p>
<p>The engineer is still accountable for reality.</p>
<p>And in backend systems, reality is what eventually wins.</p>
]]></content:encoded></item><item><title><![CDATA[Why backend systems become fragile as companies grow]]></title><description><![CDATA[Most backend systems don’t fail because of a bad initial design.
Many systems start simple, clean and understandable. Early teams usually know every component, every database table and most side effec]]></description><link>https://leandromaia.dev/why-backend-systems-become-fragile-as-companies-grow</link><guid isPermaLink="true">https://leandromaia.dev/why-backend-systems-become-fragile-as-companies-grow</guid><dc:creator><![CDATA[Leandro Maia]]></dc:creator><pubDate>Tue, 06 Jan 2026 08:00:00 GMT</pubDate><content:encoded><![CDATA[<p>Most backend systems don’t fail because of a bad initial design.</p>
<p>Many systems start simple, clean and understandable. Early teams usually know every component, every database table and most side effects of a change. Deployments feel safe and incidents are rare.</p>
<p>Yet, after some growth, the same systems often become fragile. Small changes cause unexpected problems. Incidents appear more frequently. Teams begin to fear deployments.</p>
<p>What changed is rarely the programming language, the framework or even the main architecture.</p>
<p>What changed is the context around the system.</p>
<h2>The moment complexity becomes invisible</h2>
<p>In early stages, a system is small enough to exist inside a shared mental model. A few engineers understand how data flows, which services depend on others and what assumptions exist.</p>
<p>Growth breaks that.</p>
<p>As a company grows, three things tend to happen at the same time:</p>
<ul>
<li><p>more teams interact with the same system</p>
</li>
<li><p>integrations increase</p>
</li>
<li><p>local decisions accumulate</p>
</li>
</ul>
<p>None of these are inherently bad. Each decision is usually reasonable in isolation. A new integration enables a business opportunity. A quick workaround solves an urgent need. A new service isolates a responsibility.</p>
<p>The fragility comes from how these decisions interact over time.</p>
<p>No one owns the full mental model anymore, but the system still behaves as if someone should.</p>
<h2>Local optimizations, global consequences</h2>
<p>A common pattern in growing organizations is local optimization.</p>
<p>A team improves performance for their feature. Another team adds caching for a specific endpoint. A third team creates a background job to guarantee retries.</p>
<p>Individually, these changes make sense.</p>
<p>Collectively, they create hidden coupling.</p>
<p>Soon, actions that were once simple — like reprocessing an event, replaying a queue, or fixing a database record — become dangerous. Not because the code is poorly written, but because the number of implicit assumptions increased.</p>
<p>The system did not become fragile due to complexity alone.</p>
<p>It became fragile because complexity became <em>implicit</em>.</p>
<h2>Microservices don’t automatically solve this</h2>
<p>At this stage, many organizations assume the solution is an architectural change.</p>
<p>Often the reaction is to split the system further: more services, more queues, more boundaries.</p>
<p>This can help, but only when boundaries reflect real ownership and domain understanding.</p>
<p>Otherwise, microservices simply distribute fragility across network calls instead of function calls.</p>
<p>The core issue is not whether the system is a monolith or microservices.</p>
<p>The issue is whether the system’s structure matches how teams understand and operate it.</p>
<h2>What actually improves stability</h2>
<p>In practice, stability improves not primarily through new technology, but through clearer system thinking.</p>
<p>A few patterns consistently help:</p>
<h3>Clear ownership</h3>
<p>Every important component should have a team that understands its behavior in production, not just its code.</p>
<h3>Explicit boundaries</h3>
<p>Systems become safer when assumptions are documented and contracts are treated seriously. Many production issues come from assumptions that were never written down.</p>
<h3>Observability over cleverness</h3>
<p>Metrics, logs and traces reduce fear because they allow engineers to reason about behavior instead of guessing.</p>
<h3>Fewer responsibilities per component</h3>
<p>Components that handle many unrelated responsibilities become risk multipliers. Simplicity in responsibility often matters more than technical elegance.</p>
<h2>Fragility is a systems problem</h2>
<p>It is tempting to see incidents as isolated technical failures.</p>
<p>More often, they are signals that the system outgrew its original mental model.</p>
<p>The code did not suddenly become worse. The system simply reached a scale where informal knowledge stopped being enough.</p>
<p>Backend fragility rarely comes from bad engineers or bad intentions.</p>
<p>It comes from successful systems growing beyond the structures that once kept them understandable.</p>
<p>Improving stability, therefore, is less about rewriting everything and more about making the system understandable again.</p>
]]></content:encoded></item><item><title><![CDATA[Not All Race Conditions Are Threads — Race Conditions in Distributed Systems]]></title><description><![CDATA[When engineers hear “race condition”, most imagine two threads modifying the same variable.
That is the smallest version of the problem.
In distributed systems, race conditions are far more dangerous ]]></description><link>https://leandromaia.dev/not-all-race-conditions-are-threads-race-conditions-in-distributed-systems</link><guid isPermaLink="true">https://leandromaia.dev/not-all-race-conditions-are-threads-race-conditions-in-distributed-systems</guid><dc:creator><![CDATA[Leandro Maia]]></dc:creator><pubDate>Tue, 06 Jan 2026 08:00:00 GMT</pubDate><content:encoded><![CDATA[<p>When engineers hear “race condition”, most imagine two threads modifying the same variable.</p>
<p>That is the <em>smallest</em> version of the problem.</p>
<p>In distributed systems, race conditions are far more dangerous because they don’t depend on shared memory.<br />They depend on <strong>time, ordering and partial knowledge</strong>.</p>
<p>No locks.<br />No stack traces.<br />No deterministic reproduction.</p>
<p>And the system can be perfectly healthy from an infrastructure perspective.</p>
<p>This post is about the kinds of race conditions that actually appear in production backend systems.</p>
<hr />
<h2>1) The Double-Execution Race (Duplicate Processing)</h2>
<p>This is the most common distributed race condition.</p>
<p>A worker processes a message.<br />The message broker doesn’t receive the acknowledgement in time.<br />The broker redelivers.</p>
<p>Now two workers execute the same operation.</p>
<p>Typical scenario:</p>
<ul>
<li><p>order creation</p>
</li>
<li><p>payment capture</p>
</li>
<li><p>email sending</p>
</li>
<li><p>inventory reservation</p>
</li>
<li><p>coupon redemption</p>
</li>
</ul>
<p>Nothing crashed.</p>
<p>The system did exactly what it was designed to do: <strong>at-least-once delivery</strong>.</p>
<p>But the business operation was not idempotent.</p>
<h3>What makes this dangerous</h3>
<p>The second execution is not a retry from the same process.<br />It is a <em>concurrent logical operation</em>.</p>
<p>You now have:</p>
<ul>
<li><p>two payment captures</p>
</li>
<li><p>two shipments</p>
</li>
<li><p>two state transitions</p>
</li>
<li><p>inconsistent accounting</p>
</li>
</ul>
<p>And logs look completely valid.</p>
<h3>Typical mistaken fixes</h3>
<ul>
<li><p>increasing visibility timeout</p>
</li>
<li><p>reducing consumer concurrency</p>
</li>
<li><p>adding delays</p>
</li>
</ul>
<p>Those reduce probability, not the race.</p>
<h3>Real fix</h3>
<p>You need <strong>idempotency at the business boundary</strong>, not at the infrastructure layer.</p>
<p>Examples:</p>
<ul>
<li><p>idempotency keys stored with unique constraints</p>
</li>
<li><p>operation tokens</p>
</li>
<li><p>deduplication tables</p>
</li>
<li><p>state transition guards</p>
</li>
</ul>
<p>The system must be able to answer:</p>
<blockquote>
<p>“Has this operation already been logically completed?”</p>
</blockquote>
<p>Not:</p>
<blockquote>
<p>“Has this message already been seen by this worker?”</p>
</blockquote>
<hr />
<h2>2) The Lost Update Race (Concurrent Writers)</h2>
<p>Two services read the same entity state and both decide to modify it.</p>
<p>Timeline:</p>
<ol>
<li><p>Service A reads balance = 100</p>
</li>
<li><p>Service B reads balance = 100</p>
</li>
<li><p>A subtracts 40 → writes 60</p>
</li>
<li><p>B subtracts 80 → writes 20</p>
</li>
</ol>
<p>Final state: 20<br />Correct state: −20 or rejected</p>
<p>No conflicts detected.<br />Database behaved correctly.</p>
<p>This happens frequently with:</p>
<ul>
<li><p>wallets</p>
</li>
<li><p>inventory</p>
</li>
<li><p>quotas</p>
</li>
<li><p>rate limits</p>
</li>
<li><p>seat reservations</p>
</li>
</ul>
<h3>Why transactions don’t automatically save you</h3>
<p>Because both transactions are individually valid.</p>
<p>The race is <strong>between reads</strong>, not writes.</p>
<h3>Correct approaches</h3>
<ul>
<li><p>optimistic locking (version column)</p>
</li>
<li><p>compare-and-swap updates</p>
</li>
<li><p>conditional writes</p>
</li>
<li><p>atomic database operations</p>
</li>
<li><p>append-only ledgers instead of mutable state</p>
</li>
</ul>
<p>The real solution is not stronger transactions.</p>
<p>It is <strong>state transition control</strong>.</p>
<hr />
<h2>3) The Out-of-Order Event Race</h2>
<p>Distributed systems do not guarantee global ordering.</p>
<p>Even Kafka does not — only per partition.</p>
<p>Typical example:</p>
<ol>
<li><p><code>OrderCancelled</code></p>
</li>
<li><p>`Order</p>
</li>
</ol>
]]></content:encoded></item></channel></rss>