Skip to main content

Command Palette

Search for a command to run...

Observability Is Not Dashboards — How I Actually Use Datadog in Production

Updated
5 min read

Many teams say they “have observability”.

Usually that means:

  • CPU graphs

  • memory usage

  • request count

  • maybe error rate

Those are useful, but they are not observability.

They are infrastructure visibility.

Observability only starts when you can answer a different kind of question:

“Why is this specific user request slow right now?”

Not average latency.
Not system health.

A concrete request, in a real moment, under real load.

This is how I use Datadog in practice: not as a monitoring wall, but as a debugging environment for live systems.


The Shift That Changed Everything

Earlier in my career, monitoring meant:

  1. Alert fires

  2. Look at dashboards

  3. Guess which service is responsible

  4. SSH into machines

  5. Grep logs

This works for small systems.

It collapses in distributed architectures.

In microservices, a single request might pass through:

  • API gateway

  • authentication

  • core service

  • cache

  • database

  • message broker

  • async worker

  • third-party API

Dashboards don’t reconstruct that story.

Traces do.

The biggest shift was moving from service health to request behavior.


The Three Pillars (How They Actually Connect)

People often mention logs, metrics and traces as separate tools.

They are not.

They are three zoom levels of the same incident.

Metrics — “Something is wrong”

Traces — “Where it is wrong”

Logs — “What exactly happened”

The value appears when you can jump between them in seconds.

That is the core reason I rely heavily on Datadog’s APM rather than only metrics.


Starting Point: The Latency Graph

Most real incidents I’ve seen start the same way:

Latency increases slightly.

Not a spike.
Not an outage.

Just p95 drifting from 180ms → 400ms → 900ms over 20 minutes.

Error rate is still normal.

This is the most dangerous state in production: a system degrading without failing.

In Datadog I rarely start with CPU or memory.

I start with:

Service → APM → Endpoint latency → p95/p99

Because users don’t experience servers.
They experience requests.


From Metric to Trace

Once I see a slow endpoint, the next step is not guessing.

I open a real trace.

This is where observability becomes different from monitoring.

A single trace shows:

  • every service hop

  • DB queries

  • external calls

  • cache usage

  • retries

  • time spent waiting vs executing

Now the system is no longer abstract.

I’m looking at a specific request that actually happened.

Very often the problem is immediately visible:

  • a 2.4s external API call

  • a blocking call inside async flow

  • N+1 database queries

  • a retry storm

  • thread pool saturation

Without tracing, these issues look identical from metrics.


The Most Valuable Feature: Outliers

Averages hide incidents.

I care much more about the slowest 1%.

In Datadog I frequently filter:

“Show traces where duration > 2 seconds”

This reveals patterns dashboards never show.

Examples I’ve encountered:

  • only requests with large payloads fail

  • only specific tenants are slow

  • only cache misses trigger a downstream bottleneck

  • only first request after deployment is slow (cold initialization)

Production problems are rarely global.
They are conditional.

Tracing exposes conditions.


Database Visibility

One of the most practical benefits is query-level insight.

Instead of:

“database seems slow”

I can see:

  • exact SQL query

  • execution time

  • frequency

  • which endpoint triggered it

This immediately distinguishes:

  • bad indexing

  • missing caching

  • accidental full table scan

  • ORM misuse

Many “performance problems” turn out to be query shape problems.

And they are visible in seconds.


Logs — But Only After the Trace

A mistake I made early on was starting with logs.

Logs are high volume and low signal.

Now I almost never search logs first.

I:

  1. Find a slow trace

  2. Jump to correlated logs from that trace

This changes everything.

Instead of reading thousands of lines, I read the exact log lines produced by the failing request.

Logs become evidence instead of noise.


Alerts I Actually Trust

The alerts that wake engineers at night should be very few.

The most useful alerts I’ve configured are not CPU or memory alerts.

They are:

  • p99 latency above threshold for sustained period

  • error rate per endpoint (not global)

  • saturation of worker queues

  • retry explosion indicators

Hardware metrics often alert after users notice.

Request-level alerts often alert before support tickets appear.


An Unexpected Use: Verifying Fixes

Observability is not only for incidents.

It is also for confidence.

After deploying a fix, I don’t just wait.

I compare traces:

before vs after

I verify:

  • latency distribution

  • retry count

  • DB query time

  • downstream calls

This turns performance work from intuition into evidence.


Final Thought

Monitoring tells you when the system is unhealthy.

Observability lets you understand behavior while the system is still running.

Modern backend systems rarely fail loudly.

They degrade, retry, compensate and partially work.

Without request-level visibility, engineers debug symptoms.

With observability, they debug causality.

Datadog is simply the tool I use to see that causality while the system is alive.

More from this blog

Leandro Maia

10 posts

Notes on Backend Systems and Software Architecture