Triaging Alerts with pup and an AI Agent

Pairing an internal Datadog CLI with a Claude Code skill to compress the time between an alert firing and the first useful piece of evidence.

A few weeks ago, on an otherwise quiet morning, a Datadog monitor fires for one of our legacy PHP applications in production. The p95 of HTTP request latency had hit 10+ seconds, well past the critical threshold. By the time we start acknowledging it, the alert has already half-resolved itself: it fired at 08:42:58 UTC and recovered at 08:50:58 UTC. A ~10-minute window. The kind of incident that’s gone before you’ve finished pouring coffee.

The symptom is gone. The question isn’t. You still need to know why, because the answer goes into a follow-up ticket and possibly a postmortem.

What sits between you and that answer, on a normal day, is a series of context-switches: the Datadog metrics tab, a kubectl shell on the right cluster, the application logs and the application’s source repo. Plus a fistful of timestamps, copied around four times on a markdown file that will eventually go into the incident report.

Triage friction is mostly plumbing

The “intellectual” work in triage isn’t large. You’re building an evidence chain: symptom, then where the symptom narrows to, then what code or config that narrowing implicates. Once you have the chain, the diagnosis writes itself. Most of the time at least.

Most of the time is spent in the boring art of plumbing all these tools together.

You need the alert window in ISO 8601 UTC. You need the right kubectl context, picked out of all your EKS clusters. You need to remember which namespace the deployment lives in. You need to translate “this monitor fired” into “show me memory usage by pod in that deployment, between these two timestamps.” None of it is hard. All of it is routine SRE work. All of it is friction between you and the first useful piece of evidence.

Two tools, one workflow

Two tools sit between me and that chain of routine tasks now.

The first is pup: Datadog’s new CLI tool for human and agents. It wraps the same DD API the web UI hits, but emits plain or json text. Instead of clicking through a metric explorer, I get:

pup metrics query \
  --query='p95:trace.symfony.request{service:my-php-app,env:staging}' \
  --from='2026-04-10T07:00:00Z' \
  --to='2026-04-10T08:30:00Z'

Shellable. Pipeable. Diff-able. Most usefully: scriptable.

The second tool is Claude Code with a small custom skill I wrote called /alert-triage. The skill knows our conventions. Where investigations are tracked. What it can and cannot do (thanks to a deny/approve list of commands in the settings.json). Which kubectl contexts map to which environments. Which repos hold source code it can read — the application repos, the Helm charts. When I paste in a monitor ID, the skill doesn’t ask what to do. It runs the same investigation I would have run, and writes a markdown report to the investigations directory.

That last detail matters more than the speed. The deliverable of a triage session is not the diagnosis in my head; it’s an artifact I and someone else can read.

The case, end to end

Back to the alert. The skill’s first move is the same as mine: pull the monitor config, then query the metric over the alert window to confirm the symptom. That part is mechanical.

Then it widens. A per-pod memory query:

pup metrics query \
  --query='max:kubernetes.memory.usage{kube_deployment:my-php-app,env:staging} by {pod_name}' \
  --from='2026-04-10T05:00:00Z' \
  --to='2026-04-10T08:30:00Z'

That returns the smoking gun. The worst pod peaked at 96% of its limit. Two of its neighbours were within a few hundred MB of the ceiling. Then it cross-checks with kubectl:

kubectl describe pod <pod> -n <ns> | grep -A10 "Last State"

Four pods OOMKilled, exit code 137, between 06:19 and 06:58 UTC. An hour before the latency spike. The HPA had already scaled to maximum number of replicas. CPU was still sitting at 138% of the 80% target defined in the Autoscaler policy. Capacity was insufficient and there was nowhere for the autoscaler to go.

Then the skill reads the FPM config out of the Helm chart. pm.max_children = 80. Per-worker memory ceiling: 256MB. That’s a 20GB theoretical maximum against a 2GB pod limit. The match had been lit long before the alert fired.

The chain that ends up in the report reads like this: memory pressure from PHP-FPM worker accumulation caused a wave of OOMKills, which reduced pod capacity, which pushed surviving pods into CPU saturation, which in turn starved FPM workers and drove up latency. Each link in that sentence has a query underneath it.

The artifact is a markdown file with a Timeline section in UTC, a “What this is NOT” section listing false leads ruled out, and an Issues to Fix list ranked by priority. It goes straight into a postmortem if the incident needs one, and into the engineering queue otherwise.

A different report from the same week investigates a CPU spike on one of our JVM backend services and lands in a completely different shape: an unimplemented check handler logging the same “not found” message thousands of times, combined with a scheduled batch running simultaneously on every pod because there’s no distributed lock. The tooling is the same. The diagnosis is not transferable. That’s the point: the skill is a workflow, not a playbook.

When to reach for it

A few practical notes after running this for a while.

Run it when the source matters. Without source, the model can only correlate symptoms. The diagnosis ends at “memory pressure, cause unknown.” The pre-authorised reads of the application repo and the Helm chart are what take it from “things looked bad” to “the FPM config is the cause.”

It costs tokens. Each investigation reads metrics, logs, and potentially several files of source and ci/cd pipeline runs or git logs. A noisy alert that turns out to be a non-incident is more expensive to dismiss than to verify by eye. I don’t run the skill on everything — only when the timeline is wide enough and multiple systems are involved enough that the plumbing alone would be a pain.

Verify before acting on the recommendation. The agent will suggest “raise pod memory to 3–4Gi, lower pm.max_children to 25.” Accepting that is on you. Not rarely it hallucinated or missed important information. Remember the postmortem-test: Agents suggest and humans decide.

What changed

The time gap between alert and evidence. That’s the part that collapses. The plumbing of timestamps, helm config, namespace names, the four-tool copy-paste dance, all that goes from minutes to seconds. The judgement still takes the time it always took. Which is the right division of labour: the typing is the agent’s job, the diagnosis is mine.