Agentic Incident Triage: The D2C CloudWatch-Native Build

Q: Does the agentic triage pattern work if X-Ray traces aren't enabled?

The pattern degrades gracefully without X-Ray. Steps 1 (alarm state), 2 (user impact from ALB metrics), and 3 (CloudWatch Logs Insights) still run and often surface the root cause without trace data. Step 4 (transaction analysis) is skipped, and the agent notes in the RCA brief that trace-level evidence was unavailable. For incidents involving slow database queries, step 5 (RDS Performance Insights) frequently identifies the bottleneck without needing X-Ray to pinpoint the call chain. That said, X-Ray adds meaningful depth for microservice architectures where latency is distributed across multiple services — enabling it on ECS tasks costs nothing beyond the trace storage, which runs under $5/month at D2C traffic volumes.

Q: What should an incident prompt include to get useful output from the triage agent?

Three things: the affected service name (so the agent knows which CloudWatch log group, X-Ray service, and RDS instance to query), the symptom (latency spike, error rate increase, or throughput drop), and the time window. The Amazon Q reference uses: 'Checkout is slow and we are seeing server errors on checkout-service in production. Check the last 24 hours.' For a CloudWatch-triggered version where the alarm fires automatically, the prompt template is: 'Alert: [alarm name] entered ALARM state at [timestamp]. Service: [service name from alarm description]. P95 latency is [metric value]. Investigate the past 30 minutes and generate an RCA brief.' The more specific the service name and metric context in the prompt, the fewer clarifying tool calls the agent needs before it can start the investigation sequence.

At 2:47 AM during BFCM 2024, a $7M fashion brand's checkout service started throwing errors. P95 latency hit 8.4 seconds; the baseline was 340ms. The CTO — also the on-call engineer that night — opened CloudWatch and worked through it systematically: Logs Insights for error signatures, RDS Performance Insights for slow queries, X-Ray traces for the slow path, ALB access logs for request patterns. Seventy-three minutes after the alert fired, he found it: a database migration script had left a lock on the orders table. He killed the migration, the connection pool cleared, latency dropped to 290ms.

The Slack message he posted at 4:00 AM: "Fixed. Was DB connections."

Three months later, a different engineer got a similar alert. Same symptom — checkout latency spike, elevated 5xx rate. The first incident's investigation produced no artifact beyond two words in a Slack thread. The second engineer spent 68 minutes on the same root cause before finding the same fix. The AWS agentic triage pattern that just dropped — Amazon Q, New Relic, five investigation queries from one prompt, structured RCA brief as output — is built to solve both the investigation and the handoff. D2C brands on CloudWatch can build the same five-query pattern without a New Relic subscription. But the RCA brief format is what stops the second 68-minute investigation from happening.

Running AWS for a D2C brand with one on-call engineer and no incident runbook? Book a 30-min audit — Dhwani joins every call, we review your CloudWatch alarm configuration and triage tooling, written brief inside a week. No SDR layer.

What an Agentic Triage Pattern Does Differently

A traditional on-call workflow is sequential: alarm fires, engineer gets paged, engineer opens six tools in some order they remember from last time, engineer finds root cause or gives up and reboots something. The investigation quality depends entirely on the engineer's familiarity with the system, what they check first, and whether they're making good decisions at 2am.

An agentic triage pattern changes the investigation step structurally. A single prompt — "checkout is slow, server errors on checkout-service, last 24 hours" — triggers the agent to autonomously decide which tools to invoke, in what order, and synthesize results into a structured output. The engineer shifts from executor of investigation steps to reviewer of a completed investigation. The agent doesn't get tired. It doesn't skip the RDS Performance Insights check because it assumes the problem is in application code. It runs the same five queries every time, in the same order, and produces an output in the same format regardless of who is on call.

The Amazon Q implementation in the reference post calls five New Relic tools from one prompt: alert insights, user impact assessment, log analysis, transaction analysis, and a custom NRQL query. Each tool's output feeds context into the next. The final result is an RCA brief and an Asana task created automatically for handoff. The New Relic tools are specific to New Relic customers. But the five-query pattern maps directly to CloudWatch, X-Ray, and RDS tools that every D2C brand running on AWS already has access to.

The Five Investigation Queries for a CloudWatch-Native Stack

Each of the five New Relic tools has a CloudWatch-native equivalent:

Query 1 — Alarm state (New Relic: alert insights → CloudWatch: DescribeAlarms + Anomaly Detection). Which alarms are currently in ALARM state or have transitioned in the past 2 hours? Which metrics crossed their anomaly detection bands? This gives the agent starting facts: latency spike, error rate increase, or throughput drop. CloudWatch API: DescribeAlarms filtered by state, plus GetMetricData for anomaly detection band violations against the relevant namespace.

Query 2 — User impact (New Relic: user impact → CloudWatch/ALB: request rate + error rate by path). How many requests is the affected service handling, and what fraction are erroring? Which paths are affected? CloudWatch Metrics: ALB HTTPCode_Target_5XX_Count and RequestCount by TargetGroup. For path-level breakdown, ALB access logs via Athena — a setup we covered in detail in our D2C incident investigation post. This query also answers the alert fatigue question: if error rate and order rate are clean despite the latency spike, the on-call engineer can monitor rather than escalate.

Query 3 — Log error signatures (New Relic: log analysis → CloudWatch Logs Insights). What error messages are appearing in the service logs in the 30-minute window around the alarm? What exception types and stack traces are clustered? CloudWatch Logs Insights query: filter @message like /ERROR/ | stats count(*) by bin(5m), errorType | sort count desc. This surfaces whether the errors are database connection errors, timeout errors, or null pointer exceptions — which determines whether the investigation goes toward infrastructure or application code next.

Query 4 — Transaction analysis (New Relic: transaction analysis → X-Ray). Which traces in the affected time window show high latency? Which segment in the call chain is the slow one? X-Ray API: GetTraceSummaries filtered by service name and response time above the P95 threshold, followed by BatchGetTraces on the slowest three to identify the specific segment contributing the delay. For a checkout service calling auth, inventory, payments, and fulfillment, this immediately names which downstream call is slow.

Query 5 — Database load (New Relic: custom NRQL → RDS Performance Insights). This is the D2C-specific fifth step that most generic triage architectures omit. Checkout services at $5-15M GMV almost always terminate at a relational database, and the most common root cause of checkout latency spikes is database-level: a slow query, a lock, a connection pool exhaustion, or an index that stopped being used after a schema change. RDS Performance Insights API: GetResourceMetrics for db.load.avg by SQL digest. If one query accounts for more than 40% of DB load in the incident window, that is the root cause and the agent can name it directly in the RCA brief.

For the $7M fashion brand's 73-minute investigation: query 5 would have surfaced the locked migration query in under 90 seconds. The prior 72 minutes were spent getting from alarm to database, one manual step at a time.

The RCA Brief Format That Survives Handoff

The Amazon Q implementation generates a structured RCA brief automatically as the investigation output. The format it uses — summary, blast radius, trigger, evidence links, next actions — is deliberately minimal: enough for the next engineer to understand what happened in 90 seconds, not so detailed that it takes 20 minutes to produce.

For D2C incidents, we use a seven-field version that adds two fields the standard brief omits:

SUMMARY: [Service, symptom, duration, peak metric value]
USER IMPACT: [Orders/hour affected, checkout conversion delta, webhook failures]
TRIGGER: [Root cause in one sentence]
EVIDENCE: [CloudWatch alarm link, Logs Insights permalink, X-Ray trace ID, RDS PI screenshot]
FIX APPLIED: [What was done, at what time, by whom]
WATCH FOR: [Leading indicator of recurrence — the specific metric + threshold]
FOLLOW-UP: [Ticket link for permanent fix, owner, deadline]

The WATCH FOR field is the one most D2C teams skip, and the one that turns a second identical incident from 68 minutes to 10. For the migration lock incident: "Watch for: db.load.avg spike above 0.8 concurrent with DatabaseConnections plateau in RDS Performance Insights. Check pg_locks count before assuming application cause." That single line, attached to the next deployment runbook, ends the recurring incident.

The EVIDENCE field with direct deep-links to CloudWatch Logs Insights saved queries and X-Ray trace IDs is what the Slack message "Fixed. Was DB connections" doesn't have. When the second engineer opens the first incident's brief, they can replay the exact queries that found the root cause and check whether the same pattern appears — rather than starting the five-query sequence from scratch.

Building an on-call runbook and triage agent for a D2C AWS stack? We've implemented this across 10+ brands. Grab 30 minutes — we map your existing CloudWatch setup against the five-query pattern and tell you what's already there vs what needs building. Written brief inside a week.

Alert Fatigue Is Why D2C Teams Skip Investigation Steps

A D2C brand at $5-10M GMV running ECS, RDS, Shopify webhooks, and CloudFront generates a lot of CloudWatch alarms. Scaling events during flash sales fire latency alarms that aren't service failures. End-of-month batch jobs fire throughput alarms that resolve in 15 minutes. Over time, engineers learn to dismiss alarms that look like prior false positives — and occasionally dismiss one that isn't.

The agentic triage pattern reduces alert fatigue specifically through query 2: user impact assessment. If the ALB error rate and order throughput are within normal range during a latency spike, the on-call engineer doesn't need to escalate at 2am. The agent surfaces that context immediately and the brief reads: "Latency elevated, no user-visible error impact, order throughput normal — consistent with auto-scaling lag, monitor." The engineer acknowledges and goes back to sleep.

Without that first-pass impact triage, every latency alarm looks like a potential revenue threat at 2am. The skip-and-hope pattern — dismiss the alarm, if it gets worse it'll page again — is the natural response to a runbook that requires manual investigation for every alert. The agentic pattern eliminates the skip-or-investigate binary: the agent investigates in 90 seconds and tells you which option applies.

Building the Pattern on Bedrock Without New Relic

The Amazon Q + New Relic implementation uses three paid subscriptions. For a D2C brand already running on AWS, the same five-query pattern runs on Bedrock with four components and no additional subscriptions.

1. Bedrock agent with five Lambda tool definitions. Each tool wraps one AWS API call: CloudWatch DescribeAlarms, CloudWatch Logs Insights StartQuery + GetQueryResults, X-Ray GetTraceSummaries, RDS Performance Insights GetResourceMetrics, and an Athena query against the ALB access log table. The Bedrock agent decides which tools to call based on the incident prompt and what each previous query returns — the same autonomous reasoning the New Relic implementation uses, running on Bedrock Claude instead.

2. Automatic trigger via SNS + Lambda. A Lambda function subscribes to the SNS topic attached to CloudWatch Alarms. When an alarm transitions to ALARM state, the Lambda formats an incident prompt — "Alert: [alarm name] at [timestamp]. Service: [extracted from alarm description]. Metric: [current value]. Investigate the past 30 minutes and generate an RCA brief." — and sends it to the Bedrock agent via the InvokeAgent API.

3. RCA brief output to Slack. The agent formats the seven-field brief and the trigger Lambda posts it to a Slack channel. Evidence links point to CloudWatch Logs Insights saved queries, X-Ray trace IDs, and Athena query result S3 paths. The brief arrives in Slack typically within 90 to 120 seconds of the alarm firing — before the on-call engineer has finished loading their laptop.

4. Linear or Jira task creation. A final tool call creates an incident ticket with the brief attached and assigns it to the on-call rotation owner. The ticket has all seven fields as structured metadata, not free text in a description — so incident retrospectives can query across tickets for recurring root causes by service and trigger type.

This build sits within our standard AI agent infrastructure for D2C brands — a Bedrock agent with a Lambda tool layer, SNS integration, and a Slack webhook. All five underlying AWS APIs are available without any additional subscriptions. The full build runs to about 3-4 weeks including the Athena setup for ALB log querying and the X-Ray instrumentation on ECS tasks.

Frequently Asked Questions

Does the agentic triage pattern work if X-Ray traces aren't enabled?

Yes, with degraded depth. Queries 1, 2, and 3 (alarm state, user impact, log errors) still run and frequently surface the root cause without trace data. Query 4 (transaction analysis) is skipped, and the agent notes in the RCA brief that trace-level evidence was unavailable. For incidents involving slow database queries, query 5 (RDS Performance Insights) often identifies the bottleneck without needing X-Ray to pinpoint the call chain. That said, X-Ray adds meaningful depth for ECS-based microservice stacks where latency is distributed across multiple services. Enabling it on ECS tasks is one configuration flag per task definition and costs under $5/month in trace storage at D2C traffic volumes — worth the setup before the next BFCM.

How is an agentic triage pattern different from a CloudWatch alarm runbook?

A CloudWatch alarm runbook is a static checklist: when this alarm fires, do these steps in this order. An agentic pattern is dynamic: the agent reads the alarm description and any initial context, then decides which investigation tools to call based on what each query returns. If the alarm state query shows a latency spike with no concurrent error rate increase, the agent may skip the log error step and go straight to transaction traces. If the log analysis surfaces a specific exception class, the agent can follow that thread with a targeted query rather than executing the next predetermined step. A runbook written for a hypothetical average incident will be wrong for most actual incidents in some detail. An agent that reasons about each query's output before deciding what to call next adapts to the specific incident at hand.

What should an incident prompt include to get useful output from the triage agent?

Three things: the affected service name (so the agent queries the right CloudWatch log group, X-Ray service, and RDS instance), the symptom (latency spike, error rate increase, or throughput drop), and the time window. For a CloudWatch-triggered automatic prompt: "Alert: [alarm name] entered ALARM state at [timestamp]. Service: [service name from alarm description]. P95 latency is [metric value] vs [baseline]. Investigate the past 30 minutes and generate an RCA brief." The service name is the most important field — without it, the agent must guess which log group and RDS instance to query, which adds unnecessary tool calls and slows the investigation. Embedding the service name in the alarm description rather than in the alarm name is the configuration change that makes auto-generated prompts accurate.

The Slack message he posted at 4:00 AM: "Fixed. Was DB connections."

What an Agentic Triage Pattern Does Differently

The Five Investigation Queries for a CloudWatch-Native Stack

Each of the five New Relic tools has a CloudWatch-native equivalent:

The RCA Brief Format That Survives Handoff

For D2C incidents, we use a seven-field version that adds two fields the standard brief omits:

SUMMARY: [Service, symptom, duration, peak metric value]
USER IMPACT: [Orders/hour affected, checkout conversion delta, webhook failures]
TRIGGER: [Root cause in one sentence]
EVIDENCE: [CloudWatch alarm link, Logs Insights permalink, X-Ray trace ID, RDS PI screenshot]
FIX APPLIED: [What was done, at what time, by whom]
WATCH FOR: [Leading indicator of recurrence — the specific metric + threshold]
FOLLOW-UP: [Ticket link for permanent fix, owner, deadline]

Not sure where to start?

The Checkout-Service Triage That Took 73 Minutes at 2 AM. The RCA Brief That Prevented the Next One.

What an Agentic Triage Pattern Does Differently

The Five Investigation Queries for a CloudWatch-Native Stack

The RCA Brief Format That Survives Handoff

Alert Fatigue Is Why D2C Teams Skip Investigation Steps

Building the Pattern on Bedrock Without New Relic

Frequently Asked Questions

Does the agentic triage pattern work if X-Ray traces aren't enabled?

How is an agentic triage pattern different from a CloudWatch alarm runbook?

What should an incident prompt include to get useful output from the triage agent?

Let's find what's breaking — and fix it

The Checkout-Service Triage That Took 73 Minutes at 2 AM. The RCA Brief That Prevented the Next One.

What an Agentic Triage Pattern Does Differently

The Five Investigation Queries for a CloudWatch-Native Stack

The RCA Brief Format That Survives Handoff

Alert Fatigue Is Why D2C Teams Skip Investigation Steps

Building the Pattern on Bedrock Without New Relic

Frequently Asked Questions

Does the agentic triage pattern work if X-Ray traces aren't enabled?

How is an agentic triage pattern different from a CloudWatch alarm runbook?

What should an incident prompt include to get useful output from the triage agent?

Let's find what's breaking — and fix it