ALB and VPC Flow Logs: The D2C Incident Playbook

Q: Do VPC Flow Logs capture traffic between ECS tasks within the same VPC?

Yes, if you enable flow logs at the ENI level or subnet level rather than just the VPC level. Traffic between tasks in the same VPC is captured as long as both ENIs are within the logging scope. If you're running ECS tasks in a shared subnet and want to see which task-to-task connections are being rejected (for example, an app-tier task trying to reach an RDS proxy that has a restrictive security group), subnet-level flow logs catch that traffic. VPC-level flow logs also capture it, since the VPC scope includes all subnets.

At 11:18 PM during BFCM 2024, a $6M apparel brand's Shopify webhook receiver went silent. Order/created events started returning 502. Shopify queued them and retried every 5 minutes. By 7:30 AM, 340 orders had processed in Shopify with zero fulfillment records in Odoo. The ops lead spent 4 hours rebooting ECS tasks, cycling target groups, and checking security groups before tracing the failure to an ECS container that had been OOM-killed during the traffic spike — the ALB had never marked it unhealthy fast enough, and every webhook hit a target that was already gone.

CloudTrail had nothing. No API call caused this failure. It was data-plane: a container runtime event that leaves no control-plane footprint. AWS's DevOps Agent investigation pattern — just extended to pull S3 logs and invoke custom MCP analysis tools — is built precisely for this failure mode. It correlates ALB access logs, VPC Flow Logs, and packet captures to surface root causes that CloudTrail cannot see. For the pattern to work, those logs have to exist. Most D2C brands don't have any of them enabled.

Running a D2C stack on AWS with Shopify webhooks and Odoo integration? Book a 30-min audit — Dev joins every call, we check your log coverage and ALB configuration, written brief inside a week. No SDR layer.

What CloudTrail Doesn't Capture — and What Does

CloudTrail records API calls to AWS services. When a deployment stops a service, when an IAM policy change locks out a function, when someone runs an SSM command against an instance — CloudTrail has it. The timestamp, the identity, the parameters. That's why it's the first place every ops team looks during an incident.

But when an ECS container runs out of memory and the runtime kills it, no API call initiates that event. When the ALB health check marks a target unhealthy and starts returning 502 to incoming requests, there's no API call for that either. When a security group rule silently starts dropping packets from the app tier to RDS, CloudTrail won't show it until someone makes a configuration change. As we noted in our post on AI-powered cost anomaly investigation, the distinction between control-plane and data-plane failures is exactly where automated investigation tools need a second log source.

The three log streams that cover the data plane for a D2C AWS stack are all disabled by default:

ALB access logs: every request the load balancer handles, including timing fields that distinguish "target never responded" from "target responded slowly" from "target returned a 5xx." Stored in S3. Off by default on every new ALB.
VPC Flow Logs: every accepted and rejected packet at the network interface level. Shows which connections succeeded and which were dropped before the application saw them. Stored in S3 or CloudWatch Logs. Off by default on every new VPC.
CloudFront standard logs: every request through the CDN, including the origin response code and the edge result type. Distinguishes a CloudFront-layer error from an ALB-layer error. Stored in S3. Free except for storage. Off by default on every new distribution.

The "-1 -1 -1" Pattern and What It Means for Webhook Failures

An ALB access log record contains three timing fields in sequence: request processing time, target processing time, response processing time. A healthy request to a live ECS task reads something like 0.001 0.003 0.000. A request where the ALB could not reach the target at all reads -1 -1 -1.

The -1 -1 -1 pattern is the fingerprint of a dead or unhealthy target. When an ECS task is OOM-killed and the ALB hasn't yet removed it from the target group — the health check interval is typically 30 seconds — requests that arrive in that window hit a target that's gone. Shopify's webhook sender gets a 502, logs a delivery failure, and queues a retry. If the ECS task restores itself quickly, most retries succeed and the ops team never knows. If the outage persists past Shopify's first retry window, webhooks start silently expiring.

The AWS DevOps Agent investigation demo surfaces exactly this pattern: a CloudWatch alarm fires on elevated 5xx rate, the agent reads ALB access logs from S3, finds the -1 -1 -1 timing on the relevant target group, correlates the timestamp with an ECS task count drop in CloudWatch metrics, and names the root cause. For the $6M apparel brand, this investigation would have taken 8 minutes. Without ALB access logs on their webhook receiver, it took 4 hours of reboots.

The storage cost: ALB access logging runs $0.023 per GB in S3. At 10,000 Shopify webhook deliveries per day during BFCM, the webhook receiver ALB log volume is roughly 400MB/day — approximately $0.009/day in storage, or under $3 for a full peak month. Set a 90-day S3 lifecycle policy and the cost caps there permanently.

The webhook reliability audit — ALB log coverage, ECS memory limits, health check configuration — is a 2-hour fix on most D2C stacks we've reviewed. If BFCM is on the horizon and you've never checked these settings, grab 30 minutes. Written findings inside a week, fixed before peak.

VPC Flow Logs: REJECT Before the Reboot

VPC Flow Logs operate one layer below ALB access logs — at the packet level, before the application processes anything. A rejected connection record looks like this in S3:

2 123456789012 eni-abc12345 10.0.1.50 10.0.2.100 5432 49152 6 1 40 1700000000 1700000060 REJECT OK

The REJECT at the end means the packet was dropped at the security group or NACL level. The app tier never saw the connection attempt. Three D2C scenarios where this record is the only evidence:

Odoo loses connectivity to RDS: A Terraform apply during a compliance review tightens a security group and removes the rule allowing the app tier to reach RDS on port 5432. Orders start failing. CloudTrail shows the Terraform API calls, but not the downstream effect on live traffic. Flow Logs show REJECT on port 5432 from the app tier CIDR, timestamped to within a second of the first order failure.
ECS task can't reach Shopify API: A NAT Gateway is removed during an infrastructure cost-cutting pass (common after BFCM). Lambda functions and ECS tasks reaching external APIs start failing silently. Flow Logs show outbound attempts with no return traffic — the NAT Gateway that would have translated the source IP is gone.
New ECS task can't reach ElastiCache: A redeployed task gets a new ENI and the security group inbound rule was scoped to a specific ENI rather than a security group reference. New tasks can't reach the cache layer. Ops sees cache miss rate spike in CloudWatch; Flow Logs identify the rejected connections at the specific new ENI.

Our AWS consulting work treats VPC Flow Logs as a required baseline for any D2C stack running ECS alongside external integrations. It's the only log that distinguishes "service is down" from "service can't reach its dependencies" without requiring someone to SSH into a container and run curl.

CloudFront Standard Logs and the BFCM Origin Failure

CloudFront standard logs record two status codes per request: the code CloudFront returned to the browser and the code CloudFront received from your origin. The x-edge-response-result-type field tells you whether CloudFront served a cached response, hit the origin, or returned an error from its own edge logic.

During peak traffic, this distinction matters. A browser getting a 502 from a D2C storefront could mean CloudFront is returning a cached error page (the origin was briefly down; CloudFront has the error cached), CloudFront got a 502 directly from the ALB origin, or CloudFront timed out waiting for the ALB response. All three look identical in your application monitoring. CloudFront standard logs make them distinguishable in under a minute: Error in x-edge-response-result-type and a 5xx sc-status means the origin returned the error; LimitExceeded means CloudFront hit its own threshold before the origin responded.

For the DevOps Agent pattern, CloudFront logs in S3 are the first filter: if the edge layer is serving errors from cache, the investigation stays at the CDN configuration layer. If the origin is returning 502, the investigation moves to ALB access logs. The two log sources chain naturally — CloudFront narrows the failure layer, ALB logs name the specific target.

The MCP Server Pattern: What It Means for D2C Log Tooling

The AWS blog's most architecturally interesting piece is the custom PCAP MCP Server — a tool deployed on Bedrock AgentCore Runtime that translates raw packet capture files into LLM-readable comparisons. The DevOps Agent can't natively parse a .pcap binary; the MCP server acts as a translation layer that surfaces structured evidence from an unstructured format.

The same pattern applies directly to compressed ALB access logs and VPC Flow Logs in S3. Both are gzip-compressed columnar files. An LLM can't read them directly. An MCP tool that accepts a time window, decompresses the relevant log files from S3, filters to records matching a target group and status code pattern, and returns a structured summary gives the DevOps Agent exactly what it needs in readable form.

For a D2C team running Bedrock AgentCore, this is a one-afternoon build: three Lambda functions registered as MCP tools, one each for ALB logs, Flow Logs, and CloudFront logs. The agent calls the ALB tool with the incident time window, gets back the -1 -1 -1 records and their target IPs, calls the Flow Logs tool for the same window on those target IPs, and surfaces a complete picture of why webhooks were failing at the network and application layers simultaneously. We covered how AgentCore hosts these kinds of custom tools in our post on AgentCore and D2C payment flows — the same runtime infrastructure handles both.

Enabling All Three Before Your Next Peak Event

Total setup time: 90 minutes. Total monthly cost at D2C traffic volumes: under $15. Here's the sequence.

ALB access logs (15 minutes): EC2 console → Load Balancers → select your ALB → Attributes tab → Edit → Enable access logs → point to an S3 bucket. Do this for every ALB that receives Shopify webhooks or external API traffic — the internal ALB serving your API tier matters as much as the public-facing one. Add an S3 lifecycle rule expiring logs after 90 days to keep storage costs flat.

VPC Flow Logs (30 minutes): VPC console → Your VPCs → select production VPC → Flow Logs tab → Create. Destination: S3 (cheaper than CloudWatch Logs for high-volume storage). Traffic type: All — capturing only REJECT misses the slow-connection pattern, and the cost difference is marginal. Apply to your production VPC and your staging VPC if ECS-based integration testing runs there. Add the same 90-day lifecycle policy to the S3 destination.

CloudFront standard logs (15 minutes): CloudFront console → select your distribution → General tab → Standard logging → On. Destination: S3 bucket, prefixed with the distribution ID. No extra AWS charge beyond S3 storage — typically under $2/month for a D2C storefront's CDN access volume.

Athena table setup (30 minutes): Create an Athena table over each log bucket using the DDL schemas AWS publishes for ALB logs and VPC Flow Logs (CloudFront logs use the same Athena pattern). With partitioning by date, a query scoping to a 30-minute incident window scans roughly 15–30MB and returns results in under 10 seconds. At $5/TB scanned, a single incident investigation costs under $0.001 in Athena query fees. Without Athena, investigation means downloading and decompressing gzip files by hand — which is exactly what the 4-hour manual investigation looked like.

Frequently Asked Questions

How do ALB access logs differ from CloudWatch metrics for the load balancer?

CloudWatch metrics for ALB give you aggregate counts: total request count, 5xx error rate, target response time as a percentile. They tell you a problem exists and roughly when it started. ALB access logs give you individual request records: every request with its timing fields, status code, target IP, and the critical -1 -1 -1 pattern that distinguishes "target never responded" from "target returned an error." Metrics are for alerting. Access logs are for diagnosis — you cannot identify which specific target group member failed, or trace a specific Shopify webhook delivery to its outcome, using CloudWatch metrics alone.

Do VPC Flow Logs capture traffic between ECS tasks within the same VPC?

Yes, when flow logs are enabled at the ENI level or subnet level rather than only at the VPC level. Traffic between tasks in the same VPC is captured as long as both ENIs are within the logging scope. If you're running ECS tasks in a shared subnet and want to see which task-to-task connections are being rejected — for example, an app-tier task trying to reach an RDS proxy with a restrictive security group — subnet-level flow logs catch that traffic. VPC-level flow logs also capture it, since the VPC scope includes all subnets within it.

What does an Athena query cost when investigating an active incident in ALB access logs?

Athena charges $5 per TB of data scanned. ALB access logs for a D2C brand handling 10,000 orders per day run roughly 400MB to 800MB per day compressed in S3. A query scoping to a 30-minute incident window scans approximately 15 to 30MB of data — costing under $0.001 per query. Even a broad query across 7 days of logs to find a recurring pattern scans under 6GB and costs around $0.03. Athena's per-query cost is not a meaningful constraint for incident investigation at D2C volumes. The value is speed: a properly partitioned table returns a 30-minute window of ALB logs in 3 to 8 seconds.

What CloudTrail Doesn't Capture — and What Does

The three log streams that cover the data plane for a D2C AWS stack are all disabled by default:

ALB access logs: every request the load balancer handles, including timing fields that distinguish "target never responded" from "target responded slowly" from "target returned a 5xx." Stored in S3. Off by default on every new ALB.
VPC Flow Logs: every accepted and rejected packet at the network interface level. Shows which connections succeeded and which were dropped before the application saw them. Stored in S3 or CloudWatch Logs. Off by default on every new VPC.
CloudFront standard logs: every request through the CDN, including the origin response code and the edge result type. Distinguishes a CloudFront-layer error from an ALB-layer error. Stored in S3. Free except for storage. Off by default on every new distribution.

The "-1 -1 -1" Pattern and What It Means for Webhook Failures

VPC Flow Logs: REJECT Before the Reboot

VPC Flow Logs operate one layer below ALB access logs — at the packet level, before the application processes anything. A rejected connection record looks like this in S3:

2 123456789012 eni-abc12345 10.0.1.50 10.0.2.100 5432 49152 6 1 40 1700000000 1700000060 REJECT OK

The REJECT at the end means the packet was dropped at the security group or NACL level. The app tier never saw the connection attempt. Three D2C scenarios where this record is the only evidence:

Odoo loses connectivity to RDS: A Terraform apply during a compliance review tightens a security group and removes the rule allowing the app tier to reach RDS on port 5432. Orders start failing. CloudTrail shows the Terraform API calls, but not the downstream effect on live traffic. Flow Logs show REJECT on port 5432 from the app tier CIDR, timestamped to within a second of the first order failure.
ECS task can't reach Shopify API: A NAT Gateway is removed during an infrastructure cost-cutting pass (common after BFCM). Lambda functions and ECS tasks reaching external APIs start failing silently. Flow Logs show outbound attempts with no return traffic — the NAT Gateway that would have translated the source IP is gone.
New ECS task can't reach ElastiCache: A redeployed task gets a new ENI and the security group inbound rule was scoped to a specific ENI rather than a security group reference. New tasks can't reach the cache layer. Ops sees cache miss rate spike in CloudWatch; Flow Logs identify the rejected connections at the specific new ENI.

CloudFront Standard Logs and the BFCM Origin Failure

The MCP Server Pattern: What It Means for D2C Log Tooling

Enabling All Three Before Your Next Peak Event

Total setup time: 90 minutes. Total monthly cost at D2C traffic volumes: under $15. Here's the sequence.

Not sure where to start?

Your Webhook Receiver Gets 502s. CloudTrail Has Nothing. Here's the Log That Shows You Why.

What CloudTrail Doesn't Capture — and What Does

The "-1 -1 -1" Pattern and What It Means for Webhook Failures

VPC Flow Logs: REJECT Before the Reboot

CloudFront Standard Logs and the BFCM Origin Failure

The MCP Server Pattern: What It Means for D2C Log Tooling

Enabling All Three Before Your Next Peak Event

Frequently Asked Questions

How do ALB access logs differ from CloudWatch metrics for the load balancer?

Do VPC Flow Logs capture traffic between ECS tasks within the same VPC?

What does an Athena query cost when investigating an active incident in ALB access logs?

Let's find what's breaking — and fix it

Your Webhook Receiver Gets 502s. CloudTrail Has Nothing. Here's the Log That Shows You Why.

What CloudTrail Doesn't Capture — and What Does

The "-1 -1 -1" Pattern and What It Means for Webhook Failures

VPC Flow Logs: REJECT Before the Reboot

CloudFront Standard Logs and the BFCM Origin Failure

The MCP Server Pattern: What It Means for D2C Log Tooling

Enabling All Three Before Your Next Peak Event

Frequently Asked Questions

How do ALB access logs differ from CloudWatch metrics for the load balancer?

Do VPC Flow Logs capture traffic between ECS tasks within the same VPC?

What does an Athena query cost when investigating an active incident in ALB access logs?

Let's find what's breaking — and fix it