At 10:14 PM on BFCM eve 2024, a developer at a $6M apparel brand pushed a hotfix. A discount calculation bug was causing free shipping to apply incorrectly to a subset of cart combinations. The fix was urgent — the campaign was already running, orders were coming in, and the team had confirmed the bug in production. The deployment ran in 88 seconds. No errors in CloudWatch. ECS showed the new task as healthy. The team went to bed.
The next morning: 23 Stripe payment intents marked as abandoned. $11,400 in lost revenue. Each abandoned intent corresponded to a customer who had been mid-3DS authentication at exactly the moment the deployment ran. The bank's authentication page had redirected them away from the checkout. When the bank redirected them back 45 to 75 seconds later, the ECS task that held their session was already drained. The ALB returned a 502. Stripe logged no response from the return URL. The payment intent timed out.
The root cause was not the deployment itself. It was a default config value: ALB target deregistration timeout, set at 30 seconds. No one had changed it. No one knew it mattered. AWS research on resilient payment infrastructure — the same patterns financial institutions use for persistent payment protocol connections — identifies this as a production-hardening configuration that every payment-connected service needs. For D2C, the fix is four specific values.
Running checkout on ECS behind an ALB and haven't reviewed your drain config? Book a 30-minute audit — Dev joins every call, we review your ALB timeout configuration and deployment safety settings before the next BFCM. Written brief inside a week. No SDR layer.
What ALB Target Deregistration Does During an ECS Deployment
When an ECS service deployment replaces a running task, the ALB marks the old task's target as "deregistering." During the deregistration window, the ALB continues routing in-flight requests to the deregistering target — active connections are not immediately severed. Once the deregistration timeout expires, the ALB stops routing to the old target regardless of whether any connections are still open. Those connections are dropped.
The default deregistration timeout is 30 seconds. AWS sets it conservatively low so that deployments complete quickly and new task versions become fully active without long wait times. For stateless API endpoints where each HTTP request completes in under a second, 30 seconds is more than enough. For checkout flows that involve a redirect to an external bank authentication page — and then wait for that bank to redirect the customer back — 30 seconds is not enough.
The deregistration delay is a per-target-group attribute. It applies to every ECS service pointing at that target group. Changing it does not require a code change, a container rebuild, or a redeployment. It is a single CLI command or a Terraform attribute that takes effect immediately.
Why 3DS Timing Makes the Default Gap Lethal for D2C
3DS (Three-Domain Secure) is the card-network authentication protocol that redirects customers to their issuing bank during checkout. Under PSD2 it is mandatory for European card transactions. Under RBI guidelines it applies to Indian credit card payments. For a D2C brand selling internationally or processing credit cards in India, 3DS affects a meaningful fraction of every checkout session.
The 3DS flow from the ALB's perspective: the customer's browser sends a payment initiation request to the checkout API. The checkout API calls Stripe and gets back a requires_action payment intent with a redirect URL. The browser redirects to the bank's authentication page. The customer authenticates (OTP, biometric, or password). The bank posts a completion event back to Stripe, and Stripe redirects the customer to the merchant's return_url. The browser loads the return_url from the checkout API, which queries Stripe to confirm the payment intent status and shows the order confirmation.
That round-trip — from the checkout API receiving the initial payment request to it receiving the return_url callback — takes 45 to 90 seconds depending on the issuing bank's authentication page performance. Banks in Europe typically run 45-60 seconds. Indian banks, depending on the OTP delivery path, regularly reach 75-90 seconds. The customer is off-site the entire time. The checkout API's ECS task is holding state, waiting for the return call.
If the ECS task drains after 30 seconds, it is gone 45 to 60 seconds before the customer returns. The return_url request hits a 502. Stripe abandons the payment intent after failing to post the confirmation webhook. The customer sees an error page. The order is lost.
Four Config Values That Close the Gap
1. ALB target group deregistration delay: increase to 120 seconds.
120 seconds covers the 90-second maximum 3DS round-trip with 30 seconds of margin. The immediate fix:
aws elbv2 modify-target-group-attributes --target-group-arn arn:aws:elasticloadbalancing:... --attributes Key=deregistration_delay.timeout_seconds,Value=120
In Terraform: deregistration_delay = 120 on the aws_alb_target_group resource. This applies to the target group immediately — no deployment required.
2. ECS task stop timeout: set to 120 seconds to match.
The ECS task stop timeout controls how long ECS waits for a task to stop gracefully before sending SIGKILL. If the stop timeout is shorter than the deregistration delay, ECS kills the task before the drain window completes. In the task definition:
"stopTimeout": 120
This ensures the task stays alive for the full 120-second drain window rather than being killed after the default 30-second stop timeout.
3. ALB idle connection timeout: increase to 120 seconds.
Separate from deregistration, the ALB has an idle connection timeout that applies to all connections — even during normal operation with no deployment. The default is 60 seconds. If the ALB's idle timeout fires during a 3DS flow while the customer is on the bank's page, the connection to the checkout API drops even without a deployment event. Set it to 120 seconds on the ALB load balancer attributes:
aws elbv2 modify-load-balancer-attributes --load-balancer-arn arn:aws:elasticloadbalancing:... --attributes Key=idle_timeout.timeout_seconds,Value=120
4. TCP keepalive on the webhook receiver ECS task.
If the checkout API or a separate webhook receiver listens for Stripe payment events over a persistent connection behind an NLB, the NLB's TCP idle timeout (350 seconds by default for NLBs) can drop the connection during quiet periods between webhook events. Configure Linux TCP keepalive in the ECS task's container via a startup script or sysctl configuration:
net.ipv4.tcp_keepalive_time = 45
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 5
This sends keepalive probes every 45 seconds and detects a dead connection within 95 seconds — well within the NLB's 350-second idle timeout. AWS financial infrastructure research uses this exact configuration for persistent payment protocol connections.
The ECS Deployment Sequence That Prevents Mid-3DS Drops
The deregistration timeout increase alone handles the gap. For BFCM deployments where extra caution is warranted, a weighted blue/green deployment eliminates the risk entirely by shifting traffic gradually rather than cutting over all at once.
The ECS rolling update configuration that enables safe BFCM deployments:
deploymentConfiguration:
minimumHealthyPercent: 100
maximumPercent: 200
minimumHealthyPercent: 100 means ECS never reduces below the current task count during a deployment. New tasks spin up and pass health checks before any old tasks begin draining. With maximumPercent: 200, ECS can run up to twice the task count simultaneously — new tasks serving new traffic, old tasks completing their active sessions through the 120-second drain window.
For CodeDeploy blue/green on Fargate, the equivalent is setting terminationWaitTimeInMinutes to at least 2 in the deployment group configuration, and configuring the test listener health check to verify the new task set is handling payment flows correctly before the original task set is terminated.
The deployment sequence under this configuration: new ECS tasks spin up → ALB health checks pass → ALB registers new targets → ALB begins routing new checkout sessions to new tasks → old targets enter deregistration → 120-second drain window → all active 3DS flows on old tasks complete → old tasks stop. No payment dropped. For guidance on the infrastructure layer this sits within, our AWS consulting service covers ECS deployment safety as part of every checkout resilience engagement.
CloudWatch Signals That Catch the Problem During a Deployment
Two CloudWatch metrics detect mid-deployment payment drops before the Stripe dashboard confirms them the next morning:
ALB HTTPCode_ELB_5XX_Count spike during deployment. A spike in 5xx responses from the ALB itself (not from targets) during a deployment window indicates the ALB dropped connections rather than routing them to a healthy target. This is the clearest signal that deregistration fired before connections closed. Set a CloudWatch Alarm: if HTTPCode_ELB_5XX_Count exceeds 5 in any 1-minute window during a deployment event, fire a notification and consider rollback.
Stripe payment_intent.canceled webhook spike. A custom CloudWatch metric ingesting Stripe webhook events from a Lambda processor shows when payment intents are abandoned. A spike in cancellations aligned with a deployment timestamp is the definitive confirmation that mid-3DS drops occurred. We covered how to build the webhook receiver and log pipeline for this in our D2C incident investigation post. With this metric in place, a 30-second CloudWatch alarm catches the next morning's $11,400 finding before dawn.
The financial infrastructure pattern that AWS recommends for payment-connected services uses RstPacketsReceived and ActiveFlowCount NLB metrics for the equivalent monitoring. For ALB-based D2C checkout services, the 5xx count and the Stripe webhook event stream are the equivalent observability layer. More on building the full checkout observability stack in our BFCM incident triage post.
BFCM is closer than it looks and this config takes 10 minutes to fix. Book 30 minutes — we review your ALB target group configuration, ECS task stop timeouts, and deployment settings before peak season. Written brief inside a week. Dev joins every call.
Frequently Asked Questions
Does this only affect customers using cards that require 3DS authentication?
3DS is the most common affected flow, but not the only one. Any checkout step where the customer leaves the merchant's domain and returns — bank OTP pages, PayPal redirect, UPI deep-links on mobile, buy-now-pay-later verification flows — has the same timing structure: the ECS task holds session state, the customer is off-site for 45 to 120 seconds, and the deregistration timer may expire before they return. For D2C brands selling internationally, European cards are 3DS under PSD2. For brands selling in India, RBI guidelines increasingly mandate 3DS for credit cards. At $5M or more GMV with international or Indian credit card traffic, the affected population during a BFCM deployment is typically 15 to 30 percent of active checkout sessions at the moment of deployment.
We use Shopify Payments — do we need to configure any of this ourselves?
No — if the entire checkout runs inside Shopify's hosted environment, Shopify's infrastructure team manages deregistration timeouts, connection draining, and deployment safety for the payment processing layer. This configuration applies to D2C brands that run their own checkout API or payment backend on AWS: brands with customized checkout flows, a custom order processing service behind an ALB, or any ECS service involved in the payment flow outside Shopify's hosted layer. Where it becomes your responsibility regardless of Shopify: any webhook receiver running on AWS for Stripe, Razorpay, or other gateway webhooks. If Stripe delivers payment events to a Lambda or ECS container you own, and that container sits behind an NLB, the TCP keepalive configuration in point 4 applies to your infrastructure.
What is the revenue impact of a single dropped-payment deployment at $5M GMV?
At $5M annual GMV, roughly $400,000 to $600,000 transacts in the 4-day BFCM window. Peak checkout throughput can be 15 to 25 times the daily average. A deployment at peak with 30 active checkout sessions at cutover drops 4 to 8 sessions if 3DS flows are involved. At an average D2C order value of $65 to $120, a single deployment event costs $260 to $960 in direct dropped payments. That is before accounting for customers who do not return — cart abandonment rates after a failed payment run 60 to 70 percent. The actual revenue impact is 2 to 3 times the dropped payment value. The fix — increasing deregistration timeout from 30 to 120 seconds — costs nothing and takes under 10 minutes to implement.
Founder and CEO of Braincuber. Has scoped and shipped 500+ Odoo, AI, and cloud projects for US mid-market and global brands. Takes every founder call personally — no SDR layer between buyers and the people building the system.
