AI Summary - 20-sec read - Reviewed by experts
- LLM providers cap you on requests per minute and tokens per minute. When your real traffic is bursty - and it always is - you hit those caps at the worst moment and the API returns 429 Too Many Requests. The demo worked because one person was using it.
- The first fix is retrying correctly: exponential backoff with random jitter, and honour the Retry-After header the provider sends. A tight retry loop with no backoff makes the overload worse, not better.
- Retrying alone is not enough. Shape the load before it hits the provider with a request queue, a concurrency cap, and a token-bucket limiter set just under your quota, so you smooth bursts instead of slamming into the ceiling.
- For anything user-facing, add provider fallback and graceful degradation: a second model or provider to spill over to, and a sensible partial response when everything is saturated, so users see slowness, not an error page.
- Short on time? We will make your AI feature hold up under real load. Book a free call.
Short on time? Book a free call.
The AI feature was flawless in the demo. One reviewer, a few requests, instant answers. Then it launched, real traffic arrived in bursts, and it started returning errors - 429 Too Many Requests - right when the most people were watching. This is the most common way a promising AI feature embarrasses a team in production, and it has nothing to do with the model quality. It is that your provider limits how fast you can call it, your traffic is spiky, and your code was never built to absorb the gap between the two.
Why 429s happen when you least want them
Every model provider enforces two limits: requests per minute and tokens per minute. Cross either and the API stops serving you and returns a 429. Those limits are fine on average - your daily volume may sit well under them - but averages lie. Real traffic is bursty. A newsletter goes out, a feature trends, a batch job fires, and a minute's worth of requests arrives in ten seconds. The provider does not care that you were quiet all morning. It sees a spike over the per-minute ceiling and throttles you.
The reason it hurts so much is timing. You hit the limit precisely when usage is highest, which is precisely when the feature matters most and the most people are affected. So the failure is not just technical - it lands in front of your busiest, most engaged users. Handling it is not an edge case to tidy up later. It is the difference between a feature that survives its own launch and one that does not.
Worried your AI feature will fall over the moment real traffic hits?
We load-test it against your provider limits, then build the retry, queueing, and fallback that keep it up when the spike comes. No pitch, reply in 2 hrs, no card needed, NDA on request.
Get a free auditRetry the right way: backoff, jitter, and Retry-After
When a 429 comes back, the instinct is to retry immediately. That is the worst thing you can do - a tight retry loop from every failing request piles more load onto an already-overloaded endpoint and turns a brief spike into a sustained outage. Retrying is right; retrying blindly is not.
- Exponential backoff. Wait longer after each failure - roughly one second, then two, then four - so you give the limit window time to reset instead of hammering it. Cap the total attempts so a request cannot retry forever.
- Jitter. Add a random offset to each wait. Without it, every request that failed in the same burst retries at the same instant and you get a second synchronised spike. Jitter spreads the retries out so they arrive smoothly.
- Honour Retry-After. Providers often tell you exactly how long to wait in a Retry-After header on the 429. When it is there, use it - it is the provider telling you when it will serve you again, and guessing shorter just earns another 429.
Retry only the failures that are worth retrying. A 429 or a transient network error, yes. A 400 for a malformed request will fail identically every time - retrying it wastes your quota and hides the real bug. That distinction is part of the same discipline as reliable tool calling with schema validation and recovery: know which errors are recoverable and treat the rest as real.
Do not just retry - shape the load
Backoff handles the failures that happen. The better move is to stop generating so many in the first place by controlling the rate at which you call the provider. Three controls do most of the work:
- A request queue. Instead of every request calling the model directly, put work on a queue and drain it at a steady pace. Bursts land in the queue; the provider sees an even flow. For anything that does not need an instant answer - summaries, enrichment, batch scoring - this alone removes most 429s.
- A concurrency cap. Limit how many calls are in flight at once. It stops a traffic spike from opening a thousand simultaneous connections and blowing straight through the per-minute limit.
- A token-bucket limiter. Track your own usage and keep it just under the provider's quota - refill a bucket at your allowed rate and only send when there is a token. You throttle yourself smoothly rather than letting the provider throttle you abruptly with an error.
The shift in thinking is from reactive to proactive. Backoff cleans up after you have already overloaded the endpoint. Load shaping means you rarely overload it at all. The two work together, but the queue and the limiter are what let a feature take a real spike without flinching. Knowing where your calls actually stack up needs visibility, which is why this pairs with observability and monitoring for AI agents in production.
A 429 at peak traffic puts the failure in front of your most engaged users.
We build the queueing, backoff, and fallback that let your AI feature absorb real bursts instead of breaking under them. Reply in 2 hrs, NDA on request.
Book a free callTakeaways
- 429s come from per-minute request and token caps meeting bursty traffic - they hit hardest exactly when usage peaks.
- Retry with exponential backoff plus random jitter, and honour the Retry-After header instead of guessing.
- Only retry recoverable errors - retrying a 400 wastes quota and hides the real bug.
- Shape load with a queue, a concurrency cap, and a token-bucket limiter set just under quota so you rarely hit the ceiling.
- For user-facing features, add provider fallback and a graceful partial response so users see slowness, not an error.
Fallback and graceful degradation
Even with backoff and load shaping, a big enough spike or a provider incident can saturate everything. For a user-facing feature, the answer is a fallback path. Route overflow to a second model or a second provider, so when your primary is throttled the request still gets served, just perhaps by a smaller or cheaper model. Keeping a routing layer that can switch models is the same capability that lets you tune cost and speed, which we cover in cutting AI agent latency with streaming and model routing.
When even the fallback is stretched, degrade gracefully. A cached answer, a shorter response, a clear "we are busy, here is a partial result" - any of these beats a raw error screen. Users forgive slowness. They remember the feature that broke. Building for that difference is exactly what our AI agent development and AI development services focus on: not just a model that answers, but a feature that stays up when the traffic is real.
Frequently asked questions
Can I just ask my provider to raise my rate limits?
Often yes, and you should - higher limits give you more headroom. But limits never remove the problem, because traffic is still bursty and you can still exceed any ceiling in a short window. Treat a limit increase as buying room, not as a fix. You still need backoff and load shaping so that when you do brush the higher limit, the feature bends instead of breaking.
What is the difference between the request limit and the token limit?
The request-per-minute limit caps how many calls you make; the token-per-minute limit caps the total size of those calls. You can breach the token limit with a handful of very large prompts even while making few requests. That is why trimming prompt size and context - sending only what the model needs - is a rate-limit tactic as much as a cost one: smaller calls stretch further under both ceilings.
Should every AI request go through a queue?
No. Queue the work that can tolerate a short delay - background summaries, enrichment, scoring. For a request a user is actively waiting on, a queue adds latency you do not want, so there you lean on the concurrency cap, the limiter, and a fast fallback instead. The split is by whether a human is waiting on the answer right now.
How do I know my limits before launch, not after?
Load-test against the real provider at the traffic shape you expect at peak, not the average. Ramp until you see 429s, note where they start, and set your limiter comfortably below that. Finding the ceiling in a test is cheap; finding it in front of users on launch day is not.
The short version: your model works, but your provider limits how fast you can call it and your traffic arrives in bursts. Retry with backoff and jitter, shape the load with a queue and a limiter so you rarely hit the ceiling, and add a fallback so a saturated provider means slowness rather than a broken feature. Do that and the AI feature that dazzled in the demo also survives the day everyone actually uses it.
Founder and CEO of Braincuber. Has scoped and shipped 500+ Odoo, AI, and cloud projects for US mid-market and global brands. Takes every founder call personally — no SDR layer between buyers and the people building the system.
