production reliability · LLM APIs
Rate limits, retries & backoff — surviving the 429
Your first 429 always comes at 2am. This is the field guide: what's actually being limited, the tier-ladder mechanics nobody documents, the retry pattern that survives a thundering herd, and a simulator that shows exactly how jitter turns 15% success into 95%.
01The three things being limited
Almost every model API enforces at least three independent limits. They run in parallel — you can hit any one of them, and "remaining headroom" on the others doesn't help.
Calls per minute, regardless of size. Even one-token requests count. Burns out first on chat apps with lots of small turns.
Input + output tokens summed per rolling minute. Burns out first on RAG, long-document, and batch jobs.
How many requests can be open simultaneously. Burns out first on long-running streaming completions or agents.
On top of those: a daily organisational ceiling (your billing cap), per-model limits (cheap models often have tighter TPM than expensive ones on the same tier), and quietly-enforced per-key limits when admins split traffic. The "I have 60K TPM" you read in the docs is the most generous of these — the one that bit you was something else.
x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, and retry-after. Log them. Alert when remaining drops below 10%. You should see the cliff coming a minute before you fall off it.
02The tier ladder
New accounts start in Tier 1: deliberately tiny limits, designed to keep abusers from spending themselves into bankruptcy on day one. You climb tiers by spending money and not being flagged. Tier 1 → Tier 5 is typically a 1–3 month journey of consistent legitimate usage.
| Tier | Requirement | Sonnet RPM | Sonnet TPM | Time to climb |
|---|---|---|---|---|
| Tier 1 | Any verified card | ~50 | ~40K | day zero |
| Tier 2 | $5 spent · 7+ days | ~1K | ~80K | ~1 week |
| Tier 3 | $40 spent · 7+ days | ~2K | ~160K | ~2 weeks |
| Tier 4 | $200 spent · 14+ days | ~4K | ~400K | ~1 month |
| Tier 5 | $400+ · 30+ days · clean record | ~4K+ | ~2M+ | ~3 months |
Approximate; providers adjust these. Confirm in your console.
What gets you flagged backwards: a sudden burst from zero (looks like a stolen key), geographic anomalies (key issued in SF, traffic from three datacentres in different countries), bursts that match known abuse patterns. The mitigations are unglamorous: ramp slowly. Pre-deposit credit (some providers move you a tier up immediately on prepay). Open a support ticket with your use case before launch — provider rate-limit teams will manually bump you in days, not weeks, if you have a real story.
03The retry pattern that works
There is a correct shape for retries and it is everywhere in production code that doesn't fall over:
# pseudocode that maps to anything attempt = 0 while attempt < MAX_RETRIES: resp = call_api(request) if resp.ok: return resp if resp.status not in {429, 500, 502, 503, 504}: raise # not retryable retry_after = float(resp.headers.get("retry-after", 0)) base = 2 ** attempt # 1, 2, 4, 8… jitter = random.uniform(0.5, 1.5) # spread the herd delay = max(retry_after, base * jitter) sleep(min(delay, MAX_BACKOFF)) attempt += 1 raise RateLimitExhausted
Four things matter, in decreasing order of how much they hurt when missed:
Honour retry-after. Both Anthropic and OpenAI return this header on 429s. Use it as a floor, not a ceiling. Ignoring it just causes another 429 (and provider abuse counters tick up).
Add jitter. Without jitter, every client retrying at the same moment will synchronise. They'll all wake up at t+1s, hammer the API, all get 429, all wait 2s, hammer again. Multiply your backoff by a random 0.5–1.5 factor. Done. The simulator below shows what this single change does.
Cap your retry budget. 5 attempts. 2-minute total. After that, fail loudly with structured logging. Infinite retries turn a 30-second blip into a 4-hour outage and a bill you don't want to explain.
Only retry retryable errors. Most provider errors are your fault and re-sending the exact same broken request will get the exact same error back.
| Status | Meaning | Retry? |
|---|---|---|
| 400 | Bad request — your prompt or schema is malformed. | never |
| 401 | Auth failed — your API key is wrong or revoked. | never |
| 403 | Forbidden — content policy, region block, account flag. | never |
| 404 | Model name typo, or model deprecated. | never |
| 422 | Validation — your tool schema or message order is invalid. | never |
| 429 | Rate limit — honour retry-after, then backoff + jitter. | yes |
| 500 | Server error — usually transient, retry. | yes |
| 502/503 | Bad gateway / unavailable — load balancer hiccup. | yes |
| 504 | Gateway timeout — only retry if request was idempotent. | careful |
| 529 | Overloaded (Anthropic) — global capacity, definitely retry. | yes |
04Idempotency keys
Here is the worst kind of bug: a 504 timeout where the request did reach the server, the server did execute it, but the response packet was lost. Your retry executes it again. Now the user got two emails, was charged twice, has two rows in the database.
Solution: Idempotency-Key header. Generate a UUID per logical operation (not per attempt). Send the same UUID on retries. The server hashes it, looks up the cached response, and returns the original. Newer Anthropic and OpenAI endpoints honour it; for older endpoints you wrap your own service.
curl https://api.anthropic.com/v1/messages \ -H "x-api-key: $KEY" \ -H "anthropic-version: 2023-06-01" \ -H "Idempotency-Key: 7f3c2-a91-batch-row-49271" \ # ^ same key on every retry of this logical op -d '{ "model": "claude-sonnet-4-6", "messages": [...] }'
Critical rule: the key must be deterministic for a logical operation, not random per-attempt. If you generate a fresh UUID inside your retry loop, you have idempotency keys that idempotency-protect nothing. Hash the input payload + the row ID + the date, or pass the key down from the queue worker that owns the job.
05Hedged requests for tail latency
For latency-critical interactive apps where p99 matters more than money: fire the request, wait 200ms, then fire a second identical request to a backup region or backup provider. First response wins; cancel the other. Costs roughly 2× in dollars but cuts p99 dramatically — because the slow request usually isn't slow because the model is slow, it's slow because your specific connection landed on a bad host, and the second attempt rolls the dice again.
# sketch — typically lives in your client wrapper async def hedged_call(req, hedge_after_ms=200): primary = asyncio.create_task(call("us-east", req)) done, _ = await asyncio.wait({primary}, timeout=hedge_after_ms/1000) if primary in done: return primary.result() hedge = asyncio.create_task(call("eu-west", req)) done, pending = await asyncio.wait({primary, hedge}, return_when=asyncio.FIRST_COMPLETED) for t in pending: t.cancel() # save money on the loser return next(iter(done)).result()
06Dynamic queue management
When you have a fleet of N workers hitting one provider, the failure mode isn't any single worker misbehaving — it's all of them being right but cumulatively wrong. Three patterns survive this:
Token bucket at the queue. A single shared bucket (Redis sorted set, in-process semaphore, whatever) that enforces your TPM locally before any request reaches the API. The API never sees the burst because you ate it. This is the most underused production pattern; it costs you a 5-line Redis Lua script and removes most 429s permanently.
Adaptive concurrency. The same algorithm as TCP congestion control — start with N concurrent workers, halve N when 429 rate spikes, slowly increase when no errors. AIMD (additive increase, multiplicative decrease). Self-tunes to whatever the provider is willing to give you today, including when they silently change it.
Per-tenant fairness. If you're multi-tenant and one customer ships a buggy loop, naive queueing means everyone queues behind their burst. Token-bucket per tenant, with the provider-level bucket on top as a global ceiling. The well-behaved tenants don't notice the abusive one.
07Watch the retry storm in action
Pick a load pattern and a retry strategy. Watch a 60-second timeline play out. The "API endpoint" in the middle has a token bucket: 100 requests/min, refilling at ~1.67 req/sec. Anything above that gets a 429.
success
—
failed
—
429s seen
—
total attempts
—
08Three pages from the on-call log
▸ The 3am batch job
Cron job kicked off at 03:00 UTC. Hammered the API as fast as the worker pool would go (200 concurrent). Hit TPM at 03:00:08. Retry loop kicked in with 1-second fixed delay. Hit TPM again at 03:00:09. Retried again. Pager woke us at 03:14 when the queue depth crossed 50K and downstream consumers started timing out. Bill that morning: real.
Fix: token-bucket the job itself, not the retry loop. The cron is the producer; rate-limit at the producer. We dropped concurrency from 200 → 8 with a local TPM bucket, and the same job finished 11 minutes later but with zero 429s and lower total tokens (no retry waste).
▸ The silent snapshot rollout
Provider rolled out a new model snapshot version with stricter TPM enforcement on a Tuesday. Our clients didn't notice until Wednesday's traffic peak. Error rate climbed from 0.1% to 14% over forty minutes. We'd been at 95% of the previous TPM limit during peaks for weeks and never looked.
Fix: ship x-ratelimit-remaining-tokens as a metric and alert when it drops below 15%. The data was always in the response; we just weren't reading it.
▸ The cold-start tier flap
New API key issued to a new microservice. First production deploy fires 800 requests in 30 seconds. Tier 1 ceiling is 50 RPM. Service crash-loops. Pager.
Fix: warm-start with a single low-token request, confirm 200 OK, then ramp concurrency over the first 60 seconds. Also: contact provider support, prepay $500, get bumped to Tier 3 inside 30 minutes. Always do this before a launch, not during one.
09A representative on-call log
This is the shape of a request that survives, from one of our wrappers:
Note what's there: a stable req_id across retries, an idempotency-key on the side-effect-bearing send, the actual retry-after value being respected, jitter explicitly logged, and the remaining-tokens header captured so we can graph proximity to the limit.
10What to do when you see a 429
retry-after. Wait at least that long. Retry.x-ratelimit-remaining-requests drops under 5%, slow down preemptively — don't wait for the cliff.