[2026-05-13 02:47:13] PAGER · 429 burst · model=sonnet-4.6 · err_rate=62% · queue=4,712

production reliability · LLM APIs

Rate limits, retries & backoff — surviving the 429

Your first 429 always comes at 2am. This is the field guide: what's actually being limited, the tier-ladder mechanics nobody documents, the retry pattern that survives a thundering herd, and a simulator that shows exactly how jitter turns 15% success into 95%.

01The three things being limited

Almost every model API enforces at least three independent limits. They run in parallel — you can hit any one of them, and "remaining headroom" on the others doesn't help.

REQUESTS / MIN RPM

Calls per minute, regardless of size. Even one-token requests count. Burns out first on chat apps with lots of small turns.

TOKENS / MIN TPM

Input + output tokens summed per rolling minute. Burns out first on RAG, long-document, and batch jobs.

IN-FLIGHT CONC

How many requests can be open simultaneously. Burns out first on long-running streaming completions or agents.

On top of those: a daily organisational ceiling (your billing cap), per-model limits (cheap models often have tighter TPM than expensive ones on the same tier), and quietly-enforced per-key limits when admins split traffic. The "I have 60K TPM" you read in the docs is the most generous of these — the one that bit you was something else.

Where the rate-limit signal actually lives: the response headers, not the body. Every well-behaved provider returns x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, and retry-after. Log them. Alert when remaining drops below 10%. You should see the cliff coming a minute before you fall off it.

02The tier ladder

New accounts start in Tier 1: deliberately tiny limits, designed to keep abusers from spending themselves into bankruptcy on day one. You climb tiers by spending money and not being flagged. Tier 1 → Tier 5 is typically a 1–3 month journey of consistent legitimate usage.

Tier	Requirement	Sonnet RPM	Sonnet TPM	Time to climb
Tier 1	Any verified card	~50	~40K	day zero
Tier 2	$5 spent · 7+ days	~1K	~80K	~1 week
Tier 3	$40 spent · 7+ days	~2K	~160K	~2 weeks
Tier 4	$200 spent · 14+ days	~4K	~400K	~1 month
Tier 5	$400+ · 30+ days · clean record	~4K+	~2M+	~3 months

Approximate; providers adjust these. Confirm in your console.

What gets you flagged backwards: a sudden burst from zero (looks like a stolen key), geographic anomalies (key issued in SF, traffic from three datacentres in different countries), bursts that match known abuse patterns. The mitigations are unglamorous: ramp slowly. Pre-deposit credit (some providers move you a tier up immediately on prepay). Open a support ticket with your use case before launch — provider rate-limit teams will manually bump you in days, not weeks, if you have a real story.

03The retry pattern that works

There is a correct shape for retries and it is everywhere in production code that doesn't fall over:

# pseudocode that maps to anything
attempt = 0
while attempt < MAX_RETRIES:
    resp = call_api(request)
    if resp.ok: return resp

    if resp.status not in {429, 500, 502, 503, 504}:
        raise # not retryable

    retry_after = float(resp.headers.get("retry-after", 0))
    base = 2 ** attempt                    # 1, 2, 4, 8…
    jitter = random.uniform(0.5, 1.5)        # spread the herd
    delay = max(retry_after, base * jitter)
    sleep(min(delay, MAX_BACKOFF))

    attempt += 1

raise RateLimitExhausted

Four things matter, in decreasing order of how much they hurt when missed:

Honour retry-after. Both Anthropic and OpenAI return this header on 429s. Use it as a floor, not a ceiling. Ignoring it just causes another 429 (and provider abuse counters tick up).

Add jitter. Without jitter, every client retrying at the same moment will synchronise. They'll all wake up at t+1s, hammer the API, all get 429, all wait 2s, hammer again. Multiply your backoff by a random 0.5–1.5 factor. Done. The simulator below shows what this single change does.

Cap your retry budget. 5 attempts. 2-minute total. After that, fail loudly with structured logging. Infinite retries turn a 30-second blip into a 4-hour outage and a bill you don't want to explain.

Only retry retryable errors. Most provider errors are your fault and re-sending the exact same broken request will get the exact same error back.

Status	Meaning	Retry?
400	Bad request — your prompt or schema is malformed.	never
401	Auth failed — your API key is wrong or revoked.	never
403	Forbidden — content policy, region block, account flag.	never
404	Model name typo, or model deprecated.	never
422	Validation — your tool schema or message order is invalid.	never
429	Rate limit — honour `retry-after`, then backoff + jitter.	yes
500	Server error — usually transient, retry.	yes
502/503	Bad gateway / unavailable — load balancer hiccup.	yes
504	Gateway timeout — only retry if request was idempotent.	careful
529	Overloaded (Anthropic) — global capacity, definitely retry.	yes

04Idempotency keys

Here is the worst kind of bug: a 504 timeout where the request did reach the server, the server did execute it, but the response packet was lost. Your retry executes it again. Now the user got two emails, was charged twice, has two rows in the database.

Solution: Idempotency-Key header. Generate a UUID per logical operation (not per attempt). Send the same UUID on retries. The server hashes it, looks up the cached response, and returns the original. Newer Anthropic and OpenAI endpoints honour it; for older endpoints you wrap your own service.

curl https://api.anthropic.com/v1/messages \
  -H "x-api-key: $KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Idempotency-Key: 7f3c2-a91-batch-row-49271" \
  # ^ same key on every retry of this logical op
  -d '{ "model": "claude-sonnet-4-6", "messages": [...] }'

Critical rule: the key must be deterministic for a logical operation, not random per-attempt. If you generate a fresh UUID inside your retry loop, you have idempotency keys that idempotency-protect nothing. Hash the input payload + the row ID + the date, or pass the key down from the queue worker that owns the job.

05Hedged requests for tail latency

For latency-critical interactive apps where p99 matters more than money: fire the request, wait 200ms, then fire a second identical request to a backup region or backup provider. First response wins; cancel the other. Costs roughly 2× in dollars but cuts p99 dramatically — because the slow request usually isn't slow because the model is slow, it's slow because your specific connection landed on a bad host, and the second attempt rolls the dice again.

# sketch — typically lives in your client wrapper
async def hedged_call(req, hedge_after_ms=200):
    primary = asyncio.create_task(call("us-east", req))
    done, _ = await asyncio.wait({primary}, timeout=hedge_after_ms/1000)
    if primary in done:
        return primary.result()
    hedge = asyncio.create_task(call("eu-west", req))
    done, pending = await asyncio.wait({primary, hedge},
                                       return_when=asyncio.FIRST_COMPLETED)
    for t in pending: t.cancel()                  # save money on the loser
    return next(iter(done)).result()

Be careful with hedging non-idempotent calls. If you hedge a tool call that, say, sends an email, you might send two. Pair hedging with idempotency keys (server-side dedup) or restrict it to pure read-style operations.

06Dynamic queue management

When you have a fleet of N workers hitting one provider, the failure mode isn't any single worker misbehaving — it's all of them being right but cumulatively wrong. Three patterns survive this:

Token bucket at the queue. A single shared bucket (Redis sorted set, in-process semaphore, whatever) that enforces your TPM locally before any request reaches the API. The API never sees the burst because you ate it. This is the most underused production pattern; it costs you a 5-line Redis Lua script and removes most 429s permanently.

Adaptive concurrency. The same algorithm as TCP congestion control — start with N concurrent workers, halve N when 429 rate spikes, slowly increase when no errors. AIMD (additive increase, multiplicative decrease). Self-tunes to whatever the provider is willing to give you today, including when they silently change it.

Per-tenant fairness. If you're multi-tenant and one customer ships a buggy loop, naive queueing means everyone queues behind their burst. Token-bucket per tenant, with the provider-level bucket on top as a global ceiling. The well-behaved tenants don't notice the abusive one.

07Watch the retry storm in action

Pick a load pattern and a retry strategy. Watch a 60-second timeline play out. The "API endpoint" in the middle has a token bucket: 100 requests/min, refilling at ~1.67 req/sec. Anything above that gets a 429.

retry-storm simulator

load pattern

retry strategy

60-second timeline · bucket capacity 100 · refill 1.67/s

incoming request 200 OK 429 retried final failure

success

—

failed

—

429s seen

—

total attempts

—

Pick a pattern and a strategy, then run. The naive strategy on a burst should fail almost everything; the same load on exp + jitter should succeed almost everything.

08Three pages from the on-call log

▸ The 3am batch job

Cron job kicked off at 03:00 UTC. Hammered the API as fast as the worker pool would go (200 concurrent). Hit TPM at 03:00:08. Retry loop kicked in with 1-second fixed delay. Hit TPM again at 03:00:09. Retried again. Pager woke us at 03:14 when the queue depth crossed 50K and downstream consumers started timing out. Bill that morning: real.

Fix: token-bucket the job itself, not the retry loop. The cron is the producer; rate-limit at the producer. We dropped concurrency from 200 → 8 with a local TPM bucket, and the same job finished 11 minutes later but with zero 429s and lower total tokens (no retry waste).

▸ The silent snapshot rollout

Provider rolled out a new model snapshot version with stricter TPM enforcement on a Tuesday. Our clients didn't notice until Wednesday's traffic peak. Error rate climbed from 0.1% to 14% over forty minutes. We'd been at 95% of the previous TPM limit during peaks for weeks and never looked.

Fix: ship x-ratelimit-remaining-tokens as a metric and alert when it drops below 15%. The data was always in the response; we just weren't reading it.

▸ The cold-start tier flap

New API key issued to a new microservice. First production deploy fires 800 requests in 30 seconds. Tier 1 ceiling is 50 RPM. Service crash-loops. Pager.

Fix: warm-start with a single low-token request, confirm 200 OK, then ramp concurrency over the first 60 seconds. Also: contact provider support, prepay $500, get bumped to Tier 3 inside 30 minutes. Always do this before a launch, not during one.

09A representative on-call log

This is the shape of a request that survives, from one of our wrappers:

02:47:13.402 INFO send req_id=7f3c-a91 model=sonnet-4.6 tokens=1820 02:47:13.610 WARN recv 429 retry-after=2.4 x-rl-remaining-tokens=0 02:47:13.610 INFO backoff attempt=1 base=1.0s jitter=0.83 → sleep 2.40s (floor=retry-after) 02:47:16.012 INFO send req_id=7f3c-a91 idempotency-key=batch-49271-row-7 02:47:16.890 WARN recv 429 retry-after=1.1 02:47:16.890 INFO backoff attempt=2 base=2.0s jitter=1.31 → sleep 2.62s 02:47:19.512 INFO send req_id=7f3c-a91 02:47:20.844 OK recv 200 tokens_in=1820 tokens_out=412 x-rl-remaining-tokens=38214

Note what's there: a stable req_id across retries, an idempotency-key on the side-effect-bearing send, the actual retry-after value being respected, jitter explicitly logged, and the remaining-tokens header captured so we can graph proximity to the limit.

10What to do when you see a 429

1.Read retry-after. Wait at least that long. Retry.

2.If you see a 429 twice in a row, halve your current concurrency for the next minute.

3.If x-ratelimit-remaining-requests drops under 5%, slow down preemptively — don't wait for the cliff.

4.If you've burned the retry budget (5 attempts), fail loudly with the full headers in your error log. Don't swallow.

5.If you're growing and still under Tier 5, talk to support. The fastest tier bump is a sales call, not a payment.

6.If the storm is your own (multiple workers, no shared bucket): add the bucket. Token-bucket at the producer, always.

The one-line summary: honour the header, add jitter, cap the budget, log the limit-remaining metric. Most 429 outages aren't the provider being down; they're you synchronising your own thundering herd. The provider's job is to throttle you. Your job is to not need throttling.