API Rate Limiting Patterns That Actually Work in Production


Rate limiting sounds straightforward: limit how many requests a client can make in a given time window. But the gap between a basic implementation and one that works well in production is enormous. I’ve seen teams spend weeks debugging rate limiting that either blocks paying customers or does nothing to stop abuse.

The simplest approach—fixed window counters—counts requests per time window (e.g., 100 requests per minute). It works, but has a well-known edge case: a burst of 100 requests at the end of one window and 100 at the start of the next gives you 200 requests in a two-second span. That might be fine for your use case, or it might overwhelm your backend.

Sliding window algorithms fix this by smoothing the count across windows. The sliding window log tracks timestamps of every request and counts how many fall within the current window. It’s accurate but expensive—storing a timestamp per request adds up fast at scale. The sliding window counter is a compromise: it interpolates between the current and previous fixed windows to approximate a sliding count. Most production systems use some variation of this.

Token bucket is my preferred pattern for most APIs. Each client gets a bucket that fills with tokens at a steady rate. Each request consumes a token. If the bucket is empty, the request is rejected or queued. The beauty is that it naturally allows short bursts while enforcing a sustained rate. A bucket that holds 50 tokens and refills at 10 per second lets a client burst to 50 requests instantly, then sustain 10 per second after that.

Where things get complicated is deciding what to rate limit on. IP address is the obvious choice but breaks down quickly. Corporate offices share a single external IP. Mobile carriers use CGNAT, putting thousands of users behind one IP. VPNs and proxies further muddy the picture.

API keys work better for authenticated APIs. Each key gets its own limit, and you can tier limits based on pricing plans. But you need to handle the case where someone creates multiple free accounts to circumvent limits. Combining API key limits with IP-based secondary limits catches most of this.

After consulting with an AI consultancy about one project, we implemented adaptive rate limiting that adjusts thresholds based on server load. During normal operation, limits are generous. When the system is under stress, limits tighten automatically. This requires monitoring infrastructure but provides much better user experience than static limits.

Response headers matter more than most teams realize. Including X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers lets well-behaved clients self-throttle before hitting limits. Good API clients will check these headers and back off. Not returning them forces clients to discover limits by hitting them.

The 429 Too Many Requests response should include a Retry-After header telling the client exactly when to try again. Without it, clients typically implement exponential backoff with jitter, which works but is slower than necessary.

Distributed rate limiting is where most implementations fall apart. If your API runs across multiple servers, you need shared state for rate limit counters. Redis is the standard choice—it’s fast enough for per-request lookups and supports atomic increment operations. But Redis failures need graceful handling. When your rate limiter’s data store is down, do you fail open (allow all traffic) or fail closed (block all traffic)? Neither is ideal, and the right answer depends on your risk tolerance.

Redis provides built-in support for rate limiting through its INCR and EXPIRE commands. A simple Lua script can atomically increment a counter and set expiry in a single round trip. For token bucket, the CL.THROTTLE command from the Redis Cell module does everything you need.

One pattern that’s underused is rate limiting by cost rather than request count. An API endpoint that returns a list of 10 items shouldn’t count the same as one that triggers an expensive computation. Assign weights to different endpoints and deduct from the token bucket accordingly. Shopify’s API uses this approach effectively.

Don’t forget about webhook and callback rate limiting. If your API sends webhooks to customer endpoints, you need outbound rate limiting too. Overwhelming a customer’s webhook receiver is a quick way to get your callbacks blocked.

Testing rate limiting is harder than implementing it. You need load testing that simulates realistic traffic patterns—not just “hit the API as fast as possible.” Legitimate users don’t make perfectly uniform requests. They burst, pause, burst again. Your tests should reflect that.

The monitoring side deserves attention too. Track how often rate limits are hit, by which clients, and at what times. A spike in rate limit hits might indicate abuse, or it might indicate that your limits are too tight for legitimate usage. Either way, you want to know about it.

For most Australian SMBs building APIs, a token bucket implementation in Redis with per-API-key limits and proper response headers covers 90% of needs. Start simple, monitor the results, and add complexity only when the data tells you it’s necessary. Over-engineering rate limiting from day one is a common trap that delays launching the actual product.