Secure Multi-Tenant Rate Limiting Strategies
A single misbehaving tenant generating 50x their normal traffic will, in most systems, take down the rest of the service. The fact that you have a rate_limit_per_minute config column does not change this — most rate-limiting implementations have subtle correctness issues that surface exactly when they matter most. This post is about how to do it right: which algorithm to use, how to implement it correctly across a distributed fleet, and how to defend against the noisy-neighbor failure modes that make multi-tenancy operationally dangerous.
The Problem, Precisely Stated
In a multi-tenant API, “rate limiting” can mean any of several things:
- Abuse prevention. Stop a credential-stuffing attack or a runaway script from hammering your system.
- Quota enforcement. Each tenant has a contractual limit (10K requests/hour on the Pro plan); enforce it.
- Fairness. Prevent one tenant from consuming a disproportionate share of shared resources during a spike.
- Backpressure. When the system is approaching capacity, shed load fairly across tenants.
These have overlapping but distinct mechanisms. A single token bucket doesn’t solve all four well; a layered design does.
The Algorithms Worth Knowing
Four algorithms cover essentially all production rate limiting. Understanding their properties is the foundation.
Fixed Window
Count requests in a discrete window (e.g., per minute). Reset at the boundary.
key = f"rl:{tenant_id}:{int(time() // 60)}"count = await redis.incr(key)if count == 1: await redis.expire(key, 90)if count > limit: raise RateLimitExceededPros: trivially simple, one Redis op per request, accurate count. Cons: the window boundary effect — a tenant can do 2× the limit in 2 seconds (full burst at end of window, full burst at start of next). For coarse abuse limits this is fine; for tight SLA enforcement it is not.
Sliding Window Log
Store a sorted set of request timestamps. Drop entries older than the window; count what remains.
Pros: precise. Cons: O(N) storage per tenant; expensive at high rates.
Sliding Window Counter
Combine two adjacent fixed-window counts, weighted by where the current time is in the window.
effective_count = current_window_count + previous_window_count * fraction_of_window_remainingApproximate but efficient: one or two Redis ops, no per-request log. Good middle ground for most production needs.
Token Bucket
A bucket with a fixed capacity refills at a fixed rate. Each request consumes one token; reject when the bucket is empty.
Two parameters: rate (tokens per second) and burst (capacity). The bucket allows short bursts up to capacity while maintaining a long-term average rate.
This is the algorithm most production systems should default to. It models real traffic behavior (steady state plus occasional spikes), supports natural variable-cost requests (an expensive call costs more tokens), and is easily implemented atomically.
Leaky Bucket
Equivalent to a queue with a fixed drain rate; new requests are added to the queue, processed at the drain rate, and dropped when full. From the rate-limiter’s perspective, leaky bucket and token bucket are duals — same semantics, different framing.
The case for leaky bucket: it shapes the output rate strictly, useful when downstream systems can’t absorb bursts. The case for token bucket: it preserves burst capacity, friendlier to bursty client patterns.
For most APIs, token bucket is the right default.
Implementing Token Bucket in Redis Correctly
A naive implementation has at least one race condition. The correct version uses a Lua script for atomicity:
-- KEYS[1] = bucket key-- ARGV[1] = now (ms)-- ARGV[2] = rate (tokens/sec)-- ARGV[3] = capacity-- ARGV[4] = cost of this request (usually 1)
local now = tonumber(ARGV[1])local rate = tonumber(ARGV[2])local capacity = tonumber(ARGV[3])local cost = tonumber(ARGV[4])
local bucket = redis.call('HMGET', KEYS[1], 'tokens', 'updated_at')local tokens = tonumber(bucket[1])local updated_at = tonumber(bucket[2])
if tokens == nil then tokens = capacity updated_at = nowend
local elapsed = math.max(0, now - updated_at)tokens = math.min(capacity, tokens + elapsed * rate / 1000)
local allowed = 0if tokens >= cost then tokens = tokens - cost allowed = 1end
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'updated_at', now)redis.call('PEXPIRE', KEYS[1], math.ceil(capacity / rate * 1000) + 60000)
return { allowed, tokens }This script is atomic, refills on read, supports variable-cost requests, and bounds memory via the TTL. Returning the current token count lets you emit standard rate-limit headers (X-RateLimit-Remaining, X-RateLimit-Reset).
The application call:
allowed, remaining = await redis.evalsha( SCRIPT_SHA, 1, f"rl:{tenant_id}:{endpoint}", int(time.time() * 1000), rate_per_second, burst_capacity, cost,)if not allowed: raise RateLimitExceeded(retry_after=cost / rate_per_second)Use SCRIPT LOAD once at startup; reference by SHA on every call to avoid the per-call upload overhead.
Where to Enforce
Rate limiting can happen at several layers. Each has different properties:
- Edge (CDN / WAF). Cloudflare, AWS WAF, CloudFront — protects against volumetric attacks before they hit your infrastructure. Limited per-tenant awareness because the edge often can’t see authenticated identity.
- API Gateway. Kong, Envoy, AWS API Gateway — sees the request after auth. Can enforce per-key or per-tenant limits. Often the cleanest place for quota enforcement.
- Application. Inside the service, after parsing the request and resolving the tenant. Most flexible (can rate-limit by tenant + endpoint + cost class), but adds per-request latency.
- Database / queue. Some downstream resources need their own rate limits (e.g., expensive search queries).
The defensible pattern is layered:
Each layer catches what fits its visibility. The edge layer blocks volumetric attacks; the gateway enforces contractual quotas; the application enforces fairness and per-feature limits.
Per-Tenant Limit Configuration
Hard-coded rate limits are wrong. Tenants on different plans need different limits; specific tenants (large customers, beta features) need overrides.
A configuration model that works:
plans: free: { requests_per_minute: 60, burst: 100 } pro: { requests_per_minute: 600, burst: 1000 } enterprise: { requests_per_minute: 6000, burst: 10000 }
overrides: - tenant_id: "acme-corp" rate_per_second: 200 burst: 500 reason: "contract addendum 2025-01" expires_at: "2026-12-31"Load into Redis or a config service; cache in-process with short TTL; lookup on each request via the tenant context. Per-endpoint overrides for expensive operations (e.g., bulk export) layer on top.
Variable-Cost Requests
Not all requests cost the same. A bulk-export endpoint hitting the database for a minute costs vastly more than a GET /users/me. A correctly-designed rate limiter charges by cost, not by count.
cost_table = { "GET /users/me": 1, "POST /search": 5, "POST /exports": 50, "POST /llm/generate": 10, # plus dynamic charge based on tokens}The token bucket script accepts an arbitrary cost; the application charges based on the endpoint and any dynamic factors (output size, token count for LLM endpoints, file size for uploads). For LLM endpoints specifically, post-charge for the actual tokens used after the response — pre-charge underestimates true cost.
Multiple Limits per Tenant
Real systems enforce several limits in parallel:
- Burst limit — short window, prevents instantaneous floods.
- Sustained limit — longer window, enforces the contract.
- Daily quota — coarse limit for billing/contract purposes.
- Per-endpoint limits — specific resources with their own throttles.
A request must pass all applicable limits. The check is cheap because each is one Redis op; the design is to fail closed (reject if Redis is unreachable, or fail open with a fallback in-process limit).
Response Semantics
A rate-limited response should be unambiguous and actionable. The standard:
HTTP/1.1 429 Too Many RequestsRetry-After: 30X-RateLimit-Limit: 600X-RateLimit-Remaining: 0X-RateLimit-Reset: 1715600400Content-Type: application/problem+json
{ "type": "https://example.com/errors/rate-limited", "title": "Rate limit exceeded", "detail": "Sustained request limit of 600/min exceeded for tenant acme-corp", "retry_after_seconds": 30}The Retry-After header tells well-behaved clients when to back off. The X-RateLimit-* headers give them runway awareness so they can pace themselves. Returning a JSON body with a Problem Details envelope (RFC 9457) makes the failure programmatically meaningful.
Emit the rate-limit headers on every response, not just on 429s. Clients use them for adaptive pacing; emitting them only on rejection is too late.
Distributed Failure Modes
A rate limiter backed by Redis becomes a hard dependency. What happens when Redis is degraded or unreachable?
Three strategies, in order of risk tolerance:
- Fail closed. Reject all requests when the limiter is down. Safe but produces self-inflicted outages.
- Fail open. Allow all requests when the limiter is down. Risky in the face of abuse; defensible if the limiter is supplementary (e.g., a WAF is the real defense).
- Fall back to in-process limit. Each instance maintains a coarse-grained in-memory limit as a backstop; activate when Redis is unavailable. Better than either extreme.
The right answer depends on the workload. For abuse prevention, fail open is dangerous. For SLA-driven quota, brief fail-open is acceptable.
Also consider: a Redis cluster failover takes seconds. Set timeouts on the limiter calls (50–100ms) and have a clear fallback path; you do not want the limiter to add latency to every request when Redis is healthy and outages when it isn’t.
Anti-Abuse Patterns Beyond Rate Limiting
Rate limiting is necessary but not sufficient for production anti-abuse:
- Per-IP limits in addition to per-tenant. A compromised credential generating absurd traffic from one source is suspicious regardless of the tenant’s limit.
- Anomaly detection. A tenant that historically does 100 RPM suddenly doing 10K RPM is suspicious even if “Enterprise plan” allows it. Alert and/or auto-throttle.
- Per-resource limits. A tenant has 10K user records; a request asking for all of them is suspicious. Enforce maximum response sizes and pagination requirements.
- Captcha and JS challenges for unauthenticated endpoints. Edge providers (Cloudflare, Akamai) handle this; rolling your own is rarely worthwhile.
- Per-credential and per-API-key limits even within a tenant. A leaked API key shouldn’t burn the tenant’s entire budget.
Observability for Rate Limiting
Metrics worth tracking:
- Limited request rate per tenant per endpoint. Spikes indicate either an attack or a misconfigured limit.
- Top tenants by request rate. Used for capacity planning and detecting noisy neighbors.
- Bucket fullness distribution. Tenants consistently near zero are running close to their limit and may need a higher tier.
- Limiter latency. Sub-millisecond for healthy Redis; spikes are a leading indicator of Redis problems.
- Fail-open invocations. Should be near zero in steady state; non-zero means the limiter is degraded.
Edge Cases
A few patterns that bite:
- Clock skew across application instances. If different instances disagree about “now,” the Lua script’s elapsed-time math is wrong. Use the Redis server’s
TIMEcommand if precision matters, or accept that NTP-synced application clocks are close enough. - Token bucket with very low rates. A rate of 0.01 tokens/sec (one token per 100s) accumulates fractional tokens; floating-point arithmetic in Lua can produce surprising rounding. Use integer math where possible, or cap the precision deliberately.
- Hot-key contention. A single tenant making 100K requests/sec hits the same Redis key. The single shard handling that key becomes the bottleneck. Mitigate with client-side aggregation (batch token requests in the application, deduct N tokens per batch) or shard the key (
tenant:1234:shard:Nwith N =hash(request_id) % 4). - Multi-region deployments. Rate limits per region or global? Per-region is cheap; global requires either cross-region Redis (slow) or eventual-consistency limits (loose). Document the choice explicitly.
Closing
Multi-tenant rate limiting is one of those features that looks like a one-day project and is actually a piece of infrastructure with edge cases for years. The mechanics are straightforward — token bucket in a Lua script in Redis, layered enforcement from edge to application, variable-cost charging, parallel limits at different windows, standard response semantics. The discipline is harder: configuring per-tenant limits that match the contract, charging the right cost per request, handling Redis failures deliberately, instrumenting limits as first-class metrics, and adding anti-abuse layers beyond rate limiting alone. Get the algorithm right and the per-tenant fairness story holds at any scale. Get it wrong and the day a single tenant misbehaves, you find out the limiter you thought you had wasn’t actually limiting much of anything.