Secure Multi-Tenant Rate Limiting Strategies

A single misbehaving tenant generating 50x their normal traffic will, in most systems, take down the rest of the service. The fact that you have a rate_limit_per_minute config column does not change this — most rate-limiting implementations have subtle correctness issues that surface exactly when they matter most. This post is about how to do it right: which algorithm to use, how to implement it correctly across a distributed fleet, and how to defend against the noisy-neighbor failure modes that make multi-tenancy operationally dangerous.

The Problem, Precisely Stated

In a multi-tenant API, “rate limiting” can mean any of several things:

  • Abuse prevention. Stop a credential-stuffing attack or a runaway script from hammering your system.
  • Quota enforcement. Each tenant has a contractual limit (10K requests/hour on the Pro plan); enforce it.
  • Fairness. Prevent one tenant from consuming a disproportionate share of shared resources during a spike.
  • Backpressure. When the system is approaching capacity, shed load fairly across tenants.

These have overlapping but distinct mechanisms. A single token bucket doesn’t solve all four well; a layered design does.

The Algorithms Worth Knowing

Four algorithms cover essentially all production rate limiting. Understanding their properties is the foundation.

Fixed Window

Count requests in a discrete window (e.g., per minute). Reset at the boundary.

key = f"rl:{tenant_id}:{int(time() // 60)}"
count = await redis.incr(key)
if count == 1: await redis.expire(key, 90)
if count > limit: raise RateLimitExceeded

Pros: trivially simple, one Redis op per request, accurate count. Cons: the window boundary effect — a tenant can do 2× the limit in 2 seconds (full burst at end of window, full burst at start of next). For coarse abuse limits this is fine; for tight SLA enforcement it is not.

Sliding Window Log

Store a sorted set of request timestamps. Drop entries older than the window; count what remains.

Pros: precise. Cons: O(N) storage per tenant; expensive at high rates.

Sliding Window Counter

Combine two adjacent fixed-window counts, weighted by where the current time is in the window.

effective_count = current_window_count
+ previous_window_count * fraction_of_window_remaining

Approximate but efficient: one or two Redis ops, no per-request log. Good middle ground for most production needs.

Token Bucket

A bucket with a fixed capacity refills at a fixed rate. Each request consumes one token; reject when the bucket is empty.

Two parameters: rate (tokens per second) and burst (capacity). The bucket allows short bursts up to capacity while maintaining a long-term average rate.

This is the algorithm most production systems should default to. It models real traffic behavior (steady state plus occasional spikes), supports natural variable-cost requests (an expensive call costs more tokens), and is easily implemented atomically.

Leaky Bucket

Equivalent to a queue with a fixed drain rate; new requests are added to the queue, processed at the drain rate, and dropped when full. From the rate-limiter’s perspective, leaky bucket and token bucket are duals — same semantics, different framing.

The case for leaky bucket: it shapes the output rate strictly, useful when downstream systems can’t absorb bursts. The case for token bucket: it preserves burst capacity, friendlier to bursty client patterns.

For most APIs, token bucket is the right default.

Implementing Token Bucket in Redis Correctly

A naive implementation has at least one race condition. The correct version uses a Lua script for atomicity:

-- KEYS[1] = bucket key
-- ARGV[1] = now (ms)
-- ARGV[2] = rate (tokens/sec)
-- ARGV[3] = capacity
-- ARGV[4] = cost of this request (usually 1)
local now = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local capacity = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local bucket = redis.call('HMGET', KEYS[1], 'tokens', 'updated_at')
local tokens = tonumber(bucket[1])
local updated_at = tonumber(bucket[2])
if tokens == nil then
tokens = capacity
updated_at = now
end
local elapsed = math.max(0, now - updated_at)
tokens = math.min(capacity, tokens + elapsed * rate / 1000)
local allowed = 0
if tokens >= cost then
tokens = tokens - cost
allowed = 1
end
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'updated_at', now)
redis.call('PEXPIRE', KEYS[1], math.ceil(capacity / rate * 1000) + 60000)
return { allowed, tokens }

This script is atomic, refills on read, supports variable-cost requests, and bounds memory via the TTL. Returning the current token count lets you emit standard rate-limit headers (X-RateLimit-Remaining, X-RateLimit-Reset).

The application call:

allowed, remaining = await redis.evalsha(
SCRIPT_SHA,
1,
f"rl:{tenant_id}:{endpoint}",
int(time.time() * 1000),
rate_per_second,
burst_capacity,
cost,
)
if not allowed:
raise RateLimitExceeded(retry_after=cost / rate_per_second)

Use SCRIPT LOAD once at startup; reference by SHA on every call to avoid the per-call upload overhead.

Where to Enforce

Rate limiting can happen at several layers. Each has different properties:

  • Edge (CDN / WAF). Cloudflare, AWS WAF, CloudFront — protects against volumetric attacks before they hit your infrastructure. Limited per-tenant awareness because the edge often can’t see authenticated identity.
  • API Gateway. Kong, Envoy, AWS API Gateway — sees the request after auth. Can enforce per-key or per-tenant limits. Often the cleanest place for quota enforcement.
  • Application. Inside the service, after parsing the request and resolving the tenant. Most flexible (can rate-limit by tenant + endpoint + cost class), but adds per-request latency.
  • Database / queue. Some downstream resources need their own rate limits (e.g., expensive search queries).

The defensible pattern is layered:

Each layer catches what fits its visibility. The edge layer blocks volumetric attacks; the gateway enforces contractual quotas; the application enforces fairness and per-feature limits.

Per-Tenant Limit Configuration

Hard-coded rate limits are wrong. Tenants on different plans need different limits; specific tenants (large customers, beta features) need overrides.

A configuration model that works:

plans:
free: { requests_per_minute: 60, burst: 100 }
pro: { requests_per_minute: 600, burst: 1000 }
enterprise: { requests_per_minute: 6000, burst: 10000 }
overrides:
- tenant_id: "acme-corp"
rate_per_second: 200
burst: 500
reason: "contract addendum 2025-01"
expires_at: "2026-12-31"

Load into Redis or a config service; cache in-process with short TTL; lookup on each request via the tenant context. Per-endpoint overrides for expensive operations (e.g., bulk export) layer on top.

Variable-Cost Requests

Not all requests cost the same. A bulk-export endpoint hitting the database for a minute costs vastly more than a GET /users/me. A correctly-designed rate limiter charges by cost, not by count.

cost_table = {
"GET /users/me": 1,
"POST /search": 5,
"POST /exports": 50,
"POST /llm/generate": 10, # plus dynamic charge based on tokens
}

The token bucket script accepts an arbitrary cost; the application charges based on the endpoint and any dynamic factors (output size, token count for LLM endpoints, file size for uploads). For LLM endpoints specifically, post-charge for the actual tokens used after the response — pre-charge underestimates true cost.

Multiple Limits per Tenant

Real systems enforce several limits in parallel:

  • Burst limit — short window, prevents instantaneous floods.
  • Sustained limit — longer window, enforces the contract.
  • Daily quota — coarse limit for billing/contract purposes.
  • Per-endpoint limits — specific resources with their own throttles.

A request must pass all applicable limits. The check is cheap because each is one Redis op; the design is to fail closed (reject if Redis is unreachable, or fail open with a fallback in-process limit).

Response Semantics

A rate-limited response should be unambiguous and actionable. The standard:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 600
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1715600400
Content-Type: application/problem+json
{
"type": "https://example.com/errors/rate-limited",
"title": "Rate limit exceeded",
"detail": "Sustained request limit of 600/min exceeded for tenant acme-corp",
"retry_after_seconds": 30
}

The Retry-After header tells well-behaved clients when to back off. The X-RateLimit-* headers give them runway awareness so they can pace themselves. Returning a JSON body with a Problem Details envelope (RFC 9457) makes the failure programmatically meaningful.

Emit the rate-limit headers on every response, not just on 429s. Clients use them for adaptive pacing; emitting them only on rejection is too late.

Distributed Failure Modes

A rate limiter backed by Redis becomes a hard dependency. What happens when Redis is degraded or unreachable?

Three strategies, in order of risk tolerance:

  • Fail closed. Reject all requests when the limiter is down. Safe but produces self-inflicted outages.
  • Fail open. Allow all requests when the limiter is down. Risky in the face of abuse; defensible if the limiter is supplementary (e.g., a WAF is the real defense).
  • Fall back to in-process limit. Each instance maintains a coarse-grained in-memory limit as a backstop; activate when Redis is unavailable. Better than either extreme.

The right answer depends on the workload. For abuse prevention, fail open is dangerous. For SLA-driven quota, brief fail-open is acceptable.

Also consider: a Redis cluster failover takes seconds. Set timeouts on the limiter calls (50–100ms) and have a clear fallback path; you do not want the limiter to add latency to every request when Redis is healthy and outages when it isn’t.

Anti-Abuse Patterns Beyond Rate Limiting

Rate limiting is necessary but not sufficient for production anti-abuse:

  • Per-IP limits in addition to per-tenant. A compromised credential generating absurd traffic from one source is suspicious regardless of the tenant’s limit.
  • Anomaly detection. A tenant that historically does 100 RPM suddenly doing 10K RPM is suspicious even if “Enterprise plan” allows it. Alert and/or auto-throttle.
  • Per-resource limits. A tenant has 10K user records; a request asking for all of them is suspicious. Enforce maximum response sizes and pagination requirements.
  • Captcha and JS challenges for unauthenticated endpoints. Edge providers (Cloudflare, Akamai) handle this; rolling your own is rarely worthwhile.
  • Per-credential and per-API-key limits even within a tenant. A leaked API key shouldn’t burn the tenant’s entire budget.

Observability for Rate Limiting

Metrics worth tracking:

  • Limited request rate per tenant per endpoint. Spikes indicate either an attack or a misconfigured limit.
  • Top tenants by request rate. Used for capacity planning and detecting noisy neighbors.
  • Bucket fullness distribution. Tenants consistently near zero are running close to their limit and may need a higher tier.
  • Limiter latency. Sub-millisecond for healthy Redis; spikes are a leading indicator of Redis problems.
  • Fail-open invocations. Should be near zero in steady state; non-zero means the limiter is degraded.

Edge Cases

A few patterns that bite:

  • Clock skew across application instances. If different instances disagree about “now,” the Lua script’s elapsed-time math is wrong. Use the Redis server’s TIME command if precision matters, or accept that NTP-synced application clocks are close enough.
  • Token bucket with very low rates. A rate of 0.01 tokens/sec (one token per 100s) accumulates fractional tokens; floating-point arithmetic in Lua can produce surprising rounding. Use integer math where possible, or cap the precision deliberately.
  • Hot-key contention. A single tenant making 100K requests/sec hits the same Redis key. The single shard handling that key becomes the bottleneck. Mitigate with client-side aggregation (batch token requests in the application, deduct N tokens per batch) or shard the key (tenant:1234:shard:N with N = hash(request_id) % 4).
  • Multi-region deployments. Rate limits per region or global? Per-region is cheap; global requires either cross-region Redis (slow) or eventual-consistency limits (loose). Document the choice explicitly.

Closing

Multi-tenant rate limiting is one of those features that looks like a one-day project and is actually a piece of infrastructure with edge cases for years. The mechanics are straightforward — token bucket in a Lua script in Redis, layered enforcement from edge to application, variable-cost charging, parallel limits at different windows, standard response semantics. The discipline is harder: configuring per-tenant limits that match the contract, charging the right cost per request, handling Redis failures deliberately, instrumenting limits as first-class metrics, and adding anti-abuse layers beyond rate limiting alone. Get the algorithm right and the per-tenant fairness story holds at any scale. Get it wrong and the day a single tenant misbehaves, you find out the limiter you thought you had wasn’t actually limiting much of anything.