Secure Multi-Tenant Rate Limiting Strategies

A single misbehaving tenant generating 50x their normal traffic will, in most systems, take down the rest of the service. The fact that you have a rate_limit_per_minute config column does not change this — most rate-limiting implementations have subtle correctness issues that surface exactly when they matter most. This post is about how to do it right: which algorithm to use, how to implement it correctly across a distributed fleet, and how to defend against the noisy-neighbor failure modes that make multi-tenancy operationally dangerous.

The Problem, Precisely Stated

In a multi-tenant API, “rate limiting” can mean any of several things:

Abuse prevention. Stop a credential-stuffing attack or a runaway script from hammering your system.
Quota enforcement. Each tenant has a contractual limit (10K requests/hour on the Pro plan); enforce it.
Fairness. Prevent one tenant from consuming a disproportionate share of shared resources during a spike.
Backpressure. When the system is approaching capacity, shed load fairly across tenants.

These have overlapping but distinct mechanisms. A single token bucket doesn’t solve all four well; a layered design does.

The Algorithms Worth Knowing

Four algorithms cover essentially all production rate limiting. Understanding their properties is the foundation.

Fixed Window

Count requests in a discrete window (e.g., per minute). Reset at the boundary.

key = f"rl:{tenant_id}:{int(time() // 60)}"
count = await redis.incr(key)
if count == 1: await redis.expire(key, 90)
if count > limit: raise RateLimitExceeded

Pros: trivially simple, one Redis op per request, accurate count. Cons: the window boundary effect — a tenant can do 2× the limit in 2 seconds (full burst at end of window, full burst at start of next). For coarse abuse limits this is fine; for tight SLA enforcement it is not.

Sliding Window Log

Store a sorted set of request timestamps. Drop entries older than the window; count what remains.

Pros: precise. Cons: O(N) storage per tenant; expensive at high rates.

Sliding Window Counter

Combine two adjacent fixed-window counts, weighted by where the current time is in the window.

effective_count = current_window_count
                + previous_window_count * fraction_of_window_remaining

Approximate but efficient: one or two Redis ops, no per-request log. Good middle ground for most production needs.

Token Bucket

A bucket with a fixed capacity refills at a fixed rate. Each request consumes one token; reject when the bucket is empty.

Two parameters: rate (tokens per second) and burst (capacity). The bucket allows short bursts up to capacity while maintaining a long-term average rate.

This is the algorithm most production systems should default to. It models real traffic behavior (steady state plus occasional spikes), supports natural variable-cost requests (an expensive call costs more tokens), and is easily implemented atomically.

Leaky Bucket

Equivalent to a queue with a fixed drain rate; new requests are added to the queue, processed at the drain rate, and dropped when full. From the rate-limiter’s perspective, leaky bucket and token bucket are duals — same semantics, different framing.

The case for leaky bucket: it shapes the output rate strictly, useful when downstream systems can’t absorb bursts. The case for token bucket: it preserves burst capacity, friendlier to bursty client patterns.

For most APIs, token bucket is the right default.

Implementing Token Bucket in Redis Correctly

A naive implementation has at least one race condition. The correct version uses a Lua script for atomicity:

-- KEYS[1] = bucket key
-- ARGV[1] = now (ms)
-- ARGV[2] = rate (tokens/sec)
-- ARGV[3] = capacity
-- ARGV[4] = cost of this request (usually 1)

local now      = tonumber(ARGV[1])
local rate     = tonumber(ARGV[2])
local capacity = tonumber(ARGV[3])
local cost     = tonumber(ARGV[4])

local bucket = redis.call('HMGET', KEYS[1], 'tokens', 'updated_at')
local tokens = tonumber(bucket[1])
local updated_at = tonumber(bucket[2])

if tokens == nil then
  tokens = capacity
  updated_at = now
end

local elapsed = math.max(0, now - updated_at)
tokens = math.min(capacity, tokens + elapsed * rate / 1000)

local allowed = 0
if tokens >= cost then
  tokens = tokens - cost
  allowed = 1
end

redis.call('HMSET', KEYS[1], 'tokens', tokens, 'updated_at', now)
redis.call('PEXPIRE', KEYS[1], math.ceil(capacity / rate * 1000) + 60000)

return { allowed, tokens }

This script is atomic, refills on read, supports variable-cost requests, and bounds memory via the TTL. Returning the current token count lets you emit standard rate-limit headers (X-RateLimit-Remaining, X-RateLimit-Reset).

The application call:

allowed, remaining = await redis.evalsha(
    SCRIPT_SHA,
    1,
    f"rl:{tenant_id}:{endpoint}",
    int(time.time() * 1000),
    rate_per_second,
    burst_capacity,
    cost,
)
if not allowed:
    raise RateLimitExceeded(retry_after=cost / rate_per_second)

Use SCRIPT LOAD once at startup; reference by SHA on every call to avoid the per-call upload overhead.

Where to Enforce

Rate limiting can happen at several layers. Each has different properties:

Edge (CDN / WAF). Cloudflare, AWS WAF, CloudFront — protects against volumetric attacks before they hit your infrastructure. Limited per-tenant awareness because the edge often can’t see authenticated identity.
API Gateway. Kong, Envoy, AWS API Gateway — sees the request after auth. Can enforce per-key or per-tenant limits. Often the cleanest place for quota enforcement.
Application. Inside the service, after parsing the request and resolving the tenant. Most flexible (can rate-limit by tenant + endpoint + cost class), but adds per-request latency.
Database / queue. Some downstream resources need their own rate limits (e.g., expensive search queries).

The defensible pattern is layered:

Each layer catches what fits its visibility. The edge layer blocks volumetric attacks; the gateway enforces contractual quotas; the application enforces fairness and per-feature limits.

Per-Tenant Limit Configuration

Hard-coded rate limits are wrong. Tenants on different plans need different limits; specific tenants (large customers, beta features) need overrides.

A configuration model that works:

plans:
  free:    { requests_per_minute: 60,    burst: 100 }
  pro:     { requests_per_minute: 600,   burst: 1000 }
  enterprise: { requests_per_minute: 6000, burst: 10000 }

overrides:
  - tenant_id: "acme-corp"
    rate_per_second: 200
    burst: 500
    reason: "contract addendum 2025-01"
    expires_at: "2026-12-31"

Load into Redis or a config service; cache in-process with short TTL; lookup on each request via the tenant context. Per-endpoint overrides for expensive operations (e.g., bulk export) layer on top.

Variable-Cost Requests

Not all requests cost the same. A bulk-export endpoint hitting the database for a minute costs vastly more than a GET /users/me. A correctly-designed rate limiter charges by cost, not by count.

cost_table = {
    "GET /users/me": 1,
    "POST /search": 5,
    "POST /exports": 50,
    "POST /llm/generate": 10,  # plus dynamic charge based on tokens
}

The token bucket script accepts an arbitrary cost; the application charges based on the endpoint and any dynamic factors (output size, token count for LLM endpoints, file size for uploads). For LLM endpoints specifically, post-charge for the actual tokens used after the response — pre-charge underestimates true cost.

Multiple Limits per Tenant

Real systems enforce several limits in parallel:

Burst limit — short window, prevents instantaneous floods.
Sustained limit — longer window, enforces the contract.
Daily quota — coarse limit for billing/contract purposes.
Per-endpoint limits — specific resources with their own throttles.

A request must pass all applicable limits. The check is cheap because each is one Redis op; the design is to fail closed (reject if Redis is unreachable, or fail open with a fallback in-process limit).

Response Semantics

A rate-limited response should be unambiguous and actionable. The standard:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 600
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1715600400
Content-Type: application/problem+json

{
  "type": "https://example.com/errors/rate-limited",
  "title": "Rate limit exceeded",
  "detail": "Sustained request limit of 600/min exceeded for tenant acme-corp",
  "retry_after_seconds": 30
}

The Retry-After header tells well-behaved clients when to back off. The X-RateLimit-* headers give them runway awareness so they can pace themselves. Returning a JSON body with a Problem Details envelope (RFC 9457) makes the failure programmatically meaningful.

Emit the rate-limit headers on every response, not just on 429s. Clients use them for adaptive pacing; emitting them only on rejection is too late.

Distributed Failure Modes

A rate limiter backed by Redis becomes a hard dependency. What happens when Redis is degraded or unreachable?

Three strategies, in order of risk tolerance:

Fail closed. Reject all requests when the limiter is down. Safe but produces self-inflicted outages.
Fail open. Allow all requests when the limiter is down. Risky in the face of abuse; defensible if the limiter is supplementary (e.g., a WAF is the real defense).
Fall back to in-process limit. Each instance maintains a coarse-grained in-memory limit as a backstop; activate when Redis is unavailable. Better than either extreme.

The right answer depends on the workload. For abuse prevention, fail open is dangerous. For SLA-driven quota, brief fail-open is acceptable.

Also consider: a Redis cluster failover takes seconds. Set timeouts on the limiter calls (50–100ms) and have a clear fallback path; you do not want the limiter to add latency to every request when Redis is healthy and outages when it isn’t.

Anti-Abuse Patterns Beyond Rate Limiting

Rate limiting is necessary but not sufficient for production anti-abuse:

Per-IP limits in addition to per-tenant. A compromised credential generating absurd traffic from one source is suspicious regardless of the tenant’s limit.
Anomaly detection. A tenant that historically does 100 RPM suddenly doing 10K RPM is suspicious even if “Enterprise plan” allows it. Alert and/or auto-throttle.
Per-resource limits. A tenant has 10K user records; a request asking for all of them is suspicious. Enforce maximum response sizes and pagination requirements.
Captcha and JS challenges for unauthenticated endpoints. Edge providers (Cloudflare, Akamai) handle this; rolling your own is rarely worthwhile.
Per-credential and per-API-key limits even within a tenant. A leaked API key shouldn’t burn the tenant’s entire budget.

Observability for Rate Limiting

Metrics worth tracking:

Limited request rate per tenant per endpoint. Spikes indicate either an attack or a misconfigured limit.
Top tenants by request rate. Used for capacity planning and detecting noisy neighbors.
Bucket fullness distribution. Tenants consistently near zero are running close to their limit and may need a higher tier.
Limiter latency. Sub-millisecond for healthy Redis; spikes are a leading indicator of Redis problems.
Fail-open invocations. Should be near zero in steady state; non-zero means the limiter is degraded.

Edge Cases

A few patterns that bite:

Clock skew across application instances. If different instances disagree about “now,” the Lua script’s elapsed-time math is wrong. Use the Redis server’s TIME command if precision matters, or accept that NTP-synced application clocks are close enough.
Token bucket with very low rates. A rate of 0.01 tokens/sec (one token per 100s) accumulates fractional tokens; floating-point arithmetic in Lua can produce surprising rounding. Use integer math where possible, or cap the precision deliberately.
Hot-key contention. A single tenant making 100K requests/sec hits the same Redis key. The single shard handling that key becomes the bottleneck. Mitigate with client-side aggregation (batch token requests in the application, deduct N tokens per batch) or shard the key (tenant:1234:shard:N with N = hash(request_id) % 4).
Multi-region deployments. Rate limits per region or global? Per-region is cheap; global requires either cross-region Redis (slow) or eventual-consistency limits (loose). Document the choice explicitly.

Closing

Multi-tenant rate limiting is one of those features that looks like a one-day project and is actually a piece of infrastructure with edge cases for years. The mechanics are straightforward — token bucket in a Lua script in Redis, layered enforcement from edge to application, variable-cost charging, parallel limits at different windows, standard response semantics. The discipline is harder: configuring per-tenant limits that match the contract, charging the right cost per request, handling Redis failures deliberately, instrumenting limits as first-class metrics, and adding anti-abuse layers beyond rate limiting alone. Get the algorithm right and the per-tenant fairness story holds at any scale. Get it wrong and the day a single tenant misbehaves, you find out the limiter you thought you had wasn’t actually limiting much of anything.