Caching Strategies for Low-Latency APIs (Redis + In-Memory)

Almost every API that has a latency problem at scale has a caching problem in disguise. The database is rarely the bottleneck per se — it is the database being asked to answer the same questions thousands of times per second. The fix is rarely “add more replicas” and almost always “stop asking the database those questions.” Caching, done correctly, is the single highest-leverage performance investment in a backend system. Done incorrectly, it is the single largest source of correctness bugs.

This post is about how layered caching actually works in production, where the failure modes live, and how to build a cache topology that holds up under real load.

The Layers and Their Latency Budgets

A serious API has three caches whether the team realizes it or not:

Layer	Typical Latency	Capacity	Visibility
Process-local (in-memory)	50–500 ns	MBs–GBs per instance	Single process
Distributed (Redis/Memcached)	0.5–2 ms	GBs–TBs cluster-wide	All instances
Database / origin	5–50 ms	TBs+	Source of truth

The three orders of magnitude between local memory and Redis, and the additional order of magnitude between Redis and the database, are the entire reason caching works. Every layer you can move a request to costs roughly 10x less than the one below it.

The goal is to keep as much traffic as possible serviced by L1, fall back to L2 for cross-instance sharing and durability, and only hit the origin when nothing else will do.

Cache Patterns: Look-Aside, Write-Through, Write-Behind

There are three canonical interaction patterns. Pick deliberately; the wrong one introduces subtle inconsistencies.

Look-Aside (Cache-Aside)

The application explicitly checks the cache, falls back to the source on miss, and writes to the cache after reading. By far the most common pattern.

async def get_user(user_id: str) -> User:
    cached = await redis.get(f"user:{user_id}")
    if cached:
        return User.parse_raw(cached)
    user = await db.fetch_user(user_id)
    await redis.set(f"user:{user_id}", user.json(), ex=300)
    return user

Pros: simple, application has full control, cache failures degrade gracefully. Cons: every miss costs a database round-trip; cache and DB can diverge if writes do not invalidate.

Write-Through

Writes go through the cache, which forwards to the source. The cache always holds fresh data after a write.

Pros: cache and source are tightly coupled on writes. Cons: write latency is now latency(cache) + latency(source), and cache outages can block writes if not handled carefully.

Write-Behind (Write-Back)

Writes go to the cache only; a background process flushes to the source asynchronously. Used for very high-write workloads where eventual durability is acceptable.

Pros: write latency is cache-only. Cons: durability gap on cache failure; complex backpressure and consistency machinery. Rarely the right choice for general-purpose APIs.

For most APIs the answer is look-aside with explicit invalidation on writes — write to the database, then delete the cache key (don’t update it; the next read will repopulate from the source of truth). Updating the cache on write is a common source of race-induced staleness.

TTLs Are a Correctness Decision, Not a Performance Knob

The most common cache bug in the wild is the wrong TTL on the wrong data. A few rules:

Bounded-staleness data (user profiles, configuration, feature flags) can tolerate seconds to minutes of staleness. TTL accordingly.
Strongly consistent data (account balances, inventory counts) should not be cached without explicit invalidation. A TTL is not a substitute for invalidation; it is a backstop.
Hot, low-cardinality data (homepage feeds, top products) benefits from longer TTLs combined with proactive refresh.
Per-user data has a Pareto distribution — 1% of users generate 50% of traffic. Long TTLs help most users; short TTLs help avoid serving stale data to active users.

Always set a TTL. An unbounded key is a future memory leak.

The Thundering Herd

A popular cache entry expires. A thousand concurrent requests miss simultaneously. A thousand database queries hit the origin in the same millisecond. The database falls over. This is the thundering herd, and every meaningful caching system handles it explicitly.

Three techniques, in order of complexity:

1. Single-Flight (Request Coalescing)

In-process, allow only one in-flight fetch per key; queue the rest behind it.

from asyncio import Lock
from collections import defaultdict

_locks: dict[str, Lock] = defaultdict(Lock)

async def get_with_singleflight(key: str, loader):
    cached = await cache.get(key)
    if cached:
        return cached
    async with _locks[key]:
        cached = await cache.get(key)  # double-check
        if cached:
            return cached
        value = await loader()
        await cache.set(key, value, ex=300)
        return value

This handles the within-instance herd. For cross-instance herds you need distributed coordination.

2. Distributed Locks (Carefully)

A SET NX EX lock in Redis lets exactly one instance refresh the cache while others wait or serve stale. Implement with timeouts and fencing tokens; never rely on Redis as a primary lock service for safety-critical mutexes. For cache refresh, the trade-offs are usually acceptable.

3. Probabilistic Early Expiration

Each request computes an expiration probability that grows as the TTL nears. Some fraction of requests refresh early, before any single hard expiration creates a herd.

import math, random
def should_refresh(value_age_s: float, ttl_s: float, beta: float = 1.0) -> bool:
    delta = -math.log(random.random()) * beta * RECOMPUTE_COST_S
    return value_age_s + delta >= ttl_s

(RECOMPUTE_COST_S is how long the underlying fetch takes.) This is the XFetch algorithm; it eliminates herds without explicit coordination and is the most production-friendly answer for high-traffic systems.

Stale-While-Revalidate

A complementary pattern, borrowed from HTTP caching: continue serving the stale value while a background fetch refreshes it. The user sees no latency penalty on expiry; the cache becomes effectively non-blocking for reads.

async def get_swr(key: str, fresh_ttl: int, stale_ttl: int, loader):
    entry = await cache.get(key)
    if entry and entry.age < fresh_ttl:
        return entry.value
    if entry and entry.age < stale_ttl:
        asyncio.create_task(refresh(key, loader))
        return entry.value
    return await refresh(key, loader)

Combine SWR with single-flight: only one background refresh fires per key. CDNs use this pattern almost universally.

Two-Tier Caches: L1 + L2 Done Right

Process-local caches are tempting and dangerous. They are fast and they go stale invisibly because they are not aware of writes happening elsewhere.

Rules for L1:

Short TTLs. Seconds, not minutes. The L1 exists to absorb micro-bursts, not to be a source of truth.
Sized as LRU/LFU. cachetools.TTLCache, aiocache.SimpleMemoryCache, or Go’s ristretto. Never an unbounded dict.
Invalidation via pub/sub. When a write occurs, publish an invalidation on a Redis channel; subscribed instances evict the key from L1.
Cache only idempotent, read-mostly data. Anything user-specific that requires fresh writes belongs only in L2.

async def on_user_write(user_id: str):
    await db.update_user(user_id, ...)
    await redis.delete(f"user:{user_id}")
    await redis.publish("invalidations", f"user:{user_id}")

# On startup, every instance:
async def subscribe_invalidations():
    async for msg in redis.subscribe("invalidations"):
        l1_cache.pop(msg["data"], None)

This pattern keeps L1 hits near memory speed while making cross-instance correctness explicit. Redis 7.4+ also offers client-side caching with server-assisted invalidation (RESP3 CLIENT TRACKING), which is the same idea built into the protocol — worth considering if your client library supports it cleanly.

Negative Caching

When a lookup returns nothing, cache that too — for a shorter time. Without negative caching, every request for a non-existent key hits the database.

SENTINEL = b"__NULL__"

async def get_or_none(key, loader, ttl=300, neg_ttl=30):
    v = await cache.get(key)
    if v == SENTINEL: return None
    if v: return decode(v)
    v = await loader()
    await cache.set(key, encode(v) if v else SENTINEL,
                    ex=ttl if v else neg_ttl)
    return v

Negative TTL should be shorter than positive TTL — when an entity is created, you want it to appear quickly.

Cache Key Design

Keys are an interface. Treat them with the same discipline as a public schema.

Namespacing. app:env:entity:id:version. Versions in the prefix let you mass-invalidate by deploying a new prefix.
Determinism. Hash arguments consistently. Sort dict keys before serializing. Two requests that should hit the same entry must produce identical keys.
Length matters. Redis keys live in memory; 200-byte keys at scale are expensive. Hash long composite keys (xxhash, blake3) once they exceed ~60 bytes.
Tenant scoping. Every multi-tenant key must include the tenant ID. Universally. No exceptions.

Eviction and Capacity Planning

Redis with maxmemory set and a sensible policy is non-negotiable in production. Defaults are bad.

allkeys-lfu is the right policy for most general caches — frequency-aware, robust to scan workloads.
allkeys-lru is fine for workloads with strong recency bias.
volatile-lru (evicts only keys with a TTL) is useful when you mix transient cache and durable data in the same instance — though mixing those is usually a mistake.

Plan capacity for working set, not total data. The cache’s job is to fit the hot set in memory; cold data is fine in the origin.

Observability

Cache instrumentation is cheap and tells you almost everything you need to know about cache health. Track per cache name:

Hit rate. A cache with <80% hit rate is usually misconfigured or wrongly placed. <50% means it is actively hurting more than helping if the miss path is expensive.
Get/set latencies (p50/p95/p99). Spikes here usually mean a Redis bottleneck.
Stampede counts. Coalesced requests vs. originating requests per key class.
Eviction rate. Sustained evictions mean undersized cache or unbounded keyspace.
Memory usage and key count. Per database, per prefix.

A simple Grafana dashboard with these five panels per cache prevents most incidents.

Failure Modes

A few that bite production systems:

Cache as source of truth. If your code path cannot survive an empty cache, you do not have a cache; you have a database with terrible durability.
Big-key problems. A single 50MB value blocks Redis’s single-threaded event loop on read or eviction. Split large values; consider a separate compressed-blob store for them.
Hot-key problems. A single key getting 100k req/s overwhelms one Redis shard. Mitigate with client-side caching (L1), key sharding (key:{shard_n}), or denormalizing the hot entity.
KEYS * and FLUSHDB in production. Both block the entire instance. Use SCAN for iteration and per-prefix invalidation strategies instead.
Pipeline misuse. Each pipeline still hits a single shard; pipelining across the cluster requires cluster-aware clients.

Closing

Layered caching is the difference between an API that costs $200k/month in database compute and one that costs $20k/month and runs faster. The mechanics are not subtle — L1 in-process for nanosecond reads, L2 in Redis for cross-instance sharing, look-aside with explicit invalidation, single-flight or XFetch for herds, SWR for non-blocking refresh, pub/sub for L1 invalidation, and disciplined keys with disciplined TTLs. What is subtle is the consistency: every cached value is a copy that can go stale, and the design has to make peace with that explicitly. Choose your TTLs as a correctness budget, instrument hit rates per cache, and treat cache invalidation as the named-hard-problem it is. The rest is implementation.