Scaling REST APIs to Sub-Second Latency Under Load
A REST API serves 50ms p95 with 100 RPS, then the traffic graph goes vertical at 09:00 and the p95 climbs to 4 seconds while every dashboard turns red. The team adds replicas; latency stays bad. Adds CPU; stays bad. Doubles the database; finally improves, then degrades again at the next traffic peak. This sequence is the most common operational story in backend engineering, and it nearly always traces back to a small number of well-understood problems: connection pool exhaustion, unbounded concurrency, N+1 queries, and unbounded blocking work inside request handlers. Fix those four and most “scaling” problems disappear without adding capacity.
This post is about the mechanics of holding sub-second latency under load — what the bottlenecks actually are, where to instrument them, and how to design APIs that degrade gracefully rather than catastrophically.
Latency Is a Budget, Not a Measurement
A 500ms p95 budget for an API call is not a single number — it is a sum of stages, each with its own contribution:
| Stage | Typical contribution |
|---|---|
| TLS + TCP handshake (cold) | 50–150ms |
| Load balancer / reverse proxy | 1–5ms |
| Application request parsing | 1–10ms |
| Auth / middleware | 5–50ms |
| Business logic | 5–100ms |
| Database queries | 5–200ms |
| External API calls | 50–2000ms |
| Response serialization | 1–20ms |
Hold yourself to a budget per stage. If “external API calls” alone consumes 1500ms, no amount of replicas fixes it; you need a cache, a circuit breaker, or an async pattern. Latency is debugged by accounting, not by adding hardware.
Concurrency Is Where Latency Comes From
Under low concurrency, every stage runs in isolation and latency is dominated by service time. Under high concurrency, latency is dominated by wait time: requests queued behind other requests competing for shared resources — CPU, threads, DB connections, file descriptors. The math is Little’s Law:
L = λ × WOutstanding requests = arrival rate × time in system. If you cannot serve a request in 100ms and 1000 RPS arrives, you have 100 in flight at any moment. If your server has 50 worker threads, half of every request is spent queueing for a thread.
The point: capacity planning is concurrency planning. The bottleneck is whichever resource saturates first — worker threads, DB connections, FDs, memory bandwidth, or event-loop CPU. The fix is to identify it and either raise the limit, reduce per-request consumption, or shed load.
Connection Pools: The Most Common Bottleneck
A 32-vCPU box with 16 worker processes × 32 threads per process can theoretically handle 512 concurrent requests. If those workers share a database connection pool with 20 connections, your effective concurrency is 20. The other 492 requests wait — and your DB latency monitoring shows nothing wrong with the database.
Rules that hold up:
- Pool size per instance = expected concurrent in-flight queries, not request count. Most requests spend a fraction of their time inside a query.
- Total pool size across all instances ≤ database max_connections × 0.8. Leave headroom for migrations, replicas, and admin connections.
- Use a connection pooler (PgBouncer for Postgres, ProxySQL for MySQL) when instance count × per-instance pool exceeds the database’s connection limit. Transaction-mode pooling is the standard answer for stateless services.
- Set an acquisition timeout. A request waiting forever for a connection is a request that fails the user’s deadline; failing fast is better. 500ms–2s is typical.
- Monitor pool utilization. Sustained utilization > 80% is a leading indicator of approaching saturation.
pool = asyncpg.create_pool( DB_URL, min_size=4, max_size=20, timeout=2.0, command_timeout=5.0, max_inactive_connection_lifetime=300,)Per-query timeouts are equally important. A runaway query that holds a connection for 30 seconds blocks 30 seconds of throughput.
N+1 Queries: The Quiet Killer
The single most common latency regression in production is an N+1: fetching a list of N records and then fetching related data with N additional queries. A 10ms list + 50 × 5ms detail queries = 260ms instead of 15ms.
Detect them with:
- ORM logs / slow query logs in development.
- Per-request DB query counts in tracing.
- An alert on “requests issuing > 20 queries.”
Fix with proper joins, IN (...) batch queries, or DataLoader-style batching:
class UserLoader: async def load_many(self, ids: list[int]) -> list[User]: rows = await db.fetch( "SELECT * FROM users WHERE id = ANY($1::int[])", ids ) index = {r["id"]: User.from_row(r) for r in rows} return [index[i] for i in ids]The right primitive depends on the stack — GraphQL servers use DataLoader explicitly; ORM-based services use eager loading or projections. Either way, every list endpoint should produce a fixed, small number of queries regardless of result size.
The Tail Latency Problem
Average latency is meaningless under load. p99 and p99.9 are what matters because a service serving 1000 RPS has a request hitting its tail every second.
Sources of tail latency, in rough order of frequency:
- Garbage collection pauses in JVM/Go/Node services.
- Connection establishment (TLS handshakes, TCP slow start) when long-lived connections are absent.
- Lock contention on hot in-process locks (the dreaded global mutex).
- Cross-AZ network jitter between application and database.
- Slow disks under fsync in databases.
- Eviction storms in caches.
Mitigations are surprisingly consistent: connection keep-alive at every hop, JIT/GC tuning where applicable, lock-free data structures or per-shard locks for hot paths, latency hedging for read-only operations that can be retried against another replica.
Hedged requests are the most underused tool here:
async def hedged_read(operation, hedge_after_ms=50): primary = asyncio.create_task(operation()) try: return await asyncio.wait_for(asyncio.shield(primary), hedge_after_ms/1000) except asyncio.TimeoutError: secondary = asyncio.create_task(operation()) done, _ = await asyncio.wait([primary, secondary], return_when=asyncio.FIRST_COMPLETED) return done.pop().result()For idempotent reads against a multi-replica system, hedging cuts p99 by 5–10x at <10% extra load.
Statelessness Is Non-Negotiable
A stateless service is one where any request can be served by any instance. State (sessions, in-memory caches, request-scoped contexts) lives outside the request lifecycle in a shared store — Redis, the database, the JWT itself.
Statelessness enables:
- Horizontal scaling without sticky sessions.
- Trivial blue/green and rolling deployments.
- Crash safety: a dead instance loses zero data.
- Effective load balancing: requests can be distributed by least-connections, not by session affinity.
Sticky sessions are an anti-pattern in 2026 except for very specific cases (WebSocket connections, large in-memory caches that cannot be externalized). For REST APIs, push session state to JWTs or Redis, and let any instance handle any request.
Backpressure and Load Shedding
A service that accepts every incoming request will eventually queue itself to death. The honest answer is to refuse requests when overloaded.
Mechanisms:
- Concurrency limits at the server level (
uvicorn --limit-concurrency, TomcatmaxThreads, fastifyconnectionTimeout). - Bulkheads — separate pools per dependency so a slow downstream cannot consume all your capacity.
- Circuit breakers on external calls (
pybreaker,resilience4j). Fail fast when a dependency is degraded; recover when it heals. - Request priorities. Health-check and admin endpoints in their own pool; high-priority business traffic prioritized over batch.
- Adaptive concurrency — algorithms like Netflix’s
concurrency-limitslibrary adjust limits dynamically based on observed latency.
Returning 503 Service Unavailable with Retry-After is correct behavior under overload. It is also far better than letting requests timeout at 30 seconds — the client retries pile on the same overloaded service and amplify the problem.
Async vs Sync: Choose Deliberately
For I/O-bound APIs, async runtimes (Node.js, FastAPI on uvicorn, Go, .NET async, JVM Project Loom) sustain much higher concurrency than thread-per-request servers. A single async process can hold 10,000+ concurrent connections.
But async is not free:
- Blocking calls poison the event loop. A synchronous database driver in an async handler stalls every other request on the same worker. Use async drivers everywhere or run blocking work in a thread pool.
- CPU-bound code still benefits from process-level parallelism. The event loop is single-threaded per process.
- Tracing and debugging is harder; tools must understand async context propagation.
For most modern REST APIs, async is the right default. For CPU-heavy services (image processing, ML inference), use a thread/process pool deliberately and treat async as a fan-out coordinator.
HTTP/2, Keep-Alive, and the Transport Layer
A surprising amount of “API latency” lives below the application:
- Always enable HTTP keep-alive. New TLS handshakes on every request cost 50–150ms cold.
- HTTP/2 between services lets one connection multiplex many requests; reduces head-of-line blocking and connection-pool pressure on intermediaries.
- TCP_NODELAY for low-latency RPC; Nagle’s algorithm buffers small writes and adds 40ms+ per call.
- Connection pooling in HTTP clients. A new
httpx.AsyncClient()per request is a 50ms tax per call. Instantiate once at startup; reuse for the process lifetime.
These are unglamorous fixes. They are also some of the cheapest latency wins available.
Caching as a Latency Strategy
Caching moves the bottleneck from “things the database can do per second” to “things the cache can do per second” — which is usually one to two orders of magnitude higher. Patterns covered elsewhere; the latency-specific points:
- In-process LRU for tiny, hot, near-immutable data. Sub-microsecond reads.
- Redis for shared cache. Sub-millisecond reads if co-located, 1–3ms cross-AZ.
- HTTP caching at the edge for public endpoints.
Cache-Control: public, s-maxage=60puts a CDN between you and most of your traffic. - Computed result caching for expensive queries. Pre-aggregate; serve the aggregate from cache; refresh asynchronously.
The right cache eliminates the most expensive stage from the latency budget entirely.
Observability: You Cannot Fix What You Cannot See
For a sub-second-p95 API, the minimum observable footprint is:
- RED metrics per endpoint: Rate, Errors, Duration (with histograms for percentiles).
- Resource utilization: CPU, memory, FD count, GC pause time, event-loop lag.
- Dependency latency: per-database, per-cache, per-external-API duration histograms.
- Pool utilization: DB connections, HTTP client pools, thread pools.
- Distributed traces with span attributes for query counts, cache hit/miss, and external call durations.
Without traces, the answer to “why is p95 bad?” is guesswork. With traces, it is a query. Invest in OpenTelemetry instrumentation early; the cost compounds in your favor over the system’s lifetime.
Production Realities
A few patterns that show up repeatedly in real systems:
- Cold starts. First request to a new instance is 5–20x slower than steady state — JIT warm-up, connection pool growth, lazy initialization. Warm pools at startup; load-balance away from cold instances briefly after they come up.
- Bimodal latency distributions. Most requests fast, a small tail slow. Almost always a cache miss path, a rare code branch, or a dependency on a slow downstream. Find the bimodality in your histograms; it tells you where to look.
- Latency != availability. A service that returns 200ms p95 with 0.5% errors is far worse than 300ms p95 with 0.01% errors for most users. Optimize for the SLO that matters, not the metric that looks good.
- Capacity is bounded by your worst dependency. Latency under load is set by the slowest synchronous call in the critical path. Removing it (with a cache, async pattern, or precomputation) usually beats scaling the rest of the system.
Closing
Holding sub-second latency under load is a discipline of accounting, not a feat of engineering. The bottlenecks are well-known: connection pools sized below concurrency, N+1 queries, blocking I/O in async runtimes, unbounded queues without backpressure, and naive HTTP transport. Instrumentation makes them visible; concurrency limits, hedging, caching, and stateless design make them tractable. Add capacity last, not first — most “scaling problems” are concurrency problems with hardware-shaped patches. Build the API so each stage stays inside its slice of the latency budget, treat overload as a planned condition rather than an accident, and the system stays predictable at every traffic regime instead of failing dramatically at exactly the worst moment.