Database Performance Tuning for High-Throughput APIs

A “slow database” is almost always a misunderstood database. The default Postgres instance can comfortably serve tens of thousands of simple queries per second on modest hardware. When an API saturates a database at much lower throughput, the cause is rarely the database engine itself — it’s the access pattern, the indexes, the connection management, or the queries the application is issuing. Buying a bigger instance moves the problem; fixing the workload eliminates it.

This post is about the diagnostic and tuning workflow that actually moves database performance: how to find what’s slow, what to do about each common pattern, and where the genuinely unavoidable trade-offs sit.

The Throughput Equation

A database’s throughput is bounded by:

throughput ≤ min(IO_throughput, CPU_throughput, connection_limit) / cost_per_query

Every optimization is either reducing the cost per query (faster plans, fewer rows touched, less data transferred) or raising one of the limits (faster disks, more CPU, more connections).

The diagnostic question is almost always: which limit are we hitting? Get that right and the rest of the work follows.

Step One: Measure, Don’t Guess

The first move in any database tuning effort is to look at the database’s own telemetry, not the application’s.

For Postgres, the views that matter:

pg_stat_statements. Per-query aggregate statistics — total time, mean time, calls, rows. Find the queries consuming the most cumulative time. Almost always reveals a small number of queries responsible for the majority of database load.
pg_stat_activity. Live view of what’s happening right now. Long-running queries, locks held, idle-in-transaction sessions.
pg_locks. Active locks. Useful when investigating contention.
pg_stat_user_tables / pg_stat_user_indexes. Table-level scan counts, index usage, dead tuple counts.

A productive workflow:

pg_stat_statements ordered by total_exec_time DESC LIMIT 20 — these are the queries that matter.
For each, EXPLAIN (ANALYZE, BUFFERS) to see the plan and the actual cost.
Decide: index, query rewrite, schema change, or “leave it alone.”

The BUFFERS clause shows shared buffer hits vs. disk reads — the difference between a query that’s fast because data is in memory and one that will be slow when working set exceeds RAM.

Indexing: The Highest-Leverage Tool

A correctly-indexed query is often 100–1000x faster than the unindexed equivalent. A working set of principles:

Index for selectivity, not for columns. An index on a column with two values (status = active|inactive) is rarely useful; the planner will choose a sequential scan when half the rows match.
Composite indexes are ordered. (tenant_id, created_at) supports WHERE tenant_id = ? AND created_at > ? and WHERE tenant_id = ? but not WHERE created_at > ?.
Covering indexes (Postgres: INCLUDE) keep additional columns in the index so queries can be answered without visiting the heap.
Partial indexes for narrow conditions: CREATE INDEX ... WHERE status = 'pending' is dramatically smaller and faster than indexing the whole table.
Expression indexes for query-time transformations: CREATE INDEX ... ON users (lower(email)) to support case-insensitive lookups.
GIN/GiST indexes for arrays, JSONB, full-text search, geometric data. Different mechanics; powerful where applicable.

The temptation is to index everything. The cost is real:

Each index adds write amplification on every insert/update/delete.
Each index consumes disk and memory.
More indexes slow INSERT throughput; on write-heavy tables, the marginal index is sometimes net negative.

The rule: add indexes deliberately, justified by a real query. Periodically review pg_stat_user_indexes for indexes with low idx_scan counts and remove them.

`CREATE INDEX CONCURRENTLY`

In Postgres, CREATE INDEX takes an ACCESS EXCLUSIVE lock that blocks all reads and writes on the table until complete. On a multi-GB table this is an outage. Always use CREATE INDEX CONCURRENTLY in production — it builds the index in two passes without blocking writes, at the cost of taking longer.

The trade-off: a concurrent index build can fail (e.g., on constraint violations) and leave an invalid index behind. Check pg_index.indisvalid after concurrent builds; drop and retry if invalid.

N+1 Queries: The Most Common Performance Bug

A query that returns N records, followed by N additional queries to fetch related data, is the single most common performance regression in production APIs.

Detection:

Slow query logs showing thousands of similar queries with different IDs.
Per-request query count in tracing.
Pg_stat_statements showing extremely high call counts on a “simple” query.

Fixes, in order of preference:

Join in SQL. SELECT u.*, p.* FROM users u JOIN profiles p ON p.user_id = u.id WHERE u.id = ANY($1) returns everything in one round trip.
IN (...) batch. SELECT * FROM profiles WHERE user_id = ANY($1) then assemble in application code.
DataLoader-style batching. Useful for GraphQL servers; coalesce per-request item fetches into a batch on each tick.

ORM users often hit this through lazy-loaded relationships. Eager loading directives (prefetch_related in Django, joinedload in SQLAlchemy, Include in EF Core) are the per-query fix; ORM-level discipline (lint rules, query count tests) is the long-term fix.

Query Plans Worth Understanding

EXPLAIN ANALYZE output is dense, but a few patterns are worth recognizing on sight:

Sequential Scan on large table. The planner is reading every row. Usually a missing index or low selectivity.
Index Scan with high Rows Removed by Filter. The index narrowed the candidate set but most rows didn’t pass the filter. Consider a composite or partial index.
Bitmap Heap Scan with high Heap Blocks: lossy=…. Bitmap is overflowing work_mem and falling back to lossy mode. Raise work_mem for the session, or rewrite the query.
Nested Loop with many outer rows. Almost always a slow path. The planner picked it because it underestimated cardinality.
Hash Join with disk-spilled hash. work_mem too small for the join.
Sort with external merge. work_mem too small for the sort.

Most of these point to one of: a missing index, a query rewrite opportunity, or insufficient work_mem for the query class.

Statistics

The planner depends on statistics from ANALYZE. Out-of-date stats produce bad plans. Auto-vacuum handles this on most tables but can lag on hot ones; force ANALYZE table_name after large data loads or schema changes.

For columns with skewed distributions (one tenant has 90% of the rows), the default 100-bucket histogram is too coarse. ALTER TABLE t ALTER COLUMN tenant_id SET STATISTICS 1000 increases the resolution and often dramatically improves plans on multi-tenant tables.

For correlated columns (e.g., city always implies country), extended statistics (CREATE STATISTICS ... ON city, country FROM t) help the planner estimate combined selectivity.

Connection Management

Postgres processes (one per connection) carry ~5–10 MB of memory and have non-trivial setup cost. Beyond a few hundred connections, the database itself spends meaningful time managing connections rather than running queries.

The standard solution: PgBouncer (transaction mode) in front of Postgres.

Application servers maintain large pools (1000+ logical connections) to PgBouncer.
PgBouncer maintains a small pool (50–200) to Postgres.
Connections are checked out from Postgres only for the duration of a transaction, then returned to the pool.

Constraints of transaction-mode pooling:

No session-level state. SET LOCAL works inside a transaction; SET outside a transaction does not survive the next checkout.
No prepared statements at the protocol level (until Postgres 17 + PgBouncer 1.21+, which support them via the pool). Many drivers handle this transparently; some require disabling client-side prepared statement caching.
No advisory locks across statements. They release when the transaction ends.
No LISTEN/NOTIFY — sessions are non-persistent.

The cost is real but acceptable for most workloads. The benefit is that Postgres serves orders of magnitude more clients without resource exhaustion.

Read Replicas and Read/Write Splitting

For read-heavy workloads, read replicas (streaming replication in Postgres, read replicas in RDS / Aurora) absorb most read traffic away from the primary.

Eventual consistency. Replicas lag the primary by milliseconds to seconds. Read-your-write expectations need attention — a freshly written record may not yet be on the replica. Common solutions: route writes + immediate reads to the primary; tag a request with must_read_primary = true for a short window after a write.
Replica lag monitoring is non-negotiable. A replica more than seconds behind is serving stale data; routing reads to it produces user-visible bugs.
Replicas are not free. Each replica receives every WAL change; high write rates mean replicas need similar I/O capacity to the primary.

For applications with stricter consistency needs, multi-master setups (BDR, Spanner-style) or single-primary writes with synchronous replication are options. Each carries its own latency and operational cost.

Aurora and Managed Databases

For AWS-deployed systems, Aurora is often the right answer. The relevant properties:

Storage and compute are separated. Storage is a distributed log-structured volume across 6 copies in 3 AZs.
Replication is faster. Replicas read from the same storage, no WAL shipping.
Failover is faster. Typically 30–60 seconds, sometimes less.
Auto Scaling for read replicas. Up to 15 read replicas, scaled on CPU or connection count.
Aurora Serverless v2. Per-ACU billing, scales in seconds. Right answer for spiky workloads; meaningfully more expensive per ACU than provisioned for steady-state workloads.

The trade-off is operational opacity — you don’t tune storage parameters, you can’t directly inspect WAL behavior, you depend on AWS’s choices about replication tuning. For most teams this is a feature.

Writes: The Other Half

Write-heavy workloads have a different tuning profile:

Batching. A single INSERT of 1000 rows is dramatically faster than 1000 single inserts. Use COPY for bulk load; multi-row INSERT for medium batches.
Async commit. synchronous_commit = off for non-critical inserts trades durability for throughput; the latest commits are at risk on crash, but the database is otherwise consistent. Useful for high-volume telemetry.
Partitioning. For very large tables (>100GB), declarative partitioning by time or tenant keeps indexes smaller and enables partition pruning.
HOT updates. Postgres can avoid index updates if an update doesn’t change indexed columns and the new row fits on the same page. fillfactor = 80 leaves space for in-page updates. Significant write-throughput improvement on update-heavy tables.

Vacuum and Autovacuum

Postgres uses MVCC; updates and deletes leave dead row versions that need to be reclaimed by VACUUM. Autovacuum handles this in the background, but it’s misconfigured by default for high-write workloads.

The symptoms of autovacuum lagging:

Table bloat (dead tuples making the table larger than its live data).
Slowing sequential scans on the bloated tables.
Sudden long-running vacuums that block other operations.
Transaction ID wraparound warnings in the worst case.

The fixes:

autovacuum_vacuum_scale_factor = 0.05 (default 0.2) on hot tables, so vacuums run more often on smaller deltas.
autovacuum_max_workers = 5+ so high-churn tables don’t queue.
Monitor pg_stat_user_tables.n_dead_tup and the ratio of dead to live tuples.
For severely bloated tables, pg_repack rewrites them online without locking.

Common Anti-Patterns

A few patterns to recognize:

SELECT * everywhere. Drag back columns you don’t need; consume bandwidth; defeat covering indexes. Project only what’s needed.
OFFSET pagination on large tables. OFFSET 10000 LIMIT 20 reads 10020 rows and discards 10000. Use keyset pagination (WHERE id > $last_id ORDER BY id LIMIT 20) instead.
JSON columns for queryable data. Storing JSON is cheap; querying it efficiently requires GIN indexes and care. If you regularly query specific fields, those fields belong in real columns.
COUNT(*) on huge tables. Postgres has no shortcut; it walks the table (or an index). For approximate counts, pg_class.reltuples is good enough. For exact counts, maintain a counter via triggers or use a separate analytics path.
Long-running transactions. Hold locks, prevent autovacuum, accumulate WAL. Set statement_timeout and idle_in_transaction_session_timeout aggressively.

Closing

High-throughput databases are not the result of bigger hardware; they are the result of disciplined access patterns, correct indexes, sensible connection management, and ongoing maintenance. The diagnostic workflow is consistent: look at the database’s own statistics, find the queries that dominate cost, understand why each is slow, fix or accept it. The fixes are well-known — index the right columns with the right structure, eliminate N+1 patterns, batch writes, pool connections via PgBouncer, offload reads to replicas, keep statistics current, watch out for bloat. Most “we need a bigger database” conversations are really “we have N+1 queries we never noticed” or “our most expensive query has no useful index.” Get the access patterns right and the same hardware serves dramatically more traffic. Get them wrong and no instance size is enough.