Structured Logging Strategies for Distributed Systems

In a monolith, debugging often starts with grep and ends with a stack trace. In a distributed system, the same investigation requires correlating events across five services, three load balancers, and a couple of queues. Free-form text logs that worked for a single process collapse under this load. By the time the team is on its third “I need to see all logs for request X across all services” incident, structured logging has become non-optional.

This post is about doing it correctly — what the log schema should look like, how to propagate context across service boundaries, where the storage and cost trade-offs land, and what to log versus what belongs in metrics or traces.

The Case for Structured

A traditional log line looks like this:

2026-05-13 12:34:56 [INFO] orders.api - Created order 1289 for user 4421 totaling $42.99

Useful for a human reading one line. Useless when there are millions of them. To answer “how many orders did user 4421 place this week?” you have to parse user 4421 out of the message; to answer “what’s the p95 of order creation latency?” you have to extract the latency from somewhere else. Every query becomes a regex problem.

A structured log line carries the same information as a typed record:

{
  "timestamp": "2026-05-13T12:34:56.789Z",
  "level": "INFO",
  "service": "orders-api",
  "version": "1.4.2",
  "env": "production",
  "trace_id": "01HFG2K8...",
  "span_id": "...",
  "user_id": "4421",
  "tenant_id": "acme",
  "event": "order.created",
  "order_id": "1289",
  "total_cents": 4299,
  "currency": "USD",
  "duration_ms": 87
}

The downstream effect is dramatic: indexed fields make level:ERROR service:orders-api tenant_id:acme a sub-second query rather than a multi-minute regex over text. Correlation across services becomes a join on trace_id. Aggregations on numeric fields work natively.

Schema: The Part That Earns Its Keep

A consistent schema is the difference between structured logs that scale and structured logs that just have more punctuation. The minimum set of fields every service should emit:

timestamp — ISO 8601 with timezone (always UTC).
level — DEBUG / INFO / WARN / ERROR / FATAL. Pick five and stop.
service — the emitting service name.
version — the build SHA or release tag.
env — dev / staging / production.
event — a short, machine-readable identifier for what happened (order.created, auth.failed, db.query.slow). Treat this as an enum.
message — a human-readable description. Optional but useful for ad-hoc investigation.
trace_id / span_id — for joining to traces.
Tenant/user/request context — tenant_id, user_id, request_id where applicable.

Beyond these, every log event has its own payload fields. The discipline that matters: always name fields the same way across the organization. userId, user_id, and uid for the same thing is the most common cause of log dashboards being useless.

Pick a naming convention (snake_case is the dominant choice in log ecosystems), document it, and enforce it in code review or via a structured logger that doesn’t let you write keys casually.

Logger Libraries

The libraries that produce structured output reliably:

Go: slog (stdlib, Go 1.21+), zap, zerolog.
Python: structlog, python-json-logger (for stdlib logging), loguru.
Node.js: pino, winston (with JSON formatter).
Java: Logback / Log4j2 with JSON layout.
Rust: tracing with tracing-subscriber JSON formatter.

pino and zap deserve specific mention for being aggressively fast — order-of-magnitude faster than stdlib alternatives, which matters in hot paths where logging itself can become the bottleneck.

Two configuration choices worth getting right:

JSON output, always. Pretty-printed in dev terminals via a separate formatter; raw JSON to stdout in production where it’s collected by the runtime.
Context binding. A child logger that carries trace_id, tenant_id, etc. so every line at that scope has them automatically. Manually adding context to every call is how you end up with inconsistent logs.

import structlog

log = structlog.get_logger("orders.api").bind(
    service="orders-api",
    version=APP_VERSION,
    env=ENV,
)

async def create_order(req):
    log_ctx = log.bind(
        tenant_id=req.tenant_id,
        user_id=req.user_id,
        trace_id=current_trace_id(),
    )
    log_ctx.info("order.create.start", item_count=len(req.items))
    try:
        order = await db.insert_order(...)
        log_ctx.info("order.create.success", order_id=order.id,
                     duration_ms=elapsed_ms)
        return order
    except DatabaseError as e:
        log_ctx.error("order.create.failed",
                      error=str(e), error_code=e.code)
        raise

The bound context propagates through child loggers; downstream calls receive the same correlation fields without explicit threading.

Context Propagation Across Services

In a distributed system, a single user action spans many services. Correlating their logs requires that each service receive and emit the same correlation IDs.

The patterns that work:

OpenTelemetry trace context. Trace ID and span ID propagate through W3C traceparent and tracestate headers. Auto-instrumentation in OTel SDKs handles HTTP, gRPC, Kafka, and SQS for major languages. Every service emits logs tagged with the current trace_id and span_id.
Tenant/user context. Less standardized; typically propagated via your own auth context (JWT claims) and explicitly threaded into log binding.
Request ID at the edge. The first gateway generates a request_id, attaches it as X-Request-ID, and every downstream service includes it. Useful when traces are sampled and the request_id is not.

The discipline: middleware extracts incoming headers, binds them to the request-scoped logger context, and outgoing clients re-inject them. Every framework supports this — FastAPI middleware, Express middleware, Spring filters, gRPC interceptors.

Log Levels: A Brief Sermon

Almost every team has a log-level convention that is broken in the same way: everything is INFO. Below a brief, opinionated taxonomy:

DEBUG — Inner-loop diagnostics; off in production by default; enabled selectively for an incident.
INFO — One-line-per-significant-event. Order placed, deploy started, cache invalidated. The level production engineers grep first.
WARN — Recoverable abnormality. Retry succeeded, fallback used, slow query observed.
ERROR — Failed operation requiring attention. Always actionable; if it isn’t, it shouldn’t be ERROR.
FATAL — Process is about to terminate. Rare; should trigger a process exit.

The discipline that earns its keep: ERROR alerts page; WARN alerts ticket; everything else is for investigation, not alerting. Mis-leveling produces alert fatigue.

What Not to Log

Logs are not a metrics system, not an audit log, not a database. A few categories that get logged when they shouldn’t:

High-cardinality events that should be metrics. “Request received” with full URL on every request inflates log volume without adding diagnostic value beyond what a request-rate metric would. Sample or omit; use metrics for counts.
PII and secrets. Passwords, tokens, credit card numbers, full user emails. Logs leak — through console access, log aggregation breaches, or accidental exports. Mask at the source. Regex post-processing is a leaky last-resort, not a primary defense.
Health-check noise. GET /healthz 200 from your load balancer 10x per second per instance is pure noise. Filter at the application or collector level.
Stack traces on expected errors. A 404 doesn’t need a stack trace. Reserve them for genuinely unexpected conditions.

A common litmus test: “if I’m paged at 3am, will this log help?” If no, it probably shouldn’t be there, or it should be at DEBUG.

Collection and Pipeline

Logs go from process stdout to long-term storage through a pipeline. The current well-trodden options:

Fluent Bit — lightweight, the default in most Kubernetes deployments. Tails container logs, enriches with metadata, ships to a destination.
Vector — Rust-based, faster than Fluent Bit, with a richer transformation language.
Fluentd — older, JVM-based, still widely used.

Destinations:

CloudWatch Logs — AWS-native, simple, expensive at high volume ($0.50/GB ingest).
Elasticsearch / OpenSearch — full-text search, fast queries, complex to operate at scale.
Loki — index labels only, not log content. Cheap; queries are slower than ES for full-text patterns but fast for label-filtered access. Pairs naturally with Prometheus + Grafana.
Datadog / Splunk / Honeycomb / Axiom — managed; pricing varies dramatically; capabilities and ergonomics differ.
S3 + Athena / ClickHouse — long-term archive with queryable access for compliance and post-mortems.

The architecture that ages well: collect at the node level with Fluent Bit or Vector, ship the hot tier (recent logs) to a search engine, and the cold tier (compliance, audit) to S3. Different teams query different tiers based on need.

Cost: The Conversation Everyone Avoids

Logging is the second-largest line item on most observability bills after metrics, and the easiest to leave unmanaged. Strategies to keep it sustainable:

Sample DEBUG and INFO in production. Keep 100% of WARN and ERROR; sample lower levels at 1–10% with deterministic sampling per trace_id (so all logs for the same request are kept or dropped together).
Drop noisy sources at the collector. Health checks, robot user agents, unauthenticated 404s.
Compress before shipping. Fluent Bit and Vector both support gzip / Zstandard compression. Ingest costs are per-byte at the destination.
Short retention for high-volume tiers. 7 days in hot search, 90 days in cold storage, archived to compliance store beyond that.
Drop fields, not events. A noisy field on a useful event can be dropped at the collector while keeping the event.

Visibility into log cost should be a regular review, not an annual surprise. Tag log volume by service and team in the destination’s billing dimensions.

Searching and Investigating

The point of structured logs is that investigation becomes a query. Some patterns that earn their keep:

Always start by trace_id when you have one. Joins everything across services for that request.
Then filter by service + level + time range. The minimum to narrow the surface.
Use saved queries for common investigations. “Errors in checkout flow in the last hour by error_code” should be one click, not a re-write each time.
Build runbook links into alerts. Each alert should link to a saved query that returns the relevant logs.

For teams using Loki specifically, the label-first design changes how you think about queries: filter on {service="orders-api", level="ERROR"} first to get into a manageable stream, then |= text-match within. Don’t try to put every searchable thing into a label; that explodes cardinality and breaks Loki the same way it breaks Prometheus.

Audit Logs Are Different

A common mistake is conflating diagnostic logs with audit logs. They have different requirements:

Concern	Diagnostic Logs	Audit Logs
Audience	Engineers	Compliance, security
Retention	Days to months	Years
Mutability	Append-only conceptually	Immutable, tamper-evident
Schema	Evolves with the code	Stable, versioned
Volume	High	Low
Storage	Search-optimized	WORM / object lock

Audit events deserve their own pipeline: a separate logger, a separate destination, often a separate retention class. Mixing them produces a system that is too expensive for retention if treated as audit and too uncontrolled if treated as diagnostic.

Common Failure Modes

A few patterns I’ve seen repeatedly:

Inconsistent field naming across teams. Three different ways of identifying the same user. Solve with a central schema document, code-generated logger helpers, or a linter on log key names.
Logging in tight loops. A log.debug inside a per-row loop adds 50ms to a request. Use conditional debug or move the log outside the loop.
Re-serializing already-serialized JSON. Logging a parsed JSON object as a string produces escaped-JSON-inside-JSON. Log it as a structured field, not as a string.
Synchronous log writes blocking the request path. All production loggers should be async or buffered; a network hiccup to the log collector should not affect user-facing latency.
No log rotation on disk-buffered loggers. A log collector outage fills the disk; the application crashes. Bound the buffer size.

Closing

Structured logging is one of those investments that compounds. The work — schema definition, context binding, propagation across service boundaries, collection pipeline, retention policy — is concentrated up front. The payoff is that every incident afterward is solvable by query, every metric anomaly can be drilled to specific events, and post-mortems are about decisions, not log archaeology. The mechanics are well-understood: JSON output, consistent schema, trace-context propagation, sampling and filtering at the collector, separation of diagnostic and audit pipelines. The discipline is harder: keeping field names consistent across teams, resisting the urge to log everything, removing logs that aren’t useful, and treating the schema as a contract rather than a suggestion. In a distributed system, logs are the only signal that survives every other failure — get them structured and consistent, and the rest of incident response gets dramatically faster.