Implementing Observability with Prometheus and Grafana

Observability is one of those topics where the vocabulary has gotten ahead of the practice. Most organizations have Prometheus and Grafana installed. Fewer have meaningful SLIs defined. Fewer still have alerts tied to SLO burn rate rather than threshold crossings. And only a small fraction can answer “is this service healthy right now?” with a single dashboard panel that everyone trusts. The gap between “we have monitoring” and “we have observability” is mostly the work in between.

This post is about the operational shape of a Prometheus and Grafana stack that actually supports running production systems — what to instrument, how to structure metrics, how to write alerts that don’t burn out the on-call rotation, and where the common pitfalls live.

The Three Signals Worth Instrumenting

Every production service needs three classes of signals, and most need exactly three:

RED metrics for request-driven services: Rate, Errors, Duration. The user-facing health of an API or function.
USE metrics for resource-driven systems: Utilization, Saturation, Errors. The health of a CPU, disk, queue, or thread pool.
Business / SLI metrics tied to user outcomes: checkout success rate, search relevance, deploy frequency, time-to-first-token.

A service can be 100% healthy on RED and 100% broken on the business signal — the user couldn’t complete a checkout because the recommendation service returned poor results. The reverse is also possible. Instrument both.

Brendan Gregg’s USE method and Tom Wilkie’s RED method are the canonical references; treat them as a checklist when bringing a new service into observability.

The Prometheus Model

Prometheus is a pull-based, dimensional time-series database. Two properties shape how you use it:

Pull-based collection. Prometheus scrapes targets on a schedule (typically 15–60s). Push-only metrics (short-lived jobs, events) go via the Pushgateway, which is intentionally limited — Prometheus is not designed as an event store.
Dimensional model with labels. A metric (http_requests_total) is a family of time series indexed by labels (method, status, route). High-cardinality labels (user_id, request_id) explode the time series count and crash Prometheus.

For production at scale, raw Prometheus has limits: single-node storage, retention measured in weeks not years, federation that doesn’t always behave well. The community answers are Thanos, Cortex, and Mimir, which add long-term storage, horizontal scale, and global query. Pick one and configure remote_write to it; keep local Prometheus instances as scrape-and-evaluate nodes.

Metric Types and When to Use Them

Prometheus exposes four metric types. The right choice matters more than people credit.

Counter. Monotonically increasing value (or resets on process restart). http_requests_total, bytes_sent. Always query via rate() or increase() over a time window.
Gauge. Value that can go up and down. cpu_usage_percent, queue_depth, in_flight_requests.
Histogram. Observations into pre-defined buckets, plus a sum and count. The right tool for latency. Bucket boundaries determine the percentiles you can compute.
Summary. Pre-computed percentiles. Cheaper to compute on the client; impossible to aggregate across instances. Use histograms instead in almost every case.

The histogram bucket trap: percentiles from histograms are interpolated within buckets. With [0.1, 0.5, 1, 2.5, 5, 10] second buckets, you cannot accurately compute p99 for a service whose actual p99 is 250ms — the answer is “somewhere between 100ms and 500ms.” Pick buckets that span your actual distribution and densely cover the percentiles you care about.

The OpenMetrics native histogram format (now mature in Prometheus) collapses the bucket-design problem by using a sparse, adaptive representation. If your scrape target supports it, prefer native histograms.

Cardinality: The Quiet Killer

Every unique combination of label values is a separate time series. A http_requests_total metric with method (5 values), status (15 values), route (200 values), user_id (1M values) produces 5 × 15 × 200 × 1,000,000 = 15 billion time series. Each one consumes memory.

The rule: never put unbounded-cardinality values into labels. User IDs, request IDs, trace IDs, full URLs with path parameters — these belong in logs and traces, not in metrics. The right labels are bounded and meaningful: HTTP method, status class, route template, deployment color, environment, region.

A practical workflow:

Estimate cardinality before deploying a new metric: cardinality = ∏ |labels|. If the result exceeds ~10,000 per metric, redesign.
Monitor prometheus_tsdb_head_series for unexpected growth. Sudden spikes are usually a new label with high cardinality.
Set --storage.tsdb.head-series limits and use Prometheus’s per-target limits to fail loud rather than degrade silently.

Service Discovery and Relabeling

Static scrape_configs work for ten targets and fall apart at a hundred. Use service discovery:

Kubernetes SD. Automatically discovers pods, services, endpoints, ingresses. The standard for any K8s deployment.
EC2/EKS/ECS SD. AWS-native discovery via the AWS API.
Consul / DNS SD. For non-cloud or mixed environments.

Relabeling rules transform discovered targets — adding labels, filtering, rewriting metric paths. The patterns that earn their keep:

relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: "true"
  - source_labels: [__meta_kubernetes_namespace]
    target_label: namespace
  - source_labels: [__meta_kubernetes_pod_label_app]
    target_label: app
metric_relabel_configs:
  - source_labels: [__name__]
    regex: "go_.*"
    action: drop

relabel_configs shape which targets get scraped and what labels they carry. metric_relabel_configs filter metrics post-scrape — useful for dropping noisy Go runtime metrics or high-cardinality offenders before they hit storage.

Dashboards That People Actually Use

Grafana dashboards age badly. The patterns that keep them useful:

One dashboard per service, with a standard layout. Top row: SLI/SLO status. Second row: RED metrics. Third row: dependencies. Fourth: resource utilization. Below the fold: deep-dive panels.
Variables for environment, region, deployment. A single dashboard parameterized by environment beats five copies that drift.
Annotations from deploys and incidents. Mark deploys on every dashboard; correlation is the most common diagnostic question.
Heatmaps for latency, not single-percentile lines. histogram_quantile() lines hide bimodal distributions. Heatmaps show the full shape.
No more than 12 panels per dashboard. Beyond that, no one uses them; scroll fatigue kills the value.

The hardest dashboard discipline is deletion. Dashboards proliferate; old ones rot; new engineers can’t tell which dashboard is the canonical one. Periodically audit, consolidate, and delete.

For golden-signal dashboards, the kube-prometheus-stack and the mixin ecosystem (node-mixin, kubernetes-mixin, etcd-mixin) provide starting points that have been refined over years. Adopt them and customize.

PromQL Patterns Worth Knowing

A small set of PromQL idioms covers most production needs:

# Request rate per service over 5 min
sum by (service) (rate(http_requests_total[5m]))

# Error rate (4xx + 5xx) per service
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
  / sum by (service) (rate(http_requests_total[5m]))

# p95 latency
histogram_quantile(0.95,
  sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))

# SLO burn rate: actual error ratio / budget allowed
(sum(rate(http_requests_total{status=~"5.."}[1h]))
 / sum(rate(http_requests_total[1h])))
 / (1 - 0.999)  # 99.9% availability SLO

# Saturation: usage / capacity
node_filesystem_size_bytes - node_filesystem_avail_bytes
  / node_filesystem_size_bytes

The most important pattern is the second one: error rate as a ratio, not absolute count. Absolute error counts vary with traffic; ratios are meaningful regardless of load.

For all percentile queries, use sum by (le, ...) (rate(... _bucket[5m])) — histogram_quantile over a non-aggregated bucket series produces per-instance percentiles, which are usually not what you want.

Recording Rules: Precompute the Expensive

Queries used in dashboards and alerts often recompute the same aggregation. Recording rules precompute these at evaluation time and store the result as a new series:

groups:
  - name: api-aggregates
    interval: 30s
    rules:
      - record: api:http_requests:rate5m
        expr: sum by (service, status) (rate(http_requests_total[5m]))
      - record: api:http_request_duration:p95
        expr: histogram_quantile(0.95,
              sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))

Alerts and dashboards then query the precomputed series. Two benefits: alert evaluation is dramatically cheaper, and dashboard load times improve. The trade-off is one more layer to maintain.

SLOs and Alerts That Don’t Burn Out the Team

The dominant cause of alert fatigue is alerts that fire on conditions that don’t matter to users. The dominant cause of missed incidents is alerts that don’t fire on conditions that do matter. Both are solved by SLO-based alerting.

The model, derived from Google’s SRE practices and now standard:

Define SLIs. A measurable indicator of user-perceived behavior. “Fraction of API requests completing under 500ms with status 2xx.”
Define SLOs. A target on the SLI. “99.9% of requests over a 30-day window.”
Compute the error budget. 1 - SLO over the window. 0.1% of requests can fail in a 30-day window.
Alert on burn rate. Page when the budget is being consumed faster than acceptable.

- alert: HighErrorBudgetBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[5m]))
       / sum(rate(http_requests_total[5m]))
    ) > (14.4 * (1 - 0.999))
   and
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
       / sum(rate(http_requests_total[1h]))
    ) > (14.4 * (1 - 0.999))
  for: 2m
  labels: { severity: page }
  annotations:
    summary: "Burning 30-day error budget in <2 days"

The multi-window alert (short window for fast detection, long window for noise filtering) prevents single spikes from paging while catching sustained burns. The classic reference is Google’s Site Reliability Workbook, chapter “Alerting on SLOs”; the burn-rate thresholds (14.4x for fast burn, 6x for medium, 1x for slow) are well-tested.

Standard practice now is to define SLOs in code (sloth, pyrra, or hand-rolled rules), generate the recording and alerting rules from the SLO definition, and treat the SLOs as PRs reviewed by both engineering and product.

What Not to Alert On

The corollary to “alert on user-impact”: stop alerting on things that aren’t.

High CPU. A service running at 90% CPU but serving requests under SLO is fine. Alert on the SLO, not the resource.
A single instance unhealthy. A pod restart in a fleet of 50 is normal. Alert on the fleet’s aggregate availability, not individual replicas.
Disk space at 80%. Trend-based alerts (“disk full in 24 hours at current growth”) are far more useful than threshold alerts.
Memory usage above X. Same logic. Alert on OOM-kills, not raw memory.

Every alert should answer “what user-visible thing is broken?” If you can’t answer that, the alert probably shouldn’t page.

Alertmanager Configuration

Alertmanager handles deduplication, grouping, silencing, and routing. The patterns that work:

Route by severity. page to PagerDuty, ticket to a Slack channel, info to email.
Group by service and alert name. A flood of “DB connection refused” alerts collapses to one notification per service.
Inhibition rules. Suppress dependent alerts (“API down” inhibits “API latency high”). Reduces noise during real incidents.
Silences for known issues. A documented runbook for a known problem deserves a silence, not a 4am page.

Production Realities

A few things that bite people in production with Prometheus + Grafana:

Prometheus is single-tenant by default. Multi-tenancy comes from Cortex / Mimir or careful labeling. A flat Prometheus shared across teams becomes a noisy-neighbor nightmare.
Long-term storage is not optional. Local TSDB retention beyond a few weeks is impractical at scale. remote_write to Mimir / Thanos / managed Prometheus is the standard.
Scrape staleness matters. When a target disappears, Prometheus marks its series as stale after a few scrape intervals. Alerts on those series will fire as “no data” — make sure your alert config handles this.
Federation has limits. A federated query across 50 child Prometheus instances is slow. For aggregated views, use Cortex/Mimir’s global query rather than federation.
Grafana data sources should be templated. Hard-coded data source names in dashboards break when you change clusters. Use variables.
Backup the Grafana database. Grafana stores dashboards, alerts, and provisioning state in SQLite or Postgres. Losing it is losing all your dashboards.

Closing

A working Prometheus and Grafana stack is the product of fewer decisions than people think, made carefully. Instrument the three signal classes — RED, USE, business — with bounded-cardinality labels. Run Prometheus as a scraper and forward to a long-term backend for queries beyond a few weeks. Use service discovery and relabeling so adding a service to the observability stack is one annotation, not a config push. Build dashboards by service with a standard layout, parameterized by environment, and prune them. Define SLOs, alert on burn rate against the error budget, suppress everything else. Most of the operational pain teams attribute to “monitoring” is actually the cost of skipping one of these decisions. Get them right and observability becomes the substrate for the rest of operations — incident response, capacity planning, deploy confidence, on-call sustainability. Get them wrong and you have a graph collection people stop looking at.