CI/CD Pipelines for Zero-Downtime Deployments with GitHub Actions
A zero-downtime deployment is not a feature you enable — it is the cumulative effect of dozens of small decisions made consistently: how your application handles SIGTERM, what your readiness probe actually checks, how your load balancer drains connections, what your migration strategy is, how you roll back, and what your CI pipeline does between the moment a commit lands and the moment users see the change. GitHub Actions provides the orchestration; whether the deployment is actually zero-downtime depends on what you put in the pipeline.
This post is a working pattern for production-grade CI/CD on GitHub Actions, with the deployment-time concerns that actually determine whether users notice the release.
The Pipeline Shape
A serviceable CI/CD pipeline for a service has four stages, executed in this order:
- Build and test. Lint, type-check, unit tests, build artifact.
- Container build and push. Build image, tag with commit SHA, push to registry.
- Deploy to staging. Apply manifests/Helm chart, run smoke tests, integration tests against staging.
- Deploy to production. Rolling or canary deploy, monitor health, manual or automated promotion.
The interesting questions are at the seams: when does a stage block the next one, what triggers a rollback, and how is the production deployment performed without dropping connections.
Workflow Structure in GitHub Actions
The current best practice is reusable workflows plus composite actions for shared logic. A single repository owns the canonical CI/CD workflow; each service consumes it via workflow_call.
name: deployon: push: branches: [main] workflow_dispatch:
concurrency: group: deploy-${{ github.workflow }}-${{ github.ref }} cancel-in-progress: false
jobs: build: uses: org/workflows/.github/workflows/build.yml@v1 with: service: orders-api
staging: needs: build uses: org/workflows/.github/workflows/deploy.yml@v1 with: environment: staging image: ${{ needs.build.outputs.image }}
production: needs: staging environment: production uses: org/workflows/.github/workflows/deploy.yml@v1 with: environment: production image: ${{ needs.build.outputs.image }}A few non-obvious details worth pinning down:
concurrencywithcancel-in-progress: false. Two deploys of the same service must not overlap; the second waits. Cancelling in-progress deploys mid-flight is a recipe for half-deployed states.environment: productiontriggers GitHub’s environment protection rules — required reviewers, wait timers, allowed branches. Use these instead of inventing approval flows in YAML.- Pin workflow versions to tags (
@v1), not branches (@main). A pipeline change should be a deliberate version bump. - OIDC, not long-lived secrets. GitHub’s OIDC provider authenticates to AWS/GCP/Azure via short-lived role assumptions. Avoid long-lived access keys in Actions secrets entirely.
- uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::123456789012:role/gha-deploy aws-region: us-east-1The Build: Reproducible and Cached
Every build step should be reproducible (same input → same output) and aggressively cached. The patterns that matter:
- Lock files in source control.
package-lock.json,poetry.lock,go.sum. Caches hit because they exist. - Layer ordering for Dockerfiles. Dependencies before source code. A code-only change should not invalidate the dependency install layer.
- BuildKit with cache mount.
RUN --mount=type=cache,target=/root/.cache/pip pip install .... Reuses package caches across builds. - Registry-backed cache.
cache-from: type=registry,ref=org/svc:cacheandcache-to: type=registry,ref=org/svc:cache,mode=max. Persists layers between runners. - Tag deterministically.
org/svc:${{ github.sha }}and an optionalorg/svc:latestfor traceability. Never deploylatestto production; the SHA is the canonical reference.
- uses: docker/build-push-action@v5 with: context: . push: true tags: | ghcr.io/org/${{ inputs.service }}:${{ github.sha }} ghcr.io/org/${{ inputs.service }}:sha-${{ github.sha }} cache-from: type=gha,scope=${{ inputs.service }} cache-to: type=gha,scope=${{ inputs.service }},mode=max provenance: true sbom: trueprovenance and sbom: true produce SLSA provenance attestations and an SBOM as part of the build — useful for supply-chain compliance and increasingly expected in regulated environments.
Test Stratification
Slow tests in CI are the single largest barrier to high deploy frequency. The stratification that works:
- Unit tests — fast, deterministic, run on every PR and every push. Goal: full unit suite under 3 minutes.
- Integration tests — talk to real dependencies (test database, in-memory queue). Run on every push to main and before staging deploy.
- End-to-end tests — full system, run against staging post-deploy. Smaller suite focused on critical flows.
- Smoke tests — handful of canary checks against production post-deploy. Validate the most basic functionality.
Don’t gate every PR on a 30-minute E2E suite. Gate on unit + integration; let E2E run post-deploy with rollback automation.
Test runtime can also be parallelized via matrix:
strategy: matrix: shard: [1, 2, 3, 4, 5, 6, 7, 8] fail-fast: falsesteps: - run: pytest --shard-id=${{ matrix.shard }} --total-shards=8Migrations: The Zero-Downtime Trap
Database migrations are where most “zero-downtime” deploys fail. The rule that earns its keep: every migration must be backward-compatible with the previous version of the application code.
This forces a two-phase pattern for breaking changes:
- Deploy schema change (additive only — new column, new table, new index).
- Deploy code that writes to both old and new.
- Backfill data.
- Deploy code that reads from new only.
- Deploy schema change to remove old.
Renaming a column in one deploy is a request for outages. The expand/contract pattern (expand-migrate-contract, or “ECT”) takes more deploys but prevents the half-deployed inconsistency that takes services down.
Tooling: Flyway, Liquibase, Alembic, golang-migrate, dbmate. All of them support forward-only migrations, which is correct for production — rolling back schema changes is rarely worth the complexity. Roll forward with a fixing migration instead.
For Postgres specifically:
CREATE INDEX CONCURRENTLY— does not lock the table.ALTER TABLE ADD COLUMNwith a default — in Postgres 11+ this is a metadata-only operation. Without a default, it’s cheap. With a non-constant default (e.g.,now()), it rewrites the table.- Long-running migrations should run as a separate job, not inline with the deploy. A 4-hour backfill in a deploy pipeline blocks every other deploy.
Deployment: How Rolling Updates Actually Work
For Kubernetes-deployed services, the zero-downtime properties depend on the interaction between three things: the rolling update strategy, the readiness probe, and the graceful shutdown handler.
strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 0 maxSurge: 25%
readinessProbe: httpGet: { path: /healthz/ready, port: 8080 } initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 3
lifecycle: preStop: exec: { command: ["/bin/sh", "-c", "sleep 5"] }
terminationGracePeriodSeconds: 30What each does:
maxUnavailable: 0— never reduce capacity below the desired count during a rollout. A pod must be Ready before the old one terminates.- Readiness probe checks dependencies. Not just “is the process alive?” but “can it serve requests?” — DB pool initialized, caches warm, downstream connectivity OK.
preStopsleep. Removing a pod from the Service endpoints takes seconds; if the process exits immediately on SIGTERM, the load balancer is still routing traffic to it. Sleep buys the propagation time.- Graceful shutdown handler in the app. On SIGTERM, stop accepting new connections, drain in-flight requests, close DB pools, exit.
Without all four, “rolling update” produces visible errors during deploys.
Canary and Progressive Delivery
For high-stakes services, rolling updates are not enough. Progressive delivery sends a small fraction of traffic to the new version, monitors health, and ramps up only if metrics stay healthy.
GitHub Actions plus Argo Rollouts (or Flagger) is a common pattern:
- name: Promote canary run: | kubectl argo rollouts promote orders-api -n production kubectl argo rollouts status orders-api -n production --watchFlagger goes further: it ramps the canary, queries Prometheus for SLI metrics (error rate, p95 latency), and auto-promotes or auto-rolls-back based on thresholds. The pipeline kicks off the deploy; Flagger owns the rollout. This is the model that lets you deploy a hundred times a day without humans watching graphs.
Rollback as a First-Class Operation
Rollbacks should be one command, well-documented, and exercised regularly. Two patterns:
- Helm rollback —
helm rollback <release> <revision>. Works for any Helm-managed deployment. Re-runs deploy with the previous chart values. - Image SHA rollback — re-deploy a previous SHA via the same pipeline. Slower but produces a clean audit trail.
Whichever you use, every release artifact (chart version, image SHA, manifest revision) must be addressable. Rollbacks fail when the team can’t reliably identify “the previous version.”
The rollback discipline that matters: practice it monthly. A rollback you’ve never tested is not a rollback; it is hope.
Secrets and Config
Secrets must never live in workflow YAML or environment variables in source. Two options:
- GitHub Encrypted Secrets for CI-time secrets (registry credentials, deploy keys).
- Cloud-native secret stores (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault) for runtime secrets. Pods fetch via External Secrets Operator or directly via SDK + IRSA.
Config is different from secrets. Per-environment values (DB host, region, feature flags) belong in chart values files, not in code or runtime secrets:
charts/orders-api/ values.yaml values-staging.yaml values-production.yamlHelm’s -f values-production.yaml overlays environment-specific config at deploy time.
Observability Hooks
A CI/CD pipeline that doesn’t surface what was deployed to your observability system is missing a critical signal. Two integrations earn their keep:
- Deploy annotations on dashboards. Most observability platforms (Datadog, Grafana, Honeycomb, Sentry) accept deploy events that mark dashboards. Correlating a metric spike with “X deployed at 14:32” cuts diagnostic time dramatically.
- Release tagging in error trackers. Sentry, Rollbar, Bugsnag all support releases — errors are tagged with the release that produced them, regressions show up against the right deploy.
- name: Sentry release uses: getsentry/action-release@v1 with: environment: production version: ${{ github.sha }}Pipeline Failure Modes
A few common pitfalls:
- Lengthy approval timeouts. A staging-to-production approval that takes 4 hours encourages teams to deploy at 8pm to avoid it. Either approve quickly or remove the approval — somewhere in between is worst-of-both-worlds.
- Pipelines that fail silently. A step that exits 0 on test failure is worse than a step that fails loudly. Always check exit codes; use
set -euo pipefailin shell scripts. - Resource-starved runners. Default GitHub-hosted runners are 4 vCPU. Heavy CI suites need larger runners (
runs-on: ubuntu-latest-8-core) or self-hosted runners. - Workflow sprawl. Each service ends up with a slightly different deploy workflow. Diagnosing “why did this deploy fail?” requires reading the YAML. Standardize via reusable workflows; resist the urge to customize.
- Forgetting to update workflows in tests. A pipeline change is a code change. Lint workflow YAML (
actionlint), test in a fork before rolling out, and version reusable workflows like libraries.
Closing
A zero-downtime CI/CD pipeline is the sum of disciplined choices: deterministic builds with cached dependencies, stratified tests that gate the right stages, additive-only migrations executed before app rollout, rolling deployments with correctly configured probes and shutdown handlers, progressive delivery for high-stakes services, and rollbacks practiced often enough to be a routine operation. GitHub Actions provides the orchestration — reusable workflows, OIDC-based cloud auth, environment protection rules, concurrency control — but the pipeline’s properties are determined by what runs inside it. Build the pipeline so that the dangerous things (migrations, breaking changes, slow tests) are gated explicitly, the safe things (rolling updates, observability, rollback) are automated, and the human approvals are reserved for decisions that genuinely require judgment. The goal is not a flashy deployment dashboard; it is the property that users do not notice you deployed.