Terraform for Reproducible Cloud Infrastructure
The promise of infrastructure-as-code is that the production environment is reproducible — that staging looks like production, that a new region can be brought up by running a tool, that a rollback is a git revert. The reality, in most organizations, is more complicated. Terraform configurations drift from running state, modules diverge across services, state files become operationally fragile, and “reproducibility” turns into a fiction that holds up only as long as no one looks too closely.
This post is about the patterns that make Terraform actually reproducible at production scale. The mechanics of Terraform itself are well-documented; the operational discipline that turns it into a reliable engineering tool is not.
What “Reproducible” Actually Requires
Reproducibility is not a single property; it is the conjunction of several:
- Determinism. The same config produces the same infrastructure, regardless of who runs it.
- Idempotency. Re-running the same config makes no changes if nothing has drifted.
- Auditability. Every change is reviewable, attributable, and reversible.
- Environmental parity. Dev, staging, and production share a structure; differences are explicit and bounded.
- State integrity. The state file accurately reflects reality, with no untracked resources and no drift.
Terraform supports all of these in principle. In practice each requires deliberate work, and skipping any one undermines the rest.
The Backend Decision
State management is the single largest factor in whether Terraform stays reliable as the codebase grows. The local-state default is acceptable for a single user; in any team or production setting, a remote backend with locking is mandatory.
The widely-used options in 2026:
- S3 + DynamoDB locking (the AWS-native default). Cheap, durable, well-supported. Configure encryption at rest, versioning on the state bucket, and a separate IAM role for state access.
- Terraform Cloud / HCP Terraform. Managed state, run history, policy enforcement, VCS integration. Worth the cost for teams that want managed workflows.
terraform-state-awarealternatives: Pulumi, OpenTofu (the open-source Terraform fork), Spacelift, Atlantis. The state semantics are similar; the operational model varies.
State file layout matters as much as backend choice. The model that scales:
- One state per environment per concern. Networking is a separate state from application infrastructure; databases are a separate state from compute. Avoid the megalithic single-state file that requires a 20-minute plan to change anything.
- Pass values between states via remote state data sources (or, better, via SSM Parameter Store / outputs / a service catalog). Avoid hard-coding values between states.
- Per-environment isolation. Production state is in a different account, with different IAM, than staging or dev.
terraform { required_version = ">= 1.6" backend "s3" { bucket = "myorg-tf-state-prod" key = "platform/networking/terraform.tfstate" region = "us-east-1" dynamodb_table = "tf-locks-prod" encrypt = true kms_key_id = "arn:aws:kms:us-east-1:...:key/..." }}Module Design
Modules are how you turn Terraform from “a 5,000-line root config” into something maintainable. The principles that hold up:
- One module, one concern. A “VPC” module creates a VPC, subnets, NAT, route tables. It does not create EKS, RDS, or DNS records.
- Inputs are sparse and high-level. A consumer should specify intent, not implementation.
subnet_size = "small"is better than 16 individual CIDR parameters. - Outputs are stable. Once a module is consumed by another stack, removing or renaming an output is a breaking change. Treat outputs like a public API.
- Versions are pinned.
source = "git::ssh://[email protected]/org/tf-modules//vpc?ref=v1.4.2". Floating references tomainwill eventually break someone’s apply. - Avoid conditional logic over flat composition. A module with five
enable_Xflags is a sign that you should have written multiple smaller modules.
The community modules (terraform-aws-modules/*) are excellent for getting started; many teams outgrow them as their requirements become specific. Write your own thin wrappers around community modules early, so when you need to replace them you have a stable interface.
A Worked Example: VPC Module
A typical wrapper module that captures organizational defaults:
module "vpc" { source = "terraform-aws-modules/vpc/aws" version = "~> 5.10"
name = "${var.env}-${var.region_short}-vpc" cidr = var.vpc_cidr
azs = local.azs private_subnets = [for i, az in local.azs : cidrsubnet(var.vpc_cidr, 4, i)] public_subnets = [for i, az in local.azs : cidrsubnet(var.vpc_cidr, 4, i + 8)]
enable_nat_gateway = true single_nat_gateway = var.env != "prod" one_nat_gateway_per_az = var.env == "prod"
enable_flow_log = true flow_log_destination_type = "s3" flow_log_destination_arn = aws_s3_bucket.flow_logs.arn flow_log_traffic_type = "ALL"
tags = local.common_tags}A few defensible choices in this snippet:
- Single NAT in non-prod, per-AZ NAT in prod. Cost vs. availability trade-off captured in code.
- Flow logs to S3, not CloudWatch. Cheaper for high-traffic VPCs; queryable with Athena.
- CIDR arithmetic via
cidrsubnet. Subnets derived from the VPC CIDR; impossible to overlap or mis-size.
Variables, Locals, and Tags
tags deserves its own section because tag discipline is the difference between a comprehensible AWS account and a haunted one.
locals { common_tags = { Environment = var.env Service = var.service Owner = var.owner_team ManagedBy = "terraform" Repo = "github.com/org/${var.repo}" CostCenter = var.cost_center }}Apply local.common_tags to every taggable resource. Enforce required tags via AWS Config or Service Control Policies in production accounts. Cost attribution, security investigations, and orphan-resource cleanup all depend on consistent tagging.
For variable hygiene:
- Validate inputs.
variable "env" { validation { condition = contains(["dev","staging","prod"], var.env) } }. - Sensible defaults only when safe. Defaulting
env = "dev"is a foot-gun; require explicit input for environment selection. - Type everything.
type = list(object({ ... }))catches structural errors at plan time.
Workspaces vs. Directory-per-Environment
Two patterns exist for handling multiple environments:
- Terraform workspaces. One backend, one set of files, switch contexts via
terraform workspace select. Concise but easy to misuse; a single badapplycan hit production. - Directory-per-environment.
environments/dev/,environments/staging/,environments/prod/, each with its own backend config and its owntfvars. Verbose but explicit.
For production systems, the directory-per-environment pattern wins on safety. Workspaces are fine for transient parallel environments (per-PR, per-feature branches) where the tradeoff favors brevity.
infra/├── modules/│ ├── vpc/│ ├── eks-cluster/│ └── rds-instance/└── environments/ ├── dev/ │ ├── backend.hcl │ ├── main.tf │ └── terraform.tfvars ├── staging/ │ └── ... └── prod/ └── ...Initialize with the environment-specific backend:
terraform -chdir=environments/prod init -backend-config=backend.hclterraform -chdir=environments/prod plan -var-file=terraform.tfvarsDrift, and What to Do About It
Drift is when the actual cloud state differs from the Terraform state. It happens through console access, other tooling, third-party automations, or AWS itself (auto-scaling, default tags, etc.). A configuration that drifts is no longer reproducible.
Two strategies:
- Detect. Run
terraform planon a schedule (nightly is common); alert on non-empty plans. Tools likedriftctland the built-interraform plan -detailed-exitcodeautomate this. - Prevent. Restrict console write access in production accounts; use AWS SCPs to forbid most direct API actions. Console-edit-then-Terraform-import is the source of most drift; remove the first step.
When drift is detected, the resolution is usually one of:
- Update Terraform code to match reality and commit.
terraform applyto revert reality to match code.- Decide that the drifted resource shouldn’t be managed by Terraform and
terraform state rm.
The choice is contextual but the response time is not — drift left unaddressed compounds.
Plan Reviews and Policy Enforcement
A terraform plan is a code review artifact. The discipline:
- Every change is proposed via PR, with the plan attached. Atlantis, Terraform Cloud, or
terraform-planGitHub Actions surface the plan in the PR comment. - The plan is reviewed for blast radius. “This change will destroy 23 resources” needs more scrutiny than “this change adds one resource.”
- Apply happens after merge, not as part of the PR. A merged PR triggers the apply against the production backend, with a clean separation between proposal and execution.
Policy enforcement via OPA (Open Policy Agent) / Sentinel / tflint / checkov is increasingly standard:
tflintfor syntax and AWS-specific lints.checkovortfsecfor security misconfigurations.- OPA / Sentinel for organization-specific policies (“S3 buckets must have encryption”, “EC2 instances must use approved AMIs”).
- run: tflint --recursive- run: checkov -d . --skip-check CKV_AWS_123- run: terraform plan -out=tfplan- run: opa eval --data policies/ --input tfplan.json 'data.tf.deny'A plan that violates a critical policy must fail the pipeline.
Imports and Adoption
Bringing existing infrastructure under Terraform management is a common task and a common source of breakage. The mechanics:
terraform import(orimportblocks in Terraform 1.5+) brings resources into state.terraform planafter import should show no changes. If it shows changes, your config doesn’t match reality and applying would cause drift in the other direction.- Iterate carefully. Import one resource at a time; verify the plan; commit; move on.
Import blocks (HCL-native imports) are dramatically better than the older terraform import CLI because they’re checked into source control and reviewable:
import { to = aws_s3_bucket.legacy id = "my-legacy-bucket"}
resource "aws_s3_bucket" "legacy" { bucket = "my-legacy-bucket"}For large-scale imports, Terraformer can reverse-engineer existing infrastructure into HCL — useful as a starting point, but the generated code typically needs significant refactoring before it’s usable.
Trade-offs Worth Owning
Terraform is not perfect, and being honest about its limitations is part of using it well:
- State files are operationally fragile. A corrupted or split state requires careful recovery. Versioning on the state bucket is non-negotiable; backups before risky operations are wise.
- Provider versions move. A pinned provider version (
aws ~> 5.40) is required for reproducibility; an unpinned one will eventually break a plan in production. - HCL is not a programming language. Complex logic in
dynamicandfor_eachbecomes unreadable. Push complexity into modules; keep root configs flat. - Terraform is not great at stateful, mutable resources. Database schema migrations, complex IAM policies that other systems modify, anything with a long-lived identity — these are better handled by purpose-built tools, with Terraform owning the surrounding scaffolding only.
- Refactoring requires care. Renaming a resource (
aws_instance.foo→aws_instance.bar) causes destroy-then-create unless you usemovedblocks (also in 1.5+) orterraform state mv. Plan refactors deliberately. - OpenTofu vs. Terraform. HashiCorp’s BSL license shift in 2023 produced the OpenTofu fork. Both are viable in 2026; OpenTofu has feature parity and an active community. The choice is mostly about which roadmap you trust.
CI/CD for Terraform
The pipeline that supports the workflow above:
name: terraformon: [pull_request, push]jobs: plan: if: github.event_name == 'pull_request' runs-on: ubuntu-latest permissions: { id-token: write, contents: read, pull-requests: write } steps: - uses: actions/checkout@v4 - uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::...:role/tf-plan aws-region: us-east-1 - uses: hashicorp/setup-terraform@v3 - run: terraform fmt -check -recursive - run: terraform init -backend-config=environments/${{ matrix.env }}/backend.hcl - run: terraform validate - run: tflint --recursive - run: checkov -d environments/${{ matrix.env }} - run: terraform plan -out=tfplan - uses: actions/upload-artifact@v4 with: { name: tfplan-${{ matrix.env }}, path: tfplan }
apply: if: github.ref == 'refs/heads/main' environment: production needs: plan runs-on: ubuntu-latest steps: - run: terraform apply tfplanApply jobs run in a protected environment with required reviewers. Plans are artifacts of the PR, attached as comments by the workflow.
Closing
Reproducible cloud infrastructure with Terraform is achievable, and it is not free. The mechanics — modules, providers, state, plan/apply — are well-understood. The discipline — sparse modules, pinned versions, environment isolation, consistent tagging, drift detection, policy enforcement, plan review, careful imports — is what separates infrastructure that stays predictable from infrastructure that drifts into ungovernable complexity. Build the codebase deliberately: one concern per module, one backend per environment per concern, OIDC-based CI/CD that surfaces plans for review and applies them in protected environments, drift monitored on a schedule. Treat state as a critical asset; treat tags as a contract; treat policy as code. The pay-off is real: environments that look identical, changes that are reviewable artifacts, rollbacks that are git operations, and an audit trail that survives turnover. Anything less is infrastructure scripts dressed up as infrastructure-as-code.