Infrastructure as Code Best Practices: 10 Proven Patterns for Production Teams in 2026

Infrastructure as code best practices are well documented at the beginner level. The documentation tells you to use version control, use modules, and avoid hardcoded secrets. What it does not tell you is that 67% of teams using IaC experience significant configuration drift, that teams dealing with frequent drift have 2.3x higher change failure rates, and that the most common production incidents caused by IaC failures trace back to three specific anti-patterns: monolithic state files shared across environments, secrets stored in Terraform outputs, and CI pipelines that run apply without scheduled drift detection in between.

This guide covers ten infrastructure as code best practices at the level of teams who are already past the basics: module design for reusability at scale, remote state architecture that prevents concurrent apply conflicts, secrets management that keeps credentials out of state, a testing pipeline with four distinct stages, drift detection cadence, policy as code enforcement, GitOps integration, and the tagging and naming conventions that make infrastructure auditable.

Why Infrastructure as Code Best Practices Matter More in 2026

The context for infrastructure as code best practices has shifted significantly. Three changes define the current landscape.

First, Terraform’s licensing change in 2023 drove substantial migration to OpenTofu, the community-maintained fork. As of 2026, new projects are increasingly choosing OpenTofu for greenfield work. The practices in this guide apply equally to both the tooling differs, the patterns do not.

Second, IaC footprints have grown. Teams managing ten modules in 2020 are managing hundreds in 2026, often across multiple cloud accounts, multiple regions, and multiple teams with different maturity levels. The infrastructure as code best practices that work for a single engineer managing one environment break down at this scale.

Third, security requirements have tightened. Running IaC without policy enforcement is no longer acceptable in regulated environments and increasingly not acceptable in Series A/B startups handling customer data. Policy as code tools like OPA, Checkov, and Sentinel are now expected elements of a mature IaC pipeline, not advanced add-ons.

Practice 1: Modular Design With Clear Boundaries

The most foundational of all infrastructure as code best practices is module design, and the most common violation is building modules that are too large, too opinionated, or too coupled to a specific deployment context.

Good modules have three properties: they encapsulate a single infrastructure concept, they are parameterized for all environment-specific values, and they expose only the outputs that callers need.

Repository structure that scales:

infrastructure/
├── modules/
│   ├── vpc/                    # VPC, subnets, routing - no environment logic
│   ├── eks-cluster/            # EKS cluster - no app-specific config
│   ├── rds-postgres/           # RDS module - no application schemas
│   └── kubernetes-namespace/   # Namespace + RBAC - no workload manifests
├── environments/
│   ├── dev/
│   │   ├── main.tf             # Calls modules with dev-specific values
│   │   ├── variables.tf
│   │   └── backend.tf          # Dev state backend
│   ├── staging/
│   │   ├── main.tf
│   │   └── backend.tf
│   └── production/
│       ├── main.tf
│       └── backend.tf
└── live/                       # Deployed state, Terragrunt pattern

What a well-designed module looks like:

# modules/eks-cluster/main.tf
# Module: eks-cluster
# Purpose: Creates an EKS cluster with managed node groups
# Does NOT: configure applications, namespaces, or workloads

variable "cluster_name" { type = string }
variable "kubernetes_version" { type = string }
variable "node_instance_types" { type = list(string) }
variable "min_size" { type = number }
variable "max_size" { type = number }
variable "vpc_id" { type = string }
variable "subnet_ids" { type = list(string) }
variable "tags" { type = map(string) }

# Module implementation...

output "cluster_endpoint" { value = aws_eks_cluster.this.endpoint }
output "cluster_name" { value = aws_eks_cluster.this.name }
output "cluster_ca_certificate" {
  value     = aws_eks_cluster.this.certificate_authority[0].data
  sensitive = true
}

The root module in each environment becomes thin and readable:

# environments/production/main.tf
module "eks" {
  source = "../../modules/eks-cluster"

  cluster_name        = "prod-cluster"
  kubernetes_version  = "1.30"
  node_instance_types = ["m6i.2xlarge"]
  min_size            = 3
  max_size            = 12
  vpc_id              = module.vpc.vpc_id
  subnet_ids          = module.vpc.private_subnet_ids
  tags                = local.common_tags
}

The anti-pattern to avoid:

Modules that contain environment-specific logic (if var.environment == "production" then...) make it impossible to test environments in isolation and create hidden coupling between deployment contexts. Every difference between environments should be expressed as a variable value, never as conditional logic inside a module.

Practice 2: Remote State With Isolated Backends Per Environment

Local state files are incompatible with team development. This is not a matter of best practice preference a local terraform.tfstate file on one engineer’s laptop means another engineer running terraform apply operates on a different view of reality.

Remote state with backend isolation per environment is the infrastructure as code best practice that prevents the most destructive class of IaC incidents: state conflicts causing unexpected resource destruction.

S3 + native locking configuration (Terraform 1.10+ / current best practice):

# environments/production/backend.tf
terraform {
  backend "s3" {
    bucket         = "company-terraform-state-prod"
    key            = "infrastructure/production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true

    # Native S3 locking (replaces DynamoDB - DynamoDB locking is deprecated)
    use_lockfile = true
  }
}

Note: DynamoDB-based state locking is deprecated as of Terraform 1.10. The S3 native locking using use_lockfile = true is the current recommended approach for AWS backends.

Critical: separate backend per environment, never shared:

# Correct - each environment has isolated state
s3://company-state/infrastructure/production/terraform.tfstate
s3://company-state/infrastructure/staging/terraform.tfstate
s3://company-state/infrastructure/dev/terraform.tfstate

# Dangerous - shared state with workspace prefix
s3://company-state/infrastructure/terraform.tfstate  # Do not do this

Workspaces share the same backend configuration and the same locking scope. A runaway terraform destroy in the wrong workspace with shared state can cascade in ways that isolated backends prevent. Separate backends are the infrastructure as code best practice; workspaces are appropriate for lightweight environment variants, not for production/staging/dev isolation.

State access controls:

# IAM policy for CI/CD role - read/write to specific state path only
{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:::company-terraform-state-prod/infrastructure/production/*"
    },
    {
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::company-terraform-state-prod"
    }
  ]
}

Practice 3: Secrets Never in State

Secrets in Terraform state is the most dangerous anti-pattern in IaC security and one of the most common infrastructure as code best practices violations. The mechanism that creates the vulnerability is subtle: any resource attribute that Terraform reads from a provider and stores in state is accessible to anyone with read access to the state file, even attributes marked sensitive in the provider schema are stored in plaintext in state.

The pattern that puts secrets in state:

# DANGEROUS: This writes the password to state in plaintext
resource "random_password" "db" {
  length  = 32
  special = true
}

resource "aws_db_instance" "main" {
  password = random_password.db.result  # Now in state
}

output "db_password" {
  value     = random_password.db.result  # Also in state
  sensitive = true  # Only hides it from terminal output, NOT from state
}

The infrastructure as code best practice for secrets:

# Correct: generate secrets outside Terraform, reference via data source
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "production/rds/master-password"
}

resource "aws_db_instance" "main" {
  # Password is fetched from Secrets Manager at apply time
  # Only the secret ARN reference is in state, not the password value
  manage_master_user_password = true
  master_user_secret_kms_key_id = aws_kms_key.rds.arn
}

For secrets that must be generated and stored:

# Use Vault provider - secret stays in Vault, not in Terraform state
provider "vault" {}

resource "vault_generic_secret" "db_password" {
  path = "secret/production/database"
  data_json = jsonencode({
    password = var.db_password  # Provided at apply time via TF_VAR_, never committed
  })
}

Pre-commit hook to catch secrets before they reach the repository:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.18.0
    hooks:
      - id: gitleaks
  - repo: https://github.com/trufflesecurity/trufflehog
    rev: v3.67.0
    hooks:
      - id: trufflehog

Rotate all secrets if state file access cannot be fully audited. The state file is as sensitive as the secrets it contains.

Practice 4: The Four-Stage IaC Testing Pipeline

Testing infrastructure as code at a single point static validation before merge misses the failure modes that only appear when infrastructure is actually provisioned. The infrastructure as code best practice is a four-stage testing pipeline that catches different failure classes at different points.

Stage 1: Static analysis (runs on every commit, seconds):

# .github/workflows/iac-static.yml
name: IaC Static Analysis

on: [pull_request]

jobs:
  static:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: terraform fmt check
        run: terraform fmt -check -recursive

      - name: terraform validate
        run: |
          terraform init -backend=false
          terraform validate

      - name: TFLint
        uses: terraform-linters/setup-tflint@v4
        run: tflint --recursive

      - name: Checkov
        uses: bridgecrewio/checkov-action@v12
        with:
          directory: .
          framework: terraform
          soft_fail: false

Stage 2: Unit tests with terraform test (runs on PR, minutes):

# modules/eks-cluster/tests/cluster.tftest.hcl
run "cluster_name_format" {
  variables {
    cluster_name        = "test-cluster"
    kubernetes_version  = "1.30"
    node_instance_types = ["m6i.large"]
    min_size            = 1
    max_size            = 3
    vpc_id              = "vpc-test"
    subnet_ids          = ["subnet-test"]
    tags                = {}
  }

  # Validate plan without deploying
  command = plan

  assert {
    condition     = output.cluster_name == "test-cluster"
    error_message = "Cluster name output does not match input"
  }
}

Stage 3: Security scanning with OPA/Conftest (runs on PR, seconds):

# policies/terraform/no-public-s3.rego
package terraform.s3

deny[msg] {
  resource := input.resource.aws_s3_bucket[_]
  resource.acl == "public-read"
  msg := sprintf("S3 bucket '%v' has public-read ACL - must be private", [resource.bucket])
}

deny[msg] {
  resource := input.resource.aws_s3_bucket[_]
  not resource.server_side_encryption_configuration
  msg := sprintf("S3 bucket '%v' has no server-side encryption configured", [resource.bucket])
}

Stage 4: Integration tests with Terratest (runs on merge to main, minutes to hours):

// test/eks_cluster_test.go
package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/gruntwork-io/terratest/modules/aws"
    "github.com/stretchr/testify/assert"
)

func TestEKSClusterDeploys(t *testing.T) {
    t.Parallel()

    options := &terraform.Options{
        TerraformDir: "../examples/eks-cluster",
        Vars: map[string]interface{}{
            "cluster_name":        "test-eks-" + random.UniqueId(),
            "kubernetes_version":  "1.30",
            "node_instance_types": []string{"m6i.large"},
            "min_size":            1,
            "max_size":            2,
        },
    }

    defer terraform.Destroy(t, options)
    terraform.InitAndApply(t, options)

    clusterName := terraform.Output(t, options, "cluster_name")
    assert.NotEmpty(t, clusterName)

    // Verify cluster exists in AWS
    cluster := aws.GetEksCluster(t, "us-east-1", clusterName)
    assert.Equal(t, "ACTIVE", aws.StringValue(cluster.Status))
}

Integration tests deploy real infrastructure into a sandbox account and destroy it after validation. They are expensive in time and money but catch the category of failure that static analysis cannot: actual provider behavior, IAM permission boundaries, and resource dependency ordering.

Practice 5: Drift Detection on a Schedule

Running terraform plan once before each terraform apply is necessary but not sufficient. The infrastructure as code best practice for drift detection is scheduled plan runs between deployments that alert when something changes outside of IaC.

Manual console changes are the most common source of drift. An engineer fixes an incident by editing a security group rule directly. The fix works. The rule is not reflected in Terraform. The next terraform apply – potentially weeks later – sees the rule as drift and plans to remove it.

Scheduled drift detection via GitHub Actions:

# .github/workflows/drift-detection.yml
name: Drift Detection

on:
  schedule:
    - cron: '0 */6 * * *'   # Every 6 hours
  workflow_dispatch:          # Allow manual trigger

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [staging, production]

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets[format('AWS_ROLE_{0}', matrix.environment)] }}
          aws-region: us-east-1

      - name: Terraform Init
        working-directory: environments/${{ matrix.environment }}
        run: terraform init

      - name: Terraform Plan (drift check)
        id: plan
        working-directory: environments/${{ matrix.environment }}
        run: |
          terraform plan -detailed-exitcode -out=drift.tfplan 2>&1
          echo "exit_code=$?" >> $GITHUB_OUTPUT

      - name: Alert on drift
        if: steps.plan.outputs.exit_code == '2'
        uses: slackapi/slack-github-action@v1
        with:
          slack-message: |
            Drift detected in ${{ matrix.environment }}!
            Review: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

Exit code 2 from terraform plan -detailed-exitcode means the plan is non-empty, something has drifted. Exit code 0 means no changes needed. Exit code 1 means an error.

When drift is detected, the decision is explicit: import the change into state, revert the manual change, or update the IaC to declare it intentionally. The infrastructure as code best practice is that all three options are valid, but the choice must be made deliberately, not discovered accidentally during the next deployment.

Practice 6: Policy as Code Enforcement in the Pipeline

Policy as code converts compliance requirements into automated checks that run before infrastructure reaches production. The infrastructure as code best practice is enforcement, not suggestion violations block the pipeline.

Checkov integrated in CI (catches 1000+ misconfigurations):

- name: Checkov scan
  uses: bridgecrewio/checkov-action@v12
  with:
    directory: environments/production
    framework: terraform
    check: CKV_AWS_18,CKV_AWS_19,CKV_AWS_20  # Specific checks
    # OR
    skip_check: CKV_AWS_144  # Skip with documented justification
    output_format: sarif
    soft_fail: false  # Hard fail on violations

OPA policies for organization-specific rules:

# policies/tagging.rego
package terraform.tagging

required_tags := {"environment", "team", "cost-center"}

deny[msg] {
  resource := input.resource[_][_]
  missing := required_tags - {key | resource.tags[key]}
  count(missing) > 0
  msg := sprintf(
    "Resource is missing required tags: %v",
    [concat(", ", missing)]
  )
}
# Apply OPA policies with conftest
- name: Conftest policy check
  run: |
    terraform show -json terraform.tfplan > plan.json
    conftest test plan.json \
      --policy policies/ \
      --namespace terraform

Policy as code covers what static scanners miss: organization-specific rules like mandatory tagging conventions, approved instance type lists, required encryption configurations, and IAM permission boundaries that are context-specific. These cannot be expressed as generic rules in Checkov but can be expressed precisely in Rego.

Practice 7: Tagging and Naming Conventions as Code

Untagged resources are a FinOps and audit nightmare. Infrastructure as code best practices around tagging are most effective when implemented as module-level defaults that are difficult to bypass rather than guidelines that each engineer applies inconsistently.

Common tags module:

# modules/common-tags/main.tf
variable "environment" {
  type    = string
  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "Environment must be dev, staging, or production."
  }
}
variable "team" { type = string }
variable "cost_center" { type = string }
variable "repository" { type = string }

locals {
  common_tags = {
    environment  = var.environment
    team         = var.team
    cost_center  = var.cost_center
    repository   = var.repository
    managed_by   = "terraform"
    last_updated = timestamp()
  }
}

output "tags" { value = local.common_tags }

Applied across all environments:

# environments/production/main.tf
module "tags" {
  source      = "../../modules/common-tags"
  environment = "production"
  team        = "platform"
  cost_center = "infra-001"
  repository  = "github.com/company/infrastructure"
}

module "vpc" {
  source = "../../modules/vpc"
  tags   = module.tags.tags
  # ...
}

Naming convention enforcement via validation:

variable "cluster_name" {
  type = string
  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{2,28}[a-z0-9]$", var.cluster_name))
    error_message = "Cluster name must be lowercase alphanumeric with hyphens, 4-30 chars."
  }
}

Validation blocks in Terraform enforce naming conventions at plan time, before resources are created, with human-readable error messages.

Practice 8: GitOps Integration for IaC

GitOps extends the infrastructure as code best practices around version control to automated reconciliation: Git is not just where the code lives, it is the mechanism that triggers and controls deployments.

The PR-based deployment flow:

Developer → Feature branch → terraform plan (automatic) → PR review
                                                              |
                                        Required reviewers approve
                                                              |
                                             Merge to main
                                                              |
                                        terraform apply (automatic)
                                                              |
                                         Drift detection (scheduled)

No engineer should be running terraform apply from a local terminal against production. This is the infrastructure as code best practice that most teams document but fewer enforce. Enforcing it requires removing direct cloud console access for engineers (or making it audit-logged and exceptional) and making the CI pipeline the only apply path.

Atlantis for collaborative Terraform GitOps:

# atlantis.yaml
version: 3
projects:
  - name: production-vpc
    dir: environments/production/vpc
    workspace: default
    autoplan:
      when_modified: ["*.tf", "../../modules/vpc/*.tf"]
    apply_requirements: [approved, mergeable]

With Atlantis, terraform plan runs automatically on PR creation, the plan output is posted as a PR comment, and terraform apply only runs after PR approval and merge. The apply is triggered by a comment (atlantis apply) or automatically on merge, depending on configuration.

Practice 9: Environment Parity Through Variables, Not Structure

Maintaining environment parity: dev, staging, and production behaving identically in structure, differing only in scale and configuration is one of the infrastructure as code best practices most directly connected to incident reduction. The DORA research finding that teams maintaining environment parity have significantly lower change failure rates reflects this.

The pattern that breaks parity is creating separate module implementations for each environment. The pattern that maintains it is using the same modules with different variable inputs.

Correct: single module, environment-specific values:

# environments/dev/main.tf
module "eks" {
  source              = "../../modules/eks-cluster"
  cluster_name        = "dev-cluster"
  node_instance_types = ["m6i.large"]    # Smaller in dev
  min_size            = 1
  max_size            = 3
}

# environments/production/main.tf
module "eks" {
  source              = "../../modules/eks-cluster"
  cluster_name        = "prod-cluster"
  node_instance_types = ["m6i.2xlarge"]  # Larger in production
  min_size            = 3
  max_size            = 12
}

Wrong: separate module implementations per environment:

modules/
├── eks-cluster-dev/      # Different code path for dev
└── eks-cluster-prod/     # Different code path for production

When modules diverge, bugs exist in only one environment. The staging environment stops accurately predicting production behavior. The infrastructure as code best practice is that the difference between environments is always and only expressed as variable values.

Practice 10: Progressive Deployment and Rollback

The final infrastructure as code best practice is treating infrastructure changes with the same deployment discipline applied to application code: deploy incrementally, verify at each stage, maintain rollback capability.

Terraform target for incremental rollouts:

# Deploy network changes first, validate, then compute
terraform apply -target=module.vpc
# Verify networking is correct
terraform apply -target=module.eks
# Verify cluster is healthy
terraform apply  # Apply remaining resources

-target is explicitly not a recommended daily workflow, it creates partial state that can hide dependencies. It is the infrastructure as code best practice for high-risk changes where staged rollout reduces blast radius.

State versioning for rollback:

S3 versioning on the state bucket enables rollback to a previous known-good state:

# List state versions
aws s3api list-object-versions \
  --bucket company-terraform-state-prod \
  --prefix infrastructure/production/terraform.tfstate

# Restore previous state version
aws s3api get-object \
  --bucket company-terraform-state-prod \
  --key infrastructure/production/terraform.tfstate \
  --version-id PREVIOUS_VERSION_ID \
  terraform.tfstate.backup

# Restore (use with extreme caution - review differences first)
aws s3 cp terraform.tfstate.backup \
  s3://company-terraform-state-prod/infrastructure/production/terraform.tfstate

State rollback does not undo infrastructure changes, resources already deleted are not recreated automatically. It restores Terraform’s view of what exists, which then requires a careful plan to reconcile with reality. It is a recovery mechanism, not a substitute for careful change management.

The Infrastructure as Code Best Practices Checklist

Bringing all ten patterns together into an operational checklist:

Module design:

  • Single responsibility per module, no environment logic inside modules.
  • All environment differences expressed as variables.
  • Semantic versioning on shared modules with a pinned version in consuming configurations.

State management:

  • Remote backend per environment, never shared.
  • S3 native locking enabled (not DynamoDB – deprecated).
  • State bucket with versioning and access logging.
  • IAM roles scoped to specific state paths, not entire buckets.

Secrets:

  • No secrets generated with random_password and stored in outputs.
  • AWS Secrets Manager or Vault as the source of truth for credentials.
  • Pre-commit hooks scanning for secrets before commit.
  • State files treated as sensitive documents with restricted access.

Testing pipeline:

  • terraform fmt -check and terraform validate on every commit.
  • Checkov and TFLint on every PR.
  • terraform test for unit tests on module logic.
  • Terratest integration tests in a sandbox account for significant changes.

Drift detection:

  • Scheduled plan runs every 4-6 hours on all production environments.
  • Alerting to Slack/PagerDuty when plan is non-empty.
  • Documented process for handling detected drift.

Policy as code:

  • Checkov in CI with hard failure mode.
  • OPA/Conftest for organization-specific rules.
  • Required tags validated at plan time.

GitOps:

  • No manual terraform apply from local terminals to production.
  • All applies via CI pipeline after PR approval.
  • Plan output posted to PR for review before merge.

Conclusion

Infrastructure as code best practices at the level of production teams in 2026 are not about knowing Terraform syntax, they are about the operational discipline that prevents the 67% drift rate, the state conflicts that corrupt infrastructure records, and the secrets accidentally stored in plaintext in state files.

The ten practices in this guide are not independent suggestions. They form a system: modules that are testable because they have clear boundaries, state management that prevents concurrent apply failures, secrets handling that keeps credentials out of the audit trail, testing that catches failures before production, and drift detection that closes the loop between declarations and reality.

At The Good Shell we implement and operate IaC pipelines for DevOps and platform engineering teams at funded startups. See our DevOps and infrastructure services or our case studies to see what a mature IaC implementation looks like in practice.

For the current Terraform and OpenTofu documentation on state backends and provider best practices, the HashiCorp Terraform documentation and OpenTofu docs are the authoritative references.