Infrastructure as code best practices are well documented at the beginner level. The documentation tells you to use version control, use modules, and avoid hardcoded secrets. What it does not tell you is that 67% of teams using IaC experience significant configuration drift, that teams dealing with frequent drift have 2.3x higher change failure rates, and that the most common production incidents caused by IaC failures trace back to three specific anti-patterns: monolithic state files shared across environments, secrets stored in Terraform outputs, and CI pipelines that run apply without scheduled drift detection in between.
This guide covers ten infrastructure as code best practices at the level of teams who are already past the basics: module design for reusability at scale, remote state architecture that prevents concurrent apply conflicts, secrets management that keeps credentials out of state, a testing pipeline with four distinct stages, drift detection cadence, policy as code enforcement, GitOps integration, and the tagging and naming conventions that make infrastructure auditable.
Why Infrastructure as Code Best Practices Matter More in 2026
The context for infrastructure as code best practices has shifted significantly. Three changes define the current landscape.
First, Terraform’s licensing change in 2023 drove substantial migration to OpenTofu, the community-maintained fork. As of 2026, new projects are increasingly choosing OpenTofu for greenfield work. The practices in this guide apply equally to both the tooling differs, the patterns do not.
Second, IaC footprints have grown. Teams managing ten modules in 2020 are managing hundreds in 2026, often across multiple cloud accounts, multiple regions, and multiple teams with different maturity levels. The infrastructure as code best practices that work for a single engineer managing one environment break down at this scale.
Third, security requirements have tightened. Running IaC without policy enforcement is no longer acceptable in regulated environments and increasingly not acceptable in Series A/B startups handling customer data. Policy as code tools like OPA, Checkov, and Sentinel are now expected elements of a mature IaC pipeline, not advanced add-ons.
Practice 1: Modular Design With Clear Boundaries
The most foundational of all infrastructure as code best practices is module design, and the most common violation is building modules that are too large, too opinionated, or too coupled to a specific deployment context.
Good modules have three properties: they encapsulate a single infrastructure concept, they are parameterized for all environment-specific values, and they expose only the outputs that callers need.
Repository structure that scales:
infrastructure/
├── modules/
│ ├── vpc/ # VPC, subnets, routing - no environment logic
│ ├── eks-cluster/ # EKS cluster - no app-specific config
│ ├── rds-postgres/ # RDS module - no application schemas
│ └── kubernetes-namespace/ # Namespace + RBAC - no workload manifests
├── environments/
│ ├── dev/
│ │ ├── main.tf # Calls modules with dev-specific values
│ │ ├── variables.tf
│ │ └── backend.tf # Dev state backend
│ ├── staging/
│ │ ├── main.tf
│ │ └── backend.tf
│ └── production/
│ ├── main.tf
│ └── backend.tf
└── live/ # Deployed state, Terragrunt patternWhat a well-designed module looks like:
# modules/eks-cluster/main.tf
# Module: eks-cluster
# Purpose: Creates an EKS cluster with managed node groups
# Does NOT: configure applications, namespaces, or workloads
variable "cluster_name" { type = string }
variable "kubernetes_version" { type = string }
variable "node_instance_types" { type = list(string) }
variable "min_size" { type = number }
variable "max_size" { type = number }
variable "vpc_id" { type = string }
variable "subnet_ids" { type = list(string) }
variable "tags" { type = map(string) }
# Module implementation...
output "cluster_endpoint" { value = aws_eks_cluster.this.endpoint }
output "cluster_name" { value = aws_eks_cluster.this.name }
output "cluster_ca_certificate" {
value = aws_eks_cluster.this.certificate_authority[0].data
sensitive = true
}The root module in each environment becomes thin and readable:
# environments/production/main.tf
module "eks" {
source = "../../modules/eks-cluster"
cluster_name = "prod-cluster"
kubernetes_version = "1.30"
node_instance_types = ["m6i.2xlarge"]
min_size = 3
max_size = 12
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
tags = local.common_tags
}The anti-pattern to avoid:
Modules that contain environment-specific logic (if var.environment == "production" then...) make it impossible to test environments in isolation and create hidden coupling between deployment contexts. Every difference between environments should be expressed as a variable value, never as conditional logic inside a module.
Practice 2: Remote State With Isolated Backends Per Environment
Local state files are incompatible with team development. This is not a matter of best practice preference a local terraform.tfstate file on one engineer’s laptop means another engineer running terraform apply operates on a different view of reality.
Remote state with backend isolation per environment is the infrastructure as code best practice that prevents the most destructive class of IaC incidents: state conflicts causing unexpected resource destruction.
S3 + native locking configuration (Terraform 1.10+ / current best practice):
# environments/production/backend.tf
terraform {
backend "s3" {
bucket = "company-terraform-state-prod"
key = "infrastructure/production/terraform.tfstate"
region = "us-east-1"
encrypt = true
# Native S3 locking (replaces DynamoDB - DynamoDB locking is deprecated)
use_lockfile = true
}
}Note: DynamoDB-based state locking is deprecated as of Terraform 1.10. The S3 native locking using use_lockfile = true is the current recommended approach for AWS backends.
Critical: separate backend per environment, never shared:
# Correct - each environment has isolated state
s3://company-state/infrastructure/production/terraform.tfstate
s3://company-state/infrastructure/staging/terraform.tfstate
s3://company-state/infrastructure/dev/terraform.tfstate
# Dangerous - shared state with workspace prefix
s3://company-state/infrastructure/terraform.tfstate # Do not do thisWorkspaces share the same backend configuration and the same locking scope. A runaway terraform destroy in the wrong workspace with shared state can cascade in ways that isolated backends prevent. Separate backends are the infrastructure as code best practice; workspaces are appropriate for lightweight environment variants, not for production/staging/dev isolation.
State access controls:
# IAM policy for CI/CD role - read/write to specific state path only
{
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::company-terraform-state-prod/infrastructure/production/*"
},
{
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::company-terraform-state-prod"
}
]
}Practice 3: Secrets Never in State
Secrets in Terraform state is the most dangerous anti-pattern in IaC security and one of the most common infrastructure as code best practices violations. The mechanism that creates the vulnerability is subtle: any resource attribute that Terraform reads from a provider and stores in state is accessible to anyone with read access to the state file, even attributes marked sensitive in the provider schema are stored in plaintext in state.
The pattern that puts secrets in state:
# DANGEROUS: This writes the password to state in plaintext
resource "random_password" "db" {
length = 32
special = true
}
resource "aws_db_instance" "main" {
password = random_password.db.result # Now in state
}
output "db_password" {
value = random_password.db.result # Also in state
sensitive = true # Only hides it from terminal output, NOT from state
}The infrastructure as code best practice for secrets:
# Correct: generate secrets outside Terraform, reference via data source
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "production/rds/master-password"
}
resource "aws_db_instance" "main" {
# Password is fetched from Secrets Manager at apply time
# Only the secret ARN reference is in state, not the password value
manage_master_user_password = true
master_user_secret_kms_key_id = aws_kms_key.rds.arn
}For secrets that must be generated and stored:
# Use Vault provider - secret stays in Vault, not in Terraform state
provider "vault" {}
resource "vault_generic_secret" "db_password" {
path = "secret/production/database"
data_json = jsonencode({
password = var.db_password # Provided at apply time via TF_VAR_, never committed
})
}Pre-commit hook to catch secrets before they reach the repository:
# .pre-commit-config.yaml
repos:
- repo: https://github.com/gitleaks/gitleaks
rev: v8.18.0
hooks:
- id: gitleaks
- repo: https://github.com/trufflesecurity/trufflehog
rev: v3.67.0
hooks:
- id: trufflehogRotate all secrets if state file access cannot be fully audited. The state file is as sensitive as the secrets it contains.
Practice 4: The Four-Stage IaC Testing Pipeline
Testing infrastructure as code at a single point static validation before merge misses the failure modes that only appear when infrastructure is actually provisioned. The infrastructure as code best practice is a four-stage testing pipeline that catches different failure classes at different points.
Stage 1: Static analysis (runs on every commit, seconds):
# .github/workflows/iac-static.yml
name: IaC Static Analysis
on: [pull_request]
jobs:
static:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: terraform fmt check
run: terraform fmt -check -recursive
- name: terraform validate
run: |
terraform init -backend=false
terraform validate
- name: TFLint
uses: terraform-linters/setup-tflint@v4
run: tflint --recursive
- name: Checkov
uses: bridgecrewio/checkov-action@v12
with:
directory: .
framework: terraform
soft_fail: falseStage 2: Unit tests with terraform test (runs on PR, minutes):
# modules/eks-cluster/tests/cluster.tftest.hcl
run "cluster_name_format" {
variables {
cluster_name = "test-cluster"
kubernetes_version = "1.30"
node_instance_types = ["m6i.large"]
min_size = 1
max_size = 3
vpc_id = "vpc-test"
subnet_ids = ["subnet-test"]
tags = {}
}
# Validate plan without deploying
command = plan
assert {
condition = output.cluster_name == "test-cluster"
error_message = "Cluster name output does not match input"
}
}Stage 3: Security scanning with OPA/Conftest (runs on PR, seconds):
# policies/terraform/no-public-s3.rego
package terraform.s3
deny[msg] {
resource := input.resource.aws_s3_bucket[_]
resource.acl == "public-read"
msg := sprintf("S3 bucket '%v' has public-read ACL - must be private", [resource.bucket])
}
deny[msg] {
resource := input.resource.aws_s3_bucket[_]
not resource.server_side_encryption_configuration
msg := sprintf("S3 bucket '%v' has no server-side encryption configured", [resource.bucket])
}Stage 4: Integration tests with Terratest (runs on merge to main, minutes to hours):
// test/eks_cluster_test.go
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/stretchr/testify/assert"
)
func TestEKSClusterDeploys(t *testing.T) {
t.Parallel()
options := &terraform.Options{
TerraformDir: "../examples/eks-cluster",
Vars: map[string]interface{}{
"cluster_name": "test-eks-" + random.UniqueId(),
"kubernetes_version": "1.30",
"node_instance_types": []string{"m6i.large"},
"min_size": 1,
"max_size": 2,
},
}
defer terraform.Destroy(t, options)
terraform.InitAndApply(t, options)
clusterName := terraform.Output(t, options, "cluster_name")
assert.NotEmpty(t, clusterName)
// Verify cluster exists in AWS
cluster := aws.GetEksCluster(t, "us-east-1", clusterName)
assert.Equal(t, "ACTIVE", aws.StringValue(cluster.Status))
}Integration tests deploy real infrastructure into a sandbox account and destroy it after validation. They are expensive in time and money but catch the category of failure that static analysis cannot: actual provider behavior, IAM permission boundaries, and resource dependency ordering.
Practice 5: Drift Detection on a Schedule
Running terraform plan once before each terraform apply is necessary but not sufficient. The infrastructure as code best practice for drift detection is scheduled plan runs between deployments that alert when something changes outside of IaC.
Manual console changes are the most common source of drift. An engineer fixes an incident by editing a security group rule directly. The fix works. The rule is not reflected in Terraform. The next terraform apply – potentially weeks later – sees the rule as drift and plans to remove it.
Scheduled drift detection via GitHub Actions:
# .github/workflows/drift-detection.yml
name: Drift Detection
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours
workflow_dispatch: # Allow manual trigger
jobs:
detect-drift:
runs-on: ubuntu-latest
strategy:
matrix:
environment: [staging, production]
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets[format('AWS_ROLE_{0}', matrix.environment)] }}
aws-region: us-east-1
- name: Terraform Init
working-directory: environments/${{ matrix.environment }}
run: terraform init
- name: Terraform Plan (drift check)
id: plan
working-directory: environments/${{ matrix.environment }}
run: |
terraform plan -detailed-exitcode -out=drift.tfplan 2>&1
echo "exit_code=$?" >> $GITHUB_OUTPUT
- name: Alert on drift
if: steps.plan.outputs.exit_code == '2'
uses: slackapi/slack-github-action@v1
with:
slack-message: |
Drift detected in ${{ matrix.environment }}!
Review: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}Exit code 2 from terraform plan -detailed-exitcode means the plan is non-empty, something has drifted. Exit code 0 means no changes needed. Exit code 1 means an error.
When drift is detected, the decision is explicit: import the change into state, revert the manual change, or update the IaC to declare it intentionally. The infrastructure as code best practice is that all three options are valid, but the choice must be made deliberately, not discovered accidentally during the next deployment.
Practice 6: Policy as Code Enforcement in the Pipeline
Policy as code converts compliance requirements into automated checks that run before infrastructure reaches production. The infrastructure as code best practice is enforcement, not suggestion violations block the pipeline.
Checkov integrated in CI (catches 1000+ misconfigurations):
- name: Checkov scan
uses: bridgecrewio/checkov-action@v12
with:
directory: environments/production
framework: terraform
check: CKV_AWS_18,CKV_AWS_19,CKV_AWS_20 # Specific checks
# OR
skip_check: CKV_AWS_144 # Skip with documented justification
output_format: sarif
soft_fail: false # Hard fail on violationsOPA policies for organization-specific rules:
# policies/tagging.rego
package terraform.tagging
required_tags := {"environment", "team", "cost-center"}
deny[msg] {
resource := input.resource[_][_]
missing := required_tags - {key | resource.tags[key]}
count(missing) > 0
msg := sprintf(
"Resource is missing required tags: %v",
[concat(", ", missing)]
)
}# Apply OPA policies with conftest
- name: Conftest policy check
run: |
terraform show -json terraform.tfplan > plan.json
conftest test plan.json \
--policy policies/ \
--namespace terraformPolicy as code covers what static scanners miss: organization-specific rules like mandatory tagging conventions, approved instance type lists, required encryption configurations, and IAM permission boundaries that are context-specific. These cannot be expressed as generic rules in Checkov but can be expressed precisely in Rego.
Practice 7: Tagging and Naming Conventions as Code
Untagged resources are a FinOps and audit nightmare. Infrastructure as code best practices around tagging are most effective when implemented as module-level defaults that are difficult to bypass rather than guidelines that each engineer applies inconsistently.
Common tags module:
# modules/common-tags/main.tf
variable "environment" {
type = string
validation {
condition = contains(["dev", "staging", "production"], var.environment)
error_message = "Environment must be dev, staging, or production."
}
}
variable "team" { type = string }
variable "cost_center" { type = string }
variable "repository" { type = string }
locals {
common_tags = {
environment = var.environment
team = var.team
cost_center = var.cost_center
repository = var.repository
managed_by = "terraform"
last_updated = timestamp()
}
}
output "tags" { value = local.common_tags }Applied across all environments:
# environments/production/main.tf
module "tags" {
source = "../../modules/common-tags"
environment = "production"
team = "platform"
cost_center = "infra-001"
repository = "github.com/company/infrastructure"
}
module "vpc" {
source = "../../modules/vpc"
tags = module.tags.tags
# ...
}Naming convention enforcement via validation:
variable "cluster_name" {
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]{2,28}[a-z0-9]$", var.cluster_name))
error_message = "Cluster name must be lowercase alphanumeric with hyphens, 4-30 chars."
}
}Validation blocks in Terraform enforce naming conventions at plan time, before resources are created, with human-readable error messages.
Practice 8: GitOps Integration for IaC
GitOps extends the infrastructure as code best practices around version control to automated reconciliation: Git is not just where the code lives, it is the mechanism that triggers and controls deployments.
The PR-based deployment flow:
Developer → Feature branch → terraform plan (automatic) → PR review
|
Required reviewers approve
|
Merge to main
|
terraform apply (automatic)
|
Drift detection (scheduled)No engineer should be running terraform apply from a local terminal against production. This is the infrastructure as code best practice that most teams document but fewer enforce. Enforcing it requires removing direct cloud console access for engineers (or making it audit-logged and exceptional) and making the CI pipeline the only apply path.
Atlantis for collaborative Terraform GitOps:
# atlantis.yaml
version: 3
projects:
- name: production-vpc
dir: environments/production/vpc
workspace: default
autoplan:
when_modified: ["*.tf", "../../modules/vpc/*.tf"]
apply_requirements: [approved, mergeable]With Atlantis, terraform plan runs automatically on PR creation, the plan output is posted as a PR comment, and terraform apply only runs after PR approval and merge. The apply is triggered by a comment (atlantis apply) or automatically on merge, depending on configuration.
Practice 9: Environment Parity Through Variables, Not Structure
Maintaining environment parity: dev, staging, and production behaving identically in structure, differing only in scale and configuration is one of the infrastructure as code best practices most directly connected to incident reduction. The DORA research finding that teams maintaining environment parity have significantly lower change failure rates reflects this.
The pattern that breaks parity is creating separate module implementations for each environment. The pattern that maintains it is using the same modules with different variable inputs.
Correct: single module, environment-specific values:
# environments/dev/main.tf
module "eks" {
source = "../../modules/eks-cluster"
cluster_name = "dev-cluster"
node_instance_types = ["m6i.large"] # Smaller in dev
min_size = 1
max_size = 3
}
# environments/production/main.tf
module "eks" {
source = "../../modules/eks-cluster"
cluster_name = "prod-cluster"
node_instance_types = ["m6i.2xlarge"] # Larger in production
min_size = 3
max_size = 12
}Wrong: separate module implementations per environment:
modules/
├── eks-cluster-dev/ # Different code path for dev
└── eks-cluster-prod/ # Different code path for productionWhen modules diverge, bugs exist in only one environment. The staging environment stops accurately predicting production behavior. The infrastructure as code best practice is that the difference between environments is always and only expressed as variable values.
Practice 10: Progressive Deployment and Rollback
The final infrastructure as code best practice is treating infrastructure changes with the same deployment discipline applied to application code: deploy incrementally, verify at each stage, maintain rollback capability.
Terraform target for incremental rollouts:
# Deploy network changes first, validate, then compute
terraform apply -target=module.vpc
# Verify networking is correct
terraform apply -target=module.eks
# Verify cluster is healthy
terraform apply # Apply remaining resources-target is explicitly not a recommended daily workflow, it creates partial state that can hide dependencies. It is the infrastructure as code best practice for high-risk changes where staged rollout reduces blast radius.
State versioning for rollback:
S3 versioning on the state bucket enables rollback to a previous known-good state:
# List state versions
aws s3api list-object-versions \
--bucket company-terraform-state-prod \
--prefix infrastructure/production/terraform.tfstate
# Restore previous state version
aws s3api get-object \
--bucket company-terraform-state-prod \
--key infrastructure/production/terraform.tfstate \
--version-id PREVIOUS_VERSION_ID \
terraform.tfstate.backup
# Restore (use with extreme caution - review differences first)
aws s3 cp terraform.tfstate.backup \
s3://company-terraform-state-prod/infrastructure/production/terraform.tfstateState rollback does not undo infrastructure changes, resources already deleted are not recreated automatically. It restores Terraform’s view of what exists, which then requires a careful plan to reconcile with reality. It is a recovery mechanism, not a substitute for careful change management.
The Infrastructure as Code Best Practices Checklist
Bringing all ten patterns together into an operational checklist:
Module design:
- Single responsibility per module, no environment logic inside modules.
- All environment differences expressed as variables.
- Semantic versioning on shared modules with a pinned version in consuming configurations.
State management:
- Remote backend per environment, never shared.
- S3 native locking enabled (not DynamoDB – deprecated).
- State bucket with versioning and access logging.
- IAM roles scoped to specific state paths, not entire buckets.
Secrets:
- No secrets generated with
random_passwordand stored in outputs. - AWS Secrets Manager or Vault as the source of truth for credentials.
- Pre-commit hooks scanning for secrets before commit.
- State files treated as sensitive documents with restricted access.
Testing pipeline:
terraform fmt -checkandterraform validateon every commit.- Checkov and TFLint on every PR.
terraform testfor unit tests on module logic.- Terratest integration tests in a sandbox account for significant changes.
Drift detection:
- Scheduled plan runs every 4-6 hours on all production environments.
- Alerting to Slack/PagerDuty when plan is non-empty.
- Documented process for handling detected drift.
Policy as code:
- Checkov in CI with hard failure mode.
- OPA/Conftest for organization-specific rules.
- Required tags validated at plan time.
GitOps:
- No manual
terraform applyfrom local terminals to production. - All applies via CI pipeline after PR approval.
- Plan output posted to PR for review before merge.
Conclusion
Infrastructure as code best practices at the level of production teams in 2026 are not about knowing Terraform syntax, they are about the operational discipline that prevents the 67% drift rate, the state conflicts that corrupt infrastructure records, and the secrets accidentally stored in plaintext in state files.
The ten practices in this guide are not independent suggestions. They form a system: modules that are testable because they have clear boundaries, state management that prevents concurrent apply failures, secrets handling that keeps credentials out of the audit trail, testing that catches failures before production, and drift detection that closes the loop between declarations and reality.
At The Good Shell we implement and operate IaC pipelines for DevOps and platform engineering teams at funded startups. See our DevOps and infrastructure services or our case studies to see what a mature IaC implementation looks like in practice.
For the current Terraform and OpenTofu documentation on state backends and provider best practices, the HashiCorp Terraform documentation and OpenTofu docs are the authoritative references.

