Chaos engineering Kubernetes tutorials cover the same experiments: kill a pod, inject network latency, simulate a node failure. Those experiments are valuable for stateless web applications. They are insufficient for blockchain workloads. A Cosmos validator or Ethereum consensus client that survives pod eviction in a chaos experiment can still get slashed in production when a clock skew causes it to miss attestations, when a peer disconnect cascade leaves it isolated from the network, or when a time-zone-specific condition triggers a signing timeout that no generic chaos test would ever surface.
The gap between standard chaos engineering Kubernetes practice and what blockchain infrastructure actually needs is large and largely undocumented. Trail of Bits published Attacknet in March 2024, built in collaboration with the Ethereum Foundation, specifically to address the limitations of traditional runtime verification tools when applied to blockchain nodes. Attacknet subjects devnets to the most challenging network conditions imaginable, and Trail of Bits was able to reproduce the Ethereum finality incident of May 2023 using a clock skew fault that conventional chaos testing would not have caught. Trail of Bits Attacknet
This guide covers chaos engineering Kubernetes for blockchain workloads specifically: the theoretical framework from the ChaosETH academic methodology, the tooling stack (Chaos Mesh, LitmusChaos, Attacknet), six production experiments designed for validator infrastructure, the blast radius controls that make chaos testing safe to run alongside live workloads, and the monitoring layer that tells you whether your validators survived.
Why Chaos Engineering Kubernetes Is Different for Blockchain Workloads
Standard chaos engineering Kubernetes experiments test for resilience to infrastructure failures: pods crashing, nodes going offline, network latency spiking. These are important failure modes for any distributed system. Blockchain validators have additional failure modes that are protocol-specific and financially consequential in ways that a crashed web server is not.
The slashing dimension. A Cosmos validator or Ethereum consensus client that double-signs, even accidentally, due to a failover configuration activating while the primary is still running, gets slashed. The economic penalty is immediate and irreversible. A chaos experiment that tests failover without verifying that dual-signing cannot occur during the transition is not testing the right thing. See our Cosmos validator slashing guide for the full slashing risk taxonomy.
The consensus timing dimension. Blockchain consensus protocols have strict timing requirements. A validator that is 2 seconds late submitting an attestation misses it. A validator whose clock drifts by more than the protocol’s tolerance window gets excluded from consensus. Clock skew injection, applying a time delta to a running validator pod, is not a standard Kubernetes chaos experiment, but it is the most realistic failure mode for consensus clients.
The peer connectivity dimension. A blockchain node that loses connectivity to all peers is not just “unavailable” in the way a web service is unavailable. It continues running, consuming resources, and potentially accumulating signing obligations it cannot fulfill, leading to missed attestations and performance penalties that accumulate silently until the operator reviews epoch reports.
The state corruption dimension. A database corruption that causes a web service to return errors will manifest immediately in failed requests. A corrupted validator state database may cause the node to believe it has signed something it has not, or vice versa, creating silent inconsistencies that only manifest during specific network conditions.
The ChaosETH academic methodology published in ACM Distributed Ledger Technologies identified these blockchain-specific failure modes as fundamentally different from those addressed by conventional chaos engineering frameworks. The study found that Ethereum clients exhibit unique resilience patterns under peer isolation that standard Kubernetes chaos tools do not test for. ChaosETH paper
Chaos Engineering Kubernetes Prerequisites
Cluster setup:
# Verify cluster access
kubectl cluster-info
kubectl get nodes
# Verify you have a dedicated chaos testing namespace
kubectl create namespace chaos-testing
kubectl create namespace validators-chaos # Your blockchain workloads here
# Never run chaos experiments in production namespaces without
# explicit blast radius controls - covered in the next sectionInstall Chaos Mesh:
Chaos Mesh is the CNCF-hosted chaos engineering platform that both Attacknet and the experiments in this guide build on:
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-mesh \
--create-namespace \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock
# Verify installation
kubectl get pods -n chaos-mesh
# chaos-controller-manager, chaos-daemon, chaos-dashboard should all be Running
# Access Chaos Mesh dashboard
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333
# Open http://localhost:2333Install LitmusChaos (alternative for workflow-based experiments):
LitmusChaos provides pre-built chaos experiment libraries including Kubernetes-specific scenarios:
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v3.0.0.yaml
# Verify
kubectl get pods -n litmusMonitoring prerequisites:
Chaos experiments without monitoring are blind. Before running any experiment, verify that your Prometheus stack is scraping validator metrics and that your Grafana dashboards show:
- Attestation inclusion distance (Ethereum validators).
- Block production rate (Cosmos validators).
- Peer count per validator pod.
- Signing latency.
If these metrics are not visible before the experiment, you will not be able to attribute post-experiment degradation to the chaos injection
Blast Radius Controls: The Non-Negotiable Safety Layer
Every chaos engineering Kubernetes experiment must be wrapped in blast radius controls. For blockchain workloads where a misconfigured experiment can cause slashing, these controls are not optional.
Namespace isolation:
# chaos-testing-namespace.yaml
# Run chaos experiments against a shadow validator setup,
# not against production validators directly
apiVersion: v1
kind: Namespace
metadata:
name: validators-chaos
labels:
chaos-testing: enabled
environment: chaos-shadow
---
# Network policy: chaos namespace cannot reach production namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: isolate-chaos-namespace
namespace: validators-chaos
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: validators-chaos
# Only allow egress within the chaos namespace - not to productionChaos Mesh permission scoping:
# chaos-rbac.yaml - limit which namespaces chaos experiments can affect
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: chaos-experiment-runner
namespace: validators-chaos
rules:
- apiGroups: ["chaos-mesh.org"]
resources: ["*"]
verbs: ["get", "list", "watch", "create", "delete", "patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: chaos-experiment-runner
namespace: validators-chaos # Scoped to chaos namespace only
subjects:
- kind: ServiceAccount
name: chaos-runner
namespace: validators-chaos
roleRef:
kind: Role
name: chaos-experiment-runner
apiGroup: rbac.authorization.k8s.ioExperiment duration limits:
Every Chaos Mesh experiment must have an explicit duration. Never run an experiment without a duration field, an experiment without duration runs until manually stopped, and a forgotten network partition experiment can leave validator pods isolated indefinitely.
# Every experiment must include:
spec:
duration: "5m" # Maximum experiment duration
# Also set a graceful recovery time after durationPre-experiment checklist:
# Before every chaos experiment:
# 1. Verify monitoring is active and baselines are captured
curl http://prometheus:9090/api/v1/query?query=validator_attestation_inclusion_distance
# 2. Verify experiment targets only chaos namespace
kubectl get pods -n validators-chaos --show-labels
# 3. Verify production validators are healthy
kubectl get pods -n validators-production
# 4. Document the current epoch / block height
# For Cosmos:
gaiad query consensus params --chain-id cosmoshub-4
# For Ethereum:
curl -s http://localhost:5052/eth/v1/node/syncing | jq '.data.head_slot'
# 5. Have rollback ready - know how to manually stop the experiment
# kubectl delete networkchaos --all -n validators-chaosExperiment 1: Network Partition Between Validator and Peers
The most important chaos engineering Kubernetes experiment for blockchain workloads is network partition. A validator that cannot reach peers stops receiving blocks to attest to and stops having its attestations propagated. In Cosmos, this causes missed blocks and jeopardizes validator rank. In Ethereum, this causes missed attestations and accrues inactivity penalties.
Chaos Mesh network partition:
# experiment-1-network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: validator-peer-partition
namespace: validators-chaos
spec:
action: partition
mode: one # Affect one pod (not all)
selector:
namespaces:
- validators-chaos
labelSelectors:
app: cosmos-validator
duration: "5m"
direction: both # Block both ingress and egress
target:
mode: all # Partition from all other pods
selector:
namespaces:
- validators-chaos# Apply the experiment
kubectl apply -f experiment-1-network-partition.yaml
# Monitor in real time
watch kubectl get pods -n validators-chaos
watch kubectl logs -n validators-chaos -l app=cosmos-validator --tail=20
# Check peer count during experiment (should drop to 0)
# Cosmos:
kubectl exec -n validators-chaos cosmos-validator-0 -- \
gaiad query tendermint-validator-set | jq '.validators | length'
# After 5 minutes, experiment stops automatically
# Verify recovery:
kubectl exec -n validators-chaos cosmos-validator-0 -- \
gaiad status | jq '.SyncInfo.catching_up'What to verify:
- Peer count drops to zero during partition.
- No double-signing occurs during isolation (check signing logs).
- Node reconnects within expected time after partition ends.
- Missed blocks/attestations match expected count for partition duration.
The Attacknet equivalent for Ethereum validators:
Attacknet uses Chaos Mesh to inject faults into a devnet environment generated by Kurtosis, creating various network topologies with ensembles of different kinds of faults. Network latency, where a node’s connection to the network is delayed, can help reproduce global latency conditions or detect unintentional synchronicity assumptions in the blockchain’s consensus. Attacknet repo
Experiment 2: Clock Skew Injection on Consensus Pods
Clock skew is the failure mode that generic chaos engineering Kubernetes tutorials never cover and that caused the Ethereum finality incident of May 2023. Trail of Bits was able to reproduce the Ethereum finality incident using a clock skew fault, where a node’s clock is skewed forwards or backwards for a specific duration. Trail of Bits Attacknet
Blockchain consensus protocols assume synchronized clocks. Ethereum’s RANDAO mechanism and Cosmos’s Tendermint consensus both have clock drift tolerances. Exceeding them causes validators to be excluded from rounds, miss attestation windows, or in severe cases, fork the local chain state.
Chaos Mesh time chaos experiment:
# experiment-2-clock-skew.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
name: validator-clock-skew
namespace: validators-chaos
spec:
mode: one
selector:
namespaces:
- validators-chaos
labelSelectors:
app: ethereum-consensus-client
duration: "10m"
timeOffset: "+30s" # Skew clock forward by 30 seconds
clockIds:
- CLOCK_REALTIME # System clock
- CLOCK_MONOTONIC # Monotonic clockGraduated skew testing – increase until failure:
# Start with 2 seconds skew (below typical tolerance)
# Monitor: does the validator continue attesting normally?
# Increase to 10 seconds
# Monitor: does attestation delay increase?
# Increase to 30 seconds (Ethereum) / 500ms (Cosmos — tighter tolerance)
# Monitor: does the validator get excluded from consensus?
# Increase to 2 minutes
# Monitor: does the node detect the clock issue and halt?
# Document the failure threshold for your specific client and version
# This threshold changes with protocol upgrades - run quarterlyWhat to verify:
- At what skew threshold does attestation inclusion distance increase?.
- Does the consensus client detect the clock drift and log a warning?.
- Does automatic clock correction (NTP resync) recover the node within the experiment window?.
- Is there a skew level that causes the node to stop signing entirely (protective behavior)?.
Experiment 3: Peer Disconnect Cascade
A single peer disconnect is routine. A cascade, where multiple peers disconnect in rapid succession due to a network event, tests whether the validator’s peer discovery and reconnection logic can keep up with the rate of loss.
Simulating a peer cascade with pod termination:
# experiment-3-peer-cascade.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: peer-cascade-kill
namespace: validators-chaos
spec:
action: pod-kill
mode: fixed-percent
value: "50" # Kill 50% of peer pods
selector:
namespaces:
- validators-chaos
labelSelectors:
role: validator-peer # Target peer pods, not the primary validator
duration: "3m"
gracePeriod: 0Combined with network latency on survivors:
# experiment-3b-survivor-latency.yaml
# After killing 50% of peers, inject latency on remaining ones
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: survivor-peer-latency
namespace: validators-chaos
spec:
action: delay
mode: all
selector:
namespaces:
- validators-chaos
labelSelectors:
role: validator-peer
duration: "3m"
delay:
latency: "500ms"
correlation: "100" # Correlated latency (not random)
jitter: "100ms"What to verify:
- Does the primary validator detect peer count dropping and trigger peer discovery?.
- Does the validator maintain attestation performance with 50% fewer peers?.
- When killed peers restart, does reconnection happen within expected time?.
- Does latency on surviving peers cause attestation inclusion distance to increase?.
Chaos Engineering Blockchain: Experiment 4 – Database Corruption Simulation for State Stores
Validator state stores (LevelDB for Ethereum clients, the Tendermint state database for Cosmos) are single points of failure. Corruption or lock contention causes the validator to halt or, in worst cases, start from a corrupt state that causes signing inconsistencies.
I/O stress injection:
# experiment-4-io-stress.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: validator-io-stress
namespace: validators-chaos
spec:
mode: one
selector:
namespaces:
- validators-chaos
labelSelectors:
app: cosmos-validator
duration: "5m"
stressors:
io:
workers: 4
size: "1GB" # Fill I/O with 1GB writes
path: /validator-data # Target the state store mount pathSimulating read errors with Chaos Mesh IOChaos:
# experiment-4b-io-fault.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: validator-state-read-error
namespace: validators-chaos
spec:
action: fault
mode: one
selector:
namespaces:
- validators-chaos
labelSelectors:
app: cosmos-validator
volumePath: /validator-data
path: /validator-data/data/cs.wal/* # Target write-ahead log
fault:
errno: 5 # EIO — I/O error
percent: 20 # 20% of reads return EIO
duration: "3m"What to verify:
- Does the validator detect I/O errors and halt rather than continue with potentially corrupt state?.
- Does the validator log sufficient information to diagnose the I/O issue?.
- Does automatic recovery from a backup state work correctly?.
- Is the slashing protection database (for Ethereum) read before any signing attempt during recovery?.
Experiment 5: Resource Exhaustion on Signing Services
Remote signing services (Web3Signer for Ethereum, the Tendermint key management setup for Cosmos) are often separate pods from the main validator process. Resource exhaustion on the signing service causes signing requests to timeout, which can manifest as missed attestations or, in misconfigured setups, as the validator attempting to sign with a local fallback key while the remote signer recovers.
CPU stress on the signing service:
# experiment-5-signer-cpu-stress.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: signing-service-cpu-stress
namespace: validators-chaos
spec:
mode: one
selector:
namespaces:
- validators-chaos
labelSelectors:
app: web3signer # Target the signing service, not the validator
duration: "5m"
stressors:
cpu:
workers: 8 # Consume all available CPU
load: 100What to verify:
- Does signing latency increase during CPU stress?.
- Does the validator detect signing timeouts and log them correctly?.
- Does the validator fall back to local signing if the remote signer is unavailable, and is that fallback configured correctly (or intentionally disabled)?.
- After CPU stress ends, does signing latency return to baseline within expected time?.
This experiment is particularly important for EigenLayer AVS operators. See our EigenLayer AVS setup guide for the key architecture that makes signing service resilience critical, an AVS operator whose signing service fails under load will miss tasks and accumulate performance penalties.
Chaos Engineering for Validators: Experiment 6 – Region Failover for HA Validator Setups
For high-availability validator setups running across multiple availability zones or regions, the most critical experiment is verifying that failover is exclusive, only one validator instance is active at any time. Dual signing during a failover is a slashing condition on every major blockchain.
Simulating AZ failure with node-level chaos:
# experiment-6-az-failover.yaml
# Simulate an AZ failure by isolating nodes in the primary AZ
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: primary-az-isolation
namespace: validators-chaos
spec:
action: partition
mode: all
selector:
namespaces:
- validators-chaos
nodeSelectors:
topology.kubernetes.io/zone: us-east-1a # Primary AZ
direction: both
duration: "10m"
target:
mode: all
selector:
namespaces:
- validators-chaos
nodeSelectors:
topology.kubernetes.io/zone: us-east-1b # Secondary AZVerifying exclusivity during failover:
# During the experiment, verify that only ONE validator pod is signing
# Check signing logs across all validator pods
# For Cosmos - only one pod should be producing prevotes/precommits:
kubectl logs -n validators-chaos cosmos-validator-primary --tail=50 | grep "signed"
kubectl logs -n validators-chaos cosmos-validator-standby --tail=50 | grep "signed"
# Standby should show NO signing activity while primary AZ is isolated
# For Ethereum - check attestation count:
kubectl exec -n validators-chaos ethereum-beacon-standby -- \
curl -s http://localhost:5052/eth/v1/validator/attestation_data?slot=HEAD
# Standby should NOT be submitting attestations while primary is healthyThe dual-sign detection test:
This is the most important verification step and one that generic chaos engineering Kubernetes guides never include:
# After the experiment, inspect the slashing protection database
# For Ethereum validators using Lighthouse:
kubectl exec -n validators-chaos ethereum-validator-standby -- \
lighthouse account validator slashing-protection export /tmp/slashing-export.json
# Verify: the export should show NO signed blocks or attestations
# from the standby validator during the period when the primary was active
cat /tmp/slashing-export.json | jq '.data[].signed_blocks | length'
# Should be 0 if failover was exclusiveAttacknet: The Blockchain-Native Chaos Framework
Attacknet uses Chaos Mesh to inject faults into a devnet environment generated by Kurtosis. By building on top of Kurtosis and Chaos Mesh, Attacknet can create various network topologies with ensembles of different kinds of faults to push a blockchain network to its most extreme edge cases. Attacknet repo
Attacknet differs from the experiments above in scope: rather than testing a single validator’s response to a fault, Attacknet tests the entire network’s response to coordinated faults across multiple nodes simultaneously. It is the tool for finding protocol-level bugs, not operational resilience issues.
Installing Attacknet:
# Prerequisites: Kurtosis, kubectl, Chaos Mesh installed
curl -L https://github.com/kurtosis-tech/kurtosis/releases/latest/download/kurtosis-cli_linux_amd64.tar.gz | tar xz
sudo mv kurtosis /usr/local/bin/
# Clone Attacknet
git clone https://github.com/crytic/attacknet.git
cd attacknet
# Install Go dependencies
go mod download
# Configure for your network (example: Ethereum devnet)
cp examples/ethereum-network.yaml tests/my-test.yamlExample Attacknet test configuration:
# tests/clock-skew-test.yaml
attacknetConfig:
grafanaPodName: grafana
grafanaPodPort: 3000
waitBeforeInjectionSeconds: 60
reuseDevnetBetweenRuns: false
existingDevnetNamespace: ""
harnessConfig:
networkPackage: github.com/ethpandaops/ethereum-package
networkConfig:
participants:
- el_type: geth
cl_type: lighthouse
- el_type: nethermind
cl_type: prysm
testConfig:
tests:
- testName: clock-skew-consensus-client
planSteps:
- stepType: injectFault
description: "Inject 30s clock skew on one CL client"
chaosFaultSpec:
apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
spec:
mode: one
selector:
labelSelectors:
kurtosistech.com/app-id: cl-client
timeOffset: "+30s"
duration: "10m"
- stepType: waitForFaultCompletion
- stepType: assertNetworkConverges
description: "Verify network reaches consensus after fault clears"
timeout: 300# Run the test
go run cmd/main.go start --config tests/clock-skew-test.yaml
# Access Chaos Mesh dashboard to monitor
kubectl --namespace chaos-mesh port-forward svc/chaos-dashboard 2333Attacknet combines Kurtosis for deploying Ethereum networks with Chaos Mesh for orchestrating chaos tests, performing health checks and attempting to ascertain if a network was negatively affected by the chaos. Attacknet repo
Monitoring During Chaos Experiments
Chaos engineering Kubernetes experiments without real-time monitoring are useless. You cannot draw conclusions from an experiment you could not observe.
Prometheus queries to monitor during experiments:
# Attestation inclusion distance (Ethereum) - should be 0-1 normally
# Increasing during chaos = validator being affected
ethereum_attestation_inclusion_distance
# Cosmos missed blocks - should be 0
# Any non-zero value during chaos = direct impact
cosmos_validator_missed_blocks_total
# Peer count - drops to 0 during network partition experiments
p2p_peer_count{instance="validator-pod"}
# Signing latency - increases during CPU stress on signing service
validator_signing_latency_seconds
# IBC packet timeouts - relevant if running IBC relayers alongside validators
# See: https://thegoodshell.com/cosmos-ibc-tutorial/
ibc_transfer_timeout_packets_totalChaos experiment Prometheus alerts:
# Add to your existing alert rules during chaos windows
groups:
- name: chaos-experiment-monitoring
rules:
- alert: ValidatorSlashingRisk
expr: |
ethereum_validator_double_sign_detected > 0
OR cosmos_validator_double_sign_detected > 0
labels:
severity: critical
annotations:
summary: "DOUBLE SIGNING DETECTED — stop chaos experiment immediately"
description: "A signing conflict has been detected. Halt all experiments and investigate before continuing."
- alert: ChaosExperimentExceededDuration
expr: |
time() - chaos_experiment_start_time > 600 # 10 minutes
labels:
severity: warning
annotations:
summary: "Chaos experiment has been running for over 10 minutes"
description: "Verify experiment has a defined duration and is not stuck."The ChaosETH Methodology Applied to Production Validators
The ChaosETH academic paper (published in ACM Distributed Ledger Technologies) provides a systematic framework for chaos engineering blockchain nodes that goes beyond ad-hoc experiment design. The key contribution is a taxonomy of Ethereum client failure modes mapped to chaos experiment types.
The four ChaosETH failure categories adapted for validators:
Category 1 – Network-level failures: Packet loss, latency injection, bandwidth throttling, network partition. Corresponds to experiments 1 and 3 in this guide.
Category 2 – Resource-level failures: CPU exhaustion, memory pressure, disk I/O degradation. Corresponds to experiments 4 and 5.
Category 3 – Time-level failures: Clock skew, NTP disruption, time zone changes. Corresponds to experiment 2, the most blockchain-specific category.
Category 4 – State-level failures: Database corruption, state store lock contention, snapshot inconsistency. Corresponds to experiment 4b.
The ChaosETH finding most relevant to production operators: Time-level failures are disproportionately dangerous for blockchain validators because consensus protocols assume clock synchronization in ways that are rarely tested. Most validators have never had their clock deliberately skewed. The ones that have, and passed, have higher confidence in their production resilience than those relying on NTP stability alone.
Chaos Engineering Kubernetes Checklist for Blockchain Validators
BEFORE EACH EXPERIMENT
[ ] Experiment targets only the chaos namespace - not production
[ ] Production validators are healthy and attesting normally
[ ] Monitoring dashboards are open and showing baselines
[ ] Current epoch/slot height documented
[ ] Duration field set on every ChaosObject
[ ] Rollback command ready: kubectl delete networkchaos --all -n [namespace]
[ ] Slashing protection database backed up (Ethereum validators)
DURING EACH EXPERIMENT
[ ] Peer count monitored in real time
[ ] Attestation inclusion distance monitored
[ ] No double-signing alerts firing
[ ] Experiment ending at expected time
AFTER EACH EXPERIMENT
[ ] Validator reconnected to expected peer count
[ ] Attestation performance returned to baseline
[ ] No unexpected entries in slashing protection database
[ ] Missed blocks/attestations match expected count for fault duration
[ ] Experiment findings documented in runbook
QUARTERLY CHAOS SCHEDULE
[ ] Experiment 1: Network partition (5 minutes)
[ ] Experiment 2: Clock skew graduation (15 minutes, increasing offsets)
[ ] Experiment 3: Peer cascade (10 minutes)
[ ] Experiment 4: I/O stress (5 minutes)
[ ] Experiment 5: Signing service CPU stress (5 minutes)
[ ] Experiment 6: Region failover exclusivity verification (10 minutes)
[ ] Attacknet devnet run for Ethereum client version validationConclusion
Chaos engineering Kubernetes for blockchain workloads requires a different experiment design than for stateless web services. Clock skew, peer isolation, signing service exhaustion, and dual-sign verification during failover are the experiments that matter for validator infrastructure, and none of them appear in standard chaos engineering tutorials.
Attacknet, built by Trail of Bits in collaboration with the Ethereum Foundation, demonstrated that chaos engineering can reproduce production incidents, including the Ethereum finality event of May 2023, that conventional testing cannot surface. The clock skew fault that caused that incident was not in any standard testing methodology before Attacknet identified it. Trail of Bits Attacknet
The six experiments in this guide provide a starting point for a quarterly chaos testing cadence. The Attacknet devnet methodology extends that to protocol-level validation. Together, they provide the confidence in validator resilience that the economics of staking demand.
At The Good Shell we design and operate validator infrastructure for Cosmos, Ethereum, and EigenLayer ecosystems, including chaos testing frameworks for production validator setups. See our Web3 infrastructure services or our case studies.
FAQ: Chaos Engineering Kubernetes for Blockchain
Is it safe to run chaos experiments on production validators?
No experiment in this guide should be run directly on production validators managing real stake without a parallel shadow setup. The correct approach is a chaos testing environment that mirrors your production configuration exactly: same client versions, same hardware class, same network topology, with test keys that have no delegated stake. Some experiments (specifically the failover exclusivity test) can be run on production after extensive shadow testing, with a human operator ready to intervene immediately.
What is the difference between Chaos Mesh and LitmusChaos?
Chaos Mesh is a CNCF-hosted chaos engineering platform with strong Kubernetes integration and the most mature support for the fault types relevant to blockchain workloads (network chaos, time chaos, I/O chaos). LitmusChaos is a CNCF-hosted chaos engineering platform focused on workflow-based experiments with a larger library of pre-built scenarios. Attacknet builds on Chaos Mesh specifically. For blockchain validator chaos testing, Chaos Mesh is the primary tool.
How often should chaos experiments run against validator infrastructure?
Quarterly is the minimum for a production validator operation managing significant delegated stake. After every major client upgrade is the additional trigger, client updates can change consensus timing assumptions, peer connectivity behavior, or state store format in ways that alter your resilience baseline. Run experiments before upgrading production to establish confidence in the new version’s behavior.
Can chaos engineering detect slashing risks before they occur in production?
Yes, and this is the primary value proposition for blockchain-specific chaos engineering. The dual-sign detection test in Experiment 6 specifically verifies that your failover configuration cannot produce a double-signing event. Operators who discover that their HA failover configuration would result in dual signing during a chaos experiment have avoided a slashing event that would have otherwise occurred during the next production failover. See our Cosmos validator slashing guide for the full taxonomy of slashing conditions that chaos experiments can surface pre-production.
Related Articles
- → Cosmos Validator Slashing: How to Prevent It and Recover Fast
- → Cosmos IBC Tutorial: 8 Proven Steps to Master IBC Eureka Migration
- → EigenLayer AVS Setup: 7 Proven Production Steps
- → Kubernetes Security Best Practices: The Essential Hardening Guide
- → Prometheus Alertmanager Setup: Complete Configuration Guide
