Validator Upgrade Pipeline: 7 Proven Steps to Upgrade Nodes Without Downtime

A validator upgrade pipeline is the difference between a routine software upgrade and a production incident. Every Cosmos SDK chain, every Ethereum consensus client, every Substrate-based validator will require binary upgrades throughout its lifecycle. Most teams handle these manually – one engineer, watching the block height, ready to swap the binary at the right moment. One missed notification, one slow SSH connection, one wrong binary path, and the validator misses blocks, gets jailed, and the delegators start moving their stake.

This guide covers how to build a proper validator upgrade pipeline: automated, auditable, and safe. We’ll cover Cosmovisor for Cosmos SDK chains, rolling upgrade strategies for Ethereum consensus clients, pre-upgrade validation workflows, and the GitHub Actions pipeline that ties it all together.

Why Manual Validator Upgrades Are a Slashing Risk

Before building the pipeline, it’s worth being honest about what the risk actually is. On Cosmos Hub, missing more than 500 blocks out of the last 10,000 gets you jailed and slashed 0.01% of bonded stake. A hard fork upgrade that takes 20 minutes to execute manually – while the chain is halted at a specific block height – puts every validator that wasn’t ready at risk.

The bigger risk is the double-sign scenario. An engineer upgrades the validator binary manually, the new binary starts, but the old process is still running in the background. Two signing processes, same validator key, same block height, that’s a 5% slash and permanent jailing on most Cosmos chains. There is no recovery from a double-sign slash.

A validator upgrade pipeline eliminates both risks. Upgrade timing is controlled by block height, not by human reaction time. Binary swaps are atomic, the old process terminates before the new one starts. And every step is logged, auditable, and repeatable.

The Two Upgrade Models

Before writing any pipeline code, understand which upgrade model applies to your chain.

Coordinated hard fork upgrades -> the chain halts at a specific block height. All validators must upgrade their binaries within a narrow window. If enough validators don’t upgrade in time, the chain stalls. This is the standard Cosmos SDK upgrade model governance passes a proposal, validators prepare, the chain halts at the upgrade height, everyone swaps binaries.

Rolling client upgrades -> the network continues running while validators upgrade one by one. This is how Ethereum consensus client upgrades work Lighthouse, Prysm, Teku, and Nimbus release new versions independently and validators can upgrade on their own schedule without chain coordination.

Your validator upgrade pipeline needs to handle both models. They require different tooling and different safety checks.

Step 1 – Cosmovisor: The Foundation of Any Cosmos Validator Upgrade Pipeline

For Cosmos SDK chains, Cosmovisor is the essential building block of a safe validator upgrade pipeline. It’s a process manager that wraps your chain binary and monitors the governance module for upgrade proposals. When the chain reaches the upgrade height, Cosmovisor automatically swaps the binary and restarts the process.

Install Cosmovisor:

go install cosmossdk.io/tools/cosmovisor/cmd/cosmovisor@latest
```

**Directory structure - this is critical:**
```
$DAEMON_HOME/cosmovisor/
├── current -> genesis or upgrades/<upgrade-name>
├── genesis/
│   └── bin/
│       └── gaiad          # Original binary
└── upgrades/
    └── v15/               # Upgrade name from governance proposal
        └── bin/
            └── gaiad      # New binary, pre-placed before upgrade height

Environment configuration:

# /etc/systemd/system/cosmovisor.service
[Unit]
Description=Cosmovisor - Cosmos Validator
After=network-online.target

[Service]
User=validator
Environment=DAEMON_NAME=gaiad
Environment=DAEMON_HOME=/home/validator/.gaia
Environment=DAEMON_RESTART_AFTER_UPGRADE=true
Environment=DAEMON_ALLOW_DOWNLOAD_BINARIES=false
Environment=UNSAFE_SKIP_BACKUP=false
Environment=DAEMON_DATA_BACKUP_DIR=/mnt/validator-backups
ExecStart=/usr/local/bin/cosmovisor run start
Restart=always
RestartSec=3
LimitNOFILE=65535

[Install]
WantedBy=multi-user.target

Two critical settings here. DAEMON_ALLOW_DOWNLOAD_BINARIES=false for validators, always pre-place the binary manually. Auto-download doesn’t verify binary integrity before swapping and a failed download causes a chain halt on your node. UNSAFE_SKIP_BACKUP=false Cosmovisor backs up the entire data directory before upgrading. This takes time but means you can roll back if the new binary has a critical bug.
Step 2 -> Pre-Upgrade Binary Validation

The validator upgrade pipeline must verify the new binary before the upgrade height. Placing an untested binary in the Cosmovisor upgrades directory and hoping it works is not a pipeline it’s wishful thinking.

Build and verify the binary:

# Clone the new version
git clone https://github.com/cosmos/gaia
cd gaia
git checkout v15.0.0

# Verify the commit hash matches the governance proposal
git log --oneline -1
# Output should match the hash published in the upgrade proposal

# Build
make install

# Verify binary version
gaiad version --long
# Should output: v15.0.0 with the expected commit hash

# Verify binary checksum matches the published checksum
sha256sum $(which gaiad)
# Compare with the checksum published in the upgrade plan

Place the binary in Cosmovisor before the upgrade height:

UPGRADE_NAME=v15
mkdir -p $HOME/.gaia/cosmovisor/upgrades/$UPGRADE_NAME/bin
cp $(which gaiad) $HOME/.gaia/cosmovisor/upgrades/$UPGRADE_NAME/bin/gaiad

# Verify it's in place
ls -la $HOME/.gaia/cosmovisor/upgrades/$UPGRADE_NAME/bin/
cosmovisor run version

Do this at least 24 hours before the upgrade height. The upgrade governance proposal always publishes the target block height calculate the approximate time and plan accordingly.

Step 3 – GitHub Actions: Automating the Validator Upgrade Pipeline

Manual binary placement is better than nothing. An automated validator upgrade pipeline is better than manual. Here’s a GitHub Actions workflow that monitors for new chain releases, builds the binary, verifies the checksum, and deploys it to the validator:

# .github/workflows/validator-upgrade.yaml
name: Validator Upgrade Pipeline

on:
  workflow_dispatch:
    inputs:
      upgrade_name:
        description: 'Upgrade name (matches governance proposal)'
        required: true
        type: string
      binary_version:
        description: 'Binary version tag (e.g. v15.0.0)'
        required: true
        type: string
      expected_checksum:
        description: 'Expected SHA256 checksum of binary'
        required: true
        type: string
      target_network:
        description: 'Target network'
        required: true
        type: choice
        options:
          - testnet
          - mainnet

jobs:
  build-and-verify:
    runs-on: ubuntu-latest
    outputs:
      checksum: ${{ steps.checksum.outputs.value }}

    steps:
      - name: Checkout chain repository
        uses: actions/checkout@v4
        with:
          repository: cosmos/gaia
          ref: ${{ inputs.binary_version }}

      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version-file: go.mod
          cache: true

      - name: Build binary
        run: make install

      - name: Verify binary version
        run: |
          VERSION=$(gaiad version 2>&1)
          echo "Built version: $VERSION"
          if [[ "$VERSION" != *"${{ inputs.binary_version }}"* ]]; then
            echo "Version mismatch - aborting"
            exit 1
          fi

      - name: Calculate and verify checksum
        id: checksum
        run: |
          CHECKSUM=$(sha256sum $(which gaiad) | awk '{print $1}')
          echo "value=$CHECKSUM" >> $GITHUB_OUTPUT
          if [ "$CHECKSUM" != "${{ inputs.expected_checksum }}" ]; then
            echo "Checksum mismatch - aborting"
            echo "Expected: ${{ inputs.expected_checksum }}"
            echo "Got: $CHECKSUM"
            exit 1
          fi
          echo "Checksum verified: $CHECKSUM"

      - name: Upload binary artifact
        uses: actions/upload-artifact@v4
        with:
          name: validator-binary-${{ inputs.binary_version }}
          path: ${{ env.GOPATH }}/bin/gaiad
          retention-days: 7

  deploy-testnet:
    needs: build-and-verify
    runs-on: ubuntu-latest
    environment: testnet
    if: inputs.target_network == 'testnet' || inputs.target_network == 'mainnet'

    steps:
      - name: Download binary
        uses: actions/download-artifact@v4
        with:
          name: validator-binary-${{ inputs.binary_version }}

      - name: Deploy to testnet validator
        uses: appleboy/[email protected]
        with:
          host: ${{ secrets.TESTNET_VALIDATOR_HOST }}
          username: validator
          key: ${{ secrets.TESTNET_VALIDATOR_SSH_KEY }}
          script: |
            UPGRADE_NAME=${{ inputs.upgrade_name }}
            DAEMON_HOME=$HOME/.gaia

            # Create upgrade directory
            mkdir -p $DAEMON_HOME/cosmovisor/upgrades/$UPGRADE_NAME/bin

            # Place binary
            cat > /tmp/gaiad-new << 'BINARY_EOF'
            ${{ env.BINARY_CONTENT }}
            BINARY_EOF

      - name: Verify deployment
        uses: appleboy/[email protected]
        with:
          host: ${{ secrets.TESTNET_VALIDATOR_HOST }}
          username: validator
          key: ${{ secrets.TESTNET_VALIDATOR_SSH_KEY }}
          script: |
            UPGRADE_NAME=${{ inputs.upgrade_name }}
            BINARY_PATH=$HOME/.gaia/cosmovisor/upgrades/$UPGRADE_NAME/bin/gaiad
            
            # Verify binary exists and is executable
            ls -la $BINARY_PATH
            
            # Verify version
            $BINARY_PATH version
            
            # Verify checksum
            sha256sum $BINARY_PATH

      - name: Notify Slack - testnet ready
        uses: slackapi/[email protected]
        with:
          payload: |
            {
              "text": "✅ Validator upgrade ${{ inputs.upgrade_name }} deployed to TESTNET. Binary version: ${{ inputs.binary_version }}. Checksum verified."
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

  deploy-mainnet:
    needs: [build-and-verify, deploy-testnet]
    runs-on: ubuntu-latest
    environment: mainnet
    if: inputs.target_network == 'mainnet'

    steps:
      - name: Download binary
        uses: actions/download-artifact@v4
        with:
          name: validator-binary-${{ inputs.binary_version }}

      - name: Deploy to mainnet validator
        uses: appleboy/[email protected]
        with:
          host: ${{ secrets.MAINNET_VALIDATOR_HOST }}
          username: validator
          key: ${{ secrets.MAINNET_VALIDATOR_SSH_KEY }}
          script: |
            UPGRADE_NAME=${{ inputs.upgrade_name }}
            DAEMON_HOME=$HOME/.gaia
            mkdir -p $DAEMON_HOME/cosmovisor/upgrades/$UPGRADE_NAME/bin

      - name: Notify Slack - mainnet ready
        uses: slackapi/[email protected]
        with:
          payload: |
            {
              "text": "✅ Validator upgrade ${{ inputs.upgrade_name }} deployed to MAINNET. Cosmovisor will execute at upgrade height. Monitor signing rate."
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

The environment: mainnet block activates GitHub’s manual approval gate a required reviewer must approve the mainnet deployment before it runs. This is your last human checkpoint before the binary goes live.

Step 4 – Ethereum Consensus Client Upgrades

For Ethereum validators, the validator upgrade pipeline works differently because upgrades are rolling, not coordinated. You can upgrade Lighthouse, Prysm, or Teku at any time without waiting for a specific block height.

The risk is different: the execution client and consensus client must remain compatible. Upgrading one without checking the other’s compatibility is a common source of missed attestations.

Lighthouse rolling upgrade workflow:

#!/bin/bash
# upgrade-lighthouse.sh

NEW_VERSION=$1
if [ -z "$NEW_VERSION" ]; then
  echo "Usage: ./upgrade-lighthouse.sh v5.3.0"
  exit 1
fi

# Step 1 - Verify compatibility with current execution client
echo "Current Geth version: $(geth version 2>&1 | head -1)"
echo "Target Lighthouse version: $NEW_VERSION"
echo "Check compatibility at: https://notes.ethereum.org/@launchpad/upgrades"

# Step 2 - Download and verify new binary
wget "https://github.com/sigp/lighthouse/releases/download/$NEW_VERSION/lighthouse-$NEW_VERSION-x86_64-unknown-linux-gnu.tar.gz"
wget "https://github.com/sigp/lighthouse/releases/download/$NEW_VERSION/lighthouse-$NEW_VERSION-x86_64-unknown-linux-gnu.tar.gz.sha256"

# Verify checksum
sha256sum -c "lighthouse-$NEW_VERSION-x86_64-unknown-linux-gnu.tar.gz.sha256"
if [ $? -ne 0 ]; then
  echo "Checksum verification failed - aborting"
  exit 1
fi

# Step 3 - Extract and prepare new binary
tar xzf "lighthouse-$NEW_VERSION-x86_64-unknown-linux-gnu.tar.gz"
chmod +x lighthouse

# Step 4 - Verify new binary before replacing old one
./lighthouse --version

# Step 5 - Backup current binary
cp $(which lighthouse) "/opt/lighthouse/lighthouse-backup-$(lighthouse --version 2>&1 | head -1 | awk '{print $2}')"

# Step 6 - Replace binary (graceful - wait for current attestation window)
echo "Waiting 60 seconds for current attestation cycle to complete..."
sleep 60

sudo systemctl stop lighthouse-validator
sudo cp ./lighthouse /usr/local/bin/lighthouse
sudo systemctl start lighthouse-validator

# Step 7 - Verify new version running
sleep 5
RUNNING_VERSION=$(lighthouse --version 2>&1 | head -1)
echo "Running version: $RUNNING_VERSION"

if systemctl is-active --quiet lighthouse-validator; then
  echo "Lighthouse validator is running"
else
  echo "Lighthouse failed to start - rolling back"
  sudo systemctl stop lighthouse-validator
  sudo cp "/opt/lighthouse/lighthouse-backup-*" /usr/local/bin/lighthouse
  sudo systemctl start lighthouse-validator
  exit 1
fi

Step 5 – Pre-Upgrade Monitoring and Block Height Tracking

A good validator upgrade pipeline knows when the upgrade is coming. Don’t rely on Discord notifications automate the monitoring.

Monitor upgrade height on Cosmos chains:

#!/bin/bash
# check-upgrade-plan.sh

CHAIN_RPC="${1:-https://cosmos-rpc.publicnode.com}"

# Get current upgrade plan
UPGRADE_PLAN=$(curl -s "$CHAIN_RPC/abci_query?path=\"/cosmos.upgrade.v1beta1.Query/CurrentPlan\"" | jq -r '.result.response.value' | base64 -d 2>/dev/null)

# Get current block height
CURRENT_HEIGHT=$(curl -s "$CHAIN_RPC/status" | jq -r '.result.sync_info.latest_block_height')

echo "Current block height: $CURRENT_HEIGHT"
echo "Upgrade plan: $UPGRADE_PLAN"

Prometheus alert for approaching upgrade height:

# upgrade-alerts.yaml
groups:
  - name: validator.upgrades
    rules:
      - alert: ValidatorUpgradeApproaching
        expr: |
          cosmos_upgrade_plan_height > 0 and
          (cosmos_upgrade_plan_height - cosmos_consensus_height) < 1000
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Chain upgrade in less than 1000 blocks"
          description: "Upgrade height: {{ $labels.upgrade_height }}. Current: {{ $value }} blocks away. Verify binary is in place."

      - alert: ValidatorUpgradeImminent
        expr: |
          cosmos_upgrade_plan_height > 0 and
          (cosmos_upgrade_plan_height - cosmos_consensus_height) < 100
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Chain upgrade in less than 100 blocks - verify binary NOW"

Step 6 – Post-Upgrade Verification

The validator upgrade pipeline isn’t complete when the binary swaps. It’s complete when you’ve confirmed the validator is signing blocks correctly on the new version.

Automated post-upgrade verification:

#!/bin/bash
# verify-upgrade.sh

EXPECTED_VERSION=$1
CHAIN_RPC=$2
VALIDATOR_ADDR=$3

# Step 1 - Verify binary version
RUNNING_VERSION=$(cosmovisor run version 2>&1)
echo "Running version: $RUNNING_VERSION"

if [[ "$RUNNING_VERSION" != *"$EXPECTED_VERSION"* ]]; then
  echo "ERROR: Wrong version running after upgrade"
  exit 1
fi

# Step 2 - Check validator is in active set
sleep 30  # Wait for a few blocks to be produced

VALIDATOR_STATUS=$(gaiad query tendermint-validator-set | grep "$VALIDATOR_ADDR")
if [ -z "$VALIDATOR_STATUS" ]; then
  echo "ERROR: Validator not found in active set"
  exit 1
fi
echo "Validator confirmed in active set"

# Step 3 - Check signing rate over last 20 blocks
MISSED_BLOCKS=$(gaiad query slashing signing-info "$VALIDATOR_ADDR" --output json | jq '.val_signing_info.missed_blocks_counter')
echo "Missed blocks counter: $MISSED_BLOCKS"

if [ "$MISSED_BLOCKS" -gt "5" ]; then
  echo "WARNING: Validator missing blocks post-upgrade - investigate"
  exit 1
fi

echo "✅ Upgrade verification passed - validator signing correctly on $EXPECTED_VERSION"

Step 7 – Rollback Strategy

Every validator upgrade pipeline needs a tested rollback path. Cosmovisor makes this straightforward for Cosmos chains:

# Roll back to previous binary
cd $DAEMON_HOME/cosmovisor
ls -la  # Identify previous version

# Stop cosmovisor
systemctl stop cosmovisor

# If the data directory is intact (no state migration):
# Restore the previous binary symlink
ln -sfn $DAEMON_HOME/cosmovisor/genesis/bin/gaiad $DAEMON_HOME/cosmovisor/current/bin/gaiad

# If state migration ran (hard fork):
# Restore from Cosmovisor's automatic backup
cp -r $DAEMON_DATA_BACKUP_DIR/data $DAEMON_HOME/data
systemctl start cosmovisor

For Ethereum, rollback is simpler – just downgrade the binary and restart. There’s no state migration for routine client upgrades.

The Validator Upgrade Pipeline Checklist

Before every upgrade, run through this checklist:

7 days before upgrade height:

  • New binary release published and checksum available
  • Cosmovisor upgrade directory created
  • Binary built from source and checksum verified

24 hours before upgrade height:

  • Binary deployed to testnet via upgrade pipeline
  • Testnet upgrade executed and validator confirmed signing
  • Binary deployed to mainnet cosmovisor directory
  • Backup directory has sufficient disk space

1 hour before upgrade height:

  • Monitoring dashboard open
  • Slack/PagerDuty alerts configured
  • On-call engineer confirmed available
  • Rollback procedure reviewed

Post-upgrade:

  • Validator confirmed in active set
  • Signing rate normal
  • No double-sign risk verified
  • Post-upgrade notification sent to team

Conclusion

A validator upgrade pipeline is not an optional improvement it’s a prerequisite for running a professional validator operation at any scale. Manual upgrades introduce human error, timing risk, and the double-sign scenario that ends validator careers. The pipeline described here Cosmovisor for Cosmos chains, automated binary verification, GitHub Actions for deployment, and post-upgrade monitoring handles all of these systematically.

The teams that run validators with 99.97%+ uptime are not the ones watching block heights on Discord. They’re the ones who built the automation and went to sleep before the upgrade height.

If you need this infrastructure built for your validator operation or blockchain team, this is exactly what we do at The Good Shell. See our Web3 infrastructure services or read our case studies to see production validator infrastructure in practice.

For reference on Cosmovisor configuration and all available options, see the official Cosmos SDK Cosmovisor documentation.

Leave a Reply

Your email address will not be published. Required fields are marked *