OpenTelemetry Tutorial for Validators: 7 Proven Production Patterns Most Guides Skip

Every OpenTelemetry tutorial covers the same ground: instrument a Node.js service, deploy the Collector as a DaemonSet, export to Jaeger or Prometheus, add some spans. It is a useful starting point for web services. It is insufficient for blockchain validator infrastructure.

In February 2026, Succinct launched SP1 Hypercube on mainnet, real-time Ethereum proving with proofs generated in under 12 seconds. In May 2026, the Base Azul upgrade routed $7.4B in deposits through SP1 ZK proofs. The observability requirements for infrastructure that generates and verifies ZK proofs under this kind of load are categorically different from tracing a REST API request. The latency budget is different. The failure modes are different. The correlation model: connecting a failed proof to a specific slot, a specific validator state, and a specific network condition, requires instrumentation design that no generic OpenTelemetry tutorial addresses.

This tutorial covers seven production patterns for OpenTelemetry on blockchain validator infrastructure: custom resource attributes for validator identity, epoch and slot correlation across signals, IBC packet tracing for Cosmos relayers, ZK proof pipeline observability, tail-based sampling tuned for consensus timing, Collector pipeline design for high-cardinality validator metrics, and the alert correlation model that connects OTel traces to slashing risk.

In this guide

Why This OpenTelemetry Tutorial Focuses on Validators

Generic OpenTelemetry tutorials instrument stateless services where every request is independent. Validator infrastructure has fundamentally different observability requirements.

Running this in production?

Get a senior review of your infrastructure — in 7 days

We run validator and cloud infrastructure across 24 chains with 10M+ daily checks at 99.97% uptime. Fixed-price 7-day audit: written report, prioritised findings, 90-min debrief call. $4,500 fixed, no long engagement.

Get the 7-day audit → Book a free 30-min infra review — leave with 2-3 concrete findings

State persistence across observations. A validator’s behavior in epoch N is causally connected to its behavior in epoch N-1. An attestation miss is not an isolated event, it is the result of a chain of conditions: peer count at slot start, signing latency at attestation window, memory pressure during block production. Correlating these across time requires instrumentation designed around epoch boundaries, not request boundaries.

Financial consequences of observability gaps. A web service that loses traces for five minutes produces a gap in your dashboard. A validator that loses observability for five minutes may be accumulating slashing conditions, missing attestations, or drifting from consensus, conditions that manifest as irreversible financial penalties. See our Cosmos validator slashing guide for the full taxonomy of conditions that late or absent observability allows to compound.

Cross-chain correlation. Validators running in Cosmos ecosystems interact with IBC relayers. An IBC packet timeout is causally connected to the validator’s availability during the packet’s timeout window. Tracing that connection requires span propagation across the IBC boundary, something no generic OpenTelemetry tutorial covers. See our Cosmos IBC tutorial for the IBC v2 context.

ZK proof pipeline latency. For EigenLayer AVS operators and validators in ZK-enabled ecosystems, proof generation is a critical path component with strict timing requirements. SP1 Hypercube targeting sub-12-second proof times means that any latency in the proof pipeline that pushes generation beyond the slot window causes a missed task, equivalent to a missed attestation in economic terms. See our EigenLayer AVS setup guide for the operator architecture context.

OpenTelemetry Prerequisites and Stack

This opentelemetry tutorial uses the OpenTelemetry Collector with a DaemonSet + Gateway architecture. All configuration references the official OpenTelemetry specification at <a href=”https://github.com/open-telemetry/opentelemetry-specification” target=”_blank” rel=”nofollow”>github.com/open-telemetry/opentelemetry-specification</a>.

This opentelemetry tutorial builds on the official OpenTelemetry Collector with a two-tier DaemonSet + Gateway architecture.

# Install the OpenTelemetry Operator
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

# Verify
kubectl get pods -n opentelemetry-operator-system

# Install via Helm (recommended for production)
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

helm install opentelemetry-operator open-telemetry/opentelemetry-operator \
  --namespace opentelemetry \
  --create-namespace \
  --set manager.collectorImage.repository=otel/opentelemetry-collector-contrib \
  --set admissionWebhooks.certManager.enabled=false \
  --set admissionWebhooks.autoGenerateCert.enabled=true

Backend stack for this tutorial:

Traces: Tempo (Grafana) or Jaeger.
Metrics: Prometheus (already running on most validator setups).
Logs: Loki (Grafana).
Dashboards: Grafana.

If you already have Prometheus running for validator metrics (the standard setup covered in our Prometheus Alertmanager setup guide), OTel integrates alongside it, the Collector can scrape Prometheus endpoints and bridge metrics into the OTel pipeline while Prometheus continues to serve existing alert rules.

Pattern 1: Custom Resource Attributes for Validator Identity

The most important OpenTelemetry pattern for blockchain validators is resource attribute design. Resource attributes are key-value pairs attached to every span, metric, and log emitted by a process. They answer the question “which validator, on which chain, in which role?” without requiring that information to be repeated in every individual span.

Generic OpenTelemetry tutorials use standard resource attributes: service.name, service.version, host.name. These are insufficient for validator infrastructure where you need to correlate telemetry by validator address, chain ID, client type, and operator set.

Custom resource attributes for Cosmos validators:

# opentelemetry-collector-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: validators
data:
  config.yaml: |
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133

    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

      # Scrape Prometheus metrics from validator
      prometheus:
        config:
          scrape_configs:
            - job_name: cosmos-validator
              static_configs:
                - targets: ['localhost:26660']
              metric_relabel_configs:
                - source_labels: [__name__]
                  regex: 'tendermint_.*|cosmos_.*'
                  action: keep

    processors:
      # Resource processor: add validator identity to all telemetry
      resource:
        attributes:
          - key: validator.address
            value: "${VALIDATOR_ADDRESS}"
            action: insert
          - key: validator.chain_id
            value: "${CHAIN_ID}"
            action: insert
          - key: validator.client_type
            value: "${CLIENT_TYPE}"    # gaia, osmosis, neutron, etc.
            action: insert
          - key: validator.operator_address
            value: "${OPERATOR_ADDRESS}"
            action: insert
          - key: validator.role
            value: "${VALIDATOR_ROLE}"  # primary, standby, relayer
            action: insert
          - key: network.environment
            value: "${ENVIRONMENT}"    # mainnet, testnet
            action: insert

      # Kubernetes metadata (if running in k8s)
      k8sattributes:
        auth_type: serviceAccount
        extract:
          metadata:
            - k8s.namespace.name
            - k8s.pod.name
            - k8s.node.name
          labels:
            - tag_name: validator.set
              key: validator-set
              from: pod

      batch:
        timeout: 5s
        send_batch_size: 1000

    exporters:
      otlp/tempo:
        endpoint: tempo:4317
        tls:
          insecure: true
      prometheus:
        endpoint: "0.0.0.0:8889"
      loki:
        endpoint: http://loki:3100/loki/api/v1/push

    service:
      extensions: [health_check]
      pipelines:
        traces:
          receivers: [otlp]
          processors: [resource, k8sattributes, batch]
          exporters: [otlp/tempo]
        metrics:
          receivers: [otlp, prometheus]
          processors: [resource, batch]
          exporters: [prometheus]
        logs:
          receivers: [otlp]
          processors: [resource, k8sattributes, batch]
          exporters: [loki]

Deploy the Collector as a DaemonSet (one per node):

# opentelemetry-collector-daemonset.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: validator-agent
  namespace: validators
spec:
  mode: daemonset
  config: |
    # Reference the configmap above
  env:
    - name: VALIDATOR_ADDRESS
      valueFrom:
        secretKeyRef:
          name: validator-identity
          key: address
    - name: CHAIN_ID
      value: cosmoshub-4
    - name: CLIENT_TYPE
      value: gaia
    - name: VALIDATOR_ROLE
      valueFrom:
        fieldRef:
          fieldPath: metadata.labels['validator-role']
    - name: ENVIRONMENT
      value: mainnet

With validator.address as a resource attribute on every span, metric, and log, you can filter all telemetry for a specific validator in a single Grafana query, without knowing which pod it ran on or which node it was scheduled to.

Pattern 2: OpenTelemetry Blockchain: Epoch and Slot Span Design

The second pattern unique to this OpenTelemetry tutorial is epoch and slot correlation. Blockchain consensus operates in time windows (slots in Ethereum, blocks in Cosmos) that are the natural unit of analysis for validator performance. OTel spans for validator operations should be structured around these windows.

Instrumenting a Cosmos validator with epoch context:

// validator/telemetry/cosmos.go
package telemetry

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/trace"
)

var tracer = otel.Tracer("cosmos-validator")

// StartBlockSpan creates a span for each block production attempt
// with block height and epoch as span attributes
func StartBlockSpan(ctx context.Context, height int64, epoch int64) (context.Context, trace.Span) {
    ctx, span := tracer.Start(ctx, "validator.block_production",
        trace.WithAttributes(
            attribute.Int64("cosmos.block.height", height),
            attribute.Int64("cosmos.epoch", epoch),
            attribute.Int64("cosmos.epoch_block", height%epochLength),
        ),
    )
    return ctx, span
}

// RecordAttestationResult records the outcome of an attestation attempt
func RecordAttestationResult(ctx context.Context, span trace.Span, included bool, inclusionDelay int64) {
    span.SetAttributes(
        attribute.Bool("cosmos.attestation.included", included),
        attribute.Int64("cosmos.attestation.inclusion_delay_blocks", inclusionDelay),
    )

    if !included {
        span.SetAttributes(
            attribute.String("cosmos.attestation.miss_reason", "inclusion_timeout"),
        )
        // Link to the chaos engineering alert if relevant
        // See: /chaos-engineering-kubernetes/ for the blast radius context
    }
}

// StartSigningSpan creates a child span for the signing operation
// Critical: signing latency relative to slot window is the key metric
func StartSigningSpan(ctx context.Context, signerType string) (context.Context, trace.Span) {
    ctx, span := tracer.Start(ctx, "validator.signing",
        trace.WithAttributes(
            attribute.String("validator.signer.type", signerType),  // local, remote, hsm
        ),
    )
    return ctx, span
}

Ethereum consensus client instrumentation:

// validator/telemetry/ethereum.go
package telemetry

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/trace"
)

var ethTracer = otel.Tracer("ethereum-validator")

func StartSlotSpan(ctx context.Context, slot uint64, epoch uint64) (context.Context, trace.Span) {
    ctx, span := ethTracer.Start(ctx, "validator.slot",
        trace.WithAttributes(
            attribute.Int64("ethereum.slot", int64(slot)),
            attribute.Int64("ethereum.epoch", int64(epoch)),
            attribute.Int64("ethereum.slot_in_epoch", int64(slot%32)),
        ),
    )
    return ctx, span
}

func RecordAttestationDuty(ctx context.Context, span trace.Span, result AttestationResult) {
    span.SetAttributes(
        attribute.String("ethereum.attestation.committee", result.CommitteeIndex),
        attribute.Int64("ethereum.attestation.inclusion_distance", result.InclusionDistance),
        attribute.Bool("ethereum.attestation.optimal", result.InclusionDistance == 1),
        attribute.String("ethereum.client", result.ClientType), // lighthouse, prysm, teku
    )

    if result.InclusionDistance > 1 {
        span.AddEvent("attestation_delayed",
            trace.WithAttributes(
                attribute.Int64("delay_slots", result.InclusionDistance-1),
            ),
        )
    }
}

Grafana query correlating attestation performance by epoch:

# Tempo/TraceQL: find all slots where attestation inclusion distance > 1
{ resource.validator.chain_id = "cosmoshub-4" }
| select(span.cosmos.attestation.inclusion_delay_blocks > 0)
| select(span.cosmos.epoch)

Pattern 3 – OpenTelemetry Tutorial: IBC Packet Correlation

For Cosmos chains running IBC relayers, packet timeouts and acknowledgment failures are directly related to validator availability. A relayer packet that times out during a period when the destination chain’s validators were struggling produces a causal relationship that only distributed tracing can surface.

Propagating trace context across IBC boundaries:

IBC packets carry application-level data. W3C TraceContext headers can be included in the packet memo field to propagate trace context from the sending chain to the receiving chain, connecting the sending transaction span to the receiving transaction span.

// relayer/telemetry/ibc.go
package telemetry

import (
    "context"
    "encoding/json"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/trace"
)

var relayerTracer = otel.Tracer("ibc-relayer")

// PacketMemo carries trace context in the IBC packet memo field
type PacketMemo struct {
    TraceContext map[string]string `json:"otel_trace_context,omitempty"`
    Source       string            `json:"source,omitempty"`
}

// InjectTraceContext adds OTel trace context to an IBC packet memo
func InjectTraceContext(ctx context.Context, memo string) string {
    carrier := propagation.MapCarrier{}
    otel.GetTextMapPropagator().Inject(ctx, carrier)

    packetMemo := PacketMemo{
        TraceContext: carrier,
        Source:       "hermes-relayer",
    }

    memoBytes, _ := json.Marshal(packetMemo)
    return string(memoBytes)
}

// StartPacketSpan creates a span for the full IBC packet lifecycle
func StartPacketSpan(ctx context.Context, sourceChannel, destChannel string, sequence uint64) (context.Context, trace.Span) {
    ctx, span := relayerTracer.Start(ctx, "ibc.packet.relay",
        trace.WithAttributes(
            attribute.String("ibc.source.channel", sourceChannel),
            attribute.String("ibc.dest.channel", destChannel),
            attribute.Int64("ibc.packet.sequence", int64(sequence)),
            attribute.String("ibc.protocol_version", "v2"),  // IBC v2 or classic
        ),
        trace.WithSpanKind(trace.SpanKindProducer),
    )
    return ctx, span
}

// RecordPacketTimeout records a packet timeout with context linking
func RecordPacketTimeout(span trace.Span, timeoutHeight uint64, timeoutTimestamp uint64) {
    span.SetAttributes(
        attribute.Int64("ibc.packet.timeout_height", int64(timeoutHeight)),
        attribute.Int64("ibc.packet.timeout_timestamp", int64(timeoutTimestamp)),
        attribute.String("ibc.packet.result", "timeout"),
    )
    span.AddEvent("packet_timeout")
}

With this instrumentation, a Tempo/Jaeger query for “all IBC packets that timed out during epoch X” returns traces connected to the validator availability spans from that same epoch, making the causal relationship visible without manual log correlation.

Pattern 4: OpenTelemetry Ethereum: ZK Proof Pipeline Tracing

The most advanced pattern in this OpenTelemetry tutorial is ZK proof pipeline observability. For EigenLayer AVS operators and validators in ZK-enabled ecosystems where SP1 proofs must be generated within slot windows, the proof pipeline is a critical path with a hard latency requirement.

Instrumenting an SP1 proof generation pipeline:

// prover/telemetry/sp1.go
package telemetry

import (
    "context"
    "time"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/metric"
    "go.opentelemetry.io/otel/trace"
)

var (
    proverTracer = otel.Tracer("sp1-prover")
    meter        = otel.Meter("sp1-prover")
)

// ProofGenerationMetrics tracks ZK proof timing against slot window budget
type ProofGenerationMetrics struct {
    ProofDuration    metric.Float64Histogram
    SlotBudgetUsed   metric.Float64Gauge     // Fraction of slot window consumed
    ProofsFailed     metric.Int64Counter
    ProofsSuccessful metric.Int64Counter
}

func NewProofMetrics() *ProofGenerationMetrics {
    proofDuration, _ := meter.Float64Histogram(
        "sp1.proof.duration_seconds",
        metric.WithDescription("Time to generate SP1 ZK proof"),
        metric.WithUnit("s"),
        metric.WithExplicitBucketBoundaries(1, 3, 6, 9, 12, 15, 20, 30),
    )

    slotBudget, _ := meter.Float64Gauge(
        "sp1.proof.slot_budget_fraction",
        metric.WithDescription("Fraction of slot window consumed by proof generation (1.0 = full slot)"),
    )

    failed, _ := meter.Int64Counter(
        "sp1.proof.failures_total",
        metric.WithDescription("Total failed proof generations"),
    )

    successful, _ := meter.Int64Counter(
        "sp1.proof.successes_total",
        metric.WithDescription("Total successful proof generations"),
    )

    return &ProofGenerationMetrics{
        ProofDuration:    proofDuration,
        SlotBudgetUsed:   slotBudget,
        ProofsFailed:     failed,
        ProofsSuccessful: successful,
    }
}

// StartProofSpan creates a span for the proof generation with slot budget tracking
func StartProofSpan(ctx context.Context, slot uint64, slotWindowSeconds float64) (context.Context, trace.Span, time.Time) {
    startTime := time.Now()
    ctx, span := proverTracer.Start(ctx, "sp1.proof.generate",
        trace.WithAttributes(
            attribute.Int64("sp1.slot", int64(slot)),
            attribute.Float64("sp1.slot_window_seconds", slotWindowSeconds),
            attribute.String("sp1.prover_version", "hypercube-mainnet"),
        ),
    )
    return ctx, span, startTime
}

// RecordProofResult records the outcome with budget consumption
func RecordProofResult(
    ctx context.Context,
    span trace.Span,
    metrics *ProofGenerationMetrics,
    startTime time.Time,
    slotWindowSeconds float64,
    success bool,
) {
    duration := time.Since(startTime).Seconds()
    budgetFraction := duration / slotWindowSeconds

    span.SetAttributes(
        attribute.Float64("sp1.proof.duration_seconds", duration),
        attribute.Float64("sp1.proof.slot_budget_fraction", budgetFraction),
        attribute.Bool("sp1.proof.within_budget", budgetFraction < 1.0),
        attribute.Bool("sp1.proof.success", success),
    )

    metrics.ProofDuration.Record(ctx, duration,
        metric.WithAttributes(
            attribute.Bool("success", success),
        ),
    )
    metrics.SlotBudgetUsed.Record(ctx, budgetFraction)

    if success {
        metrics.ProofsSuccessful.Add(ctx, 1)
    } else {
        metrics.ProofsFailed.Add(ctx, 1)
        span.AddEvent("proof_failed",
            trace.WithAttributes(
                attribute.Float64("budget_overrun_seconds", duration-slotWindowSeconds),
            ),
        )
    }
}

Prometheus alert for proof budget overrun:

- alert: SP1ProofBudgetOverrun
  expr: |
    sp1_proof_slot_budget_fraction > 0.8
  for: 3m
  labels:
    severity: warning
  annotations:
    summary: "SP1 proof generation consuming >80% of slot window"
    description: "Proof budget fraction {{ $value }}. At current rate, proofs will miss slot window."

- alert: SP1ProofMissedSlot
  expr: |
    increase(sp1_proof_failures_total[5m]) > 0
  labels:
    severity: critical
  annotations:
    summary: "SP1 proof failed to generate within slot window"
    description: "{{ $value }} proof failures in the last 5 minutes. Tasks will be missed."

Pattern 5: Tail-Based Sampling for Consensus Timing

Generic opentelemetry tutorial guides recommend head-based sampling decisions made at the start of a trace, before the outcome is known. For validator infrastructure, head-based sampling discards exactly the traces you need most: the ones where something went wrong.

Tail-based sampling in the Collector – keep all error and high-latency traces:

# In the Collector config, replace the batch processor with tail sampling
processors:
  tail_sampling:
    decision_wait: 10s          # Wait 10s for all spans in a trace before deciding
    num_traces: 50000           # Buffer up to 50k traces in memory
    expected_new_traces_per_sec: 100
    policies:
      # Always keep traces with errors
      - name: keep-errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Always keep slow attestation traces (inclusion distance > 1)
      - name: keep-slow-attestations
        type: numeric_attribute
        numeric_attribute:
          key: cosmos.attestation.inclusion_delay_blocks
          min_value: 1

      # Always keep traces where signing exceeded 500ms
      - name: keep-slow-signing
        type: latency
        latency:
          threshold_ms: 500

      # Always keep traces with proof budget overrun
      - name: keep-proof-overrun
        type: numeric_attribute
        numeric_attribute:
          key: sp1.proof.slot_budget_fraction
          min_value: 0.7

      # Sample 1% of healthy traces (to maintain baseline visibility)
      - name: sample-healthy
        type: probabilistic
        probabilistic:
          sampling_percentage: 1

This configuration ensures that every attestation miss, every signing timeout, and every proof overrun is captured in full, while healthy operation produces only a 1% sampled trace stream. The result is a Tempo instance that contains exactly the traces needed for incident investigation without the storage cost of capturing every healthy slot.

Pattern 6 – OpenTelemetry Cosmos: Collector Pipeline for High-Cardinality Metrics

Validator metrics are high-cardinality. A single Cosmos Hub has thousands of validators. An EigenLayer operator running multiple AVSs has metrics labeled by validator address, chain ID, AVS ID, and operator set, producing cardinality in the tens of thousands of label combinations.

Cardinality management in the Collector:

processors:
  # Filter: drop metrics with cardinality that would OOM Prometheus
  filter:
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          # Drop per-peer metrics if there are >100 peers
          - "tendermint_p2p_peer_.*_individual"

  # Transform: cap label cardinality
  transform:
    metric_statements:
      - context: datapoint
        statements:
          # Replace validator address with a hash for high-cardinality labels
          # Keep the full address only in resource attributes
          - set(attributes["validator.address_short"],
              Substring(attributes["validator.address"], 0, 10))
          - delete_key(attributes, "validator.address")

  # Memory limiter: prevent the Collector from OOMing during cardinality spikes
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

Exemplars: connecting metrics to traces:

Exemplars allow Prometheus metrics to carry a pointer to a specific trace. When an alert fires on high attestation miss rate, clicking on the exemplar in Grafana opens the exact trace that corresponds to the worst miss in the alerting window.

// In your validator instrumentation, attach exemplars to histograms
func RecordAttestationLatency(ctx context.Context, latencyMs float64) {
    // The OTel SDK automatically attaches the current trace ID as an exemplar
    // when recording histogram observations within a span context
    attestationLatency.Record(ctx, latencyMs,
        metric.WithAttributes(
            attribute.String("chain_id", chainID),
        ),
    )
    // Grafana can now link from the histogram panel directly to the trace
}

Pattern 7 – Alert Correlation: Connecting OTel Traces to Slashing Risk

The final pattern completes the loop between observability and operational response. The goal is not just to see that something went wrong, but to have the trace context available at the moment an alert fires so the on-call engineer does not start from zero.

Linking Prometheus alerts to Tempo traces:

# In Prometheus alert rules, add trace context as labels
groups:
- name: validator-observability
  rules:
  - alert: ValidatorAttestationMiss
    expr: |
      increase(cosmos_validator_missed_blocks_total[5m]) > 0
    labels:
      severity: warning
      # Link to Tempo for trace context
      trace_query: 'resource.validator.address="{{ $labels.validator_address }}" AND span.cosmos.attestation.included="false"'
    annotations:
      summary: "Validator {{ $labels.validator_address }} missed attestations"
      description: |
        {{ $value }} attestation misses in the last 5 minutes.
        Trace query: https://grafana/explore?datasource=tempo&query={{ $labels.trace_query }}

  - alert: EigenLayerTaskMiss
    expr: |
      increase(eigenlayer_avs_tasks_missed_total[5m]) > 0
    labels:
      severity: critical
    annotations:
      summary: "EigenLayer AVS task missed"
      description: |
        Task miss detected. Check chaos engineering baseline:
        https://thegoodshell.com/chaos-engineering-kubernetes/
        Trace query: resource.validator.address="{{ $labels.validator_address }}"

Grafana dashboard variable for validator-scoped views:

{
  "templating": {
    "list": [
      {
        "name": "validator_address",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(cosmos_validator_missed_blocks_total, validator_address)",
        "refresh": 2,
        "label": "Validator"
      },
      {
        "name": "chain_id",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(cosmos_validator_missed_blocks_total{validator_address=\"$validator_address\"}, chain_id)"
      }
    ]
  }
}

With these variables, a single Grafana dashboard shows all signals: traces, metrics, logs, scoped to a specific validator address and chain, without requiring separate dashboards per validator.

Conclusion

Generic OpenTelemetry tutorials instrument stateless services with independent requests. Validator infrastructure has state, timing constraints, financial consequences for observability gaps, and cross-chain causal relationships that require different instrumentation design.

The seven patterns in this tutorial: custom resource attributes for validator identity, epoch and slot span design, IBC packet correlation, ZK proof pipeline tracing, tail-based sampling tuned for consensus timing, cardinality management for high-label-count validator metrics, and alert correlation linking traces to slashing risk, address the gap between what generic OTel tutorials cover and what production validator operations require.

At The Good Shell we design observability infrastructure for Cosmos, Ethereum, and EigenLayer validator operations. See our infrastructure and SRE services or our case studies.

Para la referencia oficial, la documentación de OpenTelemetry en opentelemetry.io/docs cubre cada SDK y componente del Collector referenciado en este tutorial. La especificación completa está en github.com/open-telemetry/opentelemetry-specification.

FAQ: OpenTelemetry Tutorial for Validators

Can I use OpenTelemetry alongside my existing Prometheus setup?

Yes, and this opentelemetry tutorial recommends it as the standard approach for most validator operators. The OTel Collector has a Prometheus receiver that scrapes existing Prometheus endpoints and bridges the metrics into the OTel pipeline. Your existing Prometheus alert rules continue to work unchanged. OTel adds the traces and structured logs that Prometheus alone cannot provide.

What is the performance overhead of OTel instrumentation on a running validator?

For Go-based validators (the majority of Cosmos clients), OTel instrumentation adds negligible overhead to the hot path. The SDK is designed for low-latency production use. The Collector does the heavy processing outside the validator process. The main resource consideration is the Collector itself, a production Collector with tail-based sampling requires 512MB-1GB of memory for the trace buffer.

How do I handle OTel instrumentation in consensus-critical code paths?

In slots or blocks where signing latency is the determining factor, avoid synchronous OTel operations on the critical path. Use the asynchronous counter and gauge instruments for metrics rather than histograms (which have higher per-observation cost). Span creation and attribute setting should happen outside the signing window. Batch your OTel writes, the Collector’s batch processor handles the rest.

Which OpenTelemetry signals are most useful for validator observability?

In order of operational value: Metrics (continuous visibility into attestation rate, peer count, signing latency), then Traces (root cause analysis for specific incidents), then Logs (audit trail for key operations). Start with metrics, they give you the alerting foundation. Add traces for the operations that matter most: block production, attestation, signing. Add structured logs last, focused on key lifecycle events rather than verbose debug output.