Prometheus Alertmanager Setup: The Ultimate 7-Step Production Guide for 2026

Prometheus alertmanager setup is one of those infrastructure tasks that takes an hour to complete and months to get right. The installation is straightforward. The configuration that actually works in production routing trees that don’t flood on-call, inhibition rules that prevent alert storms, timing parameters calibrated to real incident patterns, requires experience that no official documentation can fully transmit.

This guide covers the complete prometheus alertmanager setup from installation through production hardening: the architecture, the configuration file structure, routing trees with real examples, Slack and PagerDuty integration, inhibition rules, silence management, high availability, and the alert fatigue patterns that make engineers stop paying attention to their phones at 3am.

In this guide

What Prometheus Alertmanager Actually Does

Before touching configuration, understand where Alertmanager sits in the observability stack and why it exists as a separate component.

Prometheus evaluates alerting rules against metrics and fires alerts. But Prometheus is not responsible for what happens to those alerts who gets notified: on which channel, with what frequency, and whether a minor alert should be suppressed because a more critical alert is already firing. That is Alertmanager’s job.

Running this in production?

Get a senior review of your infrastructure — in 7 days

We run validator and cloud infrastructure across 24 chains with 10M+ daily checks at 99.97% uptime. Fixed-price 7-day audit: written report, prioritised findings, 90-min debrief call. $4,500 fixed, no long engagement.

Get the 7-day audit → Book a free 30-min infra review — leave with 2-3 concrete findings

The separation is intentional and important. Alertmanager handles four distinct concerns that would be operationally chaotic to mix into Prometheus itself:

Grouping: when a network partition takes down 50 services simultaneously, you want one notification saying “50 services down, root cause: network partition”, not 50 separate pages. Grouping by shared labels batches related alerts into single notifications.

Deduplication: if the same alert fires from three Prometheus instances (in an HA setup), Alertmanager sends one notification, not three.

Routing: critical alerts go to PagerDuty and wake someone up. Warning alerts go to Slack. Info alerts go to a low-priority channel. Different teams own different services. The routing tree handles all of this through label matching.

Inhibition: if a node is down, every service on that node will fire alerts. Without inhibition, you get dozens of pages for symptoms of the same root cause. Inhibition suppresses secondary alerts when a root cause alert is already firing.

Step 1: Installation

The current stable version of Alertmanager is v0.30.1 (January 2026). Always use the latest stable release from the official GitHub, never pull from unverified sources.

Bare metal / VM installation:

# Create system user
sudo useradd --no-create-home --shell /bin/false alertmanager

# Create directories
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager

# Download and install
AM_VERSION="0.30.1"
cd /tmp
curl -LO "https://github.com/prometheus/alertmanager/releases/download/v${AM_VERSION}/alertmanager-${AM_VERSION}.linux-amd64.tar.gz"
tar xf "alertmanager-${AM_VERSION}.linux-amd64.tar.gz"
cd "alertmanager-${AM_VERSION}.linux-amd64"

sudo cp alertmanager amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool

# Clean up
cd /tmp && rm -rf "alertmanager-${AM_VERSION}.linux-amd64"*

# Verify
alertmanager --version

Systemd service:

# /etc/systemd/system/alertmanager.service
[Unit]
Description=Alertmanager
After=network-online.target

[Service]
Type=simple
User=alertmanager
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.external-url=https://alertmanager.yourdomain.com
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
sudo systemctl status alertmanager

Kubernetes deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: prom/alertmanager:v0.30.1
        args:
        - "--config.file=/etc/alertmanager/config.yml"
        - "--storage.path=/alertmanager"
        ports:
        - containerPort: 9093
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            memory: 512Mi
        volumeMounts:
        - name: config
          mountPath: /etc/alertmanager
        - name: storage
          mountPath: /alertmanager
      volumes:
      - name: config
        configMap:
          name: alertmanager-config
      - name: storage
        emptyDir: {}

Step 2: The Configuration File Structure

The prometheus alertmanager setup revolves entirely around alertmanager.yml. Every configuration decision flows through this file. Understanding its structure before writing a single line prevents the most common configuration mistakes.

# alertmanager.yml - top-level structure
global:
  # Defaults inherited by all receivers unless overridden
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: '[email protected]'
  slack_api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'

templates:
  # Custom notification templates
  - '/etc/alertmanager/templates/*.tmpl'

route:
  # The routing tree - every alert enters here
  receiver: 'default-slack'
  group_by: ['alertname', 'cluster', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    # Child routes - more specific matching

receivers:
  # Named notification configurations
  - name: 'default-slack'
    slack_configs: [...]

inhibit_rules:
  # Alert suppression rules
  - source_matchers: [...]
    target_matchers: [...]
    equal: [...]

The four top-level sections map directly to the four Alertmanager concerns: global for defaults, route for routing and grouping, receivers for notification destinations, and inhibit_rules for suppression.

Step 3: The Routing Tree

The routing tree is the heart of any prometheus alertmanager setup. Every alert enters at the root route and traverses the tree until it finds the most specific matching route. Understanding how the tree is evaluated is essential.

Evaluation rules:

Routes are evaluated top to bottom.
An alert matches the first route whose matchers it satisfies.
continue: true allows an alert to continue matching after the current route, critical for sending to multiple destinations.
If no child route matches, the alert uses the parent route’s configuration.

Production routing tree for a multi-team infrastructure:

route:
  receiver: 'default-slack'
  group_by: ['alertname', 'cluster', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
  # Critical alerts go to PagerDuty AND Slack
  - matchers:
    - severity="critical"
    receiver: 'pagerduty-critical'
    continue: true   # Keep matching - also send to Slack below

  # All alerts also route to Slack by severity
  - matchers:
    - severity="critical"
    receiver: 'slack-critical'

  - matchers:
    - severity="warning"
    receiver: 'slack-warning'

  # Web3 infrastructure gets dedicated routing
  - matchers:
    - team="blockchain"
    receiver: 'blockchain-pagerduty'
    routes:
    - matchers:
      - alertname="ValidatorMissingBlocks"
      receiver: 'blockchain-critical'
      group_wait: 10s   # Faster for validator incidents

  # Watchdog heartbeat - send to null receiver (drop)
  - matchers:
    - alertname="Watchdog"
    receiver: 'null'

  # Maintenance window route
  - matchers:
    - environment="staging"
    receiver: 'slack-staging'
    repeat_interval: 24h   # Reduce staging noise

The continue: true pattern for critical alerts is important. Without it, a critical alert matches the first route and stops, it only goes to PagerDuty. With continue: true, it keeps traversing and also matches the Slack route, so critical alerts appear in both places.

Step 4: Receivers Configuration

Slack Integration

receivers:
- name: 'slack-critical'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
    channel: '#alerts-critical'
    send_resolved: true
    title: '{{ template "slack.title" . }}'
    text: '{{ template "slack.text" . }}'
    # Color based on alert status
    color: '{{ if eq .Status "firing" }}{{ if eq .CommonLabels.severity "critical" }}danger{{ else }}warning{{ end }}{{ else }}good{{ end }}'
    # Action buttons
    actions:
    - type: button
      text: 'Runbook'
      url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
    - type: button
      text: 'Silence 4h'
      url: '{{ template "silence.link" . }}'

- name: 'slack-warning'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
    channel: '#alerts-warning'
    send_resolved: true
    title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
    text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

- name: 'null'
  # Empty receiver for dropping alerts

PagerDuty Integration

- name: 'pagerduty-critical'
  pagerduty_configs:
  - routing_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
    send_resolved: true
    severity: '{{ .CommonLabels.severity }}'
    description: '{{ .CommonAnnotations.summary }}'
    details:
      firing: '{{ template "pagerduty.instances" .Alerts.Firing }}'
      num_firing: '{{ .Alerts.Firing | len }}'
      num_resolved: '{{ .Alerts.Resolved | len }}'
      runbook: '{{ (index .Alerts 0).Annotations.runbook_url }}'

Custom notification templates

Create /etc/alertmanager/templates/custom.tmpl:

{{ define "slack.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
{{ end }}

{{ define "slack.text" }}
{{ range .Alerts }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Environment:* {{ .Labels.environment }}
{{ if .Annotations.runbook_url }}*Runbook:* {{ .Annotations.runbook_url }}{{ end }}
{{ end }}
{{ end }}

Step 5: Inhibition Rules

Inhibition rules are the most powerful and most under-configured part of any prometheus alertmanager setup. Without them, a single infrastructure failure generates dozens of pages: one for each symptom, all pointing to the same root cause.

The three inhibition rules every production setup needs:

inhibit_rules:
# Rule 1: Critical suppresses warning for the same alert
- source_matchers:
  - severity="critical"
  target_matchers:
  - severity="warning"
  equal: ['alertname', 'cluster', 'namespace']

# Rule 2: Node down suppresses all alerts on that node
- source_matchers:
  - alertname="NodeDown"
  target_matchers:
  - severity=~"warning|critical"
  equal: ['instance']

# Rule 3: Cluster unreachable suppresses all cluster alerts
- source_matchers:
  - alertname="ClusterUnreachable"
  target_matchers:
  - cluster=~".+"
  equal: ['cluster']

Critical warning about the equal field:

The equal list specifies which labels must have identical values for the inhibition to apply. If you omit a label from equal, the rule is far more aggressive than intended. If you list a label in equal that neither the source nor target alert has, the rule applies unconditionally, potentially silencing alerts you need.

Always test inhibition rules before deploying to production. Use amtool to simulate:

# Test your routing configuration
amtool config routes test \
  --config.file=/etc/alertmanager/alertmanager.yml \
  severity=critical alertname=NodeDown

# Check which routes an alert would match
amtool config routes show \
  --config.file=/etc/alertmanager/alertmanager.yml

Step 6: Timing Parameters and Alert Fatigue Prevention

The timing parameters in the routing tree have a bigger impact on on-call quality of life than any other configuration choice. Miscalibrated timing is the primary cause of alert fatigue: engineers who stop trusting their alerting system because they cannot distinguish signal from noise.

The three timing parameters:

group_wait : how long Alertmanager waits after the first alert in a new group before sending a notification. This allows related alerts to accumulate so the first notification shows the full picture. Default: 30 seconds. For most production environments, 30-60 seconds is correct.

group_interval : after the initial notification, how long between checking for new alerts in the same group. Default: 5 minutes. This prevents a continuously growing incident from generating continuous pages.

repeat_interval : how long before re-notifying for an unresolved group that has not changed. Default: 4 hours for most receivers, much shorter for critical. The most common misconfiguration is repeat_interval set too short: engineers get paged repeatedly for an incident that is known and being worked.

Production-calibrated timing:

route:
  receiver: 'default-slack'
  group_by: ['alertname', 'cluster', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h   # Default - warnings and info

  routes:
  - matchers:
    - severity="critical"
    receiver: 'pagerduty-critical'
    group_wait: 10s         # Faster initial notification for critical
    group_interval: 2m
    repeat_interval: 1h     # Re-page every hour if unresolved

  - matchers:
    - environment="staging"
    receiver: 'slack-staging'
    repeat_interval: 24h    # Staging alerts once per day maximum

Silence management during maintenance:

For planned maintenance, create silences rather than disabling Alertmanager. Silences are time-bounded, auditable, and expire automatically:

# Create a 4-hour silence for a specific instance during maintenance
amtool silence add \
  --alertmanager.url=http://localhost:9093 \
  --duration=4h \
  --author="sre-team" \
  --comment="Scheduled maintenance - node upgrade" \
  instance="node-1.prod"

# List active silences
amtool silence query

# Expire a silence early
amtool silence expire SILENCE_ID

Step 7: High Availability Prometheus Alertmanager Setup

A single Alertmanager instance is a single point of failure in your alerting pipeline. If it goes down during an incident, alerts fire in Prometheus but nobody gets notified. The prometheus alertmanager setup for production requires at least two instances in a cluster.

Alertmanager’s HA model uses a gossip protocol for cluster communication. Each instance knows about the others, shares notification state (to prevent duplicate notifications), and deduplicates across the cluster.

Two-node HA setup:

# alertmanager-1 systemd service
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=alertmanager-2.internal:9094 \
  --web.external-url=https://alertmanager.yourdomain.com

# alertmanager-2 systemd service
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=alertmanager-1.internal:9094 \
  --web.external-url=https://alertmanager.yourdomain.com

Critical: Prometheus must point to ALL Alertmanager instances, not a load balancer.

Prometheus’s alert deduplication relies on each Prometheus instance sending to all Alertmanager instances simultaneously. If you put a load balancer in front, Prometheus sends each alert to only one instance, breaking the deduplication model.

In your Prometheus configuration:

# prometheus.yml
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager-1.internal:9093
      - alertmanager-2.internal:9093

Verify the cluster is healthy:

# Check cluster members
curl -s http://alertmanager-1:9093/api/v2/status | jq '.cluster'

# Should show both peers in "ready" state

Kubernetes HA with kube-prometheus-stack:

If you are using the kube-prometheus-stack Helm chart, HA is configured by setting replicas:

# values.yaml
alertmanager:
  alertmanagerSpec:
    replicas: 3   # Three-node cluster
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          resources:
            requests:
              storage: 10Gi

Connecting Prometheus to Alertmanager

The prometheus alertmanager setup is not complete until Prometheus is configured to send alerts. In your prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093   # Or your HA instances

rule_files:
  - /etc/prometheus/rules/*.yml

Example alerting rules that integrate with the routing configuration above:

# /etc/prometheus/rules/infrastructure.yml
groups:
- name: infrastructure
  rules:
  - alert: NodeDown
    expr: up{job="node"} == 0
    for: 2m
    labels:
      severity: critical
      team: ops
    annotations:
      summary: "Node {{ $labels.instance }} is down"
      runbook_url: "https://wiki.yourdomain.com/runbooks/node-down"

  - alert: HighMemoryUsage
    expr: |
      (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
    for: 5m
    labels:
      severity: warning
      team: ops
    annotations:
      summary: "High memory usage on {{ $labels.instance }}: {{ $value | humanize }}%"

  - alert: ValidatorMissingBlocks
    expr: |
      increase(cosmos_validator_missed_blocks_total[5m]) > 10
    for: 1m
    labels:
      severity: critical
      team: blockchain
    annotations:
      summary: "Validator {{ $labels.validator }} missing blocks"
      runbook_url: "https://wiki.yourdomain.com/runbooks/validator-missing-blocks"

Conclusion

A production prometheus alertmanager setup is not a one-time configuration, it is an ongoing calibration between signal and noise. The routing tree, inhibition rules, and timing parameters need to evolve as your infrastructure grows and your team’s on-call patterns become clearer.

The teams who trust their alerting at 3am are the ones who invested in inhibition rules that prevent alert storms, timing parameters calibrated to real incident patterns, and silence workflows that make maintenance windows clean. Alert fatigue is an infrastructure problem, not a people problem.

At The Good Shell we build and operate observability stacks for DevOps and SRE teams across infrastructure and Web3. See our SRE and infrastructure services or read our case studies to see what production monitoring looks like in practice.

For the complete Alertmanager configuration reference and all available receiver integrations, the official Prometheus Alertmanager documentation is the authoritative source.