Prometheus alertmanager setup is one of those infrastructure tasks that takes an hour to complete and months to get right. The installation is straightforward. The configuration that actually works in production routing trees that don’t flood on-call, inhibition rules that prevent alert storms, timing parameters calibrated to real incident patterns, requires experience that no official documentation can fully transmit.
This guide covers the complete prometheus alertmanager setup from installation through production hardening: the architecture, the configuration file structure, routing trees with real examples, Slack and PagerDuty integration, inhibition rules, silence management, high availability, and the alert fatigue patterns that make engineers stop paying attention to their phones at 3am.
What Prometheus Alertmanager Actually Does
Before touching configuration, understand where Alertmanager sits in the observability stack and why it exists as a separate component.
Prometheus evaluates alerting rules against metrics and fires alerts. But Prometheus is not responsible for what happens to those alerts who gets notified: on which channel, with what frequency, and whether a minor alert should be suppressed because a more critical alert is already firing. That is Alertmanager’s job.
The separation is intentional and important. Alertmanager handles four distinct concerns that would be operationally chaotic to mix into Prometheus itself:
Grouping: when a network partition takes down 50 services simultaneously, you want one notification saying “50 services down, root cause: network partition”, not 50 separate pages. Grouping by shared labels batches related alerts into single notifications.
Deduplication: if the same alert fires from three Prometheus instances (in an HA setup), Alertmanager sends one notification, not three.
Routing: critical alerts go to PagerDuty and wake someone up. Warning alerts go to Slack. Info alerts go to a low-priority channel. Different teams own different services. The routing tree handles all of this through label matching.
Inhibition: if a node is down, every service on that node will fire alerts. Without inhibition, you get dozens of pages for symptoms of the same root cause. Inhibition suppresses secondary alerts when a root cause alert is already firing.
Step 1: Installation
The current stable version of Alertmanager is v0.30.1 (January 2026). Always use the latest stable release from the official GitHub, never pull from unverified sources.
Bare metal / VM installation:
# Create system user
sudo useradd --no-create-home --shell /bin/false alertmanager
# Create directories
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
# Download and install
AM_VERSION="0.30.1"
cd /tmp
curl -LO "https://github.com/prometheus/alertmanager/releases/download/v${AM_VERSION}/alertmanager-${AM_VERSION}.linux-amd64.tar.gz"
tar xf "alertmanager-${AM_VERSION}.linux-amd64.tar.gz"
cd "alertmanager-${AM_VERSION}.linux-amd64"
sudo cp alertmanager amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool
# Clean up
cd /tmp && rm -rf "alertmanager-${AM_VERSION}.linux-amd64"*
# Verify
alertmanager --versionSystemd service:
# /etc/systemd/system/alertmanager.service
[Unit]
Description=Alertmanager
After=network-online.target
[Service]
Type=simple
User=alertmanager
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.external-url=https://alertmanager.yourdomain.com
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetsudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
sudo systemctl status alertmanagerKubernetes deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:v0.30.1
args:
- "--config.file=/etc/alertmanager/config.yml"
- "--storage.path=/alertmanager"
ports:
- containerPort: 9093
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
memory: 512Mi
volumeMounts:
- name: config
mountPath: /etc/alertmanager
- name: storage
mountPath: /alertmanager
volumes:
- name: config
configMap:
name: alertmanager-config
- name: storage
emptyDir: {}Step 2: The Configuration File Structure
The prometheus alertmanager setup revolves entirely around alertmanager.yml. Every configuration decision flows through this file. Understanding its structure before writing a single line prevents the most common configuration mistakes.
# alertmanager.yml — top-level structure
global:
# Defaults inherited by all receivers unless overridden
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: '[email protected]'
slack_api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
templates:
# Custom notification templates
- '/etc/alertmanager/templates/*.tmpl'
route:
# The routing tree — every alert enters here
receiver: 'default-slack'
group_by: ['alertname', 'cluster', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Child routes — more specific matching
receivers:
# Named notification configurations
- name: 'default-slack'
slack_configs: [...]
inhibit_rules:
# Alert suppression rules
- source_matchers: [...]
target_matchers: [...]
equal: [...]The four top-level sections map directly to the four Alertmanager concerns: global for defaults, route for routing and grouping, receivers for notification destinations, and inhibit_rules for suppression.
Step 3: The Routing Tree
The routing tree is the heart of any prometheus alertmanager setup. Every alert enters at the root route and traverses the tree until it finds the most specific matching route. Understanding how the tree is evaluated is essential.
Evaluation rules:
- Routes are evaluated top to bottom.
- An alert matches the first route whose matchers it satisfies.
continue: trueallows an alert to continue matching after the current route, critical for sending to multiple destinations.- If no child route matches, the alert uses the parent route’s configuration.
Production routing tree for a multi-team infrastructure:
route:
receiver: 'default-slack'
group_by: ['alertname', 'cluster', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts go to PagerDuty AND Slack
- matchers:
- severity="critical"
receiver: 'pagerduty-critical'
continue: true # Keep matching — also send to Slack below
# All alerts also route to Slack by severity
- matchers:
- severity="critical"
receiver: 'slack-critical'
- matchers:
- severity="warning"
receiver: 'slack-warning'
# Web3 infrastructure gets dedicated routing
- matchers:
- team="blockchain"
receiver: 'blockchain-pagerduty'
routes:
- matchers:
- alertname="ValidatorMissingBlocks"
receiver: 'blockchain-critical'
group_wait: 10s # Faster for validator incidents
# Watchdog heartbeat — send to null receiver (drop)
- matchers:
- alertname="Watchdog"
receiver: 'null'
# Maintenance window route
- matchers:
- environment="staging"
receiver: 'slack-staging'
repeat_interval: 24h # Reduce staging noiseThe continue: true pattern for critical alerts is important. Without it, a critical alert matches the first route and stops, it only goes to PagerDuty. With continue: true, it keeps traversing and also matches the Slack route, so critical alerts appear in both places.
Step 4: Receivers Configuration
Slack Integration
receivers:
- name: 'slack-critical'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts-critical'
send_resolved: true
title: '{{ template "slack.title" . }}'
text: '{{ template "slack.text" . }}'
# Color based on alert status
color: '{{ if eq .Status "firing" }}{{ if eq .CommonLabels.severity "critical" }}danger{{ else }}warning{{ end }}{{ else }}good{{ end }}'
# Action buttons
actions:
- type: button
text: 'Runbook'
url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
- type: button
text: 'Silence 4h'
url: '{{ template "silence.link" . }}'
- name: 'slack-warning'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts-warning'
send_resolved: true
title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'null'
# Empty receiver for dropping alertsPagerDuty Integration
- name: 'pagerduty-critical'
pagerduty_configs:
- routing_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
send_resolved: true
severity: '{{ .CommonLabels.severity }}'
description: '{{ .CommonAnnotations.summary }}'
details:
firing: '{{ template "pagerduty.instances" .Alerts.Firing }}'
num_firing: '{{ .Alerts.Firing | len }}'
num_resolved: '{{ .Alerts.Resolved | len }}'
runbook: '{{ (index .Alerts 0).Annotations.runbook_url }}'Custom notification templates
Create /etc/alertmanager/templates/custom.tmpl:
{{ define "slack.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
{{ end }}
{{ define "slack.text" }}
{{ range .Alerts }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Environment:* {{ .Labels.environment }}
{{ if .Annotations.runbook_url }}*Runbook:* {{ .Annotations.runbook_url }}{{ end }}
{{ end }}
{{ end }}Step 5: Inhibition Rules
Inhibition rules are the most powerful and most under-configured part of any prometheus alertmanager setup. Without them, a single infrastructure failure generates dozens of pages: one for each symptom, all pointing to the same root cause.
The three inhibition rules every production setup needs:
inhibit_rules:
# Rule 1: Critical suppresses warning for the same alert
- source_matchers:
- severity="critical"
target_matchers:
- severity="warning"
equal: ['alertname', 'cluster', 'namespace']
# Rule 2: Node down suppresses all alerts on that node
- source_matchers:
- alertname="NodeDown"
target_matchers:
- severity=~"warning|critical"
equal: ['instance']
# Rule 3: Cluster unreachable suppresses all cluster alerts
- source_matchers:
- alertname="ClusterUnreachable"
target_matchers:
- cluster=~".+"
equal: ['cluster']Critical warning about the equal field:
The equal list specifies which labels must have identical values for the inhibition to apply. If you omit a label from equal, the rule is far more aggressive than intended. If you list a label in equal that neither the source nor target alert has, the rule applies unconditionally, potentially silencing alerts you need.
Always test inhibition rules before deploying to production. Use amtool to simulate:
# Test your routing configuration
amtool config routes test \
--config.file=/etc/alertmanager/alertmanager.yml \
severity=critical alertname=NodeDown
# Check which routes an alert would match
amtool config routes show \
--config.file=/etc/alertmanager/alertmanager.ymlStep 6: Timing Parameters and Alert Fatigue Prevention
The timing parameters in the routing tree have a bigger impact on on-call quality of life than any other configuration choice. Miscalibrated timing is the primary cause of alert fatigue: engineers who stop trusting their alerting system because they cannot distinguish signal from noise.
The three timing parameters:
group_wait : how long Alertmanager waits after the first alert in a new group before sending a notification. This allows related alerts to accumulate so the first notification shows the full picture. Default: 30 seconds. For most production environments, 30-60 seconds is correct.
group_interval : after the initial notification, how long between checking for new alerts in the same group. Default: 5 minutes. This prevents a continuously growing incident from generating continuous pages.
repeat_interval : how long before re-notifying for an unresolved group that has not changed. Default: 4 hours for most receivers, much shorter for critical. The most common misconfiguration is repeat_interval set too short: engineers get paged repeatedly for an incident that is known and being worked.
Production-calibrated timing:
route:
receiver: 'default-slack'
group_by: ['alertname', 'cluster', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h # Default — warnings and info
routes:
- matchers:
- severity="critical"
receiver: 'pagerduty-critical'
group_wait: 10s # Faster initial notification for critical
group_interval: 2m
repeat_interval: 1h # Re-page every hour if unresolved
- matchers:
- environment="staging"
receiver: 'slack-staging'
repeat_interval: 24h # Staging alerts once per day maximumSilence management during maintenance:
For planned maintenance, create silences rather than disabling Alertmanager. Silences are time-bounded, auditable, and expire automatically:
# Create a 4-hour silence for a specific instance during maintenance
amtool silence add \
--alertmanager.url=http://localhost:9093 \
--duration=4h \
--author="sre-team" \
--comment="Scheduled maintenance - node upgrade" \
instance="node-1.prod"
# List active silences
amtool silence query
# Expire a silence early
amtool silence expire SILENCE_IDStep 7: High Availability Prometheus Alertmanager Setup
A single Alertmanager instance is a single point of failure in your alerting pipeline. If it goes down during an incident, alerts fire in Prometheus but nobody gets notified. The prometheus alertmanager setup for production requires at least two instances in a cluster.
Alertmanager’s HA model uses a gossip protocol for cluster communication. Each instance knows about the others, shares notification state (to prevent duplicate notifications), and deduplicates across the cluster.
Two-node HA setup:
# alertmanager-1 systemd service
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--cluster.listen-address=0.0.0.0:9094 \
--cluster.peer=alertmanager-2.internal:9094 \
--web.external-url=https://alertmanager.yourdomain.com
# alertmanager-2 systemd service
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--cluster.listen-address=0.0.0.0:9094 \
--cluster.peer=alertmanager-1.internal:9094 \
--web.external-url=https://alertmanager.yourdomain.comCritical: Prometheus must point to ALL Alertmanager instances, not a load balancer.
Prometheus’s alert deduplication relies on each Prometheus instance sending to all Alertmanager instances simultaneously. If you put a load balancer in front, Prometheus sends each alert to only one instance, breaking the deduplication model.
In your Prometheus configuration:
# prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager-1.internal:9093
- alertmanager-2.internal:9093Verify the cluster is healthy:
# Check cluster members
curl -s http://alertmanager-1:9093/api/v2/status | jq '.cluster'
# Should show both peers in "ready" stateKubernetes HA with kube-prometheus-stack:
If you are using the kube-prometheus-stack Helm chart, HA is configured by setting replicas:
# values.yaml
alertmanager:
alertmanagerSpec:
replicas: 3 # Three-node cluster
storage:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 10GiConnecting Prometheus to Alertmanager
The prometheus alertmanager setup is not complete until Prometheus is configured to send alerts. In your prometheus.yml:
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093 # Or your HA instances
rule_files:
- /etc/prometheus/rules/*.ymlExample alerting rules that integrate with the routing configuration above:
# /etc/prometheus/rules/infrastructure.yml
groups:
- name: infrastructure
rules:
- alert: NodeDown
expr: up{job="node"} == 0
for: 2m
labels:
severity: critical
team: ops
annotations:
summary: "Node {{ $labels.instance }} is down"
runbook_url: "https://wiki.yourdomain.com/runbooks/node-down"
- alert: HighMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
team: ops
annotations:
summary: "High memory usage on {{ $labels.instance }}: {{ $value | humanize }}%"
- alert: ValidatorMissingBlocks
expr: |
increase(cosmos_validator_missed_blocks_total[5m]) > 10
for: 1m
labels:
severity: critical
team: blockchain
annotations:
summary: "Validator {{ $labels.validator }} missing blocks"
runbook_url: "https://wiki.yourdomain.com/runbooks/validator-missing-blocks"Conclusion
A production prometheus alertmanager setup is not a one-time configuration, it is an ongoing calibration between signal and noise. The routing tree, inhibition rules, and timing parameters need to evolve as your infrastructure grows and your team’s on-call patterns become clearer.
The teams who trust their alerting at 3am are the ones who invested in inhibition rules that prevent alert storms, timing parameters calibrated to real incident patterns, and silence workflows that make maintenance windows clean. Alert fatigue is an infrastructure problem, not a people problem.
At The Good Shell we build and operate observability stacks for DevOps and SRE teams across infrastructure and Web3. See our SRE and infrastructure services or read our case studies to see what production monitoring looks like in practice.
For the complete Alertmanager configuration reference and all available receiver integrations, the official Prometheus Alertmanager documentation is the authoritative source.

