An incident runbook template is the difference between an engineer at 3am who knows exactly what to do and one who spends the first fifteen minutes figuring out where to start. Toil rose 30% in 2026, the first increase in five years, despite AI tooling and the State of Incident Management 2026 is clear about why: teams added more observability, more alerting, and more automation without fixing the underlying process problem. Runbooks that are static, outdated, or simply absent are a primary contributor.
This guide provides a complete incident runbook template for SRE and DevOps teams: the structural components every runbook needs, a copy-paste base template with all sections, severity classification, role assignments, escalation paths, mitigation steps, and a postmortem format. It also covers the distinction between runbooks and playbooks, terms that are frequently conflated, and what changes about runbook management when you migrate away from Opsgenie.
Runbook vs Playbook: The Distinction That Matters
Before presenting the incident runbook template, clarify the terminology. The terms “runbook” and “playbook” are used interchangeably in many organizations and distinctly in others. Understanding the difference prevents building the wrong document for the problem.
A runbook documents the technical execution of a specific response. It answers: what commands do I run, what dashboards do I check, what services do I restart, what rollback do I execute? It is written for the on-call engineer who is already in the middle of an incident and needs precise, actionable steps, not context.
A playbook documents the organizational response to a class of incident. It answers: who is the incident commander, how do we communicate with customers, when do we escalate to leadership, what is the postmortem process? It is written for the team, not the individual.
The practical distinction: your incident runbook template for “high database connection pool exhaustion” is a runbook. Your incident runbook template for “how we handle any SEV-1 as an organization” is a playbook. Both are necessary. They are different documents.
The confusion matters because teams that write playbooks when they need runbooks end up with documents full of narrative that provide no actionable guidance at 3am. The test: can an engineer who joined six months ago follow this document through a live incident without asking anyone else for help? If no, it is not a runbook.
The Five Principles of an Effective Incident Runbook Template
Every incident runbook template should follow five principles regardless of the specific incident type it addresses:
Actionable: Every step is a command, a check, or a decision, not a paragraph of explanation. “Check if the database is responding” is not actionable. psql -h db-prod -U app -c "SELECT 1;" is actionable.
Accessible: The runbook must be reachable from the alert. If an engineer has to navigate three tools to find the relevant runbook after being paged, the coordination tax is already costing MTTR. The alert itself should link directly to the runbook.
Accurate: A runbook that describes a system that no longer exists is worse than no runbook, it sends engineers down dead-end diagnostic paths. Every runbook must have a last-reviewed date and an owner. Inaccurate runbooks should be treated the same as failed tests: something to be fixed before the next deployment.
Authoritative: One canonical version. Not a Google Doc someone updated informally, a Confluence page nobody knows about, and a Notion copy that is six months out of date. Single source of truth, version controlled, linked from the alert.
Adaptable: The runbook covers the known steps. Unknown scenarios require escalation paths and judgment calls that no template can fully anticipate. The runbook should explicitly say “if none of the above resolved the issue, escalate to X with the current state document.”
Severity Classification: Before the Runbook Applies
Every incident runbook template operates within a severity framework. Without defined severity levels, engineers spend the first minutes of every incident debating whether something is a SEV-1 or a SEV-2 instead of resolving it.
Severity levels and their operational implications:
SEV-1 (Critical)
Definition: Complete service unavailability, data loss risk, security breach,
or SLA breach imminent. All users or critical functionality affected.
Response time: Immediate. Page primary on-call within 2 minutes.
Escalation: Secondary on-call + engineering manager within 5 minutes.
Communication: Customer status page update within 15 minutes.
Examples: Production API down, database data corruption, auth service failure.
SEV-2 (High)
Definition: Significant degradation affecting a meaningful subset of users
or a critical feature. SLO burn rate elevated but SLA not yet at risk.
Response time: Page primary on-call within 5 minutes.
Escalation: Secondary on-call if unresolved after 15 minutes.
Communication: Internal status update. External if customer-visible for > 30 min.
Examples: P99 latency exceeding SLO, partial feature unavailability, elevated error rate.
SEV-3 (Medium)
Definition: Non-critical degradation, single-service issue with workaround
available, or issue affecting internal tooling only.
Response time: Slack notification. Acknowledge within 30 minutes during business hours.
Escalation: On-call engineer at their discretion.
Communication: Internal only.
Examples: Non-critical microservice degraded, monitoring gaps, staging environment issues.
SEV-4 (Low)
Definition: Minor issue with minimal user impact. Can be scheduled for next sprint.
Response time: Ticket creation. Review in next sprint planning.
Escalation: None.
Communication: None.
Examples: Cosmetic UI issues, non-urgent performance improvements needed.The decision tree for severity assignment should take under 60 seconds. If severity classification takes longer than that, the criteria are not clear enough.
The Incident Runbook Template: Base Structure
This is the incident runbook template structure that every service-specific runbook should follow. Copy this as a base and fill in the service-specific details.
# Runbook: [Alert Name / Incident Type]
**Service:** [Service name]
**Owner:** [Team name] | [Slack channel]
**Last reviewed:** [YYYY-MM-DD]
**Severity:** [SEV-1 / SEV-2 / SEV-3]
**Alert link:** [Direct link to the firing alert in your monitoring tool]
**Dashboard:** [Direct link to the service dashboard in Grafana]
---
## What is happening
[One to two sentences in plain English describing what the alert means and
what the user-facing impact is. Write this for an engineer who has never
seen this service before.]
Example: The payment service is returning errors for a significant percentage
of checkout requests. Users attempting to complete purchases are seeing
error messages and transactions are not completing.
---
## Immediate triage (first 5 minutes)
Confirm the scope:
- [ ] How many users are affected? Check: [specific dashboard panel link]
- [ ] Which region/availability zone? Check: [specific dashboard panel link]
- [ ] When did this start? Check: [alert history link]
- [ ] Is there a recent deployment? Check: [deployment pipeline link]
Confirm severity:
- [ ] Is this SEV-1 (all users affected) or SEV-2 (subset affected)?
- [ ] Assign severity and update the incident record.
---
## Role assignments
| Role | Responsibility | Current assignee |
|---|---|---|
| Incident commander | Owns the incident, coordinates the response, makes decisions | [Primary on-call] |
| Operations lead | Executes technical diagnosis and remediation steps | [Secondary on-call] |
| Communications lead | Updates status page, notifies stakeholders | [If SEV-1: Engineering manager] |
| Scribe | Records timeline, decisions, and actions in the incident channel | [Whoever is available] |
For SEV-3 and below: single on-call engineer handles all roles.
---
## Diagnosis steps
Work through these in order. Stop at the first step that identifies the
root cause and proceed to mitigation.
### Step 1 - Check recent deployments
```bash
# Check what deployed in the last 2 hours
kubectl rollout history deployment/payment-service -n production
# Check deployment timestamps
kubectl get deployments -n production -o wide
```
If a recent deployment correlates with the incident start time:
→ Go to **Mitigation: rollback**
### Step 2 - Check database connectivity
```bash
# Test database connection from the service pod
kubectl exec -it deployment/payment-service -n production -- \
psql -h db-prod.internal -U payment_app -c "SELECT 1;"
```
If connection fails:
→ Go to **Mitigation: database connectivity**
### Step 3 - Check connection pool exhaustion
```bash
# Check active connections vs max connections
kubectl exec -it deployment/payment-service -n production -- \
psql -h db-prod.internal -U payment_app -c \
"SELECT count(*), max_conn FROM pg_stat_activity, \
(SELECT setting::int AS max_conn FROM pg_settings WHERE name='max_connections') s;"
```
If active connections are at or near max:
→ Go to **Mitigation: connection pool**
### Step 4 - Check downstream dependencies
```bash
# Check Stripe API status
curl -s https://status.stripe.com/api/v2/status.json | jq '.status.indicator'
# Check internal payment gateway health
curl -s https://payment-gateway.internal/health | jq '.status'
```
If a downstream dependency is degraded:
→ Go to **Mitigation: dependency degradation**
### Step 5 - Check memory and CPU
```bash
# Check resource usage
kubectl top pods -n production -l app=payment-service
# Check for OOMKilled events
kubectl describe pods -n production -l app=payment-service | grep -A5 "OOMKilled"
```
---
## Mitigation steps
### Mitigation: Rollback
```bash
# Rollback to previous deployment
kubectl rollout undo deployment/payment-service -n production
# Monitor rollout status
kubectl rollout status deployment/payment-service -n production
# Verify error rate is recovering
# Check: [dashboard link] - error rate should drop within 2-3 minutes
```
Verification: Error rate below 1% for 5 consecutive minutes.
### Mitigation: Database connectivity
```bash
# Check if database pod is running
kubectl get pods -n database -l app=postgres
# Check database logs for recent errors
kubectl logs -n database -l app=postgres --since=30m | grep ERROR
# If database pod is unhealthy, restart it
kubectl rollout restart deployment/postgres -n database
```
Verification: `SELECT 1` returns successfully. Error rate recovering.
### Mitigation: Connection pool
```bash
# Identify and kill idle connections exceeding threshold
kubectl exec -it deployment/payment-service -n production -- \
psql -h db-prod.internal -U payment_app -c \
"SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE state = 'idle' AND query_start < NOW() - INTERVAL '5 minutes';"
# Scale up service replicas temporarily to distribute connection load
kubectl scale deployment/payment-service -n production --replicas=8
```
Verification: Active connections below 80% of max. Error rate recovering.
### Mitigation: Dependency degradation
If Stripe or other payment processor is degraded:
1. Update status page: "Payment processing is experiencing delays due to our payment processor. We are monitoring the situation."
2. Check provider status page for ETA
3. Do not restart services - the issue is external
4. Page engineering manager if degradation exceeds 30 minutes
---
## Escalation path
If none of the above mitigations resolve the incident within 20 minutes:Level 1: Secondary on-call – [Name] | [Phone] | Slack: @[handle]
Level 2: Engineering manager – [Name] | [Phone] | Slack: @[handle]
Level 3: VP Engineering – [Name] | [Phone] | Slack: @[handle]
Level 4: CTO – For SEV-1 incidents unresolved after 45 minutes
When escalating, provide:
- Current severity level
- Duration of incident
- What has been tried and ruled out
- Current hypothesis
- Number of affected users
---
## Communication templates
**Status page update (SEV-1, first 15 minutes):**We are investigating an issue affecting payment processing.
Some users may be unable to complete purchases. We are working
on a resolution and will provide an update in 30 minutes.
**Status page update (mitigation in progress):**We have identified the root cause of the payment processing issue
and are applying a fix. We expect full restoration within [X] minutes.
**Resolution:**The payment processing issue has been resolved. All services are
operating normally. We will publish a postmortem within 48 hours.
---
## Post-incident
### Timeline document
Complete this during the incident or immediately after:[HH:MM] Alert fired: [alert name]
[HH:MM] Incident acknowledged by: [name]
[HH:MM] Severity declared: SEV-[X]
[HH:MM] [Action taken]
[HH:MM] [Finding or escalation]
[HH:MM] [Mitigation applied]
[HH:MM] Service restored
[HH:MM] Incident closed
### Postmortem trigger
SEV-1: Postmortem required within 48 hours.
SEV-2: Postmortem required within 5 business days.
SEV-3: Postmortem at engineering team discretion.
---
## Runbook maintenance
Last tested: [YYYY-MM-DD]
Next review due: [3 months from last review]
Owner: [Team / person responsible for keeping this current]
If a step in this runbook failed or was inaccurate during an incident,
file a ticket immediately: [ticket creation link]The Postmortem Template
The postmortem is the mechanism that converts incidents into system improvements. An incident runbook template is incomplete without a corresponding postmortem structure that engineers will actually fill in.
# Postmortem: [Service] [Brief description]
# Incident date: [YYYY-MM-DD]
# Severity: [SEV-X]
# Duration: [X hours Y minutes]
# Authors: [Names]
---
## Summary
[Two to three sentences: what broke, why it broke, what the user impact was,
and what was done to fix it. Write this last.]
---
## Timeline
[Copy from the incident timeline document - no reconstruction from memory]
---
## Root cause
[One specific, technical root cause. Not "human error" - what specifically
failed and why. Example: "Connection pool limit set to 100 was insufficient
for current traffic - a slow query introduced in v3.4.2 held connections open
for 8-12 seconds instead of the normal 200ms, exhausting the pool under load."]
---
## Contributing factors
[Conditions that made the incident worse or harder to detect. Not root causes -
these are the things that amplified the impact or delayed resolution.]
- [Factor 1]
- [Factor 2]
---
## What went well
[Honest assessment of what worked. The on-call response, the monitoring that
caught it, the rollback that worked. This section matters for morale and for
identifying practices to replicate.]
---
## What went poorly
[Honest assessment of what slowed resolution. Runbook inaccuracies, missing
dashboards, slow escalation, unclear ownership. This section drives action items.]
---
## Action items
| Action | Owner | Priority | Due date |
|---|---|---|---|
| [Specific fix] | [Name] | High | [YYYY-MM-DD] |
| [Monitoring improvement] | [Name] | Medium | [YYYY-MM-DD] |
| [Runbook update] | [Name] | High | [YYYY-MM-DD] |
Action items must be specific and assigned. "Improve monitoring" is not an
action item. "Add alert for connection pool utilization > 80% with 5-minute
evaluation window" is an action item.
---
## Blameless declaration
This postmortem follows Google's blameless postmortem principles. The goal is
to understand what happened and prevent recurrence - not to assign blame. The
root cause is always a system failure, not an individual failure.Runbook Storage and the Opsgenie Migration Question
Where you store your incident runbook template matters more than most teams acknowledge. The runbook that exists but cannot be found in the first two minutes of an incident is operationally equivalent to a runbook that does not exist.
The three requirements for runbook storage:
First, the alert must link directly to the runbook. Not to the wiki homepage, not to the team folder to the specific runbook for that specific alert. Engineers under stress should not have to navigate.
Second, runbooks must be version controlled. Every change should be trackable. When an incident reveals an inaccurate runbook step, you need to know who changed it and when.
Third, runbooks must be accessible without depending on the alerting tool itself. If PagerDuty or your incident management platform goes down during an incident, you need to be able to access your runbooks through a different path: a Git repository, a static documentation site, or a runbook-in-Slack pattern.
The Opsgenie migration impact on runbooks:
Atlassian stopped selling Opsgenie in June 2025 and will shut it down completely on April 5, 2027. If your incident runbook templates live inside Opsgenie as embedded runbook links, alert annotations, or service catalog entries, they are at risk.
The migration creates an opportunity to audit your entire runbook library. Teams that have migrated report that the process reveals runbooks that were last updated years ago, runbooks that reference services that no longer exist, and alert-to-runbook links that were never properly maintained. The forced migration is worth more than the disruption cost if it results in a runbook library that is actually accurate.
When migrating runbooks away from Opsgenie:
Export all runbook content via the Opsgenie API before the shutdown. Do not rely on Atlassian’s automated migration tools for runbook content, the format does not import cleanly into most alternative tools. Rebuild runbook links in your new tool (PagerDuty, incident.io, Rootly, Better Stack) pointing to your canonical runbook storage, which should be version-controlled and independent of any specific alerting platform.
Measuring Runbook Quality
An incident runbook template is only as good as how well it performs during real incidents. Measure this explicitly.
Time to first action: Time from alert acknowledgment to the first diagnostic command executed. An effective runbook drives this to under two minutes. If engineers are spending time reading background instead of executing steps, the runbook is too verbose.
Runbook follow rate: Percentage of incidents where the on-call engineer could follow the runbook from detection to resolution without improvising. Track deviations; each deviation is a runbook improvement opportunity.
Runbook accuracy incidents: Number of times per quarter that an inaccurate runbook step contributed to delayed resolution. This metric should trend to zero. If it does not, runbook maintenance is not being treated as a first-class engineering responsibility.
Postmortem action item completion rate: Percentage of runbook-related action items completed within the agreed timeline. Runbooks that are never updated based on incident learnings will drift toward uselessness over time.
Conclusion
An incident runbook template is not a one-time documentation exercise. It is an operational artifact that degrades without maintenance, improves with each postmortem, and pays dividends every time an engineer resolves an incident in twelve minutes instead of ninety.
The toil increase in 2026 reflects organizations that invested in tooling without investing in process. Better monitoring, more alerting, and AI-assisted triage all have diminishing returns when the engineer responding to the alert does not have a clear, accurate, actionable document telling them what to do next.
Build the runbook, link it to the alert, review it after every incident, and assign a human owner who is responsible for keeping it accurate. That combination, not any specific tool, is what reduces MTTR.
At The Good Shell we implement SRE practices for startups and platform engineering teams, including incident response process design and runbook library development. See our SRE and DevOps services or our case studies.
For the authoritative reference on incident management, the Google SRE Book chapter on managing incidents covers the organizational model that the templates in this guide are built on.

