Kubernetes monitoring is critical for any production cluster. If your system goes down and you find out through a user complaint, you don’t have a monitoring problem – you have a visibility problem.
Prometheus and Grafana are the industry standard for Kubernetes observability. Prometheus scrapes and stores metrics from your cluster. Grafana visualises them. Together they give you real-time insight into pod health, node performance, resource usage, and custom application metrics – so you know about problems before your users do.
In this guide we’ll walk through setting up a complete, production-ready monitoring stack from scratch: deploying Prometheus and Grafana on Kubernetes using Helm, configuring alerting rules, connecting a Grafana dashboard, and setting up PagerDuty notifications for on-call routing.
By the end you’ll have a kubernetes monitoring stack you can actually rely on in production.
Prerequisites
Before you start, make sure you have the following:
- A running Kubernetes cluster (local with Minikube, or cloud-based on EKS, GKE, or AKS)
kubectlconfigured and pointing at your cluster- Helm 3 installed
- Basic familiarity with Kubernetes concepts: pods, namespaces, services, and deployments
Step 1 – Kubernetes Monitoring Namespace Setup
The first step in any kubernetes monitoring setup is isolating your stack in its own namespace. This makes it easier to manage RBAC, resource quotas, and network policies later.
kubectl create namespace monitoringVerify it was created:
kubectl get namespacesYou should see monitoring in the list.
Step 2 – Add the Prometheus Community Helm repository
The kube-prometheus-stack Helm chart is the most complete option for kubernetes monitoring. It bundles Prometheus, Alertmanager, Grafana, and a set of pre-configured Kubernetes dashboards and alerting rules – everything you need in a single chart.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo updateStep 3 – Create a custom values file
Before installing, create a values.yaml file to customise the deployment. This lets you persist data, configure storage, and set Grafana credentials without modifying the chart directly.
# values.yaml
grafana:
adminPassword: "your-secure-password" # Change this
persistence:
enabled: true
size: 5Gi
service:
type: ClusterIP
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 2GiA few things to note here. Retention is set to 15 days – adjust this based on your storage capacity and compliance requirements. For production, 30 days is a sensible default. Grafana persistence is enabled so dashboards and data sources survive pod restarts. And we’re using ClusterIP for the Grafana service – we’ll expose it securely via port-forwarding or an ingress later.
Step 4 – Install the kube-prometheus-stack chart
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values values.yamlWait a minute or two for all pods to come up, then verify:
kubectl get pods -n monitoring
```
You should see pods for Prometheus, Grafana, Alertmanager, and several exporters including `node-exporter` and `kube-state-metrics`. All pods should be in `Running` status.
```
NAME READY STATUS RESTARTS
alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0
kube-prometheus-stack-grafana-7d9f8b6c4-xk9p2 3/3 Running 0
kube-prometheus-stack-kube-state-metrics-84d7f9d-lmn3 1/1 Running 0
kube-prometheus-stack-operator-6b8c9f7d5-qrs7t 1/1 Running 0
kube-prometheus-stack-prometheus-node-exporter-4vkpj 1/1 Running 0
prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 0Step 5 – Access Your Kubernetes Monitoring Dashboard
With your kubernetes monitoring stack running, the quickest way to access Grafana without exposing it publicly is port-forwarding.
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoringNow open http://localhost:3000 in your browser. Log in with:
- Username:
admin - Password: the password you set in
values.yaml
You’ll be greeted with Grafana’s home screen. The chart already pre-loaded several Kubernetes dashboards – navigate to Dashboards → Browse to see them. The most useful ones out of the box are:
- Kubernetes / Compute Resources / Cluster – cluster-wide CPU and memory usage.
- Kubernetes / Compute Resources / Namespace (Pods) – per-namespace resource consumption.
- Node Exporter / Full – detailed node-level metrics including disk I/O, network, and memory pressure.
- Kubernetes / Persistent Volumes – PVC usage and capacity.
Step 6 – Expose Grafana with an Ingress (production setup)
Port-forwarding works for local access but isn’t suitable for production. To expose Grafana properly, configure an Ingress. This example uses NGINX Ingress Controller with TLS:
# grafana-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grafana-ingress
namespace: monitoring
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: nginx
tls:
- hosts:
- grafana.yourdomain.com
secretName: grafana-tls
rules:
- host: grafana.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: kube-prometheus-stack-grafana
port:
number: 80Apply it:
kubectl apply -f grafana-ingress.yamlMake sure you have cert-manager installed for automatic TLS certificate provisioning. If you don’t, you can find the official installation guide in the cert-manager documentation.
Step 7 – Configure alerting rules
A robust kubernetes monitoring setup needs proper alerting. Prometheus ships with a solid set of default alerting rules via the chart.
Create a PrometheusRule resource:
# custom-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: custom-alerts
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: custom.rules
interval: 30s
rules:
- alert: HighPodRestartRate
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is restarting frequently"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted more than 5 times in the last hour."
- alert: PersistentVolumeFillingUp
expr: |
kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes < 0.15
for: 10m
labels:
severity: critical
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} is almost full"
description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has less than 15% space remaining."
- alert: HighCPUUsage
expr: |
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace)
/ sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod, namespace) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on pod {{ $labels.pod }}"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is using more than 85% of its CPU limit."Apply the rules:
kubectl apply -f custom-alerts.yamlTo verify Prometheus picked them up, go to the Prometheus UI (port-forward on port 9090) → Status → Rules. You should see your custom rules listed and active.
Step 8 – Set up Alertmanager with PagerDuty routing
Alertmanager receives alerts from Prometheus and routes them to the right channels. Here’s a production-ready configuration that sends critical alerts to PagerDuty and warnings to Slack:
# alertmanager-config.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: alert-routing
namespace: monitoring
spec:
route:
groupBy: ['alertname', 'namespace']
groupWait: 30s
groupInterval: 5m
repeatInterval: 12h
receiver: 'default'
routes:
- matchers:
- name: severity
value: critical
receiver: pagerduty-critical
- matchers:
- name: severity
value: warning
receiver: slack-warnings
receivers:
- name: 'default'
slackConfigs:
- apiURL:
name: slack-webhook-secret
key: url
channel: '#alerts'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty-critical'
pagerdutyConfigs:
- serviceKey:
name: pagerduty-secret
key: serviceKey
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
- name: 'slack-warnings'
slackConfigs:
- apiURL:
name: slack-webhook-secret
key: url
channel: '#infra-warnings'
text: '⚠️ {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'Create the secrets for your PagerDuty and Slack credentials:
# PagerDuty
kubectl create secret generic pagerduty-secret \
--from-literal=serviceKey=YOUR_PAGERDUTY_SERVICE_KEY \
-n monitoring
# Slack
kubectl create secret generic slack-webhook-secret \
--from-literal=url=YOUR_SLACK_WEBHOOK_URL \
-n monitoringThen apply the AlertmanagerConfig:
kubectl apply -f alertmanager-config.yamlStep 9 – Verify your stack end to end
To confirm everything is working correctly, run through this checklist: Check all monitoring pods are healthy.
# Check all monitoring pods are healthy
kubectl get pods -n monitoring
# Check Prometheus targets are being scraped
kubectl port-forward svc/kube-prometheus-stack-prometheus 9090:9090 -n monitoring
# Open http://localhost:9090/targets - all targets should show "UP"
# Check Alertmanager is receiving alerts
kubectl port-forward svc/kube-prometheus-stack-alertmanager 9093:9093 -n monitoring
# Open http://localhost:9093 - you should see active alerts and silences
# Check Grafana dashboards are loading
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
# Open http://localhost:3000 - navigate to Dashboards and verify data is flowingIf any targets show as DOWN in Prometheus, check the pod logs for the corresponding exporter:
kubectl logs -n monitoring -l app.kubernetes.io/name=kube-state-metricsCommon issues and how to fix them
Prometheus targets showing as DOWN This is usually a network policy or RBAC issue. Check that the Prometheus service account has the correct ClusterRole bindings and that no NetworkPolicy is blocking scrape traffic on port 9090.
Grafana dashboards showing “No data” Verify the Prometheus data source is correctly configured in Grafana (Configuration → Data Sources). The URL should be http://kube-prometheus-stack-prometheus:9090 within the cluster.
Alertmanager not sending notifications Check the Alertmanager logs for authentication errors: kubectl logs -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0. Most issues are either wrong API keys or incorrect secret names.
PVCs stuck in Pending state If your cluster doesn’t have a default StorageClass, PVCs won’t provision automatically. Check with kubectl get storageclass and either create one or set persistence.enabled: false in your values file for a non-persistent setup.
What to monitor next
Once your base stack is running, these are the next metrics worth adding:
Application-level metrics – if your services expose a /metrics endpoint using the Prometheus client library, add a ServiceMonitor resource to scrape them. This is where monitoring becomes genuinely powerful – you can track business metrics like request rates, error rates, and latency alongside infrastructure metrics.
Kubernetes events – the kubernetes-event-exporter tool forwards cluster events to Prometheus so you can alert on things like failed image pulls, OOM kills, and node pressure warnings.
Long-term storage – Prometheus is not designed for long-term metric storage. If you need data retention beyond 30 days, consider adding Thanos or Cortex as a remote write backend.
For the official Prometheus documentation and configuration reference, visit the Prometheus documentation.
Conclusion – Your Kubernetes Monitoring Stack is Ready
You now have a production-ready kubernetes monitoring stack that gives you real-time visibility into your cluster, configurable alerting rules, and PagerDuty integration for on-call routing.
The most important thing is not the tools – it’s making sure alerts are actionable. Every alert that fires should have a runbook. Every runbook should have a clear remediation path. Start with the defaults, observe for a week, and tune the noise out aggressively.
If you’re setting this up for a funded startup or a Web3 project and want someone who’s done this before to own the implementation, that’s exactly what we do at The Good Shell. See our DevOps and SRE services or read our case studies to see what the results look like in practice.
