Complete Guide: Kubernetes Monitoring with Prometheus and Grafana in 9 Steps

Kubernetes monitoring is critical for any production cluster. If your system goes down and you find out through a user complaint, you don’t have a monitoring problem – you have a visibility problem.

Prometheus and Grafana are the industry standard for Kubernetes observability. Prometheus scrapes and stores metrics from your cluster. Grafana visualises them. Together they give you real-time insight into pod health, node performance, resource usage, and custom application metrics – so you know about problems before your users do.

In this guide we’ll walk through setting up a complete, production-ready monitoring stack from scratch: deploying Prometheus and Grafana on Kubernetes using Helm, configuring alerting rules, connecting a Grafana dashboard, and setting up PagerDuty notifications for on-call routing.

By the end you’ll have a kubernetes monitoring stack you can actually rely on in production.

Running this in production?

Get a senior review of your infrastructure — in 7 days

We run validator and cloud infrastructure across 24 chains with 10M+ daily checks at 99.97% uptime. Fixed-price 7-day audit: written report, prioritised findings, 90-min debrief call. $4,500 fixed, no long engagement.

Get the 7-day audit → Book a free 30-min infra review — leave with 2-3 concrete findings

In this guide

Toggle

Prerequisites

Before you start, make sure you have the following:

A running Kubernetes cluster (local with Minikube, or cloud-based on EKS, GKE, or AKS)
kubectl configured and pointing at your cluster
Helm 3 installed
Basic familiarity with Kubernetes concepts: pods, namespaces, services, and deployments

Step 1 – Kubernetes Monitoring Namespace Setup

The first step in any kubernetes monitoring setup is isolating your stack in its own namespace. This makes it easier to manage RBAC, resource quotas, and network policies later.

kubectl create namespace monitoring

Verify it was created:

kubectl get namespaces

You should see monitoring in the list.

Step 2 – Add the Prometheus Community Helm repository

The kube-prometheus-stack Helm chart is the most complete option for kubernetes monitoring. It bundles Prometheus, Alertmanager, Grafana, and a set of pre-configured Kubernetes dashboards and alerting rules – everything you need in a single chart.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Step 3 – Create a custom values file

Before installing, create a values.yaml file to customise the deployment. This lets you persist data, configure storage, and set Grafana credentials without modifying the chart directly.

# values.yaml

grafana:
  adminPassword: "your-secure-password"  # Change this
  persistence:
    enabled: true
    size: 5Gi
  service:
    type: ClusterIP

prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 20Gi

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 2Gi

A few things to note here. Retention is set to 15 days – adjust this based on your storage capacity and compliance requirements. For production, 30 days is a sensible default. Grafana persistence is enabled so dashboards and data sources survive pod restarts. And we’re using ClusterIP for the Grafana service – we’ll expose it securely via port-forwarding or an ingress later.

Step 4 – Install the kube-prometheus-stack chart

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values values.yaml

Wait a minute or two for all pods to come up, then verify:

kubectl get pods -n monitoring
```

You should see pods for Prometheus, Grafana, Alertmanager, and several exporters including `node-exporter` and `kube-state-metrics`. All pods should be in `Running` status.
```
NAME                                                    READY   STATUS    RESTARTS
alertmanager-kube-prometheus-stack-alertmanager-0       2/2     Running   0
kube-prometheus-stack-grafana-7d9f8b6c4-xk9p2           3/3     Running   0
kube-prometheus-stack-kube-state-metrics-84d7f9d-lmn3   1/1     Running   0
kube-prometheus-stack-operator-6b8c9f7d5-qrs7t          1/1     Running   0
kube-prometheus-stack-prometheus-node-exporter-4vkpj    1/1     Running   0
prometheus-kube-prometheus-stack-prometheus-0           2/2     Running   0

Step 5 – Access Your Kubernetes Monitoring Dashboard

With your kubernetes monitoring stack running, the quickest way to access Grafana without exposing it publicly is port-forwarding.

kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring

Now open http://localhost:3000 in your browser. Log in with:

Username: admin
Password: the password you set in values.yaml

You’ll be greeted with Grafana’s home screen. The chart already pre-loaded several Kubernetes dashboards – navigate to Dashboards → Browse to see them. The most useful ones out of the box are:

Kubernetes / Compute Resources / Cluster – cluster-wide CPU and memory usage.
Kubernetes / Compute Resources / Namespace (Pods) – per-namespace resource consumption.
Node Exporter / Full – detailed node-level metrics including disk I/O, network, and memory pressure.
Kubernetes / Persistent Volumes – PVC usage and capacity.

Step 6 – Expose Grafana with an Ingress (production setup)

Port-forwarding works for local access but isn’t suitable for production. To expose Grafana properly, configure an Ingress. This example uses NGINX Ingress Controller with TLS:

# grafana-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grafana-ingress
  namespace: monitoring
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - grafana.yourdomain.com
      secretName: grafana-tls
  rules:
    - host: grafana.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: kube-prometheus-stack-grafana
                port:
                  number: 80

Apply it:

kubectl apply -f grafana-ingress.yaml

Make sure you have cert-manager installed for automatic TLS certificate provisioning. If you don’t, you can find the official installation guide in the cert-manager documentation.

Step 7 – Configure alerting rules

A robust kubernetes monitoring setup needs proper alerting. Prometheus ships with a solid set of default alerting rules via the chart.

Create a PrometheusRule resource:

# custom-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: custom-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: custom.rules
      interval: 30s
      rules:

        - alert: HighPodRestartRate
          expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} is restarting frequently"
            description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted more than 5 times in the last hour."

        - alert: PersistentVolumeFillingUp
          expr: |
            kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes < 0.15
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: "PVC {{ $labels.persistentvolumeclaim }} is almost full"
            description: "PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has less than 15% space remaining."

        - alert: HighCPUUsage
          expr: |
            sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace)
            / sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod, namespace) > 0.85
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "High CPU usage on pod {{ $labels.pod }}"
            description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is using more than 85% of its CPU limit."

Apply the rules:

kubectl apply -f custom-alerts.yaml

To verify Prometheus picked them up, go to the Prometheus UI (port-forward on port 9090) → Status → Rules. You should see your custom rules listed and active.

Step 8 – Set up Alertmanager with PagerDuty routing

Alertmanager receives alerts from Prometheus and routes them to the right channels. Here’s a production-ready configuration that sends critical alerts to PagerDuty and warnings to Slack:

# alertmanager-config.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: alert-routing
  namespace: monitoring
spec:
  route:
    groupBy: ['alertname', 'namespace']
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 12h
    receiver: 'default'
    routes:
      - matchers:
          - name: severity
            value: critical
        receiver: pagerduty-critical
      - matchers:
          - name: severity
            value: warning
        receiver: slack-warnings

  receivers:
    - name: 'default'
      slackConfigs:
        - apiURL:
            name: slack-webhook-secret
            key: url
          channel: '#alerts'
          text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

    - name: 'pagerduty-critical'
      pagerdutyConfigs:
        - serviceKey:
            name: pagerduty-secret
            key: serviceKey
          description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'

    - name: 'slack-warnings'
      slackConfigs:
        - apiURL:
            name: slack-webhook-secret
            key: url
          channel: '#infra-warnings'
          text: '⚠️ {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

Create the secrets for your PagerDuty and Slack credentials:

# PagerDuty
kubectl create secret generic pagerduty-secret \
  --from-literal=serviceKey=YOUR_PAGERDUTY_SERVICE_KEY \
  -n monitoring

# Slack
kubectl create secret generic slack-webhook-secret \
  --from-literal=url=YOUR_SLACK_WEBHOOK_URL \
  -n monitoring

Then apply the AlertmanagerConfig:

kubectl apply -f alertmanager-config.yaml

Step 9 – Verify your stack end to end

To confirm everything is working correctly, run through this checklist: Check all monitoring pods are healthy.

# Check all monitoring pods are healthy
kubectl get pods -n monitoring

# Check Prometheus targets are being scraped
kubectl port-forward svc/kube-prometheus-stack-prometheus 9090:9090 -n monitoring
# Open http://localhost:9090/targets - all targets should show "UP"

# Check Alertmanager is receiving alerts
kubectl port-forward svc/kube-prometheus-stack-alertmanager 9093:9093 -n monitoring
# Open http://localhost:9093 - you should see active alerts and silences

# Check Grafana dashboards are loading
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
# Open http://localhost:3000 - navigate to Dashboards and verify data is flowing

If any targets show as DOWN in Prometheus, check the pod logs for the corresponding exporter:

kubectl logs -n monitoring -l app.kubernetes.io/name=kube-state-metrics

Common issues and how to fix them

Prometheus targets showing as DOWN This is usually a network policy or RBAC issue. Check that the Prometheus service account has the correct ClusterRole bindings and that no NetworkPolicy is blocking scrape traffic on port 9090.

Grafana dashboards showing “No data” Verify the Prometheus data source is correctly configured in Grafana (Configuration → Data Sources). The URL should be http://kube-prometheus-stack-prometheus:9090 within the cluster.

Alertmanager not sending notifications Check the Alertmanager logs for authentication errors: kubectl logs -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0. Most issues are either wrong API keys or incorrect secret names.

PVCs stuck in Pending state If your cluster doesn’t have a default StorageClass, PVCs won’t provision automatically. Check with kubectl get storageclass and either create one or set persistence.enabled: false in your values file for a non-persistent setup.

What to monitor next

Once your base stack is running, these are the next metrics worth adding:

Application-level metrics – if your services expose a /metrics endpoint using the Prometheus client library, add a ServiceMonitor resource to scrape them. This is where monitoring becomes genuinely powerful – you can track business metrics like request rates, error rates, and latency alongside infrastructure metrics.

Kubernetes events – the kubernetes-event-exporter tool forwards cluster events to Prometheus so you can alert on things like failed image pulls, OOM kills, and node pressure warnings.

Long-term storage – Prometheus is not designed for long-term metric storage. If you need data retention beyond 30 days, consider adding Thanos or Cortex as a remote write backend.

For the official Prometheus documentation and configuration reference, visit the Prometheus documentation.

Conclusion – Your Kubernetes Monitoring Stack is Ready

You now have a production-ready kubernetes monitoring stack that gives you real-time visibility into your cluster, configurable alerting rules, and PagerDuty integration for on-call routing.

The most important thing is not the tools – it’s making sure alerts are actionable. Every alert that fires should have a runbook. Every runbook should have a clear remediation path. Start with the defaults, observe for a week, and tune the noise out aggressively.

If you’re setting this up for a funded startup or a Web3 project and want someone who’s done this before to own the implementation, that’s exactly what we do at The Good Shell. See our DevOps and SRE services or read our case studies to see what the results look like in practice.