Kubernetes for GPU Workloads: 7 Proven Patterns 2026

Kubernetes for GPU workloads is where most AI infrastructure tutorials stop at kubectl apply and declare victory. Then production arrives: GPU nodes sitting idle because the scheduler ignores NVLink topology, autoscalers triggering on CPU metrics while the inference queue backs up, model pods restarting from scratch because nobody implemented checkpointing for Spot interruptions. The gap between a working demo and a production GPU cluster is wide, and it costs money at GPU rates.

Quick answer: Production kubernetes for gpu workloads requires seven layers that generic Kubernetes guides skip, GPU Operator for declarative driver lifecycle, MIG or time-slicing for utilization, topology-aware scheduling for multi-GPU training, Spot with checkpointing for cost, DCGM for real observability, namespace isolation for multi-tenancy, and vLLM with KEDA for inference autoscaling. This guide covers all seven with working configuration.

What changed in May 2026: NVIDIA published Dynamo Snapshot, a checkpoint/restore approach that reduces vLLM cold-start latency to near zero on single-GPU inference workloads. Azure retired Low-Priority VMs in March 2026, if you are on AKS with Low-Priority node pools, migrate to Spot VMs now. The CNCF accepted KServe as an Incubating project in November 2025. GPU node provisioning on EKS via Karpenter is now approximately 60 seconds, meaningfully faster than the 5-15 minute AKS Cluster Autoscaler path for LLM workloads where cold start is already measured in minutes.

In this guide

Why Kubernetes for GPU Workloads Requires Different Patterns

Standard Kubernetes scheduling, resource management, and observability were designed for CPU workloads. GPU workloads break the assumptions at every layer.

Running this in production?

Get a senior review of your infrastructure — in 7 days

We run validator and cloud infrastructure across 24 chains with 10M+ daily checks at 99.97% uptime. Fixed-price 7-day audit: written report, prioritised findings, 90-min debrief call. $4,500 fixed, no long engagement.

Get the 7-day audit → Book a free 30-min infra review — leave with 2-3 concrete findings

Topology matters for performance. A training job that needs 8 GPUs distributed across two nodes with NVLink will perform 40% worse than the same job with all 8 GPUs on a single NVLink domain, because GPU-to-GPU communication crosses a PCIe or network boundary instead of NVLink. The default Kubernetes scheduler has no awareness of NVLink topology. It will happily schedule your multi-GPU training job across nodes in a way that destroys throughput.

GPU utilization is invisible to standard metrics. Kubernetes reports GPU allocated (0 or 1 for each GPU slot). It does not report GPU SM utilization, memory bandwidth, KV cache fill rate, or any of the metrics that tell you whether a GPU is actually doing useful work. A cluster reporting 100% GPU allocation can be running at 15% actual utilization with 85% of GPU capacity wasted on idle processes.

Failure modes are expensive. A CPU pod that gets evicted from a Spot node restarts in seconds. A GPU training job that gets evicted from a Spot node loses hours of training progress unless checkpointing was implemented. GPU time costs money at rates that make the Spot savings irrelevant if interruptions are not handled correctly.

The Kubernetes cost allocation patterns covered in our FinOps guide apply here, but GPU cost visibility requires additional instrumentation beyond standard namespace and label allocation.

Pattern 1 – GPU Operator: Declarative Driver Lifecycle Management

The problem with manual driver installation: Most tutorials install NVIDIA drivers directly on the node, configure the container runtime manually, install the device plugin separately, and set up monitoring as an afterthought. This approach works for one node. It breaks at scale, driver upgrades require coordinated node maintenance, different GPU generations need different driver versions, and monitoring configuration drifts across nodes over time.

The GPU Operator solution: NVIDIA GPU Operator manages the entire driver stack declaratively: drivers, container runtime, device plugin, DCGM Exporter, and the feature discovery component that labels nodes by GPU type. A single Helm install handles everything, and upgrades are rolling.

# Install GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgm.enabled=true \
  --set dcgmExporter.enabled=true \
  --set nodeStatusExporter.enabled=true

# Verify GPU nodes are ready
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe node [gpu-node] | grep "nvidia.com/"

Multi-generation GPU clusters: If your cluster has A100 and H100 nodes, GPU Operator handles both. Node Feature Discovery labels each node with its GPU model, so you can use nodeSelector or affinity to schedule specific workloads to specific GPU generations:

# Schedule on H100 nodes only
nodeSelector:
  nvidia.com/gpu.product: "NVIDIA-H100-80GB-HBM3"

# Or prefer H100, fall back to A100
affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      preference:
        matchExpressions:
        - key: nvidia.com/gpu.product
          operator: In
          values: ["NVIDIA-H100-80GB-HBM3"]

Driver upgrade without downtime: With GPU Operator, driver upgrades use a rolling strategy that cordons the node, waits for GPU workloads to complete or migrate, upgrades the driver, and uncordons – without manual intervention per node.

Pattern 2: MIG and GPU Time-Slicing for Inference Cost Optimization

The waste problem: A single inference request to a 7B model uses approximately 15-20GB of GPU memory on an H100 (80GB). Running one request at a time on a full H100 wastes 75-80% of the GPU’s capacity. Most AI teams deploying kubernetes for gpu workloads at early scale make this mistake, they over-provision GPU resources per workload because the default model is one GPU per pod.

MIG (Multi-Instance GPU): Available on A100 and H100, MIG creates hardware-isolated GPU partitions with guaranteed memory and compute. An H100 can be divided into up to 7 MIG instances, each with dedicated memory (10GB for 7 instances), dedicated SM slices, and hardware isolation, meaning one workload cannot affect another’s memory bandwidth.

# Enable MIG on an H100 node
nvidia-smi -i 0 -mig 1

# Create MIG profiles (7 equal instances on H100)
nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C

# Verify
nvidia-smi -L
# GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-...)
#   MIG 1g.10gb      Device  0: (UUID: MIG-...)
#   MIG 1g.10gb      Device  1: (UUID: MIG-...)
# ...

GPU Operator automates MIG configuration across nodes via a MigConfig ConfigMap, no manual nvidia-smi per node.

Time-slicing: Available on all NVIDIA GPUs, time-slicing multiplexes multiple pods onto a single GPU through temporal sharing. No memory isolation, pods share GPU memory and can interfere with each other. Use for development and testing environments, not production inference with memory-sensitive workloads.

# ConfigMap for time-slicing (4 virtual GPUs per physical GPU)
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4

Decision framework:

Scenario	Use
Production inference, memory isolation required	MIG (A100/H100 only)
Development, testing, cost reduction	Time-slicing
Multi-GPU training, tensor parallelism	Full GPU, no partitioning
Compliance requirement for hardware isolation	MIG only

Pattern 3: Topology-Aware Scheduling with Kueue

The throughput penalty: For multi-GPU training jobs using tensor parallelism or pipeline parallelism, all GPUs must communicate frequently. If GPUs are on the same node connected via NVLink, bandwidth is 600 GB/s (NVLink 4.0). If GPUs communicate via network (different nodes), bandwidth is 200-400 Gb/s and latency is orders of magnitude higher. Ignoring topology produces 30-40% throughput degradation on multi-node training compared to topology-optimal scheduling.

Kueue for Kubernetes-native batch scheduling:

Kueue is a Kubernetes-native job queueing system that manages resource borrowing, quotas, and fair scheduling for batch workloads including GPU training jobs. It became generally available in Kubernetes 1.30 and is the recommended approach for multi-tenant GPU clusters.

# ClusterQueue: defines the GPU resource pool
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: gpu-training-queue
spec:
  namespaceSelector: {}
  resourceGroups:
  - coveredResources: ["nvidia.com/gpu", "cpu", "memory"]
    flavors:
    - name: h100-nvlink
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 8      # 8 H100 GPUs in this flavor
    - name: a100-nvlink
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 16     # 16 A100 GPUs
  preemption:
    reclaimWithinCohort: Any
    borrowWithinCohort:
      policy: LowerPriority

# LocalQueue: per-namespace queue mapping to ClusterQueue
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: training-queue
  namespace: ml-workloads
spec:
  clusterQueue: gpu-training-queue

# Job using Kueue with topology awareness
apiVersion: batch/v1
kind: Job
metadata:
  name: llm-training
  namespace: ml-workloads
  labels:
    kueue.x-k8s.io/queue-name: training-queue
spec:
  parallelism: 8
  completions: 8
  template:
    spec:
      affinity:
        podAffinity:
          # Prefer pods on same node (NVLink domain)
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              topologyKey: kubernetes.io/hostname
              labelSelector:
                matchLabels:
                  job-name: llm-training
      containers:
      - name: trainer
        resources:
          limits:
            nvidia.com/gpu: "1"

Volcano is the alternative for HPC-style gang scheduling – all pods of a job are scheduled simultaneously or none are. This prevents the deadlock where half a training job’s GPU pods are running and consuming resources while the other half wait for GPU availability on another node. Use Volcano when your training framework requires all workers to start simultaneously.

Pattern 4: Spot GPU Nodes with Checkpointing and Graceful Preemption

The cost case: Spot GPU instances on AWS cost 70-80% less than On-Demand. An p4d.24xlarge (8 x A100) costs $32.77/hour On-Demand and approximately $9-11/hour on Spot. For training workloads measured in days, the difference is thousands of dollars. The barrier is interruption handling, without checkpointing, a Spot interruption after 10 hours of training loses 10 hours of work.

Node pool architecture:

# EKS managed node group - mixed On-Demand (inference) + Spot (training)
managedNodeGroups:
  # On-Demand for inference - latency-sensitive, cannot be interrupted
  - name: inference-on-demand
    instanceType: g5.xlarge    # A10G, inference workloads
    spot: false
    minSize: 0
    maxSize: 10
    labels:
      workload-type: inference
    taints:
      - key: workload-type
        value: inference
        effect: NoSchedule

  # Spot for training - interruptible, checkpointing required
  - name: training-spot
    instanceTypes:
      - p4d.24xlarge    # 8x A100
      - p3.16xlarge     # 8x V100 fallback
    spot: true
    minSize: 0
    maxSize: 4
    labels:
      workload-type: training

PyTorch training with checkpoint-on-interrupt:

# training/checkpoint_handler.py
import signal
import torch
import os

class CheckpointHandler:
    def __init__(self, model, optimizer, checkpoint_path: str):
        self.model = model
        self.optimizer = optimizer
        self.checkpoint_path = checkpoint_path
        # Register SIGTERM handler (Kubernetes sends this on Spot interruption)
        signal.signal(signal.SIGTERM, self.save_checkpoint)

    def save_checkpoint(self, signum, frame):
        """Save checkpoint on Spot interruption signal."""
        epoch = self.optimizer.state_dict().get('epoch', 0)
        checkpoint = {
            'epoch': epoch,
            'model_state_dict': self.model.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
        }
        # Save to PVC or S3 (must be network-attached, not local disk)
        torch.save(checkpoint, f"{self.checkpoint_path}/checkpoint_epoch_{epoch}.pt")
        print(f"Checkpoint saved at epoch {epoch} before Spot interruption")
        exit(0)

    def load_latest_checkpoint(self):
        """Resume from latest checkpoint if available."""
        checkpoints = sorted([
            f for f in os.listdir(self.checkpoint_path)
            if f.startswith('checkpoint_')
        ])
        if checkpoints:
            latest = os.path.join(self.checkpoint_path, checkpoints[-1])
            checkpoint = torch.load(latest)
            self.model.load_state_dict(checkpoint['model_state_dict'])
            self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
            return checkpoint['epoch']
        return 0

# Training Job with Spot tolerations and checkpoint volume
spec:
  template:
    spec:
      tolerations:
      - key: "workload-type"
        value: "training"
        effect: "NoSchedule"
      terminationGracePeriodSeconds: 120  # Time to save checkpoint
      volumes:
      - name: checkpoints
        persistentVolumeClaim:
          claimName: training-checkpoints  # Network-attached, survives node loss
      containers:
      - name: trainer
        env:
        - name: CHECKPOINT_PATH
          value: /checkpoints
        volumeMounts:
        - name: checkpoints
          mountPath: /checkpoints

The AWS Node Termination Handler intercepts the 2-minute Spot interruption notice and sends SIGTERM to pods, giving the checkpoint handler time to save state before the node is reclaimed.

Pattern 5: GPU Observability with DCGM Exporter and OpenTelemetry

The metrics that matter for kubernetes for gpu workloads:

Standard Kubernetes metrics tell you GPU is allocated. They tell you nothing about whether it is doing useful work. The metrics that matter for production GPU clusters:

Metric	What it reveals	Alert threshold
`DCGM_FI_DEV_GPU_UTIL`	SM utilization, actual compute activity	< 30% for 15min = waste
`DCGM_FI_DEV_MEM_COPY_UTIL`	Memory bandwidth utilization	< 20% = memory-bound waste
`DCGM_FI_DEV_FB_USED`	GPU memory used	> 90% = OOM risk
`DCGM_FI_DEV_POWER_USAGE`	Power draw vs TDP	Correlated with actual compute
`DCGM_FI_PROF_NVLINK_TX_BYTES`	NVLink bandwidth utilization	Low on multi-GPU = topology issue

DCGM Exporter is deployed automatically by GPU Operator. Verify it is scraping:

# Check DCGM metrics are available
kubectl port-forward -n gpu-operator svc/dcgm-exporter 9400:9400
curl http://localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL

Critical Prometheus alerts for GPU workloads:

groups:
- name: gpu-utilization
  rules:
  - alert: GPULowUtilization
    expr: |
      avg_over_time(DCGM_FI_DEV_GPU_UTIL[15m]) < 30
      and DCGM_FI_DEV_FB_USED > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "GPU {{ $labels.gpu }} underutilized on {{ $labels.instance }}"
      description: "GPU utilization {{ $value }}% for 15 minutes. Review batch size and concurrency settings."

  - alert: GPUMemoryNearCapacity
    expr: |
      (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE) > 0.90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "GPU memory above 90% on {{ $labels.instance }}"
      description: "OOM kill risk. Check for memory leaks or reduce batch size."

  - alert: GPUNodeIdle
    expr: |
      DCGM_FI_DEV_GPU_UTIL == 0
      and kube_node_labels{label_workload_type="training"} == 1
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: "GPU training node idle for 30 minutes"
      description: "Potential cost waste. Check for failed jobs or scheduling issues."

OpenTelemetry bridge for request-to-GPU correlation:

Connecting DCGM metrics to application traces allows you to correlate a slow inference request with GPU memory pressure or SM utilization at the time of the request. The same observability pattern from our OpenTelemetry tutorial applies here: GPU utilization becomes a span attribute on the request that caused it. We use inference.gpu_budget_fraction to measure what percentage of the GPU capacity a request consumed.

Pattern 6 – Multi-Tenancy: Namespace Isolation and GPU Quotas

The multi-tenant GPU problem: Multiple teams sharing a GPU cluster without isolation create two failure modes. The first is resource starvation, a training job from one team saturates all GPU slots and blocks inference for another. The second is performance interference, memory bandwidth contention between workloads sharing the same GPU degrades latency for all of them.

Namespace isolation with GPU ResourceQuotas:

# Each team gets a dedicated namespace with GPU quota
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-ml-gpu-quota
  namespace: team-ml
spec:
  hard:
    requests.nvidia.com/gpu: "4"    # Can request up to 4 GPUs
    limits.nvidia.com/gpu: "4"
    requests.cpu: "32"
    requests.memory: "256Gi"

Dedicated node pools per tenant (hard isolation):

For teams with compliance requirements or workloads where memory bandwidth isolation matters, use dedicated node pools with taints and tolerations:

# Taint dedicated nodes for a specific team
kubectl taint nodes [gpu-node-1] team=ml-team:NoSchedule

# Team's workloads tolerate the taint
spec:
  tolerations:
  - key: "team"
    operator: "Equal"
    value: "ml-team"
    effect: "NoSchedule"
  nodeSelector:
    team: ml-team

Preventing noisy neighbor in GPU memory bandwidth:

Time-sliced GPUs share memory bandwidth. For inference workloads where latency SLOs matter, use MIG instead of time-slicing, hardware-isolated memory bandwidth per tenant is not negotiable when p99 latency is in your SLO. See our Kubernetes security best practices guide for the NetworkPolicy and RBAC patterns that complement namespace isolation.

Pattern 7: LLM Serving with vLLM and KEDA Autoscaling

The serving gap: The standard Kubernetes autoscaling (HPA on CPU/memory) is wrong for LLM inference. CPU utilization during inference is low, the work is on the GPU. Memory utilization is high but stable, the model is loaded into GPU memory. Neither metric reflects the actual scaling signal, which is inference queue depth or KV cache utilization.

vLLM deployment on Kubernetes:

vLLM is the production-standard inference engine for LLMs on Kubernetes, used in production by LinkedIn, Uber, and others. PagedAttention increases concurrent serving capacity significantly compared to naive inference servers, the difference between serving 30 and 100+ concurrent requests on the same H100 for a 7B model.

# vLLM deployment for a 7B model on H100
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
  namespace: llm-inference
spec:
  replicas: 1    # KEDA manages scaling
  template:
    spec:
      tolerations:
      - key: "workload-type"
        value: "inference"
        effect: "NoSchedule"
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command:
        - python
        - -m
        - vllm.entrypoints.openai.api_server
        args:
        - --model
        - /models/llama-3-8b
        - --max-model-len
        - "8192"
        - --gpu-memory-utilization
        - "0.90"
        - --enable-prefix-caching     # Reduces KV-cache recomputation
        - --enable-chunked-prefill    # Improves throughput on mixed workloads
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "32Gi"
        volumeMounts:
        - name: model-storage
          mountPath: /models
        startupProbe:
          httpGet:
            path: /health
            port: 8000
          failureThreshold: 60       # Models take time to load
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          periodSeconds: 5
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: llm-models    # Pre-loaded model weights, avoids cold-start download

KEDA autoscaling on inference queue depth:

# ScaledObject: scale vLLM replicas based on queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
  namespace: llm-inference
spec:
  scaleTargetRef:
    name: vllm-inference
  minReplicaCount: 1
  maxReplicaCount: 8
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_request_queue_depth
      threshold: "10"              # Scale up when >10 requests queued per replica
      query: |
        sum(vllm:num_requests_waiting) /
        count(kube_deployment_status_replicas_ready{deployment="vllm-inference"})
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_kv_cache_usage
      threshold: "0.8"             # Scale up when KV cache >80% full
      query: avg(vllm:gpu_cache_usage_perc)

Handling cold starts: vLLM model loading takes 3-8 minutes for 7B-70B models. NVIDIA Dynamo Snapshot (released May 2026) reduces single-GPU vLLM cold-start to near-zero by checkpointing and restoring the loaded model state. For production workloads where scale-to-zero is desired but cold-start latency matters, this changes the calculus significantly.

Model weights caching: Store model weights on a PVC shared across replicas so that new pods mount pre-loaded weights rather than downloading from Hugging Face on every cold start. A 7B model is ~14GB in FP16, at a typical cloud storage bandwidth, that download takes 2-5 minutes without caching.

The Kubernetes GPU Workloads Production Checklist

GPU OPERATOR AND DRIVERS
[ ] GPU Operator installed and managing drivers declaratively
[ ] Node Feature Discovery labeling nodes by GPU model
[ ] DCGM Exporter running and scraping by Prometheus
[ ] Driver upgrade tested with rolling strategy (no manual intervention)

GPU PARTITIONING
[ ] MIG configured for inference workloads requiring memory isolation
[ ] Time-slicing configured for dev/test environments
[ ] Partitioning strategy documented and enforced via GPU Operator MigConfig

SCHEDULING
[ ] Kueue installed for batch training job queueing
[ ] ClusterQueue and LocalQueue configured per team
[ ] Pod affinity rules pushing multi-GPU jobs to same NVLink domain
[ ] Gang scheduling (Volcano) if training framework requires simultaneous pod start

SPOT AND COST
[ ] Mixed On-Demand (inference) + Spot (training) node pools configured
[ ] AWS Node Termination Handler installed
[ ] SIGTERM checkpoint handler in training code
[ ] Checkpoint storage on network-attached PVC (not local disk)
[ ] GPU node scale-down annotation on training nodes

OBSERVABILITY
[ ] DCGM_FI_DEV_GPU_UTIL alert firing at <30% for 15 minutes
[ ] GPU memory alert at >90% utilization
[ ] GPU idle node alert for training nodes
[ ] OTel bridge connecting DCGM metrics to request traces

MULTI-TENANCY
[ ] ResourceQuotas per namespace limiting GPU requests
[ ] Dedicated node pools for teams with isolation requirements
[ ] MIG (not time-slicing) for production inference with latency SLOs

LLM SERVING
[ ] vLLM deployed with PagedAttention, prefix caching, chunked prefill
[ ] KEDA scaling on queue depth and KV cache utilization (not CPU/memory)
[ ] Model weights on shared PVC to avoid cold-start download
[ ] startupProbe with sufficient failureThreshold for model loading time

FAQ: Kubernetes for GPU Workloads

Do I need the GPU Operator or can I install NVIDIA drivers manually?

Manual driver installation works for one or two nodes. At any scale, GPU Operator is the correct approach, it manages drivers, container runtime configuration, device plugin, and monitoring declaratively, supports rolling upgrades without manual intervention, and handles multiple GPU generations in the same cluster. The operational overhead of manual driver management at scale exceeds the learning curve of the Operator by a significant margin.

When should I use MIG versus time-slicing for kubernetes for gpu workloads?

MIG provides hardware-level memory isolation and is required for production inference where workloads must not interfere with each other’s memory bandwidth. MIG requires A100 or H100. Time-slicing works on all NVIDIA GPUs but provides no memory isolation, pods share GPU memory and can affect each other’s performance. Use MIG for production, time-slicing for development and testing.

How should I autoscale LLM inference on Kubernetes?

Do not use HPA on CPU or memory, neither reflects inference load. Use KEDA with Prometheus triggers on inference queue depth (number of requests waiting) and KV cache utilization. A queue depth above 10 requests per replica signals insufficient capacity. KV cache utilization above 80% signals memory pressure that will degrade latency. Scale on the metric closest to the actual resource constraint.

How do I handle Spot interruptions for GPU training jobs?

Register a SIGTERM handler in your training code that saves a checkpoint to network-attached storage (PVC, S3, or GCS) when the signal is received. Kubernetes sends SIGTERM before killing a pod, and terminationGracePeriodSeconds: 120 gives the checkpoint handler 2 minutes to save state. At pod restart, load the latest checkpoint and resume. AWS Node Termination Handler provides the 2-minute Spot interruption notice via SIGTERM.

What is the right GPU observability stack for kubernetes for gpu workloads?

DCGM Exporter (deployed by GPU Operator) feeds GPU metrics to Prometheus. Alert on SM utilization (not just allocation), memory utilization, and power draw. For request-level correlation, bridge DCGM metrics to OpenTelemetry traces using the patterns in our OpenTelemetry tutorial, GPU utilization becomes a span attribute on the request that caused it.

Conclusion

Kubernetes for GPU workloads is a distinct operational domain from standard Kubernetes cluster management. The patterns that make CPU workloads reliable, HPA on resource metrics, multiple replicas for availability, Spot without checkpointing for cost, either do not apply or actively cause problems when applied to GPU training and inference.

The seven patterns in this guide, GPU Operator for driver lifecycle, MIG for utilization, Kueue for topology-aware scheduling, Spot with checkpointing, DCGM for real observability, namespace isolation for multi-tenancy, and vLLM with KEDA for LLM serving, are the production layer that generic Kubernetes guides skip.

For teams building AI infrastructure on Kubernetes and connecting it to decentralized compute or verification layers, see our EigenLayer AVS setup guide for the infrastructure patterns that connect centralized GPU clusters to decentralized validation networks.

At The Good Shell we design and operate GPU infrastructure and AI platform engineering for teams moving from prototype to production. See our infrastructure and DevOps services and case studies to understand what that looks like in practice.

For the authoritative GPU Operator reference, the NVIDIA GPU Operator documentation covers installation, configuration, and upgrade procedures. For vLLM, the official vLLM repository is the primary reference for production configuration options

Kubernetes for GPU Workloads: 7 Proven Patterns for Production AI Infrastructure