vLLM Kubernetes: 7 Proven Production Patterns 2026

vLLM Kubernetes deployments fail in production for a reason that has nothing to do with vLLM itself: the standard Kubernetes autoscaling model does not work for LLM inference. HPA scales on CPU and memory. During inference, CPU utilization is low because the work is on the GPU, and GPU memory stays constant because vLLM pre-allocates it for the KV cache. The two metrics Kubernetes uses to make scaling decisions are both flat regardless of load. A vLLM deployment under heavy traffic looks idle to the Kubernetes scheduler while requests queue up inside the engine and latency degrades.

Quick answer: Production vLLM Kubernetes requires autoscaling on the metrics that actually reflect load, inference queue depth and KV cache utilization, not CPU or memory. Use KEDA with Prometheus triggers on vllm:num_requests_waiting and vllm:gpu_cache_usage_perc. Pre-cache model weights on a PVC to cut cold starts. Run inference on On-Demand GPU nodes and training on Spot. Monitor TTFT, KV cache utilization, and preemption rate. This guide covers the seven patterns that make it work.

What changed in 2026: NVIDIA published Dynamo Snapshot in May 2026, reducing single-GPU vLLM cold-start to near-zero by checkpointing and restoring loaded model state. Azure retired Low-Priority VMs in March 2026, migrate AKS Low-Priority node pools to Spot. For models exceeding single-node capacity (70B+ on a single 8-GPU node), llm-d introduced disaggregated prefill/decode serving in late 2025. For models that fit on a single node, vLLM remains simpler and sufficient.
At a Glance: vLLM Kubernetes Production Requirements

Component	Requirement
Kubernetes version	1.27+ with NVIDIA GPU Operator
Autoscaling	KEDA on queue depth and KV cache (not HPA on CPU)
GPU memory sizing	`params_B × 2` for weights + 25% for KV cache
`gpu-memory-utilization`	0.85-0.90 (balance KV cache vs OOM headroom)
Cold start mitigation	Model weights pre-cached on PVC
Node strategy	On-Demand for inference, Spot for batch/training
Critical metrics	TTFT p99, KV cache utilization, queue depth, preemption rate
Inference optimizations	PagedAttention, prefix caching, chunked prefill

In this guide

Why vLLM Kubernetes Autoscaling Is Different

The core problem with vLLM Kubernetes deployments is that the signals Kubernetes uses for autoscaling do not reflect LLM inference load.

Running this in production?

Get a senior review of your infrastructure — in 7 days

We run validator and cloud infrastructure across 24 chains with 10M+ daily checks at 99.97% uptime. Fixed-price 7-day audit: written report, prioritised findings, 90-min debrief call. $4,500 fixed, no long engagement.

Get the 7-day audit → Book a free 30-min infra review — leave with 2-3 concrete findings

CPU is the wrong signal. During inference, the GPU does the work. CPU utilization on a vLLM pod stays low even at maximum throughput. Scaling a GPU workload on CPU is like sizing a factory by how hot the parking lot is.

GPU memory is the wrong signal. vLLM pre-allocates GPU memory for the KV cache at startup. Memory usage stays constant regardless of load, so it never triggers scale-up or scale-down. A pod serving zero requests and a pod serving 100 requests show identical GPU memory.

The right signals live inside vLLM. The metrics that reflect actual load are the inference queue depth (how many requests are waiting) and the KV cache utilization (how full the attention-state memory is). When the KV cache fills, vLLM preempts older requests to make room, a silent degradation that no standard Kubernetes metric reveals. These metrics are exported by vLLM’s Prometheus endpoint but require explicit wiring into the autoscaling path.

This is why a vLLM Kubernetes deployment needs the patterns below rather than the standard Deployment-plus-HPA approach that works for stateless web services. The GPU cost angle of this, making sure those expensive GPU nodes are not sitting idle, connects directly to the Kubernetes cost allocation patterns for GPU workloads.

Pattern 1: Right-Size GPU Memory Before Launch

The single biggest cause of vLLM Kubernetes OOM crashes is incorrect GPU memory sizing. Get this wrong and the pod crashes on the first large request or under concurrent load.

The sizing rule:

GPU memory needed = (model_params_B × 2 GB)        # FP16 weights
                  + 25% for KV cache                # attention states
                  + overhead

A 7B model in FP16 needs ~14GB for weights, plus KV cache, fitting comfortably on a single 24GB or 40GB GPU. A 70B model in FP16 needs ~140GB for weights – that is 2x A100 80GB or 4x A100 40GB with tensor parallelism. With INT4 quantization (AWQ or GPTQ), the 70B weight footprint drops to ~35GB, fitting on a single A100 80GB or H100.

The vLLM container configuration:

containers:
- name: vllm
  image: vllm/vllm-openai:v0.8.0    # Pin the version, never use latest
  args:
  - --model
  - meta-llama/Llama-3-8b
  - --max-model-len
  - "8192"
  - --gpu-memory-utilization
  - "0.85"                          # 85% leaves OOM headroom
  - --enable-prefix-caching         # Reuse KV cache across shared prefixes
  - --enable-chunked-prefill        # Better throughput on mixed workloads
  resources:
    limits:
      nvidia.com/gpu: "1"

Setting --gpu-memory-utilization to 0.85 rather than 0.95 leaves headroom that prevents OOM kills when the KV cache spikes under concurrent load. The tradeoff is slightly less KV cache capacity, which is the correct tradeoff for production stability.

Pattern 2: vLLM Kubernetes Autoscaling with KEDA on Queue Depth

This is the pattern that makes vLLM Kubernetes work. Replace HPA-on-CPU with KEDA triggered by the metrics that reflect actual inference load.

The signal chain: vLLM exports metrics to its Prometheus endpoint. Prometheus scrapes them. KEDA reads the queue depth and KV cache metrics from Prometheus and makes scaling decisions based on the load signals that matter.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
  namespace: llm-inference
spec:
  scaleTargetRef:
    name: vllm-inference
  minReplicaCount: 1
  maxReplicaCount: 8
  cooldownPeriod: 300              # Wait 5 min before scaling down (cold start is expensive)
  triggers:
  # Scale on requests waiting in the queue
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_queue_depth
      threshold: "10"             # Scale up at >10 waiting requests per replica
      query: |
        sum(vllm:num_requests_waiting) /
        count(kube_deployment_status_replicas_ready{deployment="vllm-inference"})
  # Scale on KV cache utilization
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_kv_cache
      threshold: "0.8"            # Scale up at >80% KV cache full
      query: avg(vllm:gpu_cache_usage_perc)

Why the cooldown matters: vLLM cold start is expensive (model loading takes minutes). A short cooldown causes thrashing, scaling down then immediately back up, paying the cold-start cost repeatedly. A 5-minute cooldownPeriod prevents this.

Scale-to-zero consideration: KEDA can scale vLLM to zero when traffic drops, eliminating GPU cost during idle periods. The tradeoff is cold-start latency on the first request. With weights pre-cached on a PVC, cold start is 30-60 seconds for a 7B model and 2-5 minutes for a 70B model. Without caching, add 5-10 minutes for download. Only use scale-to-zero where that first-request latency is acceptable.

Pattern 3: Cold Start Mitigation

Cold start is the defining operational challenge of vLLM Kubernetes autoscaling. Every scale-up event pays the model loading cost, and that cost determines whether autoscaling is responsive or painful.

Pre-cache model weights on a PVC:

# Store model weights on a shared PVC so pods mount pre-loaded weights
# instead of downloading from HuggingFace on every cold start
volumes:
- name: model-cache
  persistentVolumeClaim:
    claimName: llm-model-cache    # ReadOnlyMany, pre-populated with weights
volumeMounts:
- name: model-cache
  mountPath: /models

A 7B model is ~14GB in FP16. Downloading that from HuggingFace on every cold start adds 2-5 minutes. Mounting from a pre-populated PVC eliminates the download entirely, the pod loads weights from local network storage in seconds rather than minutes.

Configure probes for model loading time:

startupProbe:
  httpGet:
    path: /health
    port: 8000
  failureThreshold: 60          # Allow up to 10 minutes for model load
  periodSeconds: 10
readinessProbe:
  httpGet:
    path: /health
    port: 8000
  periodSeconds: 5
  # Pod only receives traffic after model is fully loaded

The startupProbe with a high failureThreshold prevents Kubernetes from killing the pod during the legitimate model-loading window. The readinessProbe ensures no traffic routes to the pod until the model is loaded and ready.

NVIDIA Dynamo Snapshot: Released May 2026, Dynamo Snapshot checkpoints and restores loaded model state, reducing single-GPU vLLM cold-start to near-zero. For deployments where scale-to-zero is desirable but cold-start latency has been the blocker, this changes the calculation, it makes aggressive scale-to-zero viable for latency-sensitive workloads that previously could not tolerate it.

Pattern 4: Spot GPU Nodes for Batch, On-Demand for Inference

GPU cost is the dominant cost in any vLLM Kubernetes deployment. The node strategy that balances cost and reliability splits workloads by interruption tolerance.

The split: Real-time inference runs on On-Demand GPU nodes, it is latency-sensitive and cannot tolerate Spot interruptions mid-request. Batch inference, offline processing, and any training run on Spot GPU nodes at 70-80% lower cost, with interruption handling.

# Inference deployment - On-Demand nodes, no interruption tolerance
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
spec:
  template:
    spec:
      nodeSelector:
        node-lifecycle: on-demand
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"

# Batch inference Job - Spot nodes, interruption-tolerant
apiVersion: batch/v1
kind: Job
metadata:
  name: vllm-batch-inference
spec:
  template:
    spec:
      nodeSelector:
        node-lifecycle: spot
      terminationGracePeriodSeconds: 60   # Drain in-flight requests on interruption
      containers:
      - name: vllm-batch
        # Batch jobs checkpoint progress and resume on restart

Azure note: Azure retired Low-Priority VMs in March 2026. If you run AKS with Low-Priority node pools for vLLM batch workloads, migrate to Spot VMs, they have been the correct choice for some time and Low-Priority is no longer available.

The GPU cost optimization here is FinOps applied to AI infrastructure, the same discipline covered in our Kubernetes cost allocation guide, applied to the most expensive resource class in the cluster.

Pattern 5: vLLM Kubernetes Observability: The Four Metrics That Matter

Most teams instrument vLLM Kubernetes deployments with whatever their existing Kubernetes monitoring covers, CPU, memory, pod restarts. None of these tell you whether your inference system is healthy. The metrics that matter come from vLLM’s own Prometheus endpoint.

The four critical vLLM metrics:

Metric	What it reveals	Alert threshold
`vllm:time_to_first_token_seconds`	TTFT – perceived latency	p99 > your SLO
`vllm:gpu_cache_usage_perc`	KV cache utilization	> 0.9 sustained = preemption risk
`vllm:num_requests_waiting`	Queue depth	> 10/replica = under-provisioned
`vllm:num_preemptions_total`	Preemption rate	> 0.05/sec for 30s = capacity problem

Prometheus alerts for vLLM Kubernetes:

groups:
- name: vllm-inference
  rules:
  - alert: VLLMHighPreemptionRate
    expr: |
      rate(vllm:num_preemptions_total[1m]) > 0.05
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "vLLM preempting requests on {{ $labels.pod }}"
      description: "KV cache full, requests being preempted. P99 latency will spike. Scale up or reduce max-model-len."

  - alert: VLLMHighTTFT
    expr: |
      histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m])) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "vLLM TTFT p99 above 2 seconds"
      description: "Time to first token degraded. Check queue depth and KV cache utilization."

  - alert: VLLMQueueBacklog
    expr: |
      sum(vllm:num_requests_waiting) /
      count(kube_deployment_status_replicas_ready{deployment="vllm-inference"}) > 20
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "vLLM queue depth above 20 per replica"
      description: "Requests queuing faster than processing. KEDA should be scaling - verify autoscaler health."

The preemption_rate alert is the most important. When the KV cache fills and vLLM starts preempting requests, P99 latency can spike 8x under load. Catching preemption early, before it cascades, is what separates a stable vLLM Kubernetes deployment from one that degrades silently.

To correlate these inference metrics with request-level traces, the OpenTelemetry patterns in our OpenTelemetry tutorial bridge vLLM Prometheus metrics into distributed traces, connecting a slow request to the KV cache state at the moment it was served.

If you are operating vLLM in production and want a focused review of your autoscaling triggers, GPU sizing and observability stack before traffic forces a redesign, our 7-day infrastructure audit is the focused way to do it.

Pattern 6: PagedAttention, Prefix Caching, and Chunked Prefill

vLLM’s throughput advantage over naive inference servers comes from PagedAttention, and production deployments should enable the optimizations built on top of it.

Prefix caching: When multiple requests share a common prefix (a system prompt, a few-shot template), prefix caching reuses the KV cache for that shared prefix instead of recomputing it. For workloads with long shared system prompts, this is a significant throughput gain.

args:
- --enable-prefix-caching         # Reuse KV cache for shared prefixes

Chunked prefill: Mixing long prompts (prefill-heavy) with active generations (decode-heavy) in the same batch causes latency spikes. Chunked prefill breaks long prefills into chunks that interleave with decode steps, smoothing latency under mixed workloads.

args:
- --enable-chunked-prefill
- --max-num-batched-tokens
- "8192"

The throughput impact: PagedAttention increases concurrent serving capacity significantly compared to naive serving, the difference between serving 30 and 100+ concurrent requests on the same H100 for a 7B model. Prefix caching and chunked prefill build on that base to optimize for the specific shape of production traffic.

These are configuration flags, not architectural changes, but they materially affect how many requests a vLLM Kubernetes pod can serve before KEDA needs to scale up, which directly affects GPU cost.

Pattern 7: When to Move Beyond Single-Node vLLM

For most production workloads, a single-node vLLM Kubernetes deployment with the patterns above is sufficient. There are two thresholds where you need more.

KServe for production model serving: KServe (CNCF graduated) wraps vLLM with Kubernetes-native deployment, canary rollouts, multi-model endpoints, and autoscaling. The production pattern is vLLM as the inference runtime inside a KServe InferenceService. Use KServe when you need canary deployments of new model versions, multi-model serving on shared infrastructure, or standardized inference APIs across many models.

llm-d for models exceeding single-node capacity: For frontier open-weight models (Llama 4 405B, DeepSeek V3, Qwen 3 235B) that exceed a single 8-GPU node, llm-d provides disaggregated serving, separating the prefill and decode phases, offloading KV cache between nodes, and cross-node tensor parallelism. Introduced late 2025, llm-d is rapidly evolving and currently used by AI-native companies and research institutions. For models that fit on a single node, vLLM alone is simpler and sufficient, do not add llm-d complexity until model size forces it.

The decision:

Model fits on 1 GPU, single model         -> vLLM + KEDA (this guide)
Model fits on 1 node, need canary/multi    -> KServe wrapping vLLM
Model exceeds 1 node (70B+ unquantized)    -> llm-d disaggregated serving
High throughput >1000 req/s                 -> Multi-node vLLM + Ray scheduler

The vLLM Kubernetes Production Checklist

GPU MEMORY SIZING
[ ] Model memory calculated: params_B × 2 + 25% KV cache
[ ] gpu-memory-utilization set to 0.85-0.90
[ ] Quantization (AWQ/GPTQ) evaluated for large models
[ ] vLLM image version pinned (not latest)

AUTOSCALING
[ ] KEDA installed (not relying on HPA-on-CPU)
[ ] ScaledObject triggering on vllm:num_requests_waiting
[ ] ScaledObject triggering on vllm:gpu_cache_usage_perc
[ ] cooldownPeriod set to 300s to prevent thrashing

COLD START
[ ] Model weights pre-cached on ReadOnlyMany PVC
[ ] startupProbe with failureThreshold for model load time
[ ] readinessProbe gating traffic until model loaded
[ ] NVIDIA Dynamo Snapshot evaluated for scale-to-zero

COST
[ ] Inference on On-Demand nodes
[ ] Batch/training on Spot nodes with interruption handling
[ ] AKS Low-Priority migrated to Spot (if applicable)
[ ] GPU cost allocation tracked per workload

OBSERVABILITY
[ ] vLLM Prometheus endpoint scraped
[ ] TTFT p99 alert against SLO
[ ] KV cache utilization alert at >0.9
[ ] Preemption rate alert at >0.05/sec for 30s
[ ] Queue depth alert per replica

INFERENCE OPTIMIZATION
[ ] PagedAttention (default in vLLM)
[ ] Prefix caching enabled for shared-prefix workloads
[ ] Chunked prefill enabled for mixed workloads

FAQ: vLLM Kubernetes

Why can’t I autoscale vLLM with standard Kubernetes HPA?

HPA scales on CPU and memory by default. During LLM inference, CPU utilization is low (the GPU does the work) and GPU memory is constant (vLLM pre-allocates it for the KV cache). Neither metric reflects actual load, so HPA cannot make correct scaling decisions. Use KEDA with Prometheus triggers on inference queue depth and KV cache utilization instead, these are the metrics that reflect real load on a vLLM Kubernetes deployment.

How do I reduce vLLM cold start time on Kubernetes?

Pre-cache model weights on a ReadOnlyMany PVC so pods mount pre-loaded weights instead of downloading from HuggingFace. This cuts cold start from minutes (including download) to 30-60 seconds for a 7B model. Configure a startupProbe with a high failureThreshold to prevent Kubernetes from killing the pod during model loading. For scale-to-zero with minimal cold-start penalty, evaluate NVIDIA Dynamo Snapshot (released May 2026).

What GPU memory do I need for vLLM on Kubernetes?

Use the rule: model parameters in billions times 2 for FP16 weights, plus 25% for KV cache. A 7B model needs ~14GB plus KV cache (fits on 24GB+). A 70B model needs ~140GB (2x A100 80GB or 4x A100 40GB with tensor parallelism). INT4 quantization reduces the 70B footprint to ~35GB. Set gpu-memory-utilization to 0.85 for OOM headroom.

What metrics should I monitor for vLLM Kubernetes?

Four vLLM-specific metrics matter most: TTFT (time to first token) for perceived latency, KV cache utilization for memory pressure, queue depth for whether requests are waiting, and preemption rate for capacity problems. Standard Kubernetes metrics (CPU, memory, restarts) do not reveal inference health. Alert on preemption rate above 0.05/sec, it precedes P99 latency spikes of up to 8x.

Should I use KServe or plain vLLM on Kubernetes?

Use plain vLLM with KEDA for single-model deployments where you control the serving layer. Use KServe (which wraps vLLM) when you need canary deployments of new model versions, multi-model endpoints on shared infrastructure, or a standardized inference API across many models. KServe adds operational features at the cost of additional complexity, adopt it when you need those specific features.

Conclusion

vLLM Kubernetes deployments succeed or fail on one decision: whether you autoscale on the metrics that reflect LLM inference load or on the CPU and memory metrics that do not. Everything else in this guide builds on that foundation, KEDA on queue depth and KV cache, cold-start mitigation through weight caching, the Spot-versus-On-Demand split, and the observability that catches preemption before it cascades into latency spikes.

The seven patterns here are the production layer that the vLLM documentation and quickstart guides do not cover. They are what separates a vLLM deployment that works in a demo from one that serves production traffic reliably and cost-effectively.

For the broader GPU infrastructure context that vLLM Kubernetes deployments run on, see our Kubernetes for GPU workloads guide covering the GPU Operator, MIG partitioning, and topology-aware scheduling that underpin production AI infrastructure.

At The Good Shell we design and operate AI inference infrastructure on Kubernetes for teams moving from prototype to production. See our infrastructure and DevOps services and case studies to understand what that looks like in practice.

If your team is sizing GPU infrastructure for an LLM workload in production and wants a second pair of eyes on the autoscaling and observability before going live, our 7-day infrastructure audit covers exactly that scope with a fixed price and clear deliverables.

For the authoritative vLLM reference, the official documentation at docs.vllm.ai and the source repository at github.com/vllm-project/vllm cover every serving parameter and optimization flag.

vLLM Kubernetes: 7 Proven Production Patterns for LLM Serving in 2026