Kubernetes Deployment Best Practices: 5 Proven Patterns That Break in Production

Kubernetes deployment best practices are well documented. The official documentation covers them. Every major cloud provider has a guide. There are dozens of Medium articles listing the same checklist. And yet 61% of teams experienced a production incident in the past year directly attributable to insufficient cluster configuration things that looked correct in staging and failed under real load.

The gap is not knowledge. It is the difference between knowing what a practice is and understanding why it breaks and under what conditions.

This guide covers five kubernetes deployment best practices that are widely recommended, frequently misimplemented, and reliably dangerous when they go wrong. For each one: what the practice is, the exact failure mode that happens in production, and the configuration that actually works.

In this guide

Why Kubernetes Deployment Best Practices Break in Production

Before getting into the specific patterns, understand the structural reason most kubernetes deployment best practices fail to survive contact with production.

Running this in production?

Get a senior review of your infrastructure — in 7 days

We run validator and cloud infrastructure across 24 chains with 10M+ daily checks at 99.97% uptime. Fixed-price 7-day audit: written report, prioritised findings, 90-min debrief call. $4,500 fixed, no long engagement.

Get the 7-day audit → Book a free 30-min infra review — leave with 2-3 concrete findings

Staging environments are quiet. They have predictable load, no competing workloads from other teams, and no accumulated state from months of operation. A liveness probe with a one-second timeout works fine in staging because the application always responds in 50ms. The same probe kills pods under production load when the application is momentarily CPU-throttled and takes 1.1 seconds to respond.

The 2025 Kubernetes Benchmark Report found that average CPU utilization across clusters is just 10% and average memory utilization is 23%. That is not a sign that clusters are well-provisioned, it is a sign that teams are dramatically overprovisioning to compensate for configuration they are not confident in, while simultaneously running into incidents caused by the exact same misconfiguration.

The five patterns below are not beginner mistakes. They are the failures that happen to teams who have read the documentation, set up the recommended configuration, and still end up with production incidents.

Pattern 1: Resource Limits That Create the Failure They Prevent

Resource requests and limits are among the most fundamental kubernetes deployment best practices. Set requests so the scheduler knows where to place pods. Set limits so one misbehaving container cannot starve its neighbours. Every guide says this. It is correct. And it produces one of the most insidious production failure modes in Kubernetes.

The failure mode: CPU limits causing CrashLoopBackOff via liveness probe feedback loop

Here is the scenario. You set a CPU limit on a pod that is correctly sized for steady-state operation. You configure a liveness probe with a one-second timeout which is the default. Under a burst of traffic, the pod hits its CPU limit and gets throttled. While throttled, the liveness probe HTTP request queues behind ordinary requests and takes 1.2 seconds to respond. The liveness probe fails. The kubelet restarts the pod. The consumers of this service have now built up an even larger backlog. The pod restarts into a higher load than it left. The liveness probe fails again faster. The pod stays in CrashLoopBackOff indefinitely, because the mechanism designed to keep it healthy is the mechanism keeping it down.

This is not a theoretical scenario. It is a documented production incident pattern where the combination of CPU limits and liveness probes creates a self-reinforcing failure loop that requires human intervention to break.

The fix:

For CPU limits specifically, the production-tested approach is to set CPU requests but remove CPU limits. CPU requests give the scheduler the information it needs. CPU limits introduce throttling that degrades probe reliability. Memory limits are different keep those, because OOMKilled is preferable to a memory leak consuming the entire node.

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    # No CPU limit - requests provide scheduling signal without throttling risk

For liveness probes, increase timeoutSeconds and failureThreshold beyond defaults. The default of timeoutSeconds: 1 and failureThreshold: 3 is calibrated for well-provisioned, lightly loaded clusters. Production needs more tolerance:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5       # Not 1
  failureThreshold: 5     # Not 3

Pattern 2: Liveness and Readiness Probes Pointing at the Same Endpoint

Separating liveness and readiness probes is listed in kubernetes deployment best practices guides because they serve fundamentally different purposes. Most teams implement them correctly in principle and incorrectly in practice by having both probes check the same endpoint with the same logic.

The failure mode: cascading restart storm during dependency outage

Your application checks database connectivity in its health endpoint. Both liveness and readiness probe point to /health which returns 200 only when the database is reachable. Your database has a 30-second failover. During those 30 seconds, every pod in the deployment fails its liveness probe. The kubelet restarts all pods simultaneously. The pods come back up and immediately fail liveness again because the database is still in failover. You now have zero available pods and a cascading restart loop at exactly the moment you most need the application to be running to process in-flight requests gracefully.

This is one of the most documented kubernetes deployment best practices violations. The official Kubernetes documentation warns against it explicitly. It still happens in production because the failure only manifests under real dependency outages, never in staging.

The fix:

Liveness and readiness must check different things:

/healthz – liveness endpoint. Returns 200 if the process is alive and not deadlocked. Checks nothing external. Should never fail because a dependency is down.

/ready – readiness endpoint. Returns 200 if the application can serve traffic. Can check database connectivity, cache warmth, dependency health.

livenessProbe:
  httpGet:
    path: /healthz   # Is the process alive?
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 5

readinessProbe:
  httpGet:
    path: /ready    # Can this pod serve traffic?
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3

When the database goes down, readiness fails and the pod is removed from the service endpoints, no traffic goes to it. Liveness continues to pass because the process is alive. The pod is not restarted. When the database recovers, readiness passes and the pod rejoins the endpoints. Zero cascading restarts.

For slow-starting applications, add a startup probe to prevent liveness from killing the pod during initialisation:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10  # 300 seconds to start before liveness takes over

Pattern 3: Rolling Update Configuration That Causes Downtime

Rolling updates are the standard kubernetes deployment best practices for zero-downtime deployments. The default configuration – maxSurge: 25%, maxUnavailable: 25% – is presented as safe. For many workloads it is. For workloads with fewer than eight replicas, it causes downtime.

The failure mode: the mathematics of low replica counts

You have four replicas. maxUnavailable: 25% rounds down to one pod. Kubernetes terminates one old pod and starts one new pod. The new pod takes 45 seconds to initialise and pass its readiness probe. During those 45 seconds, you have three running pods instead of four. So far so good. But if minReadySeconds is not set, Kubernetes considers the new pod ready the moment it passes its readiness probe before it has actually served any traffic successfully. If the new pod has a subtle startup issue that only manifests under real traffic, it starts receiving requests immediately. You have no buffer.

The compounding problem: without a Pod Disruption Budget (PDB), cluster maintenance events – node upgrades, autoscaler scale-downs can terminate multiple pods simultaneously. On a four-replica deployment, a node failure taking down two pods means 50% capacity loss with no protection.

The fix:

For kubernetes deployment best practices on rolling updates, the configuration that actually protects production is:

spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0      # Never terminate old pods before new ones are ready
  minReadySeconds: 30        # New pods must be stable for 30s before proceeding

maxUnavailable: 0 with maxSurge: 1 means Kubernetes always starts a new pod before terminating an old one. At no point do you have fewer than your desired replica count serving traffic.

And add a PDB to protect against maintenance events:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 3          # Or use maxUnavailable: 1
  selector:
    matchLabels:
      app: my-app

Without the PDB, a cluster upgrade can evict all your pods simultaneously. With it, the upgrade controller is forced to respect your minimum availability constraint.

Pattern 4: HPA Scaling on the Wrong Metric

Horizontal Pod Autoscaler is listed in every kubernetes deployment best practices guide, almost always with CPU as the scaling metric. This is correct for CPU-bound workloads. For the majority of production workloads – I/O bound APIs, queue processors, web services waiting on database queries, it causes HPA to fail silently while latency climbs.

The failure mode: HPA inaction during real saturation

You have a Node.js API that spends 90% of its time waiting on database queries. Under heavy traffic, request latency climbs from 50ms to 800ms. Your error rate starts increasing. The HPA sees 8% CPU utilisation across all pods and takes no action because there is no CPU pressure to respond to. The application is saturated by I/O, not CPU. The HPA metric is measuring the wrong signal.

This is not a Kubernetes bug. It is correct behaviour based on an incorrect configuration decision.

The fix:

For I/O-bound services, scale on the metric that actually represents saturation. For a web API, that is request rate or latency. For a queue processor, it is queue depth:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"   # Scale when any pod exceeds 100 req/s average

For queue-based services using Kubernetes Event-Driven Autoscaler (KEDA):

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: queue-processor
spec:
  scaleTargetRef:
    name: queue-worker
  minReplicaCount: 1
  maxReplicaCount: 50
  triggers:
  - type: rabbitmq
    metadata:
      queueName: work-queue
      queueLength: "30"     # One replica per 30 messages in queue

The technical requirement here: custom metrics need to be exposed via Prometheus and made available to the HPA through the metrics API. For most teams this means deploying the Prometheus Adapter. For KEDA, the KEDA controller handles this natively.

Also note the known conflict between HPA and VPA in auto mode, VPA restarts pods to resize them, which directly conflicts with HPA’s replica count management. Run VPA in recommendation mode only if you are also running HPA in auto mode:

# VPA in recommendation mode only
spec:
  updatePolicy:
    updateMode: "Off"   # Recommendations only, no automatic resizing

Pattern 5: Ignoring Pod Topology and Node Failure Domains

Anti-affinity rules appear in kubernetes deployment best practices documentation as a way to spread pods across nodes. Most implementations stop there. Real production deployments need to spread across availability zones, not just nodes because a node failure is a single-machine problem, but an availability zone failure takes down every node in that zone simultaneously.

The failure mode: all replicas on the same availability zone

You have five replicas. Your cluster has nodes in three availability zones. You have podAntiAffinity configured to spread pods across nodes. Kubernetes schedules two pods on nodes in us-east-1a, two on nodes in us-east-1b, and one on a node in us-east-1c. An availability zone failure takes down us-east-1a. You lose two out of five replicas instantly 40% capacity loss. Without a PDB, a maintenance event in that zone could have taken all five pods before the failure even happened.

The fix:

Use topologySpreadConstraints instead of or in addition to pod anti-affinity. This is the kubernetes deployment best practices recommendation for production clusters in 2026 anti-affinity prevents co-location on the same node, topology spread constraints enforce distribution across zones:

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: my-app
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app: my-app

maxSkew: 1 means the difference in pod count between the most and least populated zone cannot exceed one. DoNotSchedule on zone spread means Kubernetes will refuse to schedule pods that would violate the constraint; it will wait for capacity rather than pack everything into one zone.

The zone constraint uses DoNotSchedule (hard requirement) and the node constraint uses ScheduleAnyway (soft preference), which gives zone diversity as a hard guarantee while still spreading across nodes where possible.

The Production Readiness Checklist

Bringing the five patterns together, a Kubernetes deployment that meets production kubernetes deployment best practices has all of the following:

Resources: CPU requests set, CPU limits absent or generous. Memory requests and limits both set. QoS class is Guaranteed or Burstable.

Probes: Liveness checks only process health, never dependencies. Readiness checks everything required to serve traffic. Startup probe configured for slow-starting applications. timeoutSeconds calibrated to actual response time under load, not defaults.

Rolling update: maxUnavailable: 0 and maxSurge: 1. minReadySeconds set to at least 30 seconds. revisionHistoryLimit set to 5 for rollback capability.

PDB: minAvailable or maxUnavailable defined for every production deployment. Prevents maintenance events from taking down more replicas than acceptable.

Autoscaling: HPA metrics match actual saturation signal for the workload type. VPA in recommendation mode if HPA is in auto mode. KEDA for queue-based workloads.

Topology: topologySpreadConstraints configured for zone and node distribution. Zone spread is a hard constraint. Node spread is a preference.

Conclusion

Kubernetes deployment best practices exist for good reasons. The patterns above are not wrong, they are incomplete as commonly documented. The liveness probe documentation does not tell you what happens when combined with CPU limits under load. The rolling update guide does not explain the mathematics of low replica counts. The HPA documentation mentions custom metrics but defaults to CPU.

The production failure modes in this guide are not edge cases. They are the incidents that happen reliably as clusters grow, traffic increases, and infrastructure events occur outside controlled conditions.

At The Good Shell we run production Kubernetes infrastructure for startups and Web3 teams that cannot afford to learn these lessons through incidents. See our DevOps and platform engineering services or read our case studies to see what production-grade cluster configuration looks like in practice.

For the official Kubernetes documentation on deployment configuration, the Kubernetes production best practices checklist maintained by the community is the most comprehensive open reference.

Why Kubernetes Deployment Best Practices Break in Production

Pattern 1: Resource Limits That Create the Failure They Prevent

Pattern 2: Liveness and Readiness Probes Pointing at the Same Endpoint

Pattern 3: Rolling Update Configuration That Causes Downtime

Pattern 4: HPA Scaling on the Wrong Metric

Pattern 5: Ignoring Pod Topology and Node Failure Domains

The Production Readiness Checklist

Conclusion

Is your AI infrastructure ready for production?