Reducing Kubernetes Infrastructure Costs by 40% on EKS

How we cut a SaaS platform’s cloud bill from $45k to $27k per month by redesigning network routing, fixing workload placement, and replacing CPU-based autoscaling with cost-aware decisions, without sacrificing reliability.

KubernetesAWS EKSFinOpsKarpenterNetworkingAutoscaling

In this case study

TL;DR
Client context
The cost trajectory and root causes
Discovery and audit phase
Architecture and key decisions
Stack and tooling
Implementation details
Results and numbers
Lessons learned
When this fits your team

Before and after cost optimization architecture diagram

TL;DR

A Series B SaaS platform on EKS was watching its AWS bill grow faster than its revenue. The line items were vague and the team did not know which workloads were responsible. We ran a structured audit, fixed the routing layer first (NAT and inter-AZ), then compute placement and autoscaling. Total monthly spend dropped from about 45,000 USD to 27,000 USD, a 40% reduction, while p99 latency stayed flat and reliability actually improved.

Client context

The client is a Series B B2B SaaS company running their main platform on AWS EKS across three availability zones. About 200 microservices, a mix of Go and TypeScript workloads, around 30 engineers, and roughly 500 million API requests per month. They had grown 4x in the previous year and the cloud bill had grown 5x. Their CTO had been told by the board to bring infrastructure cost growth back under control by the next quarterly review.

The team was technically capable but had built fast: one default node group, default autoscaling on CPU, no FinOps tooling, no cost tagging, no VPC Endpoints, multi-AZ everywhere by reflex. The architecture was reliable but expensive, and the team did not have a clear picture of where the money was going.

The cost trajectory and root causes

It is tempting to attack a high cloud bill with broad strokes: smaller instances, fewer replicas, Spot everywhere. That is the wrong move when the line items are not understood. The first job is to map the spend to the workloads. We found four root causes once we did.

1. NAT Gateway was the single biggest line item

NAT Gateway was costing about 8,000 USD per month. Almost all of it was egress to AWS APIs (S3, ECR, CloudWatch, STS, SecretsManager) and to a handful of external SaaS providers. Because every node sat in private subnets and all egress flowed through NAT, every byte of compressed Docker image pull, every log line shipped to a third party, every secret rotation, every cross-region replication, all of it was being charged per GB.

2. Inter-AZ traffic at the service layer

Multi-AZ is a reliability feature, but the Kubernetes default scheduler does not know that traffic crossing AZs costs money. Pods were placed wherever the cluster autoscaler felt like, and the kube-proxy round-robin meant service calls hit random pods regardless of zone. We measured around 60% of internal service-to-service traffic crossing AZs unnecessarily.

3. Over-provisioned compute hiding reliability issues

The team had run into a couple of incidents where pods got OOM killed under load. The fix at the time had been to double the memory requests everywhere. That fixed the symptom but left the cluster running at about 35% real CPU and memory utilization, with twice the nodes needed for the actual workload.

4. No Spot, no Karpenter, no instance diversity

The cluster ran on a single instance family (m5.2xlarge on-demand) for everything. Stateless workloads were paying on-demand prices when they could absorb Spot interruptions. The Cluster Autoscaler had to wait for whole nodes to drain when scaling down, which added latency to scale-down events.

Discovery and audit phase

Two weeks of audit before any change. The output was a spreadsheet that mapped every significant line item in the AWS bill to a Kubernetes workload, and a ranking of where the largest savings were with the lowest risk.

We enabled cost allocation tags everywhere, deployed OpenCost into the cluster, and turned on VPC Flow Logs sampled at 1%. After two weeks of data we could answer questions like “how much does the billing-service cost us per month broken down by compute, NAT egress, and EBS?”. Most teams cannot answer that question.

The audit also surfaced a few quick wins worth tens of thousands per year that were not in the original scope: unused EBS volumes from old deployments, three RDS instances at db.r5.4xlarge serving traffic that fit comfortably on db.r5.xlarge, and two NAT Gateways in zones that did not need them.

Architecture and key decisions

The new design tackles cost at three layers, in this order: routing, compute placement, then compute itself. Order matters here because optimizing compute before fixing routing means you over-correct.

Routing layer: VPC Endpoints first, then topology-aware hints

We added VPC Endpoints (gateway endpoints for S3 and DynamoDB, interface endpoints for the rest) so AWS API traffic stops going through NAT. This single change eliminated about 80% of NAT Gateway spend. Then we enabled topology-aware routing in Kubernetes (service.kubernetes.io/topology-mode: Auto) so kube-proxy prefers same-zone endpoints when possible, which cut cross-AZ data transfer by about 70%.

Trade-off considered

Topology-aware routing can hurt availability if a zone’s endpoints are all unhealthy. We enabled it only for services with enough replicas in each zone to handle the load alone (minimum 3 replicas per AZ for opted-in services). Services with fewer replicas keep round-robin routing. This kept availability intact while still capturing most of the savings.

Compute placement: dedicated node pools, Karpenter, taints

We replaced the single node group with three pools: a stateless pool managed by Karpenter (with Spot for non-critical, on-demand for critical), a stateful pool on-demand for databases and queues, and a system pool for ingress, monitoring, and add-ons. Each pool has labels and taints so the scheduler cannot mix them.

Compute itself: right-size with VPA recommendations, not eyeballs

We deployed Vertical Pod Autoscaler in recommendation mode (not auto-apply) and Goldilocks for human-readable suggestions. The platform team reviewed the recommendations service by service, accepting or adjusting them through pull requests instead of letting the VPA modify pods live. This kept the team in the loop and avoided the classic pitfall of VPA causing pod restarts during traffic peaks.

Stack and tooling

Compute: AWS EKS, Karpenter for stateless workloads with Spot diversity (8+ instance types across m6i, m6a, c6i, r6i families), managed node groups on-demand for stateful and system pools
Networking: VPC Endpoints for S3, ECR, STS, EC2, CloudWatch Logs, SecretsManager, KMS, ECR API, plus topology-aware routing on selected services
Autoscaling: HPA on custom metrics where CPU is misleading, Karpenter for node-level scaling, scheduled scale-down for non-production environments at night
Right-sizing: VPA in recommendation mode plus Goldilocks dashboard, recommendations applied via PR review
Cost observability: OpenCost integrated with Prometheus, dashboards per namespace and per service, weekly cost diff report posted to Slack
Storage: Migrated from default gp2 volumes to gp3 with explicit IOPS, removed orphaned snapshots and unused volumes during audit
Databases: Right-sized RDS instances, enabled db.t4g burstable tier for non-production, scheduled stop for dev/staging
GitOps: ArgoCD for cluster changes, Atlantis for Terraform PRs touching AWS resources
Policy: OPA Gatekeeper rules requiring cost-allocation labels on every new namespace and workload

Implementation details

A few decisions deserve more depth because they show up in almost every cost-optimization engagement.

Spot, but only where the workload can absorb it

Spot saves 60 to 70% on compute but interrupts pods with a 2-minute warning. We classified workloads into three categories: latency-critical (no Spot), tolerant (mixed Spot and on-demand), and best-effort (mostly Spot). Karpenter handles the diversity automatically once the constraints are expressed as node selectors and pod disruption budgets.

NodeAffinity and PodAntiAffinity that actually work

Topology spread constraints can fight with topology-aware routing if you are not careful. We use topologySpreadConstraints with maxSkew: 1 to keep replicas distributed across zones, and rely on topology-aware routing to keep traffic local. The combination gives both reliability and savings.

Scheduled scale-down for non-production

Dev and staging environments do not need to run at full capacity overnight or on weekends. A KEDA CronScaler scales selected deployments down to zero outside business hours, with a clear in-cluster banner explaining what is paused. This alone saved about 2,000 USD per month on non-prod compute.

A cost-allocation tag on every workload

We added an OPA Gatekeeper rule requiring every Namespace and every Deployment to carry a team and a cost-center label. Karpenter and OpenCost propagate those labels into AWS resources, so the AWS Cost Explorer view by tag now actually maps to teams. This made monthly cost reviews easy enough that they happen consistently instead of when the bill spikes.

Results and numbers

The full optimization took eight weeks from kickoff to a stable new steady state. The numbers below are the rolling 30-day view three months after go-live.

Total monthly spend

-40%

$45k to $27k per month

NAT Gateway cost

-85%

$8k to $1.2k per month

Cross-AZ traffic

-70%

Topology-aware routing

Compute right-sizing

-35%

VPA-guided + Karpenter

p99 latency

flat

No regression after changes

Reliability

+0.3%

Uptime improved, not degraded

Bottom line: The platform now grows with traffic instead of growing faster than traffic. The team has a weekly cost diff report on Slack and a clear understanding of which workload owns which dollar, which means cost-conscious design is now a habit, not a heroic effort.

Lessons learned

Patterns we see in almost every Kubernetes cost engagement, worth surfacing.

Fix routing before compute. Optimizing compute when 20% of your bill is NAT and cross-AZ is a waste of effort. Once routing is clean, compute decisions become easier because the noise floor drops.

VPA in auto mode is dangerous. Let it recommend, not act. Pull request based application gives you a review step, a paper trail, and a way to roll back. We have seen VPA auto-apply cause cascading restarts under load in unrelated incidents.

Spot is not free money. It is a 60-70% discount in exchange for interruption tolerance you have to actually have. PDBs, graceful shutdown, and stateless design have to be in place first. Otherwise Spot causes more incidents than it saves dollars.

Cost tags are organizational, not technical. The hardest part of cost allocation is not the tooling, it is getting every team to agree on what a “team” and a “cost center” are, and then enforcing the convention. Gatekeeper makes the enforcement automatic, but the conversation has to happen first.

Weekly diffs beat monthly reviews. Monthly cost reviews mostly find regrets. A weekly diff posted to Slack catches a 2x overnight spike the morning after it happens. The conversation shifts from “why did the bill go up last month” to “what did we ship yesterday that costs more”.

When this fits your team

You are running Kubernetes at scale on a public cloud and the bill is growing faster than your traffic
You cannot answer the question “how much does service X cost us per month” with confidence
NAT Gateway, data transfer, or compute over-provisioning are line items you have heard about but never quantified
You want to optimize without sacrificing reliability or developer velocity, and without freezing feature work for months
You want a senior DevOps and SRE partner who has shipped this exact pattern before, not a generic cost-cutting consultancy

Want this kind of review for your stack?

Start with a 7-day Infrastructure Audit ($4,500 fixed) to scope the work and identify the highest-impact fixes, or book a free 30-min call to see if we are a fit.

Book a free 30-min call
or email [email protected]