Production-Grade RPC Infrastructure for Web3

How we designed and operated public RPC endpoints for a Web3 protocol handling 100M+ monthly requests, with 99.95% uptime, sub-500ms p99 latency and WAF protection that blocks 40% of abusive traffic at the edge.

KubernetesAWS EKSRPC NodesWAFCloudflareWeb3

In this case study

TL;DR
Client context
The challenge in detail
Discovery and audit phase
Architecture and key decisions
Stack and tooling
Implementation details
Results and numbers
Lessons learned
When this fits your team

TL;DR

A Series A Web3 protocol was running RPC endpoints that broke under load, leaked operational cost via unprotected abuse, and missed reliability targets that mattered for partner integrations. We redesigned the public RPC layer end to end. The result: p99 latency dropped from 1.2 seconds to 380 ms, uptime climbed from 99.2% to 99.95%, and around 40% of incoming requests are now blocked at the edge before they ever reach a node, saving roughly 8,000 USD per month in pointless compute.

Client context

The client is a Series A Web3 infrastructure company operating a public protocol with around 100 million RPC requests per month and growing roughly 3x quarter over quarter. They have an internal team of eight engineers, split between protocol and platform, and partner integrations (wallets, indexers, dApps) depend on their public endpoints staying available and fast. Downtime translates into partner SLAs being breached and, more concretely, into Discord and Telegram lighting up within minutes.

Before our engagement they were running everything on a single managed Kubernetes cluster with one large node group, RPC nodes deployed alongside application workloads, no WAF, and a single Cloudflare proxy in front. Observability was a default Prometheus stack with no SLOs, alerts mostly threshold-based on CPU.

The challenge in detail

“Make the RPC stable” is an easy ask to receive and a hard one to deliver, because the symptoms hide several independent problems. We deconstructed it into four root causes that needed different solutions.

1. No traffic filtering at the edge

Public RPC endpoints attract a particular kind of load: bots scraping the chain, abusive integrations that misconfigure retry logic and hammer the endpoint, and the occasional intentional DoS. Without a WAF and without rate limiting, every junk request consumed a node connection, a bit of CPU on the RPC service, and a bit of disk I/O on the node itself.

2. Mixed workload placement

RPC nodes (Geth and Erigon in this case) need consistent disk I/O and large memory. When they share node groups with stateless application pods, the Kubernetes scheduler ends up bin-packing them next to noisy neighbors. Latency on the RPC layer would spike for no obvious reason, and the team had no tooling to correlate it with what was running on the same EC2 instance.

3. No layered protection or routing

All traffic terminated TLS at a single Cloudflare proxy, then went straight to a Kubernetes Service of type LoadBalancer. There was no L7 routing, no separation of read vs write endpoints, no graceful degradation when a single node was syncing or slow.

4. Observability that did not match the questions being asked

The dashboards showed CPU and memory. The actual questions during incidents were: “is a specific node out of sync?”, “is one validator running slow?”, “what is the p99 for eth_getLogs right now?”. None of those were instrumented. Time to resolve incidents was high because every incident started with twenty minutes of building the right dashboard.

Discovery and audit phase

The first two weeks were an audit, not a build. We did three things in parallel.

First, we instrumented the existing stack with the metrics we needed to answer real questions: per-method latency histograms, per-node sync lag, request volume by source ASN, and 4xx/5xx rates broken down by RPC method. This alone surfaced patterns nobody had seen before. For example, two-thirds of failed requests were a handful of integrations retrying on transient errors with no backoff.

Second, we ran a structured threat model on the public endpoints. Not the corporate-security sort, but the practical one: what does an attacker get by hitting this endpoint with junk traffic, what does a misconfigured integration cost us per hour, what happens if a single node falls behind the chain head by ten blocks.

Third, we audited cost: how much of the existing compute was actually serving real users versus serving bots and broken retry loops. The number was uncomfortable. Over a third of the EC2 spend on the cluster was processing requests that should never have made it past the edge.

Architecture and key decisions

The new architecture has four layers, each with a clear job.

Edge: Cloudflare plus AWS WAF

Cloudflare handles DNS, DDoS at the transport layer, and edge caching for read-heavy methods like eth_chainId and eth_blockNumber that change predictably. Behind Cloudflare we put an AWS Application Load Balancer with AWS WAF, configured with managed rules plus custom rate limit rules: 100 requests per minute per IP and 1,000 per minute per AS-N, with a stricter rule for known abusive prefixes. Geo-blocking is enabled for restricted regions per compliance requirements.

Trade-off considered

We evaluated using only Cloudflare for WAF, which would have been simpler. We chose AWS WAF in addition because we wanted the rate-limit decisions to be visible in CloudWatch alongside ALB metrics and feed the same observability pipeline. Two layers cost slightly more but the operational visibility paid for itself within the first incident.

Cluster: dedicated node groups for RPC

We split the cluster into separate node groups: one for stateless application workloads on standard memory-optimized instances, one for RPC nodes on storage-optimized instances with provisioned IOPS gp3 volumes, and one for indexer and API services. Each pool has its own taints and tolerations so the scheduler never mixes them.

L7 routing inside Kubernetes

NGINX Ingress Controller handles routing inside the cluster. Read methods route to a pool of full nodes, write methods (transaction submission) route to a smaller, more conservative pool with stricter health checks. A custom Envoy filter (added later) does header-based routing for partners that need archive node access.

Indexer and custom API

The indexer is a separate service backed by Postgres on RDS, behind its own internal endpoint. This isolates heavy historical queries (like eth_getLogs over wide block ranges) from the latency-sensitive path. Partners that need historical data get a separate API key and route, with its own rate limits.

Stack and tooling

The full production stack:

Compute: AWS EKS with mixed instance types via Karpenter for the stateless pools, and pinned instance types (r6id.4xlarge and r6id.8xlarge) for the RPC node pool
Networking: Cloudflare in front of AWS ALB, AWS WAF with managed plus custom rules
Ingress: NGINX Ingress Controller with custom annotations for upstream timeouts tuned per method
RPC clients: Geth for full nodes, Erigon for archive nodes, deployed via custom Helm charts
Storage: EBS gp3 with 10,000 IOPS for archive nodes, gp3 default for full nodes
Observability: Prometheus, Grafana, Loki for logs, Tempo for traces, with SLO dashboards built around per-method latency and per-node sync lag
Alerting: Alertmanager into PagerDuty, with explicit SLO-burn alerts replacing CPU/memory threshold alerts
GitOps: ArgoCD for everything in the cluster, with PR-based promotion across staging and production
IaC: Terraform for AWS resources (VPC, EKS, ALB, WAF, RDS, IAM), with modules per service
Secrets: External Secrets Operator pulling from AWS Secrets Manager

Implementation details

Several decisions deserve more detail because they are the ones that most often go wrong in similar setups.

HPA on QPS, not CPU

The default Horizontal Pod Autoscaler reacts to CPU, which lags for RPC workloads where I/O is the bottleneck. We switched to scaling on requests per second using a custom metric from the Ingress, with the HPA target set to keep each pod under a known safe QPS. This kept autoscaling responsive to actual load instead of reacting after the pods were already saturated.

WAF rules that actually fire

Generic WAF rules block almost nothing useful for RPC traffic. The custom rules we ship in this engagement focus on three patterns: per-IP rate limits with burst allowances tuned to legitimate retry patterns, per-AS-N rate limits to catch botnets that rotate IPs, and method-level rules that block expensive methods (like archive-only calls) on non-archive routes.

Sync lag as a first-class metric

A node that is two blocks behind the chain head is healthy for most queries and quietly wrong for a few critical ones. We export a rpc_node_block_lag_blocks metric per node and use it both to drive readiness probes (any node lagging more than 10 blocks gets removed from the load balancer) and to surface in dashboards alongside latency.

Cost-aware deployment

Archive nodes are expensive: large disks, large memory, long sync times. We run fewer of them, behind a separate route, with stricter rate limits. Partners that need archive access authenticate and get routed there explicitly. This single decision avoided having to over-provision the entire RPC pool to handle a few heavy queries.

Results and numbers

Six weeks after the new architecture went live, the metrics that matter looked very different.

Uptime

99.95%

Up from 99.2% (30-day rolling)

p99 Latency

380 ms

Down from 1.2 s

p50 Latency

62 ms

Down from 180 ms

Edge block rate

~40%

Junk requests dropped before cluster

Compute saved

~$8k/mo

Headroom regained, no over-provision

MTTR

12 min

Down from ~55 min on similar incidents

Bottom line: The platform now handles partner growth without needing to rebuild itself every quarter. The team spends evenings building features instead of paging through CPU dashboards.

Lessons learned

A few things became clearer over the engagement and are worth surfacing because they apply to almost every Web3 infra setup we have seen.

WAF rules without telemetry are guesswork. The first cut of WAF rules blocked too much legitimate traffic. The fix was not to write smarter rules, it was to log every blocked request with enough context to tune them. After two weeks of tuning, the false-positive rate was below 0.1%.

Archive nodes are not the same product as full nodes. Trying to treat them as one pool is a recipe for either over-paying or breaking historical queries. Separate them by route, rate limit, and SLO from day one.

The expensive RPC method is not always the obvious one. eth_getLogs with wide ranges is the obvious offender. The less obvious one was repeated eth_call against the same contract from misconfigured indexers, which we ended up caching for short TTLs at the edge.

Sync lag belongs in the readiness probe. A node that responds quickly but is five blocks behind is worse than a node that takes 200 ms more and is current. Tying readiness to sync lag avoided a class of incidents that were impossible to debug after the fact.

SLOs make the on-call rotation sane. Switching from threshold alerts to SLO-burn alerts cut alert volume by roughly 70% and put the team on a real on-call rotation instead of one engineer absorbing everything.

When this fits your team

You are operating public RPC endpoints, validators, or any blockchain infrastructure that needs to stay up for partner integrations
You suspect a significant portion of your compute is processing abusive or misconfigured traffic and you cannot quantify it
Your incidents are slow to resolve because the dashboards do not answer the questions you are asking
You are growing fast and your current architecture is hitting its limits every quarter
You want a senior DevOps and SRE partner who has shipped this exact pattern before, without committing to a full-time hire

Need this for your project?

Start with a 7-day Infrastructure Audit ($4,500 fixed) to scope the work and identify the highest-impact fixes, or book a free 30-min call to see if we are a fit.

Book a free 30-min call
or email [email protected]