Production-Grade RPC Infrastructure for Web3
How we designed and operated public RPC endpoints for a Web3 protocol handling 100M+ monthly requests, with 99.95% uptime, sub-500ms p99 latency and WAF protection that blocks 40% of abusive traffic at the edge.
A Series A Web3 protocol was running RPC endpoints that broke under load, leaked operational cost via unprotected abuse, and missed reliability targets that mattered for partner integrations. We redesigned the public RPC layer end to end. The result: p99 latency dropped from 1.2 seconds to 380 ms, uptime climbed from 99.2% to 99.95%, and around 40% of incoming requests are now blocked at the edge before they ever reach a node, saving roughly 8,000 USD per month in pointless compute.
Client context
The client is a Series A Web3 infrastructure company operating a public protocol with around 100 million RPC requests per month and growing roughly 3x quarter over quarter. They have an internal team of eight engineers, split between protocol and platform, and partner integrations (wallets, indexers, dApps) depend on their public endpoints staying available and fast. Downtime translates into partner SLAs being breached and, more concretely, into Discord and Telegram lighting up within minutes.
Before our engagement they were running everything on a single managed Kubernetes cluster with one large node group, RPC nodes deployed alongside application workloads, no WAF, and a single Cloudflare proxy in front. Observability was a default Prometheus stack with no SLOs, alerts mostly threshold-based on CPU.
The challenge in detail
“Make the RPC stable” is an easy ask to receive and a hard one to deliver, because the symptoms hide several independent problems. We deconstructed it into four root causes that needed different solutions.
1. No traffic filtering at the edge
Public RPC endpoints attract a particular kind of load: bots scraping the chain, abusive integrations that misconfigure retry logic and hammer the endpoint, and the occasional intentional DoS. Without a WAF and without rate limiting, every junk request consumed a node connection, a bit of CPU on the RPC service, and a bit of disk I/O on the node itself.
2. Mixed workload placement
RPC nodes (Geth and Erigon in this case) need consistent disk I/O and large memory. When they share node groups with stateless application pods, the Kubernetes scheduler ends up bin-packing them next to noisy neighbors. Latency on the RPC layer would spike for no obvious reason, and the team had no tooling to correlate it with what was running on the same EC2 instance.
3. No layered protection or routing
All traffic terminated TLS at a single Cloudflare proxy, then went straight to a Kubernetes Service of type LoadBalancer. There was no L7 routing, no separation of read vs write endpoints, no graceful degradation when a single node was syncing or slow.
4. Observability that did not match the questions being asked
The dashboards showed CPU and memory. The actual questions during incidents were: “is a specific node out of sync?”, “is one validator running slow?”, “what is the p99 for eth_getLogs right now?”. None of those were instrumented. Time to resolve incidents was high because every incident started with twenty minutes of building the right dashboard.
Discovery and audit phase
The first two weeks were an audit, not a build. We did three things in parallel.
First, we instrumented the existing stack with the metrics we needed to answer real questions: per-method latency histograms, per-node sync lag, request volume by source ASN, and 4xx/5xx rates broken down by RPC method. This alone surfaced patterns nobody had seen before. For example, two-thirds of failed requests were a handful of integrations retrying on transient errors with no backoff.
Second, we ran a structured threat model on the public endpoints. Not the corporate-security sort, but the practical one: what does an attacker get by hitting this endpoint with junk traffic, what does a misconfigured integration cost us per hour, what happens if a single node falls behind the chain head by ten blocks.
Third, we audited cost: how much of the existing compute was actually serving real users versus serving bots and broken retry loops. The number was uncomfortable. Over a third of the EC2 spend on the cluster was processing requests that should never have made it past the edge.
Architecture and key decisions
The new architecture has four layers, each with a clear job.
Edge: Cloudflare plus AWS WAF
Cloudflare handles DNS, DDoS at the transport layer, and edge caching for read-heavy methods like eth_chainId and eth_blockNumber that change predictably. Behind Cloudflare we put an AWS Application Load Balancer with AWS WAF, configured with managed rules plus custom rate limit rules: 100 requests per minute per IP and 1,000 per minute per AS-N, with a stricter rule for known abusive prefixes. Geo-blocking is enabled for restricted regions per compliance requirements.
We evaluated using only Cloudflare for WAF, which would have been simpler. We chose AWS WAF in addition because we wanted the rate-limit decisions to be visible in CloudWatch alongside ALB metrics and feed the same observability pipeline. Two layers cost slightly more but the operational visibility paid for itself within the first incident.
Cluster: dedicated node groups for RPC
We split the cluster into separate node groups: one for stateless application workloads on standard memory-optimized instances, one for RPC nodes on storage-optimized instances with provisioned IOPS gp3 volumes, and one for indexer and API services. Each pool has its own taints and tolerations so the scheduler never mixes them.
L7 routing inside Kubernetes
NGINX Ingress Controller handles routing inside the cluster. Read methods route to a pool of full nodes, write methods (transaction submission) route to a smaller, more conservative pool with stricter health checks. A custom Envoy filter (added later) does header-based routing for partners that need archive node access.
Indexer and custom API
The indexer is a separate service backed by Postgres on RDS, behind its own internal endpoint. This isolates heavy historical queries (like eth_getLogs over wide block ranges) from the latency-sensitive path. Partners that need historical data get a separate API key and route, with its own rate limits.
Stack and tooling
The full production stack:
- Compute: AWS EKS with mixed instance types via Karpenter for the stateless pools, and pinned instance types (
r6id.4xlargeandr6id.8xlarge) for the RPC node pool - Networking: Cloudflare in front of AWS ALB, AWS WAF with managed plus custom rules
- Ingress: NGINX Ingress Controller with custom annotations for upstream timeouts tuned per method
- RPC clients: Geth for full nodes, Erigon for archive nodes, deployed via custom Helm charts
- Storage: EBS gp3 with 10,000 IOPS for archive nodes, gp3 default for full nodes
- Observability: Prometheus, Grafana, Loki for logs, Tempo for traces, with SLO dashboards built around per-method latency and per-node sync lag
- Alerting: Alertmanager into PagerDuty, with explicit SLO-burn alerts replacing CPU/memory threshold alerts
- GitOps: ArgoCD for everything in the cluster, with PR-based promotion across staging and production
- IaC: Terraform for AWS resources (VPC, EKS, ALB, WAF, RDS, IAM), with modules per service
- Secrets: External Secrets Operator pulling from AWS Secrets Manager
Implementation details
Several decisions deserve more detail because they are the ones that most often go wrong in similar setups.
HPA on QPS, not CPU
The default Horizontal Pod Autoscaler reacts to CPU, which lags for RPC workloads where I/O is the bottleneck. We switched to scaling on requests per second using a custom metric from the Ingress, with the HPA target set to keep each pod under a known safe QPS. This kept autoscaling responsive to actual load instead of reacting after the pods were already saturated.
WAF rules that actually fire
Generic WAF rules block almost nothing useful for RPC traffic. The custom rules we ship in this engagement focus on three patterns: per-IP rate limits with burst allowances tuned to legitimate retry patterns, per-AS-N rate limits to catch botnets that rotate IPs, and method-level rules that block expensive methods (like archive-only calls) on non-archive routes.
Sync lag as a first-class metric
A node that is two blocks behind the chain head is healthy for most queries and quietly wrong for a few critical ones. We export a rpc_node_block_lag_blocks metric per node and use it both to drive readiness probes (any node lagging more than 10 blocks gets removed from the load balancer) and to surface in dashboards alongside latency.
Cost-aware deployment
Archive nodes are expensive: large disks, large memory, long sync times. We run fewer of them, behind a separate route, with stricter rate limits. Partners that need archive access authenticate and get routed there explicitly. This single decision avoided having to over-provision the entire RPC pool to handle a few heavy queries.
Results and numbers
Six weeks after the new architecture went live, the metrics that matter looked very different.
Lessons learned
A few things became clearer over the engagement and are worth surfacing because they apply to almost every Web3 infra setup we have seen.
WAF rules without telemetry are guesswork. The first cut of WAF rules blocked too much legitimate traffic. The fix was not to write smarter rules, it was to log every blocked request with enough context to tune them. After two weeks of tuning, the false-positive rate was below 0.1%.
Archive nodes are not the same product as full nodes. Trying to treat them as one pool is a recipe for either over-paying or breaking historical queries. Separate them by route, rate limit, and SLO from day one.
The expensive RPC method is not always the obvious one. eth_getLogs with wide ranges is the obvious offender. The less obvious one was repeated eth_call against the same contract from misconfigured indexers, which we ended up caching for short TTLs at the edge.
Sync lag belongs in the readiness probe. A node that responds quickly but is five blocks behind is worse than a node that takes 200 ms more and is current. Tying readiness to sync lag avoided a class of incidents that were impossible to debug after the fact.
SLOs make the on-call rotation sane. Switching from threshold alerts to SLO-burn alerts cut alert volume by roughly 70% and put the team on a real on-call rotation instead of one engineer absorbing everything.
When this fits your team
- You are operating public RPC endpoints, validators, or any blockchain infrastructure that needs to stay up for partner integrations
- You suspect a significant portion of your compute is processing abusive or misconfigured traffic and you cannot quantify it
- Your incidents are slow to resolve because the dashboards do not answer the questions you are asking
- You are growing fast and your current architecture is hitting its limits every quarter
- You want a senior DevOps and SRE partner who has shipped this exact pattern before, without committing to a full-time hire
Need this for your project?
Start with a 7-day Infrastructure Audit ($4,500 fixed) to scope the work and identify the highest-impact fixes, or book a free 30-min call to see if we are a fit.
