Observability and SRE done right: LGTM stack plus OpenTelemetry

How we replaced an alert-fatigued, dashboard-everywhere observability setup with a focused LGTM stack on OTel, cutting alert volume by 79% and MTTR by 4x, while making on-call sustainable again.

SREOpenTelemetryGrafanaPrometheusMimirLokiTempoSLOs

In this case study

TL;DR
Client context
Why most observability setups quietly fail
Discovery and audit phase
Architecture and key decisions
Stack and tooling
Implementation details
Results and numbers
Lessons learned
When this fits your team

Observability architecture: applications, OpenTelemetry collectors, LGTM stack, Grafana, Alertmanager and PagerDuty

TL;DR

A SaaS platform team had Prometheus and Grafana installed and called that observability. In reality they had threshold alerts firing 120 times a week, eighty dashboards nobody trusted, no traces, no SLOs, and incidents that took 45 minutes to diagnose because the data was in the wrong shape. We rebuilt their observability stack on LGTM (Loki, Grafana, Tempo, Mimir) with OpenTelemetry as the single ingestion path, defined SLOs for the user-facing endpoints, and replaced threshold alerts with multi-window burn alerts. Alert volume dropped 79%, MTTR went from 45 to 11 minutes, and the team got their evenings back.

Client context

The client is a B2B SaaS platform with around 150 microservices, 25 engineers across 6 product teams, and an on-call rotation that everyone dreaded. They had been growing fast and the observability stack had grown organically alongside the product: a Prometheus instance per cluster, a Grafana shared by everyone, two ELK clusters that nobody owned, and a small forest of bespoke dashboards built ad-hoc during past incidents and never cleaned up.

The team was technically capable. The problem was not skill, it was structure. Every team had built dashboards and alerts that made sense to them in isolation, with no shared conventions and no shared ownership. The result was an observability surface area that grew faster than the platform itself.

Why most observability setups quietly fail

“We have observability” usually means “we have Prometheus and Grafana installed”. That is a starting point, not an outcome. We see five failure patterns over and over, and this client had all of them.

1. Alerts on causes, not symptoms

Most of their alerts fired on CPU, memory, or queue depth. Those are causes that may or may not affect users. The result was alert volume that did not correlate with user-visible problems. The on-call would get paged for a CPU spike that resolved itself before they finished reading the alert, while a real outage on a different surface went unnoticed for ten minutes.

2. Dashboards built during incidents, kept forever

Every past incident had spawned a new dashboard, built under pressure with copy-pasted PromQL. None of them got reviewed or retired afterwards. Eighty dashboards, maybe a dozen actually consulted in the last 90 days, and no way to know which were trustworthy.

3. Three signals, one of them missing

They had metrics. They had logs (in ELK, unconnected to anything). They had no distributed tracing at all. When an incident happened, the team could see that latency was high somewhere and that errors were happening somewhere, but not which service called which downstream and where the latency was actually coming from. Triage involved a lot of guessing.

4. No SLOs, no error budgets

There were no agreed targets for what “available” or “fast enough” meant per surface. Engineering and product had different intuitions, alerts fired on numbers no one had committed to, and there was no shared language for “we are within budget, we can ship” versus “we are burning budget, freeze risky changes”.

5. Logs nobody could search

The two ELK clusters had been set up for compliance, were full of unstructured text, and were too slow to use during an incident. Most engineers had stopped looking. Logs had effectively been demoted from “first place to check” to “place to check after everything else”.

Discovery and audit phase

Two weeks of audit produced a clear picture and a prioritized backlog. We did four things in parallel.

We classified every existing alert by whether it fired on a symptom (user-visible) or a cause (internal metric), how often it had fired in the last 90 days, and how often it had led to a real action. The output was uncomfortable: roughly 65% of alerts had not led to a single action in the last quarter, and 20% had fired more than 100 times. Those are the alerts that train people to ignore the on-call channel.

We mapped every dashboard to its query patterns and access logs. Most of the dashboards that were actually used could be consolidated into about 20 well-designed ones organized around user-facing surfaces (checkout, login, search, billing, etc) rather than internal services.

We instrumented a handful of critical paths with OpenTelemetry tracing in a sandbox to demonstrate the missing third signal. The first trace showed a 400 ms call that had been blamed on the database actually spending most of its time in a serialization library nobody had thought to look at. That single demo unblocked the conversation about adopting tracing platform-wide.

We sat down with each product team and defined two or three SLOs per critical surface, focused on user-facing latency and availability, with explicit error budgets. This is the conversation that takes the longest and pays the most. Without SLOs every later decision about alerting and on-call is arbitrary.

Architecture and key decisions

The new architecture has one ingestion path and one query path. That is the part that matters. Everything else is implementation detail.

OpenTelemetry as the single source of truth

Every application emits metrics, logs and traces via the OpenTelemetry SDK. Auto-instrumentation covers most of the standard libraries, manual spans cover the business-critical paths. The OTel Collector runs in two layers: an agent DaemonSet on every node for fast local collection, and a gateway Deployment for routing, sampling and enrichment before signals reach the storage backends.

Trade-off considered

We could have kept Prometheus scraping for metrics and only used OTel for traces and logs. That is the easier migration path and what many teams do. We chose to push metrics through OTel as well because it gave us a single configuration surface, a single sampling policy, and the ability to enrich all three signals with the same resource attributes. The cost was a more complex collector configuration. For a team with 150 services and growing, the consistency was worth it.

LGTM as the storage backend

Loki for logs, Grafana for visualization, Tempo for traces, Mimir for long-term metrics (with Prometheus for short-term scrape on the edge). All four are operated as cloud-native services on the same Kubernetes cluster, with S3 as the object backend and per-tenant isolation. The reason we chose this stack instead of a vendor SaaS was cost predictability at the data volumes they were heading toward (about 500 GB of logs per day) and the ability to correlate signals natively through Grafana without paying for cross-product integration.

SLO-based alerting

Every alert that pages a human now ties to an SLO. We use Sloth to generate Prometheus recording and alerting rules from SLO definitions in YAML, with the standard multi-window burn-rate alerts (a fast burn for short outages, a slow burn for sustained degradation). Internal cause-based alerts still exist but they go to Slack, not PagerDuty.

Dashboards as code, organized around user journeys

All dashboards live in Git, generated via Grafonnet or the Grafana provisioning system, organized around user-facing surfaces. Each team owns the dashboards for their surfaces. There is exactly one “platform health” dashboard, owned by the platform team, that the on-call starts with during an incident.

Stack and tooling

Instrumentation: OpenTelemetry SDK across all services (Go, TypeScript, Python), auto-instrumentation for HTTP/gRPC/DB/queue clients, manual spans for business-critical operations
Collection: OpenTelemetry Collector in agent (DaemonSet) plus gateway (Deployment) pattern, with tail-based sampling at the gateway
Metrics: Prometheus for scrape and short-term storage, Mimir for long-term storage and multi-tenant query, all backed by S3
Logs: Loki with structured metadata for high-cardinality fields, Promtail or OTel filelog receiver for stdout/syslog, S3 backend, tiered retention
Traces: Tempo as the trace backend, native OTLP ingestion, tail-based sampling integration, trace-to-logs and trace-to-metrics correlations in Grafana
Profiling: Pyroscope for continuous CPU and memory profiling on critical services (optional but very high signal)
Visualization: Grafana with dashboards provisioned from Git, RBAC scoped per team, single platform-health dashboard for on-call
SLOs: Sloth for SLO-as-code, Pyrra as the dashboard layer over the burn-rate metrics
Alerting: Alertmanager for routing, Grafana OnCall (or PagerDuty) for the rotation, Slack for non-paging notifications
Runbooks: linked from every alert, stored in Git alongside the SLO definition, reviewed quarterly
Deployment: Helm + ArgoCD for the whole observability stack, isolated node pool on EKS so an incident in the platform does not take observability down with it

Implementation details

Several decisions matter more than they look like they should.

Tail-based sampling at the gateway

Head sampling (decide whether to keep a trace before you see it) is cheap but throws away the interesting traces. Tail-based sampling (decide after the whole trace is collected) keeps every trace that contains an error or exceeds a latency threshold, plus a small percentage of normal traffic for baseline. We sample at about 5% baseline and 100% on errors. This single decision is what makes traces actually useful during an incident.

Structured metadata in Loki

Loki indexes labels, not log content. Putting high-cardinality fields like user IDs or request IDs in labels is the classic Loki mistake that blows up the index. We push those fields into structured metadata (introduced in Loki 3) and keep labels low-cardinality. The result is searchable logs at predictable cost.

SLOs that mean something

A common failure mode is to set SLOs at 99.99% on everything to look impressive, then ignore them when they burn. We worked with each team to pick targets that matched the actual user expectation for that surface: 99.9% on checkout, 99.5% on the search results page (where slowness is annoying but not catastrophic), and lower targets on background workers. The targets are documented, agreed with product, and reviewed quarterly against actual user behavior.

Multi-window burn-rate alerts

We use the two-window pattern from the Google SRE book: a fast burn alert for short windows (5 minutes burning at 14.4x) and a slow burn alert for long windows (1 hour burning at 6x). Together they catch both sudden outages and slow degradation without producing the noise of single-threshold alerts.

Runbook linked from every alert

Every paging alert has a runbook_url annotation that points to a markdown file in Git. The runbook is required to land an alert. This is a small process tax that pays back enormously during incidents at 3 AM.

Cost controls in Loki and Tempo

Both Loki and Tempo can become expensive fast. We use per-tenant ingestion limits, retention policies tiered by importance (90 days for paging-related, 14 days for everything else), and a weekly cost review per team based on OpenCost data. Cost ownership matches signal ownership.

Results and numbers

Eight weeks of work, then a steady state that the team can actually maintain. Numbers below are 60-day rolling averages.

Alert volume

-79%

120 to 25 alerts per week

MTTR

11 min

Down from 45 min

Pages per on-call

-72%

Sustainable rotation again

Dashboards in use

Down from 80, all owned

Traces coverage

100%

Critical paths instrumented

Obs stack cost

-30%

vs prior ELK + SaaS combo

Bottom line: The team trusts their alerts again. On-call is something engineers do without dreading it. Incidents resolve in a fraction of the time because the data is in the right shape. New services are instrumented correctly by default because the pattern is in place.

Lessons learned

Patterns that show up in every observability engagement.

Alert on symptoms, not causes. CPU is not a symptom. Users not being able to check out is a symptom. Build alerts from the outside in.

SLOs first, alerts second. Without an agreed SLO every alert threshold is opinion. With an SLO the burn-rate math is mechanical and the on-call rotation has a clear definition of “is this worth waking someone up”.

Three signals or none. Metrics tell you something is wrong. Logs tell you what happened. Traces tell you where. Missing any of the three turns incidents into archeology. OpenTelemetry makes it cheap to have all three from the start.

Dashboards rot fast. Build them as code, review them quarterly, retire them aggressively. A dashboard nobody trusts is worse than no dashboard, because it gives wrong answers under pressure.

Cost is part of observability. The same engineering rigor you apply to product cost has to apply to observability cost. Otherwise the bill catches up with you and someone in finance decides the cuts for you.

Runbooks make alerts good. If you cannot write a runbook for an alert, the alert is not ready to ship. This rule alone removes about 20% of proposed alerts before they ever fire.

When this fits your team

Your on-call rotation is unsustainable and engineers are pushing back on joining it
You have Prometheus and Grafana but no SLOs, no traces, and dashboards nobody trusts
Incidents take long to diagnose because the data is in three different places and does not correlate
Your observability bill is growing faster than your traffic and you cannot explain why
You are building on Kubernetes and want a self-hosted, OTel-native observability stack instead of paying per-host vendor pricing
You want a senior SRE partner who can put a real observability platform in place in weeks, not quarters

Want this kind of observability platform for your team?

Start with a 7-day Infrastructure Audit ($4,500 fixed) to scope the work and identify the highest-impact fixes, or book a free 30-min call to see if we are a fit.

Book a free 30-min call
or email [email protected]