Why observability matters before an incident
Most teams add monitoring after something breaks in production. A user reports it, engineers scramble through CloudWatch logs, and someone spends three hours correlating events across five different dashboards. The post-mortem says "add better alerting." Then the cycle repeats.
The problem isn't alerting — it's that metrics, logs, and traces are disconnected. Prometheus tells you CPU spiked at 14:32. Loki tells you there were errors at 14:32. Tempo tells you requests slowed at 14:32. When these are linked in Grafana, you can click from a metric spike directly to the relevant logs and traces in seconds.
This guide walks through deploying the full PLG stack (Prometheus + Loki + Grafana) plus Tempo on Kubernetes using kube-prometheus-stack and the Grafana Operator — managed as code, not clicked together in a UI.
The stack
| Component | Role |
|---|---|
| Prometheus | Metrics scraping and storage |
| Alertmanager | Alert routing (Slack, PagerDuty, OpsGenie) |
| Grafana | Dashboards, alert UI, data source correlation |
| Loki | Log aggregation |
| Promtail | Log shipping from pods to Loki |
| Tempo | Distributed tracing |
| OpenTelemetry Collector | Trace and metric ingestion pipeline |
Deploy kube-prometheus-stack
kube-prometheus-stack is the standard Helm chart that bundles Prometheus, Alertmanager, Grafana, and a full set of pre-built Kubernetes dashboards and alert rules.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Create a values.yaml:
# values.yaml for kube-prometheus-stack
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 50Gi
# Only scrape ServiceMonitors in the same namespace or with this label
serviceMonitorSelectorNilUsesHelmValues: false
serviceMonitorSelector: {}
serviceMonitorNamespaceSelector: {}
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 2Gi
grafana:
enabled: true
adminPassword: "change-me-use-external-secret"
persistence:
enabled: true
storageClassName: gp3
size: 5Gi
grafana.ini:
server:
root_url: "https://grafana.yourdomain.com"
auth.anonymous:
enabled: false
# Reduce noise from default rules if needed
defaultRules:
rules:
etcd: false # disable if not monitoring etcd directly
kubeScheduler: false
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace --version 56.x.x -f values.yaml
After a few minutes you'll have Prometheus scraping all Kubernetes system components, Grafana with pre-built dashboards, and Alertmanager ready for routing.
Add Loki for log aggregation
Loki doesn't index the content of logs — only the labels (like pod name, namespace, container). This keeps storage costs dramatically lower than Elasticsearch while still supporting powerful LogQL queries.
helm repo add grafana https://grafana.github.io/helm-charts
A minimal Loki values file using S3 for storage:
# loki-values.yaml
loki:
auth_enabled: false
storage:
type: s3
s3:
region: us-east-1
bucketnames: your-loki-bucket
s3ForcePathStyle: false
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
singleBinary:
replicas: 1 # Use distributed mode for production scale
# Promtail ships logs from every node
promtail:
enabled: true
helm upgrade --install loki grafana/loki-stack --namespace monitoring -f loki-values.yaml
Add Loki as a data source in Grafana:
# As a Grafana data source (or via GrafanaDatasource CRD)
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
labels:
grafana_datasource: "1"
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Loki
type: loki
url: http://loki:3100
access: proxy
Instrument your application with a ServiceMonitor
Prometheus uses ServiceMonitors to know what to scrape. Here's how to expose metrics from a Node.js app and tell Prometheus about it:
# serviceMonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: my-app
labels:
release: kube-prometheus-stack # must match Prometheus selector
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
path: /metrics
interval: 30s
Your app's Service needs a port named metrics:
spec:
ports:
- name: metrics
port: 9090
targetPort: 9090
Writing meaningful PrometheusRules
Default rules cover cluster health. You need custom rules for your application's SLOs.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-slos
namespace: my-app
labels:
release: kube-prometheus-stack
spec:
groups:
- name: my-app.slos
interval: 30s
rules:
# Error rate SLI
- record: job:http_errors:rate5m
expr: |
sum(rate(http_requests_total{job="my-app",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="my-app"}[5m]))
# Alert when error budget is burning fast
- alert: ErrorBudgetBurnHigh
expr: job:http_errors:rate5m > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "Error rate above 1% SLO for {{ $labels.job }}"
description: "Current error rate: {{ $value | humanizePercentage }}"
Correlating metrics, logs, and traces in Grafana
This is where the investment pays off. Configure Grafana data source links so you can jump from a metric spike to logs in one click.
In Grafana, edit the Prometheus data source and add a derived field:
{
"name": "TraceID",
"matcherRegex": "traceID=(\w+)",
"url": "${__value.raw}",
"urlDisplayLabel": "View Trace"
}
And for Loki, add a derived field pointing to Tempo:
{
"name": "TraceID",
"matcherRegex": "traceId=(\w+)",
"url": "http://tempo:3200/trace/${__value.raw}",
"urlDisplayLabel": "View in Tempo"
}
Now when you see a log line with a traceId, you get a direct link to the trace in Tempo. This is the core loop: metric alert → Grafana dashboard → click into logs → click into trace → find the slow database query.
Managing dashboards and alerts as code
Never build Grafana dashboards by hand in the UI. They'll diverge across environments and nobody will know which version is correct. Use the GrafanaDashboard CRD:
helm upgrade --install grafana-operator grafana/grafana-operator --namespace monitoring
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: my-app-dashboard
namespace: monitoring
spec:
instanceSelector:
matchLabels:
dashboards: grafana
json: |
{
"title": "My App",
"panels": [...]
}
Store the JSON in Git. PR review for dashboard changes. No more "who changed the dashboard on Friday afternoon."
SLO-driven alerting: the approach that actually works
Symptom-based alerts are better than cause-based alerts. Instead of alerting on "CPU > 80%", alert on "error rate > SLO" or "latency P99 > 500ms". Users don't care about CPU — they care about the service working.
The four golden signals are the right starting point:
- Latency — how long requests take (especially P95/P99, not averages)
- Traffic — requests per second
- Errors — rate of failed requests (5xx, timeouts)
- Saturation — resource utilization headroom (CPU, memory, disk)
A Prometheus query for P99 latency:
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="my-app"}[5m]))
by (le)
)
Keep your total page-level alert count below 20. If you have 200 alerts firing, nobody reads them. Ruthlessly silence noise and promote only actionable alerts to on-call.
What's next
With the PLG stack running, the next step is setting up Alertmanager routing to send critical alerts to PagerDuty and warnings to Slack — with proper inhibition rules so a cluster outage doesn't generate 300 simultaneous alerts. That's where most teams spend half their observability setup time.