Case study

Zero to Observable: Adding Monitoring to a System That Had None

Introducing Grafana and Prometheus to a system with no visibility — what we measured first, and what we learned about debugging with data.

GrafanaPrometheusNode.js

Context

The platform had no structured monitoring. When something went wrong, we'd check logs, guess, and hope. We had no idea what 'normal' looked like — request latency, error rates, database connection usage. We needed to go from zero visibility to enough signal to debug production issues. The goal wasn't perfect observability; it was 'when something breaks, we can find out why.'

Constraints

  • No prior experience with Prometheus or Grafana — we had to learn while shipping
  • Limited time — we couldn't instrument everything; we had to pick the highest-value metrics first
  • Existing app had no metrics endpoints — we had to add them without disrupting the codebase

Architecture

We started with the four golden signals: latency, traffic, errors, saturation. We added a Prometheus client to the Node.js app, exposed a /metrics endpoint, and scraped it with Prometheus. We created a few Grafana dashboards: request rate and latency by endpoint, error rate by status code, database connection pool usage. We set up basic alerting — when error rate spiked or latency crossed a threshold, we'd get a Slack notification. We didn't add distributed tracing or log aggregation initially; those were phase two. The key was starting small: a few dashboards and two or three alerts. We learned what we actually looked at when debugging, then added more.

Alternatives considered

  • Full observability stack (tracing, logs, metrics) from day one: Would have been overwhelming. We needed to walk before we ran. Metrics first, then logs, then tracing if we hit limits.
  • Rely on cloud provider metrics (e.g. AWS CloudWatch) only: CloudWatch gave us infra metrics (CPU, memory) but not application-level metrics (request latency by route, error rate by endpoint). We needed both.

Lessons learned

  • Start with the metrics you'll actually use when debugging. We thought we'd care about cache hit rate; we ended up caring most about latency p99.
  • Alert fatigue is real. We tuned thresholds after a few false alarms. Better to miss an alert than ignore them.
  • Document what each dashboard is for. A dashboard with no clear purpose becomes noise.