~3 m

🚀 Observability Workshop: From Visibility to Root Cause in Grafana Play

This exercise uses the public Grafana Play environment to demonstrate the critical difference between Visibility (seeing a problem metric) and Observability (correlating metrics with logs/traces to find the root cause).

Tool: Grafana Play (Public Demo Instance) Focus: Correlating Metrics (Prometheus) and Logs (Loki)


🛠️ Step-by-Step Guide: The Observability Pivot

Phase 1: Access and Orientation (Visibility)

Goal: Access the demo, find a dashboard, and identify a problematic metric spike.

Step 1: Access the Grafana Play Environment

  1. Open your web browser and navigate to the official Grafana demo site: play.grafana.org .
  2. Navigate to the Dashboards menu (the stack of squares icon) on the left sidebar.
  3. Open a detailed statistics dashboard, such as “Prometheus 2.0 Stats” or “Grafana Internal Stats.”

Step 2: Identify a Problem Metric (Visibility)

  1. Scan the graphs for metrics related to application performance (e.g., Request Duration, Latency, Error Rate).
  2. Find a graph that shows a clear spike or deviation from the normal line—this is your anomaly.
  3. Action: Hover your mouse over the anomaly.
    • Record 1: Note the exact timestamp (Date and Time) when the spike occurred.
    • Record 2: Note the name of the metric (e.g., http_request_duration_seconds).
    • This data represents your VISIBILITY. You know WHAT happened, but not WHY.

Phase 2: Drilling Down for Observability

Goal: Use the problem metric’s timestamp to search logs and uncover the root cause.

Step 3: Access the Explore Feature

The Explore feature is Grafana’s interface for investigative querying, allowing you to seamlessly pivot between different data sources.

  1. Click on the “Explore” icon (the compass or graph icon) in the left-hand navigation panel.
  2. In the main workspace, find the Data Source selector at the top left.

Step 4: Correlate Metrics to Logs (The Observability Pivot)

To find the root cause, you must search the log data source (Loki) during the exact time the metric spiked.

  1. In the Data Source selector, change the source from Prometheus (Metrics) to Loki (Logs).
  2. Use the Time Range Selector (top right) to narrow the window. Set the range to focus only on the time 5 minutes before and 5 minutes after the anomaly time recorded in Step 2.
  3. In the Log Queries field, enter a basic label filter to pull relevant logs. A common filter used in the Play environment is:
    {job="grafana"}
    
  4. Click “Run Query.”

Step 5: Identify the Root Cause

  1. Review the log lines returned for that narrow time window.
  2. Look specifically for log lines containing keywords that explain a failure or slowdown:
    • error, panic, timeout
    • restarting, database failure, or high-volume bursts.
  3. Conclusion: By correlating the time of the metric spike with the content of the logs, you have found the root cause (the why). You have successfully shifted from Visibility (a spike on a graph) to Observability (a database timeout message in the logs).

🔑 Key Configurations to Manipulate

Experiment with these controls in the Explore view to deepen your understanding of each data source:

Data SourceConfigurationLearning Objective
Prometheus (Metrics)Query Functions: Wrap your metric in functions like rate() or sum().Understand how data aggregation changes the story told by the metric.
Loki (Logs)Label Filters: Change the query from {job="grafana"} to {job="grafana", level="error"}.Learn to use metadata (labels) to quickly narrow the scope of investigation to only problematic events.
BothTime Range Selector: Constantly zoom in and out around the anomaly.Emphasize that context and time segmentation are the most powerful tools in observability.