~3 m
🚀 Observability Workshop: From Visibility to Root Cause in Grafana Play
This exercise uses the public Grafana Play environment to demonstrate the critical difference between Visibility (seeing a problem metric) and Observability (correlating metrics with logs/traces to find the root cause).
Tool: Grafana Play (Public Demo Instance) Focus: Correlating Metrics (Prometheus) and Logs (Loki)
🛠️ Step-by-Step Guide: The Observability Pivot
Phase 1: Access and Orientation (Visibility)
Goal: Access the demo, find a dashboard, and identify a problematic metric spike.
Step 1: Access the Grafana Play Environment
- Open your web browser and navigate to the official Grafana demo site: play.grafana.org .
- Navigate to the Dashboards menu (the stack of squares icon) on the left sidebar.
- Open a detailed statistics dashboard, such as “Prometheus 2.0 Stats” or “Grafana Internal Stats.”
Step 2: Identify a Problem Metric (Visibility)
- Scan the graphs for metrics related to application performance (e.g., Request Duration, Latency, Error Rate).
- Find a graph that shows a clear spike or deviation from the normal line—this is your anomaly.
- Action: Hover your mouse over the anomaly.
- Record 1: Note the exact timestamp (Date and Time) when the spike occurred.
- Record 2: Note the name of the metric (e.g.,
http_request_duration_seconds). - This data represents your VISIBILITY. You know WHAT happened, but not WHY.
Phase 2: Drilling Down for Observability
Goal: Use the problem metric’s timestamp to search logs and uncover the root cause.
Step 3: Access the Explore Feature
The Explore feature is Grafana’s interface for investigative querying, allowing you to seamlessly pivot between different data sources.
- Click on the “Explore” icon (the compass or graph icon) in the left-hand navigation panel.
- In the main workspace, find the Data Source selector at the top left.
Step 4: Correlate Metrics to Logs (The Observability Pivot)
To find the root cause, you must search the log data source (Loki) during the exact time the metric spiked.
- In the Data Source selector, change the source from Prometheus (Metrics) to Loki (Logs).
- Use the Time Range Selector (top right) to narrow the window. Set the range to focus only on the time 5 minutes before and 5 minutes after the anomaly time recorded in Step 2.
- In the Log Queries field, enter a basic label filter to pull relevant logs. A common filter used in the Play
environment is:
{job="grafana"} - Click “Run Query.”
Step 5: Identify the Root Cause
- Review the log lines returned for that narrow time window.
- Look specifically for log lines containing keywords that explain a failure or slowdown:
error,panic,timeoutrestarting,database failure, or high-volume bursts.
- Conclusion: By correlating the time of the metric spike with the content of the logs, you have found the root cause (the why). You have successfully shifted from Visibility (a spike on a graph) to Observability (a database timeout message in the logs).
🔑 Key Configurations to Manipulate
Experiment with these controls in the Explore view to deepen your understanding of each data source:
| Data Source | Configuration | Learning Objective |
|---|---|---|
| Prometheus (Metrics) | Query Functions: Wrap your metric in functions like rate() or sum(). | Understand how data aggregation changes the story told by the metric. |
| Loki (Logs) | Label Filters: Change the query from {job="grafana"} to {job="grafana", level="error"}. | Learn to use metadata (labels) to quickly narrow the scope of investigation to only problematic events. |
| Both | Time Range Selector: Constantly zoom in and out around the anomaly. | Emphasize that context and time segmentation are the most powerful tools in observability. |