Monitoring Distributed Systems

Following is Google's Site Reliability Engineering (SRE) teams' fundamental principles and best practices for building effective monitoring and alerting systems.

Terminology

Monitoring: The process of collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts, error rates, and response times.
White-box Monitoring: This is based on metrics derived from the internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or other internal statistics.
Black-box Monitoring: It involves testing the external behavior of a system as a user would experience it.
Dashboard: A web-based application that offers a summary view of core service metrics. Dashboards help in answering basic questions about a service and can display information like ticket queue length, high-priority bugs, and the current on-call engineer.
Alerts: Notifications intended for humans that are pushed to systems like bug or ticket queues, email aliases, or pagers. These alerts can be classified as tickets, email alerts, or pages.
Root Cause: A defect in a system or human process that, when fixed, instills confidence that a similar issue won't occur in the same way.
Node and Machine: These terms are used interchangeably to refer to a single instance of a running kernel, whether on a physical server, virtual machine, or container.
Push: Any change to a service's running software or configuration.

Why Monitoring Matters

The importance of monitoring can't be overstated. It serves various purposes, including:

Analyzing Trends: Monitoring helps in understanding the long-term trends in your system's performance, such as database growth rates or user count increases.
Comparing Performance: It allows for comparing different configurations or experiments to determine the most efficient approach.
Alerting: It notifies you when something is wrong, requiring immediate attention. Effective alerting is crucial for incident response.
Building Dashboards: Dashboards provide a snapshot of critical metrics, helping teams quickly assess the health of their services.
Ad Hoc Analysis: When problems arise, monitoring data can help in debugging and determining what might have caused the issue.

Monitoring and alerting enable systems to identify and respond to problems, potentially before they impact users. However, setting up an effective monitoring system requires careful consideration to avoid excessive alerts and ensure that each alert serves a meaningful purpose.

Setting Realistic Expectations

It's important to set realistic expectations for your monitoring efforts. Monitoring is a significant engineering endeavor, and even with a mature infrastructure, dedicated monitoring personnel are often required. Google's SRE teams have moved towards simpler and faster monitoring systems while avoiding overly complex "magic" systems that try to automatically detect thresholds or causality.

Symptoms vs. Causes

An essential aspect of monitoring is distinguishing between symptoms and causes. Symptoms indicate what is broken, while causes represent the reasons behind the issues. The key is to monitor symptoms to quickly identify problems and leave the investigation of causes for later, aiding in efficient debugging.

Balancing White-Box and Black-Box Monitoring

Google combines white-box monitoring (inspecting internal system metrics) with black-box monitoring (testing external system behavior). The choice depends on the specific context and information needed to assess system health.

The Four Golden Signals

Google emphasizes the importance of monitoring the four golden signals: latency (response time), traffic (system demand), errors (failed requests), and saturation (system fullness). Focusing on these four metrics provides a comprehensive view of system performance.

Latency: Measures the time it takes to process requests. This includes distinguishing between the latency of successful and failed requests.
Traffic: Measures the demand placed on your system, typically in requests per second.
Errors: Tracks the rate of failed requests, whether explicit (e.g., HTTP 500 errors), implicit, or by policy.
Saturation: Reflects how "full" your service is, focusing on constrained resources like CPU, memory, or I/O.

Addressing the Long Term

Monitoring isn't just about detecting immediate issues. It's crucial to consider the long-term goals. This means being open to taking short-term hits in system availability or performance in exchange for long-term stability and improvements.

A Monitoring Philosophy

Google's SRE teams advocate for a monitoring philosophy that focuses on actionable alerts and monitoring symptoms for paging. Monitoring should be clear and straightforward, with a focus on providing rapid issue diagnosis. It's about finding the right balance between monitoring causes and symptoms, being mindful of long-term goals, and simplifying the process to ensure that every alert has a clear purpose.