Monitoring a single process on a single computer is easy. You can look clearly at the metrics you expect to indicate process health and be relatively certain of their import. As you add processes and computers, the challenge becomes harder. Instead of viewing the information locally it becomes necessary to aggregate metrics in a common location. The number of metrics per second grows. It becomes more and more challenging to view the complicated system and assess anything at all. At ten machines, this situation is characterized best by confusion and frustration, at 100 the amount of effort required to fully understand an incident cannot be reliably extended to all the machines in question. We need a higher-order framework to describe the health of the system.
First, it’s important to focus on the metrics that matter most. By that, I don’t mean the favorite metrics, or the metric that was responsible for the last outage, or the newest metrics added. The four golden signals are a great example: Latency, Saturation, Traffic, and Errors. These are defined in chapter 6 of Google’s Site Reliability Engineering book as follows.
The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.
A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second.
The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, “If you committed to one-second response times, any request over one second is an error”). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content.
How “full” your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential.
In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters that alter the complexity of the request (e.g., “Give me a nonce” or “I need a globally unique monotonic integer”) that rarely change configuration, a static value from a load test might be adequate. As discussed in the previous paragraph, however, most services need to use indirect signals like CPU utilization or network bandwidth that have a known upper bound. Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation.
Finally, saturation is also concerned with predictions of impending saturation, such as “It looks like your database will fill its hard drive in 4 hours.”
Alternative views into this situation exist. Brandon Gregg cites the USE Metrics, these are particularly focused on the physical resources, but extend to some software resources as well. Given their descriptions, one could imagine that any sort of machine or physical process could be monitored in these terms. He defines them as follows:
Resource: all physical server functional components (CPUs, disks, busses, …)  Utilization: the average time that the resource was busy servicing work  Saturation: the degree to which the resource has extra work which it can’t service, often queued
Errors: the count of error events
Tom Wilkie of WeaveWorks uses the RED Metrics This is suggested as a framework for monitoring services. These have no reference to any physical hardware, just performance characteristics. One could imagine doing this kind of monitoring even in serverless architectures. They are defined as follows:
For every service, check request:
Error (rate) and
Are within SLO/A
In the situations where they apply, these each represent good options. If you’re not precisely sure which framework to apply, the four golden metrics covers everything, and monitoring one extra graph may be well worth the effort. In any case, focusing on a set of these KPIs is more important than focusing on the “correct” one.
Even at moderate scale, system complexity can be overwhelming when trying to determine the root cause of an event. Higher-order models to describe system health like The four golden signals, USE and RED provide a great overview of your systems. However, there still remains a significant challenge; how to identify the root cause.
In this series we will discuss what to do with these metrics once you have them, and how they can make your operations easier.