Prometheus Relabel Rules and the ‘action’ Parameter

Today I want to talk about learning about the action parameter in the relabel_config and metric_relabel_config elements in Prometheus. This was an epiphany I had when searching for how to dig substrings out the __meta_* label names as returned from service discovery (hint, use action: labelmap)

[Relabel configs](<relabel_config>) are composed of the following:

  • source_labels
  • separator (default ‘;’)
  • target_label – mandatory for replace actions. More on this below
  • regex (default ‘.*’)
  • modulus
  • replacement (default ‘$1’)
  • action (default ‘replace’)

Some of these elements have defaults, others are required based on the value of the action element.

When first learning about relabel configs in Prometheus you will encounter many examples that looks something like this:

    - source_labels: [__meta_kubernetes_role]
      action: keep
      regex: (?:apiserver|node)
    - source_labels: [__meta_kubernetes_role]
      target_label: job
      replacement: kubernetes_$1
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)

Every example I’ve found the stanza starts with source_labels as the first entry in the list of elements. As it states in the docs:

action defaults to ‘replace’

After reading the code (trying to find labelmap) it occurred to me that action is really the star of the show.

There are many different actions available to the relabel configs (lifted from the docs):

  • replace: Match regex against the concatenated source_labels. Then, set
    target_label to replacement, with match group references
    (${1}, ${2}, …) in replacement substituted by their value. If regex
    does not match, no replacement takes place.
  • keep: Drop targets for which regex does not match the concatenated source_labels.
  • drop: Drop targets for which regex matches the concatenated source_labels.
  • hashmod: Set target_label to the modulus of a hash of the concatenated source_labels.
  • labelmap: Match regex against all label names. Then copy the values of the matching labels
    to label names given by replacement with match group references
    (${1}, ${2}, …) in replacement substituted by their value.
  • labeldrop: Match regex against all label names. Any label that matches will be
    removed from the set of labels.
  • labelkeep: Match regex against all label names. Any label that does not match will be
    removed from the set of labels.

From a neophyte’s perspective perhaps all the examples for replace relabel rules should start with action, even if it’s redundant. To rewrite the above example:

    - action: keep
      source_labels: [__meta_kubernetes_role]
      regex: (?:apiserver|node)
    - action: replace
      source_labels: [__meta_kubernetes_role]
      target_label: job
      replacement: kubernetes_$1
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)

By leading with action it is crystal clear what is happening.

Monitoring is Dead, Long Live Observability

At this years (2017) Monitorama, there were several presentations that declared “monitoring is dead”. The context of such a bombastic statement was that “traditional” time-series monitoring tools did not give developers or operators enough information to truly know what was happening in their systems. The argument was that if you system is not telling you exactly what is going on, down to the request level, you will have no hope of making sense of what is happening or what effect the change you just shipped will have. This is especially pertinent with the complex micro-services architectures we are building today.

However, monitoring is far from dead.

Because words matter, choosing the correct terminology to describe the type of monitoring you are doing is important. From my perspective, there are 5 types of “monitoring” systems; Infrastructure Monitoring, Log Analysis, Distributed Tracing, Application Performance Monitoring (APM) and Real User Monitoring. All of these observability tools fall under the “monitoring” umbrella. Let’s take a quick look at each of these.

Infrastructure Monitoring

Infrastructure Monitoring (IM) is the most common use of the term Monitoring. Aggregate metrics collected from a variety of hardware and applications plotted on a graph with time on the X-axis. Graphs are assembled into dashboards and placed in central, visible places.

These systems are the first stop when responding to alerts; they can help answer the question about what is happening and how does that compare to the most recent history. More advanced versions of these systems automatically detect anomalies based on historic data.

Log Analysis

Human readable, text based information from our systems is long rooted in the UNIX operating system. The primary way the system has to communicate with it’s human overlords is to write a line of text into a log file in a well know location. This pattern is simple and tools that ship with UNIX can deal with text format. This pattern has been copied into all corners of software engineering, print plain text to a file. The proliferation of log files have created entire industries around the distribution, parsing and searching of logs with impressive results. Correlation of timestamps across distributed logs and extraction of numerical data from logs have made this tool type of tool indispensable for some teams.

The recent rise of structured logging, using a regular and parse-able log format, makes the job of the log analysis tool easier. No need to write a custom parser or parsing rules to extract the important signal from the human-readable noise.

At scale, this infrastructure comes at a cost. Indexing and searching terabytes of data per week/day/hour quickly becomes a burden for hardware scaling, licensing or both.

Distributed Tracing

With the rise of Services Oriented Architecture (SOA), or as its called these days, a micro-services architecture, developer team productivity is being optimized over the complexity of the overall system. Infrastructure Monitoring still affords an aggregate level of observability for any given micro-service, however determining the slow points in an architecture where requests may touch 3–12 different systems quickly becomes intractable.

In distributed tracing, a standard set of headers are passed and propagated between all distributed systems. Detailed performance metrics are captured for all possible interaction points such as third party APIs, databases and messaging systems. When correlated by an outside system, the entire call structure, and its performance through the system is revealed in great detail. At low volumes these system will emit tracing information for every request. When the correlation system becomes overwhelmed, sampling can be employed.

When distributed tracing systems operate at the request, or event level, a very detailed picture of how the system is performing emerges. Further if traces are annotated with who is invoking the request, observability starts to include why the system is slow in the form of who is making the request. This type of demographic and usage info is useful for every aspect of the business, not just engineering.

Distributed tracing can also be the source of time-series style data if stream processing is employed to aggregate the individual events. This is often employed when an aggregation is needed on a subset of the available dimensions; the grain of IM metrics are too course and the deluge of individual events are too fine.

Application Performance Monitoring (APM)

Application Performance Monitoring is a powerful black-box monitoring solution. Application code is post-processed or byte-code injection is employed to potentially measure every line of code and every external system interaction. If using throughout a homogeneous infrastructure, distributed tracing can also be derived from the stream of data.

APM has a very low barrier to entry as developers need to make few code level changes. However, code-less APM only works on some runtime engines, Java, .NET and Python for example. Other run-times need to be developer instrumented using a vendor SDK. Further, if you have a polygot micro-services architecture, not all language run-times will be instrumented leaving gaps in the visibility.

Real User Monitoring (RUM)

All of the above monitoring solutions are back-end focused. They can tell you with great detail about the performance of the services and the servers they run on. Rich browser user interfaces and Internet latency contribute to the overall end-user experience. Real User Monitoring instruments the JavaScript running the browser and sends performance and usage information to a metrics collector.

This method shows what performance the end-user is experiencing. Time spent in the browser, network latency and overall back-end performance are detailed. It does not matter that your systems are fast if the user’s experience is slow.

Some systems tie the browser instrumentation into the back-end distributed tracing giving a full click-to-render detail that includes all back-end processing. This, combined with Infrastructure Monitoring is a powerful combination.

Dead? Really?

Each one of these observability systems gives you some important insight into the behavior of your system. The tools are chosen based on the challenges your team has encountered in trying to debug your system. If you have a gobs of information in your application logs, you may choose a log analysis system. If your customers are complaining about their experience with your application being slow its reasonable to reach for a real-user monitoring tool so you can see what they see.

While there is no one tool or vendor that does everything, several vendors are starting to realize that the combination of two or more of these observability types become more than the sum of their parts.

Infrastructure Monitoring, while being decreed as dead, still remains at the center of any observability strategy. Aggregate visibility into our systems is paramount to gaining understanding of those systems and how they change over time. They are the canary in the coal mine. This class of tools should be the first place you look when the alerts are going off.

Unfortunately, the state of Infrastructure Monitoring tools only give us rudimentary information when things are out of tolerance. While machine learning promises that alerts will only be sent when things are really wrong, we have yet to achieve this nirvana.

Monitoring may well be dead, but observability is alive and kicking.

What Makes a System Healthy (Part 2)

In the first part of this series, we discussed the various frameworks for getting good KPI out of your monitoring, and simplifying your monitoring down to a manageable number of metrics. In this part, we will go on to discuss what you should do with these metrics, how to dig in if they go awry, and what impact this can have on your operations.

What makes these metrics valuable? The source of their value is not in that they tell precisely what’s wrong, but that they give a leading indicator that something is wrong to begin with. They are important as KPIs in this sense as well as several more. Unlike many metrics, each of these is human interpretable, and it’s easy to say that one way or the other is desirable or not. A service is doing well if it has high throughput, low latency, low errors, and reasonable saturation.

What makes these metrics helpful? First is obviously simplicity. This is simpler to monitor from an operations perspective, and it’s simpler to instrument from a development perspective. Setting this viewpoint coherently between both groups helps facilitate a smooth devops process

Instead of a massive dashboard full of every value that a service reports, looking at these four metrics per service dramatically reduces the cognitive load of the operations process. If any of the four golden signals is bad, there’s a problem (eg. responses taking forever, people can’t get to service, machines all saturated, error rate is high) and something should be done. As far as users can tell, though, if those four metrics are fine, then none of the rest really matter. This may seem small, it’s not. Let’s take an example. Apache Kafka is a distributed streaming platform in wide use. It emits literally hundreds of metrics. In dashboards, this is page upon page of graphs. Many of these may behave in strange ways for any of a number of reasons. Watching these metrics, it’s unclear if there’s any consequence to many of them being high, low, behaving strangely. The consuming service only sees the four golden signals, essentially. That said, do these others even matter? Not if things are going well.

This framework is great, until, of course, one of the main KPI blows up. The standard solution is unfulfilling. If the KPIs blow up, you need to sift through the rest of the metrics and find out what else is going wrong in hopes of driving towards the root cause. This process is intensely manual, and here, FreshTracks offers something different. From a data perspective, there are many metrics which have nothing to do with each other, but also a rich set of metrics which influence each other’s behaviors.

From a statistical perspective, a comprehensive study of these is an exercise in the extremes of the multiple comparisons problem. Attempting to learn about a huge number of things simultaneously from the same data diminishes the statistical power of each of the analyses. If you’re trying to figure out whether or not a certain metric influences another in a hypothesis testing framework, for example, you may find signals that are stronger than 95% likely, meaning that there’s a 1/20 chance of having noise look like signal. If you make several thousand of these comparisons, it’s only natural to expect several false signals. This can be corrected, but the number of samples (and therefore the time) required to make these assessments grows to perhaps untenable bounds.

Figuring out how the metrics impact the KPIs alone, on the other hand, makes this problem tractable. Indeed, being able to predict the KPI from the other metrics gives a map to use when the KPIs go awry. To this end, FreshTracks thinks there’s a better way to automatically drive at a root cause for production problems using human interpretable machine learning models. These are whitebox models, models whose decisions could be audited and understood by humans. In seeing what the models do, their results can be trusted. Because of interpretability, model insights can be applied in other domains, for other problems.

This gives a multivariate lens into whether or not your systems are behaving as expected, a fresh set of information to mine for system anomalies, especially if the prediction of the KPI and the KPI itself drift too far away from each other. In a system with which we are not familiar, this can save an enormous amount of time by solving one of the most fundamental problems: where to start.

What makes a System Healthy?

Monitoring a single process on a single computer is easy. You can look clearly at the metrics you expect to indicate process health and be relatively certain of their import. As you add processes and computers, the challenge becomes harder. Instead of viewing the information locally it becomes necessary to aggregate metrics in a common location. The number of metrics per second grows. It becomes more and more challenging to view the complicated system and assess anything at all. At ten machines, this situation is characterized best by confusion and frustration, at 100 the amount of effort required to fully understand an incident cannot be reliably extended to all the machines in question. We need a higher-order framework to describe the health of the system.

First, it’s important to focus on the metrics that matter most. By that, I don’t mean the favorite metrics, or the metric that was responsible for the last outage, or the newest metrics added. The four golden signals are a great example: Latency, Saturation, Traffic, and Errors. These are defined in chapter 6 of Google’s Site Reliability Engineering book as follows.


The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.


A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second.


The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, “If you committed to one-second response times, any request over one second is an error”). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content.


How “full” your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential.

In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters that alter the complexity of the request (e.g., “Give me a nonce” or “I need a globally unique monotonic integer”) that rarely change configuration, a static value from a load test might be adequate. As discussed in the previous paragraph, however, most services need to use indirect signals like CPU utilization or network bandwidth that have a known upper bound. Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation.

Finally, saturation is also concerned with predictions of impending saturation, such as “It looks like your database will fill its hard drive in 4 hours.”

Alternative views into this situation exist. Brandon Gregg cites the USE Metrics, these are particularly focused on the physical resources, but extend to some software resources as well. Given their descriptions, one could imagine that any sort of machine or physical process could be monitored in these terms. He defines them as follows:

Resource: all physical server functional components (CPUs, disks, busses, …) [1] Utilization: the average time that the resource was busy servicing work [2] Saturation: the degree to which the resource has extra work which it can’t service, often queued
Errors: the count of error events

Tom Wilkie of WeaveWorks uses the RED Metrics This is suggested as a framework for monitoring services. These have no reference to any physical hardware, just performance characteristics. One could imagine doing this kind of monitoring even in serverless architectures. They are defined as follows:

For every service, check request:
Error (rate) and
Duration (distributions)
Are within SLO/A

In the situations where they apply, these each represent good options. If you’re not precisely sure which framework to apply, the four golden metrics covers everything, and monitoring one extra graph may be well worth the effort. In any case, focusing on a set of these KPIs is more important than focusing on the “correct” one.

Even at moderate scale, system complexity can be overwhelming when trying to determine the root cause of an event. Higher-order models to describe system health like The four golden signals, USE and RED provide a great overview of your systems. However, there still remains a significant challenge; how to identify the root cause.

In this series we will discuss what to do with these metrics once you have them, and how they can make your operations easier.

About the Author

Chris Bonnell

Math PhD focusing on machine learning and signal processing.

First Post

16 inches fresh at Winter Park today…  And we’re making our website.  We may not be doing it right