Improve Prometheus Monitoring in Kubernetes with Better Self-Scrape Configs

Thorough instrumentation and observability are important for all running code. This applies especially for data stores like Prometheus. That’s why Prometheus exposes its internal metrics in a Prometheus-compatible format and provides an out-of-the-box static scrape target of localhost:9090, so that right away an end-user can request of it, “Prometheus, observe thyself.”

The Visibility Struggle

In a production setting, scraping localhost:9090 is only helpful if Prometheus is running on a single static node with a well-known hostname. In this scenario, querying, say:
prometheus_local_storage_memory_chunks

… will yield something like:
prometheus_local_storage_memory_chunks{instance="localhost:9090"}

This is somewhat “OK” since everyone presumably knows what host houses Prometheus.

At FreshTracks, however, we run Prometheus on Kubernetes – and this instance label is less information than we usually need to monitor said Prometheus instance(s). This is componded when you run more than one Prometheus instance in a high-availability configuration. While we can technically force Prometheus to run on a single “static” node within a K8s cluster, there isn’t often a strong need to do so (and it tends to defeat the purpose of using an orchestration platform).

Assuming that we let the K8s scheduler decide where our Prometheus pod(s) reside, we’ll inevitably need to know what specific node on which a particular pod is running at any given time. For example: Consider a Prometheus 1.x instance that is entering “rushed mode” periodically. You’ll likely need to check out disk and memory stats on the underlying node, but can’t if the only information you have about each running pod is {instance="localhost:9090"}.

The Case for Targeted Service Discovery

This is where Prometheus’ Kubernetes service discovery features can help us out. When service discovery is considered, it’s usually in the context of “I have a lot of things and I don’t have a clue where they might be from one moment to the next.” That scope can reasonably be scaled back considerably, though, to one of “I have a single thing that floats arounds in my cluster.” To get away from our opaque label set of {instance="localhost:9090"}, we employ a service discovery configuration such as:

scrape_configs:
- job_name: prometheus
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:

  # Constrain service discovery to one specific Prometheus 
  #  workload running one or more pods
  - action: keep
    source_labels: [__meta_kubernetes_pod_name]
    # Replace match string with your deployment's name
    regex: ^p8s-prometheus-server.+$
    replacement: $1

  # Swap "instance" label with the cluster node's IP address
  - action: replace
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: (.*)
    target_label: instance
    replacement: $1

  # Don't want to lose any extra flavor from custom K8s labels
  - action: labelmap
    source_labels: []
    regex: __meta_kubernetes_pod_label_(.+)
    replacement: $1

  # Particularly helpful to have a pod_name label in K8s HA configs
  - action: replace
    source_labels: [__meta_kubernetes_pod_name]
    regex: (.*)
    target_label: kubernetes_pod_name
    replacement: $1

Now we have a much richer label set, and we can use our new instance value to associate our node-exporter metrics with our Prometheus metrics in Grafana as follows (using Grafana’s templating and query interpolation):

### Template query, variable name = "node"
label_values(prometheus_target_interval_length_seconds_count, instance)    

### Example panel query
sum(irate(node_disk_reads_completed{instance=~'$node.*'}[5m]))
About the Author

Cody Boggs

Lead Yak Shaver at FreshTracks.io

What is the the New Kubernetes Metrics Server?

 

Kubernetes released the new “metrics-server”, as an Alpha feature in Kubernetes 1.7 and slated for beta in 1.8. The collection of documentation and code seems haphazard and difficult to collect and digest. This is my attempt to collect and summarize.

The Current State of Metrics

Mostly when we talk about “Kubernetes Metrics” we are interested in the node/container level metrics; CPU, memory, disk and network. These are also referred to as the “Core” metrics. “Custom” metrics will refer to application metrics, e.g. HTTP request rate.

Today (Kubernetes 1.7), there are several sources of metrics within a Kubernetes cluster:

Heapster

  • Heapster is an add on to Kubernetes that collects and forwards both node, namespace, pod and container level metrics to one or more “sinks” (e.g. InfluxDB). It also provides REST endpoints to gather those metrics. The metrics are constrained to CPU, filesystem, memory, network and uptime.
  • Heapster queries the kubelet for its data.
  • Today, heapster is the source of the time-series data for the Kubernetes Dashboard.
  • A stripped down version of heapster will be the basis for the metrics-server (more below).

Cadvisor

  • The Cadvisor project is a standalone container/node metrics collection and monitoring tool.
  • Cadvisor monitors node and container core metrics in addition to container [events](https://github.com/google/cadvisor/blob/master/docs/api.md#events.
  • It natively provides a Prometheus metrics endpoint
  • The Kubernetes kublet has an embedded Cadvisor that only exposes the metrics, not the events.
  • There is talk of moving cAdvisor out of the kubelet.

The Kubernetes API

  • The Kubernetes API does not track metrics per say, but can be used to derive cluster-wide state-based metrics e.g. the number of pods or containers running. Kube-state-metrics is one project that does just this.

Metrics Needs of the Kubernetes Control Plane

The Kubernetes scheduler and (eventually) the Horizontal Pod Autoscaler (HPA) needs access to the “Core” metrics, as they apply to both nodes and containers, in order to make scheduling decisions. Currently there is no standard API mechanism within Kubernetes to get metrics from any of the above metrics sources. From the Metrics Server Design Doc:

Resource Metrics API is an effort to provide a first-class Kubernetes API (stable, versioned, discoverable, available through apiserver and with client support) that serves resource usage metrics for pods and nodes.

The “metrics-server” feature (alpha in 1.7, beta in 1.8) is solving this by having a stripped-down version of Heapster, called the “metrics-server” run in the cluster as a single instance. The metrics-server will collect “Core” metrics from cAdvisor APIs (currently embedded in the kubelet) and store them in memory as opposed to in etcd. Because the metrics-server will not be a component of the core API server, a mechanism for aggregating API serving components was needed. This is called the “kube-aggregator” and was the major blocker for this project.

Overall Monitoring Architecture

The Kubernetes monitoring architecture Design Proposal is a great resource for where all the pieces fit together. Highlights:

  • The metrics-server will provide a supported API for feeding schedulers and horizontal pod auto-scalers
  • There are plans for an “Infrastore”, a Kubernetes component that keeps historical data and events
  • User supplied monitoring tools should not be talking to the metrics-server directly
  • User supplied monitoring tools that have application metrics can be a source of data to the HPA via an adapter
  • cAdvisor (embedded or not) should be the source of container metrics
  • All other Kubernetes components will supply their own metrics in a Prometheus format

Looking Forward

The metrics-server will provide a much needed official API for the internal components of Kubernetes to make decisions about the utilization and performance of the cluster. In true Kubernetes fashion, long-term metric collection and storage will remain an optional and pluggable component. It would appear that all Kubernetes internal metrics will continue to be exposed using the the Prometheus exposition format, which is great given the surging popularity of Prometheus in the Cloud Native ecosystem.

Resources

About the Author

Bob Cotton

Founder FreshTracks.io a CA Accelerator Incubation

Deploying an HA Prometheus in Kubernetes on AWS – Multiple Availability Zone Gotchas

It is a best practice to run Prometheus in a HA mode by running 2 instances of Prometheus, on separate hosts in separate Availability Zones, each configured to scrape the same targets and sending alerts to the same AlertManager(s). This minimizes the chance of missing alerts in the event that a Prometheus instance dies or an Availability Zone becomes unavailable.

Prometheus installations at scale usually require a significant quantity of resources – particularly memory, CPU, and disk – to run efficiently. We generally don’t want other pods running on the Prometheus nodes poluting the shared page cache and causing memory or disk pressure, so we need a way to pin and isolate our Prometheus pods.

At FreshTracks we use kops to provision our K8s clusters in AWS and Helm to install most software into those clusters. We have a mix of server types and utilize spot pricing for some of our workload. Spot pricing would not be a great option for runing Prometheus, as we don’t want them to disappear unexpectedly even when we’re architected to withstand such an event.

Kubernetes gives us several tools for node or resource isolation, however sometimes theory is usurped by practice when the realities of the underlying cloud provider come into play. In our case we saw our EBS-backed PersistentVolumes being auto-provisioned in different Availablity Zones from the hosts provisioned by the kops InstanceGroup.

The workaround was to create two seperate StorageClass definitions in Kubernetes to force the EBS volume into a particular availability zone and tell the two seperate Prometheus pods to use those StorageClasses.

This blog post provides a walkthrough to get a reasonable HA Prometheus setup running using Helm and kops to achieve our goals.

Assumptions

For this post, we’re assuming a couple of things. Note that these are not requirements to achieve the end-goal of HA Prometheus instances in an AWS Kubernetes cluster, just a convenient way to get there:

  • You have a kops-built cluster in AWS
  • You’re using Helm to install Prometheus

Objectives

By the end of this walkthrough, we should have:

  • 1 new kops InstanceGroup which maintains 2 dedicated Prometheus instances. Each instance should:
    • Be sized sufficiently for a “busy” Prometheus server
    • Reside in a different AvailabilityZone than one another
    • Have access to ESB-backed PersistentVolumes that reside in their respective AvailabilityZones
  • 2 Prometheus pods running in our AWS K8S cluster. Each pod should:
    • Run only on our dedicated instances
    • Run on a different instance than one another. (ie: pods shouldn’t overlap)
    • Not have to worry about non-Prometheus pods running on the dedicated instances
    • Send alerts to a single AlertManager service
  • No duplication of the various other entities that come with the Prometheus Helm chart

There are some out-of-scope objectives that are worth pursuing, but not as part of this walkthrough:

  • HA AlertManager service(s)
  • Dynamic failure detection for dashboarding tools (ie: make Grafana switch data sources if one dies)

Building Dedicated Instances

Out of the box, kops builds a handful of InstanceGroups – specifically, one InstanceGroup per master per AvailabilityZone, and a single InstanceGroup that holds the nodes. Since a kops InstanceGroup maps to an AutoScaling Group in AWS and we need to specify a particular instance type (and count) for our Prometheus-only nodes, we can’t just scale up the nodes InstanceGroup or simply bump the instance type – we won’t get the isolation we want. The solution here is to create a new InstanceGroup that specifies the things we need:

  1. Node labels to ensure Prometheus pods aren’t scheduled to any other sufficiently-sized nodes
  2. Node taints to ensure that nothing but Prometheus pods get scheduled to the Prometheus-only nodes (exempting the standard Kubernetes pods and expected DaemonSets)
  3. An appropriate instance size that will handle the load we’ll throw at Prometheus
  4. The number of nodes we want to dedicate to Prometheus

Here’s an example InstanceGroup spec that will accomplish what we want:

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my.k8s.cluster
  name: nodes-prometheus
spec:
  image: kope.io/k8s-1.6-debian-jessie-amd64-hvm-ebs-2017-05-02
  machineType: m3.xlarge
  maxSize: 2
  minSize: 2
  nodeLabels:
    prometheus-only: "true"
  role: Node
  subnets:
  - us-west-2a
  - us-west-2b
  taints:
  - dedicated=prometheus:NoSchedule

The steps to create our new InstanceGroup (assuming we’ve set the appropriate AWSPROFILE and KOPSSTATE_STORE env vars) are:

  1. kops get instancegroup nodes -o yaml > prometheus-ig.yaml
  2. Edit prometheus-ig.yaml to reflect the new values (example above)
  3. kops create -f prometheus-ig.yaml
  4. kops update cluster my.k8s.cluster --yes

Note that there’s no need to run a kops rolling-update after these steps, since this is a purely additive change. The kops update --yes is sufficient and your new nodes will be provisioned as soon as possible.

We can verify that our settings were applied correctly with a quick kubectl get node <new instance name> -o yaml (results snipped for brevity):

apiVersion: v1
kind: Node
metadata:
  <snip>
  labels:
    <snip>
    prometheus-only: "true"
  name: <instance name>
spec:
  <snip>
  taints:
  - effect: NoSchedule
    key: dedicated
    timeAdded: null
    value: prometheus

Pod and Node Isolation – Why not DaemonSets?

We want our Prometheus installation to adhere to the rules set out in our objectives above – specifically, each dedicated Prometheus host should have exactly one Prometheus pod, and nothing else1.

Normally a DaemonSet would be a great fit for this situation, as we want to make sure that both Prometheus instances are never scheduled to the same node. Since we’re installing Prometheus via Helm, however, there’s a non-trivial amount of customization that needs to happen to be able to make Helm configure Prometheus as a DaemonSet instead of a Deployment. As a result, we decided to use a fairly bullet-proof combination of node taints, node selectors, and resource requests to pin the Prometheus servers to the appropriate nodes with no overlap.

A Note On Persistent Volumes and Availability Zones

We originally set out to use resource request limits to force our two Prometheus pods to reside on different AWS instances, since we couldn’t easily make Helm install Prometheus as a Daemonset. This generally works by making each pod request more than half of the resources available on an individual target node. Such a configuration causes the scheduler to schedule both pods on the different nodes, lest we request > 100% of the available resources per node.

As it turns out, this was insufficient in AWS due to our use of multiple AZs for the dedicated Prometheus nodes. A kops InstanceGroup will effectively spread your nodes across whatever Availability Zones have been specified therein. This is exactly what we want for single-region HA, but there’s a catch: the EBS volumes which back our PersistentVolumes are not likewise spread across those AZs. This can lead to a situation where at least one of the Helm-created PersistentVolumeClaims can’t bind any of the PersistentVolumes, as they reside in a different AZ than the node that is trying to use the PersistentVolumeClaim. This leaves you with one pod “running”, and the other “pending” indefinitely.

To remedy this, we create two discrete StorageClass entities that explicitly specify the zones in which our EBS volumes should be created. These are defined as follows:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
  labels:
    k8s-addon: storage-aws.addons.k8s.io
    failure-domain.beta.kubernetes.io/region: us-west-2
    failure-domain.beta.kubernetes.io/zone: us-west-2a
  name: gp2-us-west-2a
parameters:
  type: gp2
  zone: us-west-2a
provisioner: kubernetes.io/aws-ebs

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
  labels:
    k8s-addon: storage-aws.addons.k8s.io
    failure-domain.beta.kubernetes.io/region: us-west-2
    failure-domain.beta.kubernetes.io/zone: us-west-2b
  name: gp2-us-west-2b
parameters:
  type: gp2
  zone: us-west-2b
provisioner: kubernetes.io/aws-ebs

A quick kubectl create -f later, and now we have two very specific StorageClasses that we can use in our custom Helm releases.

(We decided to keep the resource requests in place, as they provide an extra layer of overlap protection and it’s also an IaaS-agnostic solution to the problem.)

Installing Prometheus via Helm

Next up, we need to create a custom configuration for our Prometheus Helm chart releases. Note a couple of caveats:

  • You will need at least the 4.1.1 version of the stable/prometheus chart for these steps to work properly
  • “Helm chart releases” is not a typo – we will be running two discrete chart releases for Prometheus

As we’ll be running two discrete customized Helm chart releases, we need to grab the values from the default chart so that we can edit them. We’ll start with the first release:
helm inspect values stable/prometheus > first-prometheus-helm.yaml

We need a few key pieces in this first configuration to yield a Prometheus deployment whose pods will reside on the proper nodes:

  1. Valid node-selector to avoid wandering Prometheus pods
  2. Valid taint toleration to allow Prometheus pods to reside on our tainted nodes
  3. Sufficient resource requests which will avoid doubling up our Prometheus pods on a single node

Here’s an example chart configuration that should work well for our m3.xlarge instance type (example heavily condensed for brevity):

server:
  <snip>
  name: server
  nodeSelector:
    prometheus-only: "true"
  replicaCount: 2
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "prometheus"
    effect: "NoSchedule"
  persistentVolume:
    <snip>
    storageClass: "gp2-us-west-2a"
  resources: {}
    requests:
      cpu: 3000m     # request 75% of node CPU so that two pods can't fit
      memory: 11.6Gi # request 75% of node RAM so that two pods can't fit
  <snip>

Installing this release take a command along the lines of:
helm install stable/prometheus --name first-prometheus -f first-prometheus-helm.yaml

This gets us all the goodies in the Prometheus chart, and a single prometheus-server pod running on one of our dedicated instances.

Doubling Down with a Second Prometheus Release

We want two of these, though, so we need a second chart configuration. It’s important to note that we don’t want two of everything, however – no need for doubling up our node exporters, kube-state-metrics, etc. Let’s copy first-prometheus-helm.yaml to second-prometheus-helm.yaml and make the necessary edits (example once more heavily condensed):

alertmanager:
  enabled: false
kubeStateMetrics:
  enabled: false
nodeExporter:
  enabled: false
server:
  alertmanagerURL: "http://first-prometheus-alertmanager"
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "prometheus"
    effect: "NoSchedule"
  nodeSelector:
    prometheus-only: "true"
  persistentVolume:
    <snip>
    storageClass: "gp2-us-west-2b"
    <snip>
  resources: {}
    requests:
      cpu: 3000m     # request 75% of node CPU so that two pods can't fit
      memory: 11.6Gi # request 75% of node RAM so that two pods can't fit
  <snip>

We install this second Prometheus release with:
helm install stable/prometheus --name second-prometheus -f second-prometheus-helm.yaml

Wrapping Up

Kops makes it easy to build an highly-available single-region Kubernetes cluster in AWS, and Helm simplifies installing software into that cluster. That said, achieving HA installations of your software in a Kubernetes cluster can require some extra planning and finesse. Fortunately kops and Helm still make this process easier than it would be otherwise, and provide a level of reproducibility that is certain to come in handy.

1 “Nothing else” is loosely applied here to mean “nothing except for the requisite Kubernetes pods and various cluster-wide daemonsets.

About the Author

Cody Boggs

Lead Yak Shaver at FreshTracks.io

Prometheus Relabel Rules and the ‘action’ Parameter

Today I want to talk about learning about the action parameter in the relabel_config and metric_relabel_config elements in Prometheus. This was an epiphany I had when searching for how to dig substrings out the __meta_* label names as returned from service discovery (hint, use action: labelmap)

[Relabel configs](https://prometheus.io/docs/operating/configuration/#<relabel_config>) are composed of the following:

  • source_labels
  • separator (default ‘;’)
  • target_label – mandatory for replace actions. More on this below
  • regex (default ‘.*’)
  • modulus
  • replacement (default ‘$1’)
  • action (default ‘replace’)

Some of these elements have defaults, others are required based on the value of the action element.

When first learning about relabel configs in Prometheus you will encounter many examples that looks something like this:

  relabel_configs:
    - source_labels: [__meta_kubernetes_role]
      action: keep
      regex: (?:apiserver|node)
    - source_labels: [__meta_kubernetes_role]
      target_label: job
      replacement: kubernetes_$1
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)

Every example I’ve found the stanza starts with source_labels as the first entry in the list of elements. As it states in the docs:

action defaults to ‘replace’

After reading the code (trying to find labelmap) it occurred to me that action is really the star of the show.

There are many different actions available to the relabel configs (lifted from the docs):

  • replace: Match regex against the concatenated source_labels. Then, set
    target_label to replacement, with match group references
    (${1}, ${2}, …) in replacement substituted by their value. If regex
    does not match, no replacement takes place.
  • keep: Drop targets for which regex does not match the concatenated source_labels.
  • drop: Drop targets for which regex matches the concatenated source_labels.
  • hashmod: Set target_label to the modulus of a hash of the concatenated source_labels.
  • labelmap: Match regex against all label names. Then copy the values of the matching labels
    to label names given by replacement with match group references
    (${1}, ${2}, …) in replacement substituted by their value.
  • labeldrop: Match regex against all label names. Any label that matches will be
    removed from the set of labels.
  • labelkeep: Match regex against all label names. Any label that does not match will be
    removed from the set of labels.

From a neophyte’s perspective perhaps all the examples for replace relabel rules should start with action, even if it’s redundant. To rewrite the above example:

  relabel_configs:
    - action: keep
      source_labels: [__meta_kubernetes_role]
      regex: (?:apiserver|node)
    - action: replace
      source_labels: [__meta_kubernetes_role]
      target_label: job
      replacement: kubernetes_$1
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)

By leading with action it is crystal clear what is happening.

About the Author

Bob Cotton

Founder FreshTracks.io a CA Accelerator Incubation

Monitoring is Dead, Long Live Observability

At this years (2017) Monitorama, there were several presentations that declared “monitoring is dead”. The context of such a bombastic statement was that “traditional” time-series monitoring tools did not give developers or operators enough information to truly know what was happening in their systems. The argument was that if you system is not telling you exactly what is going on, down to the request level, you will have no hope of making sense of what is happening or what effect the change you just shipped will have. This is especially pertinent with the complex micro-services architectures we are building today.

However, monitoring is far from dead.

Because words matter, choosing the correct terminology to describe the type of monitoring you are doing is important. From my perspective, there are 5 types of “monitoring” systems; Infrastructure Monitoring, Log Analysis, Distributed Tracing, Application Performance Monitoring (APM) and Real User Monitoring. All of these observability tools fall under the “monitoring” umbrella. Let’s take a quick look at each of these.

Infrastructure Monitoring

Infrastructure Monitoring (IM) is the most common use of the term Monitoring. Aggregate metrics collected from a variety of hardware and applications plotted on a graph with time on the X-axis. Graphs are assembled into dashboards and placed in central, visible places.

These systems are the first stop when responding to alerts; they can help answer the question about what is happening and how does that compare to the most recent history. More advanced versions of these systems automatically detect anomalies based on historic data.

Log Analysis

Human readable, text based information from our systems is long rooted in the UNIX operating system. The primary way the system has to communicate with it’s human overlords is to write a line of text into a log file in a well know location. This pattern is simple and tools that ship with UNIX can deal with text format. This pattern has been copied into all corners of software engineering, print plain text to a file. The proliferation of log files have created entire industries around the distribution, parsing and searching of logs with impressive results. Correlation of timestamps across distributed logs and extraction of numerical data from logs have made this tool type of tool indispensable for some teams.

The recent rise of structured logging, using a regular and parse-able log format, makes the job of the log analysis tool easier. No need to write a custom parser or parsing rules to extract the important signal from the human-readable noise.

At scale, this infrastructure comes at a cost. Indexing and searching terabytes of data per week/day/hour quickly becomes a burden for hardware scaling, licensing or both.

Distributed Tracing

With the rise of Services Oriented Architecture (SOA), or as its called these days, a micro-services architecture, developer team productivity is being optimized over the complexity of the overall system. Infrastructure Monitoring still affords an aggregate level of observability for any given micro-service, however determining the slow points in an architecture where requests may touch 3–12 different systems quickly becomes intractable.

In distributed tracing, a standard set of headers are passed and propagated between all distributed systems. Detailed performance metrics are captured for all possible interaction points such as third party APIs, databases and messaging systems. When correlated by an outside system, the entire call structure, and its performance through the system is revealed in great detail. At low volumes these system will emit tracing information for every request. When the correlation system becomes overwhelmed, sampling can be employed.

When distributed tracing systems operate at the request, or event level, a very detailed picture of how the system is performing emerges. Further if traces are annotated with who is invoking the request, observability starts to include why the system is slow in the form of who is making the request. This type of demographic and usage info is useful for every aspect of the business, not just engineering.

Distributed tracing can also be the source of time-series style data if stream processing is employed to aggregate the individual events. This is often employed when an aggregation is needed on a subset of the available dimensions; the grain of IM metrics are too course and the deluge of individual events are too fine.

Application Performance Monitoring (APM)

Application Performance Monitoring is a powerful black-box monitoring solution. Application code is post-processed or byte-code injection is employed to potentially measure every line of code and every external system interaction. If using throughout a homogeneous infrastructure, distributed tracing can also be derived from the stream of data.

APM has a very low barrier to entry as developers need to make few code level changes. However, code-less APM only works on some runtime engines, Java, .NET and Python for example. Other run-times need to be developer instrumented using a vendor SDK. Further, if you have a polygot micro-services architecture, not all language run-times will be instrumented leaving gaps in the visibility.

Real User Monitoring (RUM)

All of the above monitoring solutions are back-end focused. They can tell you with great detail about the performance of the services and the servers they run on. Rich browser user interfaces and Internet latency contribute to the overall end-user experience. Real User Monitoring instruments the JavaScript running the browser and sends performance and usage information to a metrics collector.

This method shows what performance the end-user is experiencing. Time spent in the browser, network latency and overall back-end performance are detailed. It does not matter that your systems are fast if the user’s experience is slow.

Some systems tie the browser instrumentation into the back-end distributed tracing giving a full click-to-render detail that includes all back-end processing. This, combined with Infrastructure Monitoring is a powerful combination.

Dead? Really?

Each one of these observability systems gives you some important insight into the behavior of your system. The tools are chosen based on the challenges your team has encountered in trying to debug your system. If you have a gobs of information in your application logs, you may choose a log analysis system. If your customers are complaining about their experience with your application being slow its reasonable to reach for a real-user monitoring tool so you can see what they see.

While there is no one tool or vendor that does everything, several vendors are starting to realize that the combination of two or more of these observability types become more than the sum of their parts.

Infrastructure Monitoring, while being decreed as dead, still remains at the center of any observability strategy. Aggregate visibility into our systems is paramount to gaining understanding of those systems and how they change over time. They are the canary in the coal mine. This class of tools should be the first place you look when the alerts are going off.

Unfortunately, the state of Infrastructure Monitoring tools only give us rudimentary information when things are out of tolerance. While machine learning promises that alerts will only be sent when things are really wrong, we have yet to achieve this nirvana.

Monitoring may well be dead, but observability is alive and kicking.

About the Author

Bob Cotton

Founder FreshTracks.io a CA Accelerator Incubation

What Makes a System Healthy (Part 2)

In the first part of this series, we discussed the various frameworks for getting good KPI out of your monitoring, and simplifying your monitoring down to a manageable number of metrics. In this part, we will go on to discuss what you should do with these metrics, how to dig in if they go awry, and what impact this can have on your operations.

What makes these metrics valuable? The source of their value is not in that they tell precisely what’s wrong, but that they give a leading indicator that something is wrong to begin with. They are important as KPIs in this sense as well as several more. Unlike many metrics, each of these is human interpretable, and it’s easy to say that one way or the other is desirable or not. A service is doing well if it has high throughput, low latency, low errors, and reasonable saturation.

What makes these metrics helpful? First is obviously simplicity. This is simpler to monitor from an operations perspective, and it’s simpler to instrument from a development perspective. Setting this viewpoint coherently between both groups helps facilitate a smooth devops process

Instead of a massive dashboard full of every value that a service reports, looking at these four metrics per service dramatically reduces the cognitive load of the operations process. If any of the four golden signals is bad, there’s a problem (eg. responses taking forever, people can’t get to service, machines all saturated, error rate is high) and something should be done. As far as users can tell, though, if those four metrics are fine, then none of the rest really matter. This may seem small, it’s not. Let’s take an example. Apache Kafka is a distributed streaming platform in wide use. It emits literally hundreds of metrics. In dashboards, this is page upon page of graphs. Many of these may behave in strange ways for any of a number of reasons. Watching these metrics, it’s unclear if there’s any consequence to many of them being high, low, behaving strangely. The consuming service only sees the four golden signals, essentially. That said, do these others even matter? Not if things are going well.

This framework is great, until, of course, one of the main KPI blows up. The standard solution is unfulfilling. If the KPIs blow up, you need to sift through the rest of the metrics and find out what else is going wrong in hopes of driving towards the root cause. This process is intensely manual, and here, FreshTracks offers something different. From a data perspective, there are many metrics which have nothing to do with each other, but also a rich set of metrics which influence each other’s behaviors.

From a statistical perspective, a comprehensive study of these is an exercise in the extremes of the multiple comparisons problem. Attempting to learn about a huge number of things simultaneously from the same data diminishes the statistical power of each of the analyses. If you’re trying to figure out whether or not a certain metric influences another in a hypothesis testing framework, for example, you may find signals that are stronger than 95% likely, meaning that there’s a 1/20 chance of having noise look like signal. If you make several thousand of these comparisons, it’s only natural to expect several false signals. This can be corrected, but the number of samples (and therefore the time) required to make these assessments grows to perhaps untenable bounds.

Figuring out how the metrics impact the KPIs alone, on the other hand, makes this problem tractable. Indeed, being able to predict the KPI from the other metrics gives a map to use when the KPIs go awry. To this end, FreshTracks thinks there’s a better way to automatically drive at a root cause for production problems using human interpretable machine learning models. These are whitebox models, models whose decisions could be audited and understood by humans. In seeing what the models do, their results can be trusted. Because of interpretability, model insights can be applied in other domains, for other problems.

This gives a multivariate lens into whether or not your systems are behaving as expected, a fresh set of information to mine for system anomalies, especially if the prediction of the KPI and the KPI itself drift too far away from each other. In a system with which we are not familiar, this can save an enormous amount of time by solving one of the most fundamental problems: where to start.

About the Author

Chris Bonnell

Math PhD focusing on machine learning and signal processing.

What makes a System Healthy?

Monitoring a single process on a single computer is easy. You can look clearly at the metrics you expect to indicate process health and be relatively certain of their import. As you add processes and computers, the challenge becomes harder. Instead of viewing the information locally it becomes necessary to aggregate metrics in a common location. The number of metrics per second grows. It becomes more and more challenging to view the complicated system and assess anything at all. At ten machines, this situation is characterized best by confusion and frustration, at 100 the amount of effort required to fully understand an incident cannot be reliably extended to all the machines in question. We need a higher-order framework to describe the health of the system.

First, it’s important to focus on the metrics that matter most. By that, I don’t mean the favorite metrics, or the metric that was responsible for the last outage, or the newest metrics added. The four golden signals are a great example: Latency, Saturation, Traffic, and Errors. These are defined in chapter 6 of Google’s Site Reliability Engineering book as follows.

Latency

The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.

Traffic

A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second.

Errors

The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, “If you committed to one-second response times, any request over one second is an error”). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content.

Saturation

How “full” your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential.

In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters that alter the complexity of the request (e.g., “Give me a nonce” or “I need a globally unique monotonic integer”) that rarely change configuration, a static value from a load test might be adequate. As discussed in the previous paragraph, however, most services need to use indirect signals like CPU utilization or network bandwidth that have a known upper bound. Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation.

Finally, saturation is also concerned with predictions of impending saturation, such as “It looks like your database will fill its hard drive in 4 hours.”

Alternative views into this situation exist. Brandon Gregg cites the USE Metrics, these are particularly focused on the physical resources, but extend to some software resources as well. Given their descriptions, one could imagine that any sort of machine or physical process could be monitored in these terms. He defines them as follows:

Resource: all physical server functional components (CPUs, disks, busses, …) [1] Utilization: the average time that the resource was busy servicing work [2] Saturation: the degree to which the resource has extra work which it can’t service, often queued
Errors: the count of error events

Tom Wilkie of WeaveWorks uses the RED Metrics This is suggested as a framework for monitoring services. These have no reference to any physical hardware, just performance characteristics. One could imagine doing this kind of monitoring even in serverless architectures. They are defined as follows:

For every service, check request:
Rate
Error (rate) and
Duration (distributions)
Are within SLO/A

In the situations where they apply, these each represent good options. If you’re not precisely sure which framework to apply, the four golden metrics covers everything, and monitoring one extra graph may be well worth the effort. In any case, focusing on a set of these KPIs is more important than focusing on the “correct” one.

Even at moderate scale, system complexity can be overwhelming when trying to determine the root cause of an event. Higher-order models to describe system health like The four golden signals, USE and RED provide a great overview of your systems. However, there still remains a significant challenge; how to identify the root cause.

In this series we will discuss what to do with these metrics once you have them, and how they can make your operations easier.

About the Author

Chris Bonnell

Math PhD focusing on machine learning and signal processing.

First Post

16 inches fresh at Winter Park today…  And we’re making our website.  We may not be doing it right