At this years (2017) Monitorama, there were several presentations that declared “monitoring is dead”. The context of such a bombastic statement was that “traditional” time-series monitoring tools did not give developers or operators enough information to truly know what was happening in their systems. The argument was that if you system is not telling you exactly what is going on, down to the request level, you will have no hope of making sense of what is happening or what effect the change you just shipped will have. This is especially pertinent with the complex micro-services architectures we are building today.
However, monitoring is far from dead.
Because words matter, choosing the correct terminology to describe the type of monitoring you are doing is important. From my perspective, there are 5 types of “monitoring” systems; Infrastructure Monitoring, Log Analysis, Distributed Tracing, Application Performance Monitoring (APM) and Real User Monitoring. All of these observability tools fall under the “monitoring” umbrella. Let’s take a quick look at each of these.
Infrastructure Monitoring (IM) is the most common use of the term Monitoring. Aggregate metrics collected from a variety of hardware and applications plotted on a graph with time on the X-axis. Graphs are assembled into dashboards and placed in central, visible places.
These systems are the first stop when responding to alerts; they can help answer the question about what is happening and how does that compare to the most recent history. More advanced versions of these systems automatically detect anomalies based on historic data.
Human readable, text based information from our systems is long rooted in the UNIX operating system. The primary way the system has to communicate with it’s human overlords is to write a line of text into a log file in a well know location. This pattern is simple and tools that ship with UNIX can deal with text format. This pattern has been copied into all corners of software engineering, print plain text to a file. The proliferation of log files have created entire industries around the distribution, parsing and searching of logs with impressive results. Correlation of timestamps across distributed logs and extraction of numerical data from logs have made this tool type of tool indispensable for some teams.
The recent rise of structured logging, using a regular and parse-able log format, makes the job of the log analysis tool easier. No need to write a custom parser or parsing rules to extract the important signal from the human-readable noise.
At scale, this infrastructure comes at a cost. Indexing and searching terabytes of data per week/day/hour quickly becomes a burden for hardware scaling, licensing or both.
With the rise of Services Oriented Architecture (SOA), or as its called these days, a micro-services architecture, developer team productivity is being optimized over the complexity of the overall system. Infrastructure Monitoring still affords an aggregate level of observability for any given micro-service, however determining the slow points in an architecture where requests may touch 3–12 different systems quickly becomes intractable.
In distributed tracing, a standard set of headers are passed and propagated between all distributed systems. Detailed performance metrics are captured for all possible interaction points such as third party APIs, databases and messaging systems. When correlated by an outside system, the entire call structure, and its performance through the system is revealed in great detail. At low volumes these system will emit tracing information for every request. When the correlation system becomes overwhelmed, sampling can be employed.
When distributed tracing systems operate at the request, or event level, a very detailed picture of how the system is performing emerges. Further if traces are annotated with who is invoking the request, observability starts to include why the system is slow in the form of who is making the request. This type of demographic and usage info is useful for every aspect of the business, not just engineering.
Distributed tracing can also be the source of time-series style data if stream processing is employed to aggregate the individual events. This is often employed when an aggregation is needed on a subset of the available dimensions; the grain of IM metrics are too course and the deluge of individual events are too fine.
Application Performance Monitoring (APM)
Application Performance Monitoring is a powerful black-box monitoring solution. Application code is post-processed or byte-code injection is employed to potentially measure every line of code and every external system interaction. If using throughout a homogeneous infrastructure, distributed tracing can also be derived from the stream of data.
APM has a very low barrier to entry as developers need to make few code level changes. However, code-less APM only works on some runtime engines, Java, .NET and Python for example. Other run-times need to be developer instrumented using a vendor SDK. Further, if you have a polygot micro-services architecture, not all language run-times will be instrumented leaving gaps in the visibility.
Real User Monitoring (RUM)
This method shows what performance the end-user is experiencing. Time spent in the browser, network latency and overall back-end performance are detailed. It does not matter that your systems are fast if the user’s experience is slow.
Some systems tie the browser instrumentation into the back-end distributed tracing giving a full click-to-render detail that includes all back-end processing. This, combined with Infrastructure Monitoring is a powerful combination.
Each one of these observability systems gives you some important insight into the behavior of your system. The tools are chosen based on the challenges your team has encountered in trying to debug your system. If you have a gobs of information in your application logs, you may choose a log analysis system. If your customers are complaining about their experience with your application being slow its reasonable to reach for a real-user monitoring tool so you can see what they see.
While there is no one tool or vendor that does everything, several vendors are starting to realize that the combination of two or more of these observability types become more than the sum of their parts.
Infrastructure Monitoring, while being decreed as dead, still remains at the center of any observability strategy. Aggregate visibility into our systems is paramount to gaining understanding of those systems and how they change over time. They are the canary in the coal mine. This class of tools should be the first place you look when the alerts are going off.
Unfortunately, the state of Infrastructure Monitoring tools only give us rudimentary information when things are out of tolerance. While machine learning promises that alerts will only be sent when things are really wrong, we have yet to achieve this nirvana.
Monitoring may well be dead, but observability is alive and kicking.