What changed?
That’s the question ops personnel have been asking for decades whenever something goes wrong in the production IT environment.
Everything was working before, so the reasoning goes, and now it’s not. We have an incident. And to figure out what caused the incident – and hence, to have any idea how to fix it – we must know what changed.
There’s just one problem with this approach. What if everything is subject to change, all the time? Simply identifying one needle in a veritable haystack of needles would hardly be an effective approach for getting to the bottom of an incident.
Instead, we can’t only look at the individual events – the "what changed" bits that gave us clues in the past. Instead, we must look at the big picture – how everything fits together. In other words, the context for such events.
Given the complex and dynamic nature of today’s production environments, however, we can’t wait around for this big picture to materialize. We must understand the context for each incident in real-time if we have any hope of resolving issues promptly and cost-effectively.
Here’s how it’s done.
The four T's of observability
Just as understanding a forest begins with the trees, we must understand the behavior of each individual component in our production environment. To this end, we want each component to be observable: to understand its behavior, we need only look to the information it provides for that purpose. We call such information telemetry.
The most familiar forms of telemetry are logs and metrics: streams of time series data that components generate in order to provide a record of their activity. Any change in the behavior of such a component is reflected in this telemetry and represents an event at a particular time – time being the second ‘T.’
Events, however, may or may not have any importance to an engineer seeking to understand an incident. Too many monitoring tools simply report on all events, flooding ops personnel with false positives.
Engineers need some way of connecting events causally, so that they can uncover which events are correlated to incidents, and ideally, what the causes of those incidents are.
We call these causal sequences of events traces. Traces provide some measure of context for events and their telemetry, as they give engineers a basis to understand which events might be useful for uncovering root causes of incidents.
Are traces enough?
Traces, therefore, are better at providing the context for events than logs or metrics alone. Even so, traces fall short in modern IT environments.
The problem with traces is that they rarely occur in isolation. In the real world, the connections between causes and effects are rarely simple linear sequences, and furthermore, may be subject to change.
As a result, we must add the third ‘T’: topology. The topology represents all the relationships among all of the components in your entire IT landscape: what is connected to what, what depends on what, and how messages travel from one component to the next.
Furthermore, the topology is as dynamic as the environment it describes: if a relationship between components changes in reality, that change is automatically reflected in the topology.
Traces, therefore, are an integral part of the topology – but it’s the topology itself that can provide end-to-end insight into the entire IT landscape.
Topology provides the context
We call a topology that consists of nodes and their relationships a graph, and predictably, StackState’s architecture depends upon a versioned graph database.
The data that goes in such a database, however, is more than traditional telemetry data. In fact, topology data come from multiple sources, including platforms that provision and deploy networks, virtual machines, containers and services – and in many cases, discovery agents must collect such topology information, as it is richer than typical telemetry.
This topology information resembles instructions for building a structure with Tinkertoys: how one component connects to another and so on. Each of these Tinkertoy instructions represents a subgraph of the overall topology. StackState then merges the subgraphs into its representation of the overall topology.
This topology, then, provides the necessary context to the telemetry. Now engineers have the information they require to understand the relationships among the components in the environment, even as those relationships change.
Engineers should continue to ask what changed. But now, with the combination of telemetry, traces, topology and time, the answer to that question will lead the engineer to the correct answer as simply as playing with Tinkertoys.
The Intellyx take
In a Kubernetes deployment where the components in question are microservices, containers, pods, clusters and the like, keeping track of the context of the behavior of the environment by leveraging its topology is essential for understanding the activities of essentially ephemeral components.
But there is more to the story than Kubernetes. In reality, the full cloud native computing landscape includes all of hybrid IT, from on-premise legacy systems to traditional virtual machines to containers to serverless computing.
Some of these components are inherently observable, as they generate useful telemetry. Others are not, thus requiring agents to glean the information necessary to monitor and manage them.
Regardless of the specifics, however, the enterprise IT topology includes the entire hybrid IT landscape. After all, everything boils down to components and their relationships with one another, regardless of the nature of those components.
Regardless of the particular mix of systems, applications and infrastructure that make up your particular IT landscape, therefore, understanding the root causes of incidents sufficiently to mitigate them quickly depends upon the overall context of the components of that landscape. Without that context, your engineers will be working in the dark.
About Jason Bloomberg
Jason Bloomberg is a leading IT industry analyst, author, keynote speaker and globally recognized expert on multiple disruptive trends in enterprise technology and digital transformation.
He is founder and president of digital transformation analyst firm Intellyx . He is ranked among the top nine low-code analysts on the Influencer50 Low-Code50 Study for 2019, #5 on Onalytica’s list of top Digital Transformation influencers for 2018 and #15 on Jax’s list of top DevOps influencers for 2017.
Mr. Bloomberg is the author or co-author of five books, including "Low-Code for Dummies," published in October 2019.
Copyright © Intellyx LLC. StackState is an Intellyx customer. Intellyx retains the final editorial control of this article. Image credit: Mike Mozart - full-rel .