But, first, when an organization says they want their system monitoring to become predictive, what exactly do they mean? We can break it down into three problem statements:
Alerts for serious IT system problems never arrive and get reported by somebody who notices the problem—worst-case scenario: the customer notices first.
The alerts come too late—worst-case scenario: the business is already impacted.
The alerts do come but are ignored.
These three problems happen for a number of reasons.
First, it’s very hard to get good monitoring coverage. Think of the space for all possible valid and useful monitoring checks. I call this the monitoring space. Similar to NASA looking out into the dark expanse of the universe to all of its planets, moons, stars, and intergalactic elements. Now imagine trying to track all of the relevant activity, such as asteroids about to collide with Earth, and then warning about that.
This is a great way to picture what we, as IT professionals, are trying to achieve in IT monitoring, including preventing disasters in the monitoring space. With that in mind, you can grasp that the amount of data needing to be collected is huge. Next to this, there is also a typically huge number of functions that IT systems expose, and a typically huge number of logical and physical components underlying them. Multiply these factors by the number of changes IT systems go through every day...the monitoring space is simply very large and complex, which is why it’s just really hard to get good monitoring coverage. (How does NASA do it?!)
.
Second, checks, or rules, that create actionable alerts to prevent serious production problems are the hardest to create. That’s because the toughest and most serious problems do not occur in isolation. For example, when a sev1 outage hits a production environment, it is typically not something trivial that could have been caught by better testing in a quality assurance (QA) environment. Actionable information needs a high degree of context and such checks are very, very hard to create.
Third, alerts come late (or not at all) because most alerts that are responded to are based on checks that detect only obvious problematic behaviors, e.g. disk is full, API is not responding, latency is too high. For instance, when writing a rule for a monitoring check, it’s easy to forget (or be unaware) of exactly what data should be captured, until a problem hits. Let’s say a person has set a rule to send an alert when the number of users logging into their bank accounts drops to zero. No alert will be sent when the number of users drops to one. Such a situation also indicates a problem, but no alert is sent because of how the rule was set.
Fourth, alerts in the form of warnings (e.g., the latency mean is above 3s, the error count is averaging 20 per minute, disk is 90% full) are so often ignored, because they are too often not actionable nor really indicative of an impending disaster (i.e., false positives). This is called alert fatigue. In short, because of the enormous size of the monitoring space, there are too many low-quality alerts and not enough good quality alerts .
Predictive monitoring is often incorrectly posed as a solution to these problems
When you don't have good monitoring coverage to begin with, what are you going to predict? If you are already experiencing alert fatigue, why do you think informative alerts, based on predictions, are going to improve the mean-time-to-resolution (MTTR)?
While companies are really looking to solve these problems, what I see many of them doing is investing in monitoring solutions that make predictions for them. Wanting to become predictive entails a solution that most of the time just doesn't cut it. Unfortunately, I've seen a lot of time, effort, and resources wasted on such projects and solutions. In the end, they just generate more alerts that are ignored.
Instead, I propose that organizations should strive to become proactive and preventative. It's about focusing on the goal, not the solution. That entails a deeper dive into the data behind a problem to get a better and more mature monitoring model. Prediction may be a part of the solution of course, but there are many ways organizations can become more proactive without predictions. I highlight three examples here:
A lot can be gained by contextualizing IT monitoring data through a topology. There is strength in numbers: if multiple monitoring systems are reporting suspicious behavior about the same chain of IT components, it should add to your sense of urgency. But without that context, how will you know about it?
Anomaly detection is, frequently, a much better early warning system than predictive monitoring. However, without bringing anomalies back into context, it will only add to your alert fatigue.
Changes often precede trouble, yet I've seen very few organizations who have an effective way to relate anomalies or alerts back to changes. What if the person/team who made a change could be alerted directly on the consequences their change has had on the production environment?
I speak with a lot of organizations about monitoring, and I've yet to find one for which predictive monitoring was the most logical next step. I suggest we move IT monitoring forward by taking a step back, to being proactive rather than predictive. It requires many different strategies and is a journey but makes more sense for organizations who want to improve their monitoring coverage, increase problem resolution efficacy, reduce alert fatigue, and save money at the same time.
About Lodewijk Bogaards
Lodewijk Bogaards is a StackState co-founder and CTO. Lodewijk combines deep technical skills with high-level technical vision. With over 20 years of experience in IT, he specialized in Monitoring, AI, Graph databases, and software development.