Observability Maturity Model Fundamentals series, Part 5
This series of 6 blogs outlines the basics of the Observability Maturity Model. Use it to identify where you are on the observability path, understand the road ahead and provide guidance to help you find your way.
Go back to or skip to: Part 1 / Part 2 / Part 3 / Part 4 / Part 6
What is the cause of the incident and what is the impact?
Level 4, Proactive Observability With AIOps, is the most advanced level of observability. At this stage, artificial intelligence for IT operations (AIOps) is added to the mix. AIOps, in the context of monitoring and observability, is about applying AI and machine learning (ML) to sort through mountains of data looking for patterns that
drive better responses
at the soonest opportunity
by both humans and automated systems.
In Gartner’s “Market Guide for AIOps Platforms,” May 2022, by Pankaj Prasad, Padraig Byrne and Gregg Siegried, Gartner defines the characteristics of AIOps platforms in the following way:
“AIOps platforms analyze telemetry and events, and identify meaningful patterns that provide insights to support proactive responses. AIOps platforms have five characteristics:
Cross-domain data ingestion and analytics
Topology assembly from implicit and explicit sources of asset relationship and dependency
Correlation between related or redundant events associated with an incident
Pattern recognition to detect incidents, their leading indicators or probable root cause
Association of probable remediation”
We have the same view on AIOps as Gartner. AIOps builds on core capabilities from previous levels in this maturity model — such as gathering and operationalizing data, topology assembly and correlation of data — and adds in pattern recognition, anomaly detection and more accurate suggestions for remediating issues. Causal Observability is a necessary foundation: time-series topology provides an essential framework.
AIOps can help teams find problems faster and even prevent problems altogether. AI/ML algorithms look for changes in patterns that precede warnings, alerts and failures, helping teams know when a service or component starts to deviate from normal behavior and address the issue before something fails.
However, anomalies occur frequently. They do not necessarily mean a problem will occur, nor that remediation should be a high priority. AIOps helps determine which anomalies require attention and which can be ignored. Another goal of AIOps for observability is to drive automated remediation through IT service management (ITSM) and self-healing systems. If these systems receive incorrect root cause input, for example, they can self-correct the wrong issue and cause bigger problems. AIOps delivers more accurate input that enhances their effectiveness.
At Level 4, you should notice more efficient and incident-free IT operations that deliver a better customer experience. To achieve these goals, set up AIOps to transcend silos and ingest data gathered from across the environment. The AI/ML models should analyze all the observability data types we discussed in previous levels: events, metrics, logs, traces, changes and topology, all correlated over time.
A Word of Caution: Don’t Skip Level 3
Proactive observability with AIOps is the best way to ensure reliable operation of your IT systems, but it’s a mistake to move directly to Level 4 and skip over the causal observability steps in Level 3 (data consolidation, topology, correlation of all data streams over time).
Each level in this Observability Maturity Model builds on capabilities established in previous levels, but having a complete foundation matters most for success in Level 4. If you apply AI/ML without a comprehensive foundation of data, you can actually cause damage. For example, let’s say you use AI/ML on the front end of an automated self-healing system. If the algorithm determines an incorrect root cause, the self-healing system tries to remediate the wrong thing and can further break the system. If you apply AI/ML on top of insufficient data or poor-quality data, you may drive automation in the wrong direction as the algorithm learns the wrong thing.
Without topology data correlated with metric, log and trace data over time, AIOps tools will likely not understand the correlation between these various sorts of data as they come together. AIOps needs the additional context provided by topology and time to accurately assess root cause, determine business impact, detect anomalies and proactively determine when to alert SRE and DevOps teams.
Next Steps
Most AIOps solutions today require significant configuration and training time but often yield inaccurate results, especially if topology changes over time are not considered. Teams often implement them with unrealistic expectations and unclear goals, then find themselves disappointed.
Level 4 is the final observability maturity level for now, but as IT continues to evolve, we fully expect a Level 5 to emerge.