These are the questions that I&O leaders need to get answered. The problem of relating business KPIs to IT issues is not new and not easy. Almost every enterprise IT organization is struggling with this challenge. Thanks to Artificial Intelligence for IT Operations (AIOps) it can be tackled and IT issues can finally get proper root cause analysis.
The problem is formulated as follows: there are a number of business KPIs:
Additive - such as ticket sales, revenue, views, errors etc
Correlate-able - such as performance KPIs - latency, throughput and error count
Those KPIs constitute the top level performance indicators. On the other side, at the bottom of the IT stack, there are various types of telemetry like logs, events, traces and metrics that describe the infrastructure state.
The goal is to find the possible explanations of KPI violations in terms of observable state or state changes. Since KPIs are related to costs in its own way, the KPI driven root cause analysis should immediately give actionable insight on where an IT department should shift its focus based on cost tag, pointing to the right issue being the root cause.
Manually relating IT issues to anomalies in business KPIs is hard and time-consuming
Relating anomalies in KPIs to IT issues manually is quite a laborious and time-consuming process where engineers have to search through the logs, traces and other telemetry in an attempt to find the evidence and proof that a specific pattern they found actually caused the incident.
A popular approach for this is building a data lake with Splunk. Many data lakes have been built in the last decennium with the idea that they can be mined for data gold at a later stage. The problem here however is that the relationships that are needed to understand the data in a broader context aren't part of the data lake nor are they a first-class citizen in data lakes. Data lakes therefore quickly turn into data swamps and relating IT issues to anomalies in business KPIs becomes a painful process.
Automatic KPI-driven root cause analysis
Structurally the KPI driven root cause analysis (RCA) consists of two modules which are (1) anomaly detection and (2) the inference engine.
1. Anomaly detection
The anomaly detection module captures the steady functioning of the system into some form of 'normality' profile. As soon as an application or cloud resource functioning deviates from that 'normality' profile the violation is reported. Specification of what is normal and what is not for a resource, in general case, depends on the context, e.g. the business goal of monitoring.
This makes the definition of normality for many resources quite vague and dependent on the expert who is capable to manually tag the resource metrics with an anomaly label. Labeling requirement makes the anomaly detection more laborious, expensive and less practical.
We would like to automate this laborious processes to the maximum. Therefore we would like to require the KPIs metric to be context independent or at a minimum we aim at defining the normality profile in terms of the metric itself. The example of a right KPI is 'Sales' with alerting criteria 'Sales decline'. The other alerting criteria can be a seasonal trend going down. Applying these criteria to 'Sales' should indicate that sales for today should be at least the same as yesterday. Another good example of a KPI metric is 95 percentile of latency with 'normality' criteria being a threshold where the 95% of all requests should be less than 300ms.
The example of a KPI that is context-dependent is CPU utilization. The CPU utilization is pretty much a technical metric that is disconnected from business targets, therefore the normality of the CPU utilization requires additional context, such as what type of service running? if the service is IO-bound or CPU-bound? what is the latency KPI is enforced? so the CPU has to be busy only partially and should be able to pick up request bursts. These and possibly some other parameters are provided by the expert, therefore we refrain from putting normality on this metrics.
2. The inference engine
The inference engine is the second module in KPI driven RCA. The task of the inference engine is to automate root cause search and provide the engineer with top contributors. The contributors can be modules, data patterns or upgrades .
Depending on the type of KPI there can be several approaches for inference.
For additive KPIs such as revenue, the task of finding root cause is defined as a search across the dimensions that explain the KPI anomaly the best. This approach considers the top level KPI as a sum-aggregate across multiple dimensions - that is why it is called additive. The sub-aggregates are collected across multiple dimensions such as datacenter, marketing programs, user device, region, processing nodes, services etc. In complex cases the structure of dimensions is hierarchical [1]. The inference algorithm evaluate all dimensions and isolate only those having the most explanatory and surprise scores [2].
Golden signals KPIs (latency, throughput, error count) are correlate-able KPIs. They are used in multi-tiered architectures where there is a split between user facing services that have public APIs, and internal services. This also applies to microservice architectures. The approach here is to gather golden signals from the entire stack and to infer anomaly propagation paths which terminate at the possible contributors. One can use page rank inference method on a call graph where the link score can take the values of correlations between golden signals across services. The variation of this inference can be found in the papers [3] and [4].
How StackState relates IT issues to business KPIs
The approaches we just described are two examples of finding root causes of KPIs anomalies. Despite the common architectures we observed at many companies every environment is still unique and has it own variation of performance and capacity problems.
There is always a common theme which may suggest the adoption of a specific method, but the ops engineers have to adjust it to its own environment. Underestimating this leads to poor adoption of an AIOps approach that does not reduce the amount of false positives, but does exactly the opposite - it generates a huge amount of alerts. Therefore the implementation has to be the result of a conscious choice based on the evidence, basically as a result of the analysis.
StackState's AIOps platform can assist the ops engineer and data scientist on multiple levels. StackState is able to ingest data beyond typical monitoring data, including business metrics, CMDBs, CI/CD tools, service registries, automation and incident management tools. StackState uses the data it collects to learn about dependencies, allowing it to build a unique topology visualization of an IT environment - from infrastructure to business processes and from on-premise to cloud.
Further work is done with the help of Automated Machine Learning (AutoML) capabilities where the predefined problem-solution recipe is automatically trained, validated and tuned to the local environment. The goal of the AutoML feature is to reduce the involvement of the data scientist in the notoriously laborious data preprocessing/cleaning/imputing/feature engineering/hyperparameters selection process. Although there are specifics in data preprocessing and each of them is quite unique, in StackState it is effective because it focuses only on certain types of data. This, of course, does not limit the data scientist from opening jupyter notebook and query correlated streams and play with them using their own, favourite python toolset like Scikitlearn, TensorFlow and Keras. This opens the opportunity to fine-tune the models, swapping them with completely custom ones or just using completely different services.
StackState's visualization capabilities allow not only to visualize the topology, telemetry and "rewind" those back in time, but also visualizing infrastructure highlights. A highlight is a general event that has the goal of attracting the attention of the ops engineer. The typical highlights are alerts, infrastructure events correlated to KPI violations, or infrastructure state predictions.
The problem of finding root cause of KPIs anomalies is essential for an ops engineer. There are a fist full of methods that can be applied. The use-fullness and applicability of a specific method depends on experimental evidence. StackState allows both conducting the experiments and based on obtained confidence can automatically tailor the approach to specific environment.
References
1. HotSpot: Anomaly Localization for Additive KPIs With Multi-Dimensional Attributes DOI: 10.1109/ACCESS.2018.2804764
2. Adtributor: Revenue Debugging in Advertising Systems - ISBN 978-1-931971-09-6
3. Root Cause Analysis of Anomalies of Multitier Services in Public Clouds DOI: 10.1109/TNET.2018.2843805
4. Root cause detection in a service-oriented architecture DOI:10.1145/2465529.2465753