AIOps: Anomaly Detection Using Deep Learning

Artem Grotov
Artem Grotov
6 min read

In particular, detecting and fixing problems becomes too difficult as the IT infrastructure grows and becomes more dynamic – its behavior becomes too complex to understand without special tools. One of such tools is anomaly detection. It allows systems operators to quickly discover potential problems by automatically analyzing data generated by the IT system and notifying them when something “unusual” happens. Anomaly detection offers the following benefits for IT teams:

  • Save time and increase productivity. Manually setting static checks doesn't work for today's dynamic environments. Anomaly detection does the heavy work for you and automatically sets up baselines. If there's any anomalous behavior, it can immediately alert you about it.

  • Proactively identify threats as soon as they arise. Have the ability to react to problems before it affects your critical business processes.

  • Find what you didn't know. Anomaly detection helps you to spot problems you otherwise wouldn't catch with static checks. This will help you to understand your IT landscape better. 

Proactive insights about monitored processes

Anomaly detection is an important and well-studied problem. Anomalies are also referred to as abnormalities, deviants, or outliers and may reveal interesting insights about the monitored process and are often convey valuable information about it. It is important to detect outliers in your data streams because they may be indicative of malicious actions, system failures or intentional fraud. Therefore, anomaly detection is considered as an essential step in various decision-making systems. Anomaly detection is used in many fields: fraud detection, cyber-intrusion detection, medical anomaly detection, sensor networks anomaly detection, video surveillance, internet-of-things anomaly detection, log-anomaly detection, industrial damage detection and others. At StackState we use anomaly detection to discover problems in complex IT environments.

The use of deep learning for anomaly detection is motivated by the following challenges. First performance of traditional algorithms in detecting outliers is sub-optimal on the image (e.g. medical images), time series, natural language and graph-based datasets since they fail to capture complex structures in the data. Then traditional methods do not scale to large volumes of data. Finally, the automatic feature learning capability of deep learning eliminates the need of manual handcrafting of features by domain experts, it solves the problem end-to-end using raw input data. 

3 types of deep learning approaches to anomaly detection

There are three types of deep learning approaches to anomaly detection based on the availability of labels: supervised, semi-supervised and unsupervised. Supervised methods formulate the anomaly detection problem as a classification problem. They have superior performance; however, they need a large volume of training example with anomalies present which limits their applicability. Unsupervised methods are based on (Variational) Autoencoders and Generative Adversarial Networks (GANS) – they only require examples without anomalies to train them. With sufficient training samples, of normal class autoencoders and GANs would produce low reconstruction errors for normal instances, over anomalous events. Semi-supervised methods leverage small sets of anomalous examples to tune the performance of unsupervised methods.

Identify potential problems in your IT landscape with StackState AIOps

At StackState we use a combination of supervised, semi-supervised and unsupervised approaches. We apply them to the telemetry streams to identify potential problems in the IT landscape. At the core of our implementation is unsupervised representation learning using autoencoders. This representation is then used to compute the log-likelihood of the observed data and alert the user if it is too low. Additionally, the same representation is used to train a supervised neural network to explain past known anomalies and predict new ones. For training it uses the implicit labels derived from Jira tickets. This way we can leverage the power of supervised methods together with the wideness of applicability of the unsupervised methods to gain high precision and recall of anomaly detection.

Anomaly detection StackState

Anomaly detection allows operators to quickly and automatically spot potential problems in complex IT systems. This makes it possible for small teams to manage big systems, enabling the business to scale while maintaining team productivity and staying competitive.

IDC Research recently published a report about how StackState applies Artificial Intelligence across topology, telemetry, time and traces to drive better service delivery outcomes.