Part II: Anomaly detection within monitoring: how can you get started?
This post is also available in : Spanish
Anomaly detection within monitoring. Let’s go deeper into this subject
In a previous post we introduced anomaly detection as a group of techniques used to identify unusual behavior that does not comply with expected data pattern. In this article we will find out how we can apply anomaly detection within monitoring.
Here we go:
We believe anomaly detection is a potential alternative to static threshold in monitoring, so we recommend identifying first the areas of your environment with:
- A great number of alerts, mostly non-actionable
- Frequently redefining thresholds and alerts
- Metrics measurement like latency, throughput, errors, etc. in which no threshold seems to work properly.
Choosing a metric or group of metrics which we want to work with is not a simple task. There is no universal guide that can be applied to all problems.
However, once you have identified the part of your environment you want to study with anomaly detection, there are some practical recommendations you need to know in order to start the process correctly:
- Take your time to understand the problem: you might find a KPI with a bad behavior, but don’t rush to define the metrics involved. Instead, review the problem and make sure there isn’t a static rule solution. In other words, bear in mind that the only possibility is a statistical procedure.
- Take metrics with regular behavior on normal stages and unusual behavior on problem stages. If you have metrics with unusual behavior in normal stage, you probably need to check the problem definition once more.
- Take less metrics: it is easy to apply anomaly detection on one metric for one problem, and then apply it to a group of metrics for one problem.
Define your toolkit
Check your monitoring tools and verify if some offer an anomaly detection solution. If this is the case, you have to document the solution.
Tools that offer anomaly detection are based on algorithms that are implementations of different statistical models. You need to document: What model is implemented? How does the algorithm work? What features does your data need to have?
Keep in mind that many tools establish a difference between anomaly detection and outlier detection, so it is a good idea to have clear models and algorithms for each case.
If you have a cloud or a mix environment, you must check what kind of services your cloud provider offers. Microsoft Azure, for example, offers the implementation of two algorithms called Support Vector machine (SVM) and Principal Component Analysis (PCA).
In order to provide a more general view, we made a test for this post in which we imagined a monitoring tool without an anomaly detection solution and no cloud service.
If your situation is similar to this, your first step is to define an architecture for anomaly detection.
Thus you have to get a tool to efficiently receive, store and process all the time-series data about metrics you need to work with. It has to be a tool with a strong visualization module because anomaly detection is an demanding activity.
Choosing a tool
To choose a tool, you need:
- Compatibility: a tool including a proven compatibility with the monitoring tool you are using to collect your time-series data
- Easy installation and configuration
- Scalability: even if you are just testing out anomaly detection you should think how scalable the solution is. Consider that the architecture has to manage different problems, different metrics and a large amount of sensitive data.
- Tool in use: it might be a good idea to choose a tool that you have already used before and feel comfortable with.
For our test we chose to install and configure Graphite and Grafana because:
The monitoring tool we are using to gather the data appears in the list of tools compatible with Graphite.
We have used Graphite and Grafana in other projects with good results
Scalability was not a key factor in our test, so we installed Graphite with its regular data base in a very simple server and without a scheme of redundancy.
In addition to offering a solution for the containment and graphing of time-series data, these companies also offer anomaly detection solutions. This is why it is necessary to look into implemented models and algorithms in order to understand all our possibilities for an anomaly detection within monitoring project.
In our test we use R packages and Anomaly services to complete our toolkit.
Models and techniques
Now that we have a problem to solve our architecture implemented and a clear toolkit with defined metrics, it is time to go deeper into models associated with anomaly detection.
As we said in our first article, anomaly detection always works on two fronts:
- Definition of the expected or “normal” behavior
- Identification and study unusual behavior patterns
Defining the expected behavior implies we have to be able to do a prediction and this prediction process requires a statistical model.
That is one of the reasons why anomaly detection within monitoring could be difficult to implement. Understanding models associated with anomaly detection sometimes seems like something only specialized personnel in statistics or in operations research can carry out.
However, DevOps and monitoring analysts’ efforts are necessary for establishing a correct application of each model and having more opportunities to get an interesting result.
Models are implemented by algorithms. If you check anomaly detection options in your tools, you will find the algorithms they are using. However, it is interesting for understanding the whole process and knowing which algorithm and which model you can use.
Finally, before you apply any algorithm, we recommend:
- Starting with the simplest: the truth is you can’t find a model-algorithm pair that works for every case. Our recommendation here is trying out the simplest model and algorithm. If you don’t obtain satisfactory results with the first one you try, then you can move on to more complicated pairs.
- Evaluate your data: a main and not always easy task for analysts is to verify if time-series data satisfies the conditions of the model and algorithm, so it is important to avoid skipping this step.
- Stay away from general solutions: do not be tempted to try general solutions, because the ones that guarantee good results in all cases, usually fail in most situations.
Don’t hesitate to share your experience with anomaly detection within monitoring.