Using Anomaly Detection to Validate Data Quality

Isolation Forest is an unsupervised anomaly detection algorithm to find multivariate anomalies.

Sourced from https://www.datavedas.com/wp-content/uploads/2018/05/3.1.2-UNSUPERVISED-LEARNING-3-S-1.jpg

Introduction

Anomalies and outliers both reflect irregularities in the data they represent. They differ, however, in how many variables they represent with anomalies being multivariate and outliers being univariate. Identification of outliers from a single variable can be easily identified by utilizing a normal distribution or a boxplot. Conversely, anomalies are more difficult to identify as they are being determined based on multiple variables at the same time.

Anomaly detection can be used for many different purposes. Anomalies or outliers reflect extreme values in the dataset, but those extreme values could be used in many different use cases. Common use cases of anomaly detection include:

1. Fraud detection (i.e. credit cards)

2. Intrusion detection (i.e. IT systems)

3. Predictive maintenance (i.e. equipment)

4. Health deterioration (i.e. cancer detection)

An unusual use case for anomaly detection is to identify data quality issues. Poor data quality needs to be quickly identified and rectified because it can permeate through the entire data science process, negatively impacting the integrity of the business insights and recommended actions. A common adage that sums this up nicely is: “Garbage in, Garbage out”.

My focus here will be on first introducing the common ways to detect outliers and anomalies and then to dive briefly into how to use the Isolation Forest anomaly detection algorithm to reveal potential data quality issues.

Outlier vs Anomaly Detection

Outliers can be easily identified visually since they are based on univariate data (a single variable). There are 2 main ways to visually identify an outlier:

1. Distribution Plot for normally distributed data where the mean is a good measure of the center of the data

2. Boxplot for non-normally distributed data where the median is a good measure of the center of the data

With the distribution plot, in the case of a normally distributed dataset, outliers are defined as any value that is either higher or lower than 3 standard deviations away from the dataset mean. Below is an example of a normally distributed variable, Price Momentum, which quantifies a stocks volatility. The distribution represents 5,000 different stocks and their respective Price Momentum value.

Distribution of Price Momentum for 4,311 Stocks

As seen in the distribution graph above, outliers are any value that falls outside of the upper and lower limits which are shown as the red dotted lines. There are 33 or 0.76% total outliers that fall beyond the 3 standard deviation boundaries.

In the case of a non-normal distribution, a boxplot is a better visual indicator of outliers. This is because outliers for non-normal distributions are calculated by using the median and outlier boundaries calculated by multiplying the values inter-quartile range by 1.5. The inter-quartile range represents the 75th percentile value minus the 25th percentile value. Therefore, any value that is further than 1.5 times the interquartile range away from the median is considered an outlier.

Boxplot of 1-Year Price Percent Change of 4,311 Stocks

In the boxplot above, which represents all the values for the 1-year percent change of 4,311 stocks. As you can see, the outliers are defined by being outside the boundaries which are set by multiplying the interquartile range by 1.5 and then adding or subtracting that value from the median. On the boxplot, these 2 boundaries are called the “whiskers” of the boxplot.

So the next step once outliers are identified is to determine what to do with these values. It is common to remove these values, especially if the outliers are more noise than signal for the specific business question you are answering. If you are trying to answer which stocks are performing the best over the past 1 year because you’d like to identify investment opportunities, then the outliers are exactly what you are looking for. However, if you the outliers are due to a data quality issue or the business question is trying to identify common correlations between different variables, you may look to remove the outliers to focus on the center of the dataset that is most representative.

Anomaly Detection Using Isolation Forest

Anomaly detection differs from outlier detection in that it can be used to identify outliers in multi-variate datasets. This is especially helpful when there are more than 3 dimensions as there is no way to visually see outliers.

The Isolation Forest is an unsupervised machine learning algorithm which attempts to find anomalies without having a ground-truth target to train from. Isolation Forest is a tree-based algorithm that randomly splits a variable at a value between the minimum and maximum of the dataset multiple times to isolate a data point. The decision trees will end up isolating every single data point (leaf size = 1) and anomalies are determined by how many partitions are made to isolate a data point into its own leaf. The lower the number of partitions, the more anomalous the data point. The predicted results will be a “1” for a normal data point and a “-1” for an anomaly.

This algorithm, unlike other anomaly detection algorithms, does not need to first understand the “normal” datapoints to then find anomalies. This is a big benefit of the Isolation Forest as it requires less time and memory to run.

Isolation Forest Example Schematic (source: https://betterprogramming.pub/anomaly-detection-with-isolation-forest-e41f1f55cc6)

Anomaly Detection for Identifying Poor Data Quality

Poor data quality can hurt our ability to answer the business questions that we want with the accuracy that is required to be confident in those answers. Anomaly detection can help identify where the data quality may be suspect. I recently utilized the Isolation Forest algorithm to identify stocks that may have anomalous data. The API for which this data comes from has shown to have data quality issues and I needed a quick way to identify those anomalies so that I was not making investment decisions based on bad data. I was classifying stocks into “action” categories with labels such as “Buy”, “Hold” and “Sell”. These labels were based on 5 key metrics that would rank and label the stocks. Ensuring the underlying data quality of these 5 key metrics was imperative to have an accurate analysis. I utilized Scikit Learn’s Isolation Forest ensemble package to identify which stocks may have anomalous data in the 5 key metrics.

#import isolation forest package
from sklearn.ensemble import IsolationForest
#create isolation forest
if_model = IsolationForest()results = if_model.fit(df_anomaly.drop(columns=['symbol']))

While there are many parameters that can be tuned, the most important is the “contamination” which is “ The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the scores of the samples.” per the docstring. The default is set to “auto” which is best to use when the level of contamination (number of outliers) is not known. Since I did not know how many stocks potentially had bad data, I kept the default which resulted in a contamination rate of 4.7%.

Validating the Results

My main goal was to have an anomaly “marker” for each stock based on the Isolation Forest results to provide hesitation that the label of “Buy”, “Hold” and “Sell” may be due to incorrect data. I created a Tableau dashboard that allowed me to investigate stocks for investment and the anomaly detection marker helped me quickly proceed with caution. For example, IIVIP was the number 1 ranked stock based on the 5 key metrics, however, it had an anomaly marker by it.

Tableau Dashboard Table Showing Stock Performance with Anomaly Detection

Further evaluation revealed that IIVIP had overstated income data in Q3 and Q4 of 2020 which made it look like an outperformer of its stock price. Looking back at the key metrics, the Isolation Forest algorithm deemed it an anomaly due to an extremely large value for Revenue per Share of 26,457. Now, would have a simple outlier detection method like a distribution or boxplot found this data point to be anomalous? Potentially, however, this is more powerful because a data points value could be well within the normal range (not an outlier) but in the context of other variables and their data points it is an anomaly. This is how anomaly detection differs from outlier detection.

Tableau Dashboard Showing Why IIVIP Was Marked as an Anomaly

Conclusion

Identifying and dealing with outliers and anomalies are an important part of analyzing a dataset and how you deal with these data points depends on the problem you are trying to solve. Anomaly detection can be a powerful tool for many use cases to find rare data points, especially when you are looking at multiple variables at the same time. My experience of using the Isolation Forest anomaly detection algorithm for finding where the data quality may be bad is just one example of how powerful it can be for ensuring data analyses are accurate.