Anomaly detection, also referred to as outlier detection, is an invaluable technique to be leveraged in data sciences. When evaluating a dataset, an anomaly is any observation or event that does not conform to an expected pattern or to the other items. In modern use case, the detection of an anomaly is often indicative of adverse events such as network intrusions, bank fraud, medical problems, or errors in text. To demonstrate the utility and efficacy of anomaly detection, 2-dimensional anomaly detection will be implemented and compared to detection based on Mahalanobis Distance.
The R package mvoutlier
includes a variety of methods used in anomaly detection and will be the primary basis for this tutorial. The package Rfunspaceadventure
was built by 2Lt Kallhoff and will be used for an inclass exercise at the end of the tutorial.
library(mvoutlier)
library(Rfunspaceadventure)
The data set humus
we will use is built into the mvoutlier
package and is an ideal and salient use case example. The data originates from the Kola Project, which was a geological survey and data collection effort conducted from 1993-1998. Samples from the humus ground layer were taken from Norway, Finland, and Russia and elemental ground makeup was recorded for 617 unique observations. The humus ground layer is formed from decomposing organic litter to include leaves, roots, branches, etc. We begin by examining the structure of the data:
str(humus)
## 'data.frame': 617 obs. of 44 variables:
## $ ID : num 1 2 3 4 5 6 7 8 9 10 ...
## $ XCOO: num 547960 770025 498651 795152 437050 ...
## $ YCOO: num 7693790 7679170 7668150 7569390 7855900 ...
## $ Ag : num 0.124 0.13 0.124 0.309 0.04 0.421 0.235 0.043 0.07 0.086 ...
## $ Al : num 1420 7190 1970 6360 630 2060 1200 1510 3380 1320 ...
## $ As : num 0.743 0.973 1.1 1.24 0.533 1.54 0.836 0.657 1.75 1.04 ...
## $ B : num 2.51 3.45 1.62 1.59 4.33 1.93 2.51 4.46 5.08 2.1 ...
## $ Ba : num 74 65.2 121 63.5 36 111 109 24.5 42.2 51.4 ...
## $ Be : num 0.02 0.08 0.04 0.07 0.01 0.07 0.01 0.05 0.06 0.01 ...
## $ Bi : num 0.108 0.084 0.137 0.135 0.076 0.176 0.145 0.075 0.259 0.112 ...
## $ Ca : num 3900 1900 3240 2470 3500 3340 3010 3090 1760 2810 ...
## $ Cd : num 0.298 0.39 0.295 0.222 0.214 0.327 0.294 0.18 0.295 0.284 ...
## $ Co : num 1.28 3.37 0.98 3.91 0.43 4.61 0.75 0.55 1.68 1.12 ...
## $ Cr : num 3.04 2.02 1.33 17 0.79 3.78 2.01 1.63 10.9 1.71 ...
## $ Cu : num 10.9 12.3 5.59 29.8 6.23 47.3 8.66 5.55 10 9.25 ...
## $ Fe : num 2050 3170 1100 6180 790 3370 1150 1800 4820 970 ...
## $ Hg : num 0.155 0.263 0.168 0.175 0.197 0.239 0.233 0.217 0.238 0.229 ...
## $ K : num 1100 900 1000 900 700 1200 1000 600 800 1000 ...
## $ La : num 2.3 6.4 1.7 5.3 1.2 7 1.5 2.9 6.3 1.2 ...
## $ Mg : num 820 670 790 1180 1770 800 660 1750 820 570 ...
## $ Mn : num 202 22.8 70.3 118 24.4 86.1 1210 16.7 39.2 115 ...
## $ Mo : num 0.662 0.286 0.166 0.28 0.248 0.473 0.185 0.372 0.277 0.15 ...
## $ Na : num 40 120 30 150 410 60 10 320 110 50 ...
## $ Ni : num 9.41 14.8 5.05 59.5 1.98 78 5.7 2.19 11.7 13.2 ...
## $ P : num 743 1030 922 717 795 942 1120 790 779 745 ...
## $ Pb : num 15.9 13.9 19 16.3 14.3 21.7 15.2 11.6 284 14.1 ...
## $ Rb : num 8.14 2.82 4.45 8.13 2.67 5.23 4.94 2.33 4.07 5.35 ...
## $ S : num 1300 1950 1750 965 1860 1410 1880 2210 1260 1250 ...
## $ Sb : num 0.12 0.161 0.25 0.017 0.169 0.217 0.189 0.126 0.809 0.202 ...
## $ Sc : num 0.6 1.2 0.4 1.4 0.2 0.7 0.4 0.6 1.1 0.3 ...
## $ Si : num 630 640 580 690 440 530 650 490 630 560 ...
## $ Sr : num 22.2 34.4 43.4 29.2 46.4 67.9 20.7 59.1 28.8 23.2 ...
## $ Th : num 0.413 0.281 0.246 0.816 0.25 0.432 0.371 0.535 1.13 0.109 ...
## $ Tl : num 0.081 0.068 0.077 0.099 0.06 0.084 0.088 0.042 0.084 0.097 ...
## $ U : num 0.24 0.12 0.042 0.163 0.127 0.112 0.051 0.265 0.221 0.031 ...
## $ V : num 6.79 3.89 2.86 16.1 1.63 7.4 3.55 2.57 13.9 2.76 ...
## $ Y : num 0.8 2.4 0.5 1.9 0.8 1.6 0.5 1.8 1.7 0.4 ...
## $ Zn : num 59.1 18.1 67.6 43.3 41.4 46 103 21.5 23.2 38.5 ...
## $ C : num 39.9 47.5 44.2 19.4 45.8 47.8 45.7 47.3 29.5 46.8 ...
## $ H : num 5.5 6.8 6.3 3 6.1 5.8 6.5 6.4 4.4 6.5 ...
## $ N : num 1.2 1.5 1.5 0.7 1.5 1.3 1.7 1.5 0.7 1 ...
## $ LOI : num 82.2 94.7 95.4 52.7 95.3 91.3 92 92.6 64.1 94.3 ...
## $ pH : num 3.9 4.1 3.8 4 4.1 3.7 4 4.2 4 3.6 ...
## $ Cond: num 140 100 123 106 108 ...
Before conducting any analysis of this dataset, it is important to ensure that the data is in an acceptable format. The reason this humus dataset provides a strong example for outlier detection is because most observations appear to be numeric and continuous in nature. Outlier detection techniques will normalize all of the data, so the mismatch in scaling is of no consequence. Close attention must still be called to the variables themselves. It is important to note that the first variable corresponds to an identification number rather than a data point and should not be included in outlier detection analysis. Additionally, the variables XCOO and YCOO correspond to map coordinates from which the soil samples were taken. This package offers unique tools for visualizing outliers on a coordinate map, however, the coordinates will not be included in our analysis.
The classic Mahalanobis Distance as shown in equation 1, is a concept first introduced in 1936. By measuring the distance between a point and a distribution to which that point belongs, this technique acts as a statistical measure for the classification of a point as an outlier based on a chi-square distribution.
MD=√(x−ˉx)TC−1(x−ˉx) For many of its multivariate functions, this package uses the Robust Mahalanobis distance based the Minimum Co-variance Determination. While classic calculation of the distances is widely accepted, the robust calculation is preferred. The package provides a tool for plotting the robust versus classic distance called dd.plot
.
x <- log(humus[ , c("Al","Co","Cu","Ni", "Pb","U")])
distances <- dd.plot(x, quan=1/2, alpha=0.025)
# The classical and Robust distances now exist as vectors
str(distances$md.cla)
## Named num [1:617] 1.86 2.6 1.78 2.66 3.03 ...
## - attr(*, "names")= chr [1:617] "1" "2" "3" "4" ...
str(distances$md.rob)
## Named num [1:617] 2.28 2.64 1.94 2.99 3.83 ...
## - attr(*, "names")= chr [1:617] "1" "2" "3" "4" ...
The color.plot
function of this package provides outlier classification visualization for a given dataset where input arguments are as follows:
x
: a 2-dimensional data frame or matrixquan
: amount of observations to be used for mcd calculationalpha
: amount of observations used to calculate the adjusted quantileThe plot generated by this function represents the euclidean distance as a color, where blue represent small distances, and red represent large distances from the data set minimum. Mahalanobis distances are represented by their symbol. The function also draws four ellipsoids where the Mahalanobis Distances are constant. The constant values correspond to the 25%, 50%, and 75% and adjusted quantiles. Finally, this function can provide us with vectors containing Mahalanobis Distance, Euclidean Distances, and a Boolean vector qualifying outliers.
x <- log(humus[,c("U","Pb")])
colorData <- color.plot(x, quan=1/2, alpha=0.025)
str(colorData$md)
## Named num [1:617] 1.508 0.908 1.226 0.944 0.896 ...
## - attr(*, "names")= chr [1:617] "1" "2" "3" "4" ...
str(colorData$euclidean)
## Named num [1:617] 5.12 4.25 3.83 4.76 4.34 ...
## - attr(*, "names")= chr [1:617] "1" "2" "3" "4" ...
str(colorData$outliers)
## Named logi [1:617] FALSE FALSE FALSE FALSE FALSE FALSE ...
## - attr(*, "names")= chr [1:617] "1" "2" "3" "4" ...
What this technique lacks more than anything is scalability. The data that can be represented in this way is only limited to 2-dimensional. In the present, where big-data is becoming the norm, a multivariate solution is preferable.
The adjusted quantile plot function of the mvoutlier
package will solve for and order the squared Mahalanobis Distances for the given observations and plot them against the empirical Chi-Squared distribution function of these values. The function takes as input:
X
: Data frame or matrix that must be at least 2-dimensional.delta
: Quantile of the Chi-Squared distribution.quan
: the portion of observations used for Minimum Co-variance Determinant (MCD) estimation.alpha
: Maximum Thresholding ProportionThe analysis will be conducted on several randomly selected features of the humus dataset. We will attempt to identify outliers with respects to the elements “Al”, “Co”, “Cu”, “Ni”, and “Pb”, and “U”. These elements were chosen randomly to simply demonstrate the technique.
The graphics below provide an intuitive visualization of the outliers contained in this dataset. All observations are plotted against the first and second principal components, and every observation outside of the Chi-Square quantile is colored in red. The function displays outliers outside of the adjusted chi-square quantile as well. Finally, we are able to save a Boolean list which details which specific observations are considered outliers. Any observation outside of the chi-Square quantile will be listed as “TRUE”, meaning it is an anomaly
# Generate the data frame with sleected features
x <- log(humus[ , c("Al","Co","Cu","Ni", "Pb","U")])
# Execute
outliers <- aq.plot(x, delta=qchisq(0.975, df = ncol(x)),
quan = 1/2, alpha = 0.05)
## Projection to the first and second robust principal components.
## Proportion of total variation (explained variance): 0.6449667
# Inspect outliers
str(outliers)
## List of 1
## $ outliers: Named logi [1:617] FALSE FALSE FALSE FALSE FALSE FALSE ...
## ..- attr(*, "names")= chr [1:617] "1" "2" "3" "4" ...
It may be desirable for the user to eliminate some of the outlier observations manually. In this case the chisq.plot
function can be used to interactively set our quantile threshold. If the data is normally distributed, the user would ideally want to delete outliers until a straight line appears. All observations eliminated by the user are printed on the console. This function takes as input:
x
: matrix or data frame.quan
: amount of observations used for mcd estimationsask
: logical operator specifying whether user interaction is allowed or not.x <- log(humus[ , c("Al","Co","Cu","Ni", "Pb","U")])
par(mfrow = c(1,1))
chisq.plot(x, quan = 0.5, ask = TRUE)
## Remove outliers with left-click, stop with right-click on plotting device
## $outliers
## NULL
The visual above is the plot of Chi2 vs the ordered robust mahalanobis distance and includes all observations. As a user, perhaps you want to eliminate the last 9 observations and determine which observations they correspond to. In order to do this, you would simply need to left-click on the graph 9 times. This would eliminate the furthest 9 observations and save them in a vector. After 9 clicks, we obtain the following plot:
As we click through the iterations, the console displays a corresponding output:
# Excerpt from console output, iterations 7-9
# Observations left out:
# 75 28 67 35 87 613 449
# Remove outliers with left-click, stop with right-click on plotting device
# Observations left out:
# 75 28 67 35 87 613 449 173
# Remove outliers with left-click, stop with right-click on plotting device
# Observations left out:
# 75 28 67 35 87 613 449 173 116
# Remove outliers with left-click, stop with right-click on plotting device
If we want to save the final vector of outliers, we can write the function as follows:
res <- chisq.plot(log(humus[ , c("Al","Co","Cu","Ni", "Pb","U")]), quan = 0.5, ask = TRUE)
res$outliers
# [1] 75 28 67 35 87 613 449 173 116
The mvoutlier
package provides several useful tools for analyzing and understanding our multivariate outliers. Many of these tools are very specific to the test datasets and are based off of the geo-locations the data was collected from, however, they still provide a strong example of how this type of data can be utilized. For example, we can plot a map with outliers identified.
# kola background is a map from which humus samples were taken
kola <- rbind(kola.background$coast,kola.background$boundary,
kola.background$bordes)
# XY coordinates within humus dataset
XY <- humus[,c("XCOO","YCOO")]
# The mvoutlier.CoDa function returns all pertinent information needed for further analysis
data <- mvoutlier.CoDa(humus[ , c("Al","Cu", "Ni", "Co")])
# execute:
plot(data,coord=XY,map=kola,onlyout=TRUE,which="map",bw=FALSE,
symb=TRUE,symbtxt=TRUE)
Now we can see exactly where the outliers fall on the map. We know not only which observations are anomalous, but exactly where the anomalous samples were collected from. While informative, this plot is very specific to the sample data set and requires as a prerequisite the appropriate background map, and associated coordinates for data samples. Another tool we can use that is not specific to this data set and could be implemented in any dataset condusive to multivariate analysis is the Biplot function. This plot allows data from both observations and variables to be viewed, and shows towards which outlier observations the variables trend.
plot(data,onlyout=TRUE,which="biplot",bw=FALSE,
symb=TRUE,symbtxt=TRUE)
Ensure you have an internet connection and use the following link to begin: LAUNCH FUN SPACE ADVENTURE
Peter Filzmoser and Moritz Gschwandtner (2017). mvoutlier: Multivariate Outlier Detection Based on Robust Methods. R package version 2.0.8. https://CRAN.R-project.org/package=mvoutlier