Processing math: 100%

Introduction

Anomaly detection, also referred to as outlier detection, is an invaluable technique to be leveraged in data sciences. When evaluating a dataset, an anomaly is any observation or event that does not conform to an expected pattern or to the other items. In modern use case, the detection of an anomaly is often indicative of adverse events such as network intrusions, bank fraud, medical problems, or errors in text. To demonstrate the utility and efficacy of anomaly detection, 2-dimensional anomaly detection will be implemented and compared to detection based on Mahalanobis Distance.

Packages Required

The R package mvoutlier includes a variety of methods used in anomaly detection and will be the primary basis for this tutorial. The package Rfunspaceadventure was built by 2Lt Kallhoff and will be used for an inclass exercise at the end of the tutorial.

library(mvoutlier)
library(Rfunspaceadventure)

Tutorial Data Set

The data set humus we will use is built into the mvoutlier package and is an ideal and salient use case example. The data originates from the Kola Project, which was a geological survey and data collection effort conducted from 1993-1998. Samples from the humus ground layer were taken from Norway, Finland, and Russia and elemental ground makeup was recorded for 617 unique observations. The humus ground layer is formed from decomposing organic litter to include leaves, roots, branches, etc. We begin by examining the structure of the data:

str(humus)
## 'data.frame':    617 obs. of  44 variables:
##  $ ID  : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ XCOO: num  547960 770025 498651 795152 437050 ...
##  $ YCOO: num  7693790 7679170 7668150 7569390 7855900 ...
##  $ Ag  : num  0.124 0.13 0.124 0.309 0.04 0.421 0.235 0.043 0.07 0.086 ...
##  $ Al  : num  1420 7190 1970 6360 630 2060 1200 1510 3380 1320 ...
##  $ As  : num  0.743 0.973 1.1 1.24 0.533 1.54 0.836 0.657 1.75 1.04 ...
##  $ B   : num  2.51 3.45 1.62 1.59 4.33 1.93 2.51 4.46 5.08 2.1 ...
##  $ Ba  : num  74 65.2 121 63.5 36 111 109 24.5 42.2 51.4 ...
##  $ Be  : num  0.02 0.08 0.04 0.07 0.01 0.07 0.01 0.05 0.06 0.01 ...
##  $ Bi  : num  0.108 0.084 0.137 0.135 0.076 0.176 0.145 0.075 0.259 0.112 ...
##  $ Ca  : num  3900 1900 3240 2470 3500 3340 3010 3090 1760 2810 ...
##  $ Cd  : num  0.298 0.39 0.295 0.222 0.214 0.327 0.294 0.18 0.295 0.284 ...
##  $ Co  : num  1.28 3.37 0.98 3.91 0.43 4.61 0.75 0.55 1.68 1.12 ...
##  $ Cr  : num  3.04 2.02 1.33 17 0.79 3.78 2.01 1.63 10.9 1.71 ...
##  $ Cu  : num  10.9 12.3 5.59 29.8 6.23 47.3 8.66 5.55 10 9.25 ...
##  $ Fe  : num  2050 3170 1100 6180 790 3370 1150 1800 4820 970 ...
##  $ Hg  : num  0.155 0.263 0.168 0.175 0.197 0.239 0.233 0.217 0.238 0.229 ...
##  $ K   : num  1100 900 1000 900 700 1200 1000 600 800 1000 ...
##  $ La  : num  2.3 6.4 1.7 5.3 1.2 7 1.5 2.9 6.3 1.2 ...
##  $ Mg  : num  820 670 790 1180 1770 800 660 1750 820 570 ...
##  $ Mn  : num  202 22.8 70.3 118 24.4 86.1 1210 16.7 39.2 115 ...
##  $ Mo  : num  0.662 0.286 0.166 0.28 0.248 0.473 0.185 0.372 0.277 0.15 ...
##  $ Na  : num  40 120 30 150 410 60 10 320 110 50 ...
##  $ Ni  : num  9.41 14.8 5.05 59.5 1.98 78 5.7 2.19 11.7 13.2 ...
##  $ P   : num  743 1030 922 717 795 942 1120 790 779 745 ...
##  $ Pb  : num  15.9 13.9 19 16.3 14.3 21.7 15.2 11.6 284 14.1 ...
##  $ Rb  : num  8.14 2.82 4.45 8.13 2.67 5.23 4.94 2.33 4.07 5.35 ...
##  $ S   : num  1300 1950 1750 965 1860 1410 1880 2210 1260 1250 ...
##  $ Sb  : num  0.12 0.161 0.25 0.017 0.169 0.217 0.189 0.126 0.809 0.202 ...
##  $ Sc  : num  0.6 1.2 0.4 1.4 0.2 0.7 0.4 0.6 1.1 0.3 ...
##  $ Si  : num  630 640 580 690 440 530 650 490 630 560 ...
##  $ Sr  : num  22.2 34.4 43.4 29.2 46.4 67.9 20.7 59.1 28.8 23.2 ...
##  $ Th  : num  0.413 0.281 0.246 0.816 0.25 0.432 0.371 0.535 1.13 0.109 ...
##  $ Tl  : num  0.081 0.068 0.077 0.099 0.06 0.084 0.088 0.042 0.084 0.097 ...
##  $ U   : num  0.24 0.12 0.042 0.163 0.127 0.112 0.051 0.265 0.221 0.031 ...
##  $ V   : num  6.79 3.89 2.86 16.1 1.63 7.4 3.55 2.57 13.9 2.76 ...
##  $ Y   : num  0.8 2.4 0.5 1.9 0.8 1.6 0.5 1.8 1.7 0.4 ...
##  $ Zn  : num  59.1 18.1 67.6 43.3 41.4 46 103 21.5 23.2 38.5 ...
##  $ C   : num  39.9 47.5 44.2 19.4 45.8 47.8 45.7 47.3 29.5 46.8 ...
##  $ H   : num  5.5 6.8 6.3 3 6.1 5.8 6.5 6.4 4.4 6.5 ...
##  $ N   : num  1.2 1.5 1.5 0.7 1.5 1.3 1.7 1.5 0.7 1 ...
##  $ LOI : num  82.2 94.7 95.4 52.7 95.3 91.3 92 92.6 64.1 94.3 ...
##  $ pH  : num  3.9 4.1 3.8 4 4.1 3.7 4 4.2 4 3.6 ...
##  $ Cond: num  140 100 123 106 108 ...

Before conducting any analysis of this dataset, it is important to ensure that the data is in an acceptable format. The reason this humus dataset provides a strong example for outlier detection is because most observations appear to be numeric and continuous in nature. Outlier detection techniques will normalize all of the data, so the mismatch in scaling is of no consequence. Close attention must still be called to the variables themselves. It is important to note that the first variable corresponds to an identification number rather than a data point and should not be included in outlier detection analysis. Additionally, the variables XCOO and YCOO correspond to map coordinates from which the soil samples were taken. This package offers unique tools for visualizing outliers on a coordinate map, however, the coordinates will not be included in our analysis.

Multivariate Outlier Detection

The classic Mahalanobis Distance as shown in equation 1, is a concept first introduced in 1936. By measuring the distance between a point and a distribution to which that point belongs, this technique acts as a statistical measure for the classification of a point as an outlier based on a chi-square distribution.

MD=(xˉx)TC1(xˉx) For many of its multivariate functions, this package uses the Robust Mahalanobis distance based the Minimum Co-variance Determination. While classic calculation of the distances is widely accepted, the robust calculation is preferred. The package provides a tool for plotting the robust versus classic distance called dd.plot.

x <- log(humus[ , c("Al","Co","Cu","Ni", "Pb","U")])
distances <- dd.plot(x, quan=1/2, alpha=0.025)

# The classical and Robust distances now exist as vectors

str(distances$md.cla)
##  Named num [1:617] 1.86 2.6 1.78 2.66 3.03 ...
##  - attr(*, "names")= chr [1:617] "1" "2" "3" "4" ...
str(distances$md.rob)
##  Named num [1:617] 2.28 2.64 1.94 2.99 3.83 ...
##  - attr(*, "names")= chr [1:617] "1" "2" "3" "4" ...

Outlier Visualization of 2-dimensional Data

The color.plot function of this package provides outlier classification visualization for a given dataset where input arguments are as follows:

  • x: a 2-dimensional data frame or matrix
  • quan: amount of observations to be used for mcd calculation
  • alpha: amount of observations used to calculate the adjusted quantile

The plot generated by this function represents the euclidean distance as a color, where blue represent small distances, and red represent large distances from the data set minimum. Mahalanobis distances are represented by their symbol. The function also draws four ellipsoids where the Mahalanobis Distances are constant. The constant values correspond to the 25%, 50%, and 75% and adjusted quantiles. Finally, this function can provide us with vectors containing Mahalanobis Distance, Euclidean Distances, and a Boolean vector qualifying outliers.

x <- log(humus[,c("U","Pb")])
colorData <- color.plot(x, quan=1/2, alpha=0.025)

str(colorData$md)
##  Named num [1:617] 1.508 0.908 1.226 0.944 0.896 ...
##  - attr(*, "names")= chr [1:617] "1" "2" "3" "4" ...
str(colorData$euclidean)
##  Named num [1:617] 5.12 4.25 3.83 4.76 4.34 ...
##  - attr(*, "names")= chr [1:617] "1" "2" "3" "4" ...
str(colorData$outliers)
##  Named logi [1:617] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  - attr(*, "names")= chr [1:617] "1" "2" "3" "4" ...

What this technique lacks more than anything is scalability. The data that can be represented in this way is only limited to 2-dimensional. In the present, where big-data is becoming the norm, a multivariate solution is preferable.

Outlier Visualization of Multivariate Data

The adjusted quantile plot function of the mvoutlier package will solve for and order the squared Mahalanobis Distances for the given observations and plot them against the empirical Chi-Squared distribution function of these values. The function takes as input:

  • X: Data frame or matrix that must be at least 2-dimensional.
  • delta: Quantile of the Chi-Squared distribution.
  • quan: the portion of observations used for Minimum Co-variance Determinant (MCD) estimation.
  • alpha: Maximum Thresholding Proportion

The analysis will be conducted on several randomly selected features of the humus dataset. We will attempt to identify outliers with respects to the elements “Al”, “Co”, “Cu”, “Ni”, and “Pb”, and “U”. These elements were chosen randomly to simply demonstrate the technique.

The graphics below provide an intuitive visualization of the outliers contained in this dataset. All observations are plotted against the first and second principal components, and every observation outside of the Chi-Square quantile is colored in red. The function displays outliers outside of the adjusted chi-square quantile as well. Finally, we are able to save a Boolean list which details which specific observations are considered outliers. Any observation outside of the chi-Square quantile will be listed as “TRUE”, meaning it is an anomaly

# Generate the data frame with sleected features
x <- log(humus[ , c("Al","Co","Cu","Ni", "Pb","U")])

# Execute
outliers <- aq.plot(x, delta=qchisq(0.975, df = ncol(x)),
                    quan = 1/2, alpha = 0.05)
## Projection to the first and second robust principal components.
## Proportion of total variation (explained variance): 0.6449667

# Inspect outliers
str(outliers)
## List of 1
##  $ outliers: Named logi [1:617] FALSE FALSE FALSE FALSE FALSE FALSE ...
##   ..- attr(*, "names")= chr [1:617] "1" "2" "3" "4" ...

Interactive Plot

It may be desirable for the user to eliminate some of the outlier observations manually. In this case the chisq.plot function can be used to interactively set our quantile threshold. If the data is normally distributed, the user would ideally want to delete outliers until a straight line appears. All observations eliminated by the user are printed on the console. This function takes as input:

  • x: matrix or data frame.
  • quan: amount of observations used for mcd estimations
  • ask: logical operator specifying whether user interaction is allowed or not.
x <- log(humus[ , c("Al","Co","Cu","Ni", "Pb","U")])

par(mfrow = c(1,1))

chisq.plot(x, quan = 0.5, ask = TRUE)

## Remove outliers with left-click, stop with right-click on plotting device
## $outliers
## NULL

The visual above is the plot of Chi2 vs the ordered robust mahalanobis distance and includes all observations. As a user, perhaps you want to eliminate the last 9 observations and determine which observations they correspond to. In order to do this, you would simply need to left-click on the graph 9 times. This would eliminate the furthest 9 observations and save them in a vector. After 9 clicks, we obtain the following plot:

As we click through the iterations, the console displays a corresponding output:

# Excerpt from console output, iterations 7-9

# Observations left out:
# 75 28 67 35 87 613 449
# Remove outliers with left-click, stop with right-click on plotting device
# Observations left out:
# 75 28 67 35 87 613 449 173
# Remove outliers with left-click, stop with right-click on plotting device
# Observations left out:
# 75 28 67 35 87 613 449 173 116
# Remove outliers with left-click, stop with right-click on plotting device

If we want to save the final vector of outliers, we can write the function as follows:

res <- chisq.plot(log(humus[ , c("Al","Co","Cu","Ni", "Pb","U")]), quan = 0.5, ask = TRUE)
res$outliers
# [1]  75  28  67  35  87 613 449 173 116

Interpretation Plots

The mvoutlier package provides several useful tools for analyzing and understanding our multivariate outliers. Many of these tools are very specific to the test datasets and are based off of the geo-locations the data was collected from, however, they still provide a strong example of how this type of data can be utilized. For example, we can plot a map with outliers identified.

# kola background is a map from which humus samples were taken
kola <- rbind(kola.background$coast,kola.background$boundary,
             kola.background$bordes) 

# XY coordinates within humus dataset
XY <- humus[,c("XCOO","YCOO")]

# The mvoutlier.CoDa function returns all pertinent information needed for further analysis
data <- mvoutlier.CoDa(humus[ , c("Al","Cu", "Ni", "Co")])

# execute:

plot(data,coord=XY,map=kola,onlyout=TRUE,which="map",bw=FALSE,
     symb=TRUE,symbtxt=TRUE)

Now we can see exactly where the outliers fall on the map. We know not only which observations are anomalous, but exactly where the anomalous samples were collected from. While informative, this plot is very specific to the sample data set and requires as a prerequisite the appropriate background map, and associated coordinates for data samples. Another tool we can use that is not specific to this data set and could be implemented in any dataset condusive to multivariate analysis is the Biplot function. This plot allows data from both observations and variables to be viewed, and shows towards which outlier observations the variables trend.

 plot(data,onlyout=TRUE,which="biplot",bw=FALSE,
      symb=TRUE,symbtxt=TRUE) 

In Class Exercise

Ensure you have an internet connection and use the following link to begin: LAUNCH FUN SPACE ADVENTURE

Sources

Peter Filzmoser and Moritz Gschwandtner (2017). mvoutlier: Multivariate Outlier Detection Based on Robust Methods. R package version 2.0.8. https://CRAN.R-project.org/package=mvoutlier