set.seed(123)
random_numbers <- matrix(rnorm(1000))
par(oma = c(0,0,0,0), mar = c(4,4,3,2))
hist(random_numbers, breaks=50, col="navy",
main="Randomly-generated numbers\nfrom normal distribution",
xlab="value")
Anomaly detection is the process of identifying patterns in data that do not conform to expected behavior.
The definition of an anomaly is that it is unusual (or not normal).
Formal definition “Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior.” (Chandola, Banerjee, and Kumar 2009)
A simple example of anomalies in a 2-dimensional dataset. \(N_1\) and \(N_1\) are normal regions. \(o_1\) and \(o_2\) are points that are sufficiently far away from these regions. \(O_3\) are clustered anomalies.
Robust Covariance (Statistical Based) fits a gaussian distribution to the dataset.
One-Class SVM (Classification Based) generalizes a hyperplane from the data for a decision boundary that include as much data points as possible.
Isolation Forest (Isolation Based) builds multiple randomly generated binary trees from sample of the data. Points having short pathlength are anomalies.
Local Outlier Factor (Nearest Neighbour Based) calculates pairwise distance in order to find points that have higher local outlier factor as anomalies.
Definition:
Anomalies are few and different
library(isotree)
model <- isolation.forest(random_numbers, ndim=1, ntrees=10, nthreads=1)
scores <- predict(model, random_numbers, type="avg_depth")
par(mar = c(4,5,3,2))
plot(random_numbers, scores, type="p", col="darkred",
main="Average isolation depth\nfor normally-distributed numbers",
xlab="value", ylab="Average isolation depth")# Randomly-generated data from different distributions
set.seed(1)
cluster1 <- data.frame(
x = rnorm(1000, -1, .4),
y = rnorm(1000, -1, .2)
)
cluster2 <- data.frame(
x = rnorm(1000, +1, .2),
y = rnorm(1000, +1, .4)
)
outlier <- data.frame(
x = -1,
y = 1
)
### Putting them together
X <- rbind(cluster1, cluster2, outlier)
### Function to produce a heatmap of the scores
pts = seq(-3, 3, .1)
space_d <- expand.grid(x = pts, y = pts)
plot.space <- function(Z, ttl, cex.main = 1.4) {
image(pts, pts, matrix(Z, nrow = length(pts)),
col = rev(heat.colors(50)),
main = ttl, cex.main = cex.main,
xlim = c(-3, 3), ylim = c(-3, 3),
xlab = "", ylab = "")
par(new = TRUE)
plot(X, type = "p", xlim = c(-3, 3), ylim = c(-3, 3),
col = "#0000801A",
axes = FALSE, main = "",
xlab = "", ylab = "")
}Statlog/Landsat Satellit dataset from UCI repository. The smallest three classes, i.e. 2, 4, 5 are combined to form the outliers class, while all the other classes are combined to form an inlier class.
| Model | Time(s) | AUROC |
|---|---|---|
| Isolation Forest | 0.01 | 0.691 |
| Density Isolation Forest | 0.02 | 0.827 |
| Fair-Cut Forest | 0.00 | 0.897 |
| One-Class SVM | 7.02 | 0.299 |
| Local Outlier Factor | 0.97 | 0.490 |