Introduction

Outliers

  • Anomaly detection is the process of identifying patterns in data that do not conform to expected behavior.

  • The definition of an anomaly is that it is unusual (or not normal).

    • This definition is super-helpful (sarcasm)!
  • Formal definition “Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior.” (Chandola, Banerjee, and Kumar 2009)

Why should you care?

  • Business Opportunity: Australia’s anti-money laundering and counter-terrorism financing reform is coming soon…
    • System Modernization
    • Scope to include trench 2 entries (real estate professionals, lawyers, accountants, trust and company service providers)
  • Enrich your knowledge: Anomaly Detection is one of main branches of Data Mining, beside Supervised Learning, Clustering and Association Rules.

Anti-Money Laundering

What are Anomalies?

A Simple example of anomalies in a 2-dimensional dataset.

A simple example of anomalies in a 2-dimensional dataset. \(N_1\) and \(N_1\) are normal regions. \(o_1\) and \(o_2\) are points that are sufficiently far away from these regions. \(O_3\) are clustered anomalies.

Examples of real-life anomalies:

  • Credit card frauds
  • Insurance frauds
  • Cyber-intrusions
  • Terrorist activities
  • System breakdowns

Challenges of Anomaly Detection

  • Defining a normal region is very difficult, because it needs to encompass every possible normal behaviors.
  • The Masking and Swamping problems - Anoamlies are harder to be detected when hidding among normal points or increase their frequency to appear like normal points.
  • The concept draft problem - normal/anomaly behavior keeps evolving.
  • The notions of anomaly are not the same in different applications.
  • Availability of labelled data is a major issue - for training/validation of models.
  • Often data naturally contains noise, which are very similar to anomalies, making them harder to be detected. (Chandola, Banerjee, and Kumar 2009)

How to detect Anomalies? What are the Applications?

Major Techniques

  • Classification Based
  • Clustering Based
  • Nearest Neighbor Based
  • Statistical Based
  • Isolation Based

(Chandola, Banerjee, and Kumar 2009)

Applications

  • Cyber-Intrusion Detection
  • Fraud Detection
  • Medical Anomaly Detection
  • Industrial Damage Detection
  • Image Processing

Libraries for Anomaly Detection

Overview of outlier detection methods (Scikit-Learn 2024)

Coding can be found here.

  • Robust Covariance (Statistical Based) fits a gaussian distribution to the dataset.

  • One-Class SVM (Classification Based) generalizes a hyperplane from the data for a decision boundary that include as much data points as possible.

  • Isolation Forest (Isolation Based) builds multiple randomly generated binary trees from sample of the data. Points having short pathlength are anomalies.

  • Local Outlier Factor (Nearest Neighbour Based) calculates pairwise distance in order to find points that have higher local outlier factor as anomalies.

Isolation Forest

Normal points need more partitions to be isolated, Anomalies require fewer.

Path length(No of partitions) converges with increasing number of trees. (Liu, Ting, and Zhou 2008)

Definition:

Anomalies are few and different

In an Isolation Tree, anomalies/outliers have short path lengths, normal points have long path lengths. (Liu, Ting, and Zhou 2008)

IsoTree - Fast Isolation Forest in C++

set.seed(123)
random_numbers <- matrix(rnorm(1000))
par(oma = c(0,0,0,0), mar = c(4,4,3,2))
hist(random_numbers, breaks=50, col="navy",
     main="Randomly-generated numbers\nfrom normal distribution",
     xlab="value")

library(isotree)

model <- isolation.forest(random_numbers, ndim=1, ntrees=10, nthreads=1)
scores <- predict(model, random_numbers, type="avg_depth")
par(mar = c(4,5,3,2))
plot(random_numbers, scores, type="p", col="darkred",
     main="Average isolation depth\nfor normally-distributed numbers",
     xlab="value", ylab="Average isolation depth")

An example in 2D

# Randomly-generated data from different distributions
set.seed(1)
cluster1 <- data.frame(
    x = rnorm(1000, -1, .4),
    y = rnorm(1000, -1, .2)
)
cluster2 <- data.frame(
    x = rnorm(1000, +1, .2),
    y = rnorm(1000, +1, .4)
)
outlier <- data.frame(
    x = -1,
    y =  1
)

### Putting them together
X <- rbind(cluster1, cluster2, outlier)

### Function to produce a heatmap of the scores
pts = seq(-3, 3, .1)
space_d <- expand.grid(x = pts, y = pts)
plot.space <- function(Z, ttl, cex.main = 1.4) {
    image(pts, pts, matrix(Z, nrow = length(pts)),
          col = rev(heat.colors(50)),
          main = ttl, cex.main = cex.main,
          xlim = c(-3, 3), ylim = c(-3, 3),
          xlab = "", ylab = "")
    par(new = TRUE)
    plot(X, type = "p", xlim = c(-3, 3), ylim = c(-3, 3),
         col = "#0000801A",
         axes = FALSE, main = "",
         xlab = "", ylab = "")
}
model <- isolation.forest(X, ndim=1, ntrees=100, nthreads=1)
scores <- predict(model, space_d)
par(mar = c(2.5,2.2,2,2.5))
plot.space(scores, "Outlier Scores\n(clustered data with an outlier on top)", 1.0)

Variations of Isolation Forests

Coding can be found here.

Real life data - Statlog

Statlog/Landsat Satellit dataset from UCI repository. The smallest three classes, i.e. 2, 4, 5 are combined to form the outliers class, while all the other classes are combined to form an inlier class.

library(mlbench)

data("Satellite")
is_outlier <- Satellite$classes %in% c("damp grey soil", "cotton crop", "vegetation stubble")
sat_without_class <- Satellite[, names(Satellite)[names(Satellite) != "classes"]]
dim(sat_without_class)
[1] 6435   36
summary(is_outlier)
   Mode   FALSE    TRUE 
logical    4399    2036 
Model Time(s) AUROC
Isolation Forest 0.01 0.691
Density Isolation Forest 0.02 0.827
Fair-Cut Forest 0.00 0.897
One-Class SVM 7.02 0.299
Local Outlier Factor 0.97 0.490

Area Under Receiver Operating Characteristic Curves (AUROC)

Learn more about ROC Curves

Questions

Chandola, Varun, Arindam Banerjee, and Vipin Kumar. 2009. “Anomaly Detection: A Survey.” ACM Computing Surveys (CSUR) 41 (3): 15. http://scholar.google.de/scholar.bib?q=info:jAfBmk-9uAcJ:scholar.google.com/&output=citation&hl=de&ct=citation&cd=0.
Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. 2008. “Isolation Forest.” In ICDM, 413–22. IEEE Computer Society. http://dblp.uni-trier.de/db/conf/icdm/icdm2008.html#LiuTZ08.
Scikit-Learn. 2024. Novelty and Outlier Detection.” https://scikit-learn.org/stable/modules/outlier_detection.html.