Ch. 1 - Statistical outlier detection

What do we mean when we talk about anomalies?

[Video]

Defining the term anomaly

Anomaly: a data point or collection of data points that do not follow the same pattern or have the same structure as the rest of the data

Point anomaly

  • A single data point
  • Unusual when compared to the rest of the data

Example: A single 30C daily high temperature among a set of ordinary spring days

boxplot(temperature, ylab = "Celsius")

Collective anomaly

  • An anomalous collection of data instances
  • Unusual when considered together

Example: 10 consecutive high daily temperatures

Recognizing anomaly types

Which of the following is best described as a collective anomaly?

  • A $10,000 transaction made using a credit card on an online auction site.
  • A footfall of 600 visits to a large computer store on a Saturday afternoon.
  • A customer showing twice the usual spending behavior in mid-December.
  • [*] A high volume of server requests from different users using an outdated web browser.

Exploring the river nitrate data

# Explore contents of dataset
head(river)
##   index nitrate   months
## 1     1   1.581  January
## 2     2   1.323 February
## 3     3   1.140    March
## 4     4   1.245    April
## 5     5   1.072      May
## 6     6   1.483     June
summary(river$nitrate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5920  0.9485  1.0680  1.0649  1.1700  1.8970
boxplot(river$nitrate)

Testing the extremes with Grubbs’ test

[Video]

Visual assessment is not always reliable!

Grubbs’ test

  • Statistical test to decide if a point is outlying
  • Assumes the data are normally distributed
  • Requires checking the normality assumption first

Checking normality with a histogram

hist(temperature, breaks = 6)

Symmetrical & bell shaped? If no, Grubbs’ test should not be used.

Running Grubbs’ test

Use the grubbs.test() function:

grubbs.test(temperature)
    Grubbs test for one outlier

data:  temperature
G = 3.07610, U = 0.41065, p-value = 0.001796
alternative hypothesis: highest value 30 is an outlier

p-value

  • Near 0 - stronger evidence of an outlier
  • Near 1 - weaker evidence of an outlier

Reject H0; strong evidence that highest value of 30 is an outlier.

Get the row index of an outlier

Location of the maximum

which.max(weights)
[1] 5

Location of the minimum

which.min(temperature)
[1] 12

Visual check of normality

head(river)
##   index nitrate   months
## 1     1   1.581  January
## 2     2   1.323 February
## 3     3   1.140    March
## 4     4   1.245    April
## 5     5   1.072      May
## 6     6   1.483     June
# Plot a histogram of the nitrate column
hist(river$nitrate, xlab = "Nitrate concentration", breaks = 40)

Grubbs’ test

# Apply Grubbs' test to the river nitrate data
grubbs.test(river$nitrate)
## 
##  Grubbs test for one outlier
## 
## data:  river$nitrate
## G = 4.72676, U = 0.92269, p-value = 0.000211
## alternative hypothesis: highest value 1.897 is an outlier

Hunting multiple outliers using Grubbs’ test

# Apply Grubbs' test to the nitrate data
grubbs.test(river$nitrate)
## 
##  Grubbs test for one outlier
## 
## data:  river$nitrate
## G = 4.72676, U = 0.92269, p-value = 0.000211
## alternative hypothesis: highest value 1.897 is an outlier
# Use which.max to find row index of the max
which.max(river$nitrate)
## [1] 156
# Runs Grubbs' test excluding row 156
grubbs.test(river$nitrate[-156])
## 
##  Grubbs test for one outlier
## 
## data:  river$nitrate[-156]
## G = 3.42983, U = 0.95915, p-value = 0.07756
## alternative hypothesis: highest value 1.643 is an outlier
# Print the value tested in the second Grubbs' test
max(river$nitrate[-156])
## [1] 1.643

The second Grubbs’ test p-value was 0.07; since this is above 0.05, we wouldn’t treat this point as an outlier.

Anomalies in time series

[Video]

Monthly revenue data

head(msales)

  sales month
1 6.068     1
2 5.966     2
3 6.133     3
4 6.230     4
5 6.407     5
6 6.433     6

Grubbs’ test not appropriate here

  • Seasonality may be present
  • May be multiple anomalies

Visualizing monthly revenue

plot(sales ~ month, data = msales, type = 'o')

Seasonal-Hybrid ESD algorithm usage

library(AnomalyDetection)

sales_ad <- AnomalyDetectionVec(x = msales$sales, period = 12, 
                                direction = 'both')

Arguments

  • x: vector of values
  • period: period of repeating pattern
  • direction: find anomalies that are small (‘neg’), large (‘pos’), or both (‘both’)

Seasonal-Hybrid ESD algorithm output

sales_ad <- AnomalyDetectionVec(x = msales$sales, period = 12, 
                                direction = 'both')
sales_ad$anoms

  index anoms
1    14 1.561
2   108 2.156

Seasonal-Hybrid ESD algorithm plot

AnomalyDetectionVec(x = msales$sales, period = 12, 
                    direction = 'both', plot = T)

Visual assessment of seasonality

# View contents of dataset
head(river)
##   index nitrate   months
## 1     1   1.581  January
## 2     2   1.323 February
## 3     3   1.140    March
## 4     4   1.245    April
## 5     5   1.072      May
## 6     6   1.483     June
# Show the time series of nitrate concentrations with time
plot(nitrate ~ index, data = river, type = "o")

# Calculate the mean nitrate by month
monthly_mean <- tapply(river$nitrate, river$months, FUN = mean)
monthly_mean
##     April    August  December  February   January      July      June     March 
## 1.0166250 0.9380833 1.2264167 1.1838400 1.2163600 0.9810417 0.9792083 1.1050400 
##       May  November   October September 
## 0.9978333 1.0962500 1.0360000 0.9885833
# Plot the monthly means 
plot(monthly_mean, type = "o", xlab = "Month", ylab = "Monthly mean")

# Create a boxplot of nitrate against months
boxplot(nitrate ~ months, data = river)

Seasonal Hybrid ESD algorithm

# Run Seasonal-Hybrid ESD for nitrate concentrations
AnomalyDetectionVec(x = river$nitrate, period = 12, direction = 'both', plot = T)
## $anoms
##   index anoms
## 1     6 1.483
## 2    53 1.533
## 3   156 1.897
## 
## $plot

Interpreting Seasonal-Hybrid ESD output

# Use Seasonal-Hybrid ESD for nitrate concentrations
river_anomalies <- AnomalyDetectionVec(x = river$nitrate, period = 12, direction = 'both', plot = T)

# Print the anomalies
river_anomalies$anoms
##   index anoms
## 1     6 1.483
## 2    53 1.533
## 3   156 1.897
# Print the plot
print(river_anomalies$plot)

Seasonal-Hybrid ESD versus Grubbs’ test

Recall when using Grubbs’ test on the river nitrate data, that only row 156 was found to be anomalous, while Seasonal-Hybrid ESD identified 2 further high-valued anomalies. Which of the following provides the best explanation for the difference between the two approaches?

  • There are more extreme values associated with the seasonal cycle.
  • Grubbs’ test assumes that the mean is constant, and has been adversely influenced by a strong underlying trend.
  • [*] The extra anomalies are unusual with respect to the seasonal contribution at their time of occurrence.

Ch. 2 - Distance and density based anomaly detection

k-nearest neighbors distance score

[Video]

Exploring wine

# View the contents of the wine data
head(wine)
##         pH   alcohol
## 1 2.999069  8.800096
## 2 3.300928  9.499615
## 3 3.258771 10.099157
## 4 3.189202  9.898710
## 5 3.180881  9.598667
## 6 3.219069 11.001029
# Scatterplot of wine pH against alcohol
plot(pH ~ alcohol, data = wine)

kNN distance matrix

# Calculate the 5 nearest neighbors distance
wine_nn <- get.knn(wine, k = 5)

# View the distance matrix
head(wine_nn$nn.dist)
##             [,1]        [,2]       [,3]       [,4]       [,5]
## [1,] 0.010067258 0.012404926 0.02233037 0.02743524 0.03287520
## [2,] 0.008050804 0.020067068 0.02058207 0.02776543 0.03029810
## [3,] 0.008990137 0.009986587 0.01736312 0.02212520 0.02986814
## [4,] 0.007605694 0.009860220 0.02110417 0.02198950 0.02822642
## [5,] 0.007579848 0.008917272 0.02081650 0.02083640 0.02927584
## [6,] 0.009466140 0.013015131 0.02019557 0.02098385 0.02471838
# Distance from wine 5 to nearest neighbor
wine_nn$nn.dist[5, 1]
## [1] 0.007579848
# Row index of wine 5's nearest neighbor 
wine_nn$nn.ind[5, 1]
## [1] 1751
# Return data for wine and its nearest neighbor
wine[c(5, 1751), ]
##            pH  alcohol
## 5    3.180881 9.598667
## 1751 3.188425 9.599396

kNN distance score

# Calculate the 5 nearest neighbors distance
wine_nn <- get.knn(wine, k = 5)

# Create score by averaging distances
wine_nnd <- rowMeans(wine_nn$nn.dist)

# Print row index of the most anomalous point
which.max(wine_nnd)
## [1] 1520

Visualizing kNN distance

[Video]

Standardizing features

# Without standardization, features have different scales
summary(wine)
##        pH           alcohol      
##  Min.   :2.719   Min.   : 7.999  
##  1st Qu.:3.079   1st Qu.: 9.700  
##  Median :3.198   Median :10.700  
##  Mean   :3.207   Mean   :10.797  
##  3rd Qu.:3.329   3rd Qu.:11.801  
##  Max.   :3.821   Max.   :14.201
# Standardize the wine columns
wine_scaled <- scale(wine)

# Standardized features have similar means and quartiles
summary(wine_scaled)
##        pH              alcohol        
##  Min.   :-2.74170   Min.   :-2.16657  
##  1st Qu.:-0.71846   1st Qu.:-0.84910  
##  Median :-0.05055   Median :-0.07542  
##  Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.68135   3rd Qu.: 0.77698  
##  Max.   : 3.44366   Max.   : 2.63530

Appending the kNN score

# Print the 5-nearest neighbor distance score
wine_nnd[1:5]
## [1] 0.02102260 0.02135269 0.01766664 0.01775720 0.01748517
# Append the score as a new column 
wine$score <- wine_nnd

Visualizing kNN distance score

# Scatterplot showing pH, alcohol and kNN score
plot(pH ~ alcohol, data = wine, cex = sqrt(score), pch = 20)

Local outlier factor

[Video]

LOF calculation

# Calculate the LOF for wine data
wine_lof <- lof(scale(wine), k = 5)

# Append the LOF score as a new column
wine$score <- wine_lof

LOF visualization

# Scatterplot showing pH, alcohol and LOF score
plot(pH ~ alcohol, data = wine, cex = score, pch = 20)

LOF vs kNN

# Scaled wine data
wine_scaled <- scale(wine)

# Calculate and append kNN distance as a new column
wine_nn <- get.knn(wine_scaled, k = 10)
wine$score_knn <- rowMeans(wine_nn$nn.dist)     

# Calculate and append LOF as a new column
wine$score_lof <- lof(wine_scaled, k = 10)

# Find the row location of highest kNN
which.max(wine$score_knn)
## [1] 1774
# Find the row location of highest LOF
which.max(wine$score_lof)
## [1] 1305

Ch. 3 - Isolation forest

Isolation trees

[Video]

Fit and predict with an isolation tree

# Build an isolation tree 
wine_tree <- iForest(wine, nt = 1)

# Create isolation score
wine$tree_score <- predict(wine_tree, newdata = wine)

# Histogram plot of the scores
hist(wine$tree_score, breaks = 40)

Score interpretation

The wine data with the anomaly score you created in the previous exercise is preloaded.

Which of the following statements is true of the wines based on the anomaly score in the column tree_score?

  • The narrow range of scores suggests there are no potential anomalies.
  • Wine 2 score is unlikely to be anomalous because its score is close to 0.5.
  • Wine 187 has a score near to 1 and is more likely to indicate an anomaly.
  • [*] Wine 1061 has a score greater than 0.6 which indicates a possible anomaly.

Isolation forest

[Video]

Fit an isolation forest

# Fit isolation forest
wine_forest <- iForest(wine, nt = 100)
# Fit isolation forest
wine_forest <- iForest(wine, nt = 100, phi = 200)
# Fit isolation forest
wine_forest <- iForest(wine, nt = 100, phi = 200)

# Create isolation score from forest
wine_score <- predict(wine_forest, newdata = wine)

# Append score to the wine data
wine$score <- wine_score

Checking convergence

# View the contents of the wine scores
head(wine_scores)
##    trees_10 trees_100 trees_200 trees_500 trees_1000 trees_2000
## 1 0.5182650 0.5443061 0.5386447 0.5315498  0.5274856  0.5278313
## 2 0.4499686 0.4445050 0.4485685 0.4495496  0.4489324  0.4529206
## 3 0.4340265 0.4358717 0.4343007 0.4331406  0.4331947  0.4356845
## 4 0.4050908 0.4319684 0.4303714 0.4311672  0.4339464  0.4359655
## 5 0.4347151 0.4381716 0.4351090 0.4335480  0.4363655  0.4398163
## 6 0.4155559 0.4368569 0.4261294 0.4233282  0.4264659  0.4302045
# Score scatterplot 2000 vs 1000 trees 
plot(trees_2000 ~ trees_1000, data = wine_scores)

# Add reference line of equality
abline(a = 0, b = 1)

Visualizing the isolation score

[Video]

A grid of points

# Sequence of values for pH and alcohol
ph_seq <- seq(min(wine$pH),  max(wine$pH), length.out = 25)
alcohol_seq <- seq(min(wine$alcohol),  max(wine$alcohol), length.out = 25)

# Create a data frame of grid coordinates
wine_grid <- expand.grid(pH = ph_seq, alcohol = alcohol_seq)

# Visualise the grid using a scatterplot
plot(pH ~ alcohol, data = wine_grid, pch = 20)

Prediction over a grid

# Calculate isolation score at grid locations
wine_grid$score <- predict(wine_forest, wine_grid)

Anomaly contours

# Contour plot of isolation scores
contourplot(score ~ pH + alcohol, data = wine_grid, region = TRUE)


Ch. 4 - Comparing performance

Labeled anomalies

[Video]

Thyroid data

# View contents of thryoid data
head(thyroid)
##   label        TSH        T3       TT4       T4U        FTI       TBG
## 1     0 -0.2559334 -6.783703 -1.983614 -1.288439 -1.2181574 -1.443646
## 2     0 -1.3971053 -7.659171 -1.273372 -1.110363 -0.6250937 -1.750020
## 3     0 -0.7039581 -5.631023 -1.500762 -1.453953 -0.6427933 -2.082726
## 4     0 -0.3894648 -6.378238 -1.854402 -1.741635 -1.0986123 -1.994618
## 5     0 -1.4415570 -7.659171 -1.419084 -1.139142 -1.0986123 -1.396179
## 6     0 -0.3130918 -7.659171 -1.916923 -1.628306 -1.4294665 -1.617668
# Tabulate the labels
table(thyroid$label)
## 
##   0   1 
## 978  22
# Percentage of patients with disease
prop_disease <- 22 / 1000

Visualizing thyroid disease

# Scatterplot showing TSH and T3
plot(TSH ~ T3, data = thyroid, pch = 20)

# Scatterplot showing TSH, T3 and anomaly labels
plot(TSH ~ T3, data = thyroid, pch = 20, col = label + 1)

# Scatterplot showing TT4, TBG and anomaly labels
plot(TT4 ~ TBG, data = thyroid, pch = 20, col = label + 1)

Anomaly score

# Fit isolation forest
thyroid_forest <- iForest(thyroid[, -1], nt = 200, phi = 100)

# Anomaly score 
thyroid$iso_score <- predict(thyroid_forest, thyroid[, -1])

# Boxplot of the anomaly score against labels
boxplot(iso_score ~ label, data = thyroid, col = "olivedrab4")

Measuring performance

[Video]

Binarized scores

# Scale the measurement columns of thyroid
scaled_thyroid_measurements <- scale(thyroid[, -1])

# Create a LOF score for the measurements
lof_score <- lof(scaled_thyroid_measurements, k = 10)
                 
# Calculate high threshold for lof_score
high_lof <- quantile(lof_score, probs = 0.98)  

# Append binary LOF score to thyroid data
thyroid$binary_lof <- as.numeric(lof_score >= high_lof)
                 
# Calculate high threshold for iso_score
high_iso <- quantile(iso_score, probs = 0.98)  

# Append binary isolation score to thyroid data
thyroid$binary_iso <- as.numeric(iso_score >= high_iso)

Cross-tabulate binary scores

# Tabulate agreement of label and binary isolation score 
table(thyroid$label, thyroid$binary_iso)
##    
##       0   1
##   0 970   8
##   1  10  12
# Tabulate agreement of label and binary LOF score 
table(thyroid$label, thyroid$binary_lof)
##    
##       0   1
##   0 958  20
##   1  22   0
# Proportion of binary_iso and label that agree
iso_prop <- (970 + 12) / 1000

# Proportion of binary_lof and label that agree
lof_prop <- (958 + 0) / 1000

Thyroid precision and recall

# Tabulation for the binary isolation score
table(thyroid$label, thyroid$binary_iso)
##    
##       0   1
##   0 970   8
##   1  10  12
# Precision for the isolation score
precision_iso <- 12 / (8 + 12)
# Recall for the isolation score
recall_iso <- 12 / (12 + 10)

# Tabulation for the binary lof score
table(thyroid$label, thyroid$binary_lof)
##    
##       0   1
##   0 958  20
##   1  22   0
# Precision for the binary lof score
precision_lof <- 0 / (20 + 0)
# Recall for the binary lof score
recall_lof <- 0 / (22 + 0)

Which of the following statements about precision and recall is not true?

  • [*] The isolation forest has a high precision which means it can generally find the majority of the examples labelled as anomalous.
  • The isolation forest has higher recall and precision than LOF and therefore should be preferred.
  • An anomaly detector that labels everything as anomalous has a recall of 1.
  • The LOF score has correctly identified 0 anomalies and which means both precision and recall are 0.

Working with categorical features

[Video]

Converting character to factor

# Print the column classes in thyroid
sapply(X = thyroid, FUN = class)
##       label         TSH          T3         TT4         T4U         FTI 
##   "integer"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
##         TBG         age         sex 
##   "numeric" "character" "character"
# Convert column with character class to factor
thyroid$age <- as.factor(thyroid$age)
thyroid$sex <- as.factor(thyroid$sex)

# Check that all columns are factor or numeric
sapply(X = thyroid, FUN = class)
##     label       TSH        T3       TT4       T4U       FTI       TBG       age 
## "integer" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"  "factor" 
##       sex 
##  "factor"

Isolation forest with factors

# Check the class of age column
class(thyroid$age)
## [1] "factor"
# Check the class of sex column
class(thyroid$sex)
## [1] "factor"
# Fit an isolation forest with 100 trees
thyroid_for <- iForest(thyroid[, -1], nt = 100)

LOF with factors

# Calculate Gower's distance matrix
thyroid_dist <- daisy(thyroid[, -1], metric = "gower")

# Generate LOF scores for thyroid data
thyroid_lof <- lof(thyroid_dist, k = 10)

# Range of values in the distance matrix
range(as.matrix(thyroid_dist))
## [1] 0.0000000 0.7025413

Wrap-up

[Video]


About Michael Mallari

Michael is a hybrid thinker and doer—a byproduct of being a CliftonStrengths “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.

Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.

LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470