[Video]
Defining the term anomaly
Anomaly: a data point or collection of data points that do not follow the same pattern or have the same structure as the rest of the data
Point anomaly
Example: A single 30C daily high temperature among a set of ordinary spring days
boxplot(temperature, ylab = "Celsius")
Collective anomaly
Example: 10 consecutive high daily temperatures
Which of the following is best described as a collective anomaly?
# Explore contents of dataset
head(river)
## index nitrate months
## 1 1 1.581 January
## 2 2 1.323 February
## 3 3 1.140 March
## 4 4 1.245 April
## 5 5 1.072 May
## 6 6 1.483 June
summary(river$nitrate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5920 0.9485 1.0680 1.0649 1.1700 1.8970
boxplot(river$nitrate)
[Video]
Visual assessment is not always reliable!
Grubbs’ test
Checking normality with a histogram
hist(temperature, breaks = 6)
Symmetrical & bell shaped? If no, Grubbs’ test should not be used.
Running Grubbs’ test
Use the grubbs.test() function:
grubbs.test(temperature)
Grubbs test for one outlier
data: temperature
G = 3.07610, U = 0.41065, p-value = 0.001796
alternative hypothesis: highest value 30 is an outlier
p-value
Reject H0; strong evidence that highest value of 30 is an outlier.
Get the row index of an outlier
Location of the maximum
which.max(weights)
[1] 5
Location of the minimum
which.min(temperature)
[1] 12
head(river)
## index nitrate months
## 1 1 1.581 January
## 2 2 1.323 February
## 3 3 1.140 March
## 4 4 1.245 April
## 5 5 1.072 May
## 6 6 1.483 June
# Plot a histogram of the nitrate column
hist(river$nitrate, xlab = "Nitrate concentration", breaks = 40)
# Apply Grubbs' test to the river nitrate data
grubbs.test(river$nitrate)
##
## Grubbs test for one outlier
##
## data: river$nitrate
## G = 4.72676, U = 0.92269, p-value = 0.000211
## alternative hypothesis: highest value 1.897 is an outlier
# Apply Grubbs' test to the nitrate data
grubbs.test(river$nitrate)
##
## Grubbs test for one outlier
##
## data: river$nitrate
## G = 4.72676, U = 0.92269, p-value = 0.000211
## alternative hypothesis: highest value 1.897 is an outlier
# Use which.max to find row index of the max
which.max(river$nitrate)
## [1] 156
# Runs Grubbs' test excluding row 156
grubbs.test(river$nitrate[-156])
##
## Grubbs test for one outlier
##
## data: river$nitrate[-156]
## G = 3.42983, U = 0.95915, p-value = 0.07756
## alternative hypothesis: highest value 1.643 is an outlier
# Print the value tested in the second Grubbs' test
max(river$nitrate[-156])
## [1] 1.643
The second Grubbs’ test p-value was 0.07; since this is above 0.05, we wouldn’t treat this point as an outlier.
[Video]
Monthly revenue data
head(msales)
sales month
1 6.068 1
2 5.966 2
3 6.133 3
4 6.230 4
5 6.407 5
6 6.433 6
Grubbs’ test not appropriate here
Visualizing monthly revenue
plot(sales ~ month, data = msales, type = 'o')
Seasonal-Hybrid ESD algorithm usage
library(AnomalyDetection)
sales_ad <- AnomalyDetectionVec(x = msales$sales, period = 12,
direction = 'both')
Arguments
Seasonal-Hybrid ESD algorithm output
sales_ad <- AnomalyDetectionVec(x = msales$sales, period = 12,
direction = 'both')
sales_ad$anoms
index anoms
1 14 1.561
2 108 2.156
Seasonal-Hybrid ESD algorithm plot
AnomalyDetectionVec(x = msales$sales, period = 12,
direction = 'both', plot = T)
# View contents of dataset
head(river)
## index nitrate months
## 1 1 1.581 January
## 2 2 1.323 February
## 3 3 1.140 March
## 4 4 1.245 April
## 5 5 1.072 May
## 6 6 1.483 June
# Show the time series of nitrate concentrations with time
plot(nitrate ~ index, data = river, type = "o")
# Calculate the mean nitrate by month
monthly_mean <- tapply(river$nitrate, river$months, FUN = mean)
monthly_mean
## April August December February January July June March
## 1.0166250 0.9380833 1.2264167 1.1838400 1.2163600 0.9810417 0.9792083 1.1050400
## May November October September
## 0.9978333 1.0962500 1.0360000 0.9885833
# Plot the monthly means
plot(monthly_mean, type = "o", xlab = "Month", ylab = "Monthly mean")
# Create a boxplot of nitrate against months
boxplot(nitrate ~ months, data = river)
# Run Seasonal-Hybrid ESD for nitrate concentrations
AnomalyDetectionVec(x = river$nitrate, period = 12, direction = 'both', plot = T)
## $anoms
## index anoms
## 1 6 1.483
## 2 53 1.533
## 3 156 1.897
##
## $plot
# Use Seasonal-Hybrid ESD for nitrate concentrations
river_anomalies <- AnomalyDetectionVec(x = river$nitrate, period = 12, direction = 'both', plot = T)
# Print the anomalies
river_anomalies$anoms
## index anoms
## 1 6 1.483
## 2 53 1.533
## 3 156 1.897
# Print the plot
print(river_anomalies$plot)
Recall when using Grubbs’ test on the river nitrate data, that only row 156 was found to be anomalous, while Seasonal-Hybrid ESD identified 2 further high-valued anomalies. Which of the following provides the best explanation for the difference between the two approaches?
[Video]
# View the contents of the wine data
head(wine)
## pH alcohol
## 1 2.999069 8.800096
## 2 3.300928 9.499615
## 3 3.258771 10.099157
## 4 3.189202 9.898710
## 5 3.180881 9.598667
## 6 3.219069 11.001029
# Scatterplot of wine pH against alcohol
plot(pH ~ alcohol, data = wine)
# Calculate the 5 nearest neighbors distance
wine_nn <- get.knn(wine, k = 5)
# View the distance matrix
head(wine_nn$nn.dist)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.010067258 0.012404926 0.02233037 0.02743524 0.03287520
## [2,] 0.008050804 0.020067068 0.02058207 0.02776543 0.03029810
## [3,] 0.008990137 0.009986587 0.01736312 0.02212520 0.02986814
## [4,] 0.007605694 0.009860220 0.02110417 0.02198950 0.02822642
## [5,] 0.007579848 0.008917272 0.02081650 0.02083640 0.02927584
## [6,] 0.009466140 0.013015131 0.02019557 0.02098385 0.02471838
# Distance from wine 5 to nearest neighbor
wine_nn$nn.dist[5, 1]
## [1] 0.007579848
# Row index of wine 5's nearest neighbor
wine_nn$nn.ind[5, 1]
## [1] 1751
# Return data for wine and its nearest neighbor
wine[c(5, 1751), ]
## pH alcohol
## 5 3.180881 9.598667
## 1751 3.188425 9.599396
# Calculate the 5 nearest neighbors distance
wine_nn <- get.knn(wine, k = 5)
# Create score by averaging distances
wine_nnd <- rowMeans(wine_nn$nn.dist)
# Print row index of the most anomalous point
which.max(wine_nnd)
## [1] 1520
[Video]
# Without standardization, features have different scales
summary(wine)
## pH alcohol
## Min. :2.719 Min. : 7.999
## 1st Qu.:3.079 1st Qu.: 9.700
## Median :3.198 Median :10.700
## Mean :3.207 Mean :10.797
## 3rd Qu.:3.329 3rd Qu.:11.801
## Max. :3.821 Max. :14.201
# Standardize the wine columns
wine_scaled <- scale(wine)
# Standardized features have similar means and quartiles
summary(wine_scaled)
## pH alcohol
## Min. :-2.74170 Min. :-2.16657
## 1st Qu.:-0.71846 1st Qu.:-0.84910
## Median :-0.05055 Median :-0.07542
## Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.68135 3rd Qu.: 0.77698
## Max. : 3.44366 Max. : 2.63530
# Print the 5-nearest neighbor distance score
wine_nnd[1:5]
## [1] 0.02102260 0.02135269 0.01766664 0.01775720 0.01748517
# Append the score as a new column
wine$score <- wine_nnd
# Scatterplot showing pH, alcohol and kNN score
plot(pH ~ alcohol, data = wine, cex = sqrt(score), pch = 20)
[Video]
# Calculate the LOF for wine data
wine_lof <- lof(scale(wine), k = 5)
# Append the LOF score as a new column
wine$score <- wine_lof
# Scatterplot showing pH, alcohol and LOF score
plot(pH ~ alcohol, data = wine, cex = score, pch = 20)
# Scaled wine data
wine_scaled <- scale(wine)
# Calculate and append kNN distance as a new column
wine_nn <- get.knn(wine_scaled, k = 10)
wine$score_knn <- rowMeans(wine_nn$nn.dist)
# Calculate and append LOF as a new column
wine$score_lof <- lof(wine_scaled, k = 10)
# Find the row location of highest kNN
which.max(wine$score_knn)
## [1] 1774
# Find the row location of highest LOF
which.max(wine$score_lof)
## [1] 1305
[Video]
# Build an isolation tree
wine_tree <- iForest(wine, nt = 1)
# Create isolation score
wine$tree_score <- predict(wine_tree, newdata = wine)
# Histogram plot of the scores
hist(wine$tree_score, breaks = 40)
The wine data with the anomaly score you created in the previous exercise is preloaded.
Which of the following statements is true of the wines based on the anomaly score in the column tree_score?
[Video]
# Fit isolation forest
wine_forest <- iForest(wine, nt = 100)
# Fit isolation forest
wine_forest <- iForest(wine, nt = 100, phi = 200)
# Fit isolation forest
wine_forest <- iForest(wine, nt = 100, phi = 200)
# Create isolation score from forest
wine_score <- predict(wine_forest, newdata = wine)
# Append score to the wine data
wine$score <- wine_score
# View the contents of the wine scores
head(wine_scores)
## trees_10 trees_100 trees_200 trees_500 trees_1000 trees_2000
## 1 0.5182650 0.5443061 0.5386447 0.5315498 0.5274856 0.5278313
## 2 0.4499686 0.4445050 0.4485685 0.4495496 0.4489324 0.4529206
## 3 0.4340265 0.4358717 0.4343007 0.4331406 0.4331947 0.4356845
## 4 0.4050908 0.4319684 0.4303714 0.4311672 0.4339464 0.4359655
## 5 0.4347151 0.4381716 0.4351090 0.4335480 0.4363655 0.4398163
## 6 0.4155559 0.4368569 0.4261294 0.4233282 0.4264659 0.4302045
# Score scatterplot 2000 vs 1000 trees
plot(trees_2000 ~ trees_1000, data = wine_scores)
# Add reference line of equality
abline(a = 0, b = 1)
[Video]
# Sequence of values for pH and alcohol
ph_seq <- seq(min(wine$pH), max(wine$pH), length.out = 25)
alcohol_seq <- seq(min(wine$alcohol), max(wine$alcohol), length.out = 25)
# Create a data frame of grid coordinates
wine_grid <- expand.grid(pH = ph_seq, alcohol = alcohol_seq)
# Visualise the grid using a scatterplot
plot(pH ~ alcohol, data = wine_grid, pch = 20)
# Calculate isolation score at grid locations
wine_grid$score <- predict(wine_forest, wine_grid)
# Contour plot of isolation scores
contourplot(score ~ pH + alcohol, data = wine_grid, region = TRUE)
[Video]
# View contents of thryoid data
head(thyroid)
## label TSH T3 TT4 T4U FTI TBG
## 1 0 -0.2559334 -6.783703 -1.983614 -1.288439 -1.2181574 -1.443646
## 2 0 -1.3971053 -7.659171 -1.273372 -1.110363 -0.6250937 -1.750020
## 3 0 -0.7039581 -5.631023 -1.500762 -1.453953 -0.6427933 -2.082726
## 4 0 -0.3894648 -6.378238 -1.854402 -1.741635 -1.0986123 -1.994618
## 5 0 -1.4415570 -7.659171 -1.419084 -1.139142 -1.0986123 -1.396179
## 6 0 -0.3130918 -7.659171 -1.916923 -1.628306 -1.4294665 -1.617668
# Tabulate the labels
table(thyroid$label)
##
## 0 1
## 978 22
# Percentage of patients with disease
prop_disease <- 22 / 1000
# Scatterplot showing TSH and T3
plot(TSH ~ T3, data = thyroid, pch = 20)
# Scatterplot showing TSH, T3 and anomaly labels
plot(TSH ~ T3, data = thyroid, pch = 20, col = label + 1)
# Scatterplot showing TT4, TBG and anomaly labels
plot(TT4 ~ TBG, data = thyroid, pch = 20, col = label + 1)
# Fit isolation forest
thyroid_forest <- iForest(thyroid[, -1], nt = 200, phi = 100)
# Anomaly score
thyroid$iso_score <- predict(thyroid_forest, thyroid[, -1])
# Boxplot of the anomaly score against labels
boxplot(iso_score ~ label, data = thyroid, col = "olivedrab4")
[Video]
# Scale the measurement columns of thyroid
scaled_thyroid_measurements <- scale(thyroid[, -1])
# Create a LOF score for the measurements
lof_score <- lof(scaled_thyroid_measurements, k = 10)
# Calculate high threshold for lof_score
high_lof <- quantile(lof_score, probs = 0.98)
# Append binary LOF score to thyroid data
thyroid$binary_lof <- as.numeric(lof_score >= high_lof)
# Calculate high threshold for iso_score
high_iso <- quantile(iso_score, probs = 0.98)
# Append binary isolation score to thyroid data
thyroid$binary_iso <- as.numeric(iso_score >= high_iso)
# Tabulate agreement of label and binary isolation score
table(thyroid$label, thyroid$binary_iso)
##
## 0 1
## 0 970 8
## 1 10 12
# Tabulate agreement of label and binary LOF score
table(thyroid$label, thyroid$binary_lof)
##
## 0 1
## 0 958 20
## 1 22 0
# Proportion of binary_iso and label that agree
iso_prop <- (970 + 12) / 1000
# Proportion of binary_lof and label that agree
lof_prop <- (958 + 0) / 1000
# Tabulation for the binary isolation score
table(thyroid$label, thyroid$binary_iso)
##
## 0 1
## 0 970 8
## 1 10 12
# Precision for the isolation score
precision_iso <- 12 / (8 + 12)
# Recall for the isolation score
recall_iso <- 12 / (12 + 10)
# Tabulation for the binary lof score
table(thyroid$label, thyroid$binary_lof)
##
## 0 1
## 0 958 20
## 1 22 0
# Precision for the binary lof score
precision_lof <- 0 / (20 + 0)
# Recall for the binary lof score
recall_lof <- 0 / (22 + 0)
Which of the following statements about precision and recall is not true?
[Video]
# Print the column classes in thyroid
sapply(X = thyroid, FUN = class)
## label TSH T3 TT4 T4U FTI
## "integer" "numeric" "numeric" "numeric" "numeric" "numeric"
## TBG age sex
## "numeric" "character" "character"
# Convert column with character class to factor
thyroid$age <- as.factor(thyroid$age)
thyroid$sex <- as.factor(thyroid$sex)
# Check that all columns are factor or numeric
sapply(X = thyroid, FUN = class)
## label TSH T3 TT4 T4U FTI TBG age
## "integer" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "factor"
## sex
## "factor"
# Check the class of age column
class(thyroid$age)
## [1] "factor"
# Check the class of sex column
class(thyroid$sex)
## [1] "factor"
# Fit an isolation forest with 100 trees
thyroid_for <- iForest(thyroid[, -1], nt = 100)
# Calculate Gower's distance matrix
thyroid_dist <- daisy(thyroid[, -1], metric = "gower")
# Generate LOF scores for thyroid data
thyroid_lof <- lof(thyroid_dist, k = 10)
# Range of values in the distance matrix
range(as.matrix(thyroid_dist))
## [1] 0.0000000 0.7025413
[Video]
Michael is a hybrid thinker and doer—a byproduct of being a CliftonStrengths “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.
Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.
LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470