Read the data from csv.file into R environment
We will work with the original version of the dataset Stamps which is reported for the first time by Micenková, Beusekom, and Shafait (2012), and available from the outlier data repository described in Campos et al. (2016). In particular, the dataset (not normalised, without duplicates), contains 340 observations described by 9 variables (numerical predictors). Each observation (row) is a feature vector description of a Stamp, with 9 features (columns 1 to 9). it is a binary classification dataset, which contains forged (photocopied and printed) stamps as well as genuine (ink) stamps. The last (10th) column of the dataset contains the class labels (‘yes’ denotes forged, ‘no’ denotes genuine).
Stamps <- read.table("/Users/Biljana/Data Mining_1/Ass 3/Stamps_withoutdupl_09.csv", header = FALSE,
sep = ",", dec = ".") # read the csv file into a dataframe
colnames(Stamps) # display column names
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10"
head(Stamps) # show first six rows
## V1 V2 V3 V4 V5 V6 V7 V8
## 1 0.112011 0.425226 0.424364 0.681209 0.748623 0.066976 0.865725 0.020741
## 2 0.000000 0.425226 0.043612 0.086249 0.122159 0.000000 0.889814 1.000000
## 3 0.119357 0.425226 0.280526 0.587800 0.696619 0.014433 0.980303 0.367449
## 4 0.024987 0.425226 0.056383 0.097072 0.130242 0.001800 0.787256 0.284758
## 5 0.176667 0.425226 0.356267 0.515091 0.555632 0.059072 0.881087 0.006373
## 6 0.337423 0.272848 0.218090 0.564323 0.687203 0.075768 0.888813 0.001621
## V9 V10
## 1 0.181869 yes
## 2 0.969187 yes
## 3 0.595861 yes
## 4 0.804989 yes
## 5 0.124229 yes
## 6 0.110962 yes
summary(Stamps) # 9 Predictors (V1 to V9) and class labels (V10)
## V1 V2 V3 V4
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.04883 1st Qu.:0.3357 1st Qu.:0.04662 1st Qu.:0.2369
## Median :0.07872 Median :0.4252 Median :0.09584 Median :0.3456
## Mean :0.10265 Mean :0.4197 Mean :0.14193 Mean :0.3841
## 3rd Qu.:0.12214 3rd Qu.:0.4416 3rd Qu.:0.19458 3rd Qu.:0.5233
## Max. :1.00000 Max. :0.8645 Max. :1.00000 Max. :1.0000
## V5 V6 V7 V8
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.000656
## 1st Qu.:0.4524 1st Qu.:0.01010 1st Qu.:0.9183 1st Qu.:0.018221
## Median :0.5783 Median :0.01945 Median :0.9724 Median :0.032162
## Mean :0.5986 Mean :0.04151 Mean :0.9299 Mean :0.054981
## 3rd Qu.:0.7373 3rd Qu.:0.04528 3rd Qu.:0.9884 3rd Qu.:0.059360
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.000000
## V9 V10
## Min. :0.0000 Length:340
## 1st Qu.:0.4492 Class :character
## Median :0.5787 Mode :character
## Mean :0.5693
## 3rd Qu.:0.7247
## Max. :1.0000
PB_Predictors <- Stamps[,1:9] # 9 Predictors (V1 to V9)
PB_class <- Stamps[,10] # Class labels (V10)
PB_class <- ifelse(PB_class == 'no',0,1) # Inliers (class "no") = 0, Outliers (class "yes") = 1
PB_Class <- as.numeric(PB_class) # assign atomic vector as numeric
PB_Predictors_Data <- as.matrix(PB_Predictors) # conversion to conventional matrix
table(PB_Class) # how many observations are genuine or forged
## PB_Class
## 0 1
## 309 31
In total there are 340 observations, of which 31 (9.11%) are outliers. The code reads the dataset, separates the predictors (variables 1 to 9) from the class labels (variable 10), and transforms the class labels so that inliers are labelled as 0, whereas outliers are labelled as 1.
Perform Principal Component Analysis (PCA) on the Stamps data in the 9-dimensional space of the numerical predictors (PB_Predictors), and show the Proportion of Variance Explained (PVE) for each of the nine resulting principal components. Plot the accumulated sum of PVE for the first m components, as a function of m, and discuss the result: (a) How many components do we need to explain 90% or more of the total variance? (b) How much of the total variance is explained by the first three components?
# Applying PCA to 9-dimensional space of predictor variables:
#===========================================================
PCA_Stamps <- prcomp(PB_Predictors_Data, scale = TRUE, center = TRUE) # apply prcomp() from stats package
# Inspect PCA results with get_eigenvalue() from factoextra package:
get_eigenvalue(PCA_Stamps)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.5865791 39.850879 39.85088
## Dim.2 1.3724191 15.249102 55.09998
## Dim.3 1.2412560 13.791734 68.89171
## Dim.4 0.9014735 10.016372 78.90809
## Dim.5 0.7068964 7.854404 86.76249
## Dim.6 0.5650074 6.277860 93.04035
## Dim.7 0.2959451 3.288279 96.32863
## Dim.8 0.1978152 2.197946 98.52658
## Dim.9 0.1326082 1.473424 100.00000
eig <- get_eigenvalue(PCA_Stamps) # assign PCA results into a variable
# Creata a dataframe from PCA results in order to present them with ggplot2:
df_eig <- data.frame(eigenvalue = eig[ ,1], variance = eig[ ,2], cumvariance = eig[ ,3]) # create a dataframe
df_eig <- cbind(df_eig, row_id = as.factor(1:nrow(eig)), row_count = 1:nrow(eig)) # add row_id and row_count
# Assign and rename variables:
Row_Id <- df_eig$row_id
Row_Count <- df_eig$row_count
CumPVE <- (round(df_eig$cumvariance, digits = 2)) # round the values to 2 digits
Variance <- (round(df_eig$variance, digits = 2)) # round the values to 2 digits
Proportion of Variance Explained (PVE) and Accumulated Sum of PVE for Each PCA Component
## Display of Proportion of Variance Explained (PVE) for each of the nine PCA components with ggplot2:
#===============================================================================================
ggplot(df_eig, aes(x = Row_Id, y = Variance)) +
geom_bar(stat = "identity", aes(fill = Row_Id), color = "black") +
geom_path(aes(x = Row_Count), size = 1, color = "Gray50") +
geom_point(color = "Gray50", size = 1) +
xlab("Principal Component") +
ylab("Proportion of Variance Explained, (PVE)") +
scale_fill_brewer(breaks = c(1:9),
palette = "YlGnBu",
direction = -1) +
theme_bw(base_size = 12) +
theme(legend.position = "none")
Fig 1: Scree Plot - Percentage of Variance Explained by Each of the Nine Principal Components.
# Display of Accumulated Sum of PVE for each of the nine PCA components with geom_bar() from ggplot2:
#===================================================================================================
ggplot(df_eig, aes(x = Row_Id, y = CumPVE, group = 1)) +
geom_bar(stat = "identity", aes(fill = Row_Id), color = "black") +
geom_path(aes(x = Row_Count), size = 1, color = "Gray50") +
geom_hline(yintercept = 90, linetype = "dashed", color = "Gray50", size = 1, show.legend = T) +
geom_text(aes(label = paste(CumPVE, "%", sep = "")), vjust = 3, color = "Gray50") +
geom_point(color = "Gray50", size = 1) +
xlab("Principal Component") +
ylab("Accumulated Percentage of PVE") +
scale_fill_brewer(breaks = c(1:9),
palette = "YlGnBu",
direction = -1) +
theme_bw(base_size = 12) +
theme(legend.position = "none")
Fig 2: Scree Plot - Accumulated Sum of PVE of Each of the Nine Principal Components up to cut off at 90th%.
There are at least two important applications of PCA in data mining: dimensionality reduction and visualisation. Principal component analysis as a variable reduction or dimensionality reduction procedure is appropriate when we want to find a low-dimensional representation of the multivariate data that captures as much of the information as possible and plot the observations in this low-dimensional space. Each of the n-observations in the dataset lives in p-dimensional space and each variable could be considered as a different p dimension. For p-dimensional datasets where p>3, it could be very difficult to visualize a multi-dimensional hyperspace. Therefore, we wish to develop a smaller number of artificial variables (called principal components) that will account for most of the variance in the observed variables that may be used as criterion variables in subsequent analyses. These new variables are a linear combination of the p features. The first principal component extracted accounts for a maximal amount of total variance in the observed variables. The second principal component will have two important features: a) it accounts for a maximal amount of variance in the dataset that was not accounted for by the first component and it will be correlated with some of the variables that did not display strong correlations with the first component; b) second feature is that it will be uncorrelated with the first component.
In the Stamps data, the first principal component explains 39.85% of the total variance in the data, the second principal component explains 15.25% of the total variance, the third principal component accounts for 13.79%, the forth principal component accounts for 10.02%, the fifth principal component explains 7.85%, the sixth principal components accounts for 6.28% of the total variance and so on. Each succeeding component accounts for progressively smaller amounts of variance. How many components do we need to retain for interpretation of 90% or more of the total variance in the original dataset? By analyzing the output of get_eigenvalue() function and analyzing the Scree plot (Graph 1) we can notice that the first 6 principal components explain 93.04% of the total variance, and the rest of the components up to 9 account for only trivial variance of 6.88%. In order words, we can reduce 9 PCA components to 6 PCA components without compromising on proportion of variance.
An alternative criterion that can be used to check the obtained results is to retain enough components so the cumulative sum of proportion of variance explained equals to minimal value or cut off point >= 90% (Graph 2). Cumulative sum of proportion of variance explained demonstrates that the cumulative percentage of variance accounted for by components 1, 2, 3, 4, 5 and 6 is 93.04%, as a result, we have to retain all of the first six components. Results of analysis of cumulative sum of proportion of variance explained, as an important selective criterion are confirming the findings obtained by very essential criterion assessed previously, both scree plot (Graph 1) and percentage of proportion of variance explained.
How much of the total variance is explained by the first three components? By applying PCA to the 9-dimensional space of the predictors, we can see that the first two components alone explain 55.01% of the variance. The first three principal components explain 68.89% of total variance in the Stamps dataset. It is interesting to visualise the data in these three principal components, PC1, PC2 and PC3 with the ground truth outliers (according to the class labels) highlighted in blue (Fig 3).
# Create a dataframe of 3 PCA components:
PCA_Components_Stamps = data.frame(PCA_Stamps$x[, 1:3])
PC1 <- PCA_Components_Stamps$PC1
PC2 <- PCA_Components_Stamps$PC2
PC3 <- PCA_Components_Stamps$PC3
# Scatter-plot observations by PCA components 1 and 2:
plot1 = ggplot(PCA_Components_Stamps, aes(PC1, PC2)) +
geom_point(aes(colour = factor(PB_Class)))
# Scatter-plot observations by PCA components 1 and 3:
plot2 = ggplot(PCA_Components_Stamps, aes(PC1, PC3)) +
geom_point(aes(colour = factor(PB_Class)))
# Scatter-plot observations by PCA components 2 and 3:
plot3 = ggplot(PCA_Components_Stamps, aes(PC2, PC3)) +
geom_point(aes(colour = factor(PB_Class)))
grid.arrange(plot1, plot2, plot3, ncol = 1) # set up the grid
Fig 3: PCA scatter plots along the pair of different combinations of first three pair combinations of principal components (PC1, PC2 and PC3) with ground truth outliers higlighted in blue.
We can see from the scatter-plots that forged stamp samples have lower values of PC1 component in comparison with PC2 and PC3 than genuine stamp samples. Therefore, the first principal component separates forged samples of stamps from those of genuine stamps. Also, forged stamp samples have higher values of PC3 component that also separates forged from genuine stamps. Therefore, the first three principal components are reasonably useful for distinguishing stamp samples.
Do some research by yourself on how to render 3D plots in R, and then plot a 3D scatter-plot of the Stamps data as represented by the first three principal components computed in the previous item ( x = PC1 , y = PC2 , and z = PC3 ). You can use, for example, the function scatter3D() from the package plot3D. Use the class labels (PB_class) to plot inliers and outliers in different colours (for example, inliers in black and outliers in red). Make sure you produce multiple plots from different angles (at least three). Recalling that the class labels would not be available in a practical application of unsupervised outlier detection, do the outliers (forged stamps) look easy to detect in an unsupervised way, assuming that the 3D visualisation of the data via PCA is a reasonable representation of the data in full space? How about in a supervised way? Why? Justify your answers.
# 3D-Scatter Plots at different angles with PB_Class and without PB_Class:
#====================================================
x <- PCA_Components_Stamps[,1] # define x axis
y <- PCA_Components_Stamps[,2] # define y axis
z <- PCA_Components_Stamps[,3] # define z axis
# full panels of box are drawn (bty = "f")
scatter3D(x, y, z, pch = 20, col = "black", phi = 180,
theta = 60, bty = "f",
xlab = "PC1",
ylab = "PC2",
zlab = "PC3")
Fig 4: 3D Scatter Plot without Class Labels.
scatter3D(x, y, z, pch = 20, col = c("black", "red"), colvar = PB_Class, phi = 180,
theta = 60, bty = "f",
xlab = "PC1",
ylab = "PC2",
zlab = "PC3")
Fig 5: First view - 3D Scatter Plot with Class Labels.
scatter3D(x, y, z, pch = 20, col = c("black", "red"), colvar = PB_Class, phi = 60,
theta = 180, bty = "f",
xlab = "PC1",
ylab = "PC2",
zlab = "PC3")
Fig 6: Second view - 3D Scatter Plot with Class Labels.
scatter3D(x, y, z, pch = 20, col = c("black", "red"), colvar = PB_Class, phi = 120,
theta = 360, bty = "f",
xlab = "PC1",
ylab = "PC2",
zlab = "PC3")
Fig 7: Third view - 3D Scatter Plot with Class Labels.
Fig. 5, Fig 6 and Fig 7 suggest that the data, at least as approximated by the first three principal components are represented by a dense cluster and many observations that are labelled as outliers with red color, according to the class labels, actually fall very close to the cluster but they are not overlapping with it, form a smaller cluster, and, as such, they look easy to be detected as outliers by an unsupervised outlier detection method if this is a trustworthy representation of the full 9-dimensional data space. In addition, we can notice that a small fraction of these scattered outliers that are far away from the main cluster are indeed outliers according to the class labels (coloured in red). The results obtained from 3D plots confirm the results obtained from 2D plots shown at Fig 3.
In a supervised setting, class labels will be labeled as anomalies, whereas the remaining data will be considered normal. Actually, labels will be used to train a model that can determine specific types of anomalies. The results obtained from unsupervised method are always very different than the results obtained fom supervised method. As a general rule, if class labels are available, we should always use supervised methods because of its ability to discover specific anomalies. The supervised outlier detection model is very similar to building a predictive model for normal versus outlier classes. Anomaly detection is therefore just a binary classification problem. However, problems that may arise is that the number of anomalies is rare (number of instances of forged stams is very small (31) in comparison to genuine stamps (309)), as a results, it is possible most of the forged stamps to be predicted as genuine instances and the classifier still to achieve excellent accuracy. Another problem is that we have to be sure that the present class labels (forged stamps) are representative and accurate sample of anomalies for outlier detection for this dataset. However, if we take into consideration results obtained from 2D and 3D plots, we can conclude that the supervised method of classification will be very accurate with AUC-ROC > 0.9, and we can check the classification results on the testing dataset.
In this second activity, you are asked to perform unsupervised outlier detection on the Stamps data in the 9-dimensional space of the numerical predictors (PB_Predictors), using KNN Outlier with different values of the parameter (at least the following three: k = 5, 25, 100). For each k, produce the same 3D PCA visualisation of the data as in Activity 1 (PCA), but rather than using the class labels to colour the points, use instead the resulting KNN Outlier Scores as a continuous, diverging colour scale. Then, for each k, produce a second plot where the top-31 outliers according to the KNN Outlier Scores are shown in red, while the other points are shown in black. Do these plots give you any insights on the values of that look more or less appropriate from an unsupervised perspective (ignoring the class labels)? Justify your answer.
# Unsupervised Outlier Detection with different values of parameter k:
#====================================================================
k1 = 5 # KNN parameter
KNN_5_Outlier <- kNNdist(x = PB_Predictors, k = k1, all = TRUE)[,k1] # KNN distance (outlier score) computation
k2 = 25 # KNN parameter
KNN_25_Outlier <- kNNdist(x = PB_Predictors, k = k2, all = TRUE)[,k2] # KNN distance (outlier score) computation
k3 = 100 # KNN parameter
KNN_100_Outlier <- kNNdist(x = PB_Predictors, k = k3, all = TRUE)[,k3] # KNN distance (outlier score) computation
# Display KNN outliers through function:
KNN_outliers <- function(PB_Predictors, PCA_Components_Stamps, k) {
x = PCA_Components_Stamps[,1]
y = PCA_Components_Stamps[,2]
z = PCA_Components_Stamps[,3]
KNN_Outlier <- kNNdist(x = PB_Predictors, k = k, all = TRUE)[,k]
cols <- brewer.pal(11, "BrBG") # diverging colour palette "BrBG" with 11 colours
pal <- colorRampPalette(cols) # pass the pallette to colorRampPalette()
scatter3D(x, y, z, bty = "f", pch = 20,
col = pal(11), colvar = KNN_Outlier, phi = 180, theta = 60,
main = paste("KNN Outliers if k = ", k, sep = ""),
xlab = "PC1",
ylab = "PC2",
zlab = "PC3")
}
# Display of Top 31 KNN outliers through function:
Top_31_KNN_Outliers <- function(PB_Predictors, PCA_Components_Stamps, k){
KNN_Outlier <- kNNdist(x = PB_Predictors, k = k, all = TRUE)[,k]
PCA_Outliers_Df <- as.data.frame(cbind(PCA_Components_Stamps, KNN_Outlier)) # combine PCA with KNN outliers
PCA_Outliers_Rank <- PCA_Outliers_Df[order(PCA_Outliers_Df$KNN_Outlier, decreasing = T),] # reorder data frame
KNN_Result <- cbind(PCA_Outliers_Rank, ID = seq(1:nrow(PCA_Outliers_Rank))) # add in ID column
Top_Outliers <- ifelse(KNN_Result$ID <= 31,"1","0") # define the factor levels for top 31 outliers
# Assign x, y and z coordinates according to the new reordered dataframe
x = KNN_Result[,1]
y = KNN_Result[,2]
z = KNN_Result[,3]
scatter3D(x, y, z, pch = 20, col = c("black", "red"), colvar = as.numeric(Top_Outliers), phi = 180,
theta = 60, bty = "f",
main = paste("Top 31 KNN Outliers if k = ", k, sep = ""),
xlab = "PC1",
ylab = "PC2",
zlab = "PC3")
}
# 3D plot with KNN outliers if k = 5:
KNN_outliers(PB_Predictors, PCA_Components_Stamps, k = 5)
# 3D plot with Top 31 KNN outliers if k = 5:
Top_31_KNN_Outliers(PB_Predictors, PCA_Components_Stamps, k = 5)
# 3D plot with KNN outliers if k = 25:
KNN_outliers(PB_Predictors, PCA_Components_Stamps, k = 25)
# 3D plot with Top 31 KNN outliers if k = 25:
Top_31_KNN_Outliers(PB_Predictors, PCA_Components_Stamps, k = 25)
# 3D plot with KNN outliers if k = 100:
KNN_outliers(PB_Predictors, PCA_Components_Stamps, k = 100)
# 3D plot with Top 31 KNN outliers if k = 100:
Top_31_KNN_Outliers(PB_Predictors, PCA_Components_Stamps, k = 100)
# Display KNN outliers through function:
KNN_outliers <- function(PB_Predictors, PCA_Components_Stamps, k) {
x = PCA_Components_Stamps[,1]
y = PCA_Components_Stamps[,2]
z = PCA_Components_Stamps[,3]
KNN_Outlier <- kNNdist(x = PB_Predictors, k = k, all = TRUE)[,k]
cols <- brewer.pal(11, "BrBG")
pal <- colorRampPalette(cols)
scatter3D(x, y, z, bty = "f", pch = 20,
col = pal(11), colvar = KNN_Outlier, phi = 60, theta = 180,
main = paste("KNN Outliers if k = ", k, sep = ""),
xlab = "PC1",
ylab = "PC2",
zlab = "PC3")
}
# Display of Top 31 KNN outliers through function:
Top_31_KNN_Outliers <- function(PB_Predictors, PCA_Components_Stamps, k){
KNN_Outlier <- kNNdist(x = PB_Predictors, k = k, all = TRUE)[,k]
PCA_Outliers_Df <- as.data.frame(cbind(PCA_Components_Stamps, KNN_Outlier)) # combine PCA with KNN outliers
PCA_Outliers_Rank <- PCA_Outliers_Df[order(PCA_Outliers_Df$KNN_Outlier, decreasing = T),] # reorder data frame using KNN values
KNN_Result <- cbind(PCA_Outliers_Rank, ID = seq(1:nrow(PCA_Outliers_Rank))) # add in ID column
Top_Outliers <- ifelse(KNN_Result$ID <= 31,"1","0") # define the factor levels for top 31 outliers
# Assign x, y and z coordinates according to the new reordered dataframe
x = KNN_Result[,1]
y = KNN_Result[,2]
z = KNN_Result[,3]
scatter3D(x, y, z, pch = 20, col = c("black", "red"), colvar = as.numeric(Top_Outliers), phi = 60,
theta = 180, bty = "f",
main = paste("Top 31 KNN Outliers if k = ", k, sep = ""),
xlab = "PC1",
ylab = "PC2",
zlab = "PC3")
}
# 3D plot with KNN outliers if k = 5:
KNN_outliers(PB_Predictors, PCA_Components_Stamps, k = 5)
# 3D plot with Top 31 KNN outliers if k = 5:
Top_31_KNN_Outliers(PB_Predictors, PCA_Components_Stamps, k = 5)
# 3D plot with KNN outliers if k = 25:
KNN_outliers(PB_Predictors, PCA_Components_Stamps, k = 25)
# 3D plot with Top 31 KNN outliers if k = 25:
Top_31_KNN_Outliers(PB_Predictors, PCA_Components_Stamps, k = 25)
# 3D plot with KNN outliers if k = 100:
KNN_outliers(PB_Predictors, PCA_Components_Stamps, k = 100)
# 3D plot with Top 31 KNN outliers if k = 100:
Top_31_KNN_Outliers(PB_Predictors, PCA_Components_Stamps, k = 100)
# Display KNN outliers through function:
KNN_outliers <- function(PB_Predictors, PCA_Components_Stamps, k) {
x = PCA_Components_Stamps[,1]
y = PCA_Components_Stamps[,2]
z = PCA_Components_Stamps[,3]
KNN_Outlier <- kNNdist(x = PB_Predictors, k = k, all = TRUE)[,k]
cols <- brewer.pal(11, "BrBG")
pal <- colorRampPalette(cols)
scatter3D(x, y, z, bty = "f", pch = 20,
col = pal(11), colvar = KNN_Outlier, phi = 120, theta = 360,
main = paste("KNN Outliers if k = ", k, sep = ""),
xlab = "PC1",
ylab = "PC2",
zlab = "PC3")
}
# Display of Top 31 KNN outliers through function:
Top_31_KNN_Outliers <- function(PB_Predictors, PCA_Components_Stamps, k){
KNN_Outlier <- kNNdist(x = PB_Predictors, k = k, all = TRUE)[,k]
PCA_Outliers_Df <- as.data.frame(cbind(PCA_Components_Stamps, KNN_Outlier)) # combine PCA with KNN outliers
PCA_Outliers_Rank <- PCA_Outliers_Df[order(PCA_Outliers_Df$KNN_Outlier, decreasing = T),] # reorder data frame using KNN values
KNN_Result <- cbind(PCA_Outliers_Rank, ID = seq(1:nrow(PCA_Outliers_Rank))) # add in ID column
Top_Outliers <- ifelse(KNN_Result$ID <= 31,"1","0") # define the factor levels for top 31 outliers
# Assign x, y and z coordinates according to the new reordered dataframe
x = KNN_Result[,1]
y = KNN_Result[,2]
z = KNN_Result[,3]
scatter3D(x, y, z, pch = 20, col = c("black", "red"), colvar = as.numeric(Top_Outliers), phi = 120,
theta = 360, bty = "f",
main = paste("Top 31 KNN Outliers if k = ", k, sep = ""),
xlab = "PC1",
ylab = "PC2",
zlab = "PC3")
}
# 3D plot with KNN outliers if k = 5:
KNN_outliers(PB_Predictors, PCA_Components_Stamps, k = 5)
# 3D plot with Top 31 KNN outliers if k = 5:
Top_31_KNN_Outliers(PB_Predictors, PCA_Components_Stamps, k = 5)
# 3D plot with KNN outliers if k = 25:
KNN_outliers(PB_Predictors, PCA_Components_Stamps, k = 25)
# 3D plot with Top 31 KNN outliers if k = 25:
Top_31_KNN_Outliers(PB_Predictors, PCA_Components_Stamps, k = 25)
# 3D plot with KNN outliers if k = 100:
KNN_outliers(PB_Predictors, PCA_Components_Stamps, k = 100)
# 3D plot with Top 31 KNN outliers if k = 100:
Top_31_KNN_Outliers(PB_Predictors, PCA_Components_Stamps, k = 100)
Distance-based approach is a model-free anomaly detection approach as it does not require constructing an explicit model of the normal data to determine the anomaly score or class labels. We have studied the KNN outlier algorithm , a non-parametric method for outlier detection that can handle large and high-dimensional databases without making any assumption about the generating data distribution. It is a simple and effective method because it assigns the distance of an observation to its kth nearest neighbour as a degree of outlyingness of that observation. Observations that fall inside clusters (inliers) will have small distances to their nearest neighbours, therefore a low score of outlyingness, while observations that do not fall within clusters will have larger distances to their nearest neighbours, therefore a high score of outlyingness.
We are exprecting numerical preictors (PR_Predictors) to have a small distance to its k-th nearest neighbor whereas an anomaly (PB_class or KNN Outlier Scores in this case) to have a large distance to its k-th nearest neighbor. In the examples above, we apply the k-nearest neighbour approach with k =5, k = 25 and k =100 to identify the anomalous forged stamps among instances of normal stamps. We also examine the data points associated with the top-31 highest anomaly scores. By analyzing the 3D scatter-plots at different angles that are presenting the relationships of the first three princiapal components, we can notice that there are 5 the most outlying observations with the highest distance to its k-nearest neighbour called global outliers in all of the images. Most of the other outliers do not lie within the cluster, they are in the vicinity of the cluster and are called local outliers. One observation falls inside the clusters and it is actually inlier (false positive). Observed scatter-plots of PCA components show very good discrimination between the class label (KNN Outlier Scores) and the numeric predictors data. For this reason, the applied conitnuous diverging color palette is very well connected to the data point distances, measuring outlyingness of an observation in white, light, blue and drak blue tones, while inliers are represented with varying tones of brown. Coloured images with red/black color of labelled predictor data with KNN Outlier Scores are almost identical with counterparts labelled with class label as shown in previous activity. In conclusion, this unsupervised method is significantly capable in identifying outliers despite all limitations asscoiated with KNN outlier algorithm.
1. Perform supervised classification of the Stamps data, using a KNN classifier with the same values of k as used in Activity 2 (unsupervised outlier detection). For each classifier (that is, each value of k), compute the Area Under the Curve ROC (AUC-ROC) in a Leave-One-Out Cross-Validation (LOOCV) scheme.
# KNN supervised classification:
#===============================
# Put class variable(PB_Class) back on and rename:
Stamps_Df <- cbind(PB_Predictors, PB_Class)
names(Stamps_Df)[10] <- "Class"
# Check data:
colnames(Stamps_Df)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9"
## [10] "Class"
head(Stamps_Df)
## V1 V2 V3 V4 V5 V6 V7 V8
## 1 0.112011 0.425226 0.424364 0.681209 0.748623 0.066976 0.865725 0.020741
## 2 0.000000 0.425226 0.043612 0.086249 0.122159 0.000000 0.889814 1.000000
## 3 0.119357 0.425226 0.280526 0.587800 0.696619 0.014433 0.980303 0.367449
## 4 0.024987 0.425226 0.056383 0.097072 0.130242 0.001800 0.787256 0.284758
## 5 0.176667 0.425226 0.356267 0.515091 0.555632 0.059072 0.881087 0.006373
## 6 0.337423 0.272848 0.218090 0.564323 0.687203 0.075768 0.888813 0.001621
## V9 Class
## 1 0.181869 1
## 2 0.969187 1
## 3 0.595861 1
## 4 0.804989 1
## 5 0.124229 1
## 6 0.110962 1
dim(Stamps_Df)
## [1] 340 10
# Split data into training set (80%) and testing set (20%):
set.seed(0) # random seed
No_Obs <- dim(Stamps_Df)[1] # No. of observations (340)
Test_Index <- sample(No_Obs, size = as.integer(No_Obs*0.2), replace = FALSE) # 20% data records for test
Test_Predictors <- Stamps_Df[Test_Index, c(1:9)] # testing dataset
Test_Class <- as.numeric(Stamps_Df[Test_Index, "Class"]) # testing class label
Training_Index <- -Test_Index # 80% data records for training
Training_Predictors <- Stamps_Df[Training_Index, c(1:9)] # training ataset
Training_Class <- as.numeric(Stamps_Df[Training_Index, "Class"]) # training class label
Build and test the Classifiers and compare predicted outcome to observed Outcome
# Build knn classifier if k = 5
Dfn_5_Pred <- knn(train = Training_Predictors, test = Test_Predictors, cl = Training_Class, k = 5, prob = T)
(table(Dfn_5_Pred, Test_Class))
## Test_Class
## Dfn_5_Pred 0 1
## 0 59 1
## 1 3 5
# Build knn classifier if k = 25
Dfn_25_Pred <- knn(train = Training_Predictors, test = Test_Predictors, cl = Training_Class, k = 25, prob = T)
(table(Dfn_25_Pred, Test_Class))
## Test_Class
## Dfn_25_Pred 0 1
## 0 60 2
## 1 2 4
# Build knn classifier if k = 100
Dfn_100_Pred <- knn(train = Training_Predictors, test = Test_Predictors, cl = Training_Class, k = 100, prob = T)
(table(Dfn_100_Pred, Test_Class))
## Test_Class
## Dfn_100_Pred 0 1
## 0 62 6
## 1 0 0
Calculate AUC-ROC values for each supervised KNN Classifiers in a Leave-One-Out Cross-Validation (LOOCV) scheme
#Create ROCPlot function for calculating AUC-ROC:
#===============================================
rocplot <- function(predicted, observed){
Pred_Obj <- prediction(predicted, observed)
ROC <- performance(Pred_Obj, "tpr", "fpr")
# Plot the ROC Curve
plot(ROC, colorize = T, lwd = 3, main = "ROC Curve")
auc <- performance(Pred_Obj, measure = "auc")
auc <- auc@y.values[[1]]
# Return the Area Under the Curve ROC
return(auc)
}
#Plot ROC Curves for KNN classifiers with LOOCV cross-validation and with k parameter values (5, 25 and 100):
#============================================================================================================
# Create a function to assess a KNN classifier using LOOCV and to plot ROC curve:
ROC_AUC_knn.cv <- function(train, cl, k){
Pred_Class <- knn.cv(train = Training_Predictors, cl = Training_Class, k = k, prob = TRUE)
Pred_Prob <- attr(Pred_Class, "prob")
Pred_Prob <- ifelse(Pred_Class == 1, Pred_Prob, 1 - Pred_Prob)
AUC <- rocplot(predicted = Pred_Prob, observed = Training_Class) # call the rocplot()
cat("K-value:", k, ", AUC:", AUC, fill = TRUE)
}
ROC_AUC_knn.cv(Training_Predictors, Training_Class, 5)
## K-value: 5 , AUC: 0.9527935
ROC_AUC_knn.cv(Training_Predictors, Training_Class, 25)
## K-value: 25 , AUC: 0.9561943
ROC_AUC_knn.cv(Training_Predictors, Training_Class, 100)
## K-value: 100 , AUC: 0.908583
Plot AUC results for supervised KNN classifiers with LOOCV type of cross-validation
# Creata AUC_Calc() function to calculate AUC results:
#====================================================
AUC_Calc <- function(predicted, observed){
Pred_Obj <- prediction(predicted, observed)
auc <- performance(Pred_Obj, measure = "auc")
auc <- auc@y.values[[1]]
# Return the Area Under the Curve ROC
return(auc)
}
# Calculate AUC results for supervised KNN classifiers with LOOCV schema and k = 1:100:
#======================================================================================
set.seed(0) # random seed
AUC_Super <- rep(0, 100) # create an empty vector
for (k in 1:100) {
Pred_Class <- knn.cv(train = Training_Predictors, cl = Training_Class, k = k, prob = T) # LOOCV applied with knn.cv()
Pred_Prob <- attr(Pred_Class, "prob")
Pred_Prob <- ifelse(Pred_Class == 1, Pred_Prob, 1 - Pred_Prob)
AUC_Super[k] <- AUC_Calc(predicted = Pred_Prob, observed = Training_Class)
}
# Plot the AUC values with ggplot2:
#==================================
Auc_Df <- as.data.frame(AUC_Super) # create data frame
Auc_Df <- cbind(Auc_Df, k = 1:nrow(Auc_Df)) # add in k column as index
k <- Auc_Df$k # assign variable as a vector
AUC_Super <- Auc_Df$AUC_Super # assign variable as avector
ggplot(data = Auc_Df, aes(x = k, y = AUC_Super, group = 1)) +
geom_point(alpha = 20/40, color = "blue") + # showing values as scatter plots
geom_line(color = "magenta3") + # layer a pink line
geom_hline(yintercept = 0.967, linetype = "dashed", color = "Grey50") + # higlight with dashed line
xlab(label = "k value") +
ylab(label = "AUC") +
ggtitle("AUC Results for Supervised KNN Classifers with k = 1:100 on Training Dataset \n Best AUC = 0,9678 for k = 8")
Fig 8: Line graph of AUC results for supervised KNN classifiers with k = 1:100
Maximum AUC results calculation:
Auc_Df %>% filter(AUC_Super == max(AUC_Super))
## AUC_Super k
## 1 0.9721457 8
In general, the process of cross-validation includes following steps: a) divide the dataset into k disjoint, equal-sized subsets (called folds), each of which contains n/k observations where n is the dataset size; b) choose 1 fold as a test dataset, and the rest k−1 folds belong to training dataset; c) develop the model based on training dataset; d) evaluate the resulting classifier in the single fold test dataset previously created; e) apply the model to the test set and repeat K times using each fold; f) average the assessment results (e.g., accuracy) obtained across the k test folds.
If k = 10, then it is called 10 fold cross-validation procedure which is widely adopted for classification assessments. Technically, we can set k to any value between 1 and n (dataset size). If k = n, this is called “Leave-one-out cross-validation” (LOOCV) which is also broadly used in practical applications. In order to perform LOOCV cross-validation we used in-built knn.cv() function from the package class with three different k values (5, 25 and 100). The AUC-ROC values, which are measure of classification performance, for our supervised KNN classifiers for a training subset of the Stamps dataset, for k = 5, 25 and 100 are around 0.95, 0.96 and 0.92, respectively. Assessing a KNN classifier using LOOCV with varying parameter k from 1 to 100 (Fig 8), showed that best classifier performs with AUC values around 0.97 and k = 8. In addition, there is no such thing as the best k value, and neither is it true that a higher k is a better k. To choose the most appropriate k value, we have to make a tradeoff between bias and variance. If k is small, we have a high bias and a low variance, whereas if k is big, we have a low bias and a high variance for estimating the test error on test data subset.
2. Compare the resulting (supervised) KNN classification performance for each value of k, against the classification performance obtained in an unsupervised way by the KNN Outlier method with the same value of k. Notice that, if we rescale the KNN Outlier Scores (obtained in Activity 2 (unsupervised outlier detection)) into the [0,1] interval, these scores can be interpreted as outlier probabilities, which can then be compared with the class labels (ground truth) in PB_class to compute an AUC-ROC value. This way, for each value of, the AUC-ROC of the supervised KNN classifier can be compared with the AUC-ROC of KNN Outlier as an unsupervised classifier. Compare the performances of the supervised versus unsupervised classifiers and discuss the results. For example, recalling that the supervised method makes use of the class labels, whereas the unsupervised method doesn’t, what can you conclude considering there are applications where class labels are not available?
First, we calculate AUC-ROC values on non-normalized dataset of unsupervised KNN Classifier with varying k values in order to compare them same values obtained from supervised classification with KNN classifiers under the same conditions: non-normalized dataset and k values of 5, 25 and 100
# Calculate AUC-ROC values on non-normalized dataset for unsupervised KNN Classifier:
#====================================================================================
# Create a function to calculate AUC-ROC for unsupervised KNN Classifier:
KNN_outliers <- function(PB_Predictors, PB_Class, k) {
KNN_Outlier <- kNNdist(x = PB_Predictors, k = k, all = TRUE)[,k]
AUC <- rocplot(predicted = KNN_Outlier, observed = PB_Class)
cat("K-value:", k, ", AUC:", AUC, fill = TRUE)
}
KNN_outliers(PB_Predictors, PB_Class, 5)
## K-value: 5 , AUC: 0.8240944
KNN_outliers(PB_Predictors, PB_Class, 25)
## K-value: 25 , AUC: 0.8921599
KNN_outliers(PB_Predictors, PB_Class, 100)
## K-value: 100 , AUC: 0.9055225
Plot the AUC results for supervised versus unsupervised KNN Classifiers
# Create a for loop to generate AUC values for 100 k's:
#======================================================
set.seed(0) # random seed
AUC_Unsuper <- rep(0, 100) # create an empty vector
for (k in 1:100) {
KNN_Outlier <- kNNdist(x = PB_Predictors, k = k, all = TRUE)[,k] # assess the unsupervised KNN
AUC_Unsuper[k] <- AUC_Calc(predicted = KNN_Outlier, observed = PB_Class) # call AUC_Calc
}
# Create a dataframe containing AUC values of supervised and unsupervised classifications in order
# to plot them together with ggplot2:
#===================================================================================================
Auc_Unsuper_Df <- as.data.frame(AUC_Unsuper) # create dataframe with AUC_unsuper first
Auc_Unsuper_Df <- cbind(Auc_Unsuper_Df, k = 1:nrow(Auc_Unsuper_Df)) # add in k column as index
AUC_Values = merge(Auc_Unsuper_Df, Auc_Df, by = "k") # merge both dataframes by k
AUC_Values_Melted <- reshape2::melt(AUC_Values, id.var = 'k') # melt both dataframes
Value <- AUC_Values_Melted$value # assign variable to a vector
Variable <- AUC_Values_Melted$variable # assign variable to a vector
K <- AUC_Values_Melted$k # assign variable to a vector
# Generate the plot:
#===================
ggplot(AUC_Values_Melted, aes(x = K, y = Value, col = Variable)) +
geom_point(alpha = 20/40, color = "blue") + # showing values as scatter plots
geom_line() + # layer a pink line
xlab(label = "k value") +
ylab(label = "AUC") +
ggtitle("Comparison of AUC Results of Supervised Versus Unsupervised KNN Classifiers on Non-Normalized Dataset")
Fig 9: Line graph of AUC results for supervised versus unsupervised KNN classifiers on non-normalized data for k values from 1 to 100.
| k | AUC_Unsuper | AUC_Super |
|---|---|---|
| 5 | 0.8241 | 0.9528 |
| 25 | 0.8922 | 0.9562 |
| 100 | 0.9055 | 0.9086 |
The choice of k has a drastic effect on the KNN Classifiers according to the results obtained in the table, especially on supervised Classifiers. Maximum AUC for supervised classification is obteined at k = 8 (shown previously), but maximum AUC for unsupervised is at k = 81 (shown below). Comparison of the results as shown in the table reveals that unsupervised Classifiers have increased performance with increased k parameter values, whilst supervised Classifiers have declined performance at higher k values. When k = 1, Classifier has low bias but very high variance. As k grows, then it corresponds to a Classifier with a low-variance but high-bias. Firther, we have to investigate the train and test error rates and look for relationships between them, to analyze the decision boundry at different k values and to take into consideration the bias-variance tradeoff in order to understand what is the critical level of felxibility for the success of the Classifier which is a difficult task.
Maximum AUC results calculation for unsupervised KNN Classifiers:
# Maximum AUC results calculation with the corresponding k parameter for k = 1:100:
#================================================================================
Auc_Unsuper_Df %>% filter(AUC_Unsuper == max(AUC_Unsuper))
## AUC_Unsuper k
## 1 0.9067752 81
# Creat function for min-max normalization in a range (0, 1) of the dataset:
#==========================================================================
normalise <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
# Create new dataframe of all predictor variables with normalised data:
#=======================================================================
PB_Predictors_Scaled <- as.data.frame(lapply(PB_Predictors, normalise))
Calculate AUC-ROC values of unsupervised KKN Classifiers on normalized dataset
# Call the KNN_outliers() function:
#=================================
KNN_outliers <- function(PB_Predictors_Scaled, PB_Class, k) {
KNN_Outlier <- kNNdist(x = PB_Predictors_Scaled, k = k, all = TRUE)[,k]
AUC <- rocplot(predicted = KNN_Outlier, observed = PB_Class)
cat("K-value:", k, ", AUC:", AUC, fill = TRUE)
}
KNN_outliers(PB_Predictors_Scaled, PB_Class, 5)
## K-value: 5 , AUC: 0.8362042
KNN_outliers(PB_Predictors_Scaled, PB_Class, 25)
## K-value: 25 , AUC: 0.885583
KNN_outliers(PB_Predictors_Scaled, PB_Class, 100)
## K-value: 100 , AUC: 0.8991544
Calculate AUC-ROC of supervised KNN classsifier with (LOOCV) cross-validation on normalised dataset
# # Call the function ROC_AUC_knn.cv():
#====================================
ROC_AUC_knn.cv <- function(PB_Predictors_Scaled, PB_Class, k){
Pred_Class <- knn.cv(train = PB_Predictors_Scaled, cl = PB_Class, k = k, prob = TRUE)
Pred_Prob <- attr(Pred_Class, "prob")
Pred_Prob <- ifelse(Pred_Class == 1, Pred_Prob, 1 - Pred_Prob)
AUC <- rocplot(predicted = Pred_Prob, observed = PB_Class) # call rocplot()
cat("K-value:", k, ", AUC:", AUC, fill = TRUE)
}
ROC_AUC_knn.cv(PB_Predictors_Scaled, PB_Class, 5)
## K-value: 5 , AUC: 0.9564151
ROC_AUC_knn.cv(PB_Predictors_Scaled, PB_Class, 25)
## K-value: 25 , AUC: 0.9650277
ROC_AUC_knn.cv(PB_Predictors_Scaled, PB_Class, 100)
## K-value: 100 , AUC: 0.9303163
Plot the AUC results for supervised versus unsupervised KNN Classifiers
# Create a for loop to generate AUC values for 100 k's for supervised Classifiers:
#===================================================================================
set.seed(0) # random seed
AUC_Super_Norm <- rep(0, 100) # create an empty vector
for (k in 1:100) {
Pred_Class <- knn.cv(train = PB_Predictors_Scaled, cl = PB_Class, k = k, prob = T)
Pred_Prob <- attr(Pred_Class, "prob")
Pred_Prob <- ifelse(Pred_Class == 1, Pred_Prob, 1 - Pred_Prob)
AUC_Super_Norm[k] <- AUC_Calc(predicted = Pred_Prob, observed = PB_Class)
}
# Create a for loop to generate AUC values for 100 k's for unsupervised Classifiers:
#===================================================================================
set.seed(0) # random seed
AUC_Unsuper_Norm <- rep(0, 100) # create an empty vector
for (k in 1:100) {
KNN_Outlier <- kNNdist(x = PB_Predictors_Scaled, k = k, all = TRUE)[,k]
AUC_Unsuper_Norm[k] <- AUC_Calc(predicted = KNN_Outlier, observed = PB_Class)
}
## Create a dataframe containing AUC values of supervised and unsupervised classifications in order
# to plot them together with ggplot2:
#=================================================================================================
Auc_Super_Df_Norm <- as.data.frame(AUC_Super_Norm) # create dataframe of AUC supervised values first
Auc_Super_Df_Norm <- cbind(Auc_Super_Df_Norm, k = 1:nrow(Auc_Super_Df_Norm)) # add in k column as index
Auc_Unsuper_Df_Norm <- as.data.frame(AUC_Unsuper_Norm) # create dataframe with AUC unsupervised values
Auc_Unsuper_Df_Norm <- cbind(Auc_Unsuper_Df_Norm, k = 1:nrow(Auc_Unsuper_Df_Norm)) # add in k column as index
AUC_Values_Norm = merge(Auc_Unsuper_Df_Norm, Auc_Super_Df_Norm, by = "k") # merge both dataframes by k
AUC_Values_Melted_Norm <- reshape2::melt(AUC_Values_Norm, id.var = 'k') # melt both dataframes
Value <- AUC_Values_Melted_Norm$value # assign variable to a vector
Variable <- AUC_Values_Melted_Norm$variable # assign variable to a vector
K <- AUC_Values_Melted_Norm$k # assign variable to a vector
# Generate the plot:
#===================
ggplot(AUC_Values_Melted_Norm, aes(x = K, y = Value, col = Variable)) +
geom_point(alpha = 20/40, color = "blue") + # showing values as scatter plots
geom_line() + # layer a pink line
xlab(label = "k value") +
ylab(label = "AUC") +
ggtitle("Comparison of AUC Result for Supervised versus Unsupervised KNN Classifers on Normalized Dataset")
Fig 10: Line graph of AUC results for supervised versus unsupervised KNN classifiers on normalized data for k values from 1 to 100.
Maximum AUC result calculation for supervised KNN classifiers:
Auc_Super_Df_Norm %>% filter(AUC_Super_Norm == max(AUC_Super_Norm))
## AUC_Super_Norm k
## 1 0.9719177 14
Maximum AUC result calculation for unsupervised KNN classifiers:
Auc_Unsuper_Df_Norm %>% filter(AUC_Unsuper_Norm == max(AUC_Unsuper_Norm))
## AUC_Unsuper_Norm k
## 1 0.9011379 17
Comparison of AUC results on normalized data
Auc_Results_Norm <- AUC_Values_Norm[c(5,25,100), 1:3] # select 5th, 25th and 100th row
row.names(Auc_Results_Norm) <- NULL # remove default row names
pander(Auc_Results_Norm, style = 'rmarkdown', caption = "AUC results for KNN Classifiers on normalized data") # make table
| k | AUC_Unsuper_Norm | AUC_Super_Norm |
|---|---|---|
| 5 | 0.8362 | 0.9564 |
| 25 | 0.8856 | 0.965 |
| 100 | 0.8992 | 0.9303 |
Summary of AUC results
Overall_Results <- merge(Auc_Results, Auc_Results_Norm, by = "k") # merge bot result dataframes by k
pander(Overall_Results, style = 'rmarkdown', caption = "Comparison of AUC results for KNN Classifiers on non-normalized and normalized dataset") # make table
| k | AUC_Unsuper | AUC_Super | AUC_Unsuper_Norm | AUC_Super_Norm |
|---|---|---|---|---|
| 5 | 0.8241 | 0.9528 | 0.8362 | 0.9564 |
| 25 | 0.8922 | 0.9562 | 0.8856 | 0.965 |
| 100 | 0.9055 | 0.9086 | 0.8992 | 0.9303 |
According to the ROC curves, supervised KNN Classifiers stand out with both, normalised and non-normalised datasets (blue line on both graphs, (fig 9) and (fig 10). As well, supervised KNN Classifier at k = 25 has maximum AUC value of around 0.97. It is the winner in this assessment.
It has been shown that the class labels (PB_class) are very unbalanced with not equal distribution among forged and genuine stamps. There are a lot of genuine stamps. This is very inconvenient for supervised learning when we try to classify future labels correctly. In fact unbalanced distribution prefers non-parametric Classifiers, as such, the results of performance corresponding to unsupervised KNN Classifiers are very good. The results are confirming the main concept idea that unsupervised Classifers are precious tool for real-world applications when class labels are not existing. Normalization of the dataset also play a vital role in obtaining high levels of performance, even though the levels of AUC values are decreasing for unsupervised KNN Classifiers after normalization.
References