Classification

Statistical classification, more commonly known as classification, is the problem of matching observations to a set of categories it belong to. For this example, we are going to use both supervised and unsupervised method to classify the vertebral dataset. summary of the dataset is as shown below. Based on the summary, we know that the vertebral column data is made up of 310 observations and 7 attributes – with 6 numerical and 1 categorical attribute.

More information about the data can be found here: https://archive.ics.uci.edu/ml/datasets/Vertebral+Column

# Clustering & Classification
library(tidyverse)  # data manipulation
library(cluster)    # clustering algorithms
library(factoextra) # clustering algorithms & visualization - fviz
library(ggplot2) # general plotting
library(GGally) # ggpair for pairplot
library(dplyr) # data manipulation
library(NbClust) # Determine best cluster number using iteration for k-mean
library(caret) # Train test computation
library(e1071) # train test computation
library(gridExtra)

# set work directory
# Read TAB delimited files
df <- read.delim('D:/Dataset/vertebral_column_data.txt', header = FALSE, sep = '')
summary(df)

##        V1               V2              V3               V4        
##  Min.   : 26.15   Min.   :-6.55   Min.   : 14.00   Min.   : 13.37  
##  1st Qu.: 46.43   1st Qu.:10.67   1st Qu.: 37.00   1st Qu.: 33.35  
##  Median : 58.69   Median :16.36   Median : 49.56   Median : 42.41  
##  Mean   : 60.50   Mean   :17.54   Mean   : 51.93   Mean   : 42.95  
##  3rd Qu.: 72.88   3rd Qu.:22.12   3rd Qu.: 63.00   3rd Qu.: 52.69  
##  Max.   :129.83   Max.   :49.43   Max.   :125.74   Max.   :121.43  
##        V5               V6          V7     
##  Min.   : 70.08   Min.   :-11.06   AB:210  
##  1st Qu.:110.71   1st Qu.:  1.60   NO:100  
##  Median :118.27   Median : 11.77           
##  Mean   :117.92   Mean   : 26.30           
##  3rd Qu.:125.47   3rd Qu.: 41.28           
##  Max.   :163.07   Max.   :418.54

Data Preprocessing

Before the data is used for either of the analysis, data is checked for missing values or NAs. The columns are renamed appropriately based on the metadata to increase the readability. Data is then standardised using the ‘scale()’ function in R to allow meaningful comparison across the dataset. Principal component analysis (PCA) is conducted using the standardized data to reduce the dimension of data for easier visualisation of the result.

colnames(df)[1:7] <- c('pelvic_incidence','pelvic_tilt','lumbar_lordosis','sacral_slope','pelvic_radius','spondylolisthesis','class' )
# check for NAs 
# Remove NAs if applicable
df = na.omit(df)

# Scale / Standardise the data
df_std <- df %>% mutate_at(scale,.vars = vars(-7))
# Convert all data back to basic numeric
df_std$pelvic_incidence <- as.numeric(df_std$pelvic_incidence)
df_std$pelvic_tilt <- as.numeric(df_std$pelvic_tilt)
df_std$lumbar_lordosis <- as.numeric(df_std$lumbar_lordosis)
df_std$pelvic_radius <- as.numeric(df_std$pelvic_radius)
df_std$spondylolisthesis <- as.numeric(df_std$spondylolisthesis)
df_std$sacral_slope <- as.numeric(df_std$sacral_slope)

Exploratory Data Analysis

Pairplot of the standardised data is then plotted as shown below, summarising the relationship between the variables using density and scatterplots. The two colors observed on the plot represent the number of distinct class found in the dataset. Huge overlaps between the groups can be observed on the plots, indicating that the algorithms may not partition the data perfectly.

ggpairs(df,columns=1:6,
        mapping = ggplot2::aes(colour=class,alpha=0.7),
        upper = list(continuous = wrap("density", alpha = 0.5), combo = "box_no_facet"),
        lower = list(continuous = wrap("points", alpha = 0.3), combo = wrap("dot_no_facet", alpha = 0.4)))

Unsupervised Classification

K-means clustering is an unsupervised machine learning algorithm aiming to split the data points into k pre-defined non-overlapping cluster where each data point will eventually belong to only one group. The goal of k-means clustering is to minimise the total intra-cluster variation and maximise the variation between clusters. The end result is achieved when the sum of squared distance between data points and centroid is at the minimum.

Calculations are done iteratively until to optimise the position of centroids. When the sum of squared distance between data points and centroids is at the minimum, the less variation we have within each cluster, hence the data points are more homogenous compared to others within the same cluster.

Methods to optimise cluster size

To evaluate the optimal number of pre-defined group (k) for the algorithm, four methods are implemented – the Elbow method, Silhouette method, Gap statistics and using ‘NbClust’. The elbow method and the silhouette method determine the optimal number of clusters based on the optimisation of a criterion, namely the within cluster sums of square (wss) and the average silhouette. Gap statistic on the other hand conducted statistical analysis to compare evidence against the null hypothesis. Finally, the ‘NbClust’ (R package) determines the optimal number of clusters by comparing the output of the Hubert Index and D Index.

# Create df for kmeans 
# exclude the categorical data column
df1 <- df_std[1:6]


# PCs from PCA
# use princomp to return all scores
pca <- princomp(df1, cor=T) # principal components analysis using correlation matrix
pc.comp <- pca$scores
pc.comp1 <- -1*pc.comp[,1]
pc.comp2 <- -1*pc.comp[,2]

pc_combined <- cbind(pc.comp1, pc.comp2)

df_new <- as.data.frame(cbind(df_std$class,pc_combined))
names(df_new)[1] <- 'class'
df_new$class <- as.factor(ifelse(df_new$class==1,'AB','NO'))

# Clustering - K mean
fviz_nbclust(df1, kmeans, method = "wss")

# Elbow method
fviz_nbclust(df1, kmeans, method = "wss") +
        geom_vline(xintercept = 2, linetype = 2)+
        labs(subtitle = "Elbow method")

# Elbow plot wasnt clear - can either be 2 or 3 


# Silhouette method
fviz_nbclust(df1, kmeans, method = "silhouette")+
        labs(subtitle = "Silhouette method")

# Silhouette select 2 as optimum number of cluster

# Gap statistic
# nboot = 50 to keep the function speedy. 
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(1226)
fviz_nbclust(df1, kmeans, nstart = 25,  method = "gap_stat", nboot = 100)+
        labs(subtitle = "Gap statistic method")

# gap statistics select 3 as the optimal number of cluster

# Use NBClust
nc_test <- NbClust(data = df1, distance = "euclidean",
        min.nc = 2, max.nc = 10, method = 'kmeans',index = 'all')

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 9 proposed 2 as the best number of clusters 
## * 4 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 7 proposed 9 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

fviz_nbclust(nc_test)

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 9 proposed  2 as the best number of clusters
## * 4 proposed  3 as the best number of clusters
## * 1 proposed  4 as the best number of clusters
## * 7 proposed  9 as the best number of clusters
## * 2 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

Building K-means Model

K-means clustering algorithm is then performed on the dataset with K = 2 based on results obtained from the 4 methods listed above. Figure below shows the cluster plot for the k-means algorithm.

# Select center = 2 based on the test result from the 3 test and NBClust
kmean_model <- kmeans(df1, centers = 2, nstart = 25)
str(kmean_model)

## List of 9
##  $ cluster     : int [1:310] 2 2 1 1 2 2 2 2 2 2 ...
##  $ centers     : num [1:2, 1:6] 0.969 -0.663 0.609 -0.417 0.887 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "1" "2"
##   .. ..$ : chr [1:6] "pelvic_incidence" "pelvic_tilt" "lumbar_lordosis" "sacral_slope" ...
##  $ totss       : num 1854
##  $ withinss    : num [1:2] 684 464
##  $ tot.withinss: num 1148
##  $ betweenss   : num 706
##  $ size        : int [1:2] 126 184
##  $ iter        : int 1
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

fviz_cluster(kmean_model, data = df1)

kmean_model_cluster <- as.data.frame(kmean_model$cluster)
names(kmean_model_cluster)[1] <- 'class'
kmean_model_cluster$class <- as.factor(ifelse(kmean_model_cluster=='1','AB','NO'))

Supervised Classification

K Nearest Neighbour (kNN) classification is a supervised machine learning technique used to assign a class to unlabelled data based on the input of labelled training data using the close proximity among the data points. Unlike k-means, k represents the number of closest proximity (nearest neighbour) instead of the number of predefined class. To measure the proximity of the data points, Euclidean distance between the data points are computed.

Method to optimise number of nearest neighbour

Data is divided into training and testing state using a 75:25 ratio. To determine the optimal number of nearest neighbours, the training data is trained using a 10-fold cross validation where k is equal to all integers between 2 and 50. The results of the cross validation is visualised as shown below. The number of nearest neighbours with highest accuracy score is 16.

# Classification
# Create 10 fold cross validation - under caret

trControl <- trainControl(method  = "cv",number  = 10)
n = nrow(df_std)
index = sample(1:n, n*0.75, replace=FALSE)
train = df_std[index,]
test = df_std[-index,]
test_pca <- pc_combined[-index,]


# fit model for KNN classification
set.seed(1226)
fit <- train(class ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k = 2:50),
             trControl  = trControl,
             metric     = "Accuracy",
             data       = train)
fit

## k-Nearest Neighbors 
## 
## 232 samples
##   6 predictor
##   2 classes: 'AB', 'NO' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 208, 208, 210, 209, 209, 209, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    2  0.7716568  0.4579393
##    3  0.8066535  0.5494719
##    4  0.8236989  0.5867044
##    5  0.8060935  0.5613082
##    6  0.8146245  0.5838177
##    7  0.8104578  0.5758807
##    8  0.8282279  0.6089463
##    9  0.8323781  0.6182759
##   10  0.8329216  0.6178013
##   11  0.8367260  0.6270583
##   12  0.8407115  0.6358181
##   13  0.8283926  0.6074799
##   14  0.8195158  0.5858334
##   15  0.8319993  0.6184616
##   16  0.8408926  0.6387022
##   17  0.8407115  0.6380115
##   18  0.8276515  0.6073951
##   19  0.8280303  0.6050902
##   20  0.8323781  0.6140521
##   21  0.8367260  0.6233856
##   22  0.8282115  0.6071578
##   23  0.8106390  0.5721592
##   24  0.8319993  0.6169344
##   25  0.8148057  0.5773493
##   26  0.8276680  0.6039273
##   27  0.8321970  0.6138330
##   28  0.8233037  0.5975102
##   29  0.8280303  0.6052890
##   30  0.8153491  0.5805423
##   31  0.8280138  0.6080406
##   32  0.8276515  0.6038011
##   33  0.8153327  0.5864169
##   34  0.8283762  0.6192920
##   35  0.8323617  0.6251942
##   36  0.8238472  0.6037564
##   37  0.8234684  0.6010030
##   38  0.8279974  0.6170532
##   39  0.8321805  0.6312197
##   40  0.8323617  0.6303715
##   41  0.8363472  0.6400636
##   42  0.8234848  0.6099179
##   43  0.8278327  0.6232803
##   44  0.8234848  0.6135971
##   45  0.8321805  0.6330640
##   46  0.8151515  0.5935146
##   47  0.8319993  0.6336104
##   48  0.8363472  0.6409269
##   49  0.8276350  0.6219673
##   50  0.8276350  0.6206371
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 16.

accuracy <- fit$results[,c(1:2)]
best_K <- fit$bestTune[1,1]


ggplot(accuracy,aes(x=k,y=Accuracy))+
  geom_line(size=1)+
  xlab('# of K')+
  ylab('Score')+
  ggtitle(paste0("Accuracy Score for various K, best k = ", best_K))+
  theme(plot.title =  element_text(hjust = 0.5))+
  geom_vline(xintercept = best_K, linetype="dashed", color = "#00BFC4", size=1.5)

# predict test using fit 
test_pred <- predict(fit, newdata = test)

test2 <- test
test2$P.class <- predict(fit, newdata = test2)
test3 <- cbind(test2,test_pca)

Result Comparison

Position in relation to the two main principal components (PCs) of the data points by actual and predicted class using both the K-means clustering and KNN classification algorithms are plotted as shown below. As K-means is an unsupervised learning algorithm, all data points are included. The scatterplot for predicted class shows a clear boundary to separate the data points into two pre-defined cluster. A confusion matrix of the results of k-means algorithm is created and an accuracy of 66.45% is computed.

confusionMatrix(df_std$class, kmean_model_cluster$class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  AB  NO
##         AB 116  94
##         NO  10  90
##                                          
##                Accuracy : 0.6645         
##                  95% CI : (0.609, 0.7169)
##     No Information Rate : 0.5935         
##     P-Value [Acc > NIR] : 0.006054       
##                                          
##                   Kappa : 0.3708         
##                                          
##  Mcnemar's Test P-Value : 3.992e-16      
##                                          
##             Sensitivity : 0.9206         
##             Specificity : 0.4891         
##          Pos Pred Value : 0.5524         
##          Neg Pred Value : 0.9000         
##              Prevalence : 0.4065         
##          Detection Rate : 0.3742         
##    Detection Prevalence : 0.6774         
##       Balanced Accuracy : 0.7049         
##                                          
##        'Positive' Class : AB             
##

# Plot comparison
df_kmean_compare <- cbind(df_std,kmean_model_cluster,pc_combined)
names(df_kmean_compare)[8] <- 'P.Class'

plot_actual_kmean <- ggplot(df_kmean_compare,aes(x=pc.comp1,y=pc.comp2,col=class))+
                      geom_point(size=2)+
                      ggtitle('Actual Class')+
                      theme(plot.title = element_text(hjust = 0.5))

plot_pred_kmean <- ggplot(df_kmean_compare,aes(x=pc.comp1,y=pc.comp2,col=P.Class))+
                    geom_point(size=2)+
                    ggtitle('Predicted Class - K-means')+
                    theme(plot.title = element_text(hjust = 0.5))

grid.arrange(plot_actual_kmean,plot_pred_kmean,nrow=1)

On the other hand, as KNN is a supervised algorithm, only the data points from the testing dataset are included on the scatterplot. The class of the data points in the testing dataset is predicted using the KNN model trained with k = 16 and compared against the actual class. The data points are not separated by a clear boundary like K-means. However, based on the scatterplot, the class of the data point is more likely to be captured correctly using the KNN algorithm. The comparison of actual and predicted class using KNN is tabulated using a confusion matrix, and an accuracy of 78.21% is computed.

confusionMatrix(test_pred, test$class )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction AB NO
##         AB 43  8
##         NO  9 18
##                                           
##                Accuracy : 0.7821          
##                  95% CI : (0.6741, 0.8676)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 0.01803         
##                                           
##                   Kappa : 0.5143          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.8269          
##             Specificity : 0.6923          
##          Pos Pred Value : 0.8431          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.6667          
##          Detection Rate : 0.5513          
##    Detection Prevalence : 0.6538          
##       Balanced Accuracy : 0.7596          
##                                           
##        'Positive' Class : AB              
##

plot_actual_knn <- ggplot(test3,aes(x=pc.comp1,y=pc.comp2,col=class))+
  geom_point(size=2)+
  ggtitle('Actual Class')+
  theme(plot.title = element_text(hjust = 0.5))

plot_pred_knn <- ggplot(test3,aes(x=pc.comp1,y=pc.comp2,col=P.class))+
  geom_point(size=2)+
  ggtitle('Predicted Class - KNN')+
  theme(plot.title = element_text(hjust = 0.5))

grid.arrange(plot_actual_knn,plot_pred_knn,nrow=1)

Conclusion

Based on the results, KNN classification with the higher accuracy (78.21%) clearly outperforms K-means clustering algorithm (66.45%). However, the comparison is not meaningful as the supervised and unsupervised methods are incomparable. In the supervised learning algorithm (KNN), model is trained with labelled data that can be used to better evaluate the class of the observation. In contrast, the input data for k-means (unsupervised) algorithm is unlabelled, hence making it more difficult to predict the actual class. Therefore, the decision to implement either a supervised or unsupervised method should be based on the type of input data available to better solve the problem.

Classification in R

Alwin Tan

3/13/2020