Clustering Using Kmeans and Principal Component Analysis

by Martynas Rakickis

Introduction

There are many ways one can solve classification problem and in this work I will be looking at using clustering method kmeans to classify the data set that was provided by M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, ‘A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images’, in: Information Technologies in Biomedicine, Ewa Pietka, Jacek Kawa (eds.), Springer-Verlag, Berlin-Heidelberg, 2010, pp. 15-24.[https://archive.ics.uci.edu/ml/datasets/seeds]

I will use following packages for data processing and visualisation:

library(readr)
library(dplyr)
library(here)
library(ggplot2)

Load and clean the dataset

Lets start by loading the data set and exploring the data

src_path = here("data set", "seeds_dataset.txt")
readr::read_delim(file = src_path, delim = "\t", col_names = F) -> seeds
head(seeds)

## # A tibble: 6 x 8
##      X1    X2    X3    X4    X5    X6    X7    X8
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  15.3  14.8 0.871  5.76  3.31  2.22  5.22     1
## 2  14.9  14.6 0.881  5.55  3.33  1.02  4.96     1
## 3  14.3  14.1 0.905  5.29  3.34  2.70  4.82     1
## 4  13.8  13.9 0.896  5.32  3.38  2.26  4.80     1
## 5  16.1  15.0 0.903  5.66  3.56  1.36  5.18     1
## 6  14.4  14.2 0.895  5.39  3.31  2.46  4.96     1

dim(seeds)

## [1] 210   8

As per description of the data set, these are the attributes:
1. area A,
2. perimeter P,
3. compactness C = 4piA/P^2,
4. length of kernel,
5. width of kernel,
6. asymmetry coefficient
7. length of kernel groove.
8. wheat variety - Kama(1), Rosa(2) and Canadian(3)

So lets change the column names to something more meaningful for easier processing and add wheat labels:

colnames(seeds)= c("area", "perimeter","compactness", "kernel_length", "kernel_width", "asymmetry_coef", "kernel_groove_length", "wheat_variety")
seeds = mutate(seeds,
               wheat_variety = case_when(wheat_variety == 1 ~ "Kama",wheat_variety == 2 ~ "Rosa",wheat_variety == 3 ~ "Canadian")
) %>% mutate_if(is.character, factor)
head(seeds)

## # A tibble: 6 x 8
##    area perimeter compactness kernel_length kernel_width asymmetry_coef
##   <dbl>     <dbl>       <dbl>         <dbl>        <dbl>          <dbl>
## 1  15.3      14.8       0.871          5.76         3.31           2.22
## 2  14.9      14.6       0.881          5.55         3.33           1.02
## 3  14.3      14.1       0.905          5.29         3.34           2.70
## 4  13.8      13.9       0.896          5.32         3.38           2.26
## 5  16.1      15.0       0.903          5.66         3.56           1.36
## 6  14.4      14.2       0.895          5.39         3.31           2.46
## # ... with 2 more variables: kernel_groove_length <dbl>, wheat_variety <fct>

General summary of the data set is as follows:

summary(seeds)

##       area         perimeter      compactness     kernel_length  
##  Min.   :10.59   Min.   :12.41   Min.   :0.8081   Min.   :4.899  
##  1st Qu.:12.27   1st Qu.:13.45   1st Qu.:0.8569   1st Qu.:5.262  
##  Median :14.36   Median :14.32   Median :0.8734   Median :5.524  
##  Mean   :14.85   Mean   :14.56   Mean   :0.8710   Mean   :5.629  
##  3rd Qu.:17.30   3rd Qu.:15.71   3rd Qu.:0.8878   3rd Qu.:5.980  
##  Max.   :21.18   Max.   :17.25   Max.   :0.9183   Max.   :6.675  
##   kernel_width   asymmetry_coef   kernel_groove_length  wheat_variety
##  Min.   :2.630   Min.   :0.7651   Min.   :4.519        Canadian:70   
##  1st Qu.:2.944   1st Qu.:2.5615   1st Qu.:5.045        Kama    :70   
##  Median :3.237   Median :3.5990   Median :5.223        Rosa    :70   
##  Mean   :3.259   Mean   :3.7002   Mean   :5.408                      
##  3rd Qu.:3.562   3rd Qu.:4.7687   3rd Qu.:5.877                      
##  Max.   :4.033   Max.   :8.4560   Max.   :6.550

count(seeds, wheat_variety)

## # A tibble: 3 x 2
##   wheat_variety     n
##   <fct>         <int>
## 1 Canadian         70
## 2 Kama             70
## 3 Rosa             70

No NA values and the clustering target is equally distributed with 70 items of each wheat variety.

Kmeans

One of the more generally used clustering methods is kmeans, which is implemented in R in stats package with kmeans. When using clustering an important thing to consider is the number of clusters we expect to have after the process has finished. We do have dataset description information as well as we can see in the dataset that there are three different groups of wheat. If one does not know beforehand, how many clusters the algorithm should search for, there are methods to find that. Often used are Elbow and Silhouette methods, Gap statistic. Another thing to consider is whether to use scaling and/or centering for the clustering attributes or not. As our attributes are on different scales, we should scale the data and only then try clustering.

seeds_scaled = scale(seeds[,1:7], center = TRUE, scale = TRUE)
seeds_scaled = as_tibble(seeds_scaled)
seeds_scaled$wheat_variety = seeds$wheat_variety
summary(seeds_scaled)

##       area           perimeter        compactness      kernel_length    
##  Min.   :-1.4632   Min.   :-1.6458   Min.   :-2.6619   Min.   :-1.6466  
##  1st Qu.:-0.8858   1st Qu.:-0.8494   1st Qu.:-0.5967   1st Qu.:-0.8267  
##  Median :-0.1693   Median :-0.1832   Median : 0.1037   Median :-0.2371  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.8446   3rd Qu.: 0.8850   3rd Qu.: 0.7100   3rd Qu.: 0.7927  
##  Max.   : 2.1763   Max.   : 2.0603   Max.   : 2.0018   Max.   : 2.3619  
##   kernel_width     asymmetry_coef     kernel_groove_length  wheat_variety
##  Min.   :-1.6642   Min.   :-1.95210   Min.   :-1.8090      Canadian:70   
##  1st Qu.:-0.8329   1st Qu.:-0.75734   1st Qu.:-0.7387      Kama    :70   
##  Median :-0.0572   Median :-0.06731   Median :-0.3766      Rosa    :70   
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000                    
##  3rd Qu.: 0.8026   3rd Qu.: 0.71068   3rd Qu.: 0.9541                    
##  Max.   : 2.0502   Max.   : 3.16303   Max.   : 2.3234

Lets run the kmeans method:

km = kmeans(seeds_scaled[,1:7], centers = 3, iter.max = 100, nstart = 5)
print(km)

## K-means clustering with 3 clusters of sizes 71, 67, 72
## 
## Cluster means:
##         area  perimeter compactness kernel_length kernel_width asymmetry_coef
## 1 -0.1407831 -0.1696372   0.4485346    -0.2571999  0.001643014    -0.66034079
## 2  1.2536860  1.2589580   0.5591283     1.2349319  1.162075101    -0.04511157
## 3 -1.0277967 -1.0042491  -0.9626050    -0.8955451 -1.082995635     0.69314821
##   kernel_groove_length
## 1           -0.5844965
## 2            1.2892273
## 3           -0.6233191
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 3 1 1 1 1 1 3 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 1 1 2 3 3 3 3 3 3 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 3 1 3 3 3 3 3 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 144.4586 139.5542 144.5954
##  (between_SS / total_SS =  70.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The output part I am interested in is “cluster”, which contains cluster assignments for the data points. Lets add those assignments to our data set:

seeds_scaled$Cluster = km$cluster
count(seeds_scaled, Cluster)

## # A tibble: 3 x 2
##   Cluster     n
##     <int> <int>
## 1       1    71
## 2       2    67
## 3       3    72

It seems clustering has not worked ideally, the groups should be equally split 70/70/70, so lets try to find out what the results look visualised.

Principal Component Analysis (PCA)

We have 7 continuous attributes, so we would need 7 dimension plot to visualise the all the attributes. Luckily, we will not have to do it. We will use Principal Component Analysis (PCA) to reduce dimensionality - from available seven attributes, down to 2, for he 2-D plot. When using PCA, I will take scaled and centered dataset to make sure all attributes are on the same scale when running PCA:

km_pca = prcomp(seeds_scaled[,1:7])
print(km_pca)

## Standard deviations (1, .., p=7):
## [1] 2.24303392 1.09433672 0.82340964 0.26146601 0.13679769 0.07302086 0.02850258
## 
## Rotation (n x k) = (7 x 7):
##                             PC1         PC2         PC3         PC4         PC5
## area                 -0.4444735  0.02656355 -0.02587094  0.19363997 -0.20441167
## perimeter            -0.4415715  0.08400282  0.05983912  0.29545659 -0.17427591
## compactness          -0.2770174 -0.52915125 -0.62969178 -0.33281640  0.33265481
## kernel_length        -0.4235633  0.20597518  0.21187966  0.26340659  0.76609839
## kernel_width         -0.4328187 -0.11668963 -0.21648338  0.19963039 -0.46536555
## asymmetry_coef        0.1186925  0.71688203 -0.67950584  0.09246481  0.03625822
## kernel_groove_length -0.3871608  0.37719327  0.21389720 -0.80414995 -0.11134657
##                              PC6          PC7
## area                  0.42643686  0.734805689
## perimeter             0.47623853 -0.670751532
## compactness           0.14162884 -0.072552703
## kernel_length        -0.27357647  0.046276051
## kernel_width         -0.70301171 -0.039289079
## asymmetry_coef        0.01964186 -0.003723456
## kernel_groove_length -0.04282974 -0.034498098

PCA transforms the attributes of the data set into principal components (PC1:PC7) that carry part of the variance that is linked to the attributes. We can check how much variance of the data set each principal component explains, by extracting the standard deviation of each principal component and calculating variance:

pca_tbl =  tibble(proportional_variance =  km_pca$sdev^2/sum(km_pca$sdev^2) , PC =paste0("PC", 1:7))
print(pca_tbl)

## # A tibble: 7 x 2
##   proportional_variance PC   
##                   <dbl> <chr>
## 1              0.719    PC1  
## 2              0.171    PC2  
## 3              0.0969   PC3  
## 4              0.00977  PC4  
## 5              0.00267  PC5  
## 6              0.000762 PC6  
## 7              0.000116 PC7

Lets visualise the cumulative variance of these principal components:

ggplot(pca_tbl, aes(x = 1:7, y = cumsum(proportional_variance))) + 
  geom_line() + 
  geom_point() + 
  scale_x_continuous(breaks = 1:7, labels = pca_tbl$PC, name = "Principal Component") +
  scale_y_continuous(name = "Cummulative Variance Explained", breaks = seq.default(from = 0.6, to = 1, by = 0.05), labels = scales::percent_format(accuracy = 1)) +
  labs(caption = "Fig. 1 Explaining dataset variance using PCA")

What we show in this plot is that only using first two principal components - PC1 & PC2 - we can explain more than 87% of the data set’s variance. Next step is to plot these principal components

seeds_scaled[,c("PC1", "PC2")] = km_pca$x[,1:2]
p1 = ggplot(seeds_scaled, aes(x = PC1, y = PC2, color = wheat_variety )) +
  geom_point(size = 3) + scale_color_brewer(palette = "Paired") + 
  labs(colour = "Wheat Variety", caption = "Fig. 2 Wheat Variety plotted in 2D space by using PCA coordinates of first two principal components")
p1

Lets add convex hull to this plot to iron out the details and provide more visibility We need calculate which points will comprise the hull itself

hull =  seeds_scaled %>% group_by(wheat_variety) %>% slice(chull(PC1, PC2))

And lets add the convex hull:

p1 + aes(color = wheat_variety) + geom_polygon(data = hull, alpha = 0) + 
  labs(color = "Wheat Variety", caption = "Fig. 4 - same as Fig. 3 but gather each wheat variety in its convex hull")

Based on the fact that convex hulls of different wheat varieties intersect, clustering results will likely not be 100% accurate.

Now lets color the data points based on the kmeans clustering results:

ggplot(seeds_scaled, aes(x = PC1, y = PC2)) +
  geom_point(size = 5) + scale_color_brewer(palette = "Paired") + labs(colour = "Wheat Variety") +
  aes(color = factor(Cluster)) + 
  geom_polygon(data = seeds_scaled %>% group_by(Cluster) %>% slice(chull(PC1, PC2)), alpha = 0) +
  geom_point(aes(color = wheat_variety), size = 2) +
  labs(color = "Wheat Variety", caption = "Fig. 5 Clustering results overlaid on top of actual label. 1/2/3 are kmeans assignments and  Kama/Rosa/Canadian are for the actual labels")

Fig. 5 shows wheat varieties in convex hulls that are based on the clustering assignments (Fig. 4 showed convex hulls based on the actual label). kmeans assigned each group in a way that different groups’ convex hulls do not intersect with each other.