by Martynas Rakickis
There are many ways one can solve classification problem and in this work I will be looking at using clustering method kmeans
to classify the data set that was provided by M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, ‘A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images’, in: Information Technologies in Biomedicine, Ewa Pietka, Jacek Kawa (eds.), Springer-Verlag, Berlin-Heidelberg, 2010, pp. 15-24.[https://archive.ics.uci.edu/ml/datasets/seeds]
I will use following packages for data processing and visualisation:
library(readr)
library(dplyr)
library(here)
library(ggplot2)
Lets start by loading the data set and exploring the data
src_path = here("data set", "seeds_dataset.txt")
readr::read_delim(file = src_path, delim = "\t", col_names = F) -> seeds
head(seeds)
## # A tibble: 6 x 8
## X1 X2 X3 X4 X5 X6 X7 X8
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 15.3 14.8 0.871 5.76 3.31 2.22 5.22 1
## 2 14.9 14.6 0.881 5.55 3.33 1.02 4.96 1
## 3 14.3 14.1 0.905 5.29 3.34 2.70 4.82 1
## 4 13.8 13.9 0.896 5.32 3.38 2.26 4.80 1
## 5 16.1 15.0 0.903 5.66 3.56 1.36 5.18 1
## 6 14.4 14.2 0.895 5.39 3.31 2.46 4.96 1
dim(seeds)
## [1] 210 8
As per description of the data set, these are the attributes:
1. area A,
2. perimeter P,
3. compactness C = 4piA/P^2,
4. length of kernel,
5. width of kernel,
6. asymmetry coefficient
7. length of kernel groove.
8. wheat variety - Kama(1), Rosa(2) and Canadian(3)
So lets change the column names to something more meaningful for easier processing and add wheat labels:
colnames(seeds)= c("area", "perimeter","compactness", "kernel_length", "kernel_width", "asymmetry_coef", "kernel_groove_length", "wheat_variety")
seeds = mutate(seeds,
wheat_variety = case_when(wheat_variety == 1 ~ "Kama",wheat_variety == 2 ~ "Rosa",wheat_variety == 3 ~ "Canadian")
) %>% mutate_if(is.character, factor)
head(seeds)
## # A tibble: 6 x 8
## area perimeter compactness kernel_length kernel_width asymmetry_coef
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 15.3 14.8 0.871 5.76 3.31 2.22
## 2 14.9 14.6 0.881 5.55 3.33 1.02
## 3 14.3 14.1 0.905 5.29 3.34 2.70
## 4 13.8 13.9 0.896 5.32 3.38 2.26
## 5 16.1 15.0 0.903 5.66 3.56 1.36
## 6 14.4 14.2 0.895 5.39 3.31 2.46
## # ... with 2 more variables: kernel_groove_length <dbl>, wheat_variety <fct>
General summary of the data set is as follows:
summary(seeds)
## area perimeter compactness kernel_length
## Min. :10.59 Min. :12.41 Min. :0.8081 Min. :4.899
## 1st Qu.:12.27 1st Qu.:13.45 1st Qu.:0.8569 1st Qu.:5.262
## Median :14.36 Median :14.32 Median :0.8734 Median :5.524
## Mean :14.85 Mean :14.56 Mean :0.8710 Mean :5.629
## 3rd Qu.:17.30 3rd Qu.:15.71 3rd Qu.:0.8878 3rd Qu.:5.980
## Max. :21.18 Max. :17.25 Max. :0.9183 Max. :6.675
## kernel_width asymmetry_coef kernel_groove_length wheat_variety
## Min. :2.630 Min. :0.7651 Min. :4.519 Canadian:70
## 1st Qu.:2.944 1st Qu.:2.5615 1st Qu.:5.045 Kama :70
## Median :3.237 Median :3.5990 Median :5.223 Rosa :70
## Mean :3.259 Mean :3.7002 Mean :5.408
## 3rd Qu.:3.562 3rd Qu.:4.7687 3rd Qu.:5.877
## Max. :4.033 Max. :8.4560 Max. :6.550
count(seeds, wheat_variety)
## # A tibble: 3 x 2
## wheat_variety n
## <fct> <int>
## 1 Canadian 70
## 2 Kama 70
## 3 Rosa 70
No NA values and the clustering target is equally distributed with 70 items of each wheat variety.
One of the more generally used clustering methods is kmeans, which is implemented in R in stats package with kmeans
. When using clustering an important thing to consider is the number of clusters we expect to have after the process has finished. We do have dataset description information as well as we can see in the dataset that there are three different groups of wheat. If one does not know beforehand, how many clusters the algorithm should search for, there are methods to find that. Often used are Elbow and Silhouette methods, Gap statistic. Another thing to consider is whether to use scaling and/or centering for the clustering attributes or not. As our attributes are on different scales, we should scale the data and only then try clustering.
seeds_scaled = scale(seeds[,1:7], center = TRUE, scale = TRUE)
seeds_scaled = as_tibble(seeds_scaled)
seeds_scaled$wheat_variety = seeds$wheat_variety
summary(seeds_scaled)
## area perimeter compactness kernel_length
## Min. :-1.4632 Min. :-1.6458 Min. :-2.6619 Min. :-1.6466
## 1st Qu.:-0.8858 1st Qu.:-0.8494 1st Qu.:-0.5967 1st Qu.:-0.8267
## Median :-0.1693 Median :-0.1832 Median : 0.1037 Median :-0.2371
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.8446 3rd Qu.: 0.8850 3rd Qu.: 0.7100 3rd Qu.: 0.7927
## Max. : 2.1763 Max. : 2.0603 Max. : 2.0018 Max. : 2.3619
## kernel_width asymmetry_coef kernel_groove_length wheat_variety
## Min. :-1.6642 Min. :-1.95210 Min. :-1.8090 Canadian:70
## 1st Qu.:-0.8329 1st Qu.:-0.75734 1st Qu.:-0.7387 Kama :70
## Median :-0.0572 Median :-0.06731 Median :-0.3766 Rosa :70
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.8026 3rd Qu.: 0.71068 3rd Qu.: 0.9541
## Max. : 2.0502 Max. : 3.16303 Max. : 2.3234
Lets run the kmeans
method:
km = kmeans(seeds_scaled[,1:7], centers = 3, iter.max = 100, nstart = 5)
print(km)
## K-means clustering with 3 clusters of sizes 71, 67, 72
##
## Cluster means:
## area perimeter compactness kernel_length kernel_width asymmetry_coef
## 1 -0.1407831 -0.1696372 0.4485346 -0.2571999 0.001643014 -0.66034079
## 2 1.2536860 1.2589580 0.5591283 1.2349319 1.162075101 -0.04511157
## 3 -1.0277967 -1.0042491 -0.9626050 -0.8955451 -1.082995635 0.69314821
## kernel_groove_length
## 1 -0.5844965
## 2 1.2892273
## 3 -0.6233191
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 3 1 1 1 1 1 3 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 1 1 2 3 3 3 3 3 3 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 3 1 3 3 3 3 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 144.4586 139.5542 144.5954
## (between_SS / total_SS = 70.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The output part I am interested in is “cluster”, which contains cluster assignments for the data points. Lets add those assignments to our data set:
seeds_scaled$Cluster = km$cluster
count(seeds_scaled, Cluster)
## # A tibble: 3 x 2
## Cluster n
## <int> <int>
## 1 1 71
## 2 2 67
## 3 3 72
It seems clustering has not worked ideally, the groups should be equally split 70/70/70, so lets try to find out what the results look visualised.
We have 7 continuous attributes, so we would need 7 dimension plot to visualise the all the attributes. Luckily, we will not have to do it. We will use Principal Component Analysis (PCA) to reduce dimensionality - from available seven attributes, down to 2, for he 2-D plot. When using PCA, I will take scaled and centered dataset to make sure all attributes are on the same scale when running PCA:
km_pca = prcomp(seeds_scaled[,1:7])
print(km_pca)
## Standard deviations (1, .., p=7):
## [1] 2.24303392 1.09433672 0.82340964 0.26146601 0.13679769 0.07302086 0.02850258
##
## Rotation (n x k) = (7 x 7):
## PC1 PC2 PC3 PC4 PC5
## area -0.4444735 0.02656355 -0.02587094 0.19363997 -0.20441167
## perimeter -0.4415715 0.08400282 0.05983912 0.29545659 -0.17427591
## compactness -0.2770174 -0.52915125 -0.62969178 -0.33281640 0.33265481
## kernel_length -0.4235633 0.20597518 0.21187966 0.26340659 0.76609839
## kernel_width -0.4328187 -0.11668963 -0.21648338 0.19963039 -0.46536555
## asymmetry_coef 0.1186925 0.71688203 -0.67950584 0.09246481 0.03625822
## kernel_groove_length -0.3871608 0.37719327 0.21389720 -0.80414995 -0.11134657
## PC6 PC7
## area 0.42643686 0.734805689
## perimeter 0.47623853 -0.670751532
## compactness 0.14162884 -0.072552703
## kernel_length -0.27357647 0.046276051
## kernel_width -0.70301171 -0.039289079
## asymmetry_coef 0.01964186 -0.003723456
## kernel_groove_length -0.04282974 -0.034498098
PCA transforms the attributes of the data set into principal components (PC1:PC7) that carry part of the variance that is linked to the attributes. We can check how much variance of the data set each principal component explains, by extracting the standard deviation of each principal component and calculating variance:
pca_tbl = tibble(proportional_variance = km_pca$sdev^2/sum(km_pca$sdev^2) , PC =paste0("PC", 1:7))
print(pca_tbl)
## # A tibble: 7 x 2
## proportional_variance PC
## <dbl> <chr>
## 1 0.719 PC1
## 2 0.171 PC2
## 3 0.0969 PC3
## 4 0.00977 PC4
## 5 0.00267 PC5
## 6 0.000762 PC6
## 7 0.000116 PC7
Lets visualise the cumulative variance of these principal components:
ggplot(pca_tbl, aes(x = 1:7, y = cumsum(proportional_variance))) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = 1:7, labels = pca_tbl$PC, name = "Principal Component") +
scale_y_continuous(name = "Cummulative Variance Explained", breaks = seq.default(from = 0.6, to = 1, by = 0.05), labels = scales::percent_format(accuracy = 1)) +
labs(caption = "Fig. 1 Explaining dataset variance using PCA")
What we show in this plot is that only using first two principal components - PC1 & PC2 - we can explain more than 87% of the data set’s variance. Next step is to plot these principal components
seeds_scaled[,c("PC1", "PC2")] = km_pca$x[,1:2]
p1 = ggplot(seeds_scaled, aes(x = PC1, y = PC2, color = wheat_variety )) +
geom_point(size = 3) + scale_color_brewer(palette = "Paired") +
labs(colour = "Wheat Variety", caption = "Fig. 2 Wheat Variety plotted in 2D space by using PCA coordinates of first two principal components")
p1
Lets add convex hull to this plot to iron out the details and provide more visibility We need calculate which points will comprise the hull itself
hull = seeds_scaled %>% group_by(wheat_variety) %>% slice(chull(PC1, PC2))
And lets add the convex hull:
p1 + aes(color = wheat_variety) + geom_polygon(data = hull, alpha = 0) +
labs(color = "Wheat Variety", caption = "Fig. 4 - same as Fig. 3 but gather each wheat variety in its convex hull")
Based on the fact that convex hulls of different wheat varieties intersect, clustering results will likely not be 100% accurate.
Now lets color the data points based on the kmeans
clustering results:
ggplot(seeds_scaled, aes(x = PC1, y = PC2)) +
geom_point(size = 5) + scale_color_brewer(palette = "Paired") + labs(colour = "Wheat Variety") +
aes(color = factor(Cluster)) +
geom_polygon(data = seeds_scaled %>% group_by(Cluster) %>% slice(chull(PC1, PC2)), alpha = 0) +
geom_point(aes(color = wheat_variety), size = 2) +
labs(color = "Wheat Variety", caption = "Fig. 5 Clustering results overlaid on top of actual label. 1/2/3 are kmeans assignments and Kama/Rosa/Canadian are for the actual labels")
Fig. 5 shows wheat varieties in convex hulls that are based on the clustering assignments (Fig. 4 showed convex hulls based on the actual label). kmeans
assigned each group in a way that different groups’ convex hulls do not intersect with each other.