We will apply hierarchical clustering on the seeds dataset.
This dataset consists of measurements of geometrical properties of kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian.
It has variables which describe the properties of seeds like area, perimeter, asymmetry coefficient etc. There are 70 observations for each variety of wheat. You can find the details about the dataset here.
https://archive.ics.uci.edu/ml/datasets/seeds#
M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, ‘A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images’, in: Information Technologies in Biomedicine, Ewa Pietka, Jacek Kawa (eds.), Springer-Verlag, Berlin-Heidelberg, 2010, pp. 15-24.
We need following package:
library(dendextend)
##
## ---------------------
## Welcome to dendextend version 1.16.0
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags:
## https://stackoverflow.com/questions/tagged/dendextend
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
##
## cutree
file_loc <- 'seeds_dataset.txt'
seeds_df <- read.csv(file_loc,sep = '\t',header = FALSE)
feature_name <- c('area','perimeter','compactness','length.of.kernel','width.of.kernal','asymmetry.coefficient','length.of.kernel.groove','type.of.seed')
colnames(seeds_df) <- feature_name
Basic Features of the dataset:
str(seeds_df)
## 'data.frame': 210 obs. of 8 variables:
## $ area : num 15.3 14.9 14.3 13.8 16.1 ...
## $ perimeter : num 14.8 14.6 14.1 13.9 15 ...
## $ compactness : num 0.871 0.881 0.905 0.895 0.903 ...
## $ length.of.kernel : num 5.76 5.55 5.29 5.32 5.66 ...
## $ width.of.kernal : num 3.31 3.33 3.34 3.38 3.56 ...
## $ asymmetry.coefficient : num 2.22 1.02 2.7 2.26 1.35 ...
## $ length.of.kernel.groove: num 5.22 4.96 4.83 4.8 5.17 ...
## $ type.of.seed : int 1 1 1 1 1 1 1 1 1 1 ...
summary(seeds_df)
## area perimeter compactness length.of.kernel
## Min. :10.59 Min. :12.41 Min. :0.8081 Min. :4.899
## 1st Qu.:12.27 1st Qu.:13.45 1st Qu.:0.8569 1st Qu.:5.262
## Median :14.36 Median :14.32 Median :0.8734 Median :5.524
## Mean :14.85 Mean :14.56 Mean :0.8710 Mean :5.629
## 3rd Qu.:17.30 3rd Qu.:15.71 3rd Qu.:0.8878 3rd Qu.:5.980
## Max. :21.18 Max. :17.25 Max. :0.9183 Max. :6.675
## width.of.kernal asymmetry.coefficient length.of.kernel.groove type.of.seed
## Min. :2.630 Min. :0.7651 Min. :4.519 Min. :1
## 1st Qu.:2.944 1st Qu.:2.5615 1st Qu.:5.045 1st Qu.:1
## Median :3.237 Median :3.5990 Median :5.223 Median :2
## Mean :3.259 Mean :3.7002 Mean :5.408 Mean :2
## 3rd Qu.:3.562 3rd Qu.:4.7687 3rd Qu.:5.877 3rd Qu.:3
## Max. :4.033 Max. :8.4560 Max. :6.550 Max. :3
any(is.na(seeds_df))
## [1] FALSE
We will now store the labels in a separate variable and exclude the type.of.seed column from the dataset in order to do clustering. Later you will use the true labels to check how good the clustering turned out to be.
seeds_label <- seeds_df$type.of.seed
seeds_df$type.of.seed <- NULL
str(seeds_df)
## 'data.frame': 210 obs. of 7 variables:
## $ area : num 15.3 14.9 14.3 13.8 16.1 ...
## $ perimeter : num 14.8 14.6 14.1 13.9 15 ...
## $ compactness : num 0.871 0.881 0.905 0.895 0.903 ...
## $ length.of.kernel : num 5.76 5.55 5.29 5.32 5.66 ...
## $ width.of.kernal : num 3.31 3.33 3.34 3.38 3.56 ...
## $ asymmetry.coefficient : num 2.22 1.02 2.7 2.26 1.35 ...
## $ length.of.kernel.groove: num 5.22 4.96 4.83 4.8 5.17 ...
We will later append the cluster results obtained back in the original dataframe under column name the cluster with mutate(), from the dplyr package and count how many observations were assigned to each cluster with the count() function.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Notice that the scales of the features are different and you need to normalize it.
(This is because each observations’ feature values are represented as coordinates in n-dimensional space (n is the number of features) and then the distances between these coordinates are calculated. If these coordinates are not normalized, then it may lead to false results.)
seeds_df_sc <- as.data.frame(scale(seeds_df))
summary(seeds_df_sc)
## area perimeter compactness length.of.kernel
## Min. :-1.4632 Min. :-1.6458 Min. :-2.6619 Min. :-1.6466
## 1st Qu.:-0.8858 1st Qu.:-0.8494 1st Qu.:-0.5967 1st Qu.:-0.8267
## Median :-0.1693 Median :-0.1832 Median : 0.1037 Median :-0.2371
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.8446 3rd Qu.: 0.8850 3rd Qu.: 0.7100 3rd Qu.: 0.7927
## Max. : 2.1763 Max. : 2.0603 Max. : 2.0018 Max. : 2.3619
## width.of.kernal asymmetry.coefficient length.of.kernel.groove
## Min. :-1.6642 Min. :-1.95210 Min. :-1.8090
## 1st Qu.:-0.8329 1st Qu.:-0.75734 1st Qu.:-0.7387
## Median :-0.0572 Median :-0.06731 Median :-0.3766
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.8026 3rd Qu.: 0.71068 3rd Qu.: 0.9541
## Max. : 2.0502 Max. : 3.16303 Max. : 2.3234
Now the mean of all the columns is 0 and the standard deviation is 1.
We will use the euclidean distance method.
dist_mat <- dist(seeds_df_sc, method = ‘euclidean’)
the distance measure to be used. This must be one of “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”. Any unambiguous substring can be given.
dist_mat <- dist(seeds_df_sc, method = 'euclidean')
(4)Now we decide which linkage method we want to use and proceed to do hierarchical clustering.
We can try all kinds of linkage methods and later decide on which one performed better.
Here we will proceed with average linkage method.
hclust_avg <- hclust(dist_mat, method = 'average')
plot(hclust_avg)
Next, we can cut the dendrogram in order to create the desired number of clusters.
Since in this case we already know that there could be only three types of wheat you will choose the number of clusters to be k = 3, or as you can see in the dendrogram h = 3 you get three clusters.
R’s cutree() function to cut the tree with hclust_avg as one parameter and the other parameter as h = 3
cut_avg <- cutree(hclust_avg, k = 3)
plot(hclust_avg)
rect.hclust(hclust_avg , k = 3, border = 2:6)
abline(h = 3, col = 'red')
Colour plot:
avg_dend_obj <- as.dendrogram(hclust_avg)
avg_col_dend <- color_branches(avg_dend_obj, h = 3)
plot(avg_col_dend)
We will append the cluster results obtained back in the original dataframe under column name the cluster with mutate(), from the dplyr package and count how many observations were assigned to each cluster with the count() function.
seeds_df_cl <- mutate(seeds_df, cluster = cut_avg)
count(seeds_df_cl,cluster)
## cluster n
## 1 1 65
## 2 2 75
## 3 3 70
In many cases we don’t actually have the true labels. In those cases, as already discussed, you can go for other measures like maximizing Dunn’s index. You can calculate dunn’s index by using the dunn() function from the clValid library. Also, you can consider doing cross validation of the results by making train and test sets, just like you do in any other machine learning algorithm, and then doing the clustering when you do have the true labels.