We will apply hierarchical clustering on the seeds dataset.

This dataset consists of measurements of geometrical properties of kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian.

It has variables which describe the properties of seeds like area, perimeter, asymmetry coefficient etc. There are 70 observations for each variety of wheat. You can find the details about the dataset here.

https://archive.ics.uci.edu/ml/datasets/seeds#

M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, ‘A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images’, in: Information Technologies in Biomedicine, Ewa Pietka, Jacek Kawa (eds.), Springer-Verlag, Berlin-Heidelberg, 2010, pp. 15-24.

We need following package:

library(dendextend)
## 
## ---------------------
## Welcome to dendextend version 1.16.0
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
## 
##     cutree
file_loc <- 'seeds_dataset.txt'
seeds_df <- read.csv(file_loc,sep = '\t',header = FALSE)
feature_name <- c('area','perimeter','compactness','length.of.kernel','width.of.kernal','asymmetry.coefficient','length.of.kernel.groove','type.of.seed')
colnames(seeds_df) <- feature_name

Basic Features of the dataset:

str(seeds_df)
## 'data.frame':    210 obs. of  8 variables:
##  $ area                   : num  15.3 14.9 14.3 13.8 16.1 ...
##  $ perimeter              : num  14.8 14.6 14.1 13.9 15 ...
##  $ compactness            : num  0.871 0.881 0.905 0.895 0.903 ...
##  $ length.of.kernel       : num  5.76 5.55 5.29 5.32 5.66 ...
##  $ width.of.kernal        : num  3.31 3.33 3.34 3.38 3.56 ...
##  $ asymmetry.coefficient  : num  2.22 1.02 2.7 2.26 1.35 ...
##  $ length.of.kernel.groove: num  5.22 4.96 4.83 4.8 5.17 ...
##  $ type.of.seed           : int  1 1 1 1 1 1 1 1 1 1 ...
summary(seeds_df)
##       area         perimeter      compactness     length.of.kernel
##  Min.   :10.59   Min.   :12.41   Min.   :0.8081   Min.   :4.899   
##  1st Qu.:12.27   1st Qu.:13.45   1st Qu.:0.8569   1st Qu.:5.262   
##  Median :14.36   Median :14.32   Median :0.8734   Median :5.524   
##  Mean   :14.85   Mean   :14.56   Mean   :0.8710   Mean   :5.629   
##  3rd Qu.:17.30   3rd Qu.:15.71   3rd Qu.:0.8878   3rd Qu.:5.980   
##  Max.   :21.18   Max.   :17.25   Max.   :0.9183   Max.   :6.675   
##  width.of.kernal asymmetry.coefficient length.of.kernel.groove  type.of.seed
##  Min.   :2.630   Min.   :0.7651        Min.   :4.519           Min.   :1    
##  1st Qu.:2.944   1st Qu.:2.5615        1st Qu.:5.045           1st Qu.:1    
##  Median :3.237   Median :3.5990        Median :5.223           Median :2    
##  Mean   :3.259   Mean   :3.7002        Mean   :5.408           Mean   :2    
##  3rd Qu.:3.562   3rd Qu.:4.7687        3rd Qu.:5.877           3rd Qu.:3    
##  Max.   :4.033   Max.   :8.4560        Max.   :6.550           Max.   :3
any(is.na(seeds_df))
## [1] FALSE

We will now store the labels in a separate variable and exclude the type.of.seed column from the dataset in order to do clustering. Later you will use the true labels to check how good the clustering turned out to be.

seeds_label <- seeds_df$type.of.seed
seeds_df$type.of.seed <- NULL
str(seeds_df)
## 'data.frame':    210 obs. of  7 variables:
##  $ area                   : num  15.3 14.9 14.3 13.8 16.1 ...
##  $ perimeter              : num  14.8 14.6 14.1 13.9 15 ...
##  $ compactness            : num  0.871 0.881 0.905 0.895 0.903 ...
##  $ length.of.kernel       : num  5.76 5.55 5.29 5.32 5.66 ...
##  $ width.of.kernal        : num  3.31 3.33 3.34 3.38 3.56 ...
##  $ asymmetry.coefficient  : num  2.22 1.02 2.7 2.26 1.35 ...
##  $ length.of.kernel.groove: num  5.22 4.96 4.83 4.8 5.17 ...

We will later append the cluster results obtained back in the original dataframe under column name the cluster with mutate(), from the dplyr package and count how many observations were assigned to each cluster with the count() function.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Notice that the scales of the features are different and you need to normalize it.

(This is because each observations’ feature values are represented as coordinates in n-dimensional space (n is the number of features) and then the distances between these coordinates are calculated. If these coordinates are not normalized, then it may lead to false results.)

seeds_df_sc <- as.data.frame(scale(seeds_df))
summary(seeds_df_sc)
##       area           perimeter        compactness      length.of.kernel 
##  Min.   :-1.4632   Min.   :-1.6458   Min.   :-2.6619   Min.   :-1.6466  
##  1st Qu.:-0.8858   1st Qu.:-0.8494   1st Qu.:-0.5967   1st Qu.:-0.8267  
##  Median :-0.1693   Median :-0.1832   Median : 0.1037   Median :-0.2371  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.8446   3rd Qu.: 0.8850   3rd Qu.: 0.7100   3rd Qu.: 0.7927  
##  Max.   : 2.1763   Max.   : 2.0603   Max.   : 2.0018   Max.   : 2.3619  
##  width.of.kernal   asymmetry.coefficient length.of.kernel.groove
##  Min.   :-1.6642   Min.   :-1.95210      Min.   :-1.8090        
##  1st Qu.:-0.8329   1st Qu.:-0.75734      1st Qu.:-0.7387        
##  Median :-0.0572   Median :-0.06731      Median :-0.3766        
##  Mean   : 0.0000   Mean   : 0.00000      Mean   : 0.0000        
##  3rd Qu.: 0.8026   3rd Qu.: 0.71068      3rd Qu.: 0.9541        
##  Max.   : 2.0502   Max.   : 3.16303      Max.   : 2.3234

Now the mean of all the columns is 0 and the standard deviation is 1.

We will use the euclidean distance method.

dist_mat <- dist(seeds_df_sc, method = ‘euclidean’)

the distance measure to be used. This must be one of “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”. Any unambiguous substring can be given.

dist_mat <- dist(seeds_df_sc, method = 'euclidean')

(4)Now we decide which linkage method we want to use and proceed to do hierarchical clustering.

We can try all kinds of linkage methods and later decide on which one performed better.

Here we will proceed with average linkage method.


hclust_avg <- hclust(dist_mat, method = 'average')
plot(hclust_avg)

Next, we can cut the dendrogram in order to create the desired number of clusters.

Since in this case we already know that there could be only three types of wheat you will choose the number of clusters to be k = 3, or as you can see in the dendrogram h = 3 you get three clusters.

R’s cutree() function to cut the tree with hclust_avg as one parameter and the other parameter as h = 3

cut_avg <- cutree(hclust_avg, k = 3)
plot(hclust_avg)
rect.hclust(hclust_avg , k = 3, border = 2:6)
abline(h = 3, col = 'red')

Colour plot:

avg_dend_obj <- as.dendrogram(hclust_avg)
avg_col_dend <- color_branches(avg_dend_obj, h = 3)
plot(avg_col_dend)

We will append the cluster results obtained back in the original dataframe under column name the cluster with mutate(), from the dplyr package and count how many observations were assigned to each cluster with the count() function.

seeds_df_cl <- mutate(seeds_df, cluster = cut_avg)
count(seeds_df_cl,cluster)
##   cluster  n
## 1       1 65
## 2       2 75
## 3       3 70

In many cases we don’t actually have the true labels. In those cases, as already discussed, you can go for other measures like maximizing Dunn’s index. You can calculate dunn’s index by using the dunn() function from the clValid library. Also, you can consider doing cross validation of the results by making train and test sets, just like you do in any other machine learning algorithm, and then doing the clustering when you do have the true labels.