Mateusz Tomczak

432561

Introduction

Palmerpenguins is a dataset originally published in Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins in 2014 and is currently avaliable as a R package. As authors claim, the dataset goal is to provide an alternative to iris dataset for data exploration and visualisation.

Aim of this project is to test if this dataset can be used for testing or exploring different clustering methods. The dataset consists of 344 observations with 8 features. Those features are:

In this project, clustering will be made on 7 of the features, while the 8th feature - species - will be used as a control variable to test if the clustering correctly grouped different groups of penguins.

Libraries used

library(palmerpenguins)
library(gridExtra)
library(clustertend)
library(factoextra)
library(flexclust)
library(cluster) 
library(fpc)

The Dataset

As a first step, the data is inspected and prepared

data <- penguins
head(data)

Check the completeness of the dataset

data[!complete.cases(data),]

As we can see some of the rows contain missing values. In this case, we need to remove them from the dataset to perform clustering

data <- data[complete.cases(data),]

Now we will save species column into different variable

type <- data$species
table(data$species)
## 
##    Adelie Chinstrap    Gentoo 
##       146        68       119

To perform clustering all data should be in numeric format

str(data)
## tibble [333 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
##  $ bill_depth_mm    : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
##  $ flipper_length_mm: int [1:333] 181 186 195 193 190 181 195 182 191 198 ...
##  $ body_mass_g      : int [1:333] 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
##  $ year             : int [1:333] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
data$island <- as.numeric(factor(data$island))
data$sex <- as.numeric(data$sex)
data$flipper_length_mm <- as.numeric(data$flipper_length_mm)
data$body_mass_g <- as.numeric(data$body_mass_g)
data$year <- as.numeric(data$year)

Now we can select 7 features that will be used in the clustering

data <- data[,c('island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year')]

Below some statistics are show describing the data by columns. Those statistics include minimum value, 1st quantile, median, mean, 3rd quantile and maximum value

options(width = 100)
summary(data)
##      island      bill_length_mm  bill_depth_mm   flipper_length_mm  body_mass_g        sex       
##  Min.   :1.000   Min.   :32.10   Min.   :13.10   Min.   :172       Min.   :2700   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:39.50   1st Qu.:15.60   1st Qu.:190       1st Qu.:3550   1st Qu.:1.000  
##  Median :2.000   Median :44.50   Median :17.30   Median :197       Median :4050   Median :2.000  
##  Mean   :1.652   Mean   :43.99   Mean   :17.16   Mean   :201       Mean   :4207   Mean   :1.505  
##  3rd Qu.:2.000   3rd Qu.:48.60   3rd Qu.:18.70   3rd Qu.:213       3rd Qu.:4775   3rd Qu.:2.000  
##  Max.   :3.000   Max.   :59.60   Max.   :21.50   Max.   :231       Max.   :6300   Max.   :2.000  
##       year     
##  Min.   :2007  
##  1st Qu.:2007  
##  Median :2008  
##  Mean   :2008  
##  3rd Qu.:2009  
##  Max.   :2009

Diagnostics

As a first step of the analysis we will run diagnostics on the data to check if data can be clusterable. For that we will use Hopkins statistic, which assesses the clusterability of the data. Null hypothesis states that the dataset is uniformly distributed, meaning there are no meaningful clusters. Alternative hypothesis states that the dataset is not uniformly distributed, which means that it contains meaningful clusters.

Value of the statistic close to 0 indicates that it is unlikely to find statistically significant clusters. Value around 0.5 indicates random data. Value close to 1 indicate significant clusterability of the data.

We will also plot the Ordered Dissimilarity Matrix, which can be used to roughly assess the clusterability of the data by checking if blocks of color are visible, which hint to clustering tendency

data <- scale(data)
d_m <- dist(data, method = "euclidean")
get_clust_tendency(data, nrow(data)-1, graph=TRUE)
## $hopkins_stat
## [1] 0.8048359
## 
## $plot

cat("Hopkins statistic:", 1-hopkins(data, n=nrow(data)-1)$H)
## Hopkins statistic: 0.8003098

As we can see on the graph, we can roughly see some blocks of color. In case of the Hopkins statistic we get a value of about 0.8, which indicates that the data is significantly clusterable and some clusters are visible.

Centoid-based clustering

Centroid-based clustering works by organizing the data into non-hierarchical clusters around centroids. Algorithms we will use are K-Means, PAM and CLARA. For each method we will check the optimal number of clusters using the silhoutette and elbow methods. If the optimal number of clusters based on both methods was different than 3, another clustering for k=3 was performed to compare the results with variable indicating penguin species. For each method we will use Euclidean distance. In such case the results of the clustering were compared using Calinski-Harabasz index, for which higer the value the better. The homogeneity of clusters was assessed using Duda-Hart index.

K-means

K-Means is one of the most popular approaches to clustering. It works by assigning observations to the closest clusters based on cluster centers (means). Initial cluster centers are chosen at random, and their position is recalculated every time an observation is added to the cluster. This continues until all observations are assigned to the final clusters. The goal of this method is to minimize square error of the intra-class dissimilarity.

Silhouette

fviz_nbclust(data, FUNcluster = kmeans, method = "silhouette") 

Elbow method

fviz_nbclust(data, FUN = kmeans, method = "wss")

We can see that the silhouette method indicates optimal number of clusters as 2. In case of the elbow method it indicates 2 or 3 optimal clusters.

Clustering for k=2

clust_kmeans <- eclust(data, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
kmeans_silhouette <- fviz_silhouette(clust_kmeans, print.summary = FALSE)
kmeans_graph <- fviz_cluster(clust_kmeans, data = data)
grid.arrange(kmeans_silhouette, kmeans_graph, ncol=2)

table(clust_kmeans$cluster, type)
##    type
##     Adelie Chinstrap Gentoo
##   1      0         0    119
##   2    146        68      0

We can see that in case of 2 clusters the K-Means algorithm correctly identified Gentoo species, assigning it to 1st cluster and combining two remaining species into 2nd cluster. The average silhouette distance in this case equals 0.37.

Clustering for k=3

clust_kmeans_3 <- eclust(data, k=3, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)

kmeans_silhouette_3 <- fviz_silhouette(clust_kmeans_3, print.summary = FALSE)
kmeans_graph_3 <- fviz_cluster(clust_kmeans_3, data = data)
grid.arrange(kmeans_silhouette_3, kmeans_graph_3, ncol=2)

table(clust_kmeans_3$cluster, type)
##    type
##     Adelie Chinstrap Gentoo
##   1      0         0     61
##   2    146        68      0
##   3      0         0     58

For 3 clusters the K-Means does not perform well. We can see that the 2nd cluster reamined uncheanged compared to the k=2 case. The algorithm instead divided the 1st cluster into two. The average silhouette distance is a little lower than in the case of two clusters, and equals 0.33.

Calinski-Harabasz and Duda-Hart indexes

cat("Calinski-Harabasz Index\nk=2:", round(calinhara(data, clust_kmeans$cluster),digits=2), "\nk=3:", round(calinhara(data, clust_kmeans_3$cluster),digits=2))
## Calinski-Harabasz Index
## k=2: 221.28 
## k=3: 154.35

Based on the Calinski-Harabasz index we can see that the 2 cluster solution is better.

dh_kmeans <- dudahart2(data, clust_kmeans$cluster)
dh_kmeans_3 <- dudahart2(data, clust_kmeans_3$cluster)
cat("Duda-Hart Index\n\nk=2\ndh:",  dh_kmeans$dh, "\ncompare:", dh_kmeans$compare, "\nAccept H0: ", dh_kmeans$cluster1, "\n\nk=3\ndh:",  dh_kmeans_3$dh, "\ncompare:", dh_kmeans_3$compare, "\nAccept H0: ", dh_kmeans_3$cluster1)
## Duda-Hart Index
## 
## k=2
## dh: 0.5993324 
## compare: 0.8239383 
## Accept H0:  FALSE 
## 
## k=3
## dh: 0.4832184 
## compare: 0.8239383 
## Accept H0:  FALSE

For both k=2 and k=3 the Duda-Hart index indicates that the data within cluster is heterogeneous.

PAM

This method uses medoids to cluster the data. Medoid is a most centrally located object in the cluster. Compared to the K-Means, it is more robust when noise or outliers are present, as medoids are less influenced by them compared to the mean.

Silhouette

fviz_nbclust(data, FUNcluster = cluster::pam, method = "silhouette")

Elbow method

fviz_nbclust(data, FUN = cluster::pam, method = "wss")

The silhouette methods again suggest 2 clusters. Elbow method again suggests around 3, maybe 4 clusters.

Clustering for k=2

clust_pam <- eclust(data, k=2, FUNcluster="pam", hc_metric="euclidean", graph=FALSE)
pam_silhouette <- fviz_silhouette(clust_pam, print.summary = FALSE)
pam_graph <- fviz_cluster(clust_pam, data = data, elipse.type = "convex")
grid.arrange(pam_silhouette, pam_graph, ncol=2)

table(clust_pam$cluster, type)
##    type
##     Adelie Chinstrap Gentoo
##   1    145        68      0
##   2      1         0    119

In case of 2 clusters PAM, similary to K-Means, correctly assigned Gentoo species to one cluster, with exception of one observation from Adelie species. Average silhouette width in this case is 0.37.

Clustering for k=3

clust_pam_3 <- eclust(data, k=3, FUNcluster="pam", hc_metric="euclidean", graph=FALSE) 

pam_silhouette_3 <- fviz_silhouette(clust_pam_3, print.summary = FALSE)
pam_graph_3 <- fviz_cluster(clust_pam_3, data = data, elipse.type = "convex")
grid.arrange(pam_silhouette_3, pam_graph_3, ncol=2)

table(clust_pam_3$cluster, type)
##    type
##     Adelie Chinstrap Gentoo
##   1     73        37      0
##   2     73        31      0
##   3      0         0    119

For 3 clusters PAM correctly clusters the Gentoo species, this time without any observations from other species. However, observations from Adelie and Chinstrap species are divided almost equally between 1st and 2nd clusters. Average silhouette width is slightly lower than in case of two clusters, and equals 0.33.

Calinski-Harabasz and Duda-Hart indexes

cat("Calinski-Harabasz Index", "\nk=2:", round(calinhara(data, clust_pam$cluster),digits=2), "\nk=3:", round(calinhara(data, clust_pam_3$cluster),digits=2))
## Calinski-Harabasz Index 
## k=2: 219.65 
## k=3: 186.64

Based on the Calinski-Harabasz index we can see that, again, the 2 cluster solution is better.

dh_pam <- dudahart2(data, clust_pam$cluster) 
dh_pam_3 <- dudahart2(data, clust_pam_3$cluster) 
cat("Duda-Hart Index\n\nk=2\ndh:",  dh_pam$dh, "\ncompare:", dh_pam$compare, "\nAccept H0: ", dh_pam$cluster1, "\n\nk=3\ndh:",  dh_pam_3$dh, "\ncompare:", dh_pam_3$compare, "\nAccept H0: ", dh_pam_3$cluster1)
## Duda-Hart Index
## 
## k=2
## dh: 0.6011108 
## compare: 0.8239383 
## Accept H0:  FALSE 
## 
## k=3
## dh: 0.3103675 
## compare: 0.8239383 
## Accept H0:  FALSE

Duda-Hart index indicates that for both cases the data within cluster is heterogeneous.

CLARA

CLARA is used with big datasets. It draws multiple samples of the data, then applies PAM on each one. At the end, it outputs the best clustering.

Silhouette

fviz_nbclust(data, FUNcluster = cluster::clara, method = "silhouette")

Elbow method

fviz_nbclust(data, FUN = cluster::clara, method = "wss")

Based on the silhouette method we can see that optimal number of clusters is 2. For elbow method, we can aim for 2-3 optimal clusters.

Clustering for k=2

clust_clara <- eclust(data, k=2, FUNcluster="clara", hc_metric="euclidean", graph=FALSE)
clara_silhouette <- fviz_silhouette(clust_clara, print.summary = FALSE)
clara_graph <- fviz_cluster(clust_clara, data = data, elipse.type = "convex")
grid.arrange(clara_silhouette, clara_graph, ncol=2)

table(clust_clara$cluster, type)
##    type
##     Adelie Chinstrap Gentoo
##   1    146        68      0
##   2      0         0    119

Again, as is the case with other methods, CLARA correctly assings Gentoo species into one cluster, and remaining species into the other cluster. Average shilouette is 0.37

Clustering for k=3

clust_clara_3 <- eclust(data, k=3, FUNcluster="clara", hc_metric="euclidean", graph=FALSE)

clara_silhouette_3 <- fviz_silhouette(clust_clara_3, print.summary = FALSE)
clara_graph_3 <- fviz_cluster(clust_clara_3, data = data, elipse.type = "convex")
grid.arrange(clara_silhouette_3, clara_graph_3, ncol=2)

table(clust_clara_3$cluster, type)
##    type
##     Adelie Chinstrap Gentoo
##   1     73        34      0
##   2     73        34      0
##   3      0         0    119

For 3 clusters we can see that CLARA divided the Adelie and CHinstrap species equally between 1st and 2nd cluster, with 3rd cluster containing only Gentoo species. Average silhouette width is 0.33.

Calinski-Harabasz and Duda-Hart indexes

cat("Calinski-Harabasz Index\nk=2:", round(calinhara(data, clust_clara$cluster),digits=2), "\nk=3:", round(calinhara(data, clust_clara_3$cluster),digits=2))
## Calinski-Harabasz Index
## k=2: 221.28 
## k=3: 187.78

Calinski-Harabasz index indicates that also in case of CLARA the 2 cluster solution is better.

dh_clara <- dudahart2(data, clust_clara$cluster)  
dh_clara_3 <- dudahart2(data, clust_clara_3$cluster)  
cat("Duda-Hart Index\n\nk=2\ndh:",  dh_clara$dh, "\ncompare:", dh_clara$compare, "\nAccept H0: ", dh_clara$cluster1, "\n\nk=3\ndh:",  dh_clara_3$dh, "\ncompare:", dh_clara_3$compare, "\nAccept H0: ", dh_clara_3$cluster1)
## Duda-Hart Index
## 
## k=2
## dh: 0.5993324 
## compare: 0.8239383 
## Accept H0:  FALSE 
## 
## k=3
## dh: 0.3088481 
## compare: 0.8239383 
## Accept H0:  FALSE

Duda-Hart index indicates that for both cases the data within cluster is heterogeneous.

Connectivity-based clustering

Hierarchical clustering represent connectivit-based clustering. It works by assigning each object to its own cluster, then iteratively the algorithm joins two most similar clusters, continuing until only one cluster remains. At each iteration the distances between clusters are recomputed by dissimilarity formula set for used clustering method. In case of hierarchical clustering setting the number of clusters as input is not needed, however we need to propose the terminal condition. We will use euclidean distance in every method. We will also only consider the case of 3 clusters in the data. Duda-Hart index will again be used to test the homogeneity of the clusters.

Agglomerative hierarchical clustering

Agglomerative approach starts with points as individual clusters and at each steps closest pair of clusters is merged until one cluster is left. We will try to cluster using four linkage methods: single linkage, complete linkage, average linkage and Ward’s method. Distance metric is used to calculate value between clusters defined by linkage method and decide whether to split or merge clusters.

Single linkage

Single linkage computes distance between the two most similar parts of the cluster.

single <- eclust(data, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "single")
plot(single, cex=0.5, hang=-1, main = "Dendrogram - single linkage", xlab = '', sub = '', labels=FALSE)
rect.hclust(single, k=3, border='red')

singleCut <- cutree(single, 3)
fviz_cluster(list(data = data, cluster = singleCut))

table(singleCut, type)
##          type
## singleCut Adelie Chinstrap Gentoo
##         1    146        67     58
##         2      0         0     61
##         3      0         1      0

We can see that in the case of the single linkage the hierarchical clustering of the data does not work well. No single species was identified in the clustering and assigned to cluster.

Average silhouette width and Duda-Hart index

singleStats <- cluster.stats(d_m,singleCut)
cat("Average silhouette width: ", singleStats$avg.silwidth)
## Average silhouette width:  0.1781883
dh_single <- dudahart2(data, singleCut)  
cat("Duda-Hart Index\ndh:",  dh_single$dh, "\ncompare:", dh_single$compare, "\nAccept H0: ", dh_single$cluster1) 
## Duda-Hart Index
## dh: 0.7370881 
## compare: 0.8239383 
## Accept H0:  FALSE

Complete linkage

Complete linkage computes distance between the two least similar parts of the cluster.

complete <- eclust(data, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")
plot(complete, cex=0.5, hang=-1, main = "Dendrogram - complete linkage", xlab = '', sub = '', labels = FALSE)
Axis(side=2, labels=FALSE)
rect.hclust(complete, k=3, border='red')

completeCut <- cutree(complete, 3)
fviz_cluster(list(data = data, cluster = completeCut))

table(completeCut, type)
##            type
## completeCut Adelie Chinstrap Gentoo
##           1     69        52      0
##           2     77        16      0
##           3      0         0    119

In case of complete linkage we can see that only Gentoo species was identified and assigned to single cluster. Other two were divided between two remaining clusters.

Average silhouette width and Duda-Hart index

completeStats <- cluster.stats(d_m,completeCut)
cat("Average silhouette width: ", completeStats$avg.silwidth)
## Average silhouette width:  0.3066251
dh_complete <- dudahart2(data, completeCut)
cat("Duda-Hart Index\ndh:", dh_complete$dh, "\ncompare:", dh_complete$compare, "\nAccept H0: ", dh_complete$cluster1)
## Duda-Hart Index
## dh: 0.3269892 
## compare: 0.8239383 
## Accept H0:  FALSE

Average linkage

Average linkage computes distance between the center of the clusters.

avg <- eclust(data, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "average")
plot(avg, cex=0.5, hang=-1, main = "Dendrogram - average linkage", xlab = '', sub = '', labels=FALSE)
rect.hclust(avg, k=3, border='red')

avgCut <- cutree(avg, 3)
fviz_cluster(list(data = data, cluster = avgCut))

table(avgCut, type)
##       type
## avgCut Adelie Chinstrap Gentoo
##      1    146        67      0
##      2      0         0    119
##      3      0         1      0

Average linkage correctly identified Gentoo species. However, almost all other observations were assigned to the 1st cluster, with an exception of one Chinstrap pengouin, which was only observation assigned to the 3rd cluster.

Average silhouette width and Duda-Hart index

avgStats <- cluster.stats(d_m,avgCut)
cat("Average silhouette width: ", avgStats$avg.silwidth)
## Average silhouette width:  0.3011883
dh_avg <- dudahart2(data, avgCut)
cat("Duda-Hart Index\ndh:", dh_avg$dh, "\ncompare:", dh_avg$compare, "\nAccept H0: ", dh_avg$cluster1)
## Duda-Hart Index
## dh: 0.5942125 
## compare: 0.8239383 
## Accept H0:  FALSE

Ward’s method

Ward’s method is mainly used when there are no clear theoretical justifications for the choice of other linkage method. It works out which observations to group based on reducing the sum of squared distances of each observation from the average observation in the cluster.

ward <- eclust(data, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "ward.D2")
plot(ward, cex=0.5, hang=-1, main = "Dendrogram - Ward's method", xlab = '', sub = '', labels=FALSE)
rect.hclust(ward, k=3, border='red')

wardCut <- cutree(ward, 3)
fviz_cluster(list(data = data, cluster = wardCut))

table(wardCut, type)
##        type
## wardCut Adelie Chinstrap Gentoo
##       1     73        34      0
##       2     73        34      0
##       3      0         0    119

Using the Ward’s method we end up with the same cluster composition as in the case of CLARA. Only one species is correctly identified, with other two split equally between remaining two clusters.

Average silhouette width and Duda-Hart index

wardStats <- cluster.stats(d_m,wardCut)
cat("Average silhouette width: ", wardStats$avg.silwidth)
## Average silhouette width:  0.3331447
dh_ward <- dudahart2(data, wardCut)
cat("Duda-Hart Index\ndh:", dh_ward$dh, "\ncompare:", dh_ward$compare, "\nAccept H0: ", dh_ward$cluster1)
## Duda-Hart Index
## dh: 0.3088481 
## compare: 0.8239383 
## Accept H0:  FALSE

Divisive hierarchical clustering

In the case of divisive approach to hierarchical clustering we start with one cluster containing all observations. At each step the cluster is split, until each cluster contains only single observation. Here we will use DIANA technique.

DIANA

diana <- eclust(data, k=3, FUNcluster="diana")
pltree(diana, cex = 0.5, hang = -1, main = "Dendrogram - diana", xlab = '', sub = '', labels=FALSE)
rect.hclust(diana, k=3, border='red')

dianaCut <- cutree(diana, 3)
fviz_cluster(list(data = data, cluster = dianaCut))

table(dianaCut, type)
##         type
## dianaCut Adelie Chinstrap Gentoo
##        1     73        34      0
##        2     73        34      0
##        3      0         0    119

As we can see the results of clusterization using DIANA gives the same cluster composition as in the case of aggromelative clustering using Ward’s method.

Average silhouette width and Duda-Hart index

dianaStats <- cluster.stats(d_m,dianaCut)
cat("Average silhouette width: ", dianaStats$avg.silwidth)
## Average silhouette width:  0.3331447
dh_diana <- dudahart2(data, dianaCut)
cat("Duda-Hart Index\ndh:", dh_diana$dh, "\ncompare:", dh_diana$compare, "\nAccept H0: ", dh_diana$cluster1)
## Duda-Hart Index
## dh: 0.3088481 
## compare: 0.8239383 
## Accept H0:  FALSE

Conclusions

As we can see, the conducted analysis indicates that the palmerpenguins dataset is not suitable for clustering. In case of centroid-based clustering we were able to obtain some reasonable results in case of two clusters, where one spiecies was assigned to one cluster, and other were assigned both to second cluster. Using the Calinski-Harabasz index we confirmed that the two cluster solution was better in all cases.

In case of agglomerative connectivity-based clustering, only Ward’s method results were reasonable and similar to those obtained with centroid-based clustering. Divisive DIANA method gave same results as in the case of the Ward’s method.

In all of the cases the Duda-Hart index showed heterogeneity of the clusters, which suggests that the data inside the clusters was diverse. All of the results thus lead to the conclustion that the palmerpenguins dataset is not the best choice for clustering analysis.

References

Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081

Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.