Mateusz Tomczak
432561
Palmerpenguins is a dataset originally published in Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins in 2014 and is currently avaliable as a R package. As authors claim, the dataset goal is to provide an alternative to iris dataset for data exploration and visualisation.
Aim of this project is to test if this dataset can be used for testing or exploring different clustering methods. The dataset consists of 344 observations with 8 features. Those features are:
species - describes the penguin’s species, consists of values ‘Adelie’, ‘Chinstrap’ and ‘Gentoo’
island - name of the island, at which the penguin can be found, consists of values ‘Torgersen’, ‘Dream’ and ‘Biscoe’
bill_length_mm - length of penguin’s bill, in millimeters
bill_depth_mm - depth of penguin’s bill, in millimeters
flipper_length_mm - length of penguin’s flipper, in millimeters
body_mass_g - penguin’s body mass, in grams
sex - penguin’s sex
year - the year that the data was collected for given observation
In this project, clustering will be made on 7 of the features, while the 8th feature - species - will be used as a control variable to test if the clustering correctly grouped different groups of penguins.
library(palmerpenguins)
library(gridExtra)
library(clustertend)
library(factoextra)
library(flexclust)
library(cluster)
library(fpc)
As a first step, the data is inspected and prepared
data <- penguins
head(data)
Check the completeness of the dataset
data[!complete.cases(data),]
As we can see some of the rows contain missing values. In this case, we need to remove them from the dataset to perform clustering
data <- data[complete.cases(data),]
Now we will save species column into different variable
type <- data$species
table(data$species)
##
## Adelie Chinstrap Gentoo
## 146 68 119
To perform clustering all data should be in numeric format
str(data)
## tibble [333 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
## $ bill_depth_mm : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
## $ flipper_length_mm: int [1:333] 181 186 195 193 190 181 195 182 191 198 ...
## $ body_mass_g : int [1:333] 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
## $ year : int [1:333] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
data$island <- as.numeric(factor(data$island))
data$sex <- as.numeric(data$sex)
data$flipper_length_mm <- as.numeric(data$flipper_length_mm)
data$body_mass_g <- as.numeric(data$body_mass_g)
data$year <- as.numeric(data$year)
Now we can select 7 features that will be used in the clustering
data <- data[,c('island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year')]
Below some statistics are show describing the data by columns. Those statistics include minimum value, 1st quantile, median, mean, 3rd quantile and maximum value
options(width = 100)
summary(data)
## island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## Min. :1.000 Min. :32.10 Min. :13.10 Min. :172 Min. :2700 Min. :1.000
## 1st Qu.:1.000 1st Qu.:39.50 1st Qu.:15.60 1st Qu.:190 1st Qu.:3550 1st Qu.:1.000
## Median :2.000 Median :44.50 Median :17.30 Median :197 Median :4050 Median :2.000
## Mean :1.652 Mean :43.99 Mean :17.16 Mean :201 Mean :4207 Mean :1.505
## 3rd Qu.:2.000 3rd Qu.:48.60 3rd Qu.:18.70 3rd Qu.:213 3rd Qu.:4775 3rd Qu.:2.000
## Max. :3.000 Max. :59.60 Max. :21.50 Max. :231 Max. :6300 Max. :2.000
## year
## Min. :2007
## 1st Qu.:2007
## Median :2008
## Mean :2008
## 3rd Qu.:2009
## Max. :2009
As a first step of the analysis we will run diagnostics on the data to check if data can be clusterable. For that we will use Hopkins statistic, which assesses the clusterability of the data. Null hypothesis states that the dataset is uniformly distributed, meaning there are no meaningful clusters. Alternative hypothesis states that the dataset is not uniformly distributed, which means that it contains meaningful clusters.
Value of the statistic close to 0 indicates that it is unlikely to find statistically significant clusters. Value around 0.5 indicates random data. Value close to 1 indicate significant clusterability of the data.
We will also plot the Ordered Dissimilarity Matrix, which can be used to roughly assess the clusterability of the data by checking if blocks of color are visible, which hint to clustering tendency
data <- scale(data)
d_m <- dist(data, method = "euclidean")
get_clust_tendency(data, nrow(data)-1, graph=TRUE)
## $hopkins_stat
## [1] 0.8048359
##
## $plot
cat("Hopkins statistic:", 1-hopkins(data, n=nrow(data)-1)$H)
## Hopkins statistic: 0.8003098
As we can see on the graph, we can roughly see some blocks of color. In case of the Hopkins statistic we get a value of about 0.8, which indicates that the data is significantly clusterable and some clusters are visible.
Centroid-based clustering works by organizing the data into non-hierarchical clusters around centroids. Algorithms we will use are K-Means, PAM and CLARA. For each method we will check the optimal number of clusters using the silhoutette and elbow methods. If the optimal number of clusters based on both methods was different than 3, another clustering for k=3 was performed to compare the results with variable indicating penguin species. For each method we will use Euclidean distance. In such case the results of the clustering were compared using Calinski-Harabasz index, for which higer the value the better. The homogeneity of clusters was assessed using Duda-Hart index.
K-Means is one of the most popular approaches to clustering. It works by assigning observations to the closest clusters based on cluster centers (means). Initial cluster centers are chosen at random, and their position is recalculated every time an observation is added to the cluster. This continues until all observations are assigned to the final clusters. The goal of this method is to minimize square error of the intra-class dissimilarity.
Silhouette
fviz_nbclust(data, FUNcluster = kmeans, method = "silhouette")
Elbow method
fviz_nbclust(data, FUN = kmeans, method = "wss")
We can see that the silhouette method indicates optimal number of clusters as 2. In case of the elbow method it indicates 2 or 3 optimal clusters.
clust_kmeans <- eclust(data, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
kmeans_silhouette <- fviz_silhouette(clust_kmeans, print.summary = FALSE)
kmeans_graph <- fviz_cluster(clust_kmeans, data = data)
grid.arrange(kmeans_silhouette, kmeans_graph, ncol=2)
table(clust_kmeans$cluster, type)
## type
## Adelie Chinstrap Gentoo
## 1 0 0 119
## 2 146 68 0
We can see that in case of 2 clusters the K-Means algorithm correctly identified Gentoo species, assigning it to 1st cluster and combining two remaining species into 2nd cluster. The average silhouette distance in this case equals 0.37.
clust_kmeans_3 <- eclust(data, k=3, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
kmeans_silhouette_3 <- fviz_silhouette(clust_kmeans_3, print.summary = FALSE)
kmeans_graph_3 <- fviz_cluster(clust_kmeans_3, data = data)
grid.arrange(kmeans_silhouette_3, kmeans_graph_3, ncol=2)
table(clust_kmeans_3$cluster, type)
## type
## Adelie Chinstrap Gentoo
## 1 0 0 61
## 2 146 68 0
## 3 0 0 58
For 3 clusters the K-Means does not perform well. We can see that the 2nd cluster reamined uncheanged compared to the k=2 case. The algorithm instead divided the 1st cluster into two. The average silhouette distance is a little lower than in the case of two clusters, and equals 0.33.
Calinski-Harabasz and Duda-Hart indexes
cat("Calinski-Harabasz Index\nk=2:", round(calinhara(data, clust_kmeans$cluster),digits=2), "\nk=3:", round(calinhara(data, clust_kmeans_3$cluster),digits=2))
## Calinski-Harabasz Index
## k=2: 221.28
## k=3: 154.35
Based on the Calinski-Harabasz index we can see that the 2 cluster solution is better.
dh_kmeans <- dudahart2(data, clust_kmeans$cluster)
dh_kmeans_3 <- dudahart2(data, clust_kmeans_3$cluster)
cat("Duda-Hart Index\n\nk=2\ndh:", dh_kmeans$dh, "\ncompare:", dh_kmeans$compare, "\nAccept H0: ", dh_kmeans$cluster1, "\n\nk=3\ndh:", dh_kmeans_3$dh, "\ncompare:", dh_kmeans_3$compare, "\nAccept H0: ", dh_kmeans_3$cluster1)
## Duda-Hart Index
##
## k=2
## dh: 0.5993324
## compare: 0.8239383
## Accept H0: FALSE
##
## k=3
## dh: 0.4832184
## compare: 0.8239383
## Accept H0: FALSE
For both k=2 and k=3 the Duda-Hart index indicates that the data within cluster is heterogeneous.
This method uses medoids to cluster the data. Medoid is a most centrally located object in the cluster. Compared to the K-Means, it is more robust when noise or outliers are present, as medoids are less influenced by them compared to the mean.
Silhouette
fviz_nbclust(data, FUNcluster = cluster::pam, method = "silhouette")
Elbow method
fviz_nbclust(data, FUN = cluster::pam, method = "wss")
The silhouette methods again suggest 2 clusters. Elbow method again suggests around 3, maybe 4 clusters.
clust_pam <- eclust(data, k=2, FUNcluster="pam", hc_metric="euclidean", graph=FALSE)
pam_silhouette <- fviz_silhouette(clust_pam, print.summary = FALSE)
pam_graph <- fviz_cluster(clust_pam, data = data, elipse.type = "convex")
grid.arrange(pam_silhouette, pam_graph, ncol=2)
table(clust_pam$cluster, type)
## type
## Adelie Chinstrap Gentoo
## 1 145 68 0
## 2 1 0 119
In case of 2 clusters PAM, similary to K-Means, correctly assigned Gentoo species to one cluster, with exception of one observation from Adelie species. Average silhouette width in this case is 0.37.
clust_pam_3 <- eclust(data, k=3, FUNcluster="pam", hc_metric="euclidean", graph=FALSE)
pam_silhouette_3 <- fviz_silhouette(clust_pam_3, print.summary = FALSE)
pam_graph_3 <- fviz_cluster(clust_pam_3, data = data, elipse.type = "convex")
grid.arrange(pam_silhouette_3, pam_graph_3, ncol=2)
table(clust_pam_3$cluster, type)
## type
## Adelie Chinstrap Gentoo
## 1 73 37 0
## 2 73 31 0
## 3 0 0 119
For 3 clusters PAM correctly clusters the Gentoo species, this time without any observations from other species. However, observations from Adelie and Chinstrap species are divided almost equally between 1st and 2nd clusters. Average silhouette width is slightly lower than in case of two clusters, and equals 0.33.
Calinski-Harabasz and Duda-Hart indexes
cat("Calinski-Harabasz Index", "\nk=2:", round(calinhara(data, clust_pam$cluster),digits=2), "\nk=3:", round(calinhara(data, clust_pam_3$cluster),digits=2))
## Calinski-Harabasz Index
## k=2: 219.65
## k=3: 186.64
Based on the Calinski-Harabasz index we can see that, again, the 2 cluster solution is better.
dh_pam <- dudahart2(data, clust_pam$cluster)
dh_pam_3 <- dudahart2(data, clust_pam_3$cluster)
cat("Duda-Hart Index\n\nk=2\ndh:", dh_pam$dh, "\ncompare:", dh_pam$compare, "\nAccept H0: ", dh_pam$cluster1, "\n\nk=3\ndh:", dh_pam_3$dh, "\ncompare:", dh_pam_3$compare, "\nAccept H0: ", dh_pam_3$cluster1)
## Duda-Hart Index
##
## k=2
## dh: 0.6011108
## compare: 0.8239383
## Accept H0: FALSE
##
## k=3
## dh: 0.3103675
## compare: 0.8239383
## Accept H0: FALSE
Duda-Hart index indicates that for both cases the data within cluster is heterogeneous.
CLARA is used with big datasets. It draws multiple samples of the data, then applies PAM on each one. At the end, it outputs the best clustering.
Silhouette
fviz_nbclust(data, FUNcluster = cluster::clara, method = "silhouette")
Elbow method
fviz_nbclust(data, FUN = cluster::clara, method = "wss")
Based on the silhouette method we can see that optimal number of clusters is 2. For elbow method, we can aim for 2-3 optimal clusters.
clust_clara <- eclust(data, k=2, FUNcluster="clara", hc_metric="euclidean", graph=FALSE)
clara_silhouette <- fviz_silhouette(clust_clara, print.summary = FALSE)
clara_graph <- fviz_cluster(clust_clara, data = data, elipse.type = "convex")
grid.arrange(clara_silhouette, clara_graph, ncol=2)
table(clust_clara$cluster, type)
## type
## Adelie Chinstrap Gentoo
## 1 146 68 0
## 2 0 0 119
Again, as is the case with other methods, CLARA correctly assings Gentoo species into one cluster, and remaining species into the other cluster. Average shilouette is 0.37
clust_clara_3 <- eclust(data, k=3, FUNcluster="clara", hc_metric="euclidean", graph=FALSE)
clara_silhouette_3 <- fviz_silhouette(clust_clara_3, print.summary = FALSE)
clara_graph_3 <- fviz_cluster(clust_clara_3, data = data, elipse.type = "convex")
grid.arrange(clara_silhouette_3, clara_graph_3, ncol=2)
table(clust_clara_3$cluster, type)
## type
## Adelie Chinstrap Gentoo
## 1 73 34 0
## 2 73 34 0
## 3 0 0 119
For 3 clusters we can see that CLARA divided the Adelie and CHinstrap species equally between 1st and 2nd cluster, with 3rd cluster containing only Gentoo species. Average silhouette width is 0.33.
Calinski-Harabasz and Duda-Hart indexes
cat("Calinski-Harabasz Index\nk=2:", round(calinhara(data, clust_clara$cluster),digits=2), "\nk=3:", round(calinhara(data, clust_clara_3$cluster),digits=2))
## Calinski-Harabasz Index
## k=2: 221.28
## k=3: 187.78
Calinski-Harabasz index indicates that also in case of CLARA the 2 cluster solution is better.
dh_clara <- dudahart2(data, clust_clara$cluster)
dh_clara_3 <- dudahart2(data, clust_clara_3$cluster)
cat("Duda-Hart Index\n\nk=2\ndh:", dh_clara$dh, "\ncompare:", dh_clara$compare, "\nAccept H0: ", dh_clara$cluster1, "\n\nk=3\ndh:", dh_clara_3$dh, "\ncompare:", dh_clara_3$compare, "\nAccept H0: ", dh_clara_3$cluster1)
## Duda-Hart Index
##
## k=2
## dh: 0.5993324
## compare: 0.8239383
## Accept H0: FALSE
##
## k=3
## dh: 0.3088481
## compare: 0.8239383
## Accept H0: FALSE
Duda-Hart index indicates that for both cases the data within cluster is heterogeneous.
Hierarchical clustering represent connectivit-based clustering. It works by assigning each object to its own cluster, then iteratively the algorithm joins two most similar clusters, continuing until only one cluster remains. At each iteration the distances between clusters are recomputed by dissimilarity formula set for used clustering method. In case of hierarchical clustering setting the number of clusters as input is not needed, however we need to propose the terminal condition. We will use euclidean distance in every method. We will also only consider the case of 3 clusters in the data. Duda-Hart index will again be used to test the homogeneity of the clusters.
Agglomerative approach starts with points as individual clusters and at each steps closest pair of clusters is merged until one cluster is left. We will try to cluster using four linkage methods: single linkage, complete linkage, average linkage and Ward’s method. Distance metric is used to calculate value between clusters defined by linkage method and decide whether to split or merge clusters.
Single linkage computes distance between the two most similar parts of the cluster.
single <- eclust(data, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "single")
plot(single, cex=0.5, hang=-1, main = "Dendrogram - single linkage", xlab = '', sub = '', labels=FALSE)
rect.hclust(single, k=3, border='red')
singleCut <- cutree(single, 3)
fviz_cluster(list(data = data, cluster = singleCut))
table(singleCut, type)
## type
## singleCut Adelie Chinstrap Gentoo
## 1 146 67 58
## 2 0 0 61
## 3 0 1 0
We can see that in the case of the single linkage the hierarchical clustering of the data does not work well. No single species was identified in the clustering and assigned to cluster.
Average silhouette width and Duda-Hart index
singleStats <- cluster.stats(d_m,singleCut)
cat("Average silhouette width: ", singleStats$avg.silwidth)
## Average silhouette width: 0.1781883
dh_single <- dudahart2(data, singleCut)
cat("Duda-Hart Index\ndh:", dh_single$dh, "\ncompare:", dh_single$compare, "\nAccept H0: ", dh_single$cluster1)
## Duda-Hart Index
## dh: 0.7370881
## compare: 0.8239383
## Accept H0: FALSE
Complete linkage computes distance between the two least similar parts of the cluster.
complete <- eclust(data, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")
plot(complete, cex=0.5, hang=-1, main = "Dendrogram - complete linkage", xlab = '', sub = '', labels = FALSE)
Axis(side=2, labels=FALSE)
rect.hclust(complete, k=3, border='red')
completeCut <- cutree(complete, 3)
fviz_cluster(list(data = data, cluster = completeCut))
table(completeCut, type)
## type
## completeCut Adelie Chinstrap Gentoo
## 1 69 52 0
## 2 77 16 0
## 3 0 0 119
In case of complete linkage we can see that only Gentoo species was identified and assigned to single cluster. Other two were divided between two remaining clusters.
Average silhouette width and Duda-Hart index
completeStats <- cluster.stats(d_m,completeCut)
cat("Average silhouette width: ", completeStats$avg.silwidth)
## Average silhouette width: 0.3066251
dh_complete <- dudahart2(data, completeCut)
cat("Duda-Hart Index\ndh:", dh_complete$dh, "\ncompare:", dh_complete$compare, "\nAccept H0: ", dh_complete$cluster1)
## Duda-Hart Index
## dh: 0.3269892
## compare: 0.8239383
## Accept H0: FALSE
Average linkage computes distance between the center of the clusters.
avg <- eclust(data, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "average")
plot(avg, cex=0.5, hang=-1, main = "Dendrogram - average linkage", xlab = '', sub = '', labels=FALSE)
rect.hclust(avg, k=3, border='red')
avgCut <- cutree(avg, 3)
fviz_cluster(list(data = data, cluster = avgCut))
table(avgCut, type)
## type
## avgCut Adelie Chinstrap Gentoo
## 1 146 67 0
## 2 0 0 119
## 3 0 1 0
Average linkage correctly identified Gentoo species. However, almost all other observations were assigned to the 1st cluster, with an exception of one Chinstrap pengouin, which was only observation assigned to the 3rd cluster.
Average silhouette width and Duda-Hart index
avgStats <- cluster.stats(d_m,avgCut)
cat("Average silhouette width: ", avgStats$avg.silwidth)
## Average silhouette width: 0.3011883
dh_avg <- dudahart2(data, avgCut)
cat("Duda-Hart Index\ndh:", dh_avg$dh, "\ncompare:", dh_avg$compare, "\nAccept H0: ", dh_avg$cluster1)
## Duda-Hart Index
## dh: 0.5942125
## compare: 0.8239383
## Accept H0: FALSE
Ward’s method is mainly used when there are no clear theoretical justifications for the choice of other linkage method. It works out which observations to group based on reducing the sum of squared distances of each observation from the average observation in the cluster.
ward <- eclust(data, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "ward.D2")
plot(ward, cex=0.5, hang=-1, main = "Dendrogram - Ward's method", xlab = '', sub = '', labels=FALSE)
rect.hclust(ward, k=3, border='red')
wardCut <- cutree(ward, 3)
fviz_cluster(list(data = data, cluster = wardCut))
table(wardCut, type)
## type
## wardCut Adelie Chinstrap Gentoo
## 1 73 34 0
## 2 73 34 0
## 3 0 0 119
Using the Ward’s method we end up with the same cluster composition as in the case of CLARA. Only one species is correctly identified, with other two split equally between remaining two clusters.
Average silhouette width and Duda-Hart index
wardStats <- cluster.stats(d_m,wardCut)
cat("Average silhouette width: ", wardStats$avg.silwidth)
## Average silhouette width: 0.3331447
dh_ward <- dudahart2(data, wardCut)
cat("Duda-Hart Index\ndh:", dh_ward$dh, "\ncompare:", dh_ward$compare, "\nAccept H0: ", dh_ward$cluster1)
## Duda-Hart Index
## dh: 0.3088481
## compare: 0.8239383
## Accept H0: FALSE
In the case of divisive approach to hierarchical clustering we start with one cluster containing all observations. At each step the cluster is split, until each cluster contains only single observation. Here we will use DIANA technique.
diana <- eclust(data, k=3, FUNcluster="diana")
pltree(diana, cex = 0.5, hang = -1, main = "Dendrogram - diana", xlab = '', sub = '', labels=FALSE)
rect.hclust(diana, k=3, border='red')
dianaCut <- cutree(diana, 3)
fviz_cluster(list(data = data, cluster = dianaCut))
table(dianaCut, type)
## type
## dianaCut Adelie Chinstrap Gentoo
## 1 73 34 0
## 2 73 34 0
## 3 0 0 119
As we can see the results of clusterization using DIANA gives the same cluster composition as in the case of aggromelative clustering using Ward’s method.
Average silhouette width and Duda-Hart index
dianaStats <- cluster.stats(d_m,dianaCut)
cat("Average silhouette width: ", dianaStats$avg.silwidth)
## Average silhouette width: 0.3331447
dh_diana <- dudahart2(data, dianaCut)
cat("Duda-Hart Index\ndh:", dh_diana$dh, "\ncompare:", dh_diana$compare, "\nAccept H0: ", dh_diana$cluster1)
## Duda-Hart Index
## dh: 0.3088481
## compare: 0.8239383
## Accept H0: FALSE
As we can see, the conducted analysis indicates that the palmerpenguins dataset is not suitable for clustering. In case of centroid-based clustering we were able to obtain some reasonable results in case of two clusters, where one spiecies was assigned to one cluster, and other were assigned both to second cluster. Using the Calinski-Harabasz index we confirmed that the two cluster solution was better in all cases.
In case of agglomerative connectivity-based clustering, only Ward’s method results were reasonable and similar to those obtained with centroid-based clustering. Divisive DIANA method gave same results as in the case of the Ward’s method.
In all of the cases the Duda-Hart index showed heterogeneity of the clusters, which suggests that the data inside the clusters was diverse. All of the results thus lead to the conclustion that the palmerpenguins dataset is not the best choice for clustering analysis.
Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081
Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.