The purpose of this document is to explore if there is a significant enough sexual dimorphism in arctic penguins, to distinguish between the genders using K-means and PAM clustering algorithms. There could be enough difference in basic body measurements for the algorithms to be able to cluster them into 2 distinct categories, however that is not obvious when simply graphically representing the data. The experiment was conducted on a small sample of penguins and their features too, which inherently makes the research limited in it’s scope, but it could be scaled up should more data be available.
library(knitr)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(ggpubr)
library(cluster)
library(factoextra)
Data was taken from kaggle and as per the desciption, was made avaiable by Dr. Kristen Gorman of the Palmer Station at Antarctica LTER. She is a member of the Long Term Ecological Research Network. The original dataset contains 344 observations of the following 8 variables:
Additionally, it includes an index number for each observation.
In the initial cleaning 11 rows with missing data were removed, along with the year and index features. Rows with missing data were removed, as data about the sex of the penguins is required to accurately answer the question, and imputing measurements, by for example averaging the results of the whole set, would still not provide that information. The year was removed as the project is mainly focused on the basic body measurement of the penguins, and 3 years worth of data does not represent a large enough time jump to observe changes in those measurements.
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18.0 195 3250
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## sex
## 1 male
## 2 female
## 3 female
## 5 female
## 6 male
## 7 female
Pre-processed data was visualized to give an initial outlook on what is available.
Based on this plot alone, the clustering algorithms may have trouble distinguishing the genders.
To test whether the algorithms can recognize genders based on, first there must be information about how the data is split between them.
## Sex Count
## 1 female 165
## 2 male 168
For the purposes of the K-means and PAM algorithms, a new data frame was created, with features Sex, Island and Species dropped from the original one, to only leave it with relevant numerical features:
Then, the new set was scaled to suit the needs of K-means and PAM further down the line.
## dataset with only relevant features - no encoding
penguins_rel <- penguins %>% select(-sex, -species, -island)
# standardizing the data
penguins_scaled <- scale(penguins_rel)
penguins_scaled <- as.data.frame(penguins_scaled)
Before proceeding with K-means, both silhouette and elbow methods were employed, to find the optimal amount of clusters, somewhat contradictory to the initial objective of the project, nevertheless necessary.
While the elbow method could perhaps point in the direction of 3 clusters, the silhouette method leaves no room for loose interpretation. 2 clusters were selected and that indeed aligns with the initial assumption of the project. That being said, testing on 3 clusters was done, for the purposes of recognizing other results. It could also be used, should other question about the data set be raised, such as belonging to a sub-species based on body measurements or differences between penguins from different islands.
Proceeding with K-means for both 2 and 3 clusters, produces the following results.
Size and centers of each pass:
## [1] 119 214
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 0.6537742 -1.101050 1.160716 1.0995561
## 2 -0.3635474 0.612266 -0.645445 -0.6114354
## [1] 213 62 58
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 -0.3721283 0.6067124 -0.6515011 -0.6177631
## 2 0.2797368 -1.4474393 0.8539325 0.6116866
## 3 1.0675801 -0.6808363 1.4797572 1.6148100
## cluster size ave.sil.width
## 1 1 119 0.61
## 2 2 214 0.49
## cluster size ave.sil.width
## 1 1 213 0.44
## 2 2 62 0.51
## 3 3 58 0.34
It is visible to the naked eye that 2 clusters are ideal for the purposes of this task, as they yield the best average silhouette both for the clusters themselves and as a whole. Although in the 2 cluster k-means, there are some values inching close to zero in the 2nd cluster, which could indicate they are on the border of belonging to another cluster, none of them are below zero and therefore that conclusion cannot be stated with certainty. The overall structure remains strong. It is also evident that the component selection after scaling was proper as k-means for 2 clusters explained 88.1 % of the variance, with dimension 1 being the stronger component.
Additionally, when k-means is run with 3 clusters, it retains 1 of the clusters in near entirely same shape, while trying to seperate the other cluster into 2, also including 1 of the points whose center must be slightly closer to the 3rd cluster. The 2 “new” clusters also overlap each other. The shape of the silhouette substantiates that, with negative values present for one of the clustes, indicating belonging to another cluster.
After K-means, similar steps were taken for the PAM algorithm, beginning with finding the optimal number of clusters
Again, 2 clusters were selected and produce the following results
Given how well 2 clusters fit previously, to no surprise PAM produces an identical split. However it approaches 3 clusters completely differently, showing a uniformly worse overlap for clusters 1 and 3, in the second plot. This is further evident in the silhouette plots.
## cluster size ave.sil.width
## 1 1 214 0.49
## 2 2 119 0.61
## cluster size ave.sil.width
## 1 1 146 0.40
## 2 2 68 0.40
## 3 3 119 0.57
To answer the question whether there is enough sexual dimorphism in penguins, for the K-means and PAM algorithms to be able to tell them apart, we have to relate the outputs of those algorithms, to the original data
Given that there were no further transformations between scaling the data and PAM clustering, the indexes are preserved and therefore the results can simply be appended back to the original data set. To achieve that, the same index of data is given it’s assignment of cluster from both the algorithms.
# relating 2 cluster k-means to original data - index is preserved in the data set
penguins$cluster2c <- penguins_kmeans_2c$cluster
penguins$cluster3c <- penguins_kmeans_3c$cluster
#relating PAM data to original data frame
penguins$cluster_pam2c <- penguins_PAM_2c$cluster
penguins$cluster_pam3c <- penguins_PAM_3c$cluster
Now the Sex variable in the original data is converted to a binary variable and a simple calculation is made, to see if the algorithms “guessed” correctly or not
## [1] "K-means"
## # A tibble: 2 × 2
## gender_score count
## <chr> <int>
## 1 False 165
## 2 True 168
## [1] "PAM"
## # A tibble: 2 × 2
## gender_score2 count
## <chr> <int>
## 1 False 165
## 2 True 168
Given the fact that the 2 cluster approach was so clear, the answers are exactly the same for both algorithms. They guessed correctly about half the time.
Based on the data available, neither PAM nor K-means is able to substantially differentiate between the genders, when taking their basic body measurements into account, which could mean that there is not much sexual dimorphism in arctic penguins. This is not certain however, as more research could be done, for example comparing the averages of all measurements of male and female penguins, to the centroids and medoids of the algorithms. Additionally, while clustering with 3 or more clusters proved less accurate than with 2, it gives insight into how differently the data could be spread, if more clusters were considered. There could also exist differences between subspecies of the penguins, which could be explored.