Introduction

The purpose of this document is to explore if there is a significant enough sexual dimorphism in arctic penguins, to distinguish between the genders using K-means and PAM clustering algorithms. There could be enough difference in basic body measurements for the algorithms to be able to cluster them into 2 distinct categories, however that is not obvious when simply graphically representing the data. The experiment was conducted on a small sample of penguins and their features too, which inherently makes the research limited in it’s scope, but it could be scaled up should more data be available.

Libraries

library(knitr)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(ggpubr)
library(cluster)
library(factoextra)

The data set

Data was taken from kaggle and as per the desciption, was made avaiable by Dr. Kristen Gorman of the Palmer Station at Antarctica LTER. She is a member of the Long Term Ecological Research Network. The original dataset contains 344 observations of the following 8 variables:

  • Species
  • Island
  • Length of the bill in millimeters
  • Depth of the bill in millimeters
  • Length of the flippers in millimeters
  • Body mass in grams
  • Sex
  • Year of observation

Additionally, it includes an index number for each observation.

In the initial cleaning 11 rows with missing data were removed, along with the year and index features. Rows with missing data were removed, as data about the sex of the penguins is required to accurately answer the question, and imputing measurements, by for example averaging the results of the whole set, would still not provide that information. The year was removed as the project is mainly focused on the basic body measurement of the penguins, and 3 years worth of data does not represent a large enough time jump to observe changes in those measurements.

##   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1  Adelie Torgersen           39.1          18.7               181        3750
## 2  Adelie Torgersen           39.5          17.4               186        3800
## 3  Adelie Torgersen           40.3          18.0               195        3250
## 5  Adelie Torgersen           36.7          19.3               193        3450
## 6  Adelie Torgersen           39.3          20.6               190        3650
## 7  Adelie Torgersen           38.9          17.8               181        3625
##      sex
## 1   male
## 2 female
## 3 female
## 5 female
## 6   male
## 7 female

Pre-processed data was visualized to give an initial outlook on what is available.

Based on this plot alone, the clustering algorithms may have trouble distinguishing the genders.

To test whether the algorithms can recognize genders based on, first there must be information about how the data is split between them.

##      Sex Count
## 1 female   165
## 2   male   168

Scaling

For the purposes of the K-means and PAM algorithms, a new data frame was created, with features Sex, Island and Species dropped from the original one, to only leave it with relevant numerical features:

  • Length of the bill in millimeters
  • Depth of the bill in millimeters
  • Length of the flippers in millimeters
  • Body mass in grams

Then, the new set was scaled to suit the needs of K-means and PAM further down the line.

## dataset with only relevant features - no encoding
penguins_rel <- penguins %>% select(-sex, -species, -island)

# standardizing the data
penguins_scaled <- scale(penguins_rel)
penguins_scaled <- as.data.frame(penguins_scaled)

K-means

Before proceeding with K-means, both silhouette and elbow methods were employed, to find the optimal amount of clusters, somewhat contradictory to the initial objective of the project, nevertheless necessary.

While the elbow method could perhaps point in the direction of 3 clusters, the silhouette method leaves no room for loose interpretation. 2 clusters were selected and that indeed aligns with the initial assumption of the project. That being said, testing on 3 clusters was done, for the purposes of recognizing other results. It could also be used, should other question about the data set be raised, such as belonging to a sub-species based on body measurements or differences between penguins from different islands.

Proceeding with K-means for both 2 and 3 clusters, produces the following results.

Size and centers of each pass:

## [1] 119 214
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1      0.6537742     -1.101050          1.160716   1.0995561
## 2     -0.3635474      0.612266         -0.645445  -0.6114354
## [1] 213  62  58
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1     -0.3721283     0.6067124        -0.6515011  -0.6177631
## 2      0.2797368    -1.4474393         0.8539325   0.6116866
## 3      1.0675801    -0.6808363         1.4797572   1.6148100

##   cluster size ave.sil.width
## 1       1  119          0.61
## 2       2  214          0.49

##   cluster size ave.sil.width
## 1       1  213          0.44
## 2       2   62          0.51
## 3       3   58          0.34

It is visible to the naked eye that 2 clusters are ideal for the purposes of this task, as they yield the best average silhouette both for the clusters themselves and as a whole. Although in the 2 cluster k-means, there are some values inching close to zero in the 2nd cluster, which could indicate they are on the border of belonging to another cluster, none of them are below zero and therefore that conclusion cannot be stated with certainty. The overall structure remains strong. It is also evident that the component selection after scaling was proper as k-means for 2 clusters explained 88.1 % of the variance, with dimension 1 being the stronger component.

Additionally, when k-means is run with 3 clusters, it retains 1 of the clusters in near entirely same shape, while trying to seperate the other cluster into 2, also including 1 of the points whose center must be slightly closer to the 3rd cluster. The 2 “new” clusters also overlap each other. The shape of the silhouette substantiates that, with negative values present for one of the clustes, indicating belonging to another cluster.

PAM

After K-means, similar steps were taken for the PAM algorithm, beginning with finding the optimal number of clusters

Again, 2 clusters were selected and produce the following results

Given how well 2 clusters fit previously, to no surprise PAM produces an identical split. However it approaches 3 clusters completely differently, showing a uniformly worse overlap for clusters 1 and 3, in the second plot. This is further evident in the silhouette plots.

##   cluster size ave.sil.width
## 1       1  214          0.49
## 2       2  119          0.61

##   cluster size ave.sil.width
## 1       1  146          0.40
## 2       2   68          0.40
## 3       3  119          0.57

Final comparison of 2 and 3 clusters

The final question

To answer the question whether there is enough sexual dimorphism in penguins, for the K-means and PAM algorithms to be able to tell them apart, we have to relate the outputs of those algorithms, to the original data

Given that there were no further transformations between scaling the data and PAM clustering, the indexes are preserved and therefore the results can simply be appended back to the original data set. To achieve that, the same index of data is given it’s assignment of cluster from both the algorithms.

# relating 2 cluster k-means to original data - index is preserved in the data set
penguins$cluster2c <- penguins_kmeans_2c$cluster
penguins$cluster3c <- penguins_kmeans_3c$cluster

#relating PAM data to original data frame
penguins$cluster_pam2c <- penguins_PAM_2c$cluster
penguins$cluster_pam3c <- penguins_PAM_3c$cluster

Now the Sex variable in the original data is converted to a binary variable and a simple calculation is made, to see if the algorithms “guessed” correctly or not

## [1] "K-means"
## # A tibble: 2 × 2
##   gender_score count
##   <chr>        <int>
## 1 False          165
## 2 True           168
## [1] "PAM"
## # A tibble: 2 × 2
##   gender_score2 count
##   <chr>         <int>
## 1 False           165
## 2 True            168

Given the fact that the 2 cluster approach was so clear, the answers are exactly the same for both algorithms. They guessed correctly about half the time.

Conclusions

Based on the data available, neither PAM nor K-means is able to substantially differentiate between the genders, when taking their basic body measurements into account, which could mean that there is not much sexual dimorphism in arctic penguins. This is not certain however, as more research could be done, for example comparing the averages of all measurements of male and female penguins, to the centroids and medoids of the algorithms. Additionally, while clustering with 3 or more clusters proved less accurate than with 2, it gives insight into how differently the data could be spread, if more clusters were considered. There could also exist differences between subspecies of the penguins, which could be explored.