Ilina Maksimovska

Research question: Can we categorize the penguins in the data set into homogeneous groups based on three cluster variables (the length and depth of the penguin’s culmen and its body mass)?

mydata <- read.csv("./penguins.csv")

head(mydata)
##   culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g    sex
## 1             39.1            18.7               181        3750   MALE
## 2             39.5            17.4               186        3800 FEMALE
## 3             40.3            18.0               195        3250 FEMALE
## 4             36.7            19.3               193        3450 FEMALE
## 5             39.3            20.6               190        3650   MALE
## 6             38.9            17.8               181        3625 FEMALE

Explanation of a dataset:

Explanation of the variables in the data set:

library(tidyr)
mydata <- drop_na(mydata) #Drop non availables 
mydata$sex <- as.factor(mydata$sex)
summary(mydata) #Descriptive statistics
##  culmen_length_mm culmen_depth_mm flipper_length_mm  body_mass_g       sex     
##  Min.   :32.10    Min.   :13.10   Min.   : 172.0    Min.   :2700   FEMALE:165  
##  1st Qu.:39.50    1st Qu.:15.60   1st Qu.: 190.0    1st Qu.:3550   MALE  :168  
##  Median :44.50    Median :17.30   Median : 197.0    Median :4050               
##  Mean   :44.02    Mean   :17.16   Mean   : 215.4    Mean   :4207               
##  3rd Qu.:48.60    3rd Qu.:18.70   3rd Qu.: 213.0    3rd Qu.:4775               
##  Max.   :59.60    Max.   :21.50   Max.   :5000.0    Max.   :6300

Descriptive statistics

mydata$Culmenlength_z <- scale(mydata$culmen_length_mm) #Standardization of variables
mydata$Culmendepth_z   <- scale(mydata$culmen_depth_mm)
mydata$Bodymass_z <- scale(mydata$body_mass_g)
mydata$id <- seq_along(mydata$culmen_length_mm)
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(mydata[, c("Culmenlength_z", "Culmendepth_z", "Bodymass_z")]), 
      type = "pearson") #Correlation matrix
##                Culmenlength_z Culmendepth_z Bodymass_z
## Culmenlength_z           1.00         -0.22       0.59
## Culmendepth_z           -0.22          1.00      -0.47
## Bodymass_z               0.59         -0.47       1.00
## 
## n= 333 
## 
## 
## P
##                Culmenlength_z Culmendepth_z Bodymass_z
## Culmenlength_z                 0             0        
## Culmendepth_z   0                            0        
## Bodymass_z      0              0

Even though the correlation between body mass and culmen depth is above 0.3 as well as the correlation between body mass and culmen length, for educational purposes we won’t drop any variable and we will assume that it is okay.

mydata$Dissimilarity <- sqrt(mydata$Culmenlength_z^2 + mydata$Culmendepth_z^2 + mydata$Bodymass_z^2) #Finding outliers
head(mydata[order(-mydata$Dissimilarity), ], 10) #10 units with the highest value of dissimilarity
##     culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g    sex
## 247             59.6            17.0               230        6050   MALE
## 232             49.2            15.2               221        6300   MALE
## 327             55.1            16.0               230        5850   MALE
## 314             55.9            17.0               228        5600   MALE
## 137             32.1            15.5               188        3050 FEMALE
## 277             54.3            15.7               231        5650   MALE
## 164             58.0            17.8               181        3700 FEMALE
## 93              33.1            16.1               178        2900 FEMALE
## 178             54.2            20.8               201        4300   MALE
## 235             50.2            14.3               218        5700   MALE
##     Culmenlength_z Culmendepth_z Bodymass_z  id Dissimilarity
## 247      2.8620613   -0.08254921  2.2895045 247      3.666066
## 232      0.9521822   -0.99884549  2.6000058 232      2.943531
## 327      2.0356713   -0.59160270  2.0411034 327      2.942797
## 314      2.1825851   -0.08254921  1.7306021 314      2.786660
## 137     -2.1880999   -0.84612945 -1.4365116 137      2.750869
## 277      1.8887575   -0.74431875  1.7927024 277      2.708357
## 164      2.5682337    0.32469358 -0.6292081 164      2.664048
## 93      -2.0044576   -0.54069735 -1.6228124  93      2.635095
## 178      1.8703933    1.85185404  0.1159951 178      2.634614
## 235      1.1358244   -1.45699363  1.8548026 235      2.617866
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata <- mydata %>% filter(id != 247) #Removing unit due to high dissimilarity and it was an outlier on the cluster plot
#install.packages("factoextra")
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#Calculating Euclidean distances
distance <- get_dist(mydata[c("Culmenlength_z","Culmendepth_z", "Bodymass_z")], 
                     method = "euclidian")

distance2 <- distance^2

fviz_dist(distance2) #Showing dissimilarity matrix

Based on the dissimilarity matrix, we can see that there are some clusters forming along the line,

get_clust_tendency(mydata[c("Culmenlength_z","Culmendepth_z", "Bodymass_z")], #Hopkins statistics
                   n = nrow(mydata) - 1, 
                   graph = FALSE)
## $hopkins_stat
## [1] 0.7699599
## 
## $plot
## NULL

Data is clusterable since Hopkins statistics is above 0.5.

library(dplyr) #Allowing use of %>%

WARD <- mydata[c("Culmenlength_z","Culmendepth_z", "Bodymass_z")] %>%
  #scale() %>%                           
  get_dist(method = "euclidean") %>%  #Selecting distance
  hclust(method = "ward.D2") #Selecting algorithm

WARD
## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 332
library(factoextra)
fviz_dend(WARD) #Dendogram
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

fviz_dend(WARD, 
          k = 5,
          cex = 0.5, 
          palette = "jama",
          color_labels_by_k = TRUE, 
          rect = TRUE)

mydata$ClusterWard <- cutree(WARD, 
                             k = 5) #Cutting the dendrogram
head(mydata)
##   culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g    sex
## 1             39.1            18.7               181        3750   MALE
## 2             39.5            17.4               186        3800 FEMALE
## 3             40.3            18.0               195        3250 FEMALE
## 4             36.7            19.3               193        3450 FEMALE
## 5             39.3            20.6               190        3650   MALE
## 6             38.9            17.8               181        3625 FEMALE
##   Culmenlength_z Culmendepth_z Bodymass_z id Dissimilarity ClusterWard
## 1     -0.9026043     0.7828417 -0.5671079  1      1.322553           1
## 2     -0.8291474     0.1210722 -0.5050076  2      0.978354           1
## 3     -0.6822336     0.4265043 -1.1881105  3      1.434906           1
## 4     -1.3433456     1.0882738 -0.9397095  4      1.967733           1
## 5     -0.8658758     1.7500433 -0.6913084  5      2.071304           2
## 6     -0.9393327     0.3246936 -0.7223585  6      1.228647           1
#Calculating positions of initial leaders

Initial_leaders <- aggregate(mydata[, c("Culmenlength_z","Culmendepth_z", "Bodymass_z")], 
                             by = list(mydata$ClusterWard), 
                             FUN = mean)

Initial_leaders
##   Group.1 Culmenlength_z Culmendepth_z Bodymass_z
## 1       1     -1.1862295     0.2670008 -0.9586846
## 2       2     -0.5079193     1.0300963 -0.1614847
## 3       3      1.0018559     0.6927142 -0.5650718
## 4       4      0.2194497    -1.5516776  0.5221309
## 5       5      0.9384090    -0.7869894  1.5077717
library(factoextra)

#Performing K-Means clustering - initial leaders are chosen as centroids of groups, found with hierarchical clustering
K_MEANS <- hkmeans(mydata[c("Culmenlength_z","Culmendepth_z", "Bodymass_z")], 
                   k = 5,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

K_MEANS
## Hierarchical K-means clustering with 5 clusters of sizes 78, 73, 63, 61, 57
## 
## Cluster means:
##   Culmenlength_z Culmendepth_z Bodymass_z
## 1     -1.1533466     0.1484828 -1.0575408
## 2     -0.6829883     1.0269085 -0.2072666
## 3      0.9868701     0.7085037 -0.5286648
## 4      0.2733081    -1.4603317  0.6021079
## 5      1.0195177    -0.7371741  1.6123937
## 
## Clustering vector:
##   [1] 2 1 1 1 2 1 2 2 1 2 1 2 2 1 2 1 1 2 2 1 2 2 1 1 2 1 1 1 2 1 2 2 2 1 2 1 2
##  [38] 1 2 1 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 3 1 2 1 2 1 2
##  [75] 1 2 2 2 1 2 2 1 2 2 1 2 1 2 1 2 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 2 2 1
## [112] 2 1 1 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 1 1 1 1 2 2 1 1 1 1 2 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 1 3 3 3 4 5 4 5 5 4 4 5
## [223] 4 5 4 5 4 5 4 5 4 5 4 5 5 4 4 5 4 4 5 4 5 5 4 4 5 5 4 5 4 5 4 5 4 4 5 4 4
## [260] 5 4 5 4 5 4 5 4 4 4 4 4 5 4 5 4 5 4 5 5 4 5 4 4 5 4 4 5 4 5 4 5 4 5 4 5 4
## [297] 5 4 5 4 5 4 5 4 5 4 5 5 4 4 5 4 5 4 5 5 4 5 4 5 5 5 4 5 4 5 5 4 4 5 4 5
## 
## Within cluster sum of squares by cluster:
## [1] 35.78330 40.79583 48.73695 19.96303 24.64238
##  (between_SS / total_SS =  82.7 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

Number of units per group:

Group 1: 78

Group 2: 73

Group 3: 63

Group 4: 61

Group 5: 57

Ratio between sum of squares of between groups divided by total sum of squares: 82.7% We want this ratio to be as high as possible, 82.7% is a good number meaning that the cluster groups differentiate between the units.

fviz_cluster(K_MEANS, 
             palette = "jama", 
             repel = FALSE,
             ggtheme = theme_classic()) #Cluster plot

mydata$ClusterK_Means <- K_MEANS$cluster
head(mydata[c("id", "ClusterWard", "ClusterK_Means")])
##   id ClusterWard ClusterK_Means
## 1  1           1              2
## 2  2           1              1
## 3  3           1              1
## 4  4           1              1
## 5  5           2              2
## 6  6           1              1

We can see that there were some reclassifications, for example unit 1 was reclassified from group 1 to group 2

#Checking for reclassifications
table(mydata$ClusterWard)
## 
##  1  2  3  4  5 
## 90 63 61 50 68

Initially, with hierarchical clustering there were 90 units in group 1, 63 units in group 2, 61 units in group 3, 50 units in group 4 and 68 units in group 5.

table(mydata$ClusterK_Means)#Checking for reclassifications
## 
##  1  2  3  4  5 
## 78 73 63 61 57

If we do K Means clustering there are 78 units in group 1, 73 units in group 2, 63 units in group 3, 61 units in group 4 and 57 units in group 5.

table(mydata$ClusterWard, mydata$ClusterK_Means)#Checking for reclassifications
##    
##      1  2  3  4  5
##   1 75 15  0  0  0
##   2  2 58  3  0  0
##   3  1  0 60  0  0
##   4  0  0  0 50  0
##   5  0  0  0 11 57

Explanation of first row: With hierarchical clustering there were 90 units in group 1. 75 units are still clustered in group 1, 15 units were reclassified to group 2.

Explanation of first column: In first group performed my K means clustering there are 78 units. From those 78 units, 75 were already in group 1, 2 units come from group 2 and 1 unit comes from group 3.

Explanation of second row: With hierarchical clustering there were 63 units in group 2. 58 units are still clustered in group 2, 2 units were reclassified to group 1 and 3 units were reclassified to group 3.

Explanation of second column: In second group performed my K means clustering there are 73 units. From those 73 units, 58 were already in group 2 and 15 units come from group 1.

Explanation of third row: With hierarchical clustering there were 61 units in group 3. 60 units are still clustered in group 3, 1 units were reclassified to group 1.

Explanation of third column: In third group performed my K means clustering there are 63 units. From those 63 units, 60 were already in group 3 and 3 units come from group 2.

Explanation of forth row: With hierarchical clustering there were 50 units in group 4 and all of them are still clustered in group 4.

Explanation of forth column: In forth group performed my K means clustering there are 61 units. From those 61 units, 50 were already in group 4 and 11 units come from group 5.

Explanation of fifth row: With hierarchical clustering there were 68 units in group 5. 57 units are still clustered in group 5, 11 units were reclassified to group 4.

Explanation of fifth column: In fifth group performed my K means clustering there are 57 units. From those 57 units, all of them were already in group 5.

Centroids <- K_MEANS$centers
Centroids
##   Culmenlength_z Culmendepth_z Bodymass_z
## 1     -1.1533466     0.1484828 -1.0575408
## 2     -0.6829883     1.0269085 -0.2072666
## 3      0.9868701     0.7085037 -0.5286648
## 4      0.2733081    -1.4603317  0.6021079
## 5      1.0195177    -0.7371741  1.6123937
library(ggplot2)
library(tidyr)

Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c("Culmenlength_z","Culmendepth_z", "Bodymass_z"))

Figure$Groups <- factor(Figure$id, 
                        levels = c(1, 2, 3, 4, 5), 
                        labels = c("1", "2", "3", "4", "5"))

Figure$nameFactor <- factor(Figure$name, 
                            levels = c("Culmenlength_z","Culmendepth_z", "Bodymass_z"), 
                            labels = c("Culmenlength_z","Culmendepth_z", "Bodymass_z"))

ggplot(Figure, aes(x = nameFactor, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Groups, col = Groups), size = 3) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables") +
  ylim(-2, 2)

Group 1:

This group exhibits the lowest culmen length and body mass among the clusters, with each falling approximately one standard deviation below the mean. However, it notably deviates from the average in culmen depth, showing a measurement slightly above the mean.

Group 2:

In this cluster, both culmen length and body mass are below the average. Conversely, the group stands out with a distinctive feature—a culmen depth above the average, which is the highest among the clusters, measuring one standard deviation higher than the mean.

Group 3:

Group 3 displays a notable characteristic with culmen length measuring one standard deviation above the mean. It shares the highest culmen length with Group 5. Additionally, it exhibits culmen depth above the mean. However, the body mass of this group falls below the mean, marking a distinctive contrast with the other attributes.

Group 4:

Within Group 4, both culmen length and body mass surpass the average, standing at values above the mean. The group is uniquely characterized by its culmen depth, notably the lowest among the clusters, measuring 1.5 standard deviations below the mean.

Group 5:

In Group 5, the culmen length exceeds the mean by 1 standard deviation, signifying a significant deviation from the average. However, the culmen depth falls below the mean, showcasing a distinct characteristic. Notably, this group’s body mass surpasses the mean by 1.5 standard deviations, making it the highest among all clusters. Additionally, Group 5 shares the highest culmen length with Group 3 as mentioned above.

#Are all the cluster variables successful at classifing units into groups? Performing ANOVAs.

fit <- aov(cbind(Culmenlength_z, Culmendepth_z, Bodymass_z) ~ as.factor(ClusterK_Means), 
                 data = mydata)

summary(fit)
##  Response 1 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   4 262.94  65.736  353.31 < 2.2e-16 ***
## Residuals                 327  60.84   0.186                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   4 271.388  67.847  366.07 < 2.2e-16 ***
## Residuals                 327  60.605   0.185                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   4 278.266  69.567  469.27 < 2.2e-16 ***
## Residuals                 327  48.476   0.148                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: arithmetic mean (culmenlength, G1) = arithmetic mean (culmenlength, G2) = arithmetic mean (culmenlength, G3) = arithmetic mean (culmenlength, G4) = arithmetic mean (culmenlength, G5)

H1: At least one arithmetic mean of culmenlength is different among the groups.

We reject the null hypothesis at p < 0.001 and assume that at least one group has different arithmetic mean of culmen length than the others.

H0: Arithmetic mean (culmendepth, G1) = Arithmetic mean (culmendepth, G2) = Arithmetic mean (culmendepth, G3) = Arithmetic mean (culmendepth, G4) = Arithmetic mean (culmendepth, G5)

H1: At least one arithmetic mean of culmendepth is different among the groups.

We reject the null hypothesis at p < 0.001 and assume that at least one group has different arithmetic mean of culmen depth than the others.

H0: Arithmetic mean (bodymass, G1)=Arithmetic mean (bodymass, G2)=Arithmetic mean (bodymass, G3)=Arithmetic mean (bodymass, G4)=Arithmetic mean (bodymass, G5)

H1: At least one arithmetic mean of bodymass is different among the groups.

We reject the null hypothesis at p < 0.001 and assume that at least one group has different arithmetic mean of bodymass than the others.

Validation of results

#Pearson Chi2 test

chisq_results <- chisq.test(mydata$sex, as.factor(mydata$ClusterK_Means))
chisq_results
## 
##  Pearson's Chi-squared test
## 
## data:  mydata$sex and as.factor(mydata$ClusterK_Means)
## X-squared = 189.71, df = 4, p-value < 2.2e-16

Null Hypothesis: There is no association between the variables sex and clusters by K Means.

Alternative Hypothesis: There is a significant association between the variables sex and clusters by K Means.

We reject the null hypothesis at p < 0.001 and assume that there is association between the variables sex and clusters by K Means..

addmargins(chisq_results$observed)
##           
## mydata$sex   1   2   3   4   5 Sum
##     FEMALE  71   8  28  56   2 165
##     MALE     7  65  35   5  55 167
##     Sum     78  73  63  61  57 332
round(chisq_results$expected, 2)
##           
## mydata$sex     1     2     3     4     5
##     FEMALE 38.77 36.28 31.31 30.32 28.33
##     MALE   39.23 36.72 31.69 30.68 28.67

All expected frequencies are above 5 which is good.

round(chisq_results$res, 2)
##           
## mydata$sex     1     2     3     4     5
##     FEMALE  5.18 -4.70 -0.59  4.66 -4.95
##     MALE   -5.15  4.67  0.59 -4.64  4.92

There is less than expected number of penguins in category Male and group 1 (alpha = 0.001).

There is more than expected number of penguins in category Female and group 1 (alpha = 0.001).

There is less than expected number of penguins in category Female and group 2 (alpha = 0.001).

There is more than expected number of penguins in category Male and group 2 (alpha = 0.001).

There is less than expected number of penguins in category Male and group 4 (alpha = 0.001).

There is more than expected number of penguins in category Female and group 4 (alpha = 0.001).

There is less than expected number of penguins in category Female and group 5 (alpha = 0.001).

There is more than expected number of penguins in category Male and group 5 (alpha = 0.001).

Conclusion

Research question: Can we categorize the penguins in the data set into homogeneous groups based on three cluster variables (the length and depth of the penguin’s culmen and its body mass)?

Answer: Yes, we can. We clustered them into 5 groups.

Group 1: After K Means clustering, 78 penguins, constituting 23.5% of the sample, fall into the first group. Within this group, 91% are female, while 9% are male. There are more females in this group than expected and less males. This group is characterized by the lowest culmen length and body mass among the clusters, with each falling approximately one standard deviation below the mean. Additionally, their culmen depth is slightly above the mean.

Group 2: After K Means clustering, 73 penguins, constituting 22% of the sample, fall into the second group. Within this group, 11% are female, while 89% are male. There are less females in this group than expected and more males. This cluster is characterized by below the average of both culmen length and body mass. Conversely, the group stands out with a distinctive feature—a culmen depth above the average, which is the highest among the clusters, measuring one standard deviation above the mean.

Group 3: After K Means clustering, 63 penguins, constituting 19% of the sample, fall into the third group. Within this group, 44% are female, while 56% are male. This group is characterized by culmen length measuring one standard deviation above the mean. It shares the highest culmen length with Group 5. Additionally, it exhibits culmen depth above the mean. However, the body mass of this group falls below the mean, marking a distinctive contrast with the other attributes.

Group 4: After K Means clustering, 61 penguins, constituting 18.5% of the sample, fall into the fourth group. Within this group, 92% are female, while 8% are male. There are more females in this group than expected and less males. This group is characterized by both culmen length and body mass that surpass the average, standing at values above the mean. The group is uniquely characterized by its culmen depth, notably the lowest among the clusters, measuring 1.5 standard deviations below the mean.

Group 5: After K Means clustering, 57 penguins, constituting 17% of the sample, fall into the fifth group. Within this group, 4% are female, while 96% are male. There are less females in this group than expected and more males. This cluster is characterized by the culmen length exceeds the mean by 1 standard deviation, signifying a significant deviation from the average. However, the culmen depth falls below the mean, showcasing a distinct characteristic. Notably, this group’s body mass surpasses the mean by 1.5 standard deviations, making it the highest among all clusters. Additionally, Group 5 shares the highest culmen length with Group 3 as mentioned above.