mydata <- read.csv("./penguins.csv")
head(mydata)
## culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex
## 1 39.1 18.7 181 3750 MALE
## 2 39.5 17.4 186 3800 FEMALE
## 3 40.3 18.0 195 3250 FEMALE
## 4 36.7 19.3 193 3450 FEMALE
## 5 39.3 20.6 190 3650 MALE
## 6 38.9 17.8 181 3625 FEMALE
Unit of observation: one penguin
Sample size: 338 observations
Number of variables: 5
Source: Aboelwafa, Y. (2023, October 31). Clustering Penguins species. Kaggle. https://www.kaggle.com/datasets/youssefaboelwafa/clustering-penguins-species
culmen_length_mm: The length of the penguin’s beak (culmen) measured in millimeters.
culmen_depth_mm: The depth or thickness of the penguin’s beak (culmen) measured in millimeters.
flipper_length_mm: The length of the penguin’s flipper (wing) measured in millimeters.
body_mass_g: The mass or weight of the penguin’s body measured in grams.
sex: The sex of the penguin (Male or Female).
library(tidyr)
mydata <- drop_na(mydata) #Drop non availables
mydata$sex <- as.factor(mydata$sex)
summary(mydata) #Descriptive statistics
## culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex
## Min. :32.10 Min. :13.10 Min. : 172.0 Min. :2700 FEMALE:165
## 1st Qu.:39.50 1st Qu.:15.60 1st Qu.: 190.0 1st Qu.:3550 MALE :168
## Median :44.50 Median :17.30 Median : 197.0 Median :4050
## Mean :44.02 Mean :17.16 Mean : 215.4 Mean :4207
## 3rd Qu.:48.60 3rd Qu.:18.70 3rd Qu.: 213.0 3rd Qu.:4775
## Max. :59.60 Max. :21.50 Max. :5000.0 Max. :6300
Minimum Culmen Length: The smallest culmen length observed in the dataset is 32.10 mm.
1st Quartile Culmen Length: 25% of the penguins have a culmen length less than or equal to 39.50 mm.
Median Culmen Depth: The median culmen depth is 17.30 mm, representing that 50% of the observations have a culmen depth equal to or less than 17.30 mm, and the other 50% have a culmen depth greater than 17.30 mm.
Mean Culmen Depth: On average, the culmen depth for the penguins is approximately 17.16 mm.
3rd Quartile Body Mass: 75% of the penguins have a body mass less than or equal to 4775 grams and 25% of the penguins have a body mass higher than 4775 grams.
Maximum Body Mass: The largest body mass observed in the dataset is 6300 grams.
There are 165 female penguins and 168 male penguins in the dataset
mydata$Culmenlength_z <- scale(mydata$culmen_length_mm) #Standardization of variables
mydata$Culmendepth_z <- scale(mydata$culmen_depth_mm)
mydata$Bodymass_z <- scale(mydata$body_mass_g)
mydata$id <- seq_along(mydata$culmen_length_mm)
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[, c("Culmenlength_z", "Culmendepth_z", "Bodymass_z")]),
type = "pearson") #Correlation matrix
## Culmenlength_z Culmendepth_z Bodymass_z
## Culmenlength_z 1.00 -0.22 0.59
## Culmendepth_z -0.22 1.00 -0.47
## Bodymass_z 0.59 -0.47 1.00
##
## n= 333
##
##
## P
## Culmenlength_z Culmendepth_z Bodymass_z
## Culmenlength_z 0 0
## Culmendepth_z 0 0
## Bodymass_z 0 0
Even though the correlation between body mass and culmen depth is above 0.3 as well as the correlation between body mass and culmen length, for educational purposes we won’t drop any variable and we will assume that it is okay.
mydata$Dissimilarity <- sqrt(mydata$Culmenlength_z^2 + mydata$Culmendepth_z^2 + mydata$Bodymass_z^2) #Finding outliers
head(mydata[order(-mydata$Dissimilarity), ], 10) #10 units with the highest value of dissimilarity
## culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex
## 247 59.6 17.0 230 6050 MALE
## 232 49.2 15.2 221 6300 MALE
## 327 55.1 16.0 230 5850 MALE
## 314 55.9 17.0 228 5600 MALE
## 137 32.1 15.5 188 3050 FEMALE
## 277 54.3 15.7 231 5650 MALE
## 164 58.0 17.8 181 3700 FEMALE
## 93 33.1 16.1 178 2900 FEMALE
## 178 54.2 20.8 201 4300 MALE
## 235 50.2 14.3 218 5700 MALE
## Culmenlength_z Culmendepth_z Bodymass_z id Dissimilarity
## 247 2.8620613 -0.08254921 2.2895045 247 3.666066
## 232 0.9521822 -0.99884549 2.6000058 232 2.943531
## 327 2.0356713 -0.59160270 2.0411034 327 2.942797
## 314 2.1825851 -0.08254921 1.7306021 314 2.786660
## 137 -2.1880999 -0.84612945 -1.4365116 137 2.750869
## 277 1.8887575 -0.74431875 1.7927024 277 2.708357
## 164 2.5682337 0.32469358 -0.6292081 164 2.664048
## 93 -2.0044576 -0.54069735 -1.6228124 93 2.635095
## 178 1.8703933 1.85185404 0.1159951 178 2.634614
## 235 1.1358244 -1.45699363 1.8548026 235 2.617866
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>% filter(id != 247) #Removing unit due to high dissimilarity and it was an outlier on the cluster plot
#install.packages("factoextra")
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#Calculating Euclidean distances
distance <- get_dist(mydata[c("Culmenlength_z","Culmendepth_z", "Bodymass_z")],
method = "euclidian")
distance2 <- distance^2
fviz_dist(distance2) #Showing dissimilarity matrix
Based on the dissimilarity matrix, we can see that there are some clusters forming along the line,
get_clust_tendency(mydata[c("Culmenlength_z","Culmendepth_z", "Bodymass_z")], #Hopkins statistics
n = nrow(mydata) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.7699599
##
## $plot
## NULL
Data is clusterable since Hopkins statistics is above 0.5.
library(dplyr) #Allowing use of %>%
WARD <- mydata[c("Culmenlength_z","Culmendepth_z", "Bodymass_z")] %>%
#scale() %>%
get_dist(method = "euclidean") %>% #Selecting distance
hclust(method = "ward.D2") #Selecting algorithm
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 332
library(factoextra)
fviz_dend(WARD) #Dendogram
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
fviz_dend(WARD,
k = 5,
cex = 0.5,
palette = "jama",
color_labels_by_k = TRUE,
rect = TRUE)
mydata$ClusterWard <- cutree(WARD,
k = 5) #Cutting the dendrogram
head(mydata)
## culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex
## 1 39.1 18.7 181 3750 MALE
## 2 39.5 17.4 186 3800 FEMALE
## 3 40.3 18.0 195 3250 FEMALE
## 4 36.7 19.3 193 3450 FEMALE
## 5 39.3 20.6 190 3650 MALE
## 6 38.9 17.8 181 3625 FEMALE
## Culmenlength_z Culmendepth_z Bodymass_z id Dissimilarity ClusterWard
## 1 -0.9026043 0.7828417 -0.5671079 1 1.322553 1
## 2 -0.8291474 0.1210722 -0.5050076 2 0.978354 1
## 3 -0.6822336 0.4265043 -1.1881105 3 1.434906 1
## 4 -1.3433456 1.0882738 -0.9397095 4 1.967733 1
## 5 -0.8658758 1.7500433 -0.6913084 5 2.071304 2
## 6 -0.9393327 0.3246936 -0.7223585 6 1.228647 1
#Calculating positions of initial leaders
Initial_leaders <- aggregate(mydata[, c("Culmenlength_z","Culmendepth_z", "Bodymass_z")],
by = list(mydata$ClusterWard),
FUN = mean)
Initial_leaders
## Group.1 Culmenlength_z Culmendepth_z Bodymass_z
## 1 1 -1.1862295 0.2670008 -0.9586846
## 2 2 -0.5079193 1.0300963 -0.1614847
## 3 3 1.0018559 0.6927142 -0.5650718
## 4 4 0.2194497 -1.5516776 0.5221309
## 5 5 0.9384090 -0.7869894 1.5077717
library(factoextra)
#Performing K-Means clustering - initial leaders are chosen as centroids of groups, found with hierarchical clustering
K_MEANS <- hkmeans(mydata[c("Culmenlength_z","Culmendepth_z", "Bodymass_z")],
k = 5,
hc.metric = "euclidean",
hc.method = "ward.D2")
K_MEANS
## Hierarchical K-means clustering with 5 clusters of sizes 78, 73, 63, 61, 57
##
## Cluster means:
## Culmenlength_z Culmendepth_z Bodymass_z
## 1 -1.1533466 0.1484828 -1.0575408
## 2 -0.6829883 1.0269085 -0.2072666
## 3 0.9868701 0.7085037 -0.5286648
## 4 0.2733081 -1.4603317 0.6021079
## 5 1.0195177 -0.7371741 1.6123937
##
## Clustering vector:
## [1] 2 1 1 1 2 1 2 2 1 2 1 2 2 1 2 1 1 2 2 1 2 2 1 1 2 1 1 1 2 1 2 2 2 1 2 1 2
## [38] 1 2 1 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 3 1 2 1 2 1 2
## [75] 1 2 2 2 1 2 2 1 2 2 1 2 1 2 1 2 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 2 2 1
## [112] 2 1 1 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 1 1 1 1 2 2 1 1 1 1 2 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 1 3 3 3 4 5 4 5 5 4 4 5
## [223] 4 5 4 5 4 5 4 5 4 5 4 5 5 4 4 5 4 4 5 4 5 5 4 4 5 5 4 5 4 5 4 5 4 4 5 4 4
## [260] 5 4 5 4 5 4 5 4 4 4 4 4 5 4 5 4 5 4 5 5 4 5 4 4 5 4 4 5 4 5 4 5 4 5 4 5 4
## [297] 5 4 5 4 5 4 5 4 5 4 5 5 4 4 5 4 5 4 5 5 4 5 4 5 5 5 4 5 4 5 5 4 4 5 4 5
##
## Within cluster sum of squares by cluster:
## [1] 35.78330 40.79583 48.73695 19.96303 24.64238
## (between_SS / total_SS = 82.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
Number of units per group:
Group 1: 78
Group 2: 73
Group 3: 63
Group 4: 61
Group 5: 57
Ratio between sum of squares of between groups divided by total sum of squares: 82.7% We want this ratio to be as high as possible, 82.7% is a good number meaning that the cluster groups differentiate between the units.
fviz_cluster(K_MEANS,
palette = "jama",
repel = FALSE,
ggtheme = theme_classic()) #Cluster plot
mydata$ClusterK_Means <- K_MEANS$cluster
head(mydata[c("id", "ClusterWard", "ClusterK_Means")])
## id ClusterWard ClusterK_Means
## 1 1 1 2
## 2 2 1 1
## 3 3 1 1
## 4 4 1 1
## 5 5 2 2
## 6 6 1 1
We can see that there were some reclassifications, for example unit 1 was reclassified from group 1 to group 2
#Checking for reclassifications
table(mydata$ClusterWard)
##
## 1 2 3 4 5
## 90 63 61 50 68
Initially, with hierarchical clustering there were 90 units in group 1, 63 units in group 2, 61 units in group 3, 50 units in group 4 and 68 units in group 5.
table(mydata$ClusterK_Means)#Checking for reclassifications
##
## 1 2 3 4 5
## 78 73 63 61 57
If we do K Means clustering there are 78 units in group 1, 73 units in group 2, 63 units in group 3, 61 units in group 4 and 57 units in group 5.
table(mydata$ClusterWard, mydata$ClusterK_Means)#Checking for reclassifications
##
## 1 2 3 4 5
## 1 75 15 0 0 0
## 2 2 58 3 0 0
## 3 1 0 60 0 0
## 4 0 0 0 50 0
## 5 0 0 0 11 57
Explanation of first row: With hierarchical clustering there were 90 units in group 1. 75 units are still clustered in group 1, 15 units were reclassified to group 2.
Explanation of first column: In first group performed my K means clustering there are 78 units. From those 78 units, 75 were already in group 1, 2 units come from group 2 and 1 unit comes from group 3.
Explanation of second row: With hierarchical clustering there were 63 units in group 2. 58 units are still clustered in group 2, 2 units were reclassified to group 1 and 3 units were reclassified to group 3.
Explanation of second column: In second group performed my K means clustering there are 73 units. From those 73 units, 58 were already in group 2 and 15 units come from group 1.
Explanation of third row: With hierarchical clustering there were 61 units in group 3. 60 units are still clustered in group 3, 1 units were reclassified to group 1.
Explanation of third column: In third group performed my K means clustering there are 63 units. From those 63 units, 60 were already in group 3 and 3 units come from group 2.
Explanation of forth row: With hierarchical clustering there were 50 units in group 4 and all of them are still clustered in group 4.
Explanation of forth column: In forth group performed my K means clustering there are 61 units. From those 61 units, 50 were already in group 4 and 11 units come from group 5.
Explanation of fifth row: With hierarchical clustering there were 68 units in group 5. 57 units are still clustered in group 5, 11 units were reclassified to group 4.
Explanation of fifth column: In fifth group performed my K means clustering there are 57 units. From those 57 units, all of them were already in group 5.
Centroids <- K_MEANS$centers
Centroids
## Culmenlength_z Culmendepth_z Bodymass_z
## 1 -1.1533466 0.1484828 -1.0575408
## 2 -0.6829883 1.0269085 -0.2072666
## 3 0.9868701 0.7085037 -0.5286648
## 4 0.2733081 -1.4603317 0.6021079
## 5 1.0195177 -0.7371741 1.6123937
library(ggplot2)
library(tidyr)
Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c("Culmenlength_z","Culmendepth_z", "Bodymass_z"))
Figure$Groups <- factor(Figure$id,
levels = c(1, 2, 3, 4, 5),
labels = c("1", "2", "3", "4", "5"))
Figure$nameFactor <- factor(Figure$name,
levels = c("Culmenlength_z","Culmendepth_z", "Bodymass_z"),
labels = c("Culmenlength_z","Culmendepth_z", "Bodymass_z"))
ggplot(Figure, aes(x = nameFactor, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Groups, col = Groups), size = 3) +
geom_line(aes(group = id), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables") +
ylim(-2, 2)
Group 1:
This group exhibits the lowest culmen length and body mass among the clusters, with each falling approximately one standard deviation below the mean. However, it notably deviates from the average in culmen depth, showing a measurement slightly above the mean.
Group 2:
In this cluster, both culmen length and body mass are below the average. Conversely, the group stands out with a distinctive feature—a culmen depth above the average, which is the highest among the clusters, measuring one standard deviation higher than the mean.
Group 3:
Group 3 displays a notable characteristic with culmen length measuring one standard deviation above the mean. It shares the highest culmen length with Group 5. Additionally, it exhibits culmen depth above the mean. However, the body mass of this group falls below the mean, marking a distinctive contrast with the other attributes.
Group 4:
Within Group 4, both culmen length and body mass surpass the average, standing at values above the mean. The group is uniquely characterized by its culmen depth, notably the lowest among the clusters, measuring 1.5 standard deviations below the mean.
Group 5:
In Group 5, the culmen length exceeds the mean by 1 standard deviation, signifying a significant deviation from the average. However, the culmen depth falls below the mean, showcasing a distinct characteristic. Notably, this group’s body mass surpasses the mean by 1.5 standard deviations, making it the highest among all clusters. Additionally, Group 5 shares the highest culmen length with Group 3 as mentioned above.
#Are all the cluster variables successful at classifing units into groups? Performing ANOVAs.
fit <- aov(cbind(Culmenlength_z, Culmendepth_z, Bodymass_z) ~ as.factor(ClusterK_Means),
data = mydata)
summary(fit)
## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 4 262.94 65.736 353.31 < 2.2e-16 ***
## Residuals 327 60.84 0.186
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 4 271.388 67.847 366.07 < 2.2e-16 ***
## Residuals 327 60.605 0.185
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 3 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 4 278.266 69.567 469.27 < 2.2e-16 ***
## Residuals 327 48.476 0.148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: arithmetic mean (culmenlength, G1) = arithmetic mean (culmenlength, G2) = arithmetic mean (culmenlength, G3) = arithmetic mean (culmenlength, G4) = arithmetic mean (culmenlength, G5)
H1: At least one arithmetic mean of culmenlength is different among the groups.
We reject the null hypothesis at p < 0.001 and assume that at least one group has different arithmetic mean of culmen length than the others.
H0: Arithmetic mean (culmendepth, G1) = Arithmetic mean (culmendepth, G2) = Arithmetic mean (culmendepth, G3) = Arithmetic mean (culmendepth, G4) = Arithmetic mean (culmendepth, G5)
H1: At least one arithmetic mean of culmendepth is different among the groups.
We reject the null hypothesis at p < 0.001 and assume that at least one group has different arithmetic mean of culmen depth than the others.
H0: Arithmetic mean (bodymass, G1)=Arithmetic mean (bodymass, G2)=Arithmetic mean (bodymass, G3)=Arithmetic mean (bodymass, G4)=Arithmetic mean (bodymass, G5)
H1: At least one arithmetic mean of bodymass is different among the groups.
We reject the null hypothesis at p < 0.001 and assume that at least one group has different arithmetic mean of bodymass than the others.
#Pearson Chi2 test
chisq_results <- chisq.test(mydata$sex, as.factor(mydata$ClusterK_Means))
chisq_results
##
## Pearson's Chi-squared test
##
## data: mydata$sex and as.factor(mydata$ClusterK_Means)
## X-squared = 189.71, df = 4, p-value < 2.2e-16
Null Hypothesis: There is no association between the variables sex and clusters by K Means.
Alternative Hypothesis: There is a significant association between the variables sex and clusters by K Means.
We reject the null hypothesis at p < 0.001 and assume that there is association between the variables sex and clusters by K Means..
addmargins(chisq_results$observed)
##
## mydata$sex 1 2 3 4 5 Sum
## FEMALE 71 8 28 56 2 165
## MALE 7 65 35 5 55 167
## Sum 78 73 63 61 57 332
round(chisq_results$expected, 2)
##
## mydata$sex 1 2 3 4 5
## FEMALE 38.77 36.28 31.31 30.32 28.33
## MALE 39.23 36.72 31.69 30.68 28.67
All expected frequencies are above 5 which is good.
round(chisq_results$res, 2)
##
## mydata$sex 1 2 3 4 5
## FEMALE 5.18 -4.70 -0.59 4.66 -4.95
## MALE -5.15 4.67 0.59 -4.64 4.92
There is less than expected number of penguins in category Male and group 1 (alpha = 0.001).
There is more than expected number of penguins in category Female and group 1 (alpha = 0.001).
There is less than expected number of penguins in category Female and group 2 (alpha = 0.001).
There is more than expected number of penguins in category Male and group 2 (alpha = 0.001).
There is less than expected number of penguins in category Male and group 4 (alpha = 0.001).
There is more than expected number of penguins in category Female and group 4 (alpha = 0.001).
There is less than expected number of penguins in category Female and group 5 (alpha = 0.001).
There is more than expected number of penguins in category Male and group 5 (alpha = 0.001).
Research question: Can we categorize the penguins in the data set into homogeneous groups based on three cluster variables (the length and depth of the penguin’s culmen and its body mass)?
Answer: Yes, we can. We clustered them into 5 groups.
Group 1: After K Means clustering, 78 penguins, constituting 23.5% of the sample, fall into the first group. Within this group, 91% are female, while 9% are male. There are more females in this group than expected and less males. This group is characterized by the lowest culmen length and body mass among the clusters, with each falling approximately one standard deviation below the mean. Additionally, their culmen depth is slightly above the mean.
Group 2: After K Means clustering, 73 penguins, constituting 22% of the sample, fall into the second group. Within this group, 11% are female, while 89% are male. There are less females in this group than expected and more males. This cluster is characterized by below the average of both culmen length and body mass. Conversely, the group stands out with a distinctive feature—a culmen depth above the average, which is the highest among the clusters, measuring one standard deviation above the mean.
Group 3: After K Means clustering, 63 penguins, constituting 19% of the sample, fall into the third group. Within this group, 44% are female, while 56% are male. This group is characterized by culmen length measuring one standard deviation above the mean. It shares the highest culmen length with Group 5. Additionally, it exhibits culmen depth above the mean. However, the body mass of this group falls below the mean, marking a distinctive contrast with the other attributes.
Group 4: After K Means clustering, 61 penguins, constituting 18.5% of the sample, fall into the fourth group. Within this group, 92% are female, while 8% are male. There are more females in this group than expected and less males. This group is characterized by both culmen length and body mass that surpass the average, standing at values above the mean. The group is uniquely characterized by its culmen depth, notably the lowest among the clusters, measuring 1.5 standard deviations below the mean.
Group 5: After K Means clustering, 57 penguins, constituting 17% of the sample, fall into the fifth group. Within this group, 4% are female, while 96% are male. There are less females in this group than expected and more males. This cluster is characterized by the culmen length exceeds the mean by 1 standard deviation, signifying a significant deviation from the average. However, the culmen depth falls below the mean, showcasing a distinct characteristic. Notably, this group’s body mass surpasses the mean by 1.5 standard deviations, making it the highest among all clusters. Additionally, Group 5 shares the highest culmen length with Group 3 as mentioned above.