“Can we discover meaningful clusters of iris flowers based on their sepal and petal measurements, and how do these clusters correspond to the known species?”
library(datasets)
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Unit of Observation: Each row corresponds to a single iris flower.
Sample Size: The dataset contains 150 observations (flowers).
Sepal Length: Length of the sepal (in centimeters).
Sepal Width: Width of the sepal (in centimeters).
Petal Length: Length of the petal (in centimeters).
Petal Width: Width of the petal (in centimeters).
The Iris dataset was introduced by the British biologist and statistician Ronald A. Fisher in 1936.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Mean: The mean sepal length is approximately 5.84 cm, and the mean sepal width is approximately 3.06 cm.
3rd Quartile Petal length: 75% of iris flowers has up to 5.1 cm of petal length based on this sample.
Median Petal Width: If arranged in ascending order, the median petal width would be 1.3 cm.
#We will standardize our variables
iris$Sepal.Length_z <- scale(iris$Sepal.Length)
iris$Sepal.Width_z <- scale(iris$Sepal.Width)
iris$Petal.Length_z <- scale(iris$Petal.Length)
iris$Petal.Width_z <- scale(iris$Petal.Width)
#Creating new factor variable
iris$SpeciesFactor <- factor(iris$Species,
levels = c( "setosa", "versicolor", "virginica"),
labels = c( "Setosa", "Versicolor", "Virginica"))
# Display the modified dataset
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_z
## 1 5.1 3.5 1.4 0.2 setosa -0.8976739
## 2 4.9 3.0 1.4 0.2 setosa -1.1392005
## 3 4.7 3.2 1.3 0.2 setosa -1.3807271
## 4 4.6 3.1 1.5 0.2 setosa -1.5014904
## 5 5.0 3.6 1.4 0.2 setosa -1.0184372
## 6 5.4 3.9 1.7 0.4 setosa -0.5353840
## Sepal.Width_z Petal.Length_z Petal.Width_z SpeciesFactor
## 1 1.01560199 -1.335752 -1.311052 Setosa
## 2 -0.13153881 -1.335752 -1.311052 Setosa
## 3 0.32731751 -1.392399 -1.311052 Setosa
## 4 0.09788935 -1.279104 -1.311052 Setosa
## 5 1.24503015 -1.335752 -1.311052 Setosa
## 6 1.93331463 -1.165809 -1.048667 Setosa
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(iris[, c ("Sepal.Length_z", "Sepal.Width_z", "Petal.Length_z", "Petal.Width_z")]),
type = "pearson")
## Sepal.Length_z Sepal.Width_z Petal.Length_z Petal.Width_z
## Sepal.Length_z 1.00 -0.12 0.87 0.82
## Sepal.Width_z -0.12 1.00 -0.43 -0.37
## Petal.Length_z 0.87 -0.43 1.00 0.96
## Petal.Width_z 0.82 -0.37 0.96 1.00
##
## n= 150
##
##
## P
## Sepal.Length_z Sepal.Width_z Petal.Length_z Petal.Width_z
## Sepal.Length_z 0.1519 0.0000 0.0000
## Sepal.Width_z 0.1519 0.0000 0.0000
## Petal.Length_z 0.0000 0.0000 0.0000
## Petal.Width_z 0.0000 0.0000 0.0000
Correlation should be less than 0.3, which is not the case. However, even though the variables have a high correlation, I am still continuing with clustering and will decide later based on the dendogram if the clusters look good.
# Now we look for dissimilarities
iris$Dissimilarity <- sqrt(iris$"Sepal.Length_z"^2 + iris$"Sepal.Width_z"^2 + iris$"Petal.Length_z"^2 + iris$"Petal.Width_z"^2)
head(iris[order(-iris$Dissimilarity), ], 10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_z
## 118 7.7 3.8 6.7 2.2 virginica 2.2421720
## 132 7.9 3.8 6.4 2.0 virginica 2.4836986
## 16 5.7 4.4 1.5 0.4 setosa -0.1730941
## 119 7.7 2.6 6.9 2.3 virginica 2.2421720
## 34 5.5 4.2 1.4 0.2 setosa -0.4146207
## 33 5.2 4.1 1.5 0.1 setosa -0.7769106
## 123 7.7 2.8 6.7 2.0 virginica 2.2421720
## 42 4.5 2.3 1.3 0.3 setosa -1.6222537
## 110 7.2 3.6 6.1 2.5 virginica 1.6383555
## 136 7.7 3.0 6.1 2.3 virginica 2.2421720
## Sepal.Width_z Petal.Length_z Petal.Width_z SpeciesFactor Dissimilarity
## 118 1.7038865 1.666574 1.312801 Virginica 3.525830
## 132 1.7038865 1.496631 1.050416 Virginica 3.523530
## 16 3.0804554 -1.279104 -1.048667 Setosa 3.500711
## 119 -1.0492515 1.779869 1.443994 Virginica 3.373621
## 34 2.6215991 -1.335752 -1.311052 Setosa 3.247735
## 33 2.3921710 -1.279104 -1.442245 Setosa 3.168951
## 123 -0.5903951 1.666574 1.050416 Virginica 3.042490
## 42 -1.7375359 -1.392399 -1.179859 Setosa 2.996929
## 110 1.2450302 1.326688 1.706379 Virginica 2.984316
## 136 -0.1315388 1.326688 1.443994 Virginica 2.981586
I see no relevant reason to remove any unit by far, so everything will be left as it is because I do not see signs of abnormalities.
#install.packages("factoextra")
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#Calculating Eucledian distances
distance <- get_dist(iris[c("Sepal.Length_z", "Sepal.Width_z", "Petal.Length_z", "Petal.Width_z")],
method = "euclidian")
distance2 <- distance^2
fviz_dist(distance2)
With the help of Euclidian distance I can see three cluster groups forming.
get_clust_tendency(iris[c("Sepal.Length_z", "Sepal.Width_z", "Petal.Length_z", "Petal.Width_z")],
n = nrow(iris) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.8184781
##
## $plot
## NULL
The Hopkins statistics is above 0.5, meaning that the data is appropriate for making clusters.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
WARD1 <- iris[c("Sepal.Length_z", "Sepal.Width_z", "Petal.Length_z", "Petal.Width_z")] %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD1
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 150
library(factoextra)
fviz_dend(WARD1, show_labels = FALSE)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
With the help of the Dendogram I can observe that three groups are formed. The left one is most homogeneous.
fviz_dend(WARD1,
k = 3,
cex = 0.5,
palette = "jama",
color_labels_by_k = TRUE,
rect = TRUE)
As previously mentioned, 3 groups would be the most suitable in our
case.
iris$ClusterWard <- cutree(WARD1,
k = 3)
clustering1 <- cbind(ID = 1:nrow(iris), iris)
head(clustering1[c( "ID", "ClusterWard")])
## ID ClusterWard
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 1
## 6 6 1
#Calculating the position of initial leaders
initial_leaders <- aggregate(iris,
by = list(iris$ClusterWard),
FUN = mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
initial_leaders
## Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 1 5.016327 3.451020 1.465306 0.244898 NA
## 2 2 5.530000 2.566667 3.930000 1.206667 NA
## 3 3 6.546479 2.992958 5.267606 1.854930 NA
## Sepal.Length_z Sepal.Width_z Petal.Length_z Petal.Width_z SpeciesFactor
## 1 -0.9987207 0.9032290 -1.29875725 -1.252149314 NA
## 2 -0.3783917 -1.1257275 0.09743396 0.009620796 NA
## 3 0.8491418 -0.1476957 0.85515615 0.860094260 NA
## Dissimilarity ClusterWard
## 1 2.404644 1
## 2 1.352279 2
## 3 1.696173 3
#Assign initial leaders to the clusters
library(factoextra)
kmeans_clu <- hkmeans(iris[c("Sepal.Length_z", "Sepal.Width_z", "Petal.Length_z", "Petal.Width_z")],
k = 3,
hc.metric = "euclidean",
hc.method = "ward.D2")
kmeans_clu
## Hierarchical K-means clustering with 3 clusters of sizes 50, 53, 47
##
## Cluster means:
## Sepal.Length_z Sepal.Width_z Petal.Length_z Petal.Width_z
## 1 -1.01119138 0.85041372 -1.3006301 -1.2507035
## 2 -0.05005221 -0.88042696 0.3465767 0.2805873
## 3 1.13217737 0.08812645 0.9928284 1.0141287
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 2 2 2 3 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2
## [75] 2 3 3 3 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 3 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
##
## Within cluster sum of squares by cluster:
## [1] 47.35062 44.08754 47.45019
## (between_SS / total_SS = 76.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
Cluster 1 has 50 units in it, cluster 2 has 53 and cluster 3 has 47 units. The ratio between the between-group sum of squares and the total sum of squares, which is 76.7%, indicates the proportion of the overall variability in a dependent variable that can be accounted for by group differences.
library(factoextra)
fviz_cluster(kmeans_clu,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw())
I do not see any outliars here, so I won’t adjust the data for now.
iris$ClusterK_Means <- kmeans_clu$cluster
table(iris$ClusterWard)
##
## 1 2 3
## 49 30 71
table(iris$ClusterK_Means)
##
## 1 2 3
## 50 53 47
table(iris$ClusterWard, iris$ClusterK_Means)
##
## 1 2 3
## 1 49 0 0
## 2 1 29 0
## 3 0 24 47
Leaders_final <- kmeans_clu$centers
Figure <- as.data.frame(Leaders_final)
Figure$ID <- 1:nrow(Figure)
library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Sepal.Length_z", "Sepal.Width_z", "Petal.Length_z", "Petal.Width_z"))
Figure$Group <- factor(Figure$ID,
levels = c(1, 2, 3,4),
labels = c("1", "2", "3","4"))
Figure$NameF <- factor(Figure$name,
levels = c("Sepal.Length_z", "Sepal.Width_z", "Petal.Length_z", "Petal.Width_z"),
labels = c("Sepal.Length_z", "Sepal.Width_z", "Petal.Length_z", "Petal.Width_z"))
library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = ID), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2.2, 2.2)
Analyzing the provided chart reveals that individuals within cluster 1 (depicted by a red circle) exhibit sepal length, petal length, and petal width below the average. Nevertheless, their sepal width is notably above average. In cluster 2 (illustrated by a green triangle), the Irises showcase a little bit below average sepal length, slightly above-average petal length and petal width, but notably fall below the average in sepal width. As for the third cluster (represented by a blue square), it surpasses the average across all variables, particularly excelling in sepal length and closely approaching the average in sepal width.
fit <- aov(cbind(Sepal.Length_z, Sepal.Width_z, Petal.Length_z, Petal.Width_z) ~ as.factor(ClusterK_Means),
data = iris)
summary(fit)
## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 111.504 55.752 218.57 < 2.2e-16 ***
## Residuals 147 37.496 0.255
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 77.608 38.804 79.9 < 2.2e-16 ***
## Residuals 147 71.392 0.486
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 3 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 137.276 68.638 860.64 < 2.2e-16 ***
## Residuals 147 11.724 0.080
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 4 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 130.723 65.362 525.7 < 2.2e-16 ***
## Residuals 147 18.277 0.124
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA test:
H0: Means of the respective variable are equal across all groups.
H1: At least one of the means of the respective variables differs across the groups.
Through the application of ANOVA, I aim to assess the effectiveness of my clustering variables in distinguishing between each other. Given that the p-values are below 0.001, we can reject the null hypothesis, indicating successful differentiation between the variables.
chisq_results <- chisq.test(iris$SpeciesFactor, as.factor(iris$ClusterK_Means))
chisq_results
##
## Pearson's Chi-squared test
##
## data: iris$SpeciesFactor and as.factor(iris$ClusterK_Means)
## X-squared = 187.64, df = 4, p-value < 2.2e-16
H0: There is no association between species and clusters by K-Means.
H1: There is association between species and clusters by K-Means.
We reject the null hypothesis at p < 0.001 and conclude that there is an association between species and clusters by K-Means.
addmargins(chisq_results$observed)
## as.factor(iris$ClusterK_Means)
## iris$SpeciesFactor 1 2 3 Sum
## Setosa 50 0 0 50
## Versicolor 0 39 11 50
## Virginica 0 14 36 50
## Sum 50 53 47 150
round(chisq_results$expected, 2)
## as.factor(iris$ClusterK_Means)
## iris$SpeciesFactor 1 2 3
## Setosa 16.67 17.67 15.67
## Versicolor 16.67 17.67 15.67
## Virginica 16.67 17.67 15.67
round(chisq_results$res, 2)
## as.factor(iris$ClusterK_Means)
## iris$SpeciesFactor 1 2 3
## Setosa 8.16 -4.20 -3.96
## Versicolor -4.08 5.08 -1.18
## Virginica -4.08 -0.87 5.14
aggregate(iris$Sepal.Length,
by = list(iris$ClusterK_Means),
FUN = "mean")
## Group.1 x
## 1 1 5.006000
## 2 2 5.801887
## 3 3 6.780851
This data shows us the mean of sepal lengths among all three species of Iris flowers.
#Checking if Sepal Length has statistically significant impact on groups classified by K-Means
fit <- aov(Sepal.Length ~ as.factor(ClusterK_Means),
data = iris)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 76.46 38.23 218.6 <2e-16 ***
## Residuals 147 25.71 0.17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: The means of sepal length are the same across 3 cluster groups.
H1: At least one cluster group differs in the mean of sepal length.
We can reject H0 and conclude that at least one cluster group differs from the others in the mean of sepal length inside it (p<0.001).
Cluster 1 (50 units out of 150): Individuals in Cluster 1, represented by a red circle, exhibit characteristics of irises with sepal length, petal length, and petal width below the average. However, their sepal width is notably above average. This suggests that Cluster 1 consists of irises with relatively short and narrow sepals, shorter and narrower petals, but with wider sepals compared to the average.
Cluster 2 (53 units out of 150): In Cluster 2, depicted by a green triangle, the irises showcase a little below average sepal length, slightly above-average petal length and petal width. However, they notably fall below the average in sepal width. This implies that Cluster 2 consists of irises with shorter and narrower sepals, slightly longer and wider petals, but with narrower sepals compared to the average.
Cluster 3 (47 units out of. 150): The third cluster, represented by a blue square, surpasses the average across all variables. It particularly excels in sepal length and closely approaches the average in sepal width. This indicates that Cluster 3 consists of irises with longer and wider sepals, longer and wider petals, and wider sepals compared to the average.
In summary:
Cluster 1: Shorter and narrower sepals, shorter and narrower petals, wider sepals.
Cluster 2: Shorter and narrower sepals, slightly longer and wider petals, narrower sepals.
Cluster 3: Longer and wider sepals, longer and wider petals, wider sepals.