data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
I will create a new variable, the Sepal Ratio, which is the ratio of Sepal Length to Sepal Width. This new feature provides a unique measure of the flower’s shape, offering a more refined perspective on the relationship between the sepal’s length and width. Including this ratio as an additional parameter enhances the clustering analysis by incorporating a new dimension that could potentially reveal distinct patterns among the flowers that are not captured by the individual sepal measurements.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
iris <- iris %>%
mutate("Sepal.Ratio" = Sepal.Length / Sepal.Width)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Ratio
## 1 5.1 3.5 1.4 0.2 setosa 1.457143
## 2 4.9 3.0 1.4 0.2 setosa 1.633333
## 3 4.7 3.2 1.3 0.2 setosa 1.468750
## 4 4.6 3.1 1.5 0.2 setosa 1.483871
## 5 5.0 3.6 1.4 0.2 setosa 1.388889
## 6 5.4 3.9 1.7 0.4 setosa 1.384615
library(psych)
describe(iris)
## vars n mean sd median trimmed mad min max range skew
## Sepal.Length 1 150 5.84 0.83 5.80 5.81 1.04 4.30 7.90 3.60 0.31
## Sepal.Width 2 150 3.06 0.44 3.00 3.04 0.44 2.00 4.40 2.40 0.31
## Petal.Length 3 150 3.76 1.77 4.35 3.76 1.85 1.00 6.90 5.90 -0.27
## Petal.Width 4 150 1.20 0.76 1.30 1.18 1.04 0.10 2.50 2.40 -0.10
## Species* 5 150 2.00 0.82 2.00 2.00 1.48 1.00 3.00 2.00 0.00
## Sepal.Ratio 6 150 1.95 0.40 2.03 1.94 0.38 1.27 2.96 1.69 0.02
## kurtosis se
## Sepal.Length -0.61 0.07
## Sepal.Width 0.14 0.04
## Petal.Length -1.42 0.14
## Petal.Width -1.36 0.06
## Species* -1.52 0.07
## Sepal.Ratio -0.86 0.03
Let’s analyze few results I got. For example about the two lengths.
Sepal Length
The mean sepal length is 5.84 cm with a standard deviation of 0.83, indicating that most measurements are relatively close to the mean, but there is some variability. The skewness of 0.31 indicates a slight rightward skew, suggesting that there are a few provinces with slightly larger sepal lengths. The range, from 4.3 cm to 7.9 cm, shows that there are a few flowers with much larger sepals, but the majority fall within a more typical range.
Petal Length
Petal length has a mean of 3.76 cm and a standard deviation of 1.77, which is notably higher than for other parameters, reflecting a wider spread in petal lengths. The median is 4.35 cm, which is higher than the mean, indicating that the distribution is skewed to the left (more smaller petals than larger ones). The range extends from 1.0 cm to 6.9 cm, signifying considerable variation in petal length across the dataset.
RQ: can clustering techniques reveal natural groupings within the dataset of Iris flowers based on their sepal and petal characteristics?
iris_clu_std <- as.data.frame(scale(iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Sepal.Ratio")]))
head(iris_clu_std)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Ratio
## 1 -0.8976739 1.01560199 -1.335752 -1.311052 -1.2398566
## 2 -1.1392005 -0.13153881 -1.335752 -1.311052 -0.7999085
## 3 -1.3807271 0.32731751 -1.392399 -1.311052 -1.2108735
## 4 -1.5014904 0.09788935 -1.279104 -1.311052 -1.1731164
## 5 -1.0184372 1.24503015 -1.335752 -1.311052 -1.4102869
## 6 -0.5353840 1.93331463 -1.165809 -1.048667 -1.4209578
For the clustering analysis of the Iris dataset, I selected five variables: Sepal Length, Sepal Width, Petal Length, Petal Width, and Sepal Ratio. I decided to exclude the Species variable from the clustering process for now. I’ll use the species labels to compare how well the clusters match with the actual flower species, with criterion validity.
iris_clu_std$Dissimilarity <- sqrt(iris_clu_std$Sepal.Length^2 +
iris_clu_std$Sepal.Width^2 +
iris_clu_std$Petal.Length^2 +
iris_clu_std$Petal.Width^2 +
iris_clu_std$Sepal.Ratio^2)
head(data.frame(Flower_ID = 1:nrow(iris), Dissimilarity = iris_clu_std$Dissimilarity)[order(-iris_clu_std$Dissimilarity), ])
## Flower_ID Dissimilarity
## 119 119 4.208884
## 16 16 3.867347
## 123 123 3.634628
## 34 34 3.624215
## 33 33 3.601554
## 132 132 3.537387
In the results of the dissimilarity calculation, we observe that flower ID 119 has a notably higher dissimilarity score of 4.21 compared to the other flowers. This suggests that flower 119 is an outlier in terms of its dissimilarity, so I’ll remove it.
library(dplyr)
iris_cleaned <- iris_clu_std %>%
filter(row_number() != 119)
iris_cleaned$Dissimilarity <- sqrt(
iris_cleaned$Sepal.Length^2 +
iris_cleaned$Sepal.Width^2 +
iris_cleaned$Petal.Length^2 +
iris_cleaned$Petal.Width^2 +
iris_cleaned$Sepal.Ratio^2
)
head(iris_cleaned[order(-iris_cleaned$Dissimilarity), ])
## Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Ratio Dissimilarity
## 16 -0.1730941 3.0804554 -1.279104 -1.048667 -1.6435927 3.867347
## 122 2.2421720 -0.5903951 1.666574 1.050416 1.9884107 3.634628
## 34 -0.4146207 2.6215991 -1.335752 -1.311052 -1.6084617 3.624215
## 33 -0.7769106 2.3921710 -1.279104 -1.442245 -1.7114159 3.601554
## 131 2.4836986 1.7038865 1.496631 1.050416 0.3127907 3.537387
## 118 2.2421720 1.7038865 1.666574 1.312801 0.1813696 3.530492
library(factoextra)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distances <- get_dist(iris_cleaned,
method = "euclidean")
fviz_dist(Distances, gradient = list(low = "darkred",
mid = "grey95",
high = "white"))
library(factoextra)
get_clust_tendency(iris_cleaned,
n = nrow(iris_cleaned) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.8495027
##
## $plot
## NULL
The Hopkins statistic, which tells us how well-suited the data is for clustering, is 0.849. Since this value is close to 1, it shows that the data has a strong clustering structure. This means the data is ideal for clustering analysis. Moreover, when we look at the distance matrix, we can see distinct groupings, which further proves that the data naturally forms clear clusters.
library(factoextra)
library(NbClust)
fviz_nbclust(iris_cleaned, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
The elbow method helps find the best number of clusters by plotting the total within-cluster sum of squares against the number of clusters. In the plot, we see noticeable breaks at 2 and 3 clusters. So, dividing the data into either 2 or 3 clusters gives us a good balance of simplicity and meaningful differences in the data.
fviz_nbclust(iris_cleaned, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette analysis")
The silhouette method also supports this, showing that 2 clusters is the most preferable choice.
library(NbClust)
nc <- NbClust(iris_cleaned, distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans", index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 12 proposed 2 as the best number of clusters
## * 7 proposed 3 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 2 proposed 9 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
After considering the K-means results and aiming for more interesting and distinct clusters, I decided to go with 3 clusters. This allows for a more detailed grouping while still maintaining meaningful differences between the clusters.
Clustering <- kmeans(iris_cleaned,
centers = 3,
nstart = 25)
Clustering
## K-means clustering with 3 clusters of sizes 50, 64, 35
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Ratio Dissimilarity
## 1 -1.01119138 0.8504137 -1.300630 -1.2507035 -1.2072836 2.711802
## 2 0.07031946 -0.7266181 0.386691 0.3247565 0.5415159 1.403314
## 3 1.25191290 0.1437750 1.100097 1.1516218 0.6625869 2.320648
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 3 3 2 3 3 3 3
## [112] 2 3 2 2 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3 3 3 2 3 3
## [149] 2
##
## Within cluster sum of squares by cluster:
## [1] 58.58131 107.41077 54.49020
## (between_SS / total_SS = 73.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
From the K-means clustering results, here are some interesting findings:
Largest Cluster: Cluster 2 is the largest, with 64 units, making it the most populous cluster in the dataset.
Cluster Means: The mean for Petal Length in Cluster 1 is -1.300630, indicating that this cluster contains flowers with relatively smaller petal lengths compared to other clusters.
SS Ratio: The Within-Cluster Sum of Squares (SS) ratio is 73.2%, which shows that the clusters are well-separated, with the majority of the variability in the data accounted for by differences between clusters rather than within each cluster. This high ratio suggests that the clustering model is effective.
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = TRUE,
ggtheme = theme_bw(),
data = iris_cleaned)
Based on graphical observation, I could remove iris id 16, 34 and 40 from cluster 1; 86, 71, 58, 94, 99 and 61 from cluster 2; 131, 118 and 110 from cluster 3. It is evident they are potential outliers.
library(dplyr)
iris_cleaned <- iris_cleaned %>% filter(!(row_number() %in% c(16, 34, 39, 40, 42, 86, 71, 58, 94, 99, 61, 131, 181, 110)))
iris_cleaned <- as.data.frame(scale(iris_cleaned[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Sepal.Ratio")]))
Clustering <- kmeans(iris_cleaned,
centers = 3,
nstart = 25)
Clustering
## K-means clustering with 3 clusters of sizes 52, 45, 39
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Ratio
## 1 0.05127646 -0.8884530 0.4145288 0.3411984 0.6045754
## 2 -1.07671389 0.9391076 -1.3274463 -1.2683989 -1.2389273
## 3 1.17399357 0.1010183 0.9789637 1.0086061 0.6234310
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 3 3 3 1 1 1 3 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 3 3 3 1 1 1 1
## [75] 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 3 1 3 3 3 3 1 3 1 3 1 3 1 1 3 3 3 1 3 1 3 1
## [112] 3 3 1 1 3 3 3 3 1 1 3 3 3 1 3 3 3 1 3 3 3 1 3 3 1
##
## Within cluster sum of squares by cluster:
## [1] 59.93223 37.52378 43.38612
## (between_SS / total_SS = 79.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The ratio improved, which is a positive thing.
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = iris_cleaned)
Now the clusters better satisfy both within-cluster and between-cluster requirements. Separation of groups with reduced outlier influence is clearer.
Averages <- Clustering$centers
Averages
## Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Ratio
## 1 0.05127646 -0.8884530 0.4145288 0.3411984 0.6045754
## 2 -1.07671389 0.9391076 -1.3274463 -1.2683989 -1.2389273
## 3 1.17399357 0.1010183 0.9789637 1.0086061 0.6234310
Let’s explain two of the averages for the clusters:
Cluster 1 - Sepal Length: The average Sepal Length for Cluster 1 is 0.05127646. This means that, on average, the flowers in Cluster 1 have a Sepal Length close to the overall average. The value being close to zero suggests that the sepal length of these flowers is pretty typical compared to the entire dataset.
Cluster 2 - Petal Length: The average Petal Length for Cluster 2 is -1.3274463, which is lower than the overall average. This indicates that the flowers in Cluster 2 have smaller petal lengths compared to the rest of the flowers in the dataset. The negative value shows they are below the mean for petal length.
Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)
library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width" , "Sepal.Ratio"))
Figure$Group <- factor(Figure$ID,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Figure$NameF <- factor(Figure$name,
levels = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width" , "Sepal.Ratio"),
labels = c("Sepal Length", "Sepal Width", "Petal Length", "Petal Width" , "Sepal Ratio"))
library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, color = Group), size = 3) +
geom_line(aes(group = ID), size = 1) +
ylab("Averages") +
xlab("Cluster Variables") +
ylim(-2, 3.5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
In the line graph comparing the three clusters, we can observe the following trends:
The graph would show distinct lines for each cluster, with Cluster 2 dipping lower for petal-related features and Cluster 3 rising higher, while Cluster 1 remains more balanced and centered.
iris_cleaned$Group <- Clustering$cluster
fit <- aov(cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Sepal.Ratio) ~ as.factor(Group),
data = iris_cleaned)
summary(fit)
## Response Sepal.Length :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 106.058 53.029 243.69 < 2.2e-16 ***
## Residuals 133 28.942 0.218
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Sepal.Width :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 81.131 40.565 100.15 < 2.2e-16 ***
## Residuals 133 53.869 0.405
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Petal.Length :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 125.607 62.803 889.26 < 2.2e-16 ***
## Residuals 133 9.393 0.071
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Petal.Width :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 118.125 59.063 465.51 < 2.2e-16 ***
## Residuals 133 16.875 0.127
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Sepal.Ratio :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 103.237 51.618 216.14 < 2.2e-16 ***
## Residuals 133 31.763 0.239
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The very low p-values (less than 0.001) for all flower measurements show that the clustering clearly separates the flowers into different groups. Each cluster has its own unique pattern based on petal and sepal size. This means the clustering method worked well in finding groups of flowers with different shapes and sizes.
I had an issue when trying to run the chi-square test for criterion validity because my dataset didn’t have the Group column properly assigned. This happened because I had created a standardized version of the iris dataset (iris_clu_std) that didn’t include the Species variable, and then I removed outliers, which made it harder to match everything correctly.
To fix this, I asked ChatGPT for help. It suggested that I reassign the clusters to the original iris dataset by running K-means clustering again and adding the Group column properly. After that, I made sure there were no missing values and created a contingency table to compare the species with the clusters. Finally, I was able to run the chi-square test correctly and proceed with the criterion validity analysis.
set.seed(42)
kmeans_result <- kmeans(iris[, 1:4], centers = 3)
iris$Group <- as.factor(kmeans_result$cluster)
iris_cleaned <- iris[complete.cases(iris$Species, iris$Group), ]
contingency_table <- table(iris_cleaned$Species, iris_cleaned$Group)
chi_square <- chisq.test(contingency_table)
chi_square
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 223.6, df = 4, p-value < 2.2e-16
The null hypothesis says that there is no relationship between species and clusters, meaning the clusters do not group flowers based on their species.
Since the p-value is extremely small (p<0.001), this means we can reject H0. In simple terms, the test confirms that the clusters are not random and that there is a strong relationship between the species and the way the flowers were grouped. This shows that the clustering worked well in separating the different types of flowers.
addmargins(chi_square$observed)
##
## 1 2 3 Sum
## setosa 50 0 0 50
## versicolor 0 48 2 50
## virginica 0 14 36 50
## Sum 50 62 38 150
addmargins(round(chi_square$expected, 2))
##
## 1 2 3 Sum
## setosa 16.67 20.67 12.67 50.01
## versicolor 16.67 20.67 12.67 50.01
## virginica 16.67 20.67 12.67 50.01
## Sum 50.01 62.01 38.01 150.03
The results show that setosa flowers are all in Cluster 1, while versicolor is mostly in Cluster 2, and virginica is split between Clusters 2 and 3. The expected values show that if the clustering was random, each species would be more evenly spread (since 50/3 ≈ 16.67). Since the actual results are very different, it confirms that the clustering worked well in grouping similar flowers together.
round(chi_square$res, 2)
##
## 1 2 3
## setosa 8.16 -4.55 -3.56
## versicolor -4.08 6.01 -3.00
## virginica -4.08 -1.47 6.56
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cramers_v(iris$Species, iris$Group)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.86 | [0.76, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
A high positive residual (like 8.16 for setosa in Cluster 1) means there are more setosa flowers in that cluster than expected, while a large negative residual (like -4.08 for virginica in Cluster 1) means there are fewer virginica flowers there than expected. This confirms that the clusters are meaningfully separating the species.
The high Cramer’s V (0.86) confirms a strong relationship between the species and the clusters, meaning the clustering grouped the flowers very well.