This document performs clustering analysis on the Iris dataset using the K-means clustering algorithm. We’ll focus only on the numerical columns for our analysis.
library(tidyverse) # for data manipulation
## Warning: package 'tidyverse' was built under R version 4.4.1
## Warning: package 'ggplot2' was built under R version 4.4.1
## Warning: package 'tidyr' was built under R version 4.4.1
## Warning: package 'readr' was built under R version 4.4.1
## Warning: package 'dplyr' was built under R version 4.4.1
## Warning: package 'forcats' was built under R version 4.4.1
## Warning: package 'lubridate' was built under R version 4.4.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(cluster) # for clustering algorithms
## Warning: package 'cluster' was built under R version 4.4.1
library(factoextra) # for clustering visualization
## Warning: package 'factoextra' was built under R version 4.4.1
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Load the data
data(iris)
# Select only numerical columns
iris_numeric <- iris %>%
select_if(is.numeric) %>%
scale() # Standardize the variables
# Convert to dataframe and add original species for later comparison
iris_numeric <- as.data.frame(iris_numeric)
K-means clustering is a partitioning method that divides n observations into k clusters. The algorithm works as follows:
# Elbow method
set.seed(123) # for reproducibility
fviz_nbclust(iris_numeric, kmeans, method = "wss") +
labs(title = "Elbow Method for Optimal k")
# Silhouette method
fviz_nbclust(iris_numeric, kmeans, method = "silhouette") +
labs(title = "Silhouette Method for Optimal k")
# Set k=3 as we know there are 3 species in the dataset
set.seed(123)
km_result <- kmeans(iris_numeric, centers = 3, nstart = 25)
# Add cluster assignments to original data
iris_clustered <- iris %>%
mutate(Cluster = as.factor(km_result$cluster))
# Calculate silhouette score
sil <- silhouette(km_result$cluster, dist(iris_numeric))
cat("Average silhouette width:", mean(sil[,3]), "\n")
## Average silhouette width: 0.4599482
# Create confusion matrix between clusters and actual species
table(km_result$cluster, iris$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 39 14
## 3 0 11 36
# Calculate within-cluster sum of squares
cat("Total within-cluster sum of squares:", km_result$tot.withinss, "\n")
## Total within-cluster sum of squares: 138.8884
# Visualize clusters using first two principal components
fviz_cluster(list(data = iris_numeric, cluster = km_result$cluster),
geom = "point",
ellipse.type = "convex",
palette = "Set2",
main = "K-means Clustering Results") +
theme_minimal()
# Scatter plot of Sepal.Length vs Sepal.Width colored by clusters
ggplot(iris_clustered, aes(x = Sepal.Length, y = Sepal.Width, color = Cluster)) +
geom_point(size = 3, alpha = 0.7) +
labs(title = "Clusters by Sepal Measurements",
x = "Sepal Length",
y = "Sepal Width") +
theme_minimal()
The analysis reveals: 1. The elbow and silhouette methods suggest 3 clusters, which aligns with the known number of species 2. The clustering shows good separation between groups 3. The silhouette score indicates the quality of the clustering 4. The confusion matrix shows how well the clusters correspond to actual species
This document demonstrates k-means clustering on the Iris dataset, including: - Data preprocessing (standardization) - Optimal cluster number determination - Cluster analysis - Evaluation metrics - Multiple visualization approaches