Introduction

This document performs clustering analysis on the Iris dataset using the K-means clustering algorithm. We’ll focus only on the numerical columns for our analysis.

Loading Required Libraries

library(tidyverse)  # for data manipulation
## Warning: package 'tidyverse' was built under R version 4.4.1
## Warning: package 'ggplot2' was built under R version 4.4.1
## Warning: package 'tidyr' was built under R version 4.4.1
## Warning: package 'readr' was built under R version 4.4.1
## Warning: package 'dplyr' was built under R version 4.4.1
## Warning: package 'forcats' was built under R version 4.4.1
## Warning: package 'lubridate' was built under R version 4.4.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(cluster)    # for clustering algorithms
## Warning: package 'cluster' was built under R version 4.4.1
library(factoextra) # for clustering visualization
## Warning: package 'factoextra' was built under R version 4.4.1
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Data Preparation

# Load the data
data(iris)

# Select only numerical columns
iris_numeric <- iris %>% 
  select_if(is.numeric) %>%
  scale()  # Standardize the variables

# Convert to dataframe and add original species for later comparison
iris_numeric <- as.data.frame(iris_numeric)

Theory

K-means clustering is a partitioning method that divides n observations into k clusters. The algorithm works as follows:

  1. Initialize k centroids randomly in the feature space
  2. Assign each data point to the nearest centroid
  3. Recalculate centroids as the mean of all points assigned to that cluster
  4. Repeat steps 2-3 until convergence (minimal centroid movement)

Determining Optimal Number of Clusters

# Elbow method
set.seed(123)  # for reproducibility
fviz_nbclust(iris_numeric, kmeans, method = "wss") +
  labs(title = "Elbow Method for Optimal k")

# Silhouette method
fviz_nbclust(iris_numeric, kmeans, method = "silhouette") +
  labs(title = "Silhouette Method for Optimal k")

Performing K-means Clustering

# Set k=3 as we know there are 3 species in the dataset
set.seed(123)
km_result <- kmeans(iris_numeric, centers = 3, nstart = 25)

# Add cluster assignments to original data
iris_clustered <- iris %>%
  mutate(Cluster = as.factor(km_result$cluster))

Cluster Evaluation

# Calculate silhouette score
sil <- silhouette(km_result$cluster, dist(iris_numeric))
cat("Average silhouette width:", mean(sil[,3]), "\n")
## Average silhouette width: 0.4599482
# Create confusion matrix between clusters and actual species
table(km_result$cluster, iris$Species)
##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         39        14
##   3      0         11        36
# Calculate within-cluster sum of squares
cat("Total within-cluster sum of squares:", km_result$tot.withinss, "\n")
## Total within-cluster sum of squares: 138.8884

Visualization

# Visualize clusters using first two principal components
fviz_cluster(list(data = iris_numeric, cluster = km_result$cluster),
             geom = "point",
             ellipse.type = "convex",
             palette = "Set2",
             main = "K-means Clustering Results") +
  theme_minimal()

# Scatter plot of Sepal.Length vs Sepal.Width colored by clusters
ggplot(iris_clustered, aes(x = Sepal.Length, y = Sepal.Width, color = Cluster)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "Clusters by Sepal Measurements",
       x = "Sepal Length",
       y = "Sepal Width") +
  theme_minimal()

Interpretation

The analysis reveals: 1. The elbow and silhouette methods suggest 3 clusters, which aligns with the known number of species 2. The clustering shows good separation between groups 3. The silhouette score indicates the quality of the clustering 4. The confusion matrix shows how well the clusters correspond to actual species

This document demonstrates k-means clustering on the Iris dataset, including: - Data preprocessing (standardization) - Optimal cluster number determination - Cluster analysis - Evaluation metrics - Multiple visualization approaches