Clustering, also referred to as cluster analysis, is an unsupervised machine learning technique aimed at grouping objects in such a way that objects within the same group (cluster) are more similar to each other than to those in different groups. It is widely used for exploratory data analysis, where the goal is to discover hidden patterns or natural groupings in datasets without predefined labels.
The primary objective of this project is to apply clustering methods to a selected dataset to identify inherent groupings and to understand the structure and characteristics of these clusters. The dataset used in this project is the Iris dataset. The goal is to uncover hidden patterns, validate the biological classifications of the dataset, and compare different clustering methods to understand their effectiveness and implications.
Iris dataset consists of 150 observations of iris flowers from three species: Setosa, Versicolor, and Virginica. Each observation is described by four numerical features that represent physical measurements of the flowers. Additionally, the dataset includes a Species column, which is a categorical variable indicating the species of the flower. For clustering purposes, this column was excluded to allow the algorithms to group observations based solely on the numerical features.
In this project, I will explore several clustering algorithms. Each of used techniques has its unique properties, strengths, and limitations, which will be thoroughly examined to determine their effectiveness and suitability for the dataset at hand.
Furthermore, the project will focus on evaluating the quality of the clusters obtained using various metrics such as silhouette score. The interpretation of the results will be a crucial component, as it will provide context-specific insights.
library(stats)
library(cluster)
library(ggplot2)
library(GGally)
data(iris)
df <- iris[, -5] # Exclude species for clustering
summary(df)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
scaled_df <- scale(df)
Scaling dataset to ensure all features contribute equally to the clustering process. Standardization transforms the data to have a mean of 0 and standard deviation of 1, preventing features with larger ranges from dominating the clustering results.
set.seed(123)
wss <- sapply(1:10, function(k) {
kmeans(scaled_df, centers = k, nstart = 10)$tot.withinss
})
# Plot WSS (Within-Cluster Sum of Squares) vs. Number of Clusters
elbow_plot <- data.frame(Clusters = 1:10, WSS = wss)
ggplot(elbow_plot, aes(x = Clusters, y = WSS)) +
geom_line() +
geom_point() +
ggtitle("Elbow Method for Optimal Clusters") +
xlab("Number of Clusters") +
ylab("Within-Cluster Sum of Squares") +
scale_x_continuous(breaks = 1:10)
The Elbow Method is chosen because it provides a visual and quantitative way to determine the optimal number of clusters for k-means clustering. It involves calculating the Within-Cluster Sum of Squares (WSS), which measures the compactness of clusters. As the number of clusters increases, WSS decreases because points are closer to their respective centroids. However, after a certain point (the “elbow”), adding more clusters yields only marginal improvements, indicating the optimal number of clusters.
Why Use WSS?
Using WSS provides the following advantages:
Compactness Assessment: It quantifies how closely related the data points within each cluster are, which is crucial for ensuring well-defined groupings.
Optimal Balance: By analyzing the trade-off between the number of clusters and the compactness of data, the Elbow Method helps avoid overfitting (too many clusters) or underfitting (too few clusters).
Interpretability: WSS provides an intuitive visual representation of clustering performance, making it easier to decide on the most appropriate k.
In the context of this project, WSS is calculated for a range of k values, and the results are plotted to observe the trend. The analysis focuses on identifying the elbow point to select the most meaningful number of clusters. By focusing on WSS, this approach ensures that the resulting clusters are compact and well-separated, leading to a more interpretable and effective clustering solution.
From the elbow plot, we observe that WSS decreases sharply up to k=3, suggesting 3 clusters.
K-means clustering is a partitioning method that divides the data into k clusters by minimizing the variance within each cluster. The algorithm works iteratively to assign each data point to the nearest cluster center (centroid) and recalculates the centroids until the assignments stabilize.
Key Advantages of K-means
Simplicity: K-means is easy to implement and computationally efficient, making it suitable for large datasets.
Scalability: The algorithm performs well with large datasets as long as the number of clusters remains small.
Interpretability: The resulting clusters are straightforward to interpret and visualize, especially in lower dimensions.
Limitations of K-means
Sensitivity to Initialization: The algorithm’s outcome depends on the initial centroids, which may lead to different results for different runs.
Fixed Number of Clusters: The number of clusters must be predefined, which may require trial-and-error or methods like the Elbow Method for selection.
Assumes Spherical Clusters: K-means works best when clusters are spherical and of similar size, which may not be the case in some datasets.
Outlier Sensitivity: Outliers can significantly affect the placement of centroids and thus the overall clustering result.
Applying K-means in This Project
Defining k: Based on the Elbow Method, the optimal number of clusters was determined. This value ensures that the model balances interpretability and compactness.
Clustering Execution: The k-means algorithm was applied to the dataset, using the optimal k identified in the previous step. The algorithm iteratively adjusted the centroids and cluster assignments to minimize the WSS.
k <- 3
kmeans_result <- kmeans(scaled_df, centers = k, nstart = 25)
# Adding cluster assignments to the original data
iris$Cluster <- as.factor(kmeans_result$cluster)
Clustering evaluation is a critical step in validating the quality of the clusters produced by the algorithm. Unlike supervised learning, where evaluation relies on ground truth labels, clustering requires metrics that assess the structure and cohesiveness of the clusters based on the data itself. The aim is to determine how well the clustering algorithm has partitioned the dataset into meaningful groups.
To assess the quality of the clustering, I used the silhouette score, which measures how similar each observation is to its assigned cluster compared to other clusters. A higher silhouette score indicates better-defined clusters.
sil <- silhouette(kmeans_result$cluster, dist(scaled_df))
mean_silhouette <- mean(sil[, 3]) # Average silhouette width
mean_silhouette
## [1] 0.4599482
A score of 0.459 suggests that while the clusters are distinguishable, there may be some overlap or points near cluster boundaries. This is acceptable for this dataset and analysis.
Practical implications: This clustering is suitable for general insights into the data but may require refinement if precise cluster separation is critical.
Visualization is crucial for understanding clustering results. It provides an intuitive understanding of the patterns discovered in the data. While numerical evaluation metrics offer quantitative insights, visual representations allow us to assess the quality of clustering more holistically and identify potential issues such as overlapping clusters, outliers, or cluster imbalances. Principal Component Analysis (PCA) is used to reduce the dataset’s dimensionality to two principal components, allowing for 2D visualization.
pca <- prcomp(scaled_df)
pca_data <- data.frame(PC1 = pca$x[, 1], PC2 = pca$x[, 2], Cluster = iris$Cluster)
ggplot(pca_data, aes(x = PC1, y = PC2, color = Cluster)) +
geom_point(size = 3) +
ggtitle("Cluster Visualization using PCA") +
theme_minimal()
Hierarchical clustering builds a tree-like structure (dendrogram) to represent nested clusters. It provides a visual representation of how clusters are merged or split based on distances.
The Ward.D2 method was used for dendrogram. This method is a hierarchical clustering approach that minimizes the total within-cluster variance at each step of the clustering process. It is an updated version of the original Ward method, often preferred due to its improved mathematical properties in terms of distance calculation. By using Ward.D2, the resulting hierarchical clustering dendrogram effectively captures the natural groupings in the data while maintaining interpretability and statistical robustness.
if (!requireNamespace("dendextend", quietly = TRUE)) {
install.packages("dendextend")
}
library(dendextend)
# Calculate the distance matrix
dist_matrix <- dist(scaled_df)
# Perform hierarchical clustering using the Ward's method
hc <- hclust(dist_matrix, method = "ward.D2")
# Convert to dendrogram and suppress labels
dend <- as.dendrogram(hc)
dend <- set(dend, "labels", NULL) # Suppress labels
plot(dend, main = "Dendrogram of Hierarchical Clustering", ylab = "Height", axes = TRUE)
The dendrogram shows three distinct groups at a reasonable height cutoff, confirming the optimal choice of 3 clusters. This visualization supports the conclusions drawn from the Elbow Method and k-means clustering.
Partitioning Around Medoids (PAM) is a clustering algorithm that partitions the dataset into k clusters, similar to k-means. However, unlike k-means, PAM selects actual data points (called medoids) as cluster centers, making it more robust to outliers and noise. This makes PAM a suitable alternative for datasets where traditional clustering algorithms like k-means may struggle.
Why Use PAM Clustering?
Robustness to Outliers: Unlike k-means, which minimizes squared distances and is highly sensitive to outliers, PAM minimizes absolute distances, making it more robust to noise and extreme values.
Medoids as Cluster Representatives: Since medoids are actual data points, they provide a more interpretable representation of clusters compared to k-means artificial centroids.
Strengths:
Robust to outliers and noise due to the use of medoids.
Medoids provide an interpretable summary of clusters.
Limitations:
pam_result <- pam(scaled_df, k = 3)
# Visualization for PAM Clustering
ggplot(pca_data, aes(x = PC1, y = PC2, color = as.factor(pam_result$clustering))) +
geom_point(size = 3) +
ggtitle("PAM Clustering Visualization") +
theme_minimal()
# Adding K-means and PAM clustering assignments to the dataset
iris$KMeans_Cluster <- kmeans_result$cluster
iris$PAM_Cluster <- pam_result$clustering
#calculating the mean values of each feature for the clusters
cluster_summary_kmeans <- aggregate(df, by = list(Cluster = iris$KMeans_Cluster), FUN = mean)
cluster_summary_pam <- aggregate(df, by = list(Cluster = iris$PAM_Cluster), FUN = mean)
print(cluster_summary_kmeans)
## Cluster Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 1 5.801887 2.673585 4.369811 1.413208
## 2 2 5.006000 3.428000 1.462000 0.246000
## 3 3 6.780851 3.095745 5.510638 1.972340
print(cluster_summary_pam)
## Cluster Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 1 5.006000 3.428000 1.462000 0.246000
## 2 2 6.811111 3.082222 5.553333 1.997778
## 3 3 5.812727 2.700000 4.376364 1.412727
Cluster 1:
Cluster 2:
Cluster 3:
This analysis grouped the iris dataset into 3 clusters using K-means and PAM methods.
The silhouette score, PCA visualization, and dendrogram validate the clustering quality.
The results provide insights into the dataset’s underlying structure and differences in clustering methods.