Clustering Analysis using k-means, DBSCAN, and Evaluation Metrics in R

#1. Introduction In this tutorial, we demonstrate how to perform clustering analysis using the k-means and DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithms. Additionally, we explore different methods such as the Elbow Method, Silhouette Method, and Gap Statistics to evaluate the optimal number of clusters. We use the famous Iris dataset for clustering without the species label and the multishapes dataset for DBSCAN clustering. #2. Required Libraries We first load the necessary libraries for the clustering analysis and visualization:

# Load necessary libraries
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(NbClust)
library(fpc)
library(dbscan)
## 
## Attaching package: 'dbscan'
## The following object is masked from 'package:fpc':
## 
##     dbscan
## The following object is masked from 'package:stats':
## 
##     as.dendrogram
library(ggplot2)

#3. Scaling the Data

df<-iris[,-5]
df<-iris.scaled <- scale(df)
head(df,3)
##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]   -0.8976739   1.0156020    -1.335752   -1.311052
## [2,]   -1.1392005  -0.1315388    -1.335752   -1.311052
## [3,]   -1.3807271   0.3273175    -1.392399   -1.311052

#4. Elbow Method for k-means The Elbow Method helps to identify the optimal number of clusters by plotting within-cluster sum of squares (WSS) against the number of clusters. A sharp bend (or “elbow”) in the graph indicates the ideal number of clusters.

fviz_nbclust(df, kmeans, method = "wss") +
  geom_vline(xintercept = 3, linetype = 3) +  # 3 is the known number of clusters
  labs(subtitle = "ELBOW Method")

#5. DBSCAN Clustering DBSCAN is a density-based clustering algorithm that finds clusters based on the density of points. Here, we apply it to the multishapes dataset and visualize the clusters.

set.seed(123)
dbc <- fpc::dbscan(df, eps = 0.5, MinPts = 5)

fviz_cluster(dbc, data = df, stand = FALSE,
             ellipse = FALSE,
             show.clust.cent = FALSE,
             geom = "point", palette = "jco", ggtheme = theme_minimal())

print(dbc)
## dbscan Pts=150 MinPts=5 eps=0.5
##         0  1  2
## border 34  5 18
## seed    0 40 53
## total  34 45 71

#6. k-NN Distance Plot The k-NN distance plot helps in determining the appropriate eps value for DBSCAN by plotting the distance to the k nearest neighbors.

dbscan::kNNdistplot(df, k = 5)
abline(h = 0.5, lty = 3)  # The chosen eps value

#7. Silhouette Method The Silhouette Method evaluates the quality of clustering by measuring how similar each point is to its own cluster compared to other clusters. A high average silhouette width indicates a good clustering configuration.

fviz_nbclust(df, kmeans, method = "silhouette") +
  labs(subtitle = "SILHOUETTE METHOD")

#8. Gap Statistics Method The Gap Statistics method compares the total within intra-cluster variation for different numbers of clusters with the expected values under null reference distribution. It helps to select the number of clusters.

set.seed(123)
fviz_nbclust(df, kmeans, nstart = 25, method = "gap_stat", nboot = 50) +
  labs(subtitle = "GAP STATISTICS METHOD")

#9. Conclusion In this tutorial, we demonstrated how to perform clustering on the Iris dataset using k-means and DBSCAN. We also used methods like the Elbow Method, Silhouette Method, and Gap Statistics to evaluate the performance and determine the optimal number of clusters. DBSCAN was particularly useful for discovering clusters in the multishapes dataset.