Clustering Analysis

Clustering Analysis using k-means, DBSCAN, and Evaluation Metrics in R

#1. Introduction The k-means and DBSCAN (Density-Based Spatial Clustering of Applications with Noise) methods are used to do clustering analysis in this tutorial. In order to determine the ideal number of clusters, we also investigate various approaches, including the Elbow Method, Silhouette Method, and Gap Statistics. We employ the multishapes dataset for DBSCAN clustering and the well-known Iris dataset for clustering without the species identification. #2. Necessary Library Resources First, we load the libraries required for the visualization and clustering analysis:

# Load necessary libraries
library(factoextra)

## Warning: package 'factoextra' was built under R version 4.3.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.3.3

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(NbClust)
library(fpc)

## Warning: package 'fpc' was built under R version 4.3.3

library(dbscan)

## Warning: package 'dbscan' was built under R version 4.3.3

## 
## Attaching package: 'dbscan'

## The following object is masked from 'package:fpc':
## 
##     dbscan

## The following object is masked from 'package:stats':
## 
##     as.dendrogram

library(ggplot2)

#3. Scaling the Data

df<-iris[,-5]
df<-iris.scaled <- scale(df)
head(df,3)

##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]   -0.8976739   1.0156020    -1.335752   -1.311052
## [2,]   -1.1392005  -0.1315388    -1.335752   -1.311052
## [3,]   -1.3807271   0.3273175    -1.392399   -1.311052

#4. Elbow Method for k-means The Elbow Method helps to identify the optimal number of clusters by plotting within-cluster sum of squares (WSS) against the number of clusters. A sharp bend (or “elbow”) in the graph indicates the ideal number of clusters.

fviz_nbclust(df, kmeans, method = "wss") +
  geom_vline(xintercept = 3, linetype = 3) +  # 3 is the known number of clusters
  labs(subtitle = "ELBOW Method")

#5. DBSCAN Clustering DBSCAN is a density-based clustering algorithm that finds clusters based on the density of points. Here, we apply it to the multishapes dataset and visualize the clusters.

set.seed(123)
dbc <- fpc::dbscan(df, eps = 0.5, MinPts = 5)

fviz_cluster(dbc, data = df, stand = FALSE,
             ellipse = FALSE,
             show.clust.cent = FALSE,
             geom = "point", palette = "jco", ggtheme = theme_minimal())

print(dbc)

## dbscan Pts=150 MinPts=5 eps=0.5
##         0  1  2
## border 34  5 18
## seed    0 40 53
## total  34 45 71

#6. k-NN Distance Plot The k-NN distance plot helps in determining the appropriate eps value for DBSCAN by plotting the distance to the k nearest neighbors. {r}

dbscan::kNNdistplot(df, k = 5)
abline(h = 0.5, lty = 3)  # The chosen eps value

#7. Silhouette Method The Silhouette Method evaluates the quality of clustering by measuring how similar each point is to its own cluster compared to other clusters. A high average silhouette width indicates a good clustering configuration.

fviz_nbclust(df, kmeans, method = "silhouette") +
labs(subtitle = "SILHOUETTE METHOD")

#8. Gap Statistics Method The Gap Statistics method compares the total within intra-cluster variation for different numbers of clusters with the expected values under null reference distribution. It helps to select the number of clusters.

set.seed(123)
fviz_nbclust(df, kmeans, nstart = 25, method = "gap_stat", nboot = 50) +
  labs(subtitle = "GAP STATISTICS METHOD")

#9. Conclusion In this lesson, we showed how to use DBSCAN and k-means to cluster the Iris dataset. In order to assess performance and identify the ideal number of clusters, we also employed techniques such as the Elbow Method, Silhouette Method, and Gap Statistics. DBSCAN proved quite helpful in identifying clusters in the dataset of multishapes.

Clustering Analysis

Meghana

2024-10-24