Steps to follow in clustering Analysis
1:-Data Preparation 2:-Assesing clustering tendency 3:-Defining the optimal number of cluster 4:-Computing cluster analyses 5:-Validating clustering analyses
Step1:-Data Preparation
library(factoextra)
library(cluster)
# Load the data set
data(USArrests)
# Remove any missing value (i.e, NA values for not available)
# That might be present in the data
USArrests <- na.omit(USArrests)
# View the first 6 rows of the data
head(USArrests, n = 6)
To inspect the data before the K-means clustering we’ll compute some descriptive statistics such as the mean and the standard deviation of the variables.
The apply() function is used to apply a given function (e.g : min(), max(), mean(), …) on the data set. The second argument can take the value of:
1: for applying the function on the rows 2: for applying the function on the columns
desc_stats <- data.frame(
Min = apply(USArrests, 2, min), # minimum
Med = apply(USArrests, 2, median), # median
Mean = apply(USArrests, 2, mean), # mean
SD = apply(USArrests, 2, sd), # Standard deviation
Max = apply(USArrests, 2, max) # Maximum
)
desc_stats <- round(desc_stats, 1)
head(desc_stats)
**Note that the variables have a large different means and variances. They must be standardized to make them comparable.
Standardization consists of transforming the variables such that they have mean zero and standard deviation one. The scale() function can be used as follow:
df<-scale(USArrests)
Step2:-Assesing clustering tendency
The function get_clust_tendency() [in factoextra] can be used. It computes Hopkins statistic and provides a visual approach.
res <- get_clust_tendency(df, 40, graph = FALSE)
# Hopskin statistic
res$hopkins_stat
[1] 0.3440875
# Visualize the dissimilarity matrix
res$plot
NULL
The value of Hopkins statistic is significantly < 0.5, indicating that the data is highly clusterable. Additionally, It can be seen that the ordered dissimilarity image contains patterns (i.e., clusters).
Step 3:-Defining the optimal number of cluster As k-means clustering requires to specify the number of clusters to generate, we’ll use the function clusGap() [in cluster] to compute gap statistics for estimating the optimal number of clusters . The function fviz_gap_stat() [in factoextra] is used to visualize the gap statistic plot.
set.seed(123)
# Compute the gap statistic
gap_stat <- clusGap(df, FUN = kmeans, nstart = 25,
K.max = 10, B = 500)
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 500) [one "." per sample]:
.................................................. 50
.................................................. 100
.................................................. 150
.................................................. 200
.................................................. 250
.................................................. 300
.................................................. 350
.................................................. 400
.................................................. 450
.................................................. 500
# Plot the result
fviz_gap_stat(gap_stat)
Step4:-Computing cluster analyses K Means clustering with K = 4
# Compute k-means
set.seed(123)
km.res <- kmeans(df, 4, nstart = 25)
head(km.res$cluster, 20)
Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa
4 3 3 4 3 3 2 2 3 4 2 1 3 2 1
Kansas Kentucky Louisiana Maine Maryland
2 1 4 1 3
# Visualize clusters using factoextra
fviz_cluster(km.res, USArrests)
Step5:-Validating clustering analyses
The silhouette measures (SiSi) how similar an object ii is to the the other objects in its own cluster versus those in the neighbor cluster. SiSi values range from 1 to - 1: A value of SiSi close to 1 indicates that the object is well clustered. In the other words, the object ii is similar to the other objects in its group. A value of SiSi close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.
sil <- silhouette(km.res$cluster, dist(df))
rownames(sil) <- rownames(USArrests)
head(sil[, 1:3])
cluster neighbor sil_width
Alabama 4 3 0.48577530
Alaska 3 4 0.05825209
Arizona 3 2 0.41548326
Arkansas 4 2 0.11870947
California 3 2 0.43555885
Colorado 3 2 0.32654235
fviz_silhouette(sil)
cluster size ave.sil.width
1 1 13 0.37
2 2 16 0.34
3 3 13 0.27
4 4 8 0.39
It can be seen that there are some samples which have negative silhouette values. Some natural questions are : Which samples are these? To what cluster are they closer?
This can be determined from the output of the function silhouette() as follow:
neg_sil_index <- which(sil[, "sil_width"] < 0)
sil[neg_sil_index, , drop = FALSE]
Enhanced Clustering Analysis The function eclust() [in factoextra] provides several advantages compared to the standard packages used for clustering analysis:
It simplifies the workflow of clustering analysis It can be used to compute hierarchical clustering and partitioning clustering in a single line function call The function eclust() computes automatically the gap statistic for estimating the right number of clusters. It automatically provides silhouette information It draws beautiful graphs using ggplot2
# Compute k-means
res.km <- eclust(df, "kmeans")
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
.................................................. 50
.................................................. 100
# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)
# Silhouette plot
fviz_silhouette(res.km)
cluster size ave.sil.width
1 1 13 0.27
2 2 13 0.37
3 3 8 0.39
4 4 16 0.34
2.Hierarchical clustering
# Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust") # compute hclust
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
.................................................. 50
.................................................. 100
fviz_dend(res.hc, rect = TRUE) # dendrogam
The R code below generates the silhouette plot and the scatter plot for hierarchical clustering.
fviz_silhouette(res.hc) # silhouette plot
cluster size ave.sil.width
1 1 7 0.46
2 2 12 0.29
3 3 19 0.26
4 4 12 0.43
fviz_cluster(res.hc) # scatter plot