knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
### Create datable() with Filter:
options(DT.options = list(pageLength = 5,
lengthMenu = c(5, 10, 20, 50, 100), # Adjust option in Show entries
autoWidth = TRUE,
language = list(search = 'Filter:')))
# Load the required library
library(factoextra) # clustering algorithms
library(NbClust) # Packages for calculate number of clusters k
library(datasets)
library(cluster) # clustering algorithms
library(DT) # For using datatable()
library(tidyverse)
library(ggpubr) # For combing plot ggarrange()
library(gridExtra) # For combing plot grid.arrange()
K-Means Clutering is an unsupervised machine learning clustering algorithm that attempts to group observations into different clusters. Specifically, the goal of the algorithm is to minimize the difference within clusters and maximize the difference between clusters. Below is a graphical representation of the clusters that could be formed using the algorithm.
This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.
Step by step for analyzing cluster
To perform a cluster analysis in R, generally, the data should be
prepared as follows:
1. Rows are observations (individuals) and columns are variables.
2. Any missing value in the data must be removed or estimated.
3. The data must be standardized (i.e., scaled) to make
variables comparable. Recall that, standardization consists of
transforming the variables such that they have mean zero and standard
deviation one.
# Load the dataset:
df = USArrests
#Show the table of the original dataset
datatable(df,
filter = 'top',
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: center;',
'Table 1.1: ', htmltools::em('The number of residents in arrest in US')))
# Remove any missing value
df <- na.omit(df)
# Normalized/ Scaling/standardizing the data :
df = scale(df)
#Show the table of the dataset
datatable(df,
filter = 'top',
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: center;',
'Table 1.2: ', htmltools::em('The normalized value of residents in arrest in US')))
This step helps answer the question: “Can the data be clustered ?” by using Hopkins’s statistic. If the value of Hopkins statistic is close to 1 (far above 0.5), then we can conclude that the dataset is significantly clusterable.
cl= get_clust_tendency(df, n = 49, graph = T)
cl
## $hopkins_stat
## [1] 0.6079684
##
## $plot
Or we can use the following code to draw a plot. However, we don’t know the Hopkins value with this code:
distance = get_dist(df)
fviz_dist(distance, gradient = list(low = "red", mid = "white", high = "blue"))
Discuss:
+ The hopkins value is 0.60 > 0.5 => clusterable
+ The graph illustrates the clusters of the data
A variety of measures have been proposed in the literature for
evaluating clustering results. There are more than thirty indices and
methods for identifying the optimal number of clusters. A few common
methods:
+ Elbow
+ Average silhouette method
+ Gap statistic method (preferable)
The elbow method, in which the sum of squares at each number of clusters is calculated and graphed, and the user looks for a change of slope from steep to shallow (an elbow) to determine the optimal number of clusters. This method is inexact, but still potentially helpful.
fviz_nbclust(df, kmeans, method = "wss") + labs(subtitle = "The Elbow Method") + theme_minimal()
=> The number of clusters can be either 4 or 6.
Another visualization that can help determine the optimal number of clusters is called the a silhouette method. Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k.
fviz_nbclust( df, kmeans, method = "silhouette") + labs(subtitle = "Silhouette method") + theme_minimal()
=> The number of clusters can be 2.
The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic (i.e., that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.
gap = clusGap(df, FUN = kmeans, nstart= 25, K.max = 10, B = 100)
fviz_gap_stat(gap) + labs(subtitle = "Gap statistic method") + theme_minimal()
=> The number of clusters can be 3.
k=2 (Silhouette method)
km2 = kmeans(df, centers = 2, nstart = 25)
fviz_cluster(km2, df)
km2$betweenss/ km2$totss
## [1] 0.4751918
k=3 (Gap statistic method)
km3 = kmeans(df, centers = 3, nstart = 25)
fviz_cluster(km3, df)
km3$betweenss/ km3$totss
## [1] 0.6003915
print(km3)
## K-means clustering with 3 clusters of sizes 20, 13, 17
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 1.0049340 1.0138274 0.1975853 0.8469650
## 2 -0.9615407 -1.1066010 -0.9301069 -0.9667633
## 3 -0.4469795 -0.3465138 0.4788049 -0.2571398
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 1 1 1 3 1
## Colorado Connecticut Delaware Florida Georgia
## 1 3 3 1 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 2 1 3 2
## Kansas Kentucky Louisiana Maine Maryland
## 3 2 1 2 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 1 2 1 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 1 2 3
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 2 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 2 1 1 3 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 2 2 3
##
## Within cluster sum of squares by cluster:
## [1] 46.74796 11.95246 19.62285
## (between_SS / total_SS = 60.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
k=4 (Elbow method)
km4 = kmeans(df, centers = 4, nstart = 25)
fviz_cluster(km4, df)
km4$betweenss/ km4$totss
## [1] 0.7122287
k=5
km5 = kmeans(df, centers = 5, nstart = 25)
fviz_cluster(km5, df)
km5$betweenss/ km5$totss
## [1] 0.7502847
Or we can use the following command to combine plots:
km2 = kmeans(df, centers = 2, nstart = 25)
km3 = kmeans(df, centers = 3, nstart = 25)
km4 = kmeans(df, centers = 4, nstart = 25)
km5 = kmeans(df, centers = 5, nstart = 25)
# Plot to compare:
p1 = fviz_cluster(km2, geom = "point", df) + ggtitle("k=2")
p2 = fviz_cluster(km3, geom = "point", df) + ggtitle("k=3")
p3 = fviz_cluster(km4, geom = "point", df) + ggtitle("k=4")
p4 = fviz_cluster(km5, geom = "point", df) + ggtitle("k=5")
# COmbine plot
grid.arrange(p1,p2,p3,p4, nrow=2)
In regression, we define the R2 value as the ration of regression sum of square and total sum of square, which indicates the variability of the data explained by the regression model. In K-Means, the ratio of betweenss and totss has a similar interpretation: indicates the amount of information retained after clustering.
In this case:
+ With k = 2, the ratio is 0.47
+ With k = 3, the ratio is 0.6
+ With k = 4, the ratio is 0.71
+ With k = 5, the ratio is 0.75
=> There is high chance that k=3 is the optimal number of cluster
because the distance is the highest.
The 2 components of the plot (for all k) explains 62% + 24.7% = 86.7% of the point variability.
Use silhouette measure (how similar an object to the other objects in its own cluster). Si value range from 1 (good cluster) to -1 (poor cluster)
sil = silhouette(km3$cluster, dist(df))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 20 0.26
## 2 2 13 0.37
## 3 3 17 0.32
sil = silhouette(km4$cluster, dist(df))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 8 0.39
## 2 2 13 0.37
## 3 3 16 0.34
## 4 4 13 0.27
For number of clusters = 3
hc3 = eclust(df, "hclust", k=3)
fviz_dend(hc3, rect = TRUE)
For number of clusters = 4
hc4 = eclust(df, "hclust", k=4)
fviz_dend(hc4, rect = TRUE)
Using library(dendextend) instead of using eclust() and fviz_dend(), we can see the similar dendrogram:
library(dendextend)
df %>% dist() %>% hclust() %>% as.dendrogram() -> dend
dend %>% set("labels_col", k =4) %>% set("branches_k_color", k =4) %>% plot(axes = FALSE)
Assign the cluster number to each observation, after that comparing to the observation group you already have.
# Convert back to the dataframe
df = as.data.frame(df)
df$state = rownames(df)
# Check the total number of cluster
table(km3$cluster)
##
## 1 2 3
## 20 13 17
# Assign the cluster number to each observation
df$cluster = as.factor(km3$cluster)
# Compare the cluster result to the class label (if we have one)
table(df$cluster, df$cluster)
##
## 1 2 3
## 1 20 0 0
## 2 0 13 0
## 3 0 0 17
Clustering result then can be compared with the class label (if
we have one - For example: Species). This step can be done with
the iris data:
table(df\(Species,
df\)cluster)
Consider analyzing normalized data for clustering rather than un-normalized data.
Here I draw a scatter plot illustrating the relationship between Urban Population (%) and Number of Arrested people who commited Murder/Rape/Assault
# Murder vs UrbanPop - normalized value
a1 = ggplot(df, aes(UrbanPop, Murder, shape = cluster))+
geom_point () +
stat_ellipse(aes(y = Murder, x = UrbanPop, fill = cluster), geom = 'polygon', alpha = 0.4, level = 0.95) +
labs(x = 'UrbanPop', y = 'Murder', subtitle = "Murder vs UrbanPop")+
theme_bw()
# Assault vs UrbanPop - normalized value
a2 = ggplot(df, aes(UrbanPop, Assault, shape = cluster))+
geom_point () +
stat_ellipse(aes(y = Assault, x = UrbanPop, fill = cluster), geom = 'polygon', alpha = 0.4, level = 0.95) +
labs(x = 'UrbanPop', y = 'Assault', subtitle = "Assault vs UrbanPop")+
theme_bw()
# Rape vs UrbanPop - normalized value
a3 = ggplot(df, aes(UrbanPop, Rape, shape = cluster))+
geom_point () +
stat_ellipse(aes(y = Rape, x = UrbanPop, fill = cluster), geom = 'polygon', alpha = 0.4, level = 0.95) +
labs(x = 'UrbanPop', y = 'Rape', subtitle = "Rape vs UrbanPop")+
theme_bw()
# Drawing a density plot of 3 clusters
b1 = ggdensity(df, x = "Murder", fill = "cluster") +
labs(subtitle = "Murder Density")+
theme_bw()
b2 = ggdensity(df, x = "Assault", fill = "cluster") +
labs(subtitle = "Assault Density")+
theme_bw()
b3 = ggdensity(df, x = "Rape", fill = "cluster") +
labs(subtitle = "Rape Density")+
theme_bw()
# Combine 3 plots
plot = ggarrange(a1,a2, a3, b1, b2, b3,
ncol=3, nrow=2,
common.legend = TRUE,
legend="right",
labels = c("A","B","C", "D", "E", "F"))
annotate_figure(plot, bottom = text_grob("The normalized data of people committing murder/assault/rape vs urban population ",
color = "red", face = "bold", size = 12))
# Data preparation
data = USArrests
data$cluster = as.factor(km3$cluster)
data$state = rownames(data)
# Murder vs UrbanPop
a1 = ggplot(data, aes(UrbanPop, Murder, shape = cluster))+
geom_point () +
stat_ellipse(aes(y = Murder, x = UrbanPop, fill = cluster), geom = 'polygon', alpha = 0.4, level = 0.95) +
labs(x = 'Urban Population (%)', y = 'Murder', subtitle = "Murder vs UrbanPop")+
theme_bw()
# Assault vs UrbanPop - normalized value
a2 = ggplot(data, aes(UrbanPop, Assault, shape = cluster))+
geom_point () +
stat_ellipse(aes(y = Assault, x = UrbanPop, fill = cluster), geom = 'polygon', alpha = 0.4, level = 0.95) +
labs(x = 'Urban Population (%)', y = 'Assault', subtitle = "Assault vs UrbanPop")+
theme_bw()
# Rape vs UrbanPop - normalized value
a3 = ggplot(data, aes(UrbanPop, Rape, shape = cluster))+
geom_point () +
stat_ellipse(aes(y = Rape, x = UrbanPop, fill = cluster), geom = 'polygon', alpha = 0.4, level = 0.95) +
labs(x = 'Urban Population (%)', y = 'Rape', subtitle = "Rape vs UrbanPop")+
theme_bw()
# Drawing a density plot of 3 clusters
b1 = ggdensity(data, x = "Murder", fill = "cluster") +
labs(subtitle = "Murder Density")+
theme_bw()
b2 = ggdensity(data, x = "Assault", fill = "cluster") +
labs(subtitle = "Assault Density")+
theme_bw()
b3 = ggdensity(data, x = "Rape", fill = "cluster") +
labs(subtitle = "Rape Density")+
theme_bw()
# Combine 3 plots
plot = ggarrange(a1,a2, a3, b1, b2, b3,
ncol=3, nrow=2,
common.legend = TRUE,
legend="right",
labels = c("A","B","C", "D", "E", "F"))
annotate_figure(plot, bottom = text_grob("The data of people committing murder/assault/rape vs urban population ",
color = "red", face = "bold", size = 12))
This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Step by step for analyzing cluster
To perform a cluster analysis in R, generally, the data should be
prepared as follows:
1. Rows are observations (individuals) and columns are variables.
2. Any missing value in the data must be removed or estimated.
3. The data must be standardized (i.e., scaled) to make
variables comparable. Recall that, standardization consists of
transforming the variables such that they have mean zero and standard
deviation one.
# Load the dataset:
df = iris
#Show the table of the original dataset
datatable(df,
filter = 'top',
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: center;',
'Table 2.1: ', htmltools::em('The iris dataset')))
# Remove any missing value
df <- na.omit(df)
# Select only numeric value for computing, remove the SPECIES COLUMN:
df = df %>% select (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
# Normalized/ Scaling/standardizing the data :
df = scale(df)
#Show the table of the dataset
datatable(df,
filter = 'top',
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: center;',
'Table 2.2: ', htmltools::em('The normalized value iris dataset')))
This step helps answer the question: “Can the data be clustered ?” by using Hopkins’s statistic. If the value of Hopkins statistic is close to 1 (far above 0.5), then we can conclude that the dataset is significantly clusterable.
cl= get_clust_tendency(df, n = 50, graph = T)
cl
## $hopkins_stat
## [1] 0.7997314
##
## $plot
Or we can use the following command, but we won’t know the Hopkins statistic:
distance = get_dist(df)
fviz_dist(distance, gradient = list(low = "red", mid = "white", high = "blue"))
Discuss:
+ The hopkins value is 0.80 >> 0.5 => strongly
clusterable
+ The graph illustrates the clusters of the data
A variety of measures have been proposed in the literature for
evaluating clustering results. There are more than thirty indices and
methods for identifying the optimal number of clusters. A few common
methods:
+ Elbow
+ Average silhouette method
+ Gap statistic method (preferable)
The elbow method, in which the sum of squares at each number of clusters is calculated and graphed, and the user looks for a change of slope from steep to shallow (an elbow) to determine the optimal number of clusters. This method is inexact, but still potentially helpful.
fviz_nbclust(df, kmeans, method = "wss") + labs(subtitle = "The Elbow Method") + theme_minimal()
=> The number of clusters can be 3.
Another visualization that can help determine the optimal number of clusters is called the a silhouette method. Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k.
fviz_nbclust( df, kmeans, method = "silhouette") + labs(subtitle = "Silhouette method") + theme_minimal()
=> The number of clusters can be 2.
The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic (i.e., that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.
gap = clusGap(df, FUN = kmeans, nstart= 10, K.max = 10, B = 100)
fviz_gap_stat(gap) + labs(subtitle = "Gap statistic method") + theme_minimal()
=> The number of clusters can be 2.
k=2 (Gap statistic - Silhouette method)
km2 = kmeans(df, centers = 2, nstart = 25)
fviz_cluster(km2, df)
km2$betweenss/ km2$totss
## [1] 0.6293972
k=3 (Elbow method)
km3 = kmeans(df, centers = 3, nstart = 25)
fviz_cluster(km3, df)
km3$betweenss/ km3$totss
## [1] 0.7669658
k=4
km4 = kmeans(df, centers = 4, nstart = 25)
fviz_cluster(km4, df)
km4$betweenss/ km4$totss
## [1] 0.8098463
Or we can use the following command to combine plots:
km2 = kmeans(df, centers = 2, nstart = 25)
km3 = kmeans(df, centers = 3, nstart = 25)
km4 = kmeans(df, centers = 4, nstart = 25)
km5 = kmeans(df, centers = 5, nstart = 25)
# Plot to compare:
p1 = fviz_cluster(km2, geom = "point", df) + ggtitle("k=2")
p2 = fviz_cluster(km3, geom = "point", df) + ggtitle("k=3")
p3 = fviz_cluster(km4, geom = "point", df) + ggtitle("k=4")
p4 = fviz_cluster(km5, geom = "point", df) + ggtitle("k=5")
# COmbine plot
grid.arrange(p1,p2,p3,p4, nrow=2)
In regression, we define the R2 value as the ration of regression sum of square and total sum of square, which indicates the variability of the data explained by the regression model. In K-Means, the ratio of betweenss and totss has a similar interpretation: indicates the amount of information retained after clustering.
In this case:
+ With k = 2, the ratio is 0.63
+ With k = 3, the ratio is 0.76
+ With k = 4, the ratio is 0.80
=> There is high chance that k=3 is the optimal number of cluster
because the distance is the highest. Secondly, we already have 3 types
of Species => Chooose k=3
The 2 components of the plot (for all k) explains 73% + 22.9% = 95.9% of the point variability.
Use silhouette measure (how similar an object to the other objects in its own cluster). Si value range from 1 (good cluster) to -1 (poor cluster). Lech check k = 2, 3 or 4:
sil = silhouette(km2$cluster, dist(df))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 50 0.68
## 2 2 100 0.53
sil = silhouette(km3$cluster, dist(df))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 47 0.35
## 2 2 53 0.39
## 3 3 50 0.64
sil = silhouette(km4$cluster, dist(df))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 25 0.47
## 2 2 47 0.35
## 3 3 25 0.37
## 4 4 53 0.38
Assess the silhouette width:
+ k = 2: 0.58
+ k = 3: 0.46 => Preferable because we already have 3 Species types,
previous explained reason.
+ k = 4: 0.38
For number of clusters = 3
hc3 = eclust(df, "hclust", k=3)
fviz_dend(hc3, rect = TRUE)
Using library(dendextend) instead of using eclust() and fviz_dend(), we can see the similar dendrogram:
library(dendextend)
df %>% dist() %>% hclust() %>% as.dendrogram() -> dend
dend %>% set("labels_col", k =43) %>% set("branches_k_color", k =3) %>% plot(axes = FALSE)
Assign the cluster number to each observation, after that comparing to the observation group you already have.
# Convert back to the dataframe
df = as.data.frame(df)
# Assign the Species column to df (NOTE: we remove the column Species previously)
data = iris
df$Species = iris$Species
# Check the total number of cluster
table(km3$cluster)
##
## 1 2 3
## 47 53 50
#Check the total number of Species
table(df$Species)
##
## setosa versicolor virginica
## 50 50 50
# Assign the cluster number to each observation
df$cluster = as.factor(km3$cluster)
# Compare the cluster result to the class label (if we have one)
table(df$cluster, df$Species)
##
## setosa versicolor virginica
## 1 0 11 36
## 2 0 39 14
## 3 50 0 0
Discuss:
Each Species has 50 observations
There is a mismatch between our clustering result and the
Species.
Consider analyzing normalized data for clustering rather than un-normalized data.
Plot the normalized data:
# Change the level order
df$Species = factor(df$Species, level = c("versicolor", "setosa", "virginica"))
# Sepal.Length vs Sepal.Width - Cluster + Species:
a1 = ggplot(df, aes(x = Sepal.Width, y= Sepal.Length, color = cluster))+
geom_point () +
labs(x = 'Sepal.Width', y = 'Sepal.Length', subtitle = "Cluster of S.Length vs S.Width")+
theme_bw()
b1 = ggplot(df, aes(x = Sepal.Width, y= Sepal.Length, color = Species))+
geom_point () +
labs(x = 'Sepal.Width', y = 'Sepal.Length', subtitle = "Species of S.Length vs S.Width")+
theme_bw()
# Petal.Length vs Petal.Width - cluster + Species
a2 = ggplot(df, aes(Petal.Width, Petal.Length, color = cluster))+
geom_point () +
labs(x = 'Petal.Length', y = 'Petal.Width', subtitle = "Cluster of P.Length vs P.Width")+
theme_bw()
b2 = ggplot(df, aes(Petal.Width, Petal.Length, color = Species))+
geom_point () +
labs(x = 'Petal.Length', y = 'Petal.Width', subtitle = "Species of P.Length vs P.Width")+
theme_bw()
plot = ggarrange(a1, b1, a2, b2,
ncol=2, nrow=2,
common.legend = F,
legend="right",
labels = c("A","B", "C", "D"))
annotate_figure(plot, bottom = text_grob("The scatter plot",
color = "red", face = "bold", size = 12))
# Drawing a density plot of 3 clusters
c1 = ggdensity(df, x = "Sepal.Length", fill = "cluster") +
labs(subtitle = "Sepal.Length Density")+
theme_bw()
c2 = ggdensity(df, x = "Sepal.Width", fill = "cluster") +
labs(subtitle = "Sepal.Width Density")+
theme_bw()
c3 = ggdensity(df, x = "Petal.Length", fill = "cluster") +
labs(subtitle = "Petal.Length Density")+
theme_bw()
c4 = ggdensity(df, x = "Petal.Width", fill = "cluster") +
labs(subtitle = "Petal.Width Density")+
theme_bw()
plot = ggarrange(c1,c2, c3, c4,
ncol=2, nrow=2,
common.legend = TRUE,
legend="right",
labels = c("A","B","C", "D"))
annotate_figure(plot, bottom = text_grob("The density plot",
color = "red", face = "bold", size = 12))
# Data preparation
data = iris
data$cluster = as.factor(km3$cluster)
# Change the level order
data$Species = factor(data$Species, level = c("versicolor", "setosa", "virginica"))
# Sepal.Length vs Sepal.Width - Cluster + Species:
a1 = ggplot(data, aes(x = Sepal.Width, y= Sepal.Length, color = cluster))+
geom_point () +
labs(x = 'Sepal.Width', y = 'Sepal.Length', subtitle = "Cluster of S.Length vs S.Width")+
theme_bw()
b1 = ggplot(data, aes(x = Sepal.Width, y= Sepal.Length, color = Species))+
geom_point () +
labs(x = 'Sepal.Width', y = 'Sepal.Length', subtitle = "Species of S.Length vs S.Width")+
theme_bw()
# Petal.Length vs Petal.Width - cluster + Species
a2 = ggplot(data, aes(Petal.Width, Petal.Length, color = cluster))+
geom_point () +
labs(x = 'Petal.Length', y = 'Petal.Width', subtitle = "Cluster of P.Length vs P.Width")+
theme_bw()
b2 = ggplot(data, aes(Petal.Width, Petal.Length, color = Species))+
geom_point () +
labs(x = 'Petal.Length', y = 'Petal.Width', subtitle = "Species of P.Length vs P.Width")+
theme_bw()
plot = ggarrange(a1, b1, a2, b2,
ncol=2, nrow=2,
common.legend = F,
legend="right",
labels = c("A","B", "C", "D"))
annotate_figure(plot, bottom = text_grob("The scatter plot",
color = "red", face = "bold", size = 12))
# Drawing a density plot of 3 clusters
c1 = ggdensity(data, x = "Sepal.Length", fill = "cluster") +
labs(subtitle = "Sepal.Length Density")+
theme_bw()
c2 = ggdensity(data, x = "Sepal.Width", fill = "cluster") +
labs(subtitle = "Sepal.Width Density")+
theme_bw()
c3 = ggdensity(data, x = "Petal.Length", fill = "cluster") +
labs(subtitle = "Petal.Length Density")+
theme_bw()
c4 = ggdensity(data, x = "Petal.Width", fill = "cluster") +
labs(subtitle = "Petal.Width Density")+
theme_bw()
plot = ggarrange(c1,c2, c3, c4,
ncol=2, nrow=2,
common.legend = TRUE,
legend="right",
labels = c("A","B","C", "D"))
annotate_figure(plot, bottom = text_grob("The density plot",
color = "red", face = "bold", size = 12))
For testing what happened when we don’t normalize data before analyzing ?
Step by step for analyzing cluster
# Load the dataset:
df = USArrests
#Show the table of the original dataset
datatable(df,
filter = 'top',
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: center;',
'Table 3.1: ', htmltools::em('The number of residents in arrest in US')))
This step helps answer the question: “Can the data be clustered ?” by using Hopkins’s statistic. If the value of Hopkins statistic is close to 1 (far above 0.5), then we can conclude that the dataset is significantly clusterable.
cl= get_clust_tendency(df, n = 49, graph = T)
cl
## $hopkins_stat
## [1] 0.5638709
##
## $plot
Discuss:
+ The hopkins value is 0.56 > 0.5 => clusterable
+ The graph illustrates the clusters of the data
fviz_nbclust(df, kmeans, method = "wss") + labs(subtitle = "The Elbow Method") + theme_minimal()
=> The number of clusters can be 3.
fviz_nbclust( df, kmeans, method = "silhouette") + labs(subtitle = "Silhouette method") + theme_minimal()
=> The number of clusters can be 2.
gap = clusGap(df, FUN = kmeans, nstart= 25, K.max = 10, B = 100)
fviz_gap_stat(gap) + labs(subtitle = "Gap statistic method") + theme_minimal()
=> The number of clusters cant be identified.
**k=2
km2 = kmeans(df, centers = 2, nstart = 25)
fviz_cluster(km2, df)
km2$betweenss/ km2$totss
## [1] 0.72907
**k=3
km3 = kmeans(df, centers = 3, nstart = 25)
fviz_cluster(km3, df)
km3$betweenss/ km3$totss
## [1] 0.8651961
In this case:
+ With k = 2, the ratio is 0.72
+ with k = 3, the ratio is 0.86
=> There is high chance that k=3 is the optimal number of cluster
because the distance is the highest.
Use silhouette measure (how similar an object to the other objects in its own cluster). Si value range from 1 (good cluster) to -1 (poor cluster)
sil = silhouette(km2$cluster, dist(df))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 21 0.58
## 2 2 29 0.60
sil = silhouette(km3$cluster, dist(df))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 14 0.54
## 2 2 16 0.54
## 3 3 20 0.52
For number of clusters = 3
hc3 = eclust(df, "hclust", k=3)
fviz_dend(hc3, rect = TRUE)
This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Step by step for analyzing cluster
# Load the dataset:
df = iris
#Show the table of the original dataset
datatable(df,
filter = 'top',
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: center;',
'Table 4.1: ', htmltools::em('The iris dataset')))
To perform a cluster analysis in R, generally, the data should be
prepared as follows:
1. Rows are observations (individuals) and columns are variables.
2. Any missing value in the data must be removed or estimated.
3. The data must be standardized (i.e., scaled) to make variables
comparable. Recall that, standardization consists of transforming the
variables such that they have mean zero and standard deviation one.
# Remove any missing value
df <- na.omit(df)
# Select only numeric value for computing, remove the SPECIES COLUMN:
df = df %>% select (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
# Normalized/ Scaling/standardizing the data :
#Show the table of the dataset
datatable(df,
filter = 'top',
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: center;',
'Table 3.2: ', htmltools::em('The numeric value iris dataset')))
This step helps answer the question: “Can the data be clustered ?” by using Hopkins’s statistic. If the value of Hopkins statistic is close to 1 (far above 0.5), then we can conclude that the dataset is significantly clusterable.
cl= get_clust_tendency(df, n = 50, graph = T)
cl
## $hopkins_stat
## [1] 0.8191482
##
## $plot
Discuss:
+ The hopkins value is 0.81 >> 0.5 => strongly
clusterable
+ The graph illustrates the clusters of the data
A variety of measures have been proposed in the literature for
evaluating clustering results. There are more than thirty indices and
methods for identifying the optimal number of clusters. A few common
methods:
+ Elbow
+ Average silhouette method
+ Gap statistic method (preferable)
The elbow method, in which the sum of squares at each number of clusters is calculated and graphed, and the user looks for a change of slope from steep to shallow (an elbow) to determine the optimal number of clusters. This method is inexact, but still potentially helpful.
fviz_nbclust(df, kmeans, method = "wss") + labs(subtitle = "The Elbow Method") + theme_minimal()
=> The number of clusters can be 3.
Another visualization that can help determine the optimal number of clusters is called the a silhouette method. Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k.
fviz_nbclust( df, kmeans, method = "silhouette") + labs(subtitle = "Silhouette method") + theme_minimal()
=> The number of clusters can be 2.
The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic (i.e., that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.
gap = clusGap(df, FUN = kmeans, nstart= 10, K.max = 10, B = 100)
fviz_gap_stat(gap) + labs(subtitle = "Gap statistic method") + theme_minimal()
=> Model suggest the number of clusters can be 6 but in plot, I think it should be 3
k=2 (Gap statistic - Silhouette method)
km2 = kmeans(df, centers = 2, nstart = 25)
fviz_cluster(km2, df)
km2$betweenss/ km2$totss
## [1] 0.7764096
k=3 (Elbow method)
km3 = kmeans(df, centers = 3, nstart = 25)
fviz_cluster(km3, df)
km3$betweenss/ km3$totss
## [1] 0.8842753
In this example:
+ With k = 2, the ratio is 0.77
+ with k = 3, the ratio is 0.88
=> We already have 3 types of Species => Chooose k=3
The 2 components of the plot (for all k) explains 73% + 22.9% = 95.9% of the point variability.
Use silhouette measure (how similar an object to the other objects in its own cluster). Si value range from 1 (good cluster) to -1 (poor cluster). Lech check k = 3:
sil = silhouette(km3$cluster, dist(df))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 62 0.42
## 2 2 50 0.80
## 3 3 38 0.45
Assess the silhouette width:
+ k = 3: 0.55 => Preferable because we already have 3 Species types,
previous explained reason.
For number of clusters = 3
hc3 = eclust(df, "hclust", k=3)
fviz_dend(hc3, rect = TRUE)
Assign the cluster number to each observation, after that comparing to the observation group you already have.
# Convert back to the dataframe
df = as.data.frame(df)
# Assign the Species column to df (NOTE: we remove the column Species previously)
data = iris
df$Species = iris$Species
# Check the total number of cluster
table(km3$cluster)
##
## 1 2 3
## 62 50 38
#Check the total number of Species
table(df$Species)
##
## setosa versicolor virginica
## 50 50 50
# Assign the cluster number to each observation
df$cluster = as.factor(km3$cluster)
# Compare the cluster result to the class label (if we have one)
table(df$cluster, df$Species)
##
## setosa versicolor virginica
## 1 0 48 14
## 2 50 0 0
## 3 0 2 36
Discuss:
Each Species has 50 observations
There is a mismatch between our clustering result and the
Species.