knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)
### Create datable() with Filter:
options(DT.options = list(pageLength = 5, 
                          lengthMenu = c(5, 10, 20, 50, 100), # Adjust option in Show entries
                          autoWidth = TRUE,
                          language = list(search = 'Filter:'))) 
# Load the required library
library(factoextra)  # clustering algorithms
library(NbClust)     # Packages for calculate number of clusters k
library(datasets) 
library(cluster)     # clustering algorithms
library(DT)          # For using datatable()
library(tidyverse)
library(ggpubr)      # For combing plot ggarrange()
library(gridExtra)   # For combing plot grid.arrange()

K-Means Clutering is an unsupervised machine learning clustering algorithm that attempts to group observations into different clusters. Specifically, the goal of the algorithm is to minimize the difference within clusters and maximize the difference between clusters. Below is a graphical representation of the clusters that could be formed using the algorithm.

1 Dataset: USArrests

This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.

Step by step for analyzing cluster

1.1 Data preparation

To perform a cluster analysis in R, generally, the data should be prepared as follows:
1. Rows are observations (individuals) and columns are variables.
2. Any missing value in the data must be removed or estimated.
3. The data must be standardized (i.e., scaled) to make variables comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one.

# Load the dataset: 
df = USArrests

#Show the table of the original dataset
datatable(df, 
          filter = 'top',
          caption = htmltools::tags$caption(     
                    style = 'caption-side: top; text-align: center;',
                    'Table 1.1: ', htmltools::em('The number of residents in arrest in US'))) 
# Remove any missing value
df <- na.omit(df)

# Normalized/ Scaling/standardizing the data :
df = scale(df)

#Show the table of the dataset
datatable(df, 
          filter = 'top',
          caption = htmltools::tags$caption(     
                    style = 'caption-side: top; text-align: center;',
                    'Table 1.2: ', htmltools::em('The normalized value of residents in arrest in US'))) 

1.2 Visualization of distance matrix

This step helps answer the question: “Can the data be clustered ?” by using Hopkins’s statistic. If the value of Hopkins statistic is close to 1 (far above 0.5), then we can conclude that the dataset is significantly clusterable.

cl= get_clust_tendency(df, n = 49, graph = T)
cl
## $hopkins_stat
## [1] 0.6079684
## 
## $plot

Or we can use the following code to draw a plot. However, we don’t know the Hopkins value with this code:

distance = get_dist(df)
fviz_dist(distance, gradient = list(low = "red", mid = "white", high = "blue"))

Discuss:
+ The hopkins value is 0.60 > 0.5 => clusterable
+ The graph illustrates the clusters of the data

1.3 Determination optimal number of clusters k

A variety of measures have been proposed in the literature for evaluating clustering results. There are more than thirty indices and methods for identifying the optimal number of clusters. A few common methods:
+ Elbow
+ Average silhouette method
+ Gap statistic method (preferable)

1.3.1 The Elbow method

The elbow method, in which the sum of squares at each number of clusters is calculated and graphed, and the user looks for a change of slope from steep to shallow (an elbow) to determine the optimal number of clusters. This method is inexact, but still potentially helpful.

fviz_nbclust(df, kmeans, method = "wss") + labs(subtitle = "The Elbow Method") + theme_minimal() 

=> The number of clusters can be either 4 or 6.

1.3.2 Silhouette method

Another visualization that can help determine the optimal number of clusters is called the a silhouette method. Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k.

fviz_nbclust( df, kmeans, method = "silhouette") + labs(subtitle = "Silhouette method") + theme_minimal()

=> The number of clusters can be 2.

1.3.3 Gap statistic method

The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic (i.e., that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.

gap = clusGap(df, FUN = kmeans, nstart= 25, K.max = 10, B = 100)
fviz_gap_stat(gap) + labs(subtitle = "Gap statistic method") + theme_minimal()

=> The number of clusters can be 3.

1.4 Computing k means

k=2 (Silhouette method)

km2 = kmeans(df, centers = 2, nstart = 25)
fviz_cluster(km2, df)

km2$betweenss/ km2$totss
## [1] 0.4751918

k=3 (Gap statistic method)

km3 = kmeans(df, centers = 3, nstart = 25)
fviz_cluster(km3, df)

km3$betweenss/ km3$totss
## [1] 0.6003915
print(km3)
## K-means clustering with 3 clusters of sizes 20, 13, 17
## 
## Cluster means:
##       Murder    Assault   UrbanPop       Rape
## 1  1.0049340  1.0138274  0.1975853  0.8469650
## 2 -0.9615407 -1.1066010 -0.9301069 -0.9667633
## 3 -0.4469795 -0.3465138  0.4788049 -0.2571398
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              3              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              1              3              3              1              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              2              1              3              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              2              1              2              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              1              2              1              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              1              2              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              2              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              1              3              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              2              2              3 
## 
## Within cluster sum of squares by cluster:
## [1] 46.74796 11.95246 19.62285
##  (between_SS / total_SS =  60.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

k=4 (Elbow method)

km4 = kmeans(df, centers = 4, nstart = 25)
fviz_cluster(km4, df)

km4$betweenss/ km4$totss
## [1] 0.7122287

k=5

km5 = kmeans(df, centers = 5, nstart = 25)
fviz_cluster(km5, df)

km5$betweenss/ km5$totss
## [1] 0.7502847

Or we can use the following command to combine plots:

km2 = kmeans(df, centers = 2, nstart = 25)
km3 = kmeans(df, centers = 3, nstart = 25)
km4 = kmeans(df, centers = 4, nstart = 25)
km5 = kmeans(df, centers = 5, nstart = 25)

# Plot to compare:
p1 = fviz_cluster(km2, geom = "point", df) + ggtitle("k=2")
p2 = fviz_cluster(km3, geom = "point", df) + ggtitle("k=3")
p3 = fviz_cluster(km4, geom = "point", df) + ggtitle("k=4")
p4 = fviz_cluster(km5, geom = "point", df) + ggtitle("k=5")

# COmbine plot
grid.arrange(p1,p2,p3,p4, nrow=2)

In regression, we define the R2 value as the ration of regression sum of square and total sum of square, which indicates the variability of the data explained by the regression model. In K-Means, the ratio of betweenss and totss has a similar interpretation: indicates the amount of information retained after clustering.

In this case:
+ With k = 2, the ratio is 0.47
+ With k = 3, the ratio is 0.6
+ With k = 4, the ratio is 0.71
+ With k = 5, the ratio is 0.75
=> There is high chance that k=3 is the optimal number of cluster because the distance is the highest.

The 2 components of the plot (for all k) explains 62% + 24.7% = 86.7% of the point variability.

1.5 Validation

Use silhouette measure (how similar an object to the other objects in its own cluster). Si value range from 1 (good cluster) to -1 (poor cluster)

sil = silhouette(km3$cluster, dist(df))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   20          0.26
## 2       2   13          0.37
## 3       3   17          0.32

sil = silhouette(km4$cluster, dist(df))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1    8          0.39
## 2       2   13          0.37
## 3       3   16          0.34
## 4       4   13          0.27

1.6 Drawing dendrogram

For number of clusters = 3

hc3 = eclust(df, "hclust", k=3)
fviz_dend(hc3, rect = TRUE)

For number of clusters = 4

hc4 = eclust(df, "hclust", k=4)
fviz_dend(hc4, rect = TRUE)

Using library(dendextend) instead of using eclust() and fviz_dend(), we can see the similar dendrogram:

library(dendextend)
df %>% dist() %>% hclust() %>% as.dendrogram() -> dend
dend %>% set("labels_col", k =4) %>% set("branches_k_color", k =4) %>% plot(axes = FALSE)

1.7 Extra: Compare the clusters with available data

Assign the cluster number to each observation, after that comparing to the observation group you already have.

# Convert back to the dataframe
df = as.data.frame(df)
df$state = rownames(df)

# Check the total number of cluster
table(km3$cluster)
## 
##  1  2  3 
## 20 13 17
# Assign the cluster number to each observation
df$cluster = as.factor(km3$cluster)

# Compare the cluster result to the class label (if we have one)
table(df$cluster, df$cluster)
##    
##      1  2  3
##   1 20  0  0
##   2  0 13  0
##   3  0  0 17
  1. We can see that with k=3:
  • cluster 1: 20 observations
  • cluster 2: 13 observations
  • Cluster 3: 17 observations
  1. Clustering result then can be compared with the class label (if we have one - For example: Species). This step can be done with the iris data:
    table(df\(Species, df\)cluster)

  2. Consider analyzing normalized data for clustering rather than un-normalized data.

  3. Here I draw a scatter plot illustrating the relationship between Urban Population (%) and Number of Arrested people who commited Murder/Rape/Assault

# Murder vs UrbanPop - normalized value
a1 = ggplot(df, aes(UrbanPop, Murder, shape = cluster))+
      geom_point () + 
      stat_ellipse(aes(y = Murder, x = UrbanPop, fill = cluster), geom = 'polygon', alpha = 0.4, level = 0.95) + 
      labs(x = 'UrbanPop', y = 'Murder', subtitle = "Murder vs UrbanPop")+
      theme_bw() 

# Assault vs UrbanPop - normalized value
a2 = ggplot(df, aes(UrbanPop, Assault, shape = cluster))+
      geom_point () + 
      stat_ellipse(aes(y = Assault, x = UrbanPop, fill = cluster), geom = 'polygon', alpha = 0.4, level = 0.95) + 
      labs(x = 'UrbanPop', y = 'Assault', subtitle = "Assault vs UrbanPop")+
      theme_bw() 

# Rape vs UrbanPop - normalized value
a3 = ggplot(df, aes(UrbanPop, Rape, shape = cluster))+
      geom_point () + 
      stat_ellipse(aes(y = Rape, x = UrbanPop, fill = cluster), geom = 'polygon', alpha = 0.4, level = 0.95) + 
      labs(x = 'UrbanPop', y = 'Rape', subtitle = "Rape vs UrbanPop")+
      theme_bw() 

# Drawing a density plot of 3 clusters
b1 = ggdensity(df, x = "Murder", fill = "cluster") +
        labs(subtitle = "Murder Density")+
        theme_bw()
  
b2 = ggdensity(df, x = "Assault", fill = "cluster") +
        labs(subtitle = "Assault Density")+
        theme_bw()

b3 = ggdensity(df, x = "Rape", fill = "cluster") +
        labs(subtitle = "Rape Density")+
        theme_bw()

# Combine 3 plots
plot = ggarrange(a1,a2, a3, b1, b2, b3, 
                 ncol=3, nrow=2, 
                 common.legend = TRUE,
                 legend="right", 
                 labels = c("A","B","C", "D", "E", "F"))

annotate_figure(plot, bottom = text_grob("The normalized data of people committing murder/assault/rape vs urban population ", 
               color = "red", face = "bold", size = 12))

# Data preparation
data = USArrests
data$cluster = as.factor(km3$cluster)
data$state = rownames(data) 

# Murder vs UrbanPop 
a1 = ggplot(data, aes(UrbanPop, Murder, shape = cluster))+
      geom_point () + 
      stat_ellipse(aes(y = Murder, x = UrbanPop, fill = cluster), geom = 'polygon', alpha = 0.4, level = 0.95) + 
      labs(x = 'Urban Population (%)', y = 'Murder', subtitle = "Murder vs UrbanPop")+
      theme_bw() 

# Assault vs UrbanPop - normalized value
a2 = ggplot(data, aes(UrbanPop, Assault, shape = cluster))+
      geom_point () + 
      stat_ellipse(aes(y = Assault, x = UrbanPop, fill = cluster), geom = 'polygon', alpha = 0.4, level = 0.95) + 
      labs(x = 'Urban Population (%)', y = 'Assault', subtitle = "Assault vs UrbanPop")+
      theme_bw() 

# Rape vs UrbanPop - normalized value
a3 = ggplot(data, aes(UrbanPop, Rape, shape = cluster))+
      geom_point () + 
      stat_ellipse(aes(y = Rape, x = UrbanPop, fill = cluster), geom = 'polygon', alpha = 0.4, level = 0.95) + 
      labs(x = 'Urban Population (%)', y = 'Rape', subtitle = "Rape vs UrbanPop")+
      theme_bw() 

# Drawing a density plot of 3 clusters
b1 = ggdensity(data, x = "Murder", fill = "cluster") +
        labs(subtitle = "Murder Density")+
        theme_bw()
  
b2 = ggdensity(data, x = "Assault", fill = "cluster") +
        labs(subtitle = "Assault Density")+
        theme_bw()

b3 = ggdensity(data, x = "Rape", fill = "cluster") +
        labs(subtitle = "Rape Density")+
        theme_bw()

# Combine 3 plots
plot = ggarrange(a1,a2, a3, b1, b2, b3, 
                 ncol=3, nrow=2, 
                 common.legend = TRUE,
                 legend="right", 
                 labels = c("A","B","C", "D", "E", "F"))

annotate_figure(plot, bottom = text_grob("The data of people committing murder/assault/rape vs urban population ", 
               color = "red", face = "bold", size = 12))

2 Dataset: iris:

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Step by step for analyzing cluster

2.1 Data preparation

To perform a cluster analysis in R, generally, the data should be prepared as follows:
1. Rows are observations (individuals) and columns are variables.
2. Any missing value in the data must be removed or estimated.
3. The data must be standardized (i.e., scaled) to make variables comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one.

# Load the dataset: 
df = iris

#Show the table of the original dataset
datatable(df, 
          filter = 'top',
          caption = htmltools::tags$caption(     
                    style = 'caption-side: top; text-align: center;',
                    'Table 2.1: ', htmltools::em('The iris dataset'))) 
# Remove any missing value
df <- na.omit(df)

# Select only numeric value for computing, remove the SPECIES COLUMN:
df = df %>% select (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)

# Normalized/ Scaling/standardizing the data :
df = scale(df)

#Show the table of the dataset
datatable(df, 
          filter = 'top',
          caption = htmltools::tags$caption(     
                    style = 'caption-side: top; text-align: center;',
                    'Table 2.2: ', htmltools::em('The normalized value iris dataset'))) 

2.2 Visualization of distance matrix

This step helps answer the question: “Can the data be clustered ?” by using Hopkins’s statistic. If the value of Hopkins statistic is close to 1 (far above 0.5), then we can conclude that the dataset is significantly clusterable.

cl= get_clust_tendency(df, n = 50, graph = T)
cl
## $hopkins_stat
## [1] 0.7997314
## 
## $plot

Or we can use the following command, but we won’t know the Hopkins statistic:

distance = get_dist(df)
fviz_dist(distance, gradient = list(low = "red", mid = "white", high = "blue"))

Discuss:
+ The hopkins value is 0.80 >> 0.5 => strongly clusterable
+ The graph illustrates the clusters of the data

2.3 Determination optimal number of clusters k

A variety of measures have been proposed in the literature for evaluating clustering results. There are more than thirty indices and methods for identifying the optimal number of clusters. A few common methods:
+ Elbow
+ Average silhouette method
+ Gap statistic method (preferable)

2.3.1 The Elbow method

The elbow method, in which the sum of squares at each number of clusters is calculated and graphed, and the user looks for a change of slope from steep to shallow (an elbow) to determine the optimal number of clusters. This method is inexact, but still potentially helpful.

fviz_nbclust(df, kmeans, method = "wss") + labs(subtitle = "The Elbow Method") + theme_minimal() 

=> The number of clusters can be 3.

2.3.2 Silhouette method

Another visualization that can help determine the optimal number of clusters is called the a silhouette method. Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k.

fviz_nbclust( df, kmeans, method = "silhouette") + labs(subtitle = "Silhouette method") + theme_minimal()

=> The number of clusters can be 2.

2.3.3 Gap statistic method

The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic (i.e., that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.

gap = clusGap(df, FUN = kmeans, nstart= 10, K.max = 10, B = 100)
fviz_gap_stat(gap) + labs(subtitle = "Gap statistic method") + theme_minimal()

=> The number of clusters can be 2.

2.4 Computing k means

k=2 (Gap statistic - Silhouette method)

km2 = kmeans(df, centers = 2, nstart = 25)
fviz_cluster(km2, df)

km2$betweenss/ km2$totss
## [1] 0.6293972

k=3 (Elbow method)

km3 = kmeans(df, centers = 3, nstart = 25)
fviz_cluster(km3, df)

km3$betweenss/ km3$totss
## [1] 0.7669658

k=4

km4 = kmeans(df, centers = 4, nstart = 25)
fviz_cluster(km4, df)

km4$betweenss/ km4$totss
## [1] 0.8098463

Or we can use the following command to combine plots:

km2 = kmeans(df, centers = 2, nstart = 25)
km3 = kmeans(df, centers = 3, nstart = 25)
km4 = kmeans(df, centers = 4, nstart = 25)
km5 = kmeans(df, centers = 5, nstart = 25)

# Plot to compare:
p1 = fviz_cluster(km2, geom = "point", df) + ggtitle("k=2")
p2 = fviz_cluster(km3, geom = "point", df) + ggtitle("k=3")
p3 = fviz_cluster(km4, geom = "point", df) + ggtitle("k=4")
p4 = fviz_cluster(km5, geom = "point", df) + ggtitle("k=5")

# COmbine plot
grid.arrange(p1,p2,p3,p4, nrow=2)

In regression, we define the R2 value as the ration of regression sum of square and total sum of square, which indicates the variability of the data explained by the regression model. In K-Means, the ratio of betweenss and totss has a similar interpretation: indicates the amount of information retained after clustering.

In this case:
+ With k = 2, the ratio is 0.63
+ With k = 3, the ratio is 0.76
+ With k = 4, the ratio is 0.80
=> There is high chance that k=3 is the optimal number of cluster because the distance is the highest. Secondly, we already have 3 types of Species => Chooose k=3

The 2 components of the plot (for all k) explains 73% + 22.9% = 95.9% of the point variability.

2.5 Validation

Use silhouette measure (how similar an object to the other objects in its own cluster). Si value range from 1 (good cluster) to -1 (poor cluster). Lech check k = 2, 3 or 4:

sil = silhouette(km2$cluster, dist(df))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   50          0.68
## 2       2  100          0.53

sil = silhouette(km3$cluster, dist(df))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   47          0.35
## 2       2   53          0.39
## 3       3   50          0.64

sil = silhouette(km4$cluster, dist(df))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   25          0.47
## 2       2   47          0.35
## 3       3   25          0.37
## 4       4   53          0.38

Assess the silhouette width:
+ k = 2: 0.58
+ k = 3: 0.46 => Preferable because we already have 3 Species types, previous explained reason.
+ k = 4: 0.38

2.6 Drawing dendrogram

For number of clusters = 3

hc3 = eclust(df, "hclust", k=3)
fviz_dend(hc3, rect = TRUE)

Using library(dendextend) instead of using eclust() and fviz_dend(), we can see the similar dendrogram:

library(dendextend)
df %>% dist() %>% hclust() %>% as.dendrogram() -> dend
dend %>% set("labels_col", k =43) %>% set("branches_k_color", k =3) %>% plot(axes = FALSE)

2.7 Extra: Compare the clusters with available data

Assign the cluster number to each observation, after that comparing to the observation group you already have.

# Convert back to the dataframe
df = as.data.frame(df)

# Assign the Species column to df (NOTE: we remove the column Species previously)
data = iris
df$Species = iris$Species

# Check the total number of cluster
table(km3$cluster)
## 
##  1  2  3 
## 47 53 50
#Check the total number of Species
table(df$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50
# Assign the cluster number to each observation
df$cluster = as.factor(km3$cluster)

# Compare the cluster result to the class label (if we have one)
table(df$cluster, df$Species)
##    
##     setosa versicolor virginica
##   1      0         11        36
##   2      0         39        14
##   3     50          0         0

Discuss:

  1. With the number of cluster k=3:
  • Cluster 1: 47 observations
  • Cluster 2: 53 observations
  • Cluster 3: 50 observations
  1. Each Species has 50 observations

  2. There is a mismatch between our clustering result and the Species.

  • Total number of correctly classified instances are: 50 + 39 + 36 = 125
  • Total number of incorrectly classified instances are: 11 + 14 = 25
  • Accuracy = 125/(125+25) = 0.833 i.e our model has achieved 83.3% accuracy!
  1. For Setosa cluster:
  • Sensitivity = True positive = 50/50 = 100%
  • Specificity = True negative = 100/100 = 100%
  • Accuracy = 100%
  1. For versicolor cluster:
  • Sensitivity = True positive = 39/50 = 78%
  • Specificity = True negative = 86/100 = 86% (Fail: 14 observations)
  • Accuracy = (39+86)/150 = 83.3%
  1. For virginica cluster:
  • Sensitivity = True positive = 36/50 = 72%
  • Specificity = True negative = 89/100 = 89% (Fail: 11 observations)
  • Accuracy = (36+89)/150 = 83.3%
  1. Consider analyzing normalized data for clustering rather than un-normalized data.

  2. Plot the normalized data:

# Change the level order
df$Species = factor(df$Species, level = c("versicolor", "setosa", "virginica"))

# Sepal.Length vs Sepal.Width - Cluster + Species: 
a1 = ggplot(df, aes(x = Sepal.Width, y= Sepal.Length, color = cluster))+
      geom_point () + 
      labs(x = 'Sepal.Width', y = 'Sepal.Length', subtitle = "Cluster of S.Length vs S.Width")+
      theme_bw() 


b1 = ggplot(df, aes(x = Sepal.Width, y= Sepal.Length, color = Species))+
      geom_point () + 
      labs(x = 'Sepal.Width', y = 'Sepal.Length', subtitle = "Species of S.Length vs S.Width")+
      theme_bw() 

# Petal.Length vs Petal.Width - cluster + Species
a2 = ggplot(df, aes(Petal.Width, Petal.Length, color = cluster))+
      geom_point () + 
      labs(x = 'Petal.Length', y = 'Petal.Width', subtitle = "Cluster of P.Length vs P.Width")+
      theme_bw() 

b2 = ggplot(df, aes(Petal.Width, Petal.Length, color = Species))+
      geom_point () + 
      labs(x = 'Petal.Length', y = 'Petal.Width', subtitle = "Species of P.Length vs P.Width")+
      theme_bw() 

plot = ggarrange(a1, b1, a2, b2, 
                 ncol=2, nrow=2, 
                 common.legend = F,
                 legend="right", 
                 labels = c("A","B", "C", "D"))

annotate_figure(plot, bottom = text_grob("The scatter plot", 
               color = "red", face = "bold", size = 12))

# Drawing a density plot of 3 clusters
c1 = ggdensity(df, x = "Sepal.Length", fill = "cluster") +
        labs(subtitle = "Sepal.Length Density")+
        theme_bw()
  
c2 = ggdensity(df, x = "Sepal.Width", fill = "cluster") +
        labs(subtitle = "Sepal.Width Density")+
        theme_bw()

c3 = ggdensity(df, x = "Petal.Length", fill = "cluster") +
        labs(subtitle = "Petal.Length Density")+
        theme_bw()

c4 = ggdensity(df, x = "Petal.Width", fill = "cluster") +
        labs(subtitle = "Petal.Width Density")+
        theme_bw()
  
plot = ggarrange(c1,c2, c3, c4, 
                 ncol=2, nrow=2, 
                 common.legend = TRUE,
                 legend="right", 
                 labels = c("A","B","C", "D"))

annotate_figure(plot, bottom = text_grob("The density plot", 
               color = "red", face = "bold", size = 12))

  1. Plot the unnormalized data:
# Data preparation
data = iris
data$cluster = as.factor(km3$cluster)

# Change the level order
data$Species = factor(data$Species, level = c("versicolor", "setosa", "virginica"))

# Sepal.Length vs Sepal.Width - Cluster + Species: 
a1 = ggplot(data, aes(x = Sepal.Width, y= Sepal.Length, color = cluster))+
      geom_point () + 
      labs(x = 'Sepal.Width', y = 'Sepal.Length', subtitle = "Cluster of S.Length vs S.Width")+
      theme_bw() 

b1 = ggplot(data, aes(x = Sepal.Width, y= Sepal.Length, color = Species))+
      geom_point () + 
      labs(x = 'Sepal.Width', y = 'Sepal.Length', subtitle = "Species of S.Length vs S.Width")+
      theme_bw() 

# Petal.Length vs Petal.Width - cluster + Species
a2 = ggplot(data, aes(Petal.Width, Petal.Length, color = cluster))+
      geom_point () + 
      labs(x = 'Petal.Length', y = 'Petal.Width', subtitle = "Cluster of P.Length vs P.Width")+
      theme_bw() 

b2 = ggplot(data, aes(Petal.Width, Petal.Length, color = Species))+
      geom_point () + 
      labs(x = 'Petal.Length', y = 'Petal.Width', subtitle = "Species of P.Length vs   P.Width")+
      theme_bw() 

plot = ggarrange(a1, b1, a2, b2, 
                 ncol=2, nrow=2, 
                 common.legend = F,
                 legend="right", 
                 labels = c("A","B", "C", "D"))

annotate_figure(plot, bottom = text_grob("The scatter plot", 
               color = "red", face = "bold", size = 12))

# Drawing a density plot of 3 clusters
c1 = ggdensity(data, x = "Sepal.Length", fill = "cluster") +
        labs(subtitle = "Sepal.Length Density")+
        theme_bw()
  
c2 = ggdensity(data, x = "Sepal.Width", fill = "cluster") +
        labs(subtitle = "Sepal.Width Density")+
        theme_bw()

c3 = ggdensity(data, x = "Petal.Length", fill = "cluster") +
        labs(subtitle = "Petal.Length Density")+
        theme_bw()

c4 = ggdensity(data, x = "Petal.Width", fill = "cluster") +
        labs(subtitle = "Petal.Width Density")+
        theme_bw()
  
plot = ggarrange(c1,c2, c3, c4, 
                 ncol=2, nrow=2, 
                 common.legend = TRUE,
                 legend="right", 
                 labels = c("A","B","C", "D"))

annotate_figure(plot, bottom = text_grob("The density plot", 
               color = "red", face = "bold", size = 12))

3 Dataset: USArrests: un-normalized data (trial only)

For testing what happened when we don’t normalize data before analyzing ?

Step by step for analyzing cluster

3.1 Data preparation

# Load the dataset: 
df = USArrests

#Show the table of the original dataset
datatable(df, 
          filter = 'top',
          caption = htmltools::tags$caption(     
                    style = 'caption-side: top; text-align: center;',
                    'Table 3.1: ', htmltools::em('The number of residents in arrest in US'))) 

3.2 Visualization of distance matrix

This step helps answer the question: “Can the data be clustered ?” by using Hopkins’s statistic. If the value of Hopkins statistic is close to 1 (far above 0.5), then we can conclude that the dataset is significantly clusterable.

cl= get_clust_tendency(df, n = 49, graph = T)
cl
## $hopkins_stat
## [1] 0.5638709
## 
## $plot

Discuss:
+ The hopkins value is 0.56 > 0.5 => clusterable
+ The graph illustrates the clusters of the data

3.3 Determination optimal number of clusters k

  • Elbow
  • Average silhouette method
  • Gap statistic method (preferable)

3.3.1 The Elbow method

fviz_nbclust(df, kmeans, method = "wss") + labs(subtitle = "The Elbow Method") + theme_minimal() 

=> The number of clusters can be 3.

3.3.2 Silhouette method

fviz_nbclust( df, kmeans, method = "silhouette") + labs(subtitle = "Silhouette method") + theme_minimal()

=> The number of clusters can be 2.

3.3.3 Gap statistic method

gap = clusGap(df, FUN = kmeans, nstart= 25, K.max = 10, B = 100)
fviz_gap_stat(gap) + labs(subtitle = "Gap statistic method") + theme_minimal()

=> The number of clusters cant be identified.

3.4 Computing k means

**k=2

km2 = kmeans(df, centers = 2, nstart = 25)
fviz_cluster(km2, df)

km2$betweenss/ km2$totss
## [1] 0.72907

**k=3

km3 = kmeans(df, centers = 3, nstart = 25)
fviz_cluster(km3, df)

km3$betweenss/ km3$totss
## [1] 0.8651961

In this case:
+ With k = 2, the ratio is 0.72
+ with k = 3, the ratio is 0.86
=> There is high chance that k=3 is the optimal number of cluster because the distance is the highest.

3.5 Validation

Use silhouette measure (how similar an object to the other objects in its own cluster). Si value range from 1 (good cluster) to -1 (poor cluster)

sil = silhouette(km2$cluster, dist(df))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   21          0.58
## 2       2   29          0.60

sil = silhouette(km3$cluster, dist(df))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   14          0.54
## 2       2   16          0.54
## 3       3   20          0.52

3.6 Drawing dendrogram

For number of clusters = 3

hc3 = eclust(df, "hclust", k=3)
fviz_dend(hc3, rect = TRUE)

4 Dataset: iris: un-normalized data (trial only)

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Step by step for analyzing cluster

4.1 Data preparation

# Load the dataset: 
df = iris

#Show the table of the original dataset
datatable(df, 
          filter = 'top',
          caption = htmltools::tags$caption(     
                    style = 'caption-side: top; text-align: center;',
                    'Table 4.1: ', htmltools::em('The iris dataset'))) 

To perform a cluster analysis in R, generally, the data should be prepared as follows:
1. Rows are observations (individuals) and columns are variables.
2. Any missing value in the data must be removed or estimated.
3. The data must be standardized (i.e., scaled) to make variables comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one.

# Remove any missing value
df <- na.omit(df)

# Select only numeric value for computing, remove the SPECIES COLUMN:
df = df %>% select (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)

# Normalized/ Scaling/standardizing the data :

#Show the table of the dataset
datatable(df, 
          filter = 'top',
          caption = htmltools::tags$caption(     
                    style = 'caption-side: top; text-align: center;',
                    'Table 3.2: ', htmltools::em('The numeric value iris dataset'))) 

4.2 Visualization of distance matrix

This step helps answer the question: “Can the data be clustered ?” by using Hopkins’s statistic. If the value of Hopkins statistic is close to 1 (far above 0.5), then we can conclude that the dataset is significantly clusterable.

cl= get_clust_tendency(df, n = 50, graph = T)
cl
## $hopkins_stat
## [1] 0.8191482
## 
## $plot

Discuss:
+ The hopkins value is 0.81 >> 0.5 => strongly clusterable
+ The graph illustrates the clusters of the data

4.3 Determination optimal number of clusters k

A variety of measures have been proposed in the literature for evaluating clustering results. There are more than thirty indices and methods for identifying the optimal number of clusters. A few common methods:
+ Elbow
+ Average silhouette method
+ Gap statistic method (preferable)

4.3.1 The Elbow method

The elbow method, in which the sum of squares at each number of clusters is calculated and graphed, and the user looks for a change of slope from steep to shallow (an elbow) to determine the optimal number of clusters. This method is inexact, but still potentially helpful.

fviz_nbclust(df, kmeans, method = "wss") + labs(subtitle = "The Elbow Method") + theme_minimal() 

=> The number of clusters can be 3.

4.3.2 Silhouette method

Another visualization that can help determine the optimal number of clusters is called the a silhouette method. Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k.

fviz_nbclust( df, kmeans, method = "silhouette") + labs(subtitle = "Silhouette method") + theme_minimal()

=> The number of clusters can be 2.

4.3.3 Gap statistic method

The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic (i.e., that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.

gap = clusGap(df, FUN = kmeans, nstart= 10, K.max = 10, B = 100)
fviz_gap_stat(gap) + labs(subtitle = "Gap statistic method") + theme_minimal()

=> Model suggest the number of clusters can be 6 but in plot, I think it should be 3

4.4 Computing k means

k=2 (Gap statistic - Silhouette method)

km2 = kmeans(df, centers = 2, nstart = 25)
fviz_cluster(km2, df)

km2$betweenss/ km2$totss
## [1] 0.7764096

k=3 (Elbow method)

km3 = kmeans(df, centers = 3, nstart = 25)
fviz_cluster(km3, df)

km3$betweenss/ km3$totss
## [1] 0.8842753

In this example:
+ With k = 2, the ratio is 0.77
+ with k = 3, the ratio is 0.88
=> We already have 3 types of Species => Chooose k=3

The 2 components of the plot (for all k) explains 73% + 22.9% = 95.9% of the point variability.

4.5 Validation

Use silhouette measure (how similar an object to the other objects in its own cluster). Si value range from 1 (good cluster) to -1 (poor cluster). Lech check k = 3:

sil = silhouette(km3$cluster, dist(df))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   62          0.42
## 2       2   50          0.80
## 3       3   38          0.45

Assess the silhouette width:
+ k = 3: 0.55 => Preferable because we already have 3 Species types, previous explained reason.

4.6 Drawing dendrogram

For number of clusters = 3

hc3 = eclust(df, "hclust", k=3)
fviz_dend(hc3, rect = TRUE)

4.7 Extra: Compare the clusters with available data

Assign the cluster number to each observation, after that comparing to the observation group you already have.

# Convert back to the dataframe
df = as.data.frame(df)

# Assign the Species column to df (NOTE: we remove the column Species previously)
data = iris
df$Species = iris$Species

# Check the total number of cluster
table(km3$cluster)
## 
##  1  2  3 
## 62 50 38
#Check the total number of Species
table(df$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50
# Assign the cluster number to each observation
df$cluster = as.factor(km3$cluster)

# Compare the cluster result to the class label (if we have one)
table(df$cluster, df$Species)
##    
##     setosa versicolor virginica
##   1      0         48        14
##   2     50          0         0
##   3      0          2        36

Discuss:

  1. With the number of cluster k=3:
  • cluster 1: 62 observations
  • cluster 2: 50 observations
  • Cluster 3: 38 observations
  1. Each Species has 50 observations

  2. There is a mismatch between our clustering result and the Species.

  • Total number of correctly classified instances are: 50 + 48 + 36 = 134
  • Total number of incorrectly classified instances are: 2 + 14 = 16
  • Accuracy = 134/150 = 0.893 i.e our model has achieved 89.3% accuracy!
  1. For Setosa cluster:
  • Sensitivity = True positive = 50/50 = 100%
  • Specificity = True negative = 100/100 = 100%
  • Accuracy = 100%
  1. For versicolor cluster:
  • Sensitivity = True positive = 48/50 = 96%
  • Specificity = True negative = 86/100 = 89.3% (Fail: 14 observations)
  • Accuracy = (48+86)/150 = 89.3%
  1. For virginica cluster:
  • Sensitivity = True positive = 36/50 = 72%
  • Specificity = True negative = 98/100 = 98%
  • Accuracy = (36+98)/150 = 89.3%