Cluster Validation

Preliminary

We will use the DescTools and caret packages, which were previously installed. In addition, we use the ggplot2 and factoextra for plotting, as well as the cluster and fpc packages.

Next, we load all of the necessary libraries for use in the session.

library(caret)
library(DescTools)
library(ggplot2)
library(cluster)
library(factoextra) 
library(fpc)

In the lesson that follows we will use a subset of the “carseats.csv” file, “carseat_sub.csv”. This dataset contains sales of child car seats at 81 different stores. The car seat manufacturer would like to group stores for market segmentation and marketing purposes. We load the objects in the Clustering.RData file and use the ls() function to list its contents.

load("Clustering.RData")
ls()

## [1] "cs_dist"        "cs_sc"          "cs_sub"         "facs"          
## [5] "kmeans8"        "nums"           "ords"           "ward"          
## [9] "wards_clusters"

External Validation

In performing External Validation, we compare the clusters to a known categorical variable.

We can compare the distributions of the clustering solutions to the Sales_Lev variable using the table() function, to evaluate if either method is able to differentiate Low and High Sales Levels.

First, we compare the Ward’s Method HCA clusters (wards_clusters).

table(Sales = cs_sub$Sales_Lev, 
      Clusters = wards_clusters)

##       Clusters
## Sales   1  2  3  4  5  6  7  8
##   Low  11  6 11  1  4  4  5  6
##   High  3  4  9  4  4  6  3  0

Next, we compare the k-Means clusters (kmeans8$cluster).

table(Sales = cs_sub$Sales_Lev, 
      Clusters = kmeans8$cluster)

##       Clusters
## Sales   1  2  3  4  5  6  7  8
##   Low   6  6  7  3  3 13  6  4
##   High  7  1  1  7  5  4  3  5

As shown, neither solution demonstrates that it is able to separate the clusters by the Sales_Lev variable.

Internal Validation

Within-cluster sum of squares (WSS, or SSE)

We can plot the WSS using the fviz_nbclust() function from the factoextra package and specify method = "wss". We look for an ‘elbow’ point, which is a tradeoff point.

Hierarchical Cluster Analysis (HCA)

fviz_nbclust(cs_sc[ ,nums], # scaled data
             FUNcluster = hcut, # hierarchical
             method = "wss", # SSE
             k.max = 15)

Unfortunately, as shown, there is no clear elbow point in this case to help us choose a k value.

k-Means Cluster Analysis (kMC)

fviz_nbclust(cs_sc[ ,nums], # scaled data
             FUNcluster = kmeans, # kmeans
             method = "wss", # SSE
             k.max = 15)

Based on the plot, there appears to be a leveling off, or elbow, occurring at k = 8. For this reason, we would retain 8 clusters.

Silhouette

We can plot the output from running the silhouette() function in the cluster package to obtain a visualization of the silhouette coefficients.

Values close to 1 suggest that the observation is well matched to the assigned cluster
Values close to 0 suggest that the observation is borderline matched between two clusters
Values close to -1 suggest that the observation may be assigned to the wrong cluster

Hierarchical Cluster Analysis (HCA)

Per-observation Silhouette values for the 8 cluster Ward’s Solution.

We can use the plot() function on the output from the silhouette() function to visualize the Silhouette values for a given clustering solution.

plot(silhouette(x = wards_clusters, 
                dist = cs_dist), 
     main = "") # remove main title

As we did with WSS, we can use the fviz_nbclust() function and specify method = "silhouette" to plot the average silhouette values across a range of k values to choose a value for k that maximizes the average silhouette value. For HCA, we set FUNcluster = "hcut". The maximum value will have a vertical dashed line at the k value. In some cases, we may have local maxima that are better options for reducing the complexity (number of clusters).

fviz_nbclust(x = cs_sc[ ,nums], 
             FUNcluster = hcut, 
             method = "silhouette", 
             k.max = 15)

Based on the output, k = 15 is suggested, with the average silhouette increasing as k increases. In this case, since we have a small sample, we would prefer a lower k, with k = 2 and k = 8 showing as local maximum values.

k-Means Cluster Analysis (kMC)

Per-observation Silhouette values for 8 cluster kMC solution

We can use the plot() function on the output from the silhouette() function to visualize the Silhouette values for a given clustering solution.

plot(silhouette(x = kmeans8$cluster, 
                dist = cs_dist), 
     main = "")

Finally, we can use the fviz_nbclust() function and specify method = "silhouette" to plot the average silhouette values across a range of k values to choose a value for k that maximizes the average silhouette value. For kMC, we set FUNcluster = "kmeans".

fviz_nbclust(x = cs_sc[ ,nums], 
             FUNcluster = kmeans, 
             method = "silhouette", 
             k.max = 15)

Based on the output, k = 13 is designated as the maximum average silhouette value. Again, we may prefer to choose a smaller k value, in this case k = 8.

Cluster Statistics

We can use the cluster.stats() function from the fpc package to validate cluster solutions and compare:

max.diameter: small diameters are preferred (compact clusters)
min.separation: large separation between clusters is preferred
average.between: large average distance between clusters preferred
average.within : small average distance within clusters preferred
Dunn Index (dunn): computed as minimum separation/maximum diameter, higher values are preferred

We can create a vector of the component names so that we can create a dataframe to compare theses values side-by-side.

c_stats <- c("max.diameter", "min.separation", 
           "average.between", "average.within",
           "dunn")

HCA

ward_stats <- cluster.stats(d = cs_dist, 
              clustering = wards_clusters)

kMC

kmc_stats <- cluster.stats(d = cs_dist, 
              clustering = kmeans8$cluster)

Compare the values side-by-side using cbind() and matching the component names to the vector (c_stats).

cbind(HCA = ward_stats[names(ward_stats) %in% c_stats], 
                 kMC = kmc_stats[names(kmc_stats) %in% c_stats])

##                 HCA       kMC     
## average.between 3.783192  3.773382
## average.within  2.623164  2.562219
## max.diameter    4.499647  4.459896
## min.separation  1.428948  1.201353
## dunn            0.3175688 0.269368

Based on the values, HCA has a higher Dunn Index, higher average between distance, and a larger minimum separation distance. This suggests that the clusters are further apart from each other than in the kMC solution. However, the HCA cluster solution has a higher maximum diameter and higher average within distance, suggesting that the clusters themselves are more spread out and less compact than in the kMC solution.

We can also compare cluster solutions to known external references, such as Sales Level.

In this case we focus on the Adjusted Rand Index (corrected.rand). This value ranges from -1 to 1, where values close to 1 are preferred.

We run the cluster.stats() function again for both HCA and kMC, this time specifying the Sales_Lev variable in the alt.clustering argument. Note: we covert the Sales_Lev to numeric using the as.numeric() function for compatibility. We isolate the corrected.rand component as the output.

HCA

cluster.stats(d = cs_dist, 
              clustering = wards_clusters, 
              alt.clustering = as.numeric(cs_sub$Sales_Lev))$corrected.rand

## [1] 0.007263596

kMC

cluster.stats(d = cs_dist, 
              clustering = kmeans8$cluster, 
              alt.clustering = as.numeric(cs_sub$Sales_Lev))$corrected.rand

## [1] 0.02242552

As shown, the kMC solution does a better job of aligning with the Sales Level, but both corrected.rand values show poor performance (close to 0).

Our next steps would be to improve the clustering solutions by using the k values suggested by the visualizatons produced by the fviz_nbclust() function for both methods using the average silhouette and WSS approaches.

Cluster Validation

Dr. Chelsey Hill

Preliminary

External Validation

Internal Validation

Within-cluster sum of squares (WSS, or SSE)

Hierarchical Cluster Analysis (HCA)

k-Means Cluster Analysis (kMC)

Silhouette

Hierarchical Cluster Analysis (HCA)

k-Means Cluster Analysis (kMC)

Cluster Statistics