We will use the DescTools and caret packages, which were previously installed. In addition, we use the ggplot2 and factoextra for plotting, as well as the cluster and fpc packages.
Next, we load all of the necessary libraries for use in the session.
library(caret)
library(DescTools)
library(ggplot2)
library(cluster)
library(factoextra)
library(fpc)
In the lesson that follows we will use a subset of the “carseats.csv” file, “carseat_sub.csv”. This dataset contains sales of child car seats at 81 different stores. The car seat manufacturer would like to group stores for market segmentation and marketing purposes. We load the objects in the Clustering.RData
file and use the ls()
function to list its contents.
load("Clustering.RData")
ls()
## [1] "cs_dist" "cs_sc" "cs_sub" "facs"
## [5] "kmeans8" "nums" "ords" "ward"
## [9] "wards_clusters"
In performing External Validation, we compare the clusters to a known categorical variable.
We can compare the distributions of the clustering solutions to the Sales_Lev
variable using the table()
function, to evaluate if either method is able to differentiate Low and High Sales Levels.
First, we compare the Ward’s Method HCA clusters (wards_clusters
).
table(Sales = cs_sub$Sales_Lev,
Clusters = wards_clusters)
## Clusters
## Sales 1 2 3 4 5 6 7 8
## Low 11 6 11 1 4 4 5 6
## High 3 4 9 4 4 6 3 0
Next, we compare the k-Means clusters (kmeans8$cluster
).
table(Sales = cs_sub$Sales_Lev,
Clusters = kmeans8$cluster)
## Clusters
## Sales 1 2 3 4 5 6 7 8
## Low 6 6 7 3 3 13 6 4
## High 7 1 1 7 5 4 3 5
As shown, neither solution demonstrates that it is able to separate the clusters by the Sales_Lev
variable.
We can plot the WSS using the fviz_nbclust()
function from the factoextra package and specify method = "wss"
. We look for an ‘elbow’ point, which is a tradeoff point.
fviz_nbclust(cs_sc[ ,nums], # scaled data
FUNcluster = hcut, # hierarchical
method = "wss", # SSE
k.max = 15)
Unfortunately, as shown, there is no clear elbow point in this case to help us choose a k value.
fviz_nbclust(cs_sc[ ,nums], # scaled data
FUNcluster = kmeans, # kmeans
method = "wss", # SSE
k.max = 15)
Based on the plot, there appears to be a leveling off, or elbow, occurring at k = 8. For this reason, we would retain 8 clusters.
We can plot the output from running the silhouette()
function in the cluster package to obtain a visualization of the silhouette coefficients.
Per-observation Silhouette values for the 8 cluster Ward’s Solution.
We can use the plot()
function on the output from the silhouette()
function to visualize the Silhouette values for a given clustering solution.
plot(silhouette(x = wards_clusters,
dist = cs_dist),
main = "") # remove main title
As we did with WSS, we can use the fviz_nbclust()
function and specify method = "silhouette"
to plot the average silhouette values across a range of k values to choose a value for k that maximizes the average silhouette value. For HCA, we set FUNcluster = "hcut"
. The maximum value will have a vertical dashed line at the k value. In some cases, we may have local maxima that are better options for reducing the complexity (number of clusters).
fviz_nbclust(x = cs_sc[ ,nums],
FUNcluster = hcut,
method = "silhouette",
k.max = 15)
Based on the output, k = 15 is suggested, with the average silhouette increasing as k increases. In this case, since we have a small sample, we would prefer a lower k, with k = 2 and k = 8 showing as local maximum values.
Per-observation Silhouette values for 8 cluster kMC solution
We can use the plot()
function on the output from the silhouette()
function to visualize the Silhouette values for a given clustering solution.
plot(silhouette(x = kmeans8$cluster,
dist = cs_dist),
main = "")
Finally, we can use the
fviz_nbclust()
function and specify method = "silhouette"
to plot the average silhouette values across a range of k values to choose a value for k that maximizes the average silhouette value. For kMC, we set FUNcluster = "kmeans"
.
fviz_nbclust(x = cs_sc[ ,nums],
FUNcluster = kmeans,
method = "silhouette",
k.max = 15)
Based on the output, k = 13 is designated as the maximum average silhouette value. Again, we may prefer to choose a smaller k value, in this case k = 8.
We can use the cluster.stats()
function from the fpc package to validate cluster solutions and compare:
max.diameter
: small diameters are preferred (compact clusters)min.separation
: large separation between clusters is preferredaverage.between
: large average distance between clusters preferredaverage.within
: small average distance within clusters preferreddunn
): computed as minimum separation/maximum diameter, higher values are preferredWe can create a vector of the component names so that we can create a dataframe to compare theses values side-by-side.
c_stats <- c("max.diameter", "min.separation",
"average.between", "average.within",
"dunn")
HCA
ward_stats <- cluster.stats(d = cs_dist,
clustering = wards_clusters)
kMC
kmc_stats <- cluster.stats(d = cs_dist,
clustering = kmeans8$cluster)
Compare the values side-by-side using cbind()
and matching the component names to the vector (c_stats
).
cbind(HCA = ward_stats[names(ward_stats) %in% c_stats],
kMC = kmc_stats[names(kmc_stats) %in% c_stats])
## HCA kMC
## average.between 3.783192 3.773382
## average.within 2.623164 2.562219
## max.diameter 4.499647 4.459896
## min.separation 1.428948 1.201353
## dunn 0.3175688 0.269368
Based on the values, HCA has a higher Dunn Index, higher average between distance, and a larger minimum separation distance. This suggests that the clusters are further apart from each other than in the kMC solution. However, the HCA cluster solution has a higher maximum diameter and higher average within distance, suggesting that the clusters themselves are more spread out and less compact than in the kMC solution.
We can also compare cluster solutions to known external references, such as Sales Level.
In this case we focus on the Adjusted Rand Index (corrected.rand
). This value ranges from -1 to 1, where values close to 1 are preferred.
We run the cluster.stats()
function again for both HCA and kMC, this time specifying the Sales_Lev
variable in the alt.clustering
argument. Note: we covert the Sales_Lev to numeric using the as.numeric()
function for compatibility. We isolate the corrected.rand
component as the output.
HCA
cluster.stats(d = cs_dist,
clustering = wards_clusters,
alt.clustering = as.numeric(cs_sub$Sales_Lev))$corrected.rand
## [1] 0.007263596
kMC
cluster.stats(d = cs_dist,
clustering = kmeans8$cluster,
alt.clustering = as.numeric(cs_sub$Sales_Lev))$corrected.rand
## [1] 0.02242552
As shown, the kMC solution does a better job of aligning with the Sales Level, but both corrected.rand
values show poor performance (close to 0).
Our next steps would be to improve the clustering solutions by using the k values suggested by the visualizatons produced by the fviz_nbclust()
function for both methods using the average silhouette and WSS approaches.