Hierarchical Cluster Analysis (HCA)

Preliminary

We will use the DescTools and caret packages, which were previously installed. In addition, we use the ggplot2 and factoextra for plotting, as well as the cluster and fpc packages.

install.packages(c("cluster", "factoextra", 
                   "fpc", "ggplot2"))

Next, we load all of the necessary libraries for use in the session.

library(caret)
library(DescTools)
library(ggplot2)
library(cluster)
library(factoextra) 
library(fpc)

In the lesson that follows we will use a subset of the “carseats.csv” file, “carseat_sub.csv”. This dataset contains sales of child car seats at 81 different stores. The car seat manufacturer would like to group stores for market segmentation and marketing purposes.

We use the read.csv() function to import the CSV file into R as a dataframe named cs_sub. We set stringsAsFactors = FALSE to keep any character columns as-is.

cs_sub <- read.csv(file = "carseat_sub.csv",
                   stringsAsFactors = FALSE)

First, we obtain the structure of the data using the str() function.

str(cs_sub)

## 'data.frame':    81 obs. of  11 variables:
##  $ CompPrice  : int  117 122 115 118 147 145 107 98 104 136 ...
##  $ Income     : int  100 35 28 32 74 119 115 118 99 58 ...
##  $ Advertising: int  4 2 11 0 13 16 11 0 15 16 ...
##  $ Population : int  466 393 29 284 251 294 496 19 226 241 ...
##  $ Price      : int  97 136 86 110 131 113 131 107 102 131 ...
##  $ ShelveLoc  : chr  "Medium" "Medium" "Good" "Good" ...
##  $ Age        : int  55 62 53 63 52 42 50 64 58 44 ...
##  $ Education  : int  14 18 18 13 10 12 11 17 17 18 ...
##  $ Urban      : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ US         : chr  "Yes" "No" "Yes" "No" ...
##  $ Sales_Lev  : chr  "Low" "Low" "High" "Low" ...

Next, we create our convenience vectors of variable names by type and convert the factor (categorical–ordinal and nominal) variables.

Nominal (Unordered Factors)

facs <- c("Urban", "US") 
cs_sub[ ,facs] <- lapply(X = cs_sub[ ,facs], 
                     FUN = factor)

Ordinal (Ordered Factors)

ords <- c("ShelveLoc", "Sales_Lev")
cs_sub$ShelveLoc <- factor(x = cs_sub$ShelveLoc, 
                       levels = c("Bad", "Medium", "Good"),
                       ordered = TRUE)

cs_sub$Sales_Lev <- factor(x = cs_sub$Sales_Lev,
                       levels = c("Low", "High"),
                       ordered = TRUE)

Numerical

nums <- names(cs_sub)[!names(cs_sub) %in% c(facs, ords)]
nums

## [1] "CompPrice"   "Income"      "Advertising" "Population"  "Price"      
## [6] "Age"         "Education"

Preprocessing

Before clustering we need to:

handle missing values (remove or impute)

PlotMiss(x = cs_sub)

As shown, there are no missing values. If missing values exist, we remove row-wise using the na.omit() function or impute using preProcess() and predict().

rescale the numeric variables A popular transformation for clustering is standardization. We create a new dataframe, cs_sc, which contains all of the categorical variables and the scaled numerical variables.

cs_cs <- preProcess(x = cs_sub,
                    method = c("center", "scale"))
cs_sc <- predict(object = cs_cs,
                 newdata = cs_sub)

Hierarchical Cluster Analysis (HCA)

We apply HCA to a distance matrix, created using the dist() function. We use only the numerical variables, which are in our nums vector. By default, dist() will use Euclidean distance (method = "euclidean").

cs_dist <- dist(x = cs_sc[ ,nums])

To apply HCA to the distance matrix, we use the hclust() function in the stats package.

Single Linkage

Apply Single Linkage HCA (method = "single")

sing <- hclust(d = cs_dist, 
               method = "single")

We can create the dendrogram plot to visualize the cluster solution. By default, using the plot() function on an object created using the hclust() function will produce a dendrogram. We can use the rect.hclust() function, which will overlay k boxes, identifying the clusters. We will set k = 10.

plot(sing, 
     sub = NA, xlab = NA, # remove labels
     main = "Single Linkage") # add title
rect.hclust(tree = sing, 
            k = 10, # 10 clusters
            border = hcl.colors(10)) # 10 colors

We can create a vector of cluster assignments (sin_clusters) based on the clustering solution and chosen k value (k = 10) using the cutree() function.

sin_clusters <- cutree(tree = sing,
                          k = 10)
sin_clusters

##  [1]  1  1  2  1  1  1  3  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [26]  1  1  1  1  1  1  4  1  1  1  5  3  6  1  7  1  1  1  1  1  8  1  1  1  1
## [51]  1  1  1  1  1  1  1  9  1  1  1 10  1  1  1  1  1  1  1  1  1  1  1  1  1
## [76]  8  1  1  1  1  1

We can perform the same steps as in the Single Link example to perform Complete Link, Average Link and Ward’s Method Clustering.

Complete Linkage

Apply Complete Linkage HCA (method = "complete")

com <- hclust(d = cs_dist, 
              method = "complete")

Visualize the dendrogram. Based on the dendrogram, we choose k = 6 and use the rect.hclust() function to identify the 6 clusters (k = 6).

plot(com, 
     sub = NA, xlab = NA, 
     main = "Complete Linkage")
rect.hclust(tree = com, 
            k = 6, 
            border = hcl.colors(6))

We can create a vector of cluster assignments (com_clusters) based on the clustering solution and chosen k value (k = 6) using the cutree() function.

com_clusters <- cutree(tree = com, 
                       k = 6)

Average Linkage

Apply Average Linkage HCA (method = "average")

avg <- hclust(d = cs_dist, 
              method = "average")

Visualize the dendrogram. Based on the dendrogram, we choose k = 8 and use the rect.hclust() function to identify the 8 clusters (k = 8).

plot(avg, 
     sub = NA, xlab = NA, 
     main = "Average Linkage")
rect.hclust(tree = avg, 
            k = 8, 
            border = hcl.colors(8))

We can create a vector of cluster assignments (avg_clusters) based on the clustering solution and chosen k value (k = 8) using the cutree() function.

avg_clusters <- cutree(tree = avg, 
                       k = 8)

Ward’s Method

Apply Ward’s Method HCA (method = "ward.D2").

ward <- hclust(d = cs_dist, 
               method = "ward.D2")

Visualize the dendrogram. Based on the dendrogram, we choose k = 8 and use the rect.hclust() function to identify the 8 clusters (k = 8).

plot(ward, 
     xlab = NA, sub = NA, 
     main = "Ward's Method")
rect.hclust(tree = ward, 
            k = 8, 
            border = hcl.colors(8))

We can create a vector of cluster assignments (wards_clusters) based on the clustering solution and chosen k value (k = 8) using the cutree() function.

wards_clusters <- cutree(tree = ward,
                         k = 8)

Visualize the Cluster Solutions

We can use the fviz_cluster() function in the factoextra package to visualize 2 of our cluster solutions, Complete Linkage and Ward’s Method.

fviz_cluster(object = list(data = cs_sc[,nums], 
                           cluster = com_clusters),
             main = "Complete Linkage Clusters")

fviz_cluster(object = list(data = cs_sc[,nums], 
                           cluster = wards_clusters),
             main = "Ward's Method Clusters")

We prefer clustering solutions that have greater distance between clusters and less distance within clusters. Based on the visualization, we will use Ward’s Method with 8 clusters.

Describing the Cluster Solution

First, we can compare the mean values across the different clusters for our (scaled) input (numeric) variables. We use the aggregate() function to aggregate the average values (FUN = mean) of the numeric variables (nums) by cluster assignment (wards_clusters).

clus_means_HCA <- aggregate(cs_sc[ ,nums],
          by = list(wards_clusters),
          FUN = mean)
clus_means_HCA

##   Group.1  CompPrice     Income Advertising  Population        Price        Age
## 1       1 -0.4812221  0.4366868  -0.8646498  0.25068353 -0.337538059  0.5373090
## 2       2 -0.1368187 -0.6811710   0.2032959 -0.03312042 -0.009914372 -0.7828729
## 3       3  0.1310506  0.2024599  -0.2676093 -0.82398211 -0.118019157 -0.2056942
## 4       4  0.3553133  1.5161468   0.7477800  0.66090105  0.252625822 -0.5520014
## 5       5 -0.7660001  0.5728793   1.3400904  0.40318681 -0.813943715  0.9919518
## 6       6  1.0094128 -0.4459841   0.8949378  0.88896675  0.592383719  0.6600740
## 7       7 -1.0541153 -0.9894339  -0.7936987  0.30932900 -0.881509206 -1.2205668
## 8       8  1.3624188 -0.5232598  -0.2725146 -0.76550372  2.260286125  0.4014124
##    Education
## 1  0.1904665
## 2  1.3415551
## 3 -0.9526284
## 4 -1.1577993
## 5  1.1830139
## 6  0.2224412
## 7 -0.5889164
## 8  0.2970488

We can use the matplot() function to visualize the (scaled) cluster centers (averages) to observe differences across clusters. We use the t() function to transpose the output from the aggregate() function and exclude the cluster information in the input. We use the axis() function to customize the x-axis labels with the nums variable names and use the legend() function to identify the lines corresponding to the clusters.

matplot(t(clus_means_HCA[,-1]), # Ignore cluster # column
        type = "l", # line plot ("l")
        ylab = "", # no y-axis label
        xlim = c(0, 7), # add space for legend
        xaxt = "n", # no x-axis
        col = 1:8, # 8 colors (for k = 8)
        lty = 1:8, # 8 line types (for k = 8)
        main = "Cluster Centers") # main title
# Add custom x-axis labels
axis(side = 1, # x-axis
     at = 1:7, # x values 1-7
     labels = nums, # variable names as labels
     las = 2) # flip text to vertical
legend("left", # left position
       legend = 1:8, # 8 lines
       col = 1:8, # 8 colors
       lty = 1:8, # 8 line types
       cex = 0.6) # reduce text size

Based on the plot, we can describe the clusters as:

Cluster 1: Low Advertising
Cluster 2: Low Income, Low Age, High Education
Cluster 3: Low Population, Low Education
Cluster 4: High Income, Low Education
Cluster 5: Low Competitor Price, High Advertising, Low Price, High Age, High Education
Cluster 6: High Competitor Price, High Population
Cluster 7: Low Competitor Price, Low Income, Low Advertising, Low Price, Low Age
Cluster 8: High Competitor Price, High Price, Low Population

Based on these findings, the car seat manufacturer company can choose particular clusters to target. For instance, the company may want to target the locations in Cluster 6 for coupons that reduce the price, since the Competitor Price is high and the population is high. This may allow them to gain market share. The company may want to reduce the amount of advertising being spent on the locations in Cluster 5, since the average competitor and own prices are low and average age of the local population is high. This may suggest that advertising money can be better spent in the locations in another cluster.

We can also evaluate and look for patterns in the distribution of clusters against our categorical variables (ords and facs), which were not included in the analysis by setting FUN = table in the aggregate() function.

aggregate(x = cs_sub[ ,c(ords,facs)],
          by = list(wards_clusters),
          FUN = table)

##   Group.1 ShelveLoc.Bad ShelveLoc.Medium ShelveLoc.Good Sales_Lev.Low
## 1       1             1                9              4            11
## 2       2             2                6              2             6
## 3       3             4               11              5            11
## 4       4             2                2              1             1
## 5       5             3                3              2             4
## 6       6             1                6              3             4
## 7       7             2                2              4             5
## 8       8             1                2              3             6
##   Sales_Lev.High Urban.No Urban.Yes US.No US.Yes
## 1              3        5         9    10      4
## 2              4        3         7     3      7
## 3              9        4        16     5     15
## 4              4        1         4     0      5
## 5              4        3         5     0      8
## 6              6        2         8     3      7
## 7              3        3         5     7      1
## 8              0        1         5     2      4

Hierarchical Cluster Analysis (HCA)

Dr. Chelsey Hill

Preliminary

Preprocessing

Hierarchical Cluster Analysis (HCA)

Single Linkage

Complete Linkage

Average Linkage

Ward’s Method

Visualize the Cluster Solutions

Describing the Cluster Solution