Hierarchical Cluster Analysis (HCA)

Preliminary

The output that follows was generated using the stat, caret, cluster, and factoextra packages in R.

In the lesson that follows we use a data set containing car seat sales at 81 different stores. The car seat manufacturer would like to group stores for market segmentation and marketing purposes. The variables in the dataset include:

CompPrice: Price charged by competitor at each location
Income: Community income level (in 1,000s of dollars)
Advertising: Local advertising budget for the company at each location (in 1,000s of dollars)
Population: Population size in region (in 1,000s)
Price: Price charged by company for car seats at each locations
ShelveLoc: Quality of shelving location for the car seats at each location (Bad, Medium, Good)
Age: Average age of the local population
Education: Education level at each location
Urban: Indicates if the location is urban (Yes) or rural (No)
US: Indicates if the location is in the US (Yes) or outside of the US (No)
Sales_Lev: Indicates the sales level of the location. High sales are those that are greater than 8 units (in 1,000s) and Low sales are those than are less than or equal to 8 units (in 1,000s)

Summary statistic information for the numerical variables and frequency information for the categorical variables is provided below.

Summary Statistics: Numerical Variables
CompPrice	Income	Advertising	Population	Price	Age	Education
Min. : 86.0	Min. : 21.00	Min. : 0.000	Min. : 14.0	Min. : 63.0	Min. :25.00	Min. :10.0
1st Qu.:113.0	1st Qu.: 42.00	1st Qu.: 0.000	1st Qu.:161.0	1st Qu.: 99.0	1st Qu.:39.00	1st Qu.:11.0
Median :122.0	Median : 67.00	Median : 5.000	Median :284.0	Median :112.0	Median :56.00	Median :14.0
Mean :124.3	Mean : 69.07	Mean : 6.519	Mean :273.4	Mean :115.5	Mean :53.41	Mean :13.7
3rd Qu.:136.0	3rd Qu.: 93.00	3rd Qu.:13.000	3rd Qu.:393.0	3rd Qu.:131.0	3rd Qu.:64.00	3rd Qu.:16.0
Max. :162.0	Max. :120.00	Max. :25.000	Max. :508.0	Max. :191.0	Max. :79.00	Max. :18.0

Summary Statistics: Categorical Variables
Urban	US	ShelveLoc	Sales_Lev
No :22	No :30	Bad :16	Low :48
Yes:59	Yes:51	Medium:41	High:33
		Good :24

Preprocessing

Before clustering we need to:

identify and address missing values
rescale numeric variables
evaluate outliers

Missingness: If missing values exist, we can remove row-wise or column-wise or impute using a measure of central tendancy, such as the mean, median, or mode.

Rescaling: We need to rescale numeric variables if there are differences in scale across variables. A popular rescaling approach for cluster analysis is standardization, in which we convert all of the values for each numerical variable to Z-Scores.

Outliers: Cluster analysis is also sensitive to outliers because it is distance-based, so we can also use our Z-score values to identify outliers–where the absolute value of the Z-Score is greater than 3.

A preview of the first 6 observations of the CompPrice variable before an after standardization is below.

Preview of CompPrice before and after Standardization
CompPrice_Before	CompPrice_After
117	-0.45
122	-0.14
115	-0.58
118	-0.39
147	1.41
145	1.29

Hierarchical Cluster Analysis (HCA)

We apply HCA to a distance matrix, rather than the standardized data set. We use only the numerical variables as input to the distance matrix, which uses Euclidean distance.

Single Linkage

First, we apply Single Linkage HCA, where the distance between 2 clusters is defined as the distance between the two closest points in each cluster. We can create the dendrogram plot to visualize the cluster solution. The vertical lines in the plot represent observations in the dataset. The y-axis represents the distance at which points or clusters are merged.

To form clusters, we can either choose the number of clusters (which will determine the height that we ‘cut’ the dendrogram) or a height at which to ‘cut’ our dendrogram (which will determine the number of clusters). As shown in the dendrogram, there is no natural cluster solution, as the single-linkage solution formed long, stringy clusters.

If we choose k = 4, for a 4 cluster solution, the resulting dendrogram with cluster assignments designated by color is shown below. As shown, we have 3 singleton clusters, which each contain one observation and all other observations belong to the final cluster.

Although we do not know for sure, the single linkage solution does not seem to capture meaningful groupings that exist in the data. For this reason, we will try HCA with complete linkage next.

Complete Linkage

We can perform the same steps as in the single linkage example to perform complete linkage. Complete linkage defines the distance between 2 clusters as the distance between the 2 farthest points in each cluster. A dendrogram of the HCA solution using complete linkage is shown below.

If we choose k = 6, for a 6 cluster solution, the resulting dendrogram with cluster assignments designated by color is shown below.

Although we do not know for sure yet, this appears to be a better solution than the single linkage HCA solution. We will continue with the complete linkage solution to describe and validate our clustering.

Describing the Cluster Solution

First, we can compare the mean values across the different clusters from Complete Linkage HCA for our (scaled) input (numeric) variables.

Group.1	CompPrice	Income	Advertising	Population	Price	Age	Education
1	-0.50	0.81	-0.38	-0.25	-0.26	0.34	-0.17
2	1.20	-0.64	-0.54	-0.49	1.31	0.30	0.25
3	-0.57	0.28	1.27	0.00	-0.58	0.63	1.30
4	-0.77	-0.67	-0.57	0.75	-0.60	-0.59	0.30
5	0.67	0.18	1.04	0.71	0.54	0.05	-0.50
6	-0.12	-0.12	-0.87	-1.04	-0.61	-0.80	-1.01

We can also visualize the (scaled) cluster centers (averages) to observe differences across clusters.

Based on the plot, we can describe the clusters as:

Cluster 1: High Income
Cluster 2: High Competitor Price, High Price, Lower Income
Cluster 3: High Advertising, Lower Price, High Age, High Education
Cluster 4: Low Competitor Price, Low Price, Low Income, High Population
Cluster 5: Higher Population
Cluster 6: Low Advertising, Low Population, Low Price, Low Age, Low Education

Based on these findings, the car seat manufacturing company can choose particular clusters to target (or not) for marketing and advertising purposes. For instance, the company may want to reduce the amount of advertising in Cluster 3, since they already are offering a lower price. This may suggest that advertising money can be better spent in the locations in another cluster. It may make sense to redirect marketing efforts to Clusters 1 or 2.

Cluster Validation

While we found a cluster solution, it is important to validate and evaluate our solution to ensure that we are picking the best possible solution. We can use external, internal, and relative validation approaches.

External Validation

In performing External Validation, we compare the clusters to a known grouping variable in the data set.

We can compare the distributions of the two clustering solutions (Single and Complete Linkage) to the Sales_Lev variable to evaluate how well our solution is able to differentiate between Low and High sales levels.

Single Linkage Solution
	1	2	3	4
Low	46	0	1	1
High	32	1	0	0

Complete Linkage Solution
	1	2	3	4	5	6
Low	11	11	5	11	6	4
High	5	2	6	3	10	7

Based on the tables above, neither clustering solution is able to distinguish between the two sales levels well. To compare our clustering solutions, we will use internal validation.

Internal Validation

For internal validation, we can use the silhouette coefficient, which is a value that tells us how well matched each observation is to its assigned cluster. We want values close to 1, which indicates that the observation is a good fit for the cluster it is assigned to. We can look at the average of all the silhouette coefficient values for the data when comparing the two clustering solutions.

When interpreting silhouette values * Values close to 1 suggest that the observation is well matched to the assigned cluster * Values close to 0 suggest that the observation is borderline matched between two clusters * Values close to -1 suggest that the observation may be assigned to the wrong cluster

The plots below visualize of the silhouette coefficients for each observation in the data set for our two clustering solutions. We can compare the average values to pick between the two solutions, preferring higher average silhouette values. As shown below, the Complete Linkage clustering solution with six clusters is preferred.

Relative Validation

Many times we do not know the true clustering nature of the data. After we identify a preferred method and approach (in this case complete linkage HCA), we can use relative validation methods to choose the best value of k, the number of clusters.

When choosing the best number of clusters, we can plot the average silhouette coefficient across many k values and choose the k that produces the highest average silhouette value. As depicted by the red dashed vertical line, we have the highest average silhouette value when k = 2. In fact, k values of 2-4 and 7-10 all have higher silhouette values than our chosen k (6).

Below, we visualize the dendrogram of corresponding to our best complete linkage HCA solution with 2 clusters, which has an average silhouette width of 0.14. Since this is our best solution (among Complete Linkage HCA solutions considering 2-10 clusters), we should use this solution when describing our clusters and making business decisions.

Hierarchical Cluster Analysis (HCA)

Dr. Chelsey Hill

Preliminary

Preprocessing

Hierarchical Cluster Analysis (HCA)

Single Linkage

Complete Linkage

Describing the Cluster Solution

Cluster Validation

External Validation

Internal Validation

Relative Validation