The output that follows was generated using the stat, caret, cluster, and factoextra packages in R.
In the lesson that follows we use a data set containing car seat sales at 81 different stores. The car seat manufacturer would like to group stores for market segmentation and marketing purposes. The variables in the dataset include:
CompPrice: Price charged by competitor at each
locationIncome: Community income level (in 1,000s of
dollars)Advertising: Local advertising budget for the company
at each location (in 1,000s of dollars)Population: Population size in region (in 1,000s)Price: Price charged by company for car seats at each
locationsShelveLoc: Quality of shelving location for the car
seats at each location (Bad, Medium, Good)Age: Average age of the local populationEducation: Education level at each locationUrban: Indicates if the location is urban
(Yes) or rural (No)US: Indicates if the location is in the US
(Yes) or outside of the US (No)Sales_Lev: Indicates the sales level of the location.
High sales are those that are greater than 8 units (in
1,000s) and Low sales are those than are less than or equal
to 8 units (in 1,000s)Summary statistic information for the numerical variables and frequency information for the categorical variables is provided below.
| CompPrice | Income | Advertising | Population | Price | Age | Education | |
|---|---|---|---|---|---|---|---|
| Min. : 86.0 | Min. : 21.00 | Min. : 0.000 | Min. : 14.0 | Min. : 63.0 | Min. :25.00 | Min. :10.0 | |
| 1st Qu.:113.0 | 1st Qu.: 42.00 | 1st Qu.: 0.000 | 1st Qu.:161.0 | 1st Qu.: 99.0 | 1st Qu.:39.00 | 1st Qu.:11.0 | |
| Median :122.0 | Median : 67.00 | Median : 5.000 | Median :284.0 | Median :112.0 | Median :56.00 | Median :14.0 | |
| Mean :124.3 | Mean : 69.07 | Mean : 6.519 | Mean :273.4 | Mean :115.5 | Mean :53.41 | Mean :13.7 | |
| 3rd Qu.:136.0 | 3rd Qu.: 93.00 | 3rd Qu.:13.000 | 3rd Qu.:393.0 | 3rd Qu.:131.0 | 3rd Qu.:64.00 | 3rd Qu.:16.0 | |
| Max. :162.0 | Max. :120.00 | Max. :25.000 | Max. :508.0 | Max. :191.0 | Max. :79.00 | Max. :18.0 |
| Urban | US | ShelveLoc | Sales_Lev | |
|---|---|---|---|---|
| No :22 | No :30 | Bad :16 | Low :48 | |
| Yes:59 | Yes:51 | Medium:41 | High:33 | |
| Good :24 |
Before clustering we need to:
Missingness: If missing values exist, we can remove row-wise or column-wise or impute using a measure of central tendancy, such as the mean, median, or mode.
Rescaling: We need to rescale numeric variables if there are differences in scale across variables. A popular rescaling approach for cluster analysis is standardization, in which we convert all of the values for each numerical variable to Z-Scores.
Outliers: Cluster analysis is also sensitive to outliers because it is distance-based, so we can also use our Z-score values to identify outliers–where the absolute value of the Z-Score is greater than 3.
A preview of the first 6 observations of the CompPrice variable before an after standardization is below.
| CompPrice_Before | CompPrice_After |
|---|---|
| 117 | -0.45 |
| 122 | -0.14 |
| 115 | -0.58 |
| 118 | -0.39 |
| 147 | 1.41 |
| 145 | 1.29 |
We apply HCA to a distance matrix, rather than the standardized data set. We use only the numerical variables as input to the distance matrix, which uses Euclidean distance.
First, we apply Single Linkage HCA, where the distance between 2 clusters is defined as the distance between the two closest points in each cluster. We can create the dendrogram plot to visualize the cluster solution. The vertical lines in the plot represent observations in the dataset. The y-axis represents the distance at which points or clusters are merged.
To form clusters, we can either choose the number of clusters (which will determine the height that we ‘cut’ the dendrogram) or a height at which to ‘cut’ our dendrogram (which will determine the number of clusters). As shown in the dendrogram, there is no natural cluster solution, as the single-linkage solution formed long, stringy clusters.
If we choose k = 4, for a 4 cluster solution, the resulting dendrogram with cluster assignments designated by color is shown below. As shown, we have 3 singleton clusters, which each contain one observation and all other observations belong to the final cluster.
Although we do not know for sure, the single linkage solution does not seem to capture meaningful groupings that exist in the data. For this reason, we will try HCA with complete linkage next.
We can perform the same steps as in the single linkage example to perform complete linkage. Complete linkage defines the distance between 2 clusters as the distance between the 2 farthest points in each cluster. A dendrogram of the HCA solution using complete linkage is shown below.
If we choose k = 6, for a 6 cluster solution, the resulting dendrogram with cluster assignments designated by color is shown below.
Although we do not know for sure yet, this appears to be a better solution than the single linkage HCA solution. We will continue with the complete linkage solution to describe and validate our clustering.
First, we can compare the mean values across the different clusters from Complete Linkage HCA for our (scaled) input (numeric) variables.
| Group.1 | CompPrice | Income | Advertising | Population | Price | Age | Education |
|---|---|---|---|---|---|---|---|
| 1 | -0.50 | 0.81 | -0.38 | -0.25 | -0.26 | 0.34 | -0.17 |
| 2 | 1.20 | -0.64 | -0.54 | -0.49 | 1.31 | 0.30 | 0.25 |
| 3 | -0.57 | 0.28 | 1.27 | 0.00 | -0.58 | 0.63 | 1.30 |
| 4 | -0.77 | -0.67 | -0.57 | 0.75 | -0.60 | -0.59 | 0.30 |
| 5 | 0.67 | 0.18 | 1.04 | 0.71 | 0.54 | 0.05 | -0.50 |
| 6 | -0.12 | -0.12 | -0.87 | -1.04 | -0.61 | -0.80 | -1.01 |
We can also visualize the (scaled) cluster centers (averages) to
observe differences across clusters.
Based on the plot, we can describe the clusters as:
Based on these findings, the car seat manufacturing company can choose particular clusters to target (or not) for marketing and advertising purposes. For instance, the company may want to reduce the amount of advertising in Cluster 3, since they already are offering a lower price. This may suggest that advertising money can be better spent in the locations in another cluster. It may make sense to redirect marketing efforts to Clusters 1 or 2.
While we found a cluster solution, it is important to validate and evaluate our solution to ensure that we are picking the best possible solution. We can use external, internal, and relative validation approaches.
In performing External Validation, we compare the clusters to a known grouping variable in the data set.
We can compare the distributions of the two clustering solutions (Single and Complete Linkage) to the Sales_Lev variable to evaluate how well our solution is able to differentiate between Low and High sales levels.
| 1 | 2 | 3 | 4 | |
|---|---|---|---|---|
| Low | 46 | 0 | 1 | 1 |
| High | 32 | 1 | 0 | 0 |
| 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|
| Low | 11 | 11 | 5 | 11 | 6 | 4 |
| High | 5 | 2 | 6 | 3 | 10 | 7 |
Based on the tables above, neither clustering solution is able to distinguish between the two sales levels well. To compare our clustering solutions, we will use internal validation.
For internal validation, we can use the silhouette coefficient, which is a value that tells us how well matched each observation is to its assigned cluster. We want values close to 1, which indicates that the observation is a good fit for the cluster it is assigned to. We can look at the average of all the silhouette coefficient values for the data when comparing the two clustering solutions.
When interpreting silhouette values * Values close to 1 suggest that the observation is well matched to the assigned cluster * Values close to 0 suggest that the observation is borderline matched between two clusters * Values close to -1 suggest that the observation may be assigned to the wrong cluster
The plots below visualize of the silhouette coefficients for each observation in the data set for our two clustering solutions. We can compare the average values to pick between the two solutions, preferring higher average silhouette values. As shown below, the Complete Linkage clustering solution with six clusters is preferred.
Many times we do not know the true clustering nature of the data. After we identify a preferred method and approach (in this case complete linkage HCA), we can use relative validation methods to choose the best value of k, the number of clusters.
When choosing the best number of clusters, we can plot the average silhouette coefficient across many k values and choose the k that produces the highest average silhouette value. As depicted by the red dashed vertical line, we have the highest average silhouette value when k = 2. In fact, k values of 2-4 and 7-10 all have higher silhouette values than our chosen k (6).
Below, we visualize the dendrogram of corresponding to our best complete linkage HCA solution with 2 clusters, which has an average silhouette width of 0.14. Since this is our best solution (among Complete Linkage HCA solutions considering 2-10 clusters), we should use this solution when describing our clusters and making business decisions.