We will use the DescTools and caret packages, which were previously installed. In addition, we use the ggplot2 and factoextra for plotting, as well as the cluster and fpc packages.
install.packages(c("cluster", "factoextra",
"fpc", "ggplot2"))
Next, we load all of the necessary libraries for use in the session.
library(caret)
library(DescTools)
library(ggplot2)
library(cluster)
library(factoextra)
library(fpc)
In the lesson that follows we will use a subset of the “carseats.csv” file, “carseat_sub.csv”. This dataset contains sales of child car seats at 81 different stores. The car seat manufacturer would like to group stores for market segmentation and marketing purposes.
We use the read.csv() function to import the CSV file into R as a dataframe named cs_sub. We set stringsAsFactors = FALSE to keep any character columns as-is.
cs_sub <- read.csv(file = "carseat_sub.csv",
stringsAsFactors = FALSE)
First, we obtain the structure of the data using the str() function.
str(cs_sub)
## 'data.frame': 81 obs. of 11 variables:
## $ CompPrice : int 117 122 115 118 147 145 107 98 104 136 ...
## $ Income : int 100 35 28 32 74 119 115 118 99 58 ...
## $ Advertising: int 4 2 11 0 13 16 11 0 15 16 ...
## $ Population : int 466 393 29 284 251 294 496 19 226 241 ...
## $ Price : int 97 136 86 110 131 113 131 107 102 131 ...
## $ ShelveLoc : chr "Medium" "Medium" "Good" "Good" ...
## $ Age : int 55 62 53 63 52 42 50 64 58 44 ...
## $ Education : int 14 18 18 13 10 12 11 17 17 18 ...
## $ Urban : chr "Yes" "Yes" "Yes" "Yes" ...
## $ US : chr "Yes" "No" "Yes" "No" ...
## $ Sales_Lev : chr "Low" "Low" "High" "Low" ...
Next, we create our convenience vectors of variable names by type and convert the factor (categorical–ordinal and nominal) variables.
Nominal (Unordered Factors)
facs <- c("Urban", "US")
cs_sub[ ,facs] <- lapply(X = cs_sub[ ,facs],
FUN = factor)
Ordinal (Ordered Factors)
ords <- c("ShelveLoc", "Sales_Lev")
cs_sub$ShelveLoc <- factor(x = cs_sub$ShelveLoc,
levels = c("Bad", "Medium", "Good"),
ordered = TRUE)
cs_sub$Sales_Lev <- factor(x = cs_sub$Sales_Lev,
levels = c("Low", "High"),
ordered = TRUE)
Numerical
nums <- names(cs_sub)[!names(cs_sub) %in% c(facs, ords)]
nums
## [1] "CompPrice" "Income" "Advertising" "Population" "Price"
## [6] "Age" "Education"
Before clustering we need to:
PlotMiss(x = cs_sub)
As shown, there are no missing values. If missing values exist, we remove row-wise using the na.omit() function or impute using preProcess() and predict().
cs_sc, which contains all of the categorical variables and the scaled numerical variables.cs_cs <- preProcess(x = cs_sub,
method = c("center", "scale"))
cs_sc <- predict(object = cs_cs,
newdata = cs_sub)
We apply HCA to a distance matrix, created using the dist() function. We use only the numerical variables, which are in our nums vector. By default, dist() will use Euclidean distance (method = "euclidean").
cs_dist <- dist(x = cs_sc[ ,nums])
To apply HCA to the distance matrix, we use the hclust() function in the stats package.
Apply Single Linkage HCA (method = "single")
sing <- hclust(d = cs_dist,
method = "single")
We can create the dendrogram plot to visualize the cluster solution. By default, using the plot() function on an object created using the hclust() function will produce a dendrogram. We can use the rect.hclust() function, which will overlay k boxes, identifying the clusters. We will set k = 10.
plot(sing,
sub = NA, xlab = NA, # remove labels
main = "Single Linkage") # add title
rect.hclust(tree = sing,
k = 10, # 10 clusters
border = hcl.colors(10)) # 10 colors
We can create a vector of cluster assignments (sin_clusters) based on the clustering solution and chosen k value (k = 10) using the cutree() function.
sin_clusters <- cutree(tree = sing,
k = 10)
sin_clusters
## [1] 1 1 2 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [26] 1 1 1 1 1 1 4 1 1 1 5 3 6 1 7 1 1 1 1 1 8 1 1 1 1
## [51] 1 1 1 1 1 1 1 9 1 1 1 10 1 1 1 1 1 1 1 1 1 1 1 1 1
## [76] 8 1 1 1 1 1
We can perform the same steps as in the Single Link example to perform Complete Link, Average Link and Ward’s Method Clustering.
Apply Complete Linkage HCA (method = "complete")
com <- hclust(d = cs_dist,
method = "complete")
Visualize the dendrogram. Based on the dendrogram, we choose k = 6 and use the rect.hclust() function to identify the 6 clusters (k = 6).
plot(com,
sub = NA, xlab = NA,
main = "Complete Linkage")
rect.hclust(tree = com,
k = 6,
border = hcl.colors(6))
We can create a vector of cluster assignments (com_clusters) based on the clustering solution and chosen k value (k = 6) using the cutree() function.
com_clusters <- cutree(tree = com,
k = 6)
Apply Average Linkage HCA (method = "average")
avg <- hclust(d = cs_dist,
method = "average")
Visualize the dendrogram. Based on the dendrogram, we choose k = 8 and use the rect.hclust() function to identify the 8 clusters (k = 8).
plot(avg,
sub = NA, xlab = NA,
main = "Average Linkage")
rect.hclust(tree = avg,
k = 8,
border = hcl.colors(8))
We can create a vector of cluster assignments (avg_clusters) based on the clustering solution and chosen k value (k = 8) using the cutree() function.
avg_clusters <- cutree(tree = avg,
k = 8)
Apply Ward’s Method HCA (method = "ward.D2").
ward <- hclust(d = cs_dist,
method = "ward.D2")
Visualize the dendrogram. Based on the dendrogram, we choose k = 8 and use the rect.hclust() function to identify the 8 clusters (k = 8).
plot(ward,
xlab = NA, sub = NA,
main = "Ward's Method")
rect.hclust(tree = ward,
k = 8,
border = hcl.colors(8))
We can create a vector of cluster assignments (wards_clusters) based on the clustering solution and chosen k value (k = 8) using the cutree() function.
wards_clusters <- cutree(tree = ward,
k = 8)
We can use the fviz_cluster() function in the factoextra package to visualize 2 of our cluster solutions, Complete Linkage and Ward’s Method.
fviz_cluster(object = list(data = cs_sc[,nums],
cluster = com_clusters),
main = "Complete Linkage Clusters")
fviz_cluster(object = list(data = cs_sc[,nums],
cluster = wards_clusters),
main = "Ward's Method Clusters")
We prefer clustering solutions that have greater distance between clusters and less distance within clusters. Based on the visualization, we will use Ward’s Method with 8 clusters.
First, we can compare the mean values across the different clusters for our (scaled) input (numeric) variables. We use the aggregate() function to aggregate the average values (FUN = mean) of the numeric variables (nums) by cluster assignment (wards_clusters).
clus_means_HCA <- aggregate(cs_sc[ ,nums],
by = list(wards_clusters),
FUN = mean)
clus_means_HCA
## Group.1 CompPrice Income Advertising Population Price Age
## 1 1 -0.4812221 0.4366868 -0.8646498 0.25068353 -0.337538059 0.5373090
## 2 2 -0.1368187 -0.6811710 0.2032959 -0.03312042 -0.009914372 -0.7828729
## 3 3 0.1310506 0.2024599 -0.2676093 -0.82398211 -0.118019157 -0.2056942
## 4 4 0.3553133 1.5161468 0.7477800 0.66090105 0.252625822 -0.5520014
## 5 5 -0.7660001 0.5728793 1.3400904 0.40318681 -0.813943715 0.9919518
## 6 6 1.0094128 -0.4459841 0.8949378 0.88896675 0.592383719 0.6600740
## 7 7 -1.0541153 -0.9894339 -0.7936987 0.30932900 -0.881509206 -1.2205668
## 8 8 1.3624188 -0.5232598 -0.2725146 -0.76550372 2.260286125 0.4014124
## Education
## 1 0.1904665
## 2 1.3415551
## 3 -0.9526284
## 4 -1.1577993
## 5 1.1830139
## 6 0.2224412
## 7 -0.5889164
## 8 0.2970488
We can use the matplot() function to visualize the (scaled) cluster centers (averages) to observe differences across clusters. We use the t() function to transpose the output from the aggregate() function and exclude the cluster information in the input. We use the axis() function to customize the x-axis labels with the nums variable names and use the legend() function to identify the lines corresponding to the clusters.
matplot(t(clus_means_HCA[,-1]), # Ignore cluster # column
type = "l", # line plot ("l")
ylab = "", # no y-axis label
xlim = c(0, 7), # add space for legend
xaxt = "n", # no x-axis
col = 1:8, # 8 colors (for k = 8)
lty = 1:8, # 8 line types (for k = 8)
main = "Cluster Centers") # main title
# Add custom x-axis labels
axis(side = 1, # x-axis
at = 1:7, # x values 1-7
labels = nums, # variable names as labels
las = 2) # flip text to vertical
legend("left", # left position
legend = 1:8, # 8 lines
col = 1:8, # 8 colors
lty = 1:8, # 8 line types
cex = 0.6) # reduce text size
Based on the plot, we can describe the clusters as:
Based on these findings, the car seat manufacturer company can choose particular clusters to target. For instance, the company may want to target the locations in Cluster 6 for coupons that reduce the price, since the Competitor Price is high and the population is high. This may allow them to gain market share. The company may want to reduce the amount of advertising being spent on the locations in Cluster 5, since the average competitor and own prices are low and average age of the local population is high. This may suggest that advertising money can be better spent in the locations in another cluster.
We can also evaluate and look for patterns in the distribution of clusters against our categorical variables (ords and facs), which were not included in the analysis by setting FUN = table in the aggregate() function.
aggregate(x = cs_sub[ ,c(ords,facs)],
by = list(wards_clusters),
FUN = table)
## Group.1 ShelveLoc.Bad ShelveLoc.Medium ShelveLoc.Good Sales_Lev.Low
## 1 1 1 9 4 11
## 2 2 2 6 2 6
## 3 3 4 11 5 11
## 4 4 2 2 1 1
## 5 5 3 3 2 4
## 6 6 1 6 3 4
## 7 7 2 2 4 5
## 8 8 1 2 3 6
## Sales_Lev.High Urban.No Urban.Yes US.No US.Yes
## 1 3 5 9 10 4
## 2 4 3 7 3 7
## 3 9 4 16 5 15
## 4 4 1 4 0 5
## 5 4 3 5 0 8
## 6 6 2 8 3 7
## 7 3 3 5 7 1
## 8 0 1 5 2 4