We will use the DescTools and caret packages, which were previously installed. In addition, we use the ggplot2 and factoextra for plotting, as well as the cluster and fpc packages.
Next, we load all of the necessary libraries for use in the session.
library(caret)
library(DescTools)
library(ggplot2)
library(cluster)
library(factoextra)
library(fpc)
In the lesson that follows we will use a subset of the “carseats.csv” file, “carseat_sub.csv”. This dataset contains sales of child car seats at 81 different stores. The car seat manufacturer would like to group stores for market segmentation and marketing purposes. We load the objects in the HCA.RData file and use the ls() function to list its contents.
load("HCA.RData")
ls()
## [1] "cs_dist" "cs_sc" "cs_sub" "facs"
## [5] "nums" "ords" "ward" "wards_clusters"
For k-Means clustering, we use the (preprocessed, scaled) data directly as input to the kmeans() function.
As with HCA, before KMC clustering, we need to:
We use the standardized (center, scaled) data used for HCA.
str(cs_sc)
## 'data.frame': 81 obs. of 11 variables:
## $ CompPrice : num -0.455 -0.143 -0.579 -0.392 1.414 ...
## $ Income : num 1.039 -1.145 -1.38 -1.246 0.166 ...
## $ Advertising: num -0.371 -0.665 0.659 -0.959 0.954 ...
## $ Population : num 1.3514 0.8391 -1.7152 0.0742 -0.1573 ...
## $ Price : num -0.713 0.793 -1.137 -0.211 0.6 ...
## $ ShelveLoc : Ord.factor w/ 3 levels "Bad"<"Medium"<..: 2 2 3 3 3 1 3 2 1 2 ...
## $ Age : num 0.1021 0.5511 -0.0261 0.6152 -0.0903 ...
## $ Education : num 0.111 1.603 1.603 -0.263 -1.382 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 2 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 2 ...
## $ Sales_Lev : Ord.factor w/ 2 levels "Low"<"High": 1 1 2 1 2 2 2 1 1 2 ...
For k-Means clustering, we use the set.seed() function to initialize a random seed for the initial cluster centers.
set.seed(831)
We use the kmeans() function from the stats package to perform k-Means cluster analysis. In the kmeans() function, the centers argument specifies how many clusters you want to have (and therefore how many cluster centers/centroids are needed). By specifying a value for the nstart argument, we are telling R how many random initial centroids should be tried.
We use only the numeric variables as input. We will use k = 8, based on our Ward’s HCA. We can run a code line with the name of the kmeans object (kmeans8) to view summary information.
kmeans8 <- kmeans(x = cs_sc[ ,nums],
centers = 8,
trace = FALSE,
nstart = 30)
kmeans8
## K-means clustering with 8 clusters of sizes 13, 7, 8, 10, 8, 17, 9, 9
##
## Cluster means:
## CompPrice Income Advertising Population Price Age
## 1 0.02946154 -0.07226949 -0.7102149 -1.21155885 -0.4898521 -0.4355790
## 2 1.49442528 -0.37206821 -0.6649355 -0.39491784 1.5046557 1.0457760
## 3 0.47990366 -0.81304370 0.3467748 -0.16434592 1.1599338 -0.8838792
## 4 0.39269040 1.04913280 0.6153379 0.07775459 0.2410432 -0.2569989
## 5 0.82252719 -0.46446307 1.3584851 0.96370213 0.6290621 0.7113787
## 6 -0.53880587 0.41649973 -0.7428427 0.21046613 -0.3855558 0.4303342
## 7 -1.03594589 -0.77897892 -0.5177777 0.69489648 -0.7426246 -1.4370088
## 8 -0.74523502 0.35589133 1.2644676 0.16781131 -0.8498714 0.8788318
## Education
## 1 -0.95119368
## 2 0.27040318
## 3 0.94986519
## 4 -1.15779932
## 5 0.11052977
## 6 0.11052977
## 7 0.06908111
## 8 1.22964367
##
## Clustering vector:
## [1] 6 3 8 6 4 4 4 6 8 3 2 8 7 6 5 6 2 6 6 5 1 1 5 6 4 6 6 4 4 6 1 7 7 3 8 2 4 2
## [39] 6 2 7 3 6 4 1 5 1 8 6 3 7 3 1 7 1 6 8 5 8 1 1 5 2 6 4 3 1 7 7 4 1 7 8 8 2 5
## [77] 5 1 3 1 6
##
## Within cluster sum of squares by cluster:
## [1] 35.92961 25.29331 27.97937 28.27671 27.05699 51.21635 31.60098 27.33670
## (between_SS / total_SS = 54.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
We visualize the cluster solution using the fviz_clust() from the factoextra package, which plots in 2 dimensions using principal components analysis
fviz_cluster(object = kmeans8,
data = cs_sub[ ,nums])
We can look at the scaled centers of the clusters by looking at the centers list component of the kmeans object (kmeans8) we created
kmeans8$centers
## CompPrice Income Advertising Population Price Age
## 1 0.02946154 -0.07226949 -0.7102149 -1.21155885 -0.4898521 -0.4355790
## 2 1.49442528 -0.37206821 -0.6649355 -0.39491784 1.5046557 1.0457760
## 3 0.47990366 -0.81304370 0.3467748 -0.16434592 1.1599338 -0.8838792
## 4 0.39269040 1.04913280 0.6153379 0.07775459 0.2410432 -0.2569989
## 5 0.82252719 -0.46446307 1.3584851 0.96370213 0.6290621 0.7113787
## 6 -0.53880587 0.41649973 -0.7428427 0.21046613 -0.3855558 0.4303342
## 7 -1.03594589 -0.77897892 -0.5177777 0.69489648 -0.7426246 -1.4370088
## 8 -0.74523502 0.35589133 1.2644676 0.16781131 -0.8498714 0.8788318
## Education
## 1 -0.95119368
## 2 0.27040318
## 3 0.94986519
## 4 -1.15779932
## 5 0.11052977
## 6 0.11052977
## 7 0.06908111
## 8 1.22964367
Again, we can use the matplot() function to visualize the (scaled) cluster centers to observe differences (and add the custom x-axis labels and legend).
matplot(t(kmeans8$centers),
type = "l",
ylab = "",
xlim = c(0, 7),
xaxt = "n",
col = 1:8,
lty = 1:8,
main = "Cluster Centers")
# Customize the x-axis labels
axis(1, at = 1:7,
labels = nums,
las = 2)
# Add a legend
legend("left",
legend = 1:8,
col = 1:8,
lty = 1:8,
cex = 0.6)
Based on the plot, we can describe the clusters as:
Based on these findings and the plot, the car seat manufacturer company can choose particular clusters to target. For instance, they may want to consider increasing advertising or marketing efforts for the locations in Cluster 7, since the population is the second highest, both prices are low and Age is low. This suggests that this population may have a high demand for car seats.
We can use the table() function to compare the k = 8 clustering solutions for HCA and kMC and evaluate if the clusters are consistent.
table(KMeans = kmeans8$cluster,
HCA = wards_clusters)
## HCA
## KMeans 1 2 3 4 5 6 7 8
## 1 0 0 11 0 0 0 2 0
## 2 0 0 1 0 0 2 0 4
## 3 0 5 1 0 0 0 0 2
## 4 0 0 5 5 0 0 0 0
## 5 0 0 0 0 0 8 0 0
## 6 14 1 2 0 0 0 0 0
## 7 0 3 0 0 0 0 6 0
## 8 0 1 0 0 8 0 0 0
As shown, there are some similarities with the clustering solutions, but they have produced differing results. The next step is to validate the clustering solutions.