k-Means Cluster Analysis (kMC)

Preliminary

We will use the DescTools and caret packages, which were previously installed. In addition, we use the ggplot2 and factoextra for plotting, as well as the cluster and fpc packages.

Next, we load all of the necessary libraries for use in the session.

library(caret)
library(DescTools)
library(ggplot2)
library(cluster)
library(factoextra) 
library(fpc)

In the lesson that follows we will use a subset of the “carseats.csv” file, “carseat_sub.csv”. This dataset contains sales of child car seats at 81 different stores. The car seat manufacturer would like to group stores for market segmentation and marketing purposes. We load the objects in the HCA.RData file and use the ls() function to list its contents.

load("HCA.RData")
ls()

## [1] "cs_dist"        "cs_sc"          "cs_sub"         "facs"          
## [5] "nums"           "ords"           "ward"           "wards_clusters"

Preprocessing

For k-Means clustering, we use the (preprocessed, scaled) data directly as input to the kmeans() function.

As with HCA, before KMC clustering, we need to:

handle missing values (remove or impute)
rescale the numeric variables

We use the standardized (center, scaled) data used for HCA.

str(cs_sc)

## 'data.frame':    81 obs. of  11 variables:
##  $ CompPrice  : num  -0.455 -0.143 -0.579 -0.392 1.414 ...
##  $ Income     : num  1.039 -1.145 -1.38 -1.246 0.166 ...
##  $ Advertising: num  -0.371 -0.665 0.659 -0.959 0.954 ...
##  $ Population : num  1.3514 0.8391 -1.7152 0.0742 -0.1573 ...
##  $ Price      : num  -0.713 0.793 -1.137 -0.211 0.6 ...
##  $ ShelveLoc  : Ord.factor w/ 3 levels "Bad"<"Medium"<..: 2 2 3 3 3 1 3 2 1 2 ...
##  $ Age        : num  0.1021 0.5511 -0.0261 0.6152 -0.0903 ...
##  $ Education  : num  0.111 1.603 1.603 -0.263 -1.382 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 2 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 2 ...
##  $ Sales_Lev  : Ord.factor w/ 2 levels "Low"<"High": 1 1 2 1 2 2 2 1 1 2 ...

k-Means Clustering

For k-Means clustering, we use the set.seed() function to initialize a random seed for the initial cluster centers.

set.seed(831)

We use the kmeans() function from the stats package to perform k-Means cluster analysis. In the kmeans() function, the centers argument specifies how many clusters you want to have (and therefore how many cluster centers/centroids are needed). By specifying a value for the nstart argument, we are telling R how many random initial centroids should be tried.

We use only the numeric variables as input. We will use k = 8, based on our Ward’s HCA. We can run a code line with the name of the kmeans object (kmeans8) to view summary information.

kmeans8 <- kmeans(x = cs_sc[ ,nums], 
                  centers = 8, 
                  trace = FALSE, 
                  nstart = 30)
kmeans8

## K-means clustering with 8 clusters of sizes 13, 7, 8, 10, 8, 17, 9, 9
## 
## Cluster means:
##     CompPrice      Income Advertising  Population      Price        Age
## 1  0.02946154 -0.07226949  -0.7102149 -1.21155885 -0.4898521 -0.4355790
## 2  1.49442528 -0.37206821  -0.6649355 -0.39491784  1.5046557  1.0457760
## 3  0.47990366 -0.81304370   0.3467748 -0.16434592  1.1599338 -0.8838792
## 4  0.39269040  1.04913280   0.6153379  0.07775459  0.2410432 -0.2569989
## 5  0.82252719 -0.46446307   1.3584851  0.96370213  0.6290621  0.7113787
## 6 -0.53880587  0.41649973  -0.7428427  0.21046613 -0.3855558  0.4303342
## 7 -1.03594589 -0.77897892  -0.5177777  0.69489648 -0.7426246 -1.4370088
## 8 -0.74523502  0.35589133   1.2644676  0.16781131 -0.8498714  0.8788318
##     Education
## 1 -0.95119368
## 2  0.27040318
## 3  0.94986519
## 4 -1.15779932
## 5  0.11052977
## 6  0.11052977
## 7  0.06908111
## 8  1.22964367
## 
## Clustering vector:
##  [1] 6 3 8 6 4 4 4 6 8 3 2 8 7 6 5 6 2 6 6 5 1 1 5 6 4 6 6 4 4 6 1 7 7 3 8 2 4 2
## [39] 6 2 7 3 6 4 1 5 1 8 6 3 7 3 1 7 1 6 8 5 8 1 1 5 2 6 4 3 1 7 7 4 1 7 8 8 2 5
## [77] 5 1 3 1 6
## 
## Within cluster sum of squares by cluster:
## [1] 35.92961 25.29331 27.97937 28.27671 27.05699 51.21635 31.60098 27.33670
##  (between_SS / total_SS =  54.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Visualize the Cluster Solution

We visualize the cluster solution using the fviz_clust() from the factoextra package, which plots in 2 dimensions using principal components analysis

fviz_cluster(object = kmeans8, 
             data = cs_sub[ ,nums])

Describing the Cluster Solution

We can look at the scaled centers of the clusters by looking at the centers list component of the kmeans object (kmeans8) we created

kmeans8$centers

##     CompPrice      Income Advertising  Population      Price        Age
## 1  0.02946154 -0.07226949  -0.7102149 -1.21155885 -0.4898521 -0.4355790
## 2  1.49442528 -0.37206821  -0.6649355 -0.39491784  1.5046557  1.0457760
## 3  0.47990366 -0.81304370   0.3467748 -0.16434592  1.1599338 -0.8838792
## 4  0.39269040  1.04913280   0.6153379  0.07775459  0.2410432 -0.2569989
## 5  0.82252719 -0.46446307   1.3584851  0.96370213  0.6290621  0.7113787
## 6 -0.53880587  0.41649973  -0.7428427  0.21046613 -0.3855558  0.4303342
## 7 -1.03594589 -0.77897892  -0.5177777  0.69489648 -0.7426246 -1.4370088
## 8 -0.74523502  0.35589133   1.2644676  0.16781131 -0.8498714  0.8788318
##     Education
## 1 -0.95119368
## 2  0.27040318
## 3  0.94986519
## 4 -1.15779932
## 5  0.11052977
## 6  0.11052977
## 7  0.06908111
## 8  1.22964367

Again, we can use the matplot() function to visualize the (scaled) cluster centers to observe differences (and add the custom x-axis labels and legend).

matplot(t(kmeans8$centers), 
        type = "l", 
        ylab = "",
        xlim = c(0, 7),
        xaxt = "n",
        col = 1:8,
        lty = 1:8,
        main = "Cluster Centers")
# Customize the x-axis labels
axis(1, at = 1:7, 
     labels = nums, 
     las = 2)
# Add a legend
legend("left", 
       legend = 1:8, 
       col = 1:8, 
       lty = 1:8,
       cex = 0.6)

Based on the plot, we can describe the clusters as:

Cluster 1: Low Population, Low Education, Low Advertising
Cluster 2: High Competitor Price, High Price, Low Advertising
Cluster 3: Low Income, High Education
Cluster 4: High Income, Low Education
Cluster 5: High Advertising, High Population
Cluster 6: Low Advertising
Cluster 7: Low Competitor Price, Low Age
Cluster 8: High Advertising, Low Price, High Education

Based on these findings and the plot, the car seat manufacturer company can choose particular clusters to target. For instance, they may want to consider increasing advertising or marketing efforts for the locations in Cluster 7, since the population is the second highest, both prices are low and Age is low. This suggests that this population may have a high demand for car seats.

Comparing HCA & K-Means Solutions

We can use the table() function to compare the k = 8 clustering solutions for HCA and kMC and evaluate if the clusters are consistent.

table(KMeans = kmeans8$cluster,
      HCA = wards_clusters)

##       HCA
## KMeans  1  2  3  4  5  6  7  8
##      1  0  0 11  0  0  0  2  0
##      2  0  0  1  0  0  2  0  4
##      3  0  5  1  0  0  0  0  2
##      4  0  0  5  5  0  0  0  0
##      5  0  0  0  0  0  8  0  0
##      6 14  1  2  0  0  0  0  0
##      7  0  3  0  0  0  0  6  0
##      8  0  1  0  0  8  0  0  0

As shown, there are some similarities with the clustering solutions, but they have produced differing results. The next step is to validate the clustering solutions.