Learning Objectives

  1. Data Exploration and Preparation

  2. Standardization/Scaling

  3. Clustering Model Evaluation

  4. Hierarchical clustering

Case Introduction - Customer Segmentation

In business world, age and income are two crucial features that could be utilized to segment potential customers as they are highly influential for purchasing behaviors and capacities

This session we will use both k-means and hierarchical clustering method to analyze teh age_income data to segment potential consumers, then we will compare their performance and interpret the results.

First load the data age_income_data.csv from canvas Module Week 7, or click this link age_income_data

market <- read.csv("age_income_data.csv")

Explore Dataset

We will first explore the structure and statistics of the variables in this dataset.

str(market)
## 'data.frame':    8105 obs. of  3 variables:
##  $ bin   : chr  "60-69" "30-39" "20-29" "30-39" ...
##  $ age   : int  64 33 24 33 78 62 88 54 54 31 ...
##  $ income: num  87083 76808 12044 61972 60120 ...
summary(market)
##      bin                 age            income        
##  Length:8105        Min.   :18.00   Min.   :   233.6  
##  Class :character   1st Qu.:28.00   1st Qu.: 43792.7  
##  Mode  :character   Median :39.00   Median : 65060.0  
##                     Mean   :42.85   Mean   : 66223.6  
##                     3rd Qu.:55.00   3rd Qu.: 85944.7  
##                     Max.   :89.00   Max.   :178676.4

Explore Trend with Visualization

Since there are only two variables, we can further utilize stacked box plot to explore potential trend/relationship between these two variables.

par(mfrow = c(1, 2)) #Create a partition of the plotting space to plot two graphs side by side
boxplot(market$age ~ market$bin, main = "Explore Age") 
boxplot(market$income ~ market$bin, main = "Explore Income") #Observe the trend of income for different age range

par(mfrow = c(1, 1)) #Reset the ploting partition to default (1 plot)

Correlation Test

We noticed that there are potential relationship between age and income, we can further validate this with correlation test. cor.test is a function we can use to test the statistical significance of a certain correlation relationship.

#Test the correlation between age and income
cor.test(market$age, market$income) 
## 
##  Pearson's product-moment correlation
## 
## data:  market$age and market$income
## t = -5.4055, df = 8103, p-value = 6.648e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08160633 -0.03822020
## sample estimates:
##         cor 
## -0.05994158

We will only focus on the p-value output, since it is smaller than 0.5, it indicates a significant correlation between age and income.

K-means Clustering without Scaling

To understand the importance of scaling, we will first create a k-means model without scaling to see what we will get:

#K-means Clustering without Scaling
set.seed(789) #Set a random seed to generate the same result

three <- kmeans(market[, 2:3], 3) 
#Specify variables age and income
#Specify k=3 clusters

Visualize K-Means Clustering Results

Then we can visualize the cluster results with following code:

We first create a scatter plot, then add points of cluster centers. Use color to distinguish different clusters

plot(market$age, market$inc, col = three$cluster, xlab = 'age',
     ylab = 'income', main = 'K-means without Scaling')
#Add points of centers to the plot
points(three$centers[, 1], three$centers[, 2], 
       pch = 23, col = 'maroon', bg = 'lightblue', cex = 3)
rm(three) #remove the k-means cluster result

Standardization/Scaling

Since the scales of age and income are drastically different, the clustering results are over-simplified. We need to use scale() function to standardize these two variables.

market$age_scale <- as.numeric(scale(market$age))
market$inc_scale <- as.numeric(scale(market$income))

Redo K-Means Clustering

Then we will redo the k-means anlaysis

#Redo clustering after scaling
set.seed(789)
three <- kmeans(market[, 4:5], 3)
#Visualize results after scaling
plot(market$age_scale, market$inc_scale,
     col=three$cluster, 
     xlab = 'age', ylab = 'income',
     main = 'K-means with Scaling')
points(three$centers[, 1], three$centers[, 2], 
       pch = 23, col = 'maroon', bg = 'lightblue', cex = 3)
rm(three)

Hierarchical Clustering

Then we will use hclust() function to conduct hierarchical clustering. We will need a dist()function to produce a distance matrix as an input, as well as a specified method for cluster distance.

set.seed(456)
hc_mod <- hclust(dist(market[, 4:5]), method = "ward.D2")
hc_mod
## 
## Call:
## hclust(d = dist(market[, 4:5]), method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 8105

Visualize as Dendrogram

Then we can visualize the result use as.dendrogram() function

 # Visualizing the Model Output
dend <- as.dendrogram(hc_mod)
#Install this package if you don't have it
if(!require("dendextend")) install.packages("dendextend") 
#Add colors to the different branches
dend_six_color <- color_branches(dend, k = 6)
#Plot the dendrogram
plot(dend_six_color, leaflab = "none", horiz = TRUE,
     main = "Age and Income Dendrogram", xlab = "Height")
#Add a line to indicate cluster decision
abline(v = 37.5, lty = 'dashed', col = 'blue')

Visualize Structure of Dendrogram

An alternative way to show the hierarchical structure:

str(cut(dend, h = 37.5)$upper)
## --[dendrogram w/ 2 branches and 6 members at h = 108]
##   |--[dendrogram w/ 2 branches and 2 members at h = 38]
##   |  |--leaf "Branch 1" (h= 15.7 midpoint = 274, x.member = 782 )
##   |  `--leaf "Branch 2" (h= 19.5 midpoint = 628, x.member = 1526 )
##   `--[dendrogram w/ 2 branches and 4 members at h = 93.8]
##      |--[dendrogram w/ 2 branches and 2 members at h = 41.3]
##      |  |--leaf "Branch 3" (h= 17 midpoint = 431, x.member = 905 )
##      |  `--leaf "Branch 4" (h= 18.5 midpoint = 463, x.member = 1473 )
##      `--[dendrogram w/ 2 branches and 2 members at h = 56.4]
##         |--leaf "Branch 5" (h= 13.6 midpoint = 530, x.member = 1323 )
##         `--leaf "Branch 6" (h= 30.8 midpoint = 753, x.member = 2096 )

Build Mulitple K-means model with different K

Now if we want to see what would be the best number of clusters, we can build multiple models to evaluate which one performs the best.

#Change number of clusters from 2 to 10, then compare their results through elbow plot
set.seed(456)
two <- kmeans(market[, 4:5], 2)
three <- kmeans(market[, 4:5], 3)
four <- kmeans(market[, 4:5], 4)
five <- kmeans(market[, 4:5], 5)
six <- kmeans(market[, 4:5], 6)
seven <- kmeans(market[, 4:5], 7)
eight <- kmeans(market[, 4:5], 8)
nine <- kmeans(market[, 4:5], 9)
ten <- kmeans(market[, 4:5], 10)

Evaluate the Models

To evaluate models, we need to first calculate the total within-cluster sum-of-square (measure how losely data are clustered). tot.withinss stands for Total within-cluster sum of squares, i.e. sum(withiness)

# Evaluting the Models
optimize <- data.frame(clusters = c(2:10), wss = rep(0, 9)) #Create a new dataframe to store different wss(withinness) from the above models
optimize[1, 2] <- as.numeric(two$tot.withinss) #store the total within-cluster sum of square to their respective row
optimize[2, 2] <- as.numeric(three$tot.withinss)
optimize[3, 2] <- as.numeric(four$tot.withinss)
optimize[4, 2] <- as.numeric(five$tot.withinss)
optimize[5, 2] <- as.numeric(six$tot.withinss)
optimize[6, 2] <- as.numeric(seven$tot.withinss)
optimize[7, 2] <- as.numeric(eight$tot.withinss)
optimize[8, 2] <- as.numeric(nine$tot.withinss)
optimize[9, 2] <- as.numeric(ten$tot.withinss)

Elbow Plot

So now we can use elbow plot to identify the “elbow point”, which gives us a sense of the performance of each model.

#Plot out the Elbow plot. type = "b" indicates both lines and points
plot(optimize$wss ~ optimize$clusters, type = "b", 
     ylim = c(0, 12000), ylab = 'Within Sum of Square Error',
     main = 'Finding Optimal Number of Clusters Based on Error',
     xlab = 'Number of Clusters', pch = 17, col = 'black')

Evaluating Model

We notice that the within sum of square error reduced at the point of 5 or 6 clusters.

Then we will save both k-means and hierarchical cluster results (with 5 and 6 clusters for each) into original dataset.

#Save k-means cluster results to original dataset
market$clus5 <- five$cluster
market$clus6 <- six$cluster
#Save hierarchical clustering results
dend_five <- cutree(dend, k = 5)
market$dend5 <- dend_five

dend_six <- cutree(dend, k = 6)
market$dend6 <- dend_six

Choosing a Model from Comparison

Your turn: Summarizing the Results

After deciding which model to use, please summarize the cluster size and key statistics of income in each cluster for your marketing group. (Hint: group_by() and summarise() functions)

## # A tibble: 6 × 2
##   dend6 ClusterSize
##   <int>       <int>
## 1     1        1473
## 2     2        2096
## 3     3        1323
## 4     4         782
## 5     5        1526
## 6     6         905

## # A tibble: 6 × 7
##   dend6 min_age med_age max_age med_inc min_inc max_inc
##   <int>   <int>   <dbl>   <int>   <dbl>   <dbl>   <dbl>
## 1     1      35      47      71  88170.  69492. 137557.
## 2     2      24      33      48  67958.    234.  94709.
## 3     3      18      22      31  32329.   1485.  60887.
## 4     4      62      77      89  43044.   2319.  84301.
## 5     5      44      58      74  57806.    973.  81988.
## 6     6      25      31      50 111125.  93827. 178676.