Data Exploration and Preparation
Standardization/Scaling
Clustering Model Evaluation
Hierarchical clustering
Data Exploration and Preparation
Standardization/Scaling
Clustering Model Evaluation
Hierarchical clustering
In business world, age and income are two crucial features that could be utilized to segment potential customers as they are highly influential for purchasing behaviors and capacities
This session we will use both k-means and hierarchical clustering method to analyze teh age_income data to segment potential consumers, then we will compare their performance and interpret the results.
First load the data age_income_data.csv from canvas Module Week 7, or click this link age_income_data
market <- read.csv("age_income_data.csv")
We will first explore the structure and statistics of the variables in this dataset.
str(market)
## 'data.frame': 8105 obs. of 3 variables: ## $ bin : chr "60-69" "30-39" "20-29" "30-39" ... ## $ age : int 64 33 24 33 78 62 88 54 54 31 ... ## $ income: num 87083 76808 12044 61972 60120 ...
summary(market)
## bin age income ## Length:8105 Min. :18.00 Min. : 233.6 ## Class :character 1st Qu.:28.00 1st Qu.: 43792.7 ## Mode :character Median :39.00 Median : 65060.0 ## Mean :42.85 Mean : 66223.6 ## 3rd Qu.:55.00 3rd Qu.: 85944.7 ## Max. :89.00 Max. :178676.4
Since there are only two variables, we can further utilize stacked box plot to explore potential trend/relationship between these two variables.
par(mfrow = c(1, 2)) #Create a partition of the plotting space to plot two graphs side by side boxplot(market$age ~ market$bin, main = "Explore Age") boxplot(market$income ~ market$bin, main = "Explore Income") #Observe the trend of income for different age range
par(mfrow = c(1, 1)) #Reset the ploting partition to default (1 plot)
We noticed that there are potential relationship between age and income, we can further validate this with correlation test. cor.test is a function we can use to test the statistical significance of a certain correlation relationship.
#Test the correlation between age and income cor.test(market$age, market$income)
## ## Pearson's product-moment correlation ## ## data: market$age and market$income ## t = -5.4055, df = 8103, p-value = 6.648e-08 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## -0.08160633 -0.03822020 ## sample estimates: ## cor ## -0.05994158
We will only focus on the p-value output, since it is smaller than 0.5, it indicates a significant correlation between age and income.
To understand the importance of scaling, we will first create a k-means model without scaling to see what we will get:
#K-means Clustering without Scaling set.seed(789) #Set a random seed to generate the same result three <- kmeans(market[, 2:3], 3) #Specify variables age and income #Specify k=3 clusters
Then we can visualize the cluster results with following code:
We first create a scatter plot, then add points of cluster centers. Use color to distinguish different clusters
plot(market$age, market$inc, col = three$cluster, xlab = 'age',
ylab = 'income', main = 'K-means without Scaling')
#Add points of centers to the plot
points(three$centers[, 1], three$centers[, 2],
pch = 23, col = 'maroon', bg = 'lightblue', cex = 3)
rm(three) #remove the k-means cluster result
Since the scales of age and income are drastically different, the clustering results are over-simplified. We need to use scale() function to standardize these two variables.
market$age_scale <- as.numeric(scale(market$age)) market$inc_scale <- as.numeric(scale(market$income))
Then we will redo the k-means anlaysis
#Redo clustering after scaling
set.seed(789)
three <- kmeans(market[, 4:5], 3)
#Visualize results after scaling
plot(market$age_scale, market$inc_scale,
col=three$cluster,
xlab = 'age', ylab = 'income',
main = 'K-means with Scaling')
points(three$centers[, 1], three$centers[, 2],
pch = 23, col = 'maroon', bg = 'lightblue', cex = 3)
rm(three)
Then we will use hclust() function to conduct hierarchical clustering. We will need a dist()function to produce a distance matrix as an input, as well as a specified method for cluster distance.
set.seed(456) hc_mod <- hclust(dist(market[, 4:5]), method = "ward.D2") hc_mod
## ## Call: ## hclust(d = dist(market[, 4:5]), method = "ward.D2") ## ## Cluster method : ward.D2 ## Distance : euclidean ## Number of objects: 8105
Then we can visualize the result use as.dendrogram() function
# Visualizing the Model Output
dend <- as.dendrogram(hc_mod)
#Install this package if you don't have it
if(!require("dendextend")) install.packages("dendextend")
#Add colors to the different branches
dend_six_color <- color_branches(dend, k = 6)
#Plot the dendrogram
plot(dend_six_color, leaflab = "none", horiz = TRUE,
main = "Age and Income Dendrogram", xlab = "Height")
#Add a line to indicate cluster decision
abline(v = 37.5, lty = 'dashed', col = 'blue')
An alternative way to show the hierarchical structure:
str(cut(dend, h = 37.5)$upper)
## --[dendrogram w/ 2 branches and 6 members at h = 108] ## |--[dendrogram w/ 2 branches and 2 members at h = 38] ## | |--leaf "Branch 1" (h= 15.7 midpoint = 274, x.member = 782 ) ## | `--leaf "Branch 2" (h= 19.5 midpoint = 628, x.member = 1526 ) ## `--[dendrogram w/ 2 branches and 4 members at h = 93.8] ## |--[dendrogram w/ 2 branches and 2 members at h = 41.3] ## | |--leaf "Branch 3" (h= 17 midpoint = 431, x.member = 905 ) ## | `--leaf "Branch 4" (h= 18.5 midpoint = 463, x.member = 1473 ) ## `--[dendrogram w/ 2 branches and 2 members at h = 56.4] ## |--leaf "Branch 5" (h= 13.6 midpoint = 530, x.member = 1323 ) ## `--leaf "Branch 6" (h= 30.8 midpoint = 753, x.member = 2096 )
Now if we want to see what would be the best number of clusters, we can build multiple models to evaluate which one performs the best.
#Change number of clusters from 2 to 10, then compare their results through elbow plot set.seed(456) two <- kmeans(market[, 4:5], 2) three <- kmeans(market[, 4:5], 3) four <- kmeans(market[, 4:5], 4) five <- kmeans(market[, 4:5], 5) six <- kmeans(market[, 4:5], 6) seven <- kmeans(market[, 4:5], 7) eight <- kmeans(market[, 4:5], 8) nine <- kmeans(market[, 4:5], 9) ten <- kmeans(market[, 4:5], 10)
To evaluate models, we need to first calculate the total within-cluster sum-of-square (measure how losely data are clustered). tot.withinss stands for Total within-cluster sum of squares, i.e. sum(withiness)
# Evaluting the Models optimize <- data.frame(clusters = c(2:10), wss = rep(0, 9)) #Create a new dataframe to store different wss(withinness) from the above models optimize[1, 2] <- as.numeric(two$tot.withinss) #store the total within-cluster sum of square to their respective row optimize[2, 2] <- as.numeric(three$tot.withinss) optimize[3, 2] <- as.numeric(four$tot.withinss) optimize[4, 2] <- as.numeric(five$tot.withinss) optimize[5, 2] <- as.numeric(six$tot.withinss) optimize[6, 2] <- as.numeric(seven$tot.withinss) optimize[7, 2] <- as.numeric(eight$tot.withinss) optimize[8, 2] <- as.numeric(nine$tot.withinss) optimize[9, 2] <- as.numeric(ten$tot.withinss)
So now we can use elbow plot to identify the “elbow point”, which gives us a sense of the performance of each model.
#Plot out the Elbow plot. type = "b" indicates both lines and points
plot(optimize$wss ~ optimize$clusters, type = "b",
ylim = c(0, 12000), ylab = 'Within Sum of Square Error',
main = 'Finding Optimal Number of Clusters Based on Error',
xlab = 'Number of Clusters', pch = 17, col = 'black')
We notice that the within sum of square error reduced at the point of 5 or 6 clusters.
Then we will save both k-means and hierarchical cluster results (with 5 and 6 clusters for each) into original dataset.
#Save k-means cluster results to original dataset market$clus5 <- five$cluster market$clus6 <- six$cluster #Save hierarchical clustering results dend_five <- cutree(dend, k = 5) market$dend5 <- dend_five dend_six <- cutree(dend, k = 6) market$dend6 <- dend_six
After deciding which model to use, please summarize the cluster size and key statistics of income in each cluster for your marketing group. (Hint: group_by() and summarise() functions)
## # A tibble: 6 × 2 ## dend6 ClusterSize ## <int> <int> ## 1 1 1473 ## 2 2 2096 ## 3 3 1323 ## 4 4 782 ## 5 5 1526 ## 6 6 905
## # A tibble: 6 × 7 ## dend6 min_age med_age max_age med_inc min_inc max_inc ## <int> <int> <dbl> <int> <dbl> <dbl> <dbl> ## 1 1 35 47 71 88170. 69492. 137557. ## 2 2 24 33 48 67958. 234. 94709. ## 3 3 18 22 31 32329. 1485. 60887. ## 4 4 62 77 89 43044. 2319. 84301. ## 5 5 44 58 74 57806. 973. 81988. ## 6 6 25 31 50 111125. 93827. 178676.