Today’s class is about cluster analysis
Cluster analysis is the art of creating groups where the members of each group are more similar to each other than those of other groups.
There are several algorithms that can be used to create these groups, each differing in:
Today we will focus on one algorithm, one of the most common, K-Means.
There are infinite valid reasons to perform a cluster analysis, but in essence, the objective is to group observations.
In business, it is most common to group like customers into groups in order to:
Just remember, there needs to be a clear purpose for the segmentation, and a customer segmentation for one purpose may not be well suited for another purpose.
Source: Infidelity data, known as Fair’s Affairs. Cross-section data from a survey conducted by Psychology Today in 1969.
plot_num()
Distances are highly influenced by the range of the variable
Standardizing variables is a must!
For today, let’s just think about standardizing as dividing each value by the max of each variable. There are other methods.
If you have 100 observations, you can create between 1 and 100 clusters. The answer is somewhere in between, hopefully.
There are several criterias. Regardless of the metric you use, you will want to stop when an additional cluster will give you a diminished return in reduced variability within the clusters.
Below the sum of squares method was used.
# Scaling
cheater_ca_std <- scale(cheater_ca)
# Determining the number of clusters
fviz_nbclust(cheater_ca_std, kmeans, method = "wss") +
geom_vline(xintercept = 6, linetype = 2)# Creating 6 clusters
fit <- kmeans(cheater_ca_std, 6)
# Scoring
cheater_scored <- data.frame(cheater, fit$cluster)
# Profiling
summary <- cheater_scored %>%
group_by(fit.cluster) %>%
summarise(cheater = mean(cheater) ,
n = n() ,
age = mean(age) ,
gender = mean(gender) ,
yearsmarried = mean(yearsmarried) ,
rating = mean(rating) ,
religiousness = mean(religiousness)) %>%
rename(Cluster = fit.cluster)## `summarise()` ungrouping output (override with `.groups` argument)
| Cluster | cheater | n | age | gender | yearsmarried | rating | religiousness |
|---|---|---|---|---|---|---|---|
| 1 | 0.2421053 | 95 | 27.43158 | 0.0000000 | 6.280705 | 4.305263 | 3.136842 |
| 2 | 0.3106796 | 103 | 29.52913 | 1.0000000 | 5.519019 | 3.941748 | 2.660194 |
| 3 | 0.3972603 | 73 | 37.82192 | 0.2465753 | 12.821918 | 2.082192 | 3.246575 |
| 4 | 0.1392405 | 158 | 24.92089 | 0.3860759 | 2.374487 | 4.329114 | 2.822785 |
| 5 | 0.2857143 | 98 | 42.15306 | 0.9897959 | 13.607143 | 4.071429 | 3.755102 |
| 6 | 0.2162162 | 74 | 41.18919 | 0.0945946 | 14.932432 | 4.229730 | 3.378378 |
Let’s do some bar plots, and make a comment or two for each one