Cluster Analysis

Introduction

Today’s Class

Today’s class is about cluster analysis

What is cluster analysis

Cluster analysis is the art of creating groups where the members of each group are more similar to each other than those of other groups.

There are several algorithms that can be used to create these groups, each differing in:

Today we will focus on one algorithm, one of the most common, K-Means.

Why cluster analysis?

There are infinite valid reasons to perform a cluster analysis, but in essence, the objective is to group observations.

In business, it is most common to group like customers into groups in order to:

Just remember, there needs to be a clear purpose for the segmentation, and a customer segmentation for one purpose may not be well suited for another purpose.

How it works - K-Means

Lets take a quick look at the data we will be using

Source: Infidelity data, known as Fair’s Affairs. Cross-section data from a survey conducted by Psychology Today in 1969.

plot_num()

How the distance is calculated

Distances are highly influenced by the range of the variable

Standardizing variables is a must!
For today, let’s just think about standardizing as dividing each value by the max of each variable. There are other methods.

How many clusters should I have? (Remember this is an art)

If you have 100 observations, you can create between 1 and 100 clusters. The answer is somewhere in between, hopefully.

There are several criterias. Regardless of the metric you use, you will want to stop when an additional cluster will give you a diminished return in reduced variability within the clusters.

Below the sum of squares method was used.

# Scaling
cheater_ca_std <- scale(cheater_ca)

# Determining the number of clusters
fviz_nbclust(cheater_ca_std, kmeans, method = "wss") +
  geom_vline(xintercept = 6, linetype = 2)

Creating a cluster analysis with six clusters, scoring, and profiling

# Creating 6 clusters
fit <- kmeans(cheater_ca_std, 6)

# Scoring
cheater_scored <- data.frame(cheater, fit$cluster)

# Profiling
summary <- cheater_scored %>% 
  group_by(fit.cluster) %>% 
  summarise(cheater = mean(cheater) ,
            n = n() ,
            age = mean(age) ,
            gender = mean(gender) ,
            yearsmarried = mean(yearsmarried) ,
            rating = mean(rating) ,
            religiousness = mean(religiousness)) %>% 
  rename(Cluster = fit.cluster)
## `summarise()` ungrouping output (override with `.groups` argument)
knitr::kable(summary)
Cluster cheater n age gender yearsmarried rating religiousness
1 0.2421053 95 27.43158 0.0000000 6.280705 4.305263 3.136842
2 0.3106796 103 29.52913 1.0000000 5.519019 3.941748 2.660194
3 0.3972603 73 37.82192 0.2465753 12.821918 2.082192 3.246575
4 0.1392405 158 24.92089 0.3860759 2.374487 4.329114 2.822785
5 0.2857143 98 42.15306 0.9897959 13.607143 4.071429 3.755102
6 0.2162162 74 41.18919 0.0945946 14.932432 4.229730 3.378378

Let’s do some bar plots, and make a comment or two for each one

Cluster 3 cheats the most

Younger people seem to cheat less

Loaded variable here

Years married: Cheaters have been married for quiet a while, but what makes them different from clusters 5 and 6?

Religiousness doesn’t seem to be a major differentiator

However, how someone rates their marriage does seems to make a difference

So what? Managers don’t care about your custer analysis. Let’s put this into words.