Cluster Analysis

Introduction

Today’s Class

Today’s class is about cluster analysis

What is cluster analysis?
Why cluster analysis?
How it works
Cluster analysis in action
Putting cluster analysis to work

What is cluster analysis

Cluster analysis is the art of creating groups where the members of each group are more similar to each other than those of other groups.

There are several algorithms that can be used to create these groups, each differing in:

How the groups are formed
What constitutes similarity
How many groups are formed

Today we will focus on one algorithm, one of the most common, K-Means.

Why cluster analysis?

There are infinite valid reasons to perform a cluster analysis, but in essence, the objective is to group observations.

In business, it is most common to group like customers into groups in order to:

Develop different marketing strategies (retention, acquisition, up-selling, cross-selling)
Classify customers into tiers for service levels
Internal: segment sales people

Just remember, there needs to be a clear purpose for the segmentation, and a customer segmentation for one purpose may not be well suited for another purpose.

How it works - K-Means

Centroids (cluster centers) find/search for an optimal place
Observations are include in groups based on a scoring system
- Distance from the centroid takes points away
- Inclusion into the group gains points
- Every observation needs to belong to a group

Lets take a quick look at the data we will be using

Source: Infidelity data, known as Fair’s Affairs. Cross-section data from a survey conducted by Psychology Today in 1969.

plot_num()

How the distance is calculated

Distances are highly influenced by the range of the variable

Standardizing variables is a must!
For today, let’s just think about standardizing as dividing each value by the max of each variable. There are other methods.

How many clusters should I have? (Remember this is an art)

If you have 100 observations, you can create between 1 and 100 clusters. The answer is somewhere in between, hopefully.

There are several criterias. Regardless of the metric you use, you will want to stop when an additional cluster will give you a diminished return in reduced variability within the clusters.

Below the sum of squares method was used.

# Scaling
cheater_ca_std <- scale(cheater_ca)

# Determining the number of clusters
fviz_nbclust(cheater_ca_std, kmeans, method = "wss") +
  geom_vline(xintercept = 6, linetype = 2)

Creating a cluster analysis with six clusters, scoring, and profiling

# Creating 6 clusters
fit <- kmeans(cheater_ca_std, 6)

# Scoring
cheater_scored <- data.frame(cheater, fit$cluster)

# Profiling
summary <- cheater_scored %>% 
  group_by(fit.cluster) %>% 
  summarise(cheater = mean(cheater) ,
            n = n() ,
            age = mean(age) ,
            gender = mean(gender) ,
            yearsmarried = mean(yearsmarried) ,
            rating = mean(rating) ,
            religiousness = mean(religiousness)) %>% 
  rename(Cluster = fit.cluster)

## `summarise()` ungrouping output (override with `.groups` argument)

knitr::kable(summary)

Cluster	cheater	n	age	gender	yearsmarried	rating	religiousness
1	0.2421053	95	27.43158	0.0000000	6.280705	4.305263	3.136842
2	0.3106796	103	29.52913	1.0000000	5.519019	3.941748	2.660194
3	0.3972603	73	37.82192	0.2465753	12.821918	2.082192	3.246575
4	0.1392405	158	24.92089	0.3860759	2.374487	4.329114	2.822785
5	0.2857143	98	42.15306	0.9897959	13.607143	4.071429	3.755102
6	0.2162162	74	41.18919	0.0945946	14.932432	4.229730	3.378378

Let’s do some bar plots, and make a comment or two for each one

Cluster 3 cheats the most

Younger people seem to cheat less

Loaded variable here

Clusters 1 and 6 are women
Clusters 2 and 5 are mean
Cluster 3, the biggest cheaters, are 75 women

Years married: Cheaters have been married for quiet a while, but what makes them different from clusters 5 and 6?

Religiousness doesn’t seem to be a major differentiator

However, how someone rates their marriage does seems to make a difference

So what? Managers don’t care about your custer analysis. Let’s put this into words.

We created six clusters
People that cheat the most have been married for a while, and rate their marriage poorly
People that cheat the least (cluster 4) are younger, been married for much less years, and rate their marriage as great