The dataset contains a list of customers who use an application on several platforms, their age groups, years of experience in the field, gender, and their net worth. The point of this exercise is to apply a K-Means clustering algorithm to the dataset and allocate customers to clusters so that they can targeted with more relevant service offers.
library(readxl)
library(tidyverse)
library(factoextra)
library(skimr)
I was allowed to use this dataset on the condition that the actual customer IDs are removed. Instead, I am adding a column of simulated IDs, taken from row numbers.
df <- read_excel("Clustering_CaseStudy_28042020.xlsx")
df <- df %>% select(-1)
df <- rowid_to_column(df, "ID")
skim(df)
| Name | df |
| Number of rows | 244291 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Service Offering | 0 | 1 | 12 | 12 | 0 | 1 | 0 |
| Sales Channel | 0 | 1 | 6 | 6 | 0 | 1 | 0 |
| Sales Sub Channel | 0 | 1 | 6 | 18 | 0 | 2 | 0 |
| Age Group | 0 | 1 | 4 | 7 | 0 | 10 | 0 |
| Date of Birth | 0 | 1 | 19 | 19 | 0 | 18867 | 0 |
| Gender | 0 | 1 | 2 | 6 | 0 | 3 | 0 |
| Device | 0 | 1 | 2 | 7 | 0 | 5 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 122146.00 | 70520.88 | 1 | 61073.5 | 122146.0 | 183218.5 | 244291 | ▇▇▇▇▇ |
| Age Score | 0 | 1 | 3.49 | 2.07 | 1 | 2.0 | 3.0 | 4.0 | 10 | ▇▅▂▁▁ |
| Trading Experience Score | 4 | 1 | 2.79 | 4.77 | 0 | 0.0 | 0.0 | 4.0 | 80 | ▇▁▁▁▁ |
| Net Worth Score | 18 | 1 | 2.54 | 1.24 | 1 | 1.5 | 2.5 | 3.5 | 7 | ▇▅▅▁▁ |
| Income Score | 10 | 1 | 2.54 | 1.38 | 1 | 1.0 | 2.0 | 3.0 | 8 | ▇▃▃▁▁ |
| Gender_Sort | 0 | 1 | 1.25 | 0.45 | 1 | 1.0 | 1.0 | 1.0 | 3 | ▇▁▂▁▁ |
| Device_Sort | 0 | 1 | 1.89 | 0.76 | 1 | 1.0 | 2.0 | 2.0 | 4 | ▆▇▁▃▁ |
There is a very small number of NA’s (missing values) in the dataset. Because the data is categorical we cannot impute them and, therefore, they will be removed. Also, we are removing the columns, which are not required for our exercise.
df <- df %>% select(-`Sales Channel`,
-`Sales Sub Channel`,
-`Date of Birth`,
-`Gender`,
-`Device`,
-`Service Offering`,
-`Age Group`) %>%
drop_na()
Because K-Means uses distance calculations between values, we must standardise our features so that they are all within the same range. This will avoid one feature dominating others.
df_clean <- df[2:6] # Select all features except the ID column
z_df <- as.data.frame(lapply(df_clean, scale))
RNGversion("3.5.2") # Specify version and set seed for reproducibility
## Warning in RNGkind("Mersenne-Twister", "Inversion", "Rounding"): non-uniform
## 'Rounding' sampler used
set.seed(123)
df_clusters <- kmeans(z_df, 5)
The first time K-Means is applied to a dataset the number of clusters is set randomly. Although, it has to be noted, very often Data Scientist will have some idea about the expected clusters. This comes from practical knowledge of the data. In this instance, however, I have no prior knowledge about the clients and the number 5 is chosen randomly.
To asses the optimal number of clusters we use the Elbow Method.
#First we define a function and then apply it to our data
wssplot <- function(data, nc = 15, set.seed = 1234){
wss <- (nrow(data) - 1)*sum(apply(data, 2, var))
for(i in 2:nc) {
set.seed(1234)
wss[i] <- sum(kmeans(x = data, centers = i, nstart = 25)$withinss)
}
plot(1:nc, wss, type = 'b', xlab = 'Number of Clusters', ylab = 'Within Group Sum of Squares',
main = 'Elbow Method Plot to Find Optimal Number of Clusters', frame.plot = T,
col = 'blue', lwd = 1.5)
}
wssplot(z_df)
Let’s set the number of clusters to 3 (where the “elbow” is) and visualise the result.
df_clusters <- kmeans(z_df, 3)
fviz_cluster(df_clusters, data = z_df)
Let’s look at the centres of clusters
df_clusters$centers
## Age.Score Trading.Experience.Score Net.Worth.Score Income.Score Gender_Sort
## 1 -0.2323664 -0.3708659 -0.3821355 -0.3603820 1.7298186
## 2 1.1705044 0.9802016 1.2990420 1.1638865 -0.2593359
## 3 -0.3878635 -0.2581343 -0.3834632 -0.3367572 -0.5488243
In the above table the 3 rows refer to our clusters. The numbers across each row indicate the cluster’s average value for the category listed at the top of the column. Because the values are standardised, positive values are above the overall mean and negative are bellow. The highest values are of most interest to their corresponding clusters.
Now we can assign clusters to the original dataset with client ID’s so we can tell, which cluster each client belongs to.
df$cluster <- df_clusters$cluster
df %>% select(1,2,4,8)
## # A tibble: 244,269 x 4
## ID `Age Score` `Net Worth Score` cluster
## <int> <dbl> <dbl> <int>
## 1 1 2 1.5 3
## 2 2 3 5.5 2
## 3 3 2 2.5 3
## 4 4 2 3.5 1
## 5 5 2 3.5 3
## 6 6 6 2.5 1
## 7 7 4 7 2
## 8 8 5 2.5 3
## 9 9 2 1.5 1
## 10 10 2 1.5 1
## # ... with 244,259 more rows