Introduction

The dataset contains a list of customers who use an application on several platforms, their age groups, years of experience in the field, gender, and their net worth. The point of this exercise is to apply a K-Means clustering algorithm to the dataset and allocate customers to clusters so that they can targeted with more relevant service offers.

library(readxl)
library(tidyverse)
library(factoextra)
library(skimr)

Load and explore the data

I was allowed to use this dataset on the condition that the actual customer IDs are removed. Instead, I am adding a column of simulated IDs, taken from row numbers.

df <- read_excel("Clustering_CaseStudy_28042020.xlsx")

df <- df %>% select(-1)
df <- rowid_to_column(df, "ID")

skim(df)

Data summary
Name	df
Number of rows	244291
Number of columns	14
_______________________
Column type frequency:
character	7
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Service Offering	1	12	12	1
Sales Channel	1	6	6	1
Sales Sub Channel	1	6	18	2
Age Group	1	4	7	10
Date of Birth	1	19	19	18867
Gender	1	2	6	3
Device	1	2	7	5

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ID	0	1	122146.00	70520.88	1	61073.5	122146.0	183218.5	244291	▇▇▇▇▇
Age Score	0	1	3.49	2.07	1	2.0	3.0	4.0	10	▇▅▂▁▁
Trading Experience Score	4	1	2.79	4.77	0	0.0	0.0	4.0	80	▇▁▁▁▁
Net Worth Score	18	1	2.54	1.24	1	1.5	2.5	3.5	7	▇▅▅▁▁
Income Score	10	1	2.54	1.38	1	1.0	2.0	3.0	8	▇▃▃▁▁
Gender_Sort	0	1	1.25	0.45	1	1.0	1.0	1.0	3	▇▁▂▁▁
Device_Sort	0	1	1.89	0.76	1	1.0	2.0	2.0	4	▆▇▁▃▁

Addressing missing values

There is a very small number of NA’s (missing values) in the dataset. Because the data is categorical we cannot impute them and, therefore, they will be removed. Also, we are removing the columns, which are not required for our exercise.

df <- df %>% select(-`Sales Channel`, 
                    -`Sales Sub Channel`, 
                    -`Date of Birth`,
                    -`Gender`, 
                    -`Device`, 
                    -`Service Offering`, 
                    -`Age Group`)  %>% 
             drop_na()

Standardising Values (Z-score)

Because K-Means uses distance calculations between values, we must standardise our features so that they are all within the same range. This will avoid one feature dominating others.

df_clean <- df[2:6] # Select all features except the ID column
z_df <- as.data.frame(lapply(df_clean, scale))

Model Training

RNGversion("3.5.2") # Specify version and set seed for reproducibility

## Warning in RNGkind("Mersenne-Twister", "Inversion", "Rounding"): non-uniform
## 'Rounding' sampler used

set.seed(123)

df_clusters <- kmeans(z_df, 5)

Model Evaluation

The first time K-Means is applied to a dataset the number of clusters is set randomly. Although, it has to be noted, very often Data Scientist will have some idea about the expected clusters. This comes from practical knowledge of the data. In this instance, however, I have no prior knowledge about the clients and the number 5 is chosen randomly.

To asses the optimal number of clusters we use the Elbow Method.

#First we define a function and then apply it to our data

wssplot <- function(data, nc = 15, set.seed = 1234){
  wss <- (nrow(data) - 1)*sum(apply(data, 2, var))
  for(i in 2:nc) {
    set.seed(1234)
    wss[i] <- sum(kmeans(x = data, centers = i, nstart = 25)$withinss)
  }
  plot(1:nc, wss, type = 'b', xlab = 'Number of Clusters', ylab = 'Within Group Sum of Squares',
       main = 'Elbow Method Plot to Find Optimal Number of Clusters', frame.plot = T,
       col = 'blue', lwd = 1.5)
}

wssplot(z_df)

Improving Model’s Performance

Let’s set the number of clusters to 3 (where the “elbow” is) and visualise the result.

df_clusters <- kmeans(z_df, 3)

fviz_cluster(df_clusters, data = z_df)

Evaluating the Performance

Let’s look at the centres of clusters

df_clusters$centers

##    Age.Score Trading.Experience.Score Net.Worth.Score Income.Score Gender_Sort
## 1 -0.2323664               -0.3708659      -0.3821355   -0.3603820   1.7298186
## 2  1.1705044                0.9802016       1.2990420    1.1638865  -0.2593359
## 3 -0.3878635               -0.2581343      -0.3834632   -0.3367572  -0.5488243

In the above table the 3 rows refer to our clusters. The numbers across each row indicate the cluster’s average value for the category listed at the top of the column. Because the values are standardised, positive values are above the overall mean and negative are bellow. The highest values are of most interest to their corresponding clusters.

Assigning Clusters to the Original Data

Now we can assign clusters to the original dataset with client ID’s so we can tell, which cluster each client belongs to.

df$cluster <- df_clusters$cluster



df %>% select(1,2,4,8)

## # A tibble: 244,269 x 4
##       ID `Age Score` `Net Worth Score` cluster
##    <int>       <dbl>             <dbl>   <int>
##  1     1           2               1.5       3
##  2     2           3               5.5       2
##  3     3           2               2.5       3
##  4     4           2               3.5       1
##  5     5           2               3.5       3
##  6     6           6               2.5       1
##  7     7           4               7         2
##  8     8           5               2.5       3
##  9     9           2               1.5       1
## 10    10           2               1.5       1
## # ... with 244,259 more rows

Clustering Clients with K-Means

Taras Poltorak

02/05/2020