LBB Unsupervised Learning - Customer Segmentation

Theresia Londong

2023-04-19

Photo by <a href="https://unsplash.com/@markusspiske?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Markus Spiske</a> on <a href="https://unsplash.com/s/photos/shopping?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

Photo by Markus Spiske on Unsplash

Intro

Business Cases

The goal of this project is to classify clients based on their profiles or purchasing tendencies. How many categories of outcomes we will target from grouping depends on us as users because the Unsupervised Learning does not have certain goal on #of variable. The outcomes of this profiling exercise will eventually serve as the foundation for decision-making for numerous customer-related projects.

Project Plan

In this business case, we will group customers into 5 groups / categories based on data from Kaggle which contains sales data on an online shopping platform throughout year of 2021. The data contains the following variables:

  • Customer_id = unique customer id
  • Age = customer’s age
  • Gender = 0: Male, 1: Female
  • Revenue_Total = total sales by customer
  • N_Purchases = number of purchases to date
  • Purchase_DATE = date latest purchase, dd.mm.yy
  • Purchase_VALUE = latest purchase in €
  • Pay_Method = 0: Digital Wallets, 1: Card, 2: PayPal, 3: Other
  • Time_Spent = time spent (in sec) on website
  • Browser = 0: Chrome, 1: Safari, 2: Edge, 3: Other
  • Newsletter = 0: not subscribed, 1: subscribed
  • Voucher = 0: not used, 1: used

These data will be used to classify our consumers into 5 groups depending on how similar their profiles are. This customer profile will come in handy later when we target certain programs, such paid premium memberships, discount programs, or promotions, or when we conduct risk analyses while conducting business with our clients.

Data Preparation

Packages Installation

We will need several packages to download and store webscrapping image objects in local directories, packages for data manipulation and cleaning, analysis with the elbow-method, and visualization of the K-Means result model.

#data manipulation

library(tidyverse)
library(dplyr)
library(tidyr)

# elbow methods

library(factoextra)

# profile cluster visualization

library(ggiraphExtra)

#scientific no
options(scipen = 999)

Reading Data

Import csv data from local directory.

cust <- read.csv('data_input/Online Shop Customer Sales Data.csv')
head(cust)
##   Customer_id Age Gender Revenue_Total N_Purchases Purchase_DATE Purchase_VALUE
## 1      504308  53      0          45.3           2      22.06.21         24.915
## 2      504309  18      1          36.2           3      10.12.21          2.896
## 3      504310  52      1          10.6           1      14.03.21         10.600
## 4      504311  29      0          54.1           5      25.10.21         43.280
## 5      504312  21      1          56.9           1      14.09.21         56.900
## 6      504313  55      0          13.7           6      14.05.21         12.467
##   Pay_Method Time_Spent Browser Newsletter Voucher
## 1          1        885       0          0       0
## 2          2        656       0          0       1
## 3          0        761       0          1       0
## 4          1        906       0          1       0
## 5          1        605       0          1       0
## 6          1        364       1          0       0

EDA

Eliminate irrelevant information such as Purchase_DATE and categorical data.

cust_01 <- cust %>% select(-c('Purchase_DATE'))
head(cust_01)
##   Customer_id Age Gender Revenue_Total N_Purchases Purchase_VALUE Pay_Method
## 1      504308  53      0          45.3           2         24.915          1
## 2      504309  18      1          36.2           3          2.896          2
## 3      504310  52      1          10.6           1         10.600          0
## 4      504311  29      0          54.1           5         43.280          1
## 5      504312  21      1          56.9           1         56.900          1
## 6      504313  55      0          13.7           6         12.467          1
##   Time_Spent Browser Newsletter Voucher
## 1        885       0          0       0
## 2        656       0          0       1
## 3        761       0          1       0
## 4        906       0          1       0
## 5        605       0          1       0
## 6        364       1          0       0

Eliminate categorical data for the following variables: Gender, Pay_Method, Browser, Newsletter and Voucher.

cust_02 <- cust_01 %>% select(-c('Gender', 'Pay_Method', 'Browser', 'Newsletter', 'Voucher'))
head(cust_02)
##   Customer_id Age Revenue_Total N_Purchases Purchase_VALUE Time_Spent
## 1      504308  53          45.3           2         24.915        885
## 2      504309  18          36.2           3          2.896        656
## 3      504310  52          10.6           1         10.600        761
## 4      504311  29          54.1           5         43.280        906
## 5      504312  21          56.9           1         56.900        605
## 6      504313  55          13.7           6         12.467        364

Transform Customer_id column into row names with help of column_to_rownames()function.

cust_03 <- cust_02 %>% 
  column_to_rownames(var = "Customer_id")

Our dataframe has 65K data points with 5 predictors.

glimpse(cust_03)
## Rows: 65,796
## Columns: 5
## $ Age            <int> 53, 18, 52, 29, 21, 55, 17, 30, 51, 63, 26, 42, 40, 19,…
## $ Revenue_Total  <dbl> 45.3, 36.2, 10.6, 54.1, 56.9, 13.7, 30.7, 8.1, 18.0, 19…
## $ N_Purchases    <int> 2, 3, 1, 5, 1, 6, 6, 7, 4, 4, 5, 4, 2, 4, 1, 7, 3, 3, 6…
## $ Purchase_VALUE <dbl> 24.915, 2.896, 10.600, 43.280, 56.900, 12.467, 2.456, 6…
## $ Time_Spent     <int> 885, 656, 761, 906, 605, 364, 654, 1011, 312, 828, 1029…

For when we want to see the simple statistic information of our dataframe, we can use the summary() function.

summary(cust_03)
##       Age        Revenue_Total    N_Purchases    Purchase_VALUE  
##  Min.   :16.00   Min.   : 0.50   Min.   :1.000   Min.   : 0.005  
##  1st Qu.:28.00   1st Qu.:15.30   1st Qu.:2.000   1st Qu.: 4.820  
##  Median :40.00   Median :30.10   Median :4.000   Median :12.640  
##  Mean   :39.59   Mean   :27.73   Mean   :3.992   Mean   :15.969  
##  3rd Qu.:51.00   3rd Qu.:37.60   3rd Qu.:6.000   3rd Qu.:24.752  
##  Max.   :63.00   Max.   :59.90   Max.   :7.000   Max.   :59.900  
##    Time_Spent    
##  Min.   : 120.0  
##  1st Qu.: 358.0  
##  Median : 598.0  
##  Mean   : 598.9  
##  3rd Qu.: 840.0  
##  Max.   :1080.0

We can see that there is a significant difference between the min. and max. in the Purchase_VALUE and Time_Spent variables, while in the categorical data type, there is class imbalance in the Newsletter and Browser variables which will be handled in the following stages. In algorithms that work with numerical data, such as K_Means, scaling is necessary since it’s possible that data with excessively large values could skew or mask data with smaller values, leading to bias.

cust_03$Purchase_VALUE <- scale(cust_03$Purchase_VALUE)
cust_03$Time_Spent <- scale(cust_03$Time_Spent)

Next, we check the possibility of missing value using the following methods:

cust_03 %>% is.na() %>% colSums()
##           Age Revenue_Total   N_Purchases                             
##             0             0             0             0             0

No missing value found in data cust_02. Then we can use the cust_02 data as the final version for our model creation and store it in the cust_cln object.

cust_cln <- cust_03
head(cust_cln)
##        Age Revenue_Total N_Purchases Purchase_VALUE  Time_Spent
## 504308  53          45.3           2      0.6762492  1.02966078
## 504309  18          36.2           3     -0.9881629  0.20542652
## 504310  52          10.6           1     -0.4058190  0.58335052
## 504311  29          54.1           5      2.0644562  1.10524558
## 504312  21          56.9           1      3.0939895  0.02186343
## 504313  55          13.7           6     -0.2646928 -0.84556214

Lastly, we check the statistic of our cust_cln data.

summary(cust_cln)
##       Age        Revenue_Total    N_Purchases     Purchase_VALUE.V1 
##  Min.   :16.00   Min.   : 0.50   Min.   :1.000   Min.   :-1.206693  
##  1st Qu.:28.00   1st Qu.:15.30   1st Qu.:2.000   1st Qu.:-0.842747  
##  Median :40.00   Median :30.10   Median :4.000   Median :-0.251616  
##  Mean   :39.59   Mean   :27.73   Mean   :3.992   Mean   : 0.000000  
##  3rd Qu.:51.00   3rd Qu.:37.60   3rd Qu.:6.000   3rd Qu.: 0.663928  
##  Max.   :63.00   Max.   :59.90   Max.   :7.000   Max.   : 3.320759  
##     Time_Spent.V1    
##  Min.   :-1.7237855  
##  1st Qu.:-0.8671578  
##  Median :-0.0033315  
##  Mean   : 0.0000000  
##  3rd Qu.: 0.8676933  
##  Max.   : 1.7315196
head(cust_cln)
##        Age Revenue_Total N_Purchases Purchase_VALUE  Time_Spent
## 504308  53          45.3           2      0.6762492  1.02966078
## 504309  18          36.2           3     -0.9881629  0.20542652
## 504310  52          10.6           1     -0.4058190  0.58335052
## 504311  29          54.1           5      2.0644562  1.10524558
## 504312  21          56.9           1      3.0939895  0.02186343
## 504313  55          13.7           6     -0.2646928 -0.84556214

Modelling

K-means Clustering

K-means is centroid-based clustering algorithms, meaning that each cluster has one centroid/central point that represents the cluster.

K-means Clustering can be performed using kmeans() function, with parameters as follow,

  • x: dataset
  • centers: #of centroid \(k\)

The amount \(k\) can be determined objectively using the Elbow approach, as we will see in more detail later, or subjectively based on business concerns, as we decided before, which are 5 pieces.

# k-means with 5 clusters

RNGkind(sample.kind = "Rounding")
set.seed(88) #even out randomized points

cust_km <- kmeans(x = cust_cln, # data clean (numerical)
                    centers = 5 # k = 5 subjectively
                    )

By running the kmeans() function above, we can find out the following,

  • Shows 5 clusters with sizes 16724, 17493, 11924, 13275, and 6380. As well as the average values for each column (predictors) in each cluster
  • The component/attribute information can be accessed from the cust_km model as described below

kmeans() attributes

  1. The number of repetitions (iterations) of the k-means algorithm until a stable cluster is produced, by looking at the $iter >> according to the centroid placement conditions (center point) at that time
cust_km$iter 
## [1] 4
  1. The number of observations in each cluster, by looking at the $size
cust_km$size
## [1] 16724 17493 11924 13275  6380
  1. Location of cluster centers/centroids, commonly used for cluster profiling, by looking at $centers. The average value of each predictors column in each cluster
cust_km$centers
##        Age Revenue_Total N_Purchases Purchase_VALUE   Time_Spent
## 1 27.75837      32.95915    4.025771      0.2145989 -0.002850924
## 2 52.54782      36.70314    3.987767      0.4011333 -0.001716735
## 3 27.53371      11.22610    3.985072     -0.7163134  0.009396063
## 4 51.48603      12.47586    3.954501     -0.6635992  0.001120213
## 5 32.88433      52.03483    4.010972      1.0571511 -0.007711562
  1. Cluster label for each observation, looking at $cluster
head(cust_km$cluster)
## 504308 504309 504310 504311 504312 504313 
##      2      1      4      5      5      4

Performance Evaluation

Unlike other ML models, the performance of a K-Means model is measured by the Goodness of Fit parameter. Referring to the ideal clustering model where it is expected that the distances between clusters / cluster centroids are far apart, but the distance between observation points in the same cluster and the centroids must be close together, then the goodness of fit of the clustering results can be seen from 3 values:

  • Within Sum of Squares/WSS ($withinss): the sum of the squared distances from each observation point of a cluster to its cluster centroid. The smaller the better
  • Between Sum of Squares/BSS ($betweenss): sum of weighted squared distances from the centroid of each cluster to the global mean (average of all observed data from all clusters). Weighted based on the number of observations in the cluster. The bigger BSS the better
  • Total Sum of Squares/TSS ($totss): the sum of the squared distances from each observation point to the global mean. The BSS/TSS ratio is closer to 1 the better
# Check wss value, with $withinss
cust_km$withinss
## [1] 1409298 1871297 1101503 1323977  784372
# Check total wss value, with $tot.withinss
cust_km$tot.withinss
## [1] 6490446

The WSS value for each cluster is much smaller than the Total WSS. The total within-cluster sum of square measures the compactness (i.e goodness) of the clustering and we want it to be as small as possible.

# check bss, with $betweenss
cust_km$betweenss
## [1] 21171136
#check TSS
cust_km$totss
## [1] 27661582

BSS value is slightly smaller than TSS meaning that we can improve the between-group variability.

Knowing the ratio between BSS/TSS, where it is expected that this ratio is close to 1.

cust_km$betweenss/cust_km$totss
## [1] 0.7653624

However, in the cust_km model, it can be seen that the BSS/TSS values are in the range of 0.76 so that further improvement/optimization is needed.

Tuned K_Means

Another way of tuning the K-Means model is by optimizing the number of clusters, \(k\), with the Elbow Method as described previously.

The principle of the elbow method for finding the optimum \(k\) value is the number of clusters (k) where when \(k\) is increased, the decrease of Total WSS is no longer significant (sloping).

Due to the large amount of data, a random sample is taken of 10% of the total data to determine the optimum \(k\).

# take 10% random sample data from cust_cln

RNGkind(sample.kind = "Rounding")
set.seed(88)

split_index <- sample(x = nrow(cust_cln), size = nrow(cust_cln)*0.1)

sample_cust <- cust_cln[split_index,]
summary(sample_cust)
##       Age        Revenue_Total    N_Purchases     Purchase_VALUE.V1 
##  Min.   :16.00   Min.   : 0.50   Min.   :1.000   Min.   :-1.205937  
##  1st Qu.:28.00   1st Qu.:15.60   1st Qu.:2.000   1st Qu.:-0.828177  
##  Median :40.00   Median :30.40   Median :4.000   Median :-0.231282  
##  Mean   :39.57   Mean   :27.95   Mean   :4.008   Mean   :-0.001075  
##  3rd Qu.:51.00   3rd Qu.:37.60   3rd Qu.:6.000   3rd Qu.: 0.650700  
##  Max.   :63.00   Max.   :59.90   Max.   :7.000   Max.   : 3.320759  
##     Time_Spent.V1    
##  Min.   :-1.7237855  
##  1st Qu.:-0.8491614  
##  Median :-0.0141293  
##  Mean   :-0.0006114  
##  3rd Qu.: 0.8388991  
##  Max.   : 1.7315196

Using our sample data, sample_cust, we will find the \(k\) optimum with fviz_nbclust() function as follow,

As we can see from the elbow plot above that k= 4 - 7 are where the value of Tot.WSS begin to convergent

Model K-Means Tuned

Let’s run the kmeans() function with number of \(k\), k = 7, as observed in the above elbow plot.

# k-means with new k = 7 

RNGkind(sample.kind = "Rounding")
set.seed(88) 

cust_km_tuned <- kmeans(x = cust_cln, 
                    centers = 7 # new k value based on elbow plot observation
                    )

Performance Evaluation

We will use the WSS, Tot WSS, BSS, and TSS values to assess the performance of the ‘cust_km_tuned’ model, just as we did with the cust_km model before.

# check wss
cust_km_tuned$withinss
## [1] 371966.9 378549.7 968406.4 981904.9 665859.4 606770.2 625649.9
# check total wss
cust_km_tuned$tot.withinss
## [1] 4599107

The WSS value for each cluster is much smaller than the TSS but not 0. We want the WSS value to be as small as possible to decrease the within-group variability, meaning that the individuals within the same group share great similarity.

# check bss
cust_km_tuned$betweenss
## [1] 23062474
#check TSS
cust_km_tuned$totss
## [1] 27661582

BSS similar to TSS value and bigger than the one in cust_km model. The bigger the BSS the better as it represent the variability between groups.

Find the ratio between BSS/TSS, where this ratio is expected to be closer to 1.

cust_km_tuned$betweenss/cust_km_tuned$totss
## [1] 0.8337366

However, the BSS/TSS values in the cust_km model are in the range of 0.82, indicating that the improvement was a success.

Conclusion

# input cluster label to the initial data by adding new column, group
cust_cln$grup <- cust_km_tuned$cluster
# profiling with summarized data
cust_center <- cust_cln %>% 
  group_by(grup) %>% #  grouping in each cluster
  summarise_all(mean) # mean in each cluster

cust_center
## # A tibble: 7 × 6
##    grup   Age Revenue_Total N_Purchases Purchase_VALUE Time_Spent
##   <int> <dbl>         <dbl>       <dbl>          <dbl>      <dbl>
## 1     1  27.6          51.2        4.01          1.02   -0.00136 
## 2     2  51.4          51.3        3.97          1.04    0.0174  
## 3     3  27.5          10.7        3.99         -0.739   0.00957 
## 4     4  51.5          10.7        3.97         -0.743   0.00753 
## 5     5  39.5          32.4        4.02          0.200   0.000758
## 6     6  23.3          32.5        4.01          0.194  -0.0143  
## 7     7  55.7          32.5        3.97          0.216  -0.0112

Alternatively we can use $centers

cust_km_tuned$centers
##        Age Revenue_Total N_Purchases Purchase_VALUE    Time_Spent
## 1 27.58464      51.16587    4.012764      1.0248330 -0.0013605238
## 2 51.39122      51.25485    3.974369      1.0413411  0.0173504984
## 3 27.51321      10.71179    3.992262     -0.7393215  0.0095686521
## 4 51.47645      10.73824    3.967728     -0.7427959  0.0075294550
## 5 39.51273      32.36885    4.017530      0.1996011  0.0007584132
## 6 23.25227      32.54377    4.009632      0.1938494 -0.0143016842
## 7 55.69659      32.48341    3.974656      0.2160706 -0.0111753820
# display cluster with highest and lowest score for each customer characteristic 

cust_center %>% 
  tidyr::pivot_longer(-grup) %>% 
  group_by(name) %>% #nama kolom
  summarize(cluster_min_val = which.min(value),
            cluster_max_val = which.max(value))
## # A tibble: 5 × 3
##   name           cluster_min_val cluster_max_val
##   <chr>                    <int>           <int>
## 1 Age                          6               7
## 2 N_Purchases                  4               5
## 3 Purchase_VALUE               4               2
## 4 Revenue_Total                3               2
## 5 Time_Spent                   6               2

Alternative for group/cluster visualization is using the ggRadar() from ggiraphExtra package

#optional visualization

ggRadar(data = cust_cln,
        aes(color = grup), #define color based on cluster
        interactive = T)

Interpretation:

Cluster 2:

  • Highest score: Purchase_VALUE, Revenue_Total and Time_Spent
  • Lowest score: -

Cluster 3:

  • Highest score: -
  • Lowest score: Revenue_Total

Cluster 4:

  • Highest score: -
  • Lowest score: N_Purchases and Purchase_VALUE

Cluster 5:

  • Highest score: N_Purchases
  • Lowest score: -

Cluster 6:

  • Highest score: -
  • Lowest score: Age, Time_Spent

Cluster 7:

  • Highest score: Age
  • Lowest score: -

We can base our decision-making for customer-focused programs on the interpretation of the profile shown above.

For example:

  • Customers from Cluster 2 have high economic ability and purchasing power based on their propensity to make large purchases (Purchase Value & Total Revenue) and to stay on our shopping platform for an extended period of time. As long as they are engaged on the shopping platform and propose more things using the basket recommendation program, we can target this type of client for additional pop-up advertisements.

  • Contrary to clients from Cluster 5, although having a higher average number of purchases in terms of quantity, this type of customer does not bring in as much income as the prior cluster, leading one to the conclusion that this sort of customer prefers to shop for items at lesser prices. We may create a list of suggestions that are in line with our customers’ appetite both on the homepage and during the check-out process by understanding their purchasing patterns and financial capabilities.

  • While customers from Cluster 3 and 4 require more incentive programs to increase shopping interest on our platform. Further analysis can be done on this type of customer, for example, is there a certain period when customers from this group spend more than usual with time-series analysis? If so, we can target the seasonal promotion program for this category of customers.