Photo by Markus Spiske on Unsplash
Intro
Business Cases
The goal of this project is to classify clients based on their profiles or purchasing tendencies. How many categories of outcomes we will target from grouping depends on us as users because the Unsupervised Learning does not have certain goal on #of variable. The outcomes of this profiling exercise will eventually serve as the foundation for decision-making for numerous customer-related projects.
Project Plan
In this business case, we will group customers into 5 groups / categories based on data from Kaggle which contains sales data on an online shopping platform throughout year of 2021. The data contains the following variables:
Customer_id= unique customer idAge= customer’s ageGender= 0: Male, 1: FemaleRevenue_Total= total sales by customerN_Purchases= number of purchases to datePurchase_DATE= date latest purchase, dd.mm.yyPurchase_VALUE= latest purchase in €Pay_Method= 0: Digital Wallets, 1: Card, 2: PayPal, 3: OtherTime_Spent= time spent (in sec) on websiteBrowser= 0: Chrome, 1: Safari, 2: Edge, 3: OtherNewsletter= 0: not subscribed, 1: subscribedVoucher= 0: not used, 1: used
These data will be used to classify our consumers into 5 groups depending on how similar their profiles are. This customer profile will come in handy later when we target certain programs, such paid premium memberships, discount programs, or promotions, or when we conduct risk analyses while conducting business with our clients.
Data Preparation
Packages Installation
We will need several packages to download and store webscrapping image objects in local directories, packages for data manipulation and cleaning, analysis with the elbow-method, and visualization of the K-Means result model.
#data manipulation
library(tidyverse)
library(dplyr)
library(tidyr)
# elbow methods
library(factoextra)
# profile cluster visualization
library(ggiraphExtra)
#scientific no
options(scipen = 999)
Reading Data
Import csv data from local directory.
cust <- read.csv('data_input/Online Shop Customer Sales Data.csv')
head(cust)
## Customer_id Age Gender Revenue_Total N_Purchases Purchase_DATE Purchase_VALUE
## 1 504308 53 0 45.3 2 22.06.21 24.915
## 2 504309 18 1 36.2 3 10.12.21 2.896
## 3 504310 52 1 10.6 1 14.03.21 10.600
## 4 504311 29 0 54.1 5 25.10.21 43.280
## 5 504312 21 1 56.9 1 14.09.21 56.900
## 6 504313 55 0 13.7 6 14.05.21 12.467
## Pay_Method Time_Spent Browser Newsletter Voucher
## 1 1 885 0 0 0
## 2 2 656 0 0 1
## 3 0 761 0 1 0
## 4 1 906 0 1 0
## 5 1 605 0 1 0
## 6 1 364 1 0 0
EDA
Eliminate irrelevant information such as Purchase_DATE
and categorical data.
cust_01 <- cust %>% select(-c('Purchase_DATE'))
head(cust_01)
## Customer_id Age Gender Revenue_Total N_Purchases Purchase_VALUE Pay_Method
## 1 504308 53 0 45.3 2 24.915 1
## 2 504309 18 1 36.2 3 2.896 2
## 3 504310 52 1 10.6 1 10.600 0
## 4 504311 29 0 54.1 5 43.280 1
## 5 504312 21 1 56.9 1 56.900 1
## 6 504313 55 0 13.7 6 12.467 1
## Time_Spent Browser Newsletter Voucher
## 1 885 0 0 0
## 2 656 0 0 1
## 3 761 0 1 0
## 4 906 0 1 0
## 5 605 0 1 0
## 6 364 1 0 0
Eliminate categorical data for the following variables:
Gender, Pay_Method, Browser,
Newsletter and Voucher.
cust_02 <- cust_01 %>% select(-c('Gender', 'Pay_Method', 'Browser', 'Newsletter', 'Voucher'))
head(cust_02)
## Customer_id Age Revenue_Total N_Purchases Purchase_VALUE Time_Spent
## 1 504308 53 45.3 2 24.915 885
## 2 504309 18 36.2 3 2.896 656
## 3 504310 52 10.6 1 10.600 761
## 4 504311 29 54.1 5 43.280 906
## 5 504312 21 56.9 1 56.900 605
## 6 504313 55 13.7 6 12.467 364
Transform Customer_id column into row names with help of
column_to_rownames()function.
cust_03 <- cust_02 %>%
column_to_rownames(var = "Customer_id")
Our dataframe has 65K data points with 5 predictors.
glimpse(cust_03)
## Rows: 65,796
## Columns: 5
## $ Age <int> 53, 18, 52, 29, 21, 55, 17, 30, 51, 63, 26, 42, 40, 19,…
## $ Revenue_Total <dbl> 45.3, 36.2, 10.6, 54.1, 56.9, 13.7, 30.7, 8.1, 18.0, 19…
## $ N_Purchases <int> 2, 3, 1, 5, 1, 6, 6, 7, 4, 4, 5, 4, 2, 4, 1, 7, 3, 3, 6…
## $ Purchase_VALUE <dbl> 24.915, 2.896, 10.600, 43.280, 56.900, 12.467, 2.456, 6…
## $ Time_Spent <int> 885, 656, 761, 906, 605, 364, 654, 1011, 312, 828, 1029…
For when we want to see the simple statistic information of our
dataframe, we can use the summary() function.
summary(cust_03)
## Age Revenue_Total N_Purchases Purchase_VALUE
## Min. :16.00 Min. : 0.50 Min. :1.000 Min. : 0.005
## 1st Qu.:28.00 1st Qu.:15.30 1st Qu.:2.000 1st Qu.: 4.820
## Median :40.00 Median :30.10 Median :4.000 Median :12.640
## Mean :39.59 Mean :27.73 Mean :3.992 Mean :15.969
## 3rd Qu.:51.00 3rd Qu.:37.60 3rd Qu.:6.000 3rd Qu.:24.752
## Max. :63.00 Max. :59.90 Max. :7.000 Max. :59.900
## Time_Spent
## Min. : 120.0
## 1st Qu.: 358.0
## Median : 598.0
## Mean : 598.9
## 3rd Qu.: 840.0
## Max. :1080.0
We can see that there is a significant difference between the min.
and max. in the Purchase_VALUE and Time_Spent
variables, while in the categorical data type, there is class imbalance
in the Newsletter and Browser variables which
will be handled in the following stages. In algorithms that work with
numerical data, such as K_Means, scaling is necessary since
it’s possible that data with excessively large values could skew or mask
data with smaller values, leading to bias.
cust_03$Purchase_VALUE <- scale(cust_03$Purchase_VALUE)
cust_03$Time_Spent <- scale(cust_03$Time_Spent)
Next, we check the possibility of missing value using the following methods:
cust_03 %>% is.na() %>% colSums()
## Age Revenue_Total N_Purchases
## 0 0 0 0 0
No missing value found in data cust_02. Then we can use
the cust_02 data as the final version for our model
creation and store it in the cust_cln object.
cust_cln <- cust_03
head(cust_cln)
## Age Revenue_Total N_Purchases Purchase_VALUE Time_Spent
## 504308 53 45.3 2 0.6762492 1.02966078
## 504309 18 36.2 3 -0.9881629 0.20542652
## 504310 52 10.6 1 -0.4058190 0.58335052
## 504311 29 54.1 5 2.0644562 1.10524558
## 504312 21 56.9 1 3.0939895 0.02186343
## 504313 55 13.7 6 -0.2646928 -0.84556214
Lastly, we check the statistic of our cust_cln data.
summary(cust_cln)
## Age Revenue_Total N_Purchases Purchase_VALUE.V1
## Min. :16.00 Min. : 0.50 Min. :1.000 Min. :-1.206693
## 1st Qu.:28.00 1st Qu.:15.30 1st Qu.:2.000 1st Qu.:-0.842747
## Median :40.00 Median :30.10 Median :4.000 Median :-0.251616
## Mean :39.59 Mean :27.73 Mean :3.992 Mean : 0.000000
## 3rd Qu.:51.00 3rd Qu.:37.60 3rd Qu.:6.000 3rd Qu.: 0.663928
## Max. :63.00 Max. :59.90 Max. :7.000 Max. : 3.320759
## Time_Spent.V1
## Min. :-1.7237855
## 1st Qu.:-0.8671578
## Median :-0.0033315
## Mean : 0.0000000
## 3rd Qu.: 0.8676933
## Max. : 1.7315196
head(cust_cln)
## Age Revenue_Total N_Purchases Purchase_VALUE Time_Spent
## 504308 53 45.3 2 0.6762492 1.02966078
## 504309 18 36.2 3 -0.9881629 0.20542652
## 504310 52 10.6 1 -0.4058190 0.58335052
## 504311 29 54.1 5 2.0644562 1.10524558
## 504312 21 56.9 1 3.0939895 0.02186343
## 504313 55 13.7 6 -0.2646928 -0.84556214
Modelling
K-means Clustering
K-means is centroid-based clustering algorithms, meaning that each cluster has one centroid/central point that represents the cluster.
K-means Clustering can be performed using kmeans()
function, with parameters as follow,
x: datasetcenters: #of centroid \(k\)
The amount \(k\) can be determined objectively using the Elbow approach, as we will see in more detail later, or subjectively based on business concerns, as we decided before, which are 5 pieces.
# k-means with 5 clusters
RNGkind(sample.kind = "Rounding")
set.seed(88) #even out randomized points
cust_km <- kmeans(x = cust_cln, # data clean (numerical)
centers = 5 # k = 5 subjectively
)
By running the kmeans() function above, we can find out
the following,
- Shows 5 clusters with sizes 16724, 17493, 11924, 13275, and 6380. As well as the average values for each column (predictors) in each cluster
- The component/attribute information can be accessed from the
cust_kmmodel as described below
kmeans() attributes
- The number of repetitions (iterations) of the k-means algorithm until a stable cluster is produced, by looking at the $iter >> according to the centroid placement conditions (center point) at that time
cust_km$iter
## [1] 4
- The number of observations in each cluster, by looking at the $size
cust_km$size
## [1] 16724 17493 11924 13275 6380
- Location of cluster centers/centroids, commonly used for cluster profiling, by looking at $centers. The average value of each predictors column in each cluster
cust_km$centers
## Age Revenue_Total N_Purchases Purchase_VALUE Time_Spent
## 1 27.75837 32.95915 4.025771 0.2145989 -0.002850924
## 2 52.54782 36.70314 3.987767 0.4011333 -0.001716735
## 3 27.53371 11.22610 3.985072 -0.7163134 0.009396063
## 4 51.48603 12.47586 3.954501 -0.6635992 0.001120213
## 5 32.88433 52.03483 4.010972 1.0571511 -0.007711562
- Cluster label for each observation, looking at $cluster
head(cust_km$cluster)
## 504308 504309 504310 504311 504312 504313
## 2 1 4 5 5 4
Performance Evaluation
Unlike other ML models, the performance of a K-Means model is measured by the Goodness of Fit parameter. Referring to the ideal clustering model where it is expected that the distances between clusters / cluster centroids are far apart, but the distance between observation points in the same cluster and the centroids must be close together, then the goodness of fit of the clustering results can be seen from 3 values:
- Within Sum of Squares/WSS (
$withinss): the sum of the squared distances from each observation point of a cluster to its cluster centroid. The smaller the better - Between Sum of Squares/BSS
(
$betweenss): sum of weighted squared distances from the centroid of each cluster to the global mean (average of all observed data from all clusters). Weighted based on the number of observations in the cluster. The bigger BSS the better - Total Sum of Squares/TSS (
$totss): the sum of the squared distances from each observation point to the global mean. The BSS/TSS ratio is closer to 1 the better
# Check wss value, with $withinss
cust_km$withinss
## [1] 1409298 1871297 1101503 1323977 784372
# Check total wss value, with $tot.withinss
cust_km$tot.withinss
## [1] 6490446
The WSS value for each cluster is much smaller than the Total WSS. The total within-cluster sum of square measures the compactness (i.e goodness) of the clustering and we want it to be as small as possible.
# check bss, with $betweenss
cust_km$betweenss
## [1] 21171136
#check TSS
cust_km$totss
## [1] 27661582
BSS value is slightly smaller than TSS meaning that we can improve the between-group variability.
Knowing the ratio between BSS/TSS, where it is expected that this ratio is close to 1.
cust_km$betweenss/cust_km$totss
## [1] 0.7653624
However, in the cust_km model, it can be seen that the
BSS/TSS values are in the range of 0.76 so that further
improvement/optimization is needed.
Tuned K_Means
Another way of tuning the K-Means model is by optimizing the number of clusters, \(k\), with the Elbow Method as described previously.
The principle of the elbow method for finding the optimum \(k\) value is the number of clusters (k) where when \(k\) is increased, the decrease of Total WSS is no longer significant (sloping).
Due to the large amount of data, a random sample is taken of 10% of the total data to determine the optimum \(k\).
# take 10% random sample data from cust_cln
RNGkind(sample.kind = "Rounding")
set.seed(88)
split_index <- sample(x = nrow(cust_cln), size = nrow(cust_cln)*0.1)
sample_cust <- cust_cln[split_index,]
summary(sample_cust)
## Age Revenue_Total N_Purchases Purchase_VALUE.V1
## Min. :16.00 Min. : 0.50 Min. :1.000 Min. :-1.205937
## 1st Qu.:28.00 1st Qu.:15.60 1st Qu.:2.000 1st Qu.:-0.828177
## Median :40.00 Median :30.40 Median :4.000 Median :-0.231282
## Mean :39.57 Mean :27.95 Mean :4.008 Mean :-0.001075
## 3rd Qu.:51.00 3rd Qu.:37.60 3rd Qu.:6.000 3rd Qu.: 0.650700
## Max. :63.00 Max. :59.90 Max. :7.000 Max. : 3.320759
## Time_Spent.V1
## Min. :-1.7237855
## 1st Qu.:-0.8491614
## Median :-0.0141293
## Mean :-0.0006114
## 3rd Qu.: 0.8388991
## Max. : 1.7315196
Using our sample data, sample_cust, we will find the
\(k\) optimum with
fviz_nbclust() function as follow,
As we can see from the elbow plot above that k= 4 - 7 are where the value of Tot.WSS begin to convergent
Model K-Means Tuned
Let’s run the kmeans() function with number of \(k\), k = 7, as observed in
the above elbow plot.
# k-means with new k = 7
RNGkind(sample.kind = "Rounding")
set.seed(88)
cust_km_tuned <- kmeans(x = cust_cln,
centers = 7 # new k value based on elbow plot observation
)
Performance Evaluation
We will use the WSS, Tot WSS, BSS, and TSS values to assess the
performance of the ‘cust_km_tuned’ model, just as we did with the
cust_km model before.
# check wss
cust_km_tuned$withinss
## [1] 371966.9 378549.7 968406.4 981904.9 665859.4 606770.2 625649.9
# check total wss
cust_km_tuned$tot.withinss
## [1] 4599107
The WSS value for each cluster is much smaller than the TSS but not 0. We want the WSS value to be as small as possible to decrease the within-group variability, meaning that the individuals within the same group share great similarity.
# check bss
cust_km_tuned$betweenss
## [1] 23062474
#check TSS
cust_km_tuned$totss
## [1] 27661582
BSS similar to TSS value and bigger than the one in
cust_kmmodel. The bigger the BSS the better as it represent the variability between groups.
Find the ratio between BSS/TSS, where this ratio is expected to be closer to 1.
cust_km_tuned$betweenss/cust_km_tuned$totss
## [1] 0.8337366
However, the BSS/TSS values in the cust_km model are in
the range of 0.82, indicating that the improvement was a success.
Conclusion
# input cluster label to the initial data by adding new column, group
cust_cln$grup <- cust_km_tuned$cluster
# profiling with summarized data
cust_center <- cust_cln %>%
group_by(grup) %>% # grouping in each cluster
summarise_all(mean) # mean in each cluster
cust_center
## # A tibble: 7 × 6
## grup Age Revenue_Total N_Purchases Purchase_VALUE Time_Spent
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 27.6 51.2 4.01 1.02 -0.00136
## 2 2 51.4 51.3 3.97 1.04 0.0174
## 3 3 27.5 10.7 3.99 -0.739 0.00957
## 4 4 51.5 10.7 3.97 -0.743 0.00753
## 5 5 39.5 32.4 4.02 0.200 0.000758
## 6 6 23.3 32.5 4.01 0.194 -0.0143
## 7 7 55.7 32.5 3.97 0.216 -0.0112
Alternatively we can use $centers
cust_km_tuned$centers
## Age Revenue_Total N_Purchases Purchase_VALUE Time_Spent
## 1 27.58464 51.16587 4.012764 1.0248330 -0.0013605238
## 2 51.39122 51.25485 3.974369 1.0413411 0.0173504984
## 3 27.51321 10.71179 3.992262 -0.7393215 0.0095686521
## 4 51.47645 10.73824 3.967728 -0.7427959 0.0075294550
## 5 39.51273 32.36885 4.017530 0.1996011 0.0007584132
## 6 23.25227 32.54377 4.009632 0.1938494 -0.0143016842
## 7 55.69659 32.48341 3.974656 0.2160706 -0.0111753820
# display cluster with highest and lowest score for each customer characteristic
cust_center %>%
tidyr::pivot_longer(-grup) %>%
group_by(name) %>% #nama kolom
summarize(cluster_min_val = which.min(value),
cluster_max_val = which.max(value))
## # A tibble: 5 × 3
## name cluster_min_val cluster_max_val
## <chr> <int> <int>
## 1 Age 6 7
## 2 N_Purchases 4 5
## 3 Purchase_VALUE 4 2
## 4 Revenue_Total 3 2
## 5 Time_Spent 6 2
Alternative for group/cluster visualization is using the
ggRadar() from ggiraphExtra package
#optional visualization
ggRadar(data = cust_cln,
aes(color = grup), #define color based on cluster
interactive = T)
Interpretation:
Cluster 2:
- Highest score: Purchase_VALUE, Revenue_Total and Time_Spent
- Lowest score: -
Cluster 3:
- Highest score: -
- Lowest score: Revenue_Total
Cluster 4:
- Highest score: -
- Lowest score: N_Purchases and Purchase_VALUE
Cluster 5:
- Highest score: N_Purchases
- Lowest score: -
Cluster 6:
- Highest score: -
- Lowest score: Age, Time_Spent
Cluster 7:
- Highest score: Age
- Lowest score: -
We can base our decision-making for customer-focused programs on the interpretation of the profile shown above.
For example:
Customers from Cluster 2 have high economic ability and purchasing power based on their propensity to make large purchases (Purchase Value & Total Revenue) and to stay on our shopping platform for an extended period of time. As long as they are engaged on the shopping platform and propose more things using the basket recommendation program, we can target this type of client for additional pop-up advertisements.
Contrary to clients from Cluster 5, although having a higher average number of purchases in terms of quantity, this type of customer does not bring in as much income as the prior cluster, leading one to the conclusion that this sort of customer prefers to shop for items at lesser prices. We may create a list of suggestions that are in line with our customers’ appetite both on the homepage and during the check-out process by understanding their purchasing patterns and financial capabilities.
While customers from Cluster 3 and 4 require more incentive programs to increase shopping interest on our platform. Further analysis can be done on this type of customer, for example, is there a certain period when customers from this group spend more than usual with time-series analysis? If so, we can target the seasonal promotion program for this category of customers.