Segmenting Customer with LRFMP Model
Introduction
The main purpose of this article is to learn how to identify different customer segments in retail industry using K-Means clustering method. Customer segmentation enables companies to divide customers into distinct and internally homogeneous groups and interact with each customer segment separately. Moreover, customer segmentation is a critical success factor for understanding behavior of different groups of customers and evaluating their value.
This article is inspired by a publication by Peker et al. who proposed an LRFMP (Length, Recency, Frequency, Periodicity) model for classifying customers in the grocery retail industry. All source code and dataset for this article are provided on my github repo.
Library
The following are the required packages that will be used throughout the article. It consists of packages for data wrangling, clustering, and data visualization.
Data Understanding
The dataset is acquired from Kaggle. It consists of transaction records and customer data from Sprocket Central Pty Ltd, a medium size bikes & cycling accessories organisation. For the sake of simplicity, we will only segment customers based on their transaction record and ignore the customer demographics.
Read Data
First we will read and inspect the transaction data.
Rows: 20,000
Columns: 13
$ transaction_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
$ product_id <int> 2, 3, 37, 88, 78, 25, 22, 15, 67, 12, 5, 61, 3…
$ customer_id <int> 2950, 3120, 402, 3135, 787, 2339, 1542, 2459, …
$ transaction_date <chr> "25/02/2017", "21/05/2017", "16/10/2017", "31/…
$ online_order <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, F…
$ order_status <chr> "Approved", "Approved", "Approved", "Approved"…
$ brand <chr> "Solex", "Trek Bicycles", "OHM Cycles", "Norco…
$ product_line <chr> "Standard", "Standard", "Standard", "Standard"…
$ product_class <chr> "medium", "medium", "low", "medium", "medium",…
$ product_size <chr> "medium", "large", "medium", "medium", "large"…
$ list_price <dbl> 71.49, 2091.47, 1793.43, 1198.46, 1765.30, 153…
$ standard_cost <chr> "$53.62", "$388.92", "$248.82", "$381.10", "$7…
$ product_first_sold_date <chr> "12/2/2012", "3/3/2014", "7/20/1999", "12/16/1…
Data description:
transation_id
: unique identifier for each transactionproduct_id
: unique identifier for each sold productcustomer_id
: unique identifier for each customertransaction_date
: date of transactiononline_order
: whether the transaction is done online or offlineorder_status
: the status of the order (approved or canceled)brand
: the brand of the productproduct_line
: the product line category of the productproduct_class
: the product class of the productproduct_size
: the size of the productlist_price
: the listed price during transactionstandard_cost
: the standard cost of the productproduct_first_sold_date
: the first time product is sold
Inspect Data
Let’s check the summary of each column from the data to see if there is anything unusual.
transaction_id product_id customer_id transaction_date
Min. : 1 Min. : 0.00 Min. : 1.0 Length:20000
1st Qu.: 5001 1st Qu.: 18.00 1st Qu.: 857.8 Class :character
Median :10000 Median : 44.00 Median :1736.0 Mode :character
Mean :10000 Mean : 45.36 Mean :1738.2
3rd Qu.:15000 3rd Qu.: 72.00 3rd Qu.:2613.0
Max. :20000 Max. :100.00 Max. :5034.0
online_order order_status brand product_line
Mode :logical Length:20000 Length:20000 Length:20000
FALSE:9811 Class :character Class :character Class :character
TRUE :9829 Mode :character Mode :character Mode :character
NA's :360
product_class product_size list_price standard_cost
Length:20000 Length:20000 Min. : 12.01 Length:20000
Class :character Class :character 1st Qu.: 575.27 Class :character
Mode :character Mode :character Median :1163.89 Mode :character
Mean :1107.83
3rd Qu.:1635.30
Max. :2091.47
product_first_sold_date
Length:20000
Class :character
Mode :character
There are around 300 transaction with missing order status (NA
). Next, we will check the order status of each transacton and see how many order are approved, assuming that an approved order is a completed transaction and the company gain revenue from this transaction.
Most of the transaction is approved with small number of cancelled transaction.
Data Preparation
Data Cleansing
We will exclude the cancelled transaction from the data. We will also transform the transaction date into a proper date data type.
LRFMP Model
We will prepare the data for creating the LRFMP Model. The model is a development from the traditional RFM model with the addition of 2 new variables: Length
and Periodicity
. Below is the definition of each variable from Peker et al.:
Length
: This feature is the time interval, in days, between the customer’s first and last visits. It shows the customer loyalty, and the higher the length is, the more loyal a customer is.Recency
: The average of number of days between the dates of the customer’s N recent visits and the last date of the observation period.Length
: Frequency refers to the customer’s total number of visits during the observation period. The higher the frequency is, the higher the customer loyalty becomes.Monetary
: Monetary refers to the average amount of money spent per visit by the customer during the observation period and reflects the contribution of the customer to the revenue of a company.Periodicity
: The standard deviation of the customer’s inter-visit times. Periodicity indicates the tendency of a customer’s visits to occur at regular intervals.
You may create the LRFMP model based on the above definition. However, for easier interpretation I will change the definition for the following metrics:
Recency
: The number of days between customer’s last visit (N = 1) and the last date of observation periodPeriodicity
: The median number of days of customer’s inter-visit times. This will change the perspective that instead of looking at the regularity of customer visit time, we will look at the average inter-visit time between customer visits.
First let’s calculate the Length
, Recency
, Frequency
, and Monetary
metrics.
Code
df_agg <- df_clean %>%
group_by(customer_id) %>%
summarise(count_visit = n_distinct(transaction_date),
sales = sum(list_price),
first_order = min(transaction_date),
last_order = max(transaction_date)
) %>%
mutate(frequency = count_visit,
monetary = sales/count_visit,
recency = difftime(max(df_clean$transaction_date), last_order, units = "day") %>%
as.numeric(),
length = difftime(last_order, first_order, units = "days") %>%
as.numeric()
)
head(df_agg, 10)
customer_id <int> | count_visit <int> | sales <dbl> | first_order <date> | |
---|---|---|---|---|
1 | 11 | 9084.45 | 2017-01-05 | |
2 | 3 | 4149.07 | 2017-05-04 | |
3 | 8 | 9888.23 | 2017-02-23 | |
4 | 2 | 1047.72 | 2017-04-03 | |
5 | 6 | 5903.20 | 2017-03-03 | |
6 | 5 | 5931.69 | 2017-01-28 | |
7 | 3 | 995.38 | 2017-02-18 | |
8 | 10 | 12024.76 | 2017-01-04 | |
9 | 6 | 5357.55 | 2017-02-04 | |
10 | 6 | 7067.83 | 2017-06-20 |
Next calculate the Periodicity
metric.
Code
df_agg_2 <- df_clean %>%
distinct(customer_id, transaction_date) %>%
arrange(customer_id, transaction_date) %>%
group_by(customer_id) %>%
mutate(lag_date = lag(transaction_date),
interval_day = difftime(transaction_date, lag_date, units = "day") %>%
as.numeric()
) %>%
drop_na() %>%
group_by(customer_id) %>%
summarise(periodicity = median(interval_day))
Finally, we will combine both dataframe to gain the complete LRFM value for each customer.
Code
customer_id <int> | frequency <int> | monetary <dbl> | recency <dbl> | length <dbl> | |
---|---|---|---|---|---|
1 | 11 | 825.8591 | 7 | 352 | |
2 | 3 | 1383.0233 | 128 | 112 | |
3 | 8 | 1236.0287 | 102 | 208 | |
4 | 2 | 523.8600 | 195 | 76 | |
5 | 6 | 983.8667 | 16 | 286 | |
6 | 5 | 1186.3380 | 64 | 272 | |
7 | 3 | 331.7933 | 253 | 62 | |
8 | 10 | 1202.4760 | 22 | 338 | |
9 | 6 | 892.9250 | 78 | 251 | |
10 | 6 | 1177.9717 | 33 | 160 |
Single Purchase Customer
Let’s check the summary of the data.
customer_id frequency monetary recency
Min. : 1 Min. : 1.000 Min. : 60.34 Min. : 0.00
1st Qu.: 877 1st Qu.: 4.000 1st Qu.: 934.40 1st Qu.: 17.00
Median :1751 Median : 5.000 Median :1113.10 Median : 44.00
Mean :1751 Mean : 5.638 Mean :1113.38 Mean : 61.23
3rd Qu.:2625 3rd Qu.: 7.000 3rd Qu.:1291.84 3rd Qu.: 86.00
Max. :5034 Max. :14.000 Max. :2182.98 Max. :353.00
length periodicity
Min. : 0 Min. : 1.0
1st Qu.:195 1st Qu.: 28.0
Median :257 Median : 41.5
Mean :240 Mean : 53.3
3rd Qu.:302 3rd Qu.: 64.0
Max. :362 Max. :357.0
NA's :50
As we can see, there are 50 customers with missing Periodicity
values. This indicate that the customer only visit the store once (Frequency
= 1) and not yet return for the second purchase.
customer_id <int> | frequency <int> | monetary <dbl> | recency <dbl> | length <dbl> | |
---|---|---|---|---|---|
71 | 1 | 1415.01 | 131 | 0 | |
161 | 1 | 1148.64 | 159 | 0 | |
191 | 1 | 1065.03 | 105 | 0 | |
278 | 1 | 1228.07 | 248 | 0 | |
373 | 1 | 1466.68 | 223 | 0 | |
774 | 1 | 1311.44 | 143 | 0 | |
822 | 1 | 2005.66 | 328 | 0 | |
862 | 1 | 774.53 | 54 | 0 | |
872 | 1 | 290.62 | 171 | 0 | |
898 | 1 | 478.16 | 205 | 0 |
We can exclude this customers from further analysis since we don’t have enough information for their transaction histories. From a marketing perspective, we can do a separate campaign for this customer to do their second purchase.
customer_id frequency monetary recency
Min. : 1.0 Min. : 2.000 Min. : 71.49 Min. : 0.00
1st Qu.: 873.5 1st Qu.: 4.000 1st Qu.: 935.15 1st Qu.: 17.00
Median :1745.0 Median : 6.000 Median :1113.46 Median : 43.00
Mean :1748.1 Mean : 5.705 Mean :1114.74 Mean : 59.73
3rd Qu.:2624.5 3rd Qu.: 7.000 3rd Qu.:1291.68 3rd Qu.: 84.00
Max. :5034.0 Max. :14.000 Max. :2182.98 Max. :321.00
length periodicity
Min. : 1.0 Min. : 1.0
1st Qu.:199.0 1st Qu.: 28.0
Median :258.0 Median : 41.5
Mean :243.5 Mean : 53.3
3rd Qu.:303.0 3rd Qu.: 64.0
Max. :362.0 Max. :357.0
Scaling Variables
Clustering algorithm will calculate the distance between data point, commonly using the euclidean distance:
If we feed the data directly into the clustering algorithm, the distance from the Monetary
variable will have significant influence compared to the Frequency
variable since the Monetary
variable has wider range of value. Therefore, the algorithm will give better result if all variables has the same scale. To address this problem, we will scale the data using the standardize normal distribution.
Code
frequency monetary recency length periodicity
1 2.3703012 -1.0316511 -0.94535279 1.4092428 -0.99217252
2 -1.2112215 0.9580634 1.22397328 -1.7079053 0.06566902
3 1.0272302 0.4331244 0.75783710 -0.4610461 -0.90705883
4 -1.6589118 -2.1101337 2.42517036 -2.1754775 0.55203294
5 0.1318495 -0.4673831 -0.78399796 0.5520270 0.06566902
6 -0.3158408 0.2556715 0.07656114 0.3701934 0.46691926
We can see the mean and the standard deviation for each variable
frequency monetary recency length periodicity
5.705489 1114.744305 59.729596 243.497531 53.299593
frequency monetary recency length periodicity
2.233687 280.022208 55.777691 76.993455 41.121471
You can cross check the value yourself.
Segmenting Customers
Clustering with LRFMP
Determine Number of Clusters
We will start segmenting customer using the K-Means algorithm. If you are not familiar with the algorithm, you may check the video from StatQuest for a complete step-by-step process of determining the cluster of each data point.
The first thing we do is to determine the optimal number of cluster, which can be evaluated using different metrics. The most common metrics is using the within sum of square (wss) which represent the distance between each data point to their respective cluster centroid. The other metric is using silhouette score.
Silhouette Score is a metric to evaluate the performance of clustering algorithm. It uses compactness of individual clusters(intra cluster distance) and separation amongst clusters (inter cluster distance) to measure an overall representative score of how well our clustering algorithm has performed.
The value of the Silhouette score ranges from -1 to 1 with the following interpretation:
- silhouette score = 1: Points are perfectly assigned in a cluster and clusters are easily distinguishable.
- silhouette score = 0: Clusters are overlapping.
- silhouette score = -1: Points are wrongly assigned in a cluster.
Here we will show both the WSS and the silhouette score for each number of clusters.
Code
trial_clust <- map_df(2:20,
function(x) {
# cluster the data
set.seed(123)
clust_temp <- kmeans(df_scaled_clean, centers = x, iter.max = 1000)
# calculate the silhouette score
silhouette_score <- mean( silhouette(clust_temp$cluster, dist = dist(df_scaled_clean))[ , 3] )
data.frame(n_clust = x,
wss = clust_temp$tot.withinss,
silhouette_score = silhouette_score
)
}
)
# Highlight Optimal Number of Cluster
p_1 <- trial_clust %>%
filter(silhouette_score == max(silhouette_score))
First let’s visualize the result of silhouette score. The x-axis shows the number of cluster while the y-axis shows the respective silhouette score. Based on the result, if we want to determine the number of cluster using the silhouette score, we will choose the number of cluster = 3 since it has the highest silhouette score.
Code
# visualize result
trial_clust %>%
ggplot(aes(x = n_clust,
y = silhouette_score
)
) +
geom_line() +
geom_point(size = 2) +
geom_point(data = p_1, size = 5, color = "firebrick3") +
scale_x_continuous(breaks = seq(0, 30, 1)) +
scale_y_continuous(labels = comma_format()) +
theme_minimal() +
labs(x = "Number of Cluster",
y = "Silhouette Score"
)
The optimal number of cluster using WSS is a bit more tricky since there is no exact value to be chosen. Instead we will have to look for the elbow or the point where the decrease in WSS is considered to be no longer significant. The following figure is an example of determining the number of clustering using elbow method.
Below are the result of our clustering. There is no clear elbow since the curve is not so steep compared to the previous figure. You may choose the number of cluster between 4 to 6 as the optimal number of cluster since after number of cluster = 6 the decrease in WSS is starting to get very small compared to the initial decrease from number of cluster = 2 to 3.
Clustering
Determining the number of optimal clusters is always hard, especially for higher dimension dataset with many variables. It also influenced by our domain knowledge regarding the business implication of the result. Thus, one should not have to stick to a single number but rather focus more on the result of the cluster. For this use case, we will use number of cluster = 4 since it will give more simple result and easier interpretation since we don’t have to explain a large number of cluster. It will also give us more room for grouping the outlier or high/low performing customers together.
Code
# Clustering with optimal number of cluster
set.seed(123)
k_clust <- kmeans(df_scaled_clean, centers = 4, iter.max = 1000)
# Assign cluster to each customer
list_cluster <- data.frame(customer_id = names(k_clust$cluster),
cluster = k_clust$cluster
)
df_out <- df_final %>%
mutate(customer_id = as.character(customer_id)) %>%
inner_join(list_cluster, by = join_by(customer_id)) %>%
mutate(cluster = as.character(cluster))
head(df_out, 10)
customer_id <chr> | frequency <int> | monetary <dbl> | recency <dbl> | length <dbl> | |
---|---|---|---|---|---|
1 | 11 | 825.8591 | 7 | 352 | |
2 | 3 | 1383.0233 | 128 | 112 | |
3 | 8 | 1236.0287 | 102 | 208 | |
4 | 2 | 523.8600 | 195 | 76 | |
5 | 6 | 983.8667 | 16 | 286 | |
6 | 5 | 1186.3380 | 64 | 272 | |
7 | 3 | 331.7933 | 253 | 62 | |
8 | 10 | 1202.4760 | 22 | 338 | |
9 | 6 | 892.9250 | 78 | 251 | |
10 | 6 | 1177.9717 | 33 | 160 |
Profiling Customers
The next step after we assign cluster segment for each customer is profiling the different segments and identify the difference between them. We will get the centroid of the mean of each variables from each cluster to profile the member of each cluster.
Code
df_out %>%
group_by(cluster) %>%
summarise(count_member = n_distinct(customer_id),
across(frequency:periodicity, mean)
) %>%
mutate(percent_member = count_member/sum(count_member)) %>%
relocate(percent_member, .after = count_member) %>%
arrange(desc(count_member)) %>%
# format the number
mutate_at(vars(contains("percent")), percent) %>%
mutate_if(is.numeric, comma)
cluster <chr> | count_member <chr> | percent_member <chr> | frequency <chr> | |
---|---|---|---|---|
3 | 1,262 | 36.65% | 5.30 | |
4 | 1,229 | 35.70% | 7.81 | |
2 | 578 | 16.79% | 3.83 | |
1 | 374 | 10.86% | 3.06 |
Let’s try to characterize each cluster based on their LRFMP values. You can assign different cluster name based on your own interpretation.
Cluster 4: Most Loyal
Cluster 4 has high number of member, with 36% of customers belong to this cluster. They are identified by the highest Frequency
of visit and also the most Recent
member to visit the store. They are also the most loyal indicated by the Length
variable. They will on average visit our store once a month based on the Periodicity
value. We must keep this segment of customers since they are the most valuable.
Cluster 3: Regular
Cluster 3 is indicated by their high Frequency
of visits and quite loyal although has lower Length
compared to the cluster 4. They wil visit our store every 2 months on average based on the Periodicity
value.
Cluster 2: Hibernating
Cluster 2 is indicated by their Recency
which shows that their last visit to our store was around 4 months ago. This cluster may require special treatment to prevent them from churning to other competitor. Although they have visited us around 4 times based on their Frequency
, they are also the least loyal as shown by the value of Length
.
Cluster 1: Seasonal
Cluster 1 has the lowest number of member, with only 10% of customers belong to this cluster. They are identified by the lowest Frequency
of visit (only 3 times) and on average only visit our store around every 3-4 months based on the Periodicity
value.
Visualize Cluster
We may also want to visualize the clustering result to help us interpret the data more easily and see if there is any overlapping segments. However, since we have more than 2 variables as the base of our segmentation, we cannot just plot the data into the plot. We will use the Principle Component Analyst (PCA) to reduce our data into fewer dimension so they can be visualized. If you are interested to learn more about PCA you can visit the following video.
Code
The x-axis shows the first dimension (PC1) from the PCA while the y-axis shows the second dimension (PC2) from the PCA. PC1 give us 41.8% of information while PC2 give us 26.8% of information, therefore the plot that we see contains around 69% of information from the data that we have while the remaining 31% of information is not presented. That’s why some clusters are seen overlapped to each other perhaps because no additional information is presented. The red label and text indicated the direction of each LRFMP variables, with data that is on the direction of the arrow indicate higher value of the variable and vice versa. For example, the frequency arrow is directed to the upper left of the plot, and cluster 4 is also located on that direction. We know from the previous table that cluster 4 has highest Frequency
. Meanwhile, cluster 1 which are located on the opposite of the frequency arrow has the least Frequency
.
From the plot above we can see that the four segments of customer are almost perfectly separated with some segments still overlapping. Cluster 2 is the least recent (has high Recency
) and also the least loyal (has low Length
). Cluster 4 has highest Frequency
and lowest Recency
while cluster 1 has lowest Frequency
.
Clustering with LFRMP and Product Preferences
Product Preferences
We have done segmenting customers using the LRFMP model on the previous section. Based on the result, we can also identify that there are several segments of customer. On this section, we will try to add more variables to consider, namely the product preference. By knowing what is the preferred product for each segment, we can provide better and personalized promotion and campaign to approach the customer.
Let’s once again look at the initial transaction data.
transaction_id <int> | product_id <int> | customer_id <int> | ||
---|---|---|---|---|
1 | 1 | 2 | 2950 | |
2 | 2 | 3 | 3120 | |
3 | 3 | 37 | 402 | |
4 | 4 | 88 | 3135 | |
5 | 5 | 78 | 787 | |
6 | 6 | 25 | 2339 |
There are several information about the product purchased: brand
, product_line
, product_classs
, and product_size
.
Let’s try to check the product preference based on the product line. We will look at the unique value of the product line.
There is a product line that has empty string (““) value. We will transform this into Other.
Code
product_line <chr> | |||
---|---|---|---|
Standard | |||
Road | |||
Mountain | |||
Touring | |||
Other |
To represent the product preference of each customer, we will get the Monetary
spending for each product line. Knowing this information will help us understand what line of product each segment prefer to buy at what price range.
Code
df_agg_3 <- df_clean %>%
group_by(customer_id, product_line) %>%
summarise(count_visit = n_distinct(transaction_date),
sales = sum(list_price),
.groups = "drop"
) %>%
mutate(frequency = count_visit,
monetary = sales/count_visit
) %>%
select(customer_id, product_line, monetary) %>%
pivot_wider(names_from = product_line,
values_from = monetary
)
head(df_agg_3, 10)
customer_id <int> | Mountain <dbl> | Road <dbl> | Standard <dbl> | Touring <dbl> | |
---|---|---|---|---|---|
1 | 688.63 | 1334.0767 | 627.6557 | NA | |
2 | NA | NA | 1383.0233 | NA | |
3 | 574.64 | 1618.4000 | 1114.5975 | NA | |
4 | NA | NA | 523.8600 | NA | |
5 | 688.63 | 745.9767 | 1488.3200 | NA | |
6 | NA | NA | 1186.3380 | NA | |
7 | 574.64 | NA | 210.3700 | NA | |
8 | NA | 774.5300 | 1235.9050 | 1362.99 | |
9 | NA | 792.9000 | 912.9300 | NA | |
10 | 574.64 | 792.9000 | 1212.0700 | NA |
Since most customers will not buy all product line, we will replace the missing value with 0.
Code
customer_id <int> | Mountain <dbl> | Road <dbl> | Standard <dbl> | Touring <dbl> | |
---|---|---|---|---|---|
1 | 688.63 | 1334.0767 | 627.6557 | 0.00 | |
2 | 0.00 | 0.0000 | 1383.0233 | 0.00 | |
3 | 574.64 | 1618.4000 | 1114.5975 | 0.00 | |
4 | 0.00 | 0.0000 | 523.8600 | 0.00 | |
5 | 688.63 | 745.9767 | 1488.3200 | 0.00 | |
6 | 0.00 | 0.0000 | 1186.3380 | 0.00 | |
7 | 574.64 | 0.0000 | 210.3700 | 0.00 | |
8 | 0.00 | 774.5300 | 1235.9050 | 1362.99 | |
9 | 0.00 | 792.9000 | 912.9300 | 0.00 | |
10 | 574.64 | 792.9000 | 1212.0700 | 0.00 |
Let’s combine this information to our previous LRFMP model.
customer_id <int> | frequency <int> | monetary <dbl> | recency <dbl> | length <dbl> | |
---|---|---|---|---|---|
1 | 11 | 825.8591 | 7 | 352 | |
2 | 3 | 1383.0233 | 128 | 112 | |
3 | 8 | 1236.0287 | 102 | 208 | |
4 | 2 | 523.8600 | 195 | 76 | |
5 | 6 | 983.8667 | 16 | 286 | |
6 | 5 | 1186.3380 | 64 | 272 | |
7 | 3 | 331.7933 | 253 | 62 | |
8 | 10 | 1202.4760 | 22 | 338 | |
9 | 6 | 892.9250 | 78 | 251 | |
10 | 6 | 1177.9717 | 33 | 160 |
Scaling Variables
Code
frequency monetary recency length periodicity Mountain
1 2.3703012 -1.0316511 -0.94535279 1.4092428 -0.99217252 3.0480029
2 -1.2112215 0.9580634 1.22397328 -1.7079053 0.06566902 -0.3603551
3 1.0272302 0.4331244 0.75783710 -0.4610461 -0.90705883 2.4838121
4 -1.6589118 -2.1101337 2.42517036 -2.1754775 0.55203294 -0.3603551
5 0.1318495 -0.4673831 -0.78399796 0.5520270 0.06566902 3.0480029
6 -0.3158408 0.2556715 0.07656114 0.3701934 0.46691926 -0.3603551
Road Standard Touring Other
1 1.02556051 -1.32913484 -0.6352812 -0.211473
2 -1.12511420 0.81252895 -0.6352812 -0.211473
3 1.48392022 0.05147185 -0.6352812 -0.211473
4 -1.12511420 -1.62342271 -0.6352812 -0.211473
5 0.07748021 1.11107240 -0.6352812 -0.211473
6 -1.12511420 0.25487484 -0.6352812 -0.211473
Determine Number of Clusters
Let’s try to find the optimal number of clusters for the new dataset.
Code
trial_clust <- map_df(2:20,
function(x) {
# cluster the data
set.seed(123)
clust_temp <- kmeans(df_scaled_clean_2, centers = x, iter.max = 1000)
# calculate the silhouette score
silhouette_score <- mean( silhouette(clust_temp$cluster, dist = dist(df_scaled_clean))[ , 3] )
data.frame(n_clust = x,
wss = clust_temp$tot.withinss,
silhouette_score = silhouette_score
)
}
)
# Highlight Optimal Number of Cluster
p_1 <- trial_clust %>%
filter(silhouette_score == max(silhouette_score))
Let’s visualize the result of silhouette score. The x-axis shows the number of cluster while the y-axis shows the respective silhouette score.
Code
# visualize result
trial_clust %>%
ggplot(aes(x = n_clust,
y = silhouette_score
)
) +
geom_line() +
geom_point(size = 2) +
geom_point(data = p_1, size = 5, color = "firebrick3") +
scale_x_continuous(breaks = seq(0, 30, 1)) +
scale_y_continuous(labels = comma_format()) +
theme_minimal() +
labs(x = "Number of Cluster",
y = "Silhouette Score"
)
Let’s visualize the WSS value of each number of cluster.
Clustering
Based on the silhouette score, the number of optimal cluster = 3 while from WSS we may choose cluster = 8 since there is a big decrease from 7 to 8. We wil use the number of cluster = 8.
Code
# Clustering with optimal number of cluster
set.seed(123)
k_clust <- kmeans(df_scaled_clean_2, centers = 8, iter.max = 1000)
# Assign cluster to each customer
list_cluster <- data.frame(customer_id = names(k_clust$cluster),
cluster = k_clust$cluster
)
df_out <- df_final_2 %>%
mutate(customer_id = as.character(customer_id)) %>%
inner_join(list_cluster, by = join_by(customer_id)) %>%
mutate(cluster = as.character(cluster))
head(df_out, 10)
customer_id <chr> | frequency <int> | monetary <dbl> | recency <dbl> | length <dbl> | |
---|---|---|---|---|---|
1 | 11 | 825.8591 | 7 | 352 | |
2 | 3 | 1383.0233 | 128 | 112 | |
3 | 8 | 1236.0287 | 102 | 208 | |
4 | 2 | 523.8600 | 195 | 76 | |
5 | 6 | 983.8667 | 16 | 286 | |
6 | 5 | 1186.3380 | 64 | 272 | |
7 | 3 | 331.7933 | 253 | 62 | |
8 | 10 | 1202.4760 | 22 | 338 | |
9 | 6 | 892.9250 | 78 | 251 | |
10 | 6 | 1177.9717 | 33 | 160 |
Profiling Customers
The next step after we assign cluster segment for each customer is profiling the different segments and identify the difference between them. For the LRFMP metrics we will calculate the mean while for the monetary value for each product line we will use median. The reason for using median is that since median reflect the middle value of the distribution, if 50% or more customers has monetary value of 0 then the median will be 0. If we insisted on using mean the monetary value wil not be 0 but close to it, making it harder to read and interpret.
The following is the character for each of our segments.
Code
df_out %>%
group_by(cluster) %>%
summarise(count_member = n_distinct(customer_id),
across(frequency:periodicity, mean),
across(Mountain:Other, median)
) %>%
mutate(percent_member = count_member/sum(count_member)) %>%
relocate(percent_member, .after = count_member) %>%
arrange(desc(count_member)) %>%
# format the number
mutate_at(vars(contains("percent")), percent) %>%
mutate_if(is.numeric, function(x) x %>%
round(digits = 1) %>%
comma()
)
cluster <chr> | count_member <chr> | percent_member <chr> | frequency <chr> | |
---|---|---|---|---|
5 | 695 | 20.19% | 6.9 | |
3 | 649 | 18.85% | 5.5 | |
7 | 637 | 18.50% | 7.1 | |
8 | 419 | 12.17% | 4.0 | |
4 | 366 | 10.63% | 6.6 | |
1 | 329 | 9.56% | 3.6 | |
6 | 228 | 6.62% | 2.8 | |
2 | 120 | 3.49% | 6.6 |
Based on the product preference, we can identify several segments that buy certain product line. These 3 clusters has almost similar LRFMP metrics but differ in their product preference:
- Cluster 2: Buy Other Product: This small segment is the only segment that has significant number of customers who purchased
Other
product line. - Cluster 4 : Buy Mountain: This segment of customer buy has significant number of customers who purchased
Mountain
product line. - Cluster 7 : Buy Touring: This segment of customer buy has significant number of customers who purchased
Touring
product line.
The above clusters has similar LRFMP model with the Cluster 5: Most Loyal with the only difference is that cluster 5 doesn’t particulary has high number of customer who buy product line other than Road
and Standard
. We can also oberserve that Cluster 3: Low End Buyer tends to buy Road
and Standard
product line with lower price compared to other segments. The segment Cluster 1: Hibernating is still observed, with customers who on average has last recent visit around 5 months ago based on their Recency
value and we see that they also tends to buy Road
and Standard
product line with lower price compared to other segments.
Visualize Cluster
Let’s try to visualize the cluster. We can see that most of the clusters are still overlapping since the PC1 and PC2 only provide around 40% of information so the plot is not very informative.
Conclusion
We have learned how to identify retail customers using K-Means clustering algorithm using the LRFMP model and the customer’s product purchases. Using the LRFMP model we can identify that we have a very loyal customers with high frequency of visit and also some customers who hasn’t visit our store in a quite long time. By intregrating the LRFMP model and the customer’s product purchases, we can identify customers in a more detailed segment, especially for their product preferences. This will be important for the marketing and the business as a whole since we can target specific customer based on their preference, thus giving them more personalized promotion and offers.