Customer Segments (or Market Segmentation) allow the companies to be able to utilize their resources (time, finance) to serve their’s goals: increasing sales, increasing profits, retaining important customers as well as implementing marketing campaigns more effectively, which based on the understanding of their customer’s behavior, habits, and preferences.
RFM is a method used for analyzing customer value. It is commonly used in database marketing and direct marketing and has received particular attention in retail and professional services industries.
RFM stands for the three dimensions: - R (Recency) The value of how recently a customer purchased at the establishment. - F (Frequency) How frequent the customer’s transactions are at the establishment. - M (Monetary) The dollar (or pounds in our case) value of all the transactions that the customer made at the establishment.
F and M are inputs for a period (1 year, a month, a period). Particularly R depends on the modeler preference and it does not affect the model’s results.
Data can be download here. First, let’s have a look at the data:
# Load data:
rm(list = ls())
library(tidyverse)
library(magrittr)
library(knitr)
my_df <- read_csv("/Users/jennynguyen/Downloads/data.csv")
# See the first 10 transaction with the first 4 columns
set.seed(29)
my_df %>%
sample_n(10) %>%
select(1:4) %>%
kable()| InvoiceNo | StockCode | Description | Quantity |
|---|---|---|---|
| 561197 | 22120 | WELCOME WOODEN BLOCK LETTERS | 4 |
| 542174 | 21326 | AGED GLASS SILVER T-LIGHT HOLDER | 36 |
| 553009 | 23191 | BUNDLE OF 3 RETRO NOTE BOOKS | 2 |
| 576837 | 84661B | BLACK SQUARE TABLE CLOCK | 1 |
| 539836 | 85232D | SET/3 DECOUPAGE STACKING TINS | 1 |
| 556187 | 47590A | BLUE HAPPY BIRTHDAY BUNTING | 3 |
| 578664 | 23267 | SET OF 4 SANTA PLACE SETTINGS | 2 |
| 577358 | 21240 | BLUE POLKADOT CUP | 1 |
| 575324 | 23130 | MISTLETOE HEART WREATH GREEN | 4 |
| 554512 | 21390 | FILIGRIS HEART WITH BUTTERFLY | 1 |
# See the first 10 transaction with the last 4 columns
my_df %>%
sample_n(10) %>%
select(5:8) %>%
kable()| InvoiceDate | UnitPrice | CustomerID | Country |
|---|---|---|---|
| 9/20/2011 11:32 | 1.85 | 13767 | United Kingdom |
| 3/28/2011 11:34 | 1.25 | 17841 | United Kingdom |
| 10/12/2011 12:54 | 2.55 | 12691 | France |
| 6/23/2011 11:38 | 0.42 | 13268 | United Kingdom |
| 12/15/2010 15:50 | 8.50 | 17980 | United Kingdom |
| 5/27/2011 12:41 | 8.50 | 16275 | United Kingdom |
| 12/7/2011 14:01 | 1.65 | 16426 | United Kingdom |
| 8/1/2011 11:23 | 2.10 | 14408 | United Kingdom |
| 6/23/2011 18:44 | 1.25 | 15023 | United Kingdom |
| 7/7/2011 16:30 | 4.13 | NA | United Kingdom |
The variables are really descriptive. We should also make a preliminary assessment of the quality of the data. First of all, let have a look at the missing rate:
## # A tibble: 1 x 8
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0.268 0 0 0 24.9
## # … with 1 more variable: Country <dbl>
Approximately 27% of stockcode is without description. Similarly, ~ 25% customters are not labled with IDs, but it’s not quite important in this situation.
The measures in this data set - if they are quantitative variables, must be non-negative. So we also need to check whether we have any negative record or not.
negative_dectect <- function(x) {100*sum(x <= 0) / length(x)}
sapply(my_df %>% select(Quantity, UnitPrice), negative_dectect)## Quantity UnitPrice
## 1.9604768 0.4644691
Because we need an exact time to calculate R so that InvoiceDate will need to be modifed.
library(lubridate)
my_df %>%
mutate(time_ymd_hm = mdy_hm(InvoiceDate),
time_hour = hour(time_ymd_hm),
time_min = minute(time_ymd_hm),
w_day = wday(time_ymd_hm, label = TRUE, abbr = TRUE),
time_mon = month(time_ymd_hm, label = TRUE, abbr = TRUE),
time_ymd = InvoiceDate %>% str_split(" ", simplify = TRUE) %>% data.frame() %>% pull(X1) %>% mdy) -> my_dfThis analysis is taken to look for some insights. For example, sales (accurate to the minute) have two spike times over 70,000 records. These are unusual:
library(hrbrthemes)
theme_set(theme_modern_rc())
my_df %>%
group_by(time_ymd_hm) %>%
summarise(sales = sum(Quantity)) %>%
ungroup() -> sales_byTime_hm
sales_byTime_hm %>%
ggplot(aes(time_ymd_hm, sales)) +
geom_line() +
labs(title = "Figure 1: Unit Sales by Min", x = "Time", y = "Quantity")Sales by day tended to increase in the last stage. On the other hand, there are two days of abnormal sales (which is predictable):
my_df %>%
group_by(time_ymd) %>%
summarise(sales = sum(Quantity)) %>%
ungroup() -> sales_byTime
sales_byTime %>%
ggplot(aes(time_ymd, sales)) +
geom_line() +
geom_point(color = "firebrick") +
labs(title = "Figure 2: Unit Sales by Day", x = "Time", y = "Quantity")From Figure 2 we also see that there is another anomaly: there are two times where the data is not continuous. Specifically, from 2010-12-22 to 2011-01-03, there are 12 consecutive days of data missing.
sales_byTime %>%
mutate(lag1 = lag(time_ymd, n = 1L)) %>%
mutate(duration_date = time_ymd - lag1) %>%
mutate(duration_date = as.numeric(duration_date)) %>%
slice(which.max(duration_date)) ## # A tibble: 1 x 4
## time_ymd sales lag1 duration_date
## <date> <dbl> <date> <dbl>
## 1 2011-01-04 8639 2010-12-23 12
my_df %>%
mutate(money = Quantity*UnitPrice) -> my_df
my_df %>%
group_by(time_ymd_hm) %>%
summarise(moneySales = sum(money)) %>%
ungroup() -> sales_byTime_hm_money
sales_byTime_hm_money %>%
ggplot(aes(time_ymd_hm, moneySales)) +
geom_line() +
labs(title = "Figure 3: Monetary Sales by Min", x = "Time", y = "")my_df %>%
group_by(time_ymd) %>%
summarise(moneySales = sum(money)) %>%
ungroup() -> sales_byTime_money
sales_byTime_money %>%
ggplot(aes(time_ymd, moneySales)) +
geom_line() +
geom_point(color = "firebrick") +
labs(title = "Figure 4: Monetary Sales by Day", x = "Time", y = "")Doanh thu cả theo số lượng bán và tiền ở 4 tháng cuối năm cao hơn hẳn các tháng còn lại trong năm:
my_df %>%
group_by(time_mon) %>%
summarise_each(funs(sum), Quantity) %>%
mutate(Quantity = Quantity / 1000) %>%
ggplot(aes(time_mon, Quantity)) +
geom_col() +
theme(panel.grid.major.x = element_blank()) +
labs(title = "Figure 5: Unit Sales in thousands by Month", x = "Time", y = "Quanlity") +
scale_y_continuous(limits = c(0, 800))my_df %>%
group_by(time_mon) %>%
summarise_each(funs(sum), money) %>%
mutate(money = money / 1000) %>%
ggplot(aes(time_mon, money)) +
geom_col() +
theme(panel.grid.major.x = element_blank()) +
labs(title = "Figure 6: Monetary Sales in thousands by Month", x = "Time", y = "")More than 4000 items are being sold, but the revenue distribution is not even:
my_df %>%
group_by(Description) %>%
summarise(sales = sum(money)) %>%
ungroup() %>%
arrange(-sales) %>%
mutate(Description = factor(Description, levels = Description)) %>%
mutate(total = sum(sales)) %>%
mutate(money_percent = sales / total) %>%
mutate(cum_money = cumsum(money_percent)) -> moneySales_Item
moneySales_Item %>%
ggplot(aes(Description, sales)) +
geom_col() +
theme(panel.grid.major.x = element_blank()) +
theme(axis.text.x = element_blank()) +
labs(title = "Figure 6: Money Sales by Product", x = "", y = "")Once again we can see the presence of principle 80 - 20: 80% of the sales of this online retailer comes from 827 product codes (corresponding to 20.59% of product codes):
moneySales_Item %>%
filter(cum_money <= 0.8) -> top80_sales
top80_sales %>% nrow() / nrow(moneySales_Item)## [1] 0.2059776
Products which bring the most revenue:
| Description | sales | money_percent | cum_money |
|---|---|---|---|
| DOTCOM POSTAGE | 206248.77 | 0.0193358 | 0.0193358 |
| REGENCY CAKESTAND 3 TIER | 174484.74 | 0.0163579 | 0.0356937 |
| PAPER CRAFT , LITTLE BIRDIE | 168469.60 | 0.0157940 | 0.0514877 |
| WHITE HANGING HEART T-LIGHT HOLDER | 106292.77 | 0.0099649 | 0.0614526 |
| PARTY BUNTING | 99504.33 | 0.0093285 | 0.0707812 |
| JUMBO BAG RED RETROSPOT | 94340.05 | 0.0088444 | 0.0796255 |
The company may use this information for its business purposes. For example, the company may prioritize shipping for orders listed above or prioritize preparing inventory for these codes. In other words, it is necessary to focus on logistics (logistics - warehousing - transportation) for the brands that bring up to 80% of the revenue for the company.
Apply the K-means Clustering for Customer Segmentation with RFM method:
y <- as.duration(ymd_hm("2011-12-31 24:59") - my_df$time_ymd_hm) %>% as.numeric()
y <- round(y / (3600*24), 0)
# Create Recency:
my_df %>% mutate(recency = y) -> my_df
# Purchase ammount for individual customers:
my_df %>%
group_by(CustomerID) %>%
summarise_each(funs(sum), money) %>%
ungroup() -> df_money
# R:
my_df %>%
group_by(CustomerID) %>%
summarise_each(funs(min), recency) %>%
ungroup() -> df_recency
# F:
my_df %>%
group_by(CustomerID) %>%
count() %>%
ungroup() %>%
rename(freq = n) -> df_freq
# Data for EDA:
df_money %>%
full_join(df_recency, by = "CustomerID") %>%
full_join(df_freq, by = "CustomerID") %>%
mutate(CustomerID = as.character(CustomerID)) -> final_df
final_df %>%
head() %>%
kable()| CustomerID | money | recency | freq |
|---|---|---|---|
| 12346 | 77183.60 | 348 | 1 |
| 12347 | 4310.00 | 24 | 182 |
| 12348 | 1797.24 | 97 | 31 |
| 12349 | 1757.55 | 41 | 73 |
| 12350 | 334.40 | 332 | 17 |
| 12352 | 2506.04 | 58 | 85 |
# Scaling dataset:
final_df %>%
mutate_if(is.numeric, function(x) {(x - min(x)) / (max(x) - min(x))}) %>%
select(-CustomerID) -> final_df_scaledOptimal K will be chosen following this method: Elbow Method:
set.seed(29)
wss <- sapply(1:10,
function(k){kmeans(final_df_scaled %>% sample_frac(0.2),
k, nstart = 30)$tot.withinss})
u <- data.frame(k = 1:10, WSS = wss)
u %>%
ggplot(aes(k, WSS)) +
geom_line() +
geom_point() +
geom_point(data = u %>% filter(k == 3), color = "red", size = 3) +
scale_x_continuous(breaks = seq(1, 10, by = 1)) +
labs(title = "Figure 7: The Optimal Number of Clusters, Elbow Method", x = "Number of Clusters (K)") +
theme(panel.grid.minor = element_blank())# Cluster with K = 3
set.seed(123)
km.res <- kmeans(final_df_scaled, 3, nstart = 30)
final_df %>%
mutate(Group = km.res$cluster) %>%
mutate(Group = paste("Group", Group)) -> final_df_clustered
# Groups of customer:
final_df_clustered %>%
group_by(Group) %>%
summarise_each(funs(mean), money, recency, freq) %>%
ungroup() %>%
mutate_if(is.numeric, function(x) {round(x, 0)}) %>%
arrange(-money) ## # A tibble: 3 x 4
## Group money recency freq
## <chr> <dbl> <dbl> <dbl>
## 1 Group 2 3308 54 165
## 2 Group 1 799 176 41
## 3 Group 3 607 316 24
Based on the results of the K-Means Clustering, customers will be classified into the following 3 groups:
Group 2 Group has those customers who spend a lot, make the purchase very often, and have the smallest Recency. This group is called * Champions *. The way of “taking care” of these customers is well described [here] (https://www.putler.com/rfm-analysis/). On average, each customer in this group spends £ 3308.
Group 1 This is Loyal Customers. This is a group of customers with the potential to turn into Champions if the company knows how to implement customer care and promotion strategies appropriately.
Group 3 Less purchase, and bring less money to the company.
Before we can use the above Insights to cater to the business strategies, we should consider these following points: the K-means clustering algorithm is very sensitive to outliers. Although the data has been scaled to minimize the impact of these outliers, it does not guarantee that the results will not be deformed.
final_df %>%
ggplot(aes(CustomerID, money)) +
geom_col() +
theme(panel.grid.minor.x = element_blank()) +
theme(panel.grid.major.x = element_blank()) +
theme(panel.grid.minor.y = element_blank()) +
theme(axis.text.x = element_blank()) +
labs(title = "Figure 8: Spending by Customer", y = "")Figure 8 shows some customers with massive spending. They may not be individual customers but maybe the form of a small store to buy and resell. Similarly, the buying frequency (Figure 9, the unit on the Y-axis is 1000):
final_df %>%
mutate(freq = freq / 1000) %>%
ggplot(aes(CustomerID, freq)) +
geom_col() +
theme(panel.grid.minor.x = element_blank()) +
theme(panel.grid.major.x = element_blank()) +
theme(panel.grid.minor.y = element_blank()) +
theme(axis.text.x = element_blank()) +
labs(title = "Figure 9: Frequency by Customer", y = "")Because the K-Means Clustering is very sensitive to outliers, so we should split these outliers for separate analysis while focusing on those common customers.
# Identify Outlier:
outlier_label <- function(x) {
a <- mean(x)
b <- sd(x)
th1 <- a - 3*b
th2 <- a + 3*b
y <- case_when(x >= th1 & x <= th2 ~ "Normal", TRUE ~ "Outlier")
return(y)
}
# Only apply Normal Observation for K-means Clustering:
final_df %>%
mutate(nor_money = outlier_label(money), nor_freq = outlier_label(freq)) %>%
filter(nor_money == "Normal", nor_freq == "Normal") %>%
select(1:4) -> final_df_normal
final_df_normal %>%
mutate_if(is.numeric, function(x) {(x - min(x)) / (max(x) - min(x))}) -> final_df_normal_scaledRe-apply K means clustering - Find the optimal K:
set.seed(29)
wss <- sapply(1:10,
function(k){kmeans(final_df_normal_scaled %>% select(-CustomerID) %>% sample_frac(0.2),
k, nstart = 30)$tot.withinss})
u <- data.frame(k = 1:10, WSS = wss)
u %>%
ggplot(aes(k, WSS)) +
geom_line() +
geom_point() +
geom_point(data = u %>% filter(k == 4), color = "red", size = 3) +
scale_x_continuous(breaks = seq(1, 10, by = 1)) +
labs(title = "Figure 10: The Optimal Number of Clusters, Elbow Method",
subtitle = "Outliers are are removed from sample.",
x = "Number of Clusters (K)") +
theme(panel.grid.minor = element_blank())After removing outliers, optimal K is 4.
# Grouping with k = 4:
set.seed(123)
km.res4 <- kmeans(final_df_normal_scaled %>% select(-CustomerID), 4, nstart = 30)
final_df_normal %>%
mutate(Group = km.res4$cluster) %>%
mutate(Group = paste("Group", Group)) -> final_df_clustered# Groups of customers description:
final_df_clustered %>%
group_by(Group) %>%
summarise_each(funs(mean), money, recency, freq) %>%
ungroup() %>%
mutate_if(is.numeric, function(x) {round(x, 0)}) %>%
arrange(-money) %>%
kable()| Group | money | recency | freq |
|---|---|---|---|
| Group 4 | 2668 | 41 | 136 |
| Group 2 | 999 | 103 | 51 |
| Group 3 | 726 | 214 | 35 |
| Group 1 | 566 | 331 | 24 |
Calculate the proportion of revenue from these customer groups:
final_df_clustered %>%
group_by(Group) %>%
summarise_each(funs(sum, mean, median, min, max, sd, n()), money) %>%
ungroup() %>%
mutate(per_sale = round(100*sum / sum(sum), 2)) -> sale_group
library(ggthemes)
sale_group %>%
ggplot(aes(reorder(Group, per_sale), per_sale, fill = Group, color = Group)) +
geom_col(width = 0.5, show.legend = FALSE) +
coord_flip() +
geom_text(aes(label = paste(per_sale, paste0(paste0("(", "%")), ")")),
hjust = -0.05, color = "white", size = 5) +
scale_y_continuous(limits = c(0, 90), expand = c(0.01, 0)) +
scale_fill_tableau() +
scale_color_tableau() +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
theme(panel.grid.major = element_blank()) +
theme(panel.grid.minor = element_blank()) +
labs(x = NULL, title = "Figure 11: Share of Sales by Customer Group")Group 4 accounts for 49.1% of total customers and is the group that brings nearly 75% of revenue to the company:
sale_group %>%
select(Group, n) %>%
mutate(total = sum(n)) %>%
mutate(label = 100*n / total) %>%
mutate(label = paste(round(label, 1), "%")) %>%
ggplot(aes(Group, n, fill = Group, color = Group)) +
geom_col(width = 0.5, show.legend = FALSE) +
geom_text(aes(label = label), color = "white", vjust = 1.4, size = 5) +
scale_fill_tableau() +
scale_color_tableau() +
theme(panel.grid.minor = element_blank()) +
theme(panel.grid.major.x = element_blank()) +
labs(x = NULL, y = NULL, title = "Figure 12: Number of Customers by Group")To answer the question a consumer whose behavior is described by R, F and M, which group will this customer be?. Many approaches/models can be applied to this problem and one of them is to use classification algorithms, for example, Random Forest:
# Data for ML:
df_forML <- final_df_clustered %>%
select(- CustomerID) %>%
mutate(Group = as.factor(Group))
# Split data into training, testing:
library(caret)
set.seed(1)
id <- createDataPartition(df_forML$Group, p = 0.8, list = FALSE)
df_train <- df_forML[id, ]
df_test <- df_forML[-id, ]
# Train Random Forest:
set.seed(1)
my_rf <- train(Group ~., method = "rf", data = df_train)## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
As the business getting imput for a new customer with M = 1757.55, R = 41, F = 73, which segmentation that customer will be:
| money | recency | freq |
|---|---|---|
| 1757.55 | 41 | 73 |
That customer will be in group 4:
## [1] "Group 4"
Model evaluation:
## Confusion Matrix and Statistics
##
## Reference
## Prediction Group 1 Group 2 Group 3 Group 4
## Group 1 99 0 0 0
## Group 2 0 217 0 1
## Group 3 0 0 123 0
## Group 4 0 0 0 424
##
## Overall Statistics
##
## Accuracy : 0.9988
## 95% CI : (0.9936, 1)
## No Information Rate : 0.4919
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9983
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Group 1 Class: Group 2 Class: Group 3
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 0.9985 1.0000
## Pos Pred Value 1.0000 0.9954 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.1146 0.2512 0.1424
## Detection Rate 0.1146 0.2512 0.1424
## Detection Prevalence 0.1146 0.2523 0.1424
## Balanced Accuracy 1.0000 0.9992 1.0000
## Class: Group 4
## Sensitivity 0.9976
## Specificity 1.0000
## Pos Pred Value 1.0000
## Neg Pred Value 0.9977
## Prevalence 0.4919
## Detection Rate 0.4907
## Detection Prevalence 0.4907
## Balanced Accuracy 0.9988
# Statistical description for the prediction on test data:
df_test %>%
mutate(GroupPredicted = pred) %>%
group_by(GroupPredicted) %>%
summarise_each(funs(mean), money, recency, freq) %>%
mutate_if(is.numeric, function(x) {round(x, 0)}) %>%
kable()| GroupPredicted | money | recency | freq |
|---|---|---|---|
| Group 1 | 453 | 330 | 22 |
| Group 2 | 1018 | 103 | 53 |
| Group 3 | 505 | 217 | 29 |
| Group 4 | 2753 | 42 | 147 |
# Statistical description for the prediction on train data compared to test data:
df_train %>%
group_by(Group) %>%
summarise_each(funs(mean), money, recency, freq) %>%
mutate_if(is.numeric, function(x) {round(x, 0)}) %>%
kable()| Group | money | recency | freq |
|---|---|---|---|
| Group 1 | 594 | 331 | 24 |
| Group 2 | 995 | 103 | 51 |
| Group 3 | 782 | 213 | 36 |
| Group 4 | 2647 | 41 | 133 |