Note: Idea of Cohort Analysis and K-mean Clustering is sourcing from https://www.kaggle.com/mahmoudelfahl/cohort-analysis-customer-segmentation-with-rfm
Data Science is especially relevant in the e-commerce and retail industry. They can predict the purchases, profits, losses and even manipulate customers into buying things by tracking their behaviour. Retail brands analyse data to create customer profiles and learn his/her sore points and market their product accordingly to push the customer towards purchasing. Have you ever imagined what if there is no any recommendation products or products that pop out in your timelime are random? Will you do any further action, either (click) or purchasing?
Recommendation engines are the most important tools in a e-commerce platform.By using these engines, retails on e-commerces can increase their sales. Providing recommendation engine is one of the way how the online marketing does to improve conversion rate. For instance, let’s say there is a online buyer have purchased a smartphone in a e-commerce, there will be recommendation items (such as another brand of smartphone, earphone, etc) since then.
In retail, customers purchase items based on impulse, and we can work on this principle by predicting the chances of a customer’s purchasing behaviour. This mostly involves a lot of how the marketing of the product is done by the retailers, and in the world of e-commerce, customer data is the best place to look for potential buying impulses. Similar to search recommendations, market basket analysis also works with a machine learning or deep learning algorithm. (source: https://www.mygreatlearning.com/blog/applications-of-data-science-in-e-commerce-industry, accessed on April 19th, 2020)
One of measurement to indicate how well an online platform performs is the retention rate. By using Cohort Analysis, we can see the metrics of retention rate which indicate how many customers are still active since their first purchase. Retention rate will show you the percentage of active customers to the total number of customers.
As Marketing Manager, we need to know our customers very well. Knowing how many products they will buy, how long to take customers to repurchase again and how much customers have spent on our website will help us to make a good and specific strategy to maintain customers.
Knowing when behavior and number of purchasing is important too. By analyzing behavior of purchasing time of customers and forecast number of transaction, marketing team can decide when the best time is to do promotion. With do promotion right on target, we have done efficiency in marketing cost.
A cohort is a group of subjects who share a defining characteristic. We can observe how a cohort behaves across time and compare it to other cohorts. Cohorts are used in medicine, psychology, econometrics, ecology and many other areas to perform a cross-section (compare difference across subjects) at intervals through time.
First of all, we should import data. This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.
This dataset includes information about:
- InvoiceNo
: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
- StockCode
: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
- Description
: Product (item) name. Nominal.
- Quantity
: The quantities of each product (item) per transaction. Numeric.
-InvoiceDate
: Invice Date and time. Numeric, the day and time when each transaction was generated.
-UnitPrice
: Unit price. Numeric, Product price per unit in sterling.
-CustomerID
: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
-Country
: Country name. Nominal, the name of the country where each customer resides.
First step we have to do is checking whether there is NA (Not Available) or missing value in this data.
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## 0 0 1454 0 0 0
## CustomerID Country
## 135080 0
Since CustomerID
is important for our model, we need to remove all the CustomerID
with NA value
After finished removing all of missing values, we need to remove all duplicated data that will impact our modeling.
Getting finished in removing data, we move to next step to check summary of data. In this step, we verify whether there is anomaly data or not.
## InvoiceNo StockCode Description Quantity
## Length:401604 Length:401604 Length:401604 Min. :-80995.00
## Class :character Class :character Class :character 1st Qu.: 2.00
## Mode :character Mode :character Mode :character Median : 5.00
## Mean : 12.18
## 3rd Qu.: 12.00
## Max. : 80995.00
## InvoiceDate UnitPrice CustomerID
## Min. :2010-12-01 08:26:00 Min. : 0.00 Min. :12346
## 1st Qu.:2011-04-06 15:02:00 1st Qu.: 1.25 1st Qu.:13939
## Median :2011-07-29 15:40:00 Median : 1.95 Median :15145
## Mean :2011-07-10 12:08:23 Mean : 3.47 Mean :15281
## 3rd Qu.:2011-10-20 11:58:30 3rd Qu.: 3.75 3rd Qu.:16784
## Max. :2011-12-09 12:50:00 Max. :38970.00 Max. :18287
## Country
## Length:401604
## Class :character
## Mode :character
##
##
##
The min for unit price = 0 and the min for Quantity with negative value. We need to subset data or select data with UnitPrice
above ‘0’ and Quantity
above ‘0’.
After finishing subsetting, we can re-check again to verify the data.
## InvoiceNo StockCode Description Quantity
## Length:392692 Length:392692 Length:392692 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.: 2.00
## Mode :character Mode :character Mode :character Median : 6.00
## Mean : 13.12
## 3rd Qu.: 12.00
## Max. :80995.00
## InvoiceDate UnitPrice CustomerID
## Min. :2010-12-01 08:26:00 Min. : 0.001 Min. :12346
## 1st Qu.:2011-04-07 11:12:00 1st Qu.: 1.250 1st Qu.:13955
## Median :2011-07-31 12:02:00 Median : 1.950 Median :15150
## Mean :2011-07-10 19:13:07 Mean : 3.126 Mean :15288
## 3rd Qu.:2011-10-20 12:53:00 3rd Qu.: 3.750 3rd Qu.:16791
## Max. :2011-12-09 12:50:00 Max. :8142.750 Max. :18287
## Country
## Length:392692
## Class :character
## Mode :character
##
##
##
As we can see from the summary, the min. UnitPrice
and Quantity
are above ‘0’. Let’s move to Cohort Analysis step.
For cohort analysis, there are a few labels that we have to create:
library(lubridate)
# Creating cohort group
retail_cohort <- retail %>%
group_by(CustomerID) %>%
summarise(CohortMonth = floor_date(min(InvoiceDate), unit = "month"))
# Extracting InvoiceMonth from InvoiceDate
retail <- retail %>%
mutate(InvoiceMonth = floor_date(InvoiceDate, unit = "month"))
# Merge the cohort group data frame to the prior dataframe
retail_merge <- merge(retail, retail_cohort, by.x = "CustomerID", by.y = "CustomerID")
head(retail_merge)
# Finding the diff between CohortMonthYear and Invoice Date
retail_merge <- retail_merge %>%
mutate(InvoiceMonth_Num = month(InvoiceMonth),
InvoiceYear_Num = year(InvoiceMonth),
CohortMonth_Num = month(CohortMonth),
CohortYear_Num = year(CohortMonth),
CohortIndex = (InvoiceYear_Num-CohortYear_Num)*12 +(InvoiceMonth_Num-CohortMonth_Num)+1)
# Count monthly active customer from each customer
cohort_counts <- retail_merge %>%
group_by(CohortMonth,CohortIndex) %>%
summarise(Total_Customer = n_distinct(CustomerID)) %>%
pivot_wider(names_from = CohortIndex, values_from = Total_Customer) %>%
rename(one_mth = 2,
two_mths = 3,
three_mths = 4,
four_mths = 5,
five_mths =6,
six_mths = 7,
seven_mths = 8,
eight_mths = 9,
nine_mths = 10,
ten_mths = 11,
elv_mths = 12,
twlv_mths = 13,
thrdtn_mths =14)
cohort_counts
After managed to create Invoice Period, Cohort Month, and Cohort Index, we can make retention rate to see the percentage of active users comparing to total users.
retention <- cohort_counts %>%
mutate(one= round(one_mth/one_mth,3)*100,
two = round(two_mths/one_mth,3)*100,
three= round(three_mths/one_mth,3)*100,
four= round(four_mths/one_mth,3)*100,
five= round(five_mths/one_mth,3)*100,
six = round(six_mths/one_mth,3)*100,
seven= round(seven_mths/one_mth,3)*100,
eight = round(eight_mths/one_mth,3)*100,
nine= round(nine_mths/one_mth,3)*100,
ten = round(ten_mths/one_mth,3)*100,
elv= round(elv_mths/one_mth,3)*100,
twlv = round(twlv_mths/one_mth,3)*100,
thrdtn = round(thrdtn_mths/one_mth,3)*100) %>%
select(-c(2:14))
retention
After finished creating retention table, we can plot it on heatmap to see more clear picture of this table.
library("plotly")
library("glue")
data_viz_retention <- retention %>%
pivot_longer(cols = c(2:14),
names_to = "CohortIndex",
values_to = "retention_rate") %>%
mutate(Index = row_number() %>% as.factor()) %>%
arrange(CohortMonth) %>%
ungroup() %>%
mutate(CohortMonth = ymd(CohortMonth) %>% as.factor())
p1 <- data_viz_retention %>%
mutate(CohortMonth = factor(CohortMonth, levels = rev(levels(CohortMonth)))) %>%
ggplot(aes(x=Index,y=CohortMonth,
text = glue("Cohort Month: {CohortMonth}
Cohort Index: {Index}
Retention Rate: {retention_rate}%")
))+
geom_tile(aes(fill = retention_rate))+
geom_text(aes(label = retention_rate),size = 2)+
scale_fill_gradientn(
colours = c("white", "#e6d0a1", "#ff0000"),
na.value = "white"
) +
labs(title = "Retention Rate (in percentage)") +
theme_minimal()+
theme(panel.grid = element_blank())
ggplotly(p1,tooltip = "text")
We find:
The highest retention rate is at Dec 2010 of Cohort Month
The trend shows the retention tending to go down slowly.
Beside retention table, we can make average quantity of active customers in each chorts.
data_viz_avg <- retail_merge %>%
group_by(CohortMonth, CohortIndex) %>%
summarise(avg_quantity = round(mean(Quantity),1)) %>%
arrange(CohortMonth) %>%
ungroup() %>%
mutate(CohortMonth = ymd(CohortMonth) %>% as.factor(),
CohortIndex = CohortIndex %>% as.factor())
p2 <- data_viz_avg %>%
mutate(CohortMonth = factor(CohortMonth, levels = rev(levels(CohortMonth)))) %>%
ggplot(aes(x=CohortIndex,y=CohortMonth,
text = glue("Cohort Month: {CohortMonth}
Cohort Index: {CohortIndex}
Average Quantity: {avg_quantity}/item
")))+
geom_tile(aes(fill = avg_quantity))+
geom_text(aes(label = avg_quantity),size = 2)+
scale_fill_gradientn(
colours = c("white","#188508", "#016000"),
na.value = "white"
) +
labs(title = "Average Quantity per Each Cohort") +
theme_minimal()+
theme(panel.grid = element_blank())
ggplotly(p2, tooltip="text")
We find:
the trend of average of quantity is going down.
For data at 8th CohortIndex of May 2011, we can conclude it as an outlier due to there is a high quantity purchased with one ID.
Before we do forecast modeling, we need to check what seasonality that our data is.
# Making date transaction into hourly
retail_fc <- retail %>%
mutate(datetime = floor_date(InvoiceDate,unit = "hour"))
# Summarise data to get the number of leads hourly
retail_fc <- retail_fc %>%
group_by(datetime) %>%
summarise(total_purchased = n_distinct(InvoiceNo))
#time padding
library(padr)
min_date <- min(retail_fc$datetime)
max_date <- max(retail_fc$datetime)
retail_fc <- retail_fc %>%
pad(start_val = make_datetime(year = year(min_date),month = month(min_date), day= day(min_date), hour = 0), end_val =make_datetime(year = year(max_date),month = month(max_date), day= day(max_date), hour = 23) )
retail_fc
After convert our data hourly, we can check whether there is any transactions in every hour.
retail_fc %>%
mutate(hour = as.factor(hour(datetime)),
total_purchased=replace_na(total_purchased,0)) %>%
group_by(hour) %>%
summarise(total = sum(total_purchased))
From the generated table, we can see that there are no transaction at 9 PM to 5 AM. Thus, we can takeout the zero transaction hours to create a better model.
retail_fc <- retail_fc %>%
filter(hour(datetime)>=6 & hour(datetime)<=20) %>%
mutate(total_purchased=replace_na(total_purchased,0))
retail_fc
First step, let’s try with single time series and check whether all seasonalities can be explained / comprehended in this time series.
# Converting dataframe into time series data
retail_ts <- ts(retail_fc$total_purchased, start = c(1,1), frequency = 15)
# Visualization
library(forecast)
retail_ts %>%
autoplot()+
theme_minimal()
We find:
retail_ts
. That means our data are most likely having multiple seasonality.Since we conclude that our time series is multiple time series, let’s move to multiple time series to see its seasonality.
# Create multiple seasonality time series
retail_msts <- msts(data = retail_fc$total_purchased,seasonal.periods = c(15,15*7))
# Decomposing
retail_msts %>%
tail(15*7*4) %>%
stl(s.window = "periodic") %>%
autoplot()
As we can see, the trend is created smoothly indicating that our data set has multiple seasonality.
# Single seasonality
retail_fc %>%
mutate(
seasonal = retail_single_decompose$seasonal,
hour = hour(datetime)
) %>%
distinct(hour, seasonal) %>%
ggplot(mapping = aes(x = hour, y = seasonal)) +
geom_col() +
theme_minimal() +
scale_x_continuous(breaks = seq(6,20,1)) +
labs(
title = "Single Seasonality Analysis",
subtitle = "Daily"
)
We find:
Most of visitors came to purchase at 12.00
Morning at 6.00 - 8.00 and afternoon at 17.00-20.00 were the least hour that visitor purchasing.
# Multiple Seasonality
as.data.frame(retail_double_decompose) %>%
mutate(datetime = retail_fc$datetime) %>%
mutate(
dow = wday(datetime, label = TRUE, abbr = FALSE),
hour = as.factor(hour(datetime))
) %>%
group_by(dow, hour) %>%
summarise(seasonal = sum(Seasonal15 + Seasonal105)) %>%
ggplot(mapping = aes(x = hour, y = seasonal)) +
geom_col(aes(fill = dow)) +
scale_fill_viridis_d(option = "plasma") +
theme_minimal() +
labs(
title = "Multiple Seasonality Analysis",
subtitle = "Daily & Weekly"
)
Beside what we see from single seasonality, We find:
Most of visitors were doing purchasing at peak hours (12.00-13.00) in the middle of the week (Wednesday & Thursday).
Thursday has more visitors purchasing than other day at afternoon (16.00-17.00)
For K-Mean Clustering, we are going to build RFM (Recency, Frequency, Monetary) model first. So, what is RFM?
Recency is about when was the last order of a customer. It means the number of days since a customer made the last purchase. If it’s a case for a website or an app, this could be interpreted as the last visit day or the last login time.
Frequency is about the number of purchase in a given period. It could be 3 months, 6 months or 1 year. So we can understand this value as for how often or how many a customer used the product of a company. The bigger the value is, the more engaged the customers are. Could we say them as our VIP? Not necessary. Cause we also have to think about how much they actually paid for each purchase, which means monetary value.
Monetary is the total amount of money a customer spent in that given period. Therefore big spenders will be differentiated with other customers such as MVP or VIP.
The RFM values can be grouped in several ways:
1.Percentiles e.g. quantiles
2.Pareto 80/20 cut
3.Custom - based on business knowledge
On this project, we are going to group by percentiles
Process of calculating percentiles:
1. Sort customers based on that metric
2. Break customers into a pre-defined number of groups of equal size
3. Assign a label to each group
As for first step, let’s add column to see total amount of purchasing that have been done.
After create Total_Purchase
, let’s check first order and last
## [1] "2011-12-09 12:50:00 UTC"
## [1] "2010-12-01 08:26:00 UTC"
Since we want to calculate day periods, let’s set one day after the last one (December 9, 2011) as snapshot_date
## [1] "2011-12-10 12:50:00 UTC"
Now, we have got snapshot_date
, we can now create our RFM table
rfm <- retail_merge %>%
group_by(CustomerID) %>%
summarise(Recency = as.numeric(round(snapshot_date- max(InvoiceDate),0)),
Frequency = n(),
Monetary = sum(Total_Purchase))
head(rfm)
After created RFM dataframe, we need to label atau put in Range for Recency, Frequency, Monetary. Note that:
We will rate Recency
customer who have been active more recently better than the less recent customer,because each company wants its customers to be recent
We will rate Frequency
and Monetary Value
higher label because we want Customer to spend more money and visit more often(that is different order than recency)
rfm <- rfm %>%
mutate( R = ntile(-Recency, 4),
Fr = ntile(Frequency,4),
M = ntile(Monetary,4),
rfm_score = R+Fr+M)
head(rfm)
After getting rfm_score, let’s group customers into Gold, Silver, Bronze
rfm$segment <- ifelse(rfm$rfm_score>9, "Gold", ifelse(rfm$rfm_score>5 & rfm$rfm_score<=9, "Silver", "Bronze"))
head(rfm)
Let’s see the characteristic of each segment
rfm_seg <- rfm %>%
group_by(segment) %>%
summarise(Recency = mean(Recency),
Frequency = mean(Frequency),
Monetary = mean(Monetary)) %>%
arrange(desc(Recency))
rfm_seg
We finds:
Recency
is above 180 days which mean most of Bronze segment was not doing transaction for 6 months since their last transaction. On the other hand, average Frequency
is around 15 showing that most of customers were doing transaction around 15 times before they stopped purchasing six months later. Average of total purchasing of each customer (Monetary
) is around GBP 266.Recency
is above 60 days that indicates most of Silver segment was not doing transaction for 2 months since their last transaction. On the other hand, average Frequency
is around 49 times of purchasing. For Monetary
value, Silver segment customers spent around 1000 GBP in online store.Recency
is around 20 days that indicates most of Gold segment was still doing transactions within one month since their last transactions. On the other hand, average Frequency
is around 226 times of purchasing. For Monetary
value, Gold segment customers spent around 5259 GBP in online store.To do K-Means Clustering, we need to scale the data first. Column that we are going to use is Recency
, Frequency
, Monetary
After done scaling, we need to decide value for ‘k’ to decide how many clustering our data will be
wss <- function(data, maxCluster = 9) {
# Initialize within sum of squares
SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
SSw <- vector()
set.seed(100)
for (i in 2:maxCluster) {
SSw[i] <- sum(kmeans(data, centers = i)$withinss)
}
plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}
According to wss above, we can use 3 as our k value due to sloping graphic after 3 and we would like compare the characteristic of clusters to generated RFM class, we should use same number of clusters. We can start create our K-means clustering.
After clustering, let’s see the characteristic of each cluster.
rfm_means <- rfm %>%
group_by(cluster) %>%
summarise(Recency = mean(Recency),
Frequency = mean(Frequency),
Monetary = mean(Monetary)) %>%
arrange(desc(Recency))
rfm_means
As we can see characteristics of each cluster in K-Means clustering, we find:
To use K-Medoids Clustering requiring scaled data, we can use the previous scaled data from K-Means clustering. And, we can use ‘3’ as our k-values since we would like to compare the result with other clustering model. Thus, in this session we can move directly to modeling.
library("cluster")
#modeling
set.seed(123)
rfm_med <- pam(rfm_scale,3)
#insert the result into dataframe
rfm$cluster_med <- as.factor(rfm_med$clustering)
rfm
Let’s check the characteristic of each cluster
rfm_med <- rfm %>%
group_by(cluster_med) %>%
summarise(Recency = mean(Recency),
Frequency = mean(Frequency),
Monetary = mean(Monetary)) %>%
arrange(desc(Recency)) %>%
mutate(cluster= c("1","2","3")) %>%
select(-1)
rfm_med
We find:
For easiness to see distribution characteristic of each cluster of each data frame, let’s divide with average value of each column to have better understanding.
# Make propotional table
# RFM table
seg_viz <- rfm_seg %>%
mutate(Recency = round(Recency/sum(Recency),2),
Frequency = round(Frequency/sum(Frequency),2),
Monetary = round(Monetary/sum(Monetary),2),
clustering =factor("Simple"),
cluster_name = factor(segment, levels = c("Bronze","Silver","Gold"))) %>%
pivot_longer(
cols = c("Recency", "Frequency","Monetary"),
names_to = "rfm",
values_to = "value") %>%
select(-1)
# K-Means table
means_viz <- rfm_means %>%
mutate(Recency = round(Recency/sum(Recency),2),
Frequency = round(Frequency/sum(Frequency),2),
Monetary = round(Monetary/sum(Monetary),2),
clustering = factor("K-Means"),
cluster_name = factor(cluster, levels = c("1","2","3"))) %>%
pivot_longer(
cols = c("Recency", "Frequency","Monetary"),
names_to = "rfm",
values_to = "value") %>%
select(-1)
# K-Medoids table
med_viz <- rfm_med %>%
mutate(Recency = round(Recency/sum(Recency),2),
Frequency = round(Frequency/sum(Frequency),2),
Monetary = round(Monetary/sum(Monetary),2),
clustering = factor("Medoids"),
cluster_name = factor(cluster,levels = c("1","2","3"))) %>%
pivot_longer(
cols = c("Recency", "Frequency","Monetary"),
names_to = "rfm",
values_to = "value") %>%
select(-1)
#combine the above 3 dataframes into 1 dataframe
cluster_merge <- do.call(rbind,list(seg_viz,means_viz,med_viz))
cluster_merge
p3 <- cluster_merge %>%
ggplot(aes(x=factor(rfm,levels = c("Recency","Frequency","Monetary")),y=cluster_name,
text = glue("{clustering} Model
Group: {cluster_name}
{rfm} propotion: {value}
")))+
geom_tile(aes(fill=value))+
facet_wrap(~clustering,
scales = "free_y")+
geom_text(aes(label = value), size =3)+
scale_fill_gradientn(
colours = c("white","#6497b1", "#04396c"))+
labs(
title = "Heatmap of Proposition Value",
x = "RFM (Recency, Frequency, Monetary)",
y = "Clustering"
)+
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
strip.background = element_rect(fill="light blue"))
ggplotly(p3, tooltip = "text")
We find:
Simple clustering (plot 1) and K-Mediods clustering (plot 3) are alike.
At K-Mean clustering (plot2), there are very significantly different values between cluster 3 and the other cluster in term of Frequency and Monetary.
since we find out 3rd cluster of K-Means has significantly different value than the other clusters, let’s check how many customers are in this cluster.
## [1] 13
As we can see the result, only 13 of 4338 (less than 1%) customers are in this cluster, we can assume that all customers that are in this cluster are outliers.
As the end of this report, we can conclude that:
Recency
is above 180 days which mean most customers of Cluster 1 were not doing transaction for 6 months since their last transaction. On the other hand, average Frequency
is around 25 showing that most of customers were doing transaction around 25 times before they stopped purchasing six months later. Average total purchasing value (Monetary
) is around GBP 595.Recency
is above 90 days that indicates most customers of Cluster 2 segment were not doing transaction for 3 months since their last transaction. On the other hand, average Frequency
is around 41 times of purchasing. For Monetary
value, customers of Cluster 2 spent around 862 GBP in online store.Recency
is around 22 days that indicates most customers of Cluster 3 were still doing transactions within one month since their last transactions. On the other hand, average Frequency
is around 142 times of purchasing. For Monetary
value, customers spent around 3251 GBP in online store.