Dataset Made Available for the task
A transactional dataset has been provided from a luxury beauty brand consisting of 1,048,301 observations on customer’s instore and online purchases since 2017 in 85 cities across 22 countries. It has 16 variables which characterize the place, date, type and amount of purchase along with customer ID and transaction ID. Figure 1 describes the data dictionary that accompanies the dataset.
Data Dictionary
Fig-1: Data Dictionary
The reference process model used for the task is Cross-Industry Standard Process for Data Mining (CRISP-DM).
Fig-2: CRISP-DM
source: CRISP-DM 1.0
This section provides a step by step account of the techniques applied to cleanse and prepare the data such that it’s fit for the modelling task.
Evaluation of descriptive statistics revealed that variables Category, Channel, and City had missing values.
There were c.11K records where Revenue =0. These were mostly Skin Care products; One can impute values based on historical records of the same product code, but these might be promotional sales (accumulated loyalty points or buy 1 get 1 free etc.) which didn’t contribute to actual revenue. Also, these records constituted less than c.1% of the entire dataset, hence, they were deleted.
library(RODBC)
con <- odbcDriverConnect(connection ="Driver={SQL Server Native Client 11.0};Server=LAPTOP-R8GPLO2U\\PEDATASCIENCE;Database=IrisConcise;uid=LAPTOP-R8GPLO2U\\soura;Trusted_Connection=yes;")
SQLCOMMAND_0 <- sqlQuery(con, "
Select category,
count(*) as records
from datasciencetask
where revenue = 0
group by category
")
records_zero_revenue <- SQLCOMMAND_0
#head(records_zero_revenue, n=5)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
library(magrittr)
## Warning: package 'magrittr' was built under R version 3.6.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ggplot(records_zero_revenue, aes(reorder(category, records), records)) +
geom_col(fill = "steelblue4") + coord_flip() +
labs(x = "Category", y = "Records") +
geom_hline(aes(yintercept = mean(records)), color = "red")
Fig 3 Records with Revenue = 0
Fig-3: Outlier
Treemap: Catgories
Barchart: Revenue by Country
Barchar: YOY Trend
Barchar: Revenue by Channel
A single RFM score is computed for each customer by concatenating the individual’s recency, frequency and monetary scores as shown in the table below.
Creating a Data Frame for the Analysis
SQLCOMMAND <- sqlQuery(con, "
Select URN as Customer_ID,
Country,
count(transaction_id) as Trans_count,
min(datediff(day,Transaction_Date,getdate())) as Recency_Days,
sum(Revenue) as Total_Revenue
from datasciencetask
where revenue <> 0 and country = 'Australia' and URN <> '-10'
group by URN ,
Country
")
customers <- SQLCOMMAND
head(customers, n=5)
## Customer_ID Country Trans_count Recency_Days Total_Revenue
## 1 -1.000042e+19 Australia 1 706 51.82
## 2 -1.000109e+19 Australia 2 964 56.36
## 3 -1.000174e+19 Australia 2 548 136.36
## 4 -1.000198e+19 Australia 1 819 26.36
## 5 -1.000232e+19 Australia 1 942 9.09
Feeding the data to rfm package function rfm_table_customer to generate rfm scores
library(rfm)
analysis_date <- lubridate::as_date('2020-02-01', tz = 'GMT')
analysis_date
## [1] "2020-02-01"
rfm_result <- rfm_table_customer(customers, Customer_ID, Trans_count,
Recency_Days, Total_Revenue, analysis_date)
#head(rfm_result, n = 5)
Fig-4: RFM Score Table
#rfm_heatmap(rfm_result)
Fig-5:RFM Heatmap
#rfm_bar_chart(rfm_result)
Fig-6:Recency vs Frequency vs Monetary Distribution
rfm_histograms(rfm_result)
Fig- 7: Histogram
rfm_order_dist(rfm_result)
Fig-8:Distribution of customers across purchase frequency
1. Recency vs. Monetary value
rfm_rm_plot(rfm_result)
Fig-9
2. Frequency vs. monetary value
rfm_fm_plot(rfm_result)
Fig-10
3. Recency vs. Frequency
rfm_rf_plot(rfm_result)
Fig-11
segment_names <- c("Premium", "Loyal Customers", "Potential Loyalist",
"New Customers", "Promising", "Need Attention", "About To Churn",
"At Risk", "High Value Churners/Resurrection", "Low Value Churners")
recency_lower <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1)
recency_upper <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2)
frequency_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
frequency_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
monetary_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
monetary_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
rfm_segments <- rfm_segment(rfm_result, segment_names, recency_lower, recency_upper,
frequency_lower, frequency_upper, monetary_lower, monetary_upper)
head(rfm_segments, n = 5)
## customer_id segment rfm_score transaction_count recency_days
## 1 -1.000042e+19 Potential Loyalist 313 1 706
## 2 -1.000109e+19 At Risk 143 2 964
## 3 -1.000174e+19 Loyal Customers 545 2 548
## 4 -1.000198e+19 Low Value Churners 211 1 819
## 5 -1.000232e+19 Low Value Churners 211 1 942
## amount recency_score frequency_score monetary_score
## 1 51.82 3 1 3
## 2 56.36 1 4 3
## 3 136.36 5 4 5
## 4 26.36 2 1 1
## 5 9.09 2 1 1
rfm_segments %>%
count(segment) %>%
arrange(desc(n)) %>%
rename(Segment = segment, Count = n)
## # A tibble: 9 x 2
## Segment Count
## <chr> <int>
## 1 Loyal Customers 27395
## 2 Others 24488
## 3 Low Value Churners 22491
## 4 Potential Loyalist 17576
## 5 About To Churn 9878
## 6 At Risk 8870
## 7 Promising 6177
## 8 New Customers 5886
## 9 High Value Churners/Resurrection 4379
Count of customers and RFM score distribution for each cohort
Table-1:Segment Distribution
data <-
rfm_segments %>%
group_by(segment) %>%
select(segment, recency_days) %>%
summarize(median(recency_days)) %>%
rename(segment = segment, avg_recency = `median(recency_days)`) %>%
arrange(avg_recency)
n_fill <- nrow(data)
ggplot(data, aes(segment, avg_recency)) +
geom_bar(stat = "identity", fill = brewer.pal(n = n_fill, name = "Set1")) +
xlab("Segment") + ylab("Median Recency") +
ggtitle("Median Recency by Segment") +
coord_flip() +
theme(
plot.title = element_text(hjust = 0.5)
)
Median Recency by Segment
data <-
rfm_segments %>%
group_by(segment) %>%
select(segment, transaction_count) %>%
summarize(median(transaction_count)) %>%
rename(segment = segment, avg_frequency = `median(transaction_count)`) %>%
arrange(avg_frequency)
n_fill <- nrow(data)
ggplot(data, aes(segment, avg_frequency)) +
geom_bar(stat = "identity", fill = brewer.pal(n = n_fill, name = "Set1")) +
xlab("Segment") + ylab("Median Frequency") +
ggtitle("Median Frequency by Segment") +
coord_flip() +
theme(
plot.title = element_text(hjust = 0.5)
)
Median Frequency by Segment
data <-
rfm_segments %>%
group_by(segment) %>%
select(segment, amount) %>%
summarize(median(amount)) %>%
rename(segment = segment, avg_monetary = `median(amount)`) %>%
arrange(avg_monetary)
n_fill <- nrow(data)
ggplot(data, aes(segment, avg_monetary)) +
geom_bar(stat = "identity", fill = brewer.pal(n = n_fill, name = "Set1")) +
xlab("Segment") + ylab("Median Monetary Value") +
ggtitle("Median Monetary Value by Segment") +
coord_flip() +
theme(
plot.title = element_text(hjust = 0.5)
)
Median Monetary by Segment
For any luxury brand a single global communication strategy is difficult. Given more time, a deep dive in customer segments by region, country, product category and channel could yield very different marketing strategies.
From our analysis of the sales data in Australia, we should engage distinct strategies for the segments of interest. If demographics data is overlaid to the behavioural segments, the resolution of the most attractive prospects could become clearer.
The loyal customers with repeat buys must be informed of the latest releases and offers. Below the line communication (direct mail or email) could be an effective channel of reaching these customers and letting them know of the new arrivals and sales offers.
The younger prospects can be acquired using Social Media influencer promotions.
Reduced advertising spend on Low value churners as it is less likely to yield any meaningful ROI.
Even in the age of digital media, TV is still considered to be one of the most effective channels of advertisement. However, much thought must be given on the right target demographic before buying of TV advertising space. Addressable TV like AdSmart can provide more targeting capability over traditional linear TV.
Media strategy to effectively increase & consolidate Share of Voice across all media channels.
Utilising Digital media more effectively.
Reduce advertisement wastage by buying ad-space on platforms where the prospects and loyal customers of the luxury brand over index.
Econometrics can attribute return on investment specific to a channel; however, we can use Linear Programming (Simplex Method) to optimise media planning (budget allocation across channels for a campaign).
The RFM model is not a black-box algorithm. It is rule based and hence, can be easily implemented in database using SQL or batch processes in R.