Modelling Task Objective

Dataset Made Available for the task

A transactional dataset has been provided from a luxury beauty brand consisting of 1,048,301 observations on customer’s instore and online purchases since 2017 in 85 cities across 22 countries. It has 16 variables which characterize the place, date, type and amount of purchase along with customer ID and transaction ID. Figure 1 describes the data dictionary that accompanies the dataset.

Data Dictionary

Fig-1: Data Dictionary

Fig-1: Data Dictionary


Problem Definition and Algorithm Choice

  • The business needs to identify potential risk cohorts within its customer base to direct appropriate strategies and drive their purchase frequency.
  • Also, they would like to boost repeat purchase from existing/ loyal customers and increase brand awareness.
  • Each cohort has a distinct financial performance with 3 main KPIs: Recency, Frequency and Monetary Value of Purchase. Varying across countries, channel type and product category, these KPIs will be fed into an RFM model; for simplicity and due to time constraint, as part of this exercise, this model will be applied to segment customers based in Australia.
  • The cohorts obtained from above model will be subjected to further analysis to understand:
    1. Customer-Product Lifecycle
    2. Purchase behavior like average order value (AOV), increase or decrease in spend in consecutive purchases etc.
    3. Lifetime Value which reflects both AOV and no. of orders per customer.
  • The business requirements could also be realized by applying segmentation algorithms like K-means or K-Prototype. Cohort analysis of products can also help to identify which products drive the maximum customer satisfaction or least repeat purchases. Business can strategize to optimize the inventory and product research/innovation. Due to time constraint, this exercise hasn’t been done here.

Modelling framework

The reference process model used for the task is Cross-Industry Standard Process for Data Mining (CRISP-DM).

Fig-2: CRISP-DM

Fig-2: CRISP-DM

source: CRISP-DM 1.0


Data Preparation

This section provides a step by step account of the techniques applied to cleanse and prepare the data such that it’s fit for the modelling task.

  • Evaluation of descriptive statistics revealed that variables Category, Channel, and City had missing values.

  • There were c.11K records where Revenue =0. These were mostly Skin Care products; One can impute values based on historical records of the same product code, but these might be promotional sales (accumulated loyalty points or buy 1 get 1 free etc.) which didn’t contribute to actual revenue. Also, these records constituted less than c.1% of the entire dataset, hence, they were deleted.

library(RODBC)
con <- odbcDriverConnect(connection ="Driver={SQL Server Native Client 11.0};Server=LAPTOP-R8GPLO2U\\PEDATASCIENCE;Database=IrisConcise;uid=LAPTOP-R8GPLO2U\\soura;Trusted_Connection=yes;")

SQLCOMMAND_0 <- sqlQuery(con, "
  Select  category,
          count(*) as records
   from datasciencetask
  where revenue = 0 
  group by category
                 ")

records_zero_revenue <- SQLCOMMAND_0
#head(records_zero_revenue, n=5)

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
library(magrittr)
## Warning: package 'magrittr' was built under R version 3.6.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ggplot(records_zero_revenue, aes(reorder(category, records), records)) + 
  geom_col(fill = "steelblue4") + coord_flip() + 
  labs(x = "Category", y = "Records") + 
  geom_hline(aes(yintercept = mean(records)), color = "red")

Fig 3 Records with Revenue = 0

  • Customer ID ‘-10’ was detected as an outlier, since it alone generated c. 1 million revenue across different countries. Therefore, it can be assumed to be a franchise or distributor which can skew the cohort performance. Those records were also ignored.
Fig-3: Outlier

Fig-3: Outlier


Key Insights

  1. c.74% of revenue comes from the sales of Skin Care and Body Care products
Treemap: Catgories

Treemap: Catgories


  1. Australia generates the highest sales revenue (12.7m, 21%), followed by the United States (10.7m, 18%) with Japan in close third (10m, 17%)
Barchart: Revenue by Country

Barchart: Revenue by Country


  1. There has been a steady decline in yoy sales revenue in ANZ region. (Sales data for 2019 is only available until 1st week of May. A forecast was done based on the avg sales revenue from May-Dec in the previous two years. The dotted bars are the forecasted revenue for 2019. A trend and seasonal decomposition and weighting days in a month, could yield better forecasts. It couldn’t be done due to time constraint)
Barchar: YOY Trend

Barchar: YOY Trend


  1. Negligible sales revenue is generated from Online sales. However, a channel ‘Null’ is seen to contribute significantly to overall sales revenue. This could be a labelling anomaly and needs to be investigated further.
Barchar: Revenue by Channel

Barchar: Revenue by Channel


Modelling with RFM Algorithm

A single RFM score is computed for each customer by concatenating the individual’s recency, frequency and monetary scores as shown in the table below.

Creating a Data Frame for the Analysis
SQLCOMMAND <- sqlQuery(con, "
  Select  URN as Customer_ID,
          Country,
              count(transaction_id) as Trans_count,
              min(datediff(day,Transaction_Date,getdate())) as Recency_Days,          
              sum(Revenue) as Total_Revenue
  from datasciencetask
 where revenue <> 0 and country = 'Australia' and URN <> '-10'
 group by URN ,
          Country
                 ")

customers <- SQLCOMMAND
head(customers, n=5)
##     Customer_ID   Country Trans_count Recency_Days Total_Revenue
## 1 -1.000042e+19 Australia           1          706         51.82
## 2 -1.000109e+19 Australia           2          964         56.36
## 3 -1.000174e+19 Australia           2          548        136.36
## 4 -1.000198e+19 Australia           1          819         26.36
## 5 -1.000232e+19 Australia           1          942          9.09
Feeding the data to rfm package function rfm_table_customer to generate rfm scores
library(rfm)
analysis_date <- lubridate::as_date('2020-02-01', tz = 'GMT')
analysis_date
## [1] "2020-02-01"
rfm_result <- rfm_table_customer(customers, Customer_ID, Trans_count,
                                 Recency_Days, Total_Revenue, analysis_date)
#head(rfm_result, n = 5)
Fig-4: RFM Score Table

Fig-4: RFM Score Table


Heatmap: The heat map shows the average revenue for different bins of recency and frequency.

#rfm_heatmap(rfm_result)
Fig-5:RFM Heatmap

Fig-5:RFM Heatmap


Barchart: Alternatively, one can generate a distribution of monetary scores for different combination of frequency and recency scores.

#rfm_bar_chart(rfm_result)
Fig-6:Recency vs Frequency vs Monetary Distribution

Fig-6:Recency vs Frequency vs Monetary Distribution


Histogram: Histogram examines the relative distribution of:

  • Total revenue generated by each customer
  • No. of days since the most recent visit by each customer
  • No. of transactions by each customer
rfm_histograms(rfm_result)

Fig- 7: Histogram


Customers by orders

rfm_order_dist(rfm_result)

Fig-8:Distribution of customers across purchase frequency


Scatter plots

1. Recency vs. Monetary value

rfm_rm_plot(rfm_result)

Fig-9

2. Frequency vs. monetary value

rfm_fm_plot(rfm_result)

Fig-10

3. Recency vs. Frequency

rfm_rf_plot(rfm_result)

Fig-11


Cohort Analysis: 10 cohorts were designed based on various combination of the RFM score and possible communications strategy that need to be adopted for each such segment and customers were classified accordingly.

segment_names <- c("Premium", "Loyal Customers", "Potential Loyalist",
                   "New Customers", "Promising", "Need Attention", "About To Churn",
                   "At Risk", "High Value Churners/Resurrection", "Low Value Churners")

recency_lower <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1)
recency_upper <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2)
frequency_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
frequency_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
monetary_lower <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
monetary_upper <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)

rfm_segments <- rfm_segment(rfm_result, segment_names, recency_lower, recency_upper,
            frequency_lower, frequency_upper, monetary_lower, monetary_upper)

head(rfm_segments, n = 5)
##     customer_id            segment rfm_score transaction_count recency_days
## 1 -1.000042e+19 Potential Loyalist       313                 1          706
## 2 -1.000109e+19            At Risk       143                 2          964
## 3 -1.000174e+19    Loyal Customers       545                 2          548
## 4 -1.000198e+19 Low Value Churners       211                 1          819
## 5 -1.000232e+19 Low Value Churners       211                 1          942
##   amount recency_score frequency_score monetary_score
## 1  51.82             3               1              3
## 2  56.36             1               4              3
## 3 136.36             5               4              5
## 4  26.36             2               1              1
## 5   9.09             2               1              1
rfm_segments %>%
  count(segment) %>%
  arrange(desc(n)) %>%
  rename(Segment = segment, Count = n)
## # A tibble: 9 x 2
##   Segment                          Count
##   <chr>                            <int>
## 1 Loyal Customers                  27395
## 2 Others                           24488
## 3 Low Value Churners               22491
## 4 Potential Loyalist               17576
## 5 About To Churn                    9878
## 6 At Risk                           8870
## 7 Promising                         6177
## 8 New Customers                     5886
## 9 High Value Churners/Resurrection  4379

Count of customers and RFM score distribution for each cohort

Table-1:Segment Distribution

Table-1:Segment Distribution


Median Recency

data <- 
  rfm_segments %>%
  group_by(segment) %>%
  select(segment, recency_days) %>%
  summarize(median(recency_days)) %>%
  rename(segment = segment, avg_recency = `median(recency_days)`) %>%
  arrange(avg_recency) 

n_fill <- nrow(data)

ggplot(data, aes(segment, avg_recency)) +
  geom_bar(stat = "identity", fill = brewer.pal(n = n_fill, name = "Set1")) +
  xlab("Segment") + ylab("Median Recency") +
  ggtitle("Median Recency by Segment") +
  coord_flip() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )
Median Recency by Segment

Median Recency by Segment


Median Frequency

data <- 
  rfm_segments %>%
  group_by(segment) %>%
  select(segment, transaction_count) %>%
  summarize(median(transaction_count)) %>%
  rename(segment = segment, avg_frequency = `median(transaction_count)`) %>%
  arrange(avg_frequency) 

n_fill <- nrow(data)

ggplot(data, aes(segment, avg_frequency)) +
  geom_bar(stat = "identity", fill = brewer.pal(n = n_fill, name = "Set1")) +
  xlab("Segment") + ylab("Median Frequency") +
  ggtitle("Median Frequency by Segment") +
  coord_flip() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )
Median Frequency by Segment

Median Frequency by Segment


Median Monetary value

data <- 
  rfm_segments %>%
  group_by(segment) %>%
  select(segment, amount) %>%
  summarize(median(amount)) %>%
  rename(segment = segment, avg_monetary = `median(amount)`) %>%
  arrange(avg_monetary) 

n_fill <- nrow(data)

ggplot(data, aes(segment, avg_monetary)) +
  geom_bar(stat = "identity", fill = brewer.pal(n = n_fill, name = "Set1")) +
  xlab("Segment") + ylab("Median Monetary Value") +
  ggtitle("Median Monetary Value by Segment") +
  coord_flip() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )
Median Monetary by Segment

Median Monetary by Segment


Limitations of the Dataset

  • Lack of complete understanding of the business process posed a problem in treatment of missing values in ‘Channel’, ‘City’, ‘Category’ and ‘Revenue’.

Enhancement to the Analysis

  • Year on year trends in purchase across countries, channel type and product categories
  • Seasonality in purchase (Seasonality and Trend Decomopistion)
  • Cohort analysis of products
  • Segmentation with K-prototype algorithm
  • More granular segmentation varying across countries, channel type and product categories

Communication Strategy

  • For any luxury brand a single global communication strategy is difficult. Given more time, a deep dive in customer segments by region, country, product category and channel could yield very different marketing strategies.

  • From our analysis of the sales data in Australia, we should engage distinct strategies for the segments of interest. If demographics data is overlaid to the behavioural segments, the resolution of the most attractive prospects could become clearer.

  • The loyal customers with repeat buys must be informed of the latest releases and offers. Below the line communication (direct mail or email) could be an effective channel of reaching these customers and letting them know of the new arrivals and sales offers.

  • The younger prospects can be acquired using Social Media influencer promotions.

  • Reduced advertising spend on Low value churners as it is less likely to yield any meaningful ROI.

  • Even in the age of digital media, TV is still considered to be one of the most effective channels of advertisement. However, much thought must be given on the right target demographic before buying of TV advertising space. Addressable TV like AdSmart can provide more targeting capability over traditional linear TV.

  • Media strategy to effectively increase & consolidate Share of Voice across all media channels.

  • Utilising Digital media more effectively.

  • Reduce advertisement wastage by buying ad-space on platforms where the prospects and loyal customers of the luxury brand over index.

  • Econometrics can attribute return on investment specific to a channel; however, we can use Linear Programming (Simplex Method) to optimise media planning (budget allocation across channels for a campaign).


Implementation Plan

The RFM model is not a black-box algorithm. It is rule based and hence, can be easily implemented in database using SQL or batch processes in R.