In my prior experience, I worked in the marketing & analytics team. I helped brands/clients like Footlocker, Macy’s with large scale marketing campaigns, strategic & creative direction, budget planning, ad optimization and reporting, and technology integration with the sole aim of driving awareness and online sales of products on their eCommerce platforms. In this Vignette, I am working on a sample sales record that you’d get from a typical eCommerce site to identify the most loyal customers using the RFM method.
I am also attempting to plot the data using ggplot to show recency, frequency and monetary visualization. I am loading the tidyverse library in order to use ggplot.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
To import the CSV to R Studio IDE-
I downloaded the Sales.csv (containing the sales record) from Kaggle. If you go to File> Import Dataset> From Text (base), you can locate the CSV file saved to your computer. After I click Import, it produces the R Code- read.csv which is a command that can read obviously .csv files.So that’s comma separated value files. Commas act as separators in the file. The same GUI can be used to import .txt files. Now, if you are importing files from an excel sheet, this GUI will not work. However, you can open an excel and save your file as .csv and then run this command.
Alternatively, you can import the dataset (using the same command) by specifying the path name as shown below.-
Sales <- read.csv("~/Desktop/Sales.csv")
# To sort the data, use the order( ) function. By default, sorting is ASCENDING.
Salesorted <- Sales[order(Sales$index), ]
# To show the first 5 or any number of rows from the beginning, use head(df, number of rows)
head(Salesorted, 5)
## index CustomerID TOTAL_ORDERS REVENUE AVERAGE_ORDER_VALUE CARRIAGE_REVENUE
## 1 0 2354 124 11986.54 96.67 529.59
## 2 1 2361 82 11025.96 134.46 97.92
## 3 2 2415 43 7259.69 168.83 171.69
## 4 3 2427 44 6992.27 158.92 92.82
## 5 4 2456 55 6263.44 113.88 179.04
## AVERAGESHIPPING FIRST_ORDER_DATE LATEST_ORDER_DATE AVGDAYSBETWEENORDERS
## 1 4.27 2016-12-30 2021-10-24 14.19
## 2 1.19 2018-03-31 2021-10-24 15.89
## 3 3.99 2017-11-30 2021-10-24 33.12
## 4 2.11 2019-04-09 2021-10-24 21.11
## 5 3.26 2020-10-23 2021-10-24 6.65
## DAYSSINCELASTORDER MONDAY_ORDERS TUESDAY_ORDERS WEDNESDAY_ORDERS
## 1 1 13 13 29
## 2 1 11 13 10
## 3 1 5 4 3
## 4 1 10 8 5
## 5 1 2 3 8
## THURSDAY_ORDERS FRIDAY_ORDERS SATURDAY_ORDERS SUNDAY_ORDERS MONDAY_REVENUE
## 1 25 19 15 10 1287.50
## 2 13 14 10 11 1253.08
## 3 5 5 8 13 698.95
## 4 8 5 3 5 1753.95
## 5 8 12 10 12 150.98
## TUESDAY_REVENUE WEDNESDAY_REVENUE THURSDAY_REVENUE FRIDAY_REVENUE
## 1 1054.15 2842.52 2133.79 2051.94
## 2 2003.25 1819.75 1114.83 1791.86
## 3 504.97 197.97 1182.97 733.96
## 4 1400.23 570.48 1154.06 798.07
## 5 330.38 1090.23 563.63 1339.44
## SATURDAY_REVENUE SUNDAY_REVENUE WEEK1_DAY01_DAY07_ORDERS
## 1 1496.33 1120.31 28
## 2 1624.34 1418.85 18
## 3 1223.96 2716.91 9
## 4 630.00 685.48 12
## 5 956.56 1832.22 10
## WEEK2_DAY08_DAY15_ORDERS WEEK3_DAY16_DAY23_ORDERS WEEK4_DAY24_DAY31_ORDERS
## 1 42 30 24
## 2 19 19 26
## 3 11 6 17
## 4 15 9 8
## 5 18 21 6
## WEEK1_DAY01_DAY07_REVENUE WEEK2_DAY08_DAY15_REVENUE WEEK3_DAY16_DAY23_REVENUE
## 1 2685.37 4299.28 2592.18
## 2 1336.09 2776.02 2807.66
## 3 2299.93 1383.92 713.94
## 4 2317.95 2417.22 997.02
## 5 831.14 1938.18 2725.66
## WEEK4_DAY24_DAY31_REVENUE TIME_0000_0600_ORDERS TIME_0601_1200_ORDERS
## 1 2409.71 12 61
## 2 4106.19 1 13
## 3 2861.90 0 6
## 4 1260.08 0 10
## 5 768.46 6 9
## TIME_1200_1800_ORDERS TIME_1801_2359_ORDERS TIME_0000_0600_REVENUE
## 1 37 14 1067.92
## 2 24 44 30.25
## 3 12 25 0.00
## 4 13 21 0.00
## 5 27 13 501.34
## TIME_0601_1200_REVENUE TIME_1200_1800_REVENUE TIME_1801_2359_REVENUE
## 1 5674.65 3916.53 1327.44
## 2 3143.69 2205.47 5646.55
## 3 707.98 2560.91 3990.80
## 4 1908.99 2025.05 3058.23
## 5 1197.83 3004.77 1559.50
The Sales.csv dataset has 5000 observations across 41 variables, also referred to as columns.
We will use a few variables from the dataset, namely: CustomerID, Revenue, Latest Order Date, and Average Days Between Orders.
RFM <- Salesorted[,c("CustomerID","TOTAL_ORDERS","REVENUE","LATEST_ORDER_DATE","AVGDAYSBETWEENORDERS")]
head(RFM, 10)
## CustomerID TOTAL_ORDERS REVENUE LATEST_ORDER_DATE AVGDAYSBETWEENORDERS
## 1 2354 124 11986.54 2021-10-24 14.19
## 2 2361 82 11025.96 2021-10-24 15.89
## 3 2415 43 7259.69 2021-10-24 33.12
## 4 2427 44 6992.27 2021-10-24 21.11
## 5 2456 55 6263.44 2021-10-24 6.65
## 6 2485 49 5841.24 2021-10-24 46.57
## 7 2519 43 5470.27 2021-10-24 23.58
## 8 2551 54 5200.53 2021-10-24 13.06
## 9 2590 19 4967.06 2021-10-24 12.37
## 10 2640 21 4726.38 2021-10-24 31.29
1. To find Recency score: We calculate the difference between today’s date and date of last purchase
For example: If today is 22nd October and last date of purchase on the record of a particular customer is 20th October, the Recency Score would be (24-22)= 2
2. To find Frequency score: We will use the the data of the column ‘AVGDAYSBETWEENORDERS’ which gives the average numbers of days between each order. We will call this variable Frequency.
For example: If there is a product being purchased every week by a customer, their Frequency score will be 7.
3. To find Monetary score: We Calculate the Total Revenue generated by a customer.
This doesn’t need to be calculated. There is a separate column for total revenue generated by the customer.
Here is the code that bungs in all the three components to calculate the RFM score of a customer.
#The as.Date() is a built-in R function that converts between character representations and class “Date” objects representing the calendar dates.
today <- as.Date("2022-08-10")
RFMCalculation <- RFM %>%
mutate(
Recency = today - as.Date(LATEST_ORDER_DATE),
#Rounds values to specified number of decimal places.
Frequency = round(AVGDAYSBETWEENORDERS, 0),
Monetary= REVENUE,
)
#The following code is used to display only a number of variables of your choosing.
RFMScore <- RFMCalculation[,c("CustomerID","Monetary","Recency","Frequency")]
head(RFMScore, 10)
## CustomerID Monetary Recency Frequency
## 1 2354 11986.54 290 days 14
## 2 2361 11025.96 290 days 16
## 3 2415 7259.69 290 days 33
## 4 2427 6992.27 290 days 21
## 5 2456 6263.44 290 days 7
## 6 2485 5841.24 290 days 47
## 7 2519 5470.27 290 days 24
## 8 2551 5200.53 290 days 13
## 9 2590 4967.06 290 days 12
## 10 2640 4726.38 290 days 31
Summary <- RFMScore %>%
#The summary is a built-in R function used to produce result summaries of various model fitting functions.
summarise(
Number_of_Rows = n(),
Monetary_mean = mean(Monetary),
Monetary_SD = sd(Monetary),
Monetary_Max = max(Monetary),
Monetary_Min = min(Monetary),
# rnorm(number_of_rows,Rev_mean, Rev_SD )
)
# head() shows the first 5 rows, knitr::kable() makes nice tables in RMarkdown
Summary %>% head() %>% knitr::kable()
| Number_of_Rows | Monetary_mean | Monetary_SD | Monetary_Max | Monetary_Min |
|---|---|---|---|---|
| 5000 | 1681.524 | 1998.619 | 34847.4 | 38.5 |
Density Distribution:
Here we are showing a density plot which is a representation of the
distribution of a numeric variable that uses a kernel density estimate
to show the probability density function of the variable. It shows the
probability of occurrence of a variable within a specified range. Larger
the area (called Probability area), higher the probability of its
occurrence.
# install.packages("ggplot2")
library(ggplot2)
library(dplyr)
# Setting the x axis-range using the Filter functin
# You could spruce up your distribution graph by passing parameters like color code to Geom_density function
RFMScore %>%
filter(Monetary<10000 ) %>%
ggplot( aes(x=Monetary)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)
Plotting the same density plot for Frequency variable.
Summary <- RFMScore %>%
summarise(
Number_of_Rows = n(),
Frequency_mean = mean(Frequency),
Frequency_SD = sd(Frequency),
Frequency_Max = max(Frequency),
Frequency_Min = min(Frequency),
# rnorm(number_of_rows,Rev_mean, Rev_SD )
)
Summary %>% head() %>% knitr::kable()
| Number_of_Rows | Frequency_mean | Frequency_SD | Frequency_Max | Frequency_Min |
|---|---|---|---|---|
| 5000 | 163.1568 | 259.7011 | 1410 | 0 |
Density Distribution:
# install.packages("ggplot2")
library(ggplot2)
library(dplyr)
#Frequency <- RFMScore$Frequency
#hist(Frequency)
# Data
RFMScore %>%
filter(Frequency<1000 ) %>%
ggplot( aes(x=Frequency)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)
Recency- Data Visualisation:
Plotting the same density plot for Recency variable.
Summary <- RFMScore %>%
summarise(
Number_of_Rows = n(),
Recency_mean = mean(Recency),
Recency_SD = sd(Recency),
Recency_Max = max(Recency),
Recency_Min = min(Recency),
# rnorm(number_of_rows,Rev_mean, Rev_SD )
)
Summary %>% head() %>% knitr::kable()
| Number_of_Rows | Recency_mean | Recency_SD | Recency_Max | Recency_Min |
|---|---|---|---|---|
| 5000 | 376.42 days | 80.15651 | 496 days | 290 days |
Density Distribution:
# install.packages("ggplot2")
library(ggplot2)
library(dplyr)
#Recency <- RFMScore$Recency
#hist(Recency)
# Data
RFMScore %>%
filter(Recency<1000 ) %>%
ggplot( aes(x=Recency)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities. All the observations of a variable get sorted in an ascending order of their numerical value. In the example code below; [0-25] is the first quantile, [25-50] is the second quantile and so on. So, at the 25%, the observation of the Recency variable is 296. Second quantile can also be a median if there are odd number of observations.
quantile(RFMScore$Recency, probs = seq(0, 1, 1/4))
## Time differences in days
## 0% 25% 50% 75% 100%
## 290.00 296.00 357.00 460.25 496.00
We can also use Quantiles to split observations in terms of percentiles. For eg: we can find out a range of observations that are above 99th percentile. This can help us assign a score to observations depending on the percentile range (score of 1 for 15th to 29th, 2 for 29th to 59th) that they belong to.
Calculating the Recency score of the customer.
#Quantiles are given as decimal values, values such as 0.9992, 0.8492, and 0.5992. A 0.95 quantile point is exactly the same as the 95th percentile point.
interested.quantiles <-
quantile( x = RFMScore$Recency
, probs = c( 0.9992, 0.8492, 0.5992, 0.2992, 0.1592 )
)
# A score is getting assigned to all observations in the Recency table.
# However, higher the Recency (number of days), lower the loyalty score a customer will get. So the top percentiles will have low scores.
#sapply() isuseful for operations on list objects and returns a list object of same length of original set.
RFMScore$RScore <-
sapply( X = RFMScore$Recency
, FUN = function( i )
if( i >= interested.quantiles[ 1 ] ){
Value=1
return(Value )
} else if( i < interested.quantiles[ 1 ] &&
i >= interested.quantiles[ 2 ] ){
Value=2
return( Value )
} else if( i < interested.quantiles[ 2 ] &&
i >= interested.quantiles[ 3 ] ){
Value=3
return( Value )
} else if( i < interested.quantiles[ 3 ] &&
i >= interested.quantiles[ 4 ] ){
Value=4
return( Value )
} else if( i < interested.quantiles[ 4 ] &&
i >= interested.quantiles[ 5 ] ){
Value=5
return( Value )
} else if( i < interested.quantiles[ 5 ] ){
Value=6
return( Value)
}
)
# view the results
head(RFMScore, 10)
## CustomerID Monetary Recency Frequency RScore
## 1 2354 11986.54 290 days 14 5
## 2 2361 11025.96 290 days 16 5
## 3 2415 7259.69 290 days 33 5
## 4 2427 6992.27 290 days 21 5
## 5 2456 6263.44 290 days 7 5
## 6 2485 5841.24 290 days 47 5
## 7 2519 5470.27 290 days 24 5
## 8 2551 5200.53 290 days 13 5
## 9 2590 4967.06 290 days 12 5
## 10 2640 4726.38 290 days 31 5
Calculating the Frequency score.
interested.quantiles <-
quantile( x = RFMScore$Recency
, probs = c( 0.9992, 0.8492, 0.5992, 0.2992, 0.1592 )
)
#Here, greater the Frequency value, higher the score.
RFMScore$FScore <-
sapply( X = RFMScore$Frequency
, FUN = function( i )
if( i >= interested.quantiles[ 1 ] ){
Value=6
return(Value )
} else if( i < interested.quantiles[ 1 ] &&
i >= interested.quantiles[ 2 ] ){
Value=5
return( Value )
} else if( i < interested.quantiles[ 2 ] &&
i >= interested.quantiles[ 3 ] ){
Value=4
return( Value )
} else if( i < interested.quantiles[ 3 ] &&
i >= interested.quantiles[ 4 ] ){
Value=3
return( Value )
} else if( i < interested.quantiles[ 4 ] &&
i >= interested.quantiles[ 5 ] ){
Value=2
return( Value )
} else if( i < interested.quantiles[ 5 ] ){
Value=1
return( Value)
}
)
# view the results
head(RFMScore, 10)
## CustomerID Monetary Recency Frequency RScore FScore
## 1 2354 11986.54 290 days 14 5 1
## 2 2361 11025.96 290 days 16 5 1
## 3 2415 7259.69 290 days 33 5 1
## 4 2427 6992.27 290 days 21 5 1
## 5 2456 6263.44 290 days 7 5 1
## 6 2485 5841.24 290 days 47 5 1
## 7 2519 5470.27 290 days 24 5 1
## 8 2551 5200.53 290 days 13 5 1
## 9 2590 4967.06 290 days 12 5 1
## 10 2640 4726.38 290 days 31 5 1
Calculating the Monetary score.
interested.quantiles <-
quantile( x = RFMScore$Monetary
, probs = c( 0.9992, 0.8492, 0.5992, 0.2992, 0.1592 )
)
# More revenue, more points!
RFMScore$MScore <-
sapply( X = RFMScore$Monetary
, FUN = function( i )
if( i >= interested.quantiles[ 1 ] ){
Value=6
return(Value )
} else if( i < interested.quantiles[ 1 ] &&
i >= interested.quantiles[ 2 ] ){
Value=5
return( Value )
} else if( i < interested.quantiles[ 2 ] &&
i >= interested.quantiles[ 3 ] ){
Value=4
return( Value )
} else if( i < interested.quantiles[ 3 ] &&
i >= interested.quantiles[ 4 ] ){
Value=3
return( Value )
} else if( i < interested.quantiles[ 4 ] &&
i >= interested.quantiles[ 5 ] ){
Value=2
return( Value )
} else if( i < interested.quantiles[ 5 ] ){
Value=1
return( Value)
}
)
# view the results
head(RFMScore, 10)
## CustomerID Monetary Recency Frequency RScore FScore MScore
## 1 2354 11986.54 290 days 14 5 1 5
## 2 2361 11025.96 290 days 16 5 1 5
## 3 2415 7259.69 290 days 33 5 1 5
## 4 2427 6992.27 290 days 21 5 1 5
## 5 2456 6263.44 290 days 7 5 1 5
## 6 2485 5841.24 290 days 47 5 1 5
## 7 2519 5470.27 290 days 24 5 1 5
## 8 2551 5200.53 290 days 13 5 1 5
## 9 2590 4967.06 290 days 12 5 1 5
## 10 2640 4726.38 290 days 31 5 1 5
We will add all the three scores. Needless to say, customers with the highest score are the most loyal.
Loyalty <- RFMScore %>%
mutate(
Score= RScore + FScore + MScore
)
#The following code is used to display only a number of variables of your choosing.
head(Loyalty[,c("CustomerID","Score")],30)
## CustomerID Score
## 1 2354 11
## 2 2361 11
## 3 2415 11
## 4 2427 11
## 5 2456 11
## 6 2485 11
## 7 2519 11
## 8 2551 11
## 9 2590 11
## 10 2640 11
## 11 2823 11
## 12 2824 11
## 13 2904 11
## 14 2927 11
## 15 2945 11
## 16 3001 11
## 17 3003 11
## 18 3072 11
## 19 3082 11
## 20 3117 10
## 21 3127 10
## 22 3138 10
## 23 3186 10
## 24 3196 10
## 25 3234 10
## 26 3260 10
## 27 3279 10
## 28 3304 10
## 29 3332 10
## 30 3407 10
Segmenting Customers
df = Loyalty
table(Loyalty$Score)
##
## 3 4 5 6 7 8 9 10 11 12 13 14
## 475 225 234 33 293 829 1362 909 367 168 78 27
Assigning a badge to customers 1. Customers with a score of 3-6 will be assigned a Bronze badge. 2. Customers with a score of 7-10 will be assigned a Gold badge. 3. Customers with a score of 11-14 will be assigned a Platinum badge.
# More revenue, more points!
Loyalty$Badge <-
sapply( X = Loyalty$Score
, FUN = function( i )
if( i >=3 &&
i <= 6 ){
return("Bronze" )
} else if(i >=7 &&
i <= 10 ){
return( "Gold" )
} else if(i >=11 &&
i <= 14 ){
return( "Platinum" )
}
)
# view the results
head(Loyalty, 10)
## CustomerID Monetary Recency Frequency RScore FScore MScore Score Badge
## 1 2354 11986.54 290 days 14 5 1 5 11 Platinum
## 2 2361 11025.96 290 days 16 5 1 5 11 Platinum
## 3 2415 7259.69 290 days 33 5 1 5 11 Platinum
## 4 2427 6992.27 290 days 21 5 1 5 11 Platinum
## 5 2456 6263.44 290 days 7 5 1 5 11 Platinum
## 6 2485 5841.24 290 days 47 5 1 5 11 Platinum
## 7 2519 5470.27 290 days 24 5 1 5 11 Platinum
## 8 2551 5200.53 290 days 13 5 1 5 11 Platinum
## 9 2590 4967.06 290 days 12 5 1 5 11 Platinum
## 10 2640 4726.38 290 days 31 5 1 5 11 Platinum
Create a piechart for customer representation.
#df = Loyalty
#table(Loyalty$Badge)
#Manually extracted the value of each badge for now!
x <- c(967, 3393, 640)
labels <- c("Bronze", "Gold", "Platinum")
# Plot the chart.
pie(x, labels, main = "Customer Segmentation", col = rainbow(length(x)))