In my prior experience, I worked in the marketing & analytics team. I helped brands/clients like Footlocker, Macy’s with large scale marketing campaigns, strategic & creative direction, budget planning, ad optimization and reporting, and technology integration with the sole aim of driving awareness and online sales of products on their eCommerce platforms. In this Vignette, I am working on a sample sales record that you’d get from a typical eCommerce site to identify the most loyal customers using the RFM method.
About me:
I did my engineering in computer science. While I worked the first couple of years as a mobile developer, I eventually moved to project management/ consulting role. I had been in the digital marketing space for many years. Wearing the PM hat allowed me to work on various aspects of digital marketing (marketing analytics, development, content seeding, branding, etc). In the last 4 years of working in Deloitte, I worked largely in the marketing team running large scale ad campaigns for Costco, Macy’s and many other e-Retail brands. My strength lies in my resourcefulness to find a solution to a problem. I’d learn the topic in-depth. As part of the process, I learnt basics statistics and a few ML techniques (that were implemented by the technical team) to be able to work on marketing reports for my clients and help them in the decision making process. I have some programming background. I am well versed in digital advertising especially web/ad analytics, clubbed with project management skills. R and Statistics is new for me. I am looking forward to learning the STDS course so that I can come up with interesting insights that are backed by numbers and based on visualization techniques which I will be learning in R and elsewhere.
I am also attempting to plot the data using ggplot to show recency, frequency and monetary visualization. I am loading the tidyverse library in order to use ggplot.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
To import the CSV to R Studio IDE-
I downloaded the Sales.csv (containing the sales record) from Kaggle. If you go to File> Import Dataset> From Text (base), you can locate the CSV file saved to your computer. After I click Import, it produces the R Code- read.csv which is a command that can read obviously .csv files.So that’s comma separated value files. Commas act as separators in the file. The same GUI can be used to import .txt files. Now, if you are importing files from an excel sheet, this GUI will not work. However, you can open an excel and save your file as .csv and then run this command.
Alternatively, you can import the dataset (using the same command) by specifying the path name as shown below.-
Sales <- read.csv("~/Desktop/Sales.csv")
# To sort the data, use the order( ) function. By default, sorting is ASCENDING.
Salesorted <- Sales[order(Sales$index), ]
# To show the first 5 or any number of rows from the beginning, use head(df, number of rows)
head(Salesorted, 5)
The Sales.csv dataset has 5000 observations across 41 variables, also referred to as columns.
We will use a few variables from the dataset, namely: CustomerID, Revenue, Latest Order Date, and Average Days Between Orders.
RFM <- Salesorted[,c("CustomerID","TOTAL_ORDERS","REVENUE","LATEST_ORDER_DATE","AVGDAYSBETWEENORDERS")]
head(RFM, 10)
1. To find Recency score: We calculate the difference between today’s date and date of last purchase
For example: If today is 22nd October and last date of purchase on the record of a particular customer is 20th October, the Recency Score would be (24-22)= 2
2. To find Frequency score: We will use the the data of the column ‘AVGDAYSBETWEENORDERS’ which gives the average numbers of days between each order. We will call this variable Frequency.
For example: If there is a product being purchased every week by a customer, their Frequency score will be 7.
3. To find Monetary score: We Calculate the Total Revenue generated by a customer.
This doesn’t need to be calculated. There is a separate column for total revenue generated by the customer.
Here is the code that bungs in all the three components to calculate the RFM score of a customer.
#The as.Date() is a built-in R function that converts between character representations and class “Date” objects representing the calendar dates.
today <- as.Date("2022-08-10")
RFMCalculation <- RFM %>%
mutate(
Recency = today - as.Date(LATEST_ORDER_DATE),
#Rounds values to specified number of decimal places.
Frequency = round(AVGDAYSBETWEENORDERS, 0),
Monetary= REVENUE,
)
#The following code is used to display only a number of variables of your choosing.
RFMScore <- RFMCalculation[,c("CustomerID","Monetary","Recency","Frequency")]
head(RFMScore, 10)
Summary <- RFMScore %>%
#The summary is a built-in R function used to produce result summaries of various model fitting functions.
summarise(
Number_of_Rows = n(),
Monetary_mean = mean(Monetary),
Monetary_SD = sd(Monetary),
Monetary_Max = max(Monetary),
Monetary_Min = min(Monetary),
# rnorm(number_of_rows,Rev_mean, Rev_SD )
)
# head() shows the first 5 rows, knitr::kable() makes nice tables in RMarkdown
Summary %>% head() %>% knitr::kable()
| Number_of_Rows | Monetary_mean | Monetary_SD | Monetary_Max | Monetary_Min |
|---|---|---|---|---|
| 5000 | 1681.524 | 1998.619 | 34847.4 | 38.5 |
Density Distribution:
Here we are showing a density plot which is a representation of the
distribution of a numeric variable that uses a kernel density estimate
to show the probability density function of the variable. It shows the
probability of occurrence of a variable within a specified range. Larger
the area (called Probability area), higher the probability of its
occurrence.
# install.packages("ggplot2")
library(ggplot2)
library(dplyr)
# Setting the x axis-range using the Filter functin
# You could spruce up your distribution graph by passing parameters like color code to Geom_density function
RFMScore %>%
filter(Monetary<10000 ) %>%
ggplot( aes(x=Monetary)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)
Density Distribution:
# install.packages("ggplot2")
library(ggplot2)
library(dplyr)
#Frequency <- RFMScore$Frequency
#hist(Frequency)
# Data
RFMScore %>%
filter(Frequency<1000 ) %>%
ggplot( aes(x=Frequency)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)
Recency- Data Visualisation:
Plotting the same density plot for Recency variable.
Summary <- RFMScore %>%
summarise(
Number_of_Rows = n(),
Recency_mean = mean(Recency),
Recency_SD = sd(Recency),
Recency_Max = max(Recency),
Recency_Min = min(Recency),
# rnorm(number_of_rows,Rev_mean, Rev_SD )
)
Summary %>% head() %>% knitr::kable()
| Number_of_Rows | Recency_mean | Recency_SD | Recency_Max | Recency_Min |
|---|---|---|---|---|
| 5000 | 376.42 days | 80.15651 | 496 days | 290 days |
Density Distribution:
# install.packages("ggplot2")
library(ggplot2)
library(dplyr)
#Recency <- RFMScore$Recency
#hist(Recency)
# Data
RFMScore %>%
filter(Recency<1000 ) %>%
ggplot( aes(x=Recency)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities. All the observations of a variable get sorted in an ascending order of their numerical value. In the example code below; [0-25] is the first quantile, [25-50] is the second quantile and so on. So, at the 25%, the observation of the Recency variable is 296. Second quantile can also be a median if there are odd number of observations.
quantile(RFMScore$Recency, probs = seq(0, 1, 1/4))
## Time differences in days
## 0% 25% 50% 75% 100%
## 290.00 296.00 357.00 460.25 496.00
We can also use Quantiles to split observations in terms of percentiles. For eg: we can find out a range of observations that are above 99th percentile. This can help us assign a score to observations depending on the percentile range (score of 1 for 15th to 29th, 2 for 29th to 59th) that they belong to.
Calculating the Recency score of the customer.
#Quantiles are given as decimal values, values such as 0.9992, 0.8492, and 0.5992. A 0.95 quantile point is exactly the same as the 95th percentile point.
interested.quantiles <-
quantile( x = RFMScore$Recency
, probs = c( 0.9992, 0.8492, 0.5992, 0.2992, 0.1592 )
)
# A score is getting assigned to all observations in the Recency table.
# However, higher the Recency (number of days), lower the loyalty score a customer will get. So the top percentiles will have low scores.
#sapply() isuseful for operations on list objects and returns a list object of same length of original set.
RFMScore$RScore <-
sapply( X = RFMScore$Recency
, FUN = function( i )
if( i >= interested.quantiles[ 1 ] ){
Value=1
return(Value )
} else if( i < interested.quantiles[ 1 ] &&
i >= interested.quantiles[ 2 ] ){
Value=2
return( Value )
} else if( i < interested.quantiles[ 2 ] &&
i >= interested.quantiles[ 3 ] ){
Value=3
return( Value )
} else if( i < interested.quantiles[ 3 ] &&
i >= interested.quantiles[ 4 ] ){
Value=4
return( Value )
} else if( i < interested.quantiles[ 4 ] &&
i >= interested.quantiles[ 5 ] ){
Value=5
return( Value )
} else if( i < interested.quantiles[ 5 ] ){
Value=6
return( Value)
}
)
# view the results
head(RFMScore, 10)
Calculating the Frequency score.
interested.quantiles <-
quantile( x = RFMScore$Recency
, probs = c( 0.9992, 0.8492, 0.5992, 0.2992, 0.1592 )
)
#Here, greater the Frequency value, higher the score.
RFMScore$FScore <-
sapply( X = RFMScore$Frequency
, FUN = function( i )
if( i >= interested.quantiles[ 1 ] ){
Value=6
return(Value )
} else if( i < interested.quantiles[ 1 ] &&
i >= interested.quantiles[ 2 ] ){
Value=5
return( Value )
} else if( i < interested.quantiles[ 2 ] &&
i >= interested.quantiles[ 3 ] ){
Value=4
return( Value )
} else if( i < interested.quantiles[ 3 ] &&
i >= interested.quantiles[ 4 ] ){
Value=3
return( Value )
} else if( i < interested.quantiles[ 4 ] &&
i >= interested.quantiles[ 5 ] ){
Value=2
return( Value )
} else if( i < interested.quantiles[ 5 ] ){
Value=1
return( Value)
}
)
# view the results
head(RFMScore, 10)
Calculating the Monetary score.
interested.quantiles <-
quantile( x = RFMScore$Monetary
, probs = c( 0.9992, 0.8492, 0.5992, 0.2992, 0.1592 )
)
# More revenue, more points!
RFMScore$MScore <-
sapply( X = RFMScore$Monetary
, FUN = function( i )
if( i >= interested.quantiles[ 1 ] ){
Value=6
return(Value )
} else if( i < interested.quantiles[ 1 ] &&
i >= interested.quantiles[ 2 ] ){
Value=5
return( Value )
} else if( i < interested.quantiles[ 2 ] &&
i >= interested.quantiles[ 3 ] ){
Value=4
return( Value )
} else if( i < interested.quantiles[ 3 ] &&
i >= interested.quantiles[ 4 ] ){
Value=3
return( Value )
} else if( i < interested.quantiles[ 4 ] &&
i >= interested.quantiles[ 5 ] ){
Value=2
return( Value )
} else if( i < interested.quantiles[ 5 ] ){
Value=1
return( Value)
}
)
# view the results
head(RFMScore, 10)
We will add all the three scores. Needless to say, customers with the highest score are the most loyal.
Loyalty <- RFMScore %>%
mutate(
Score= RScore + FScore + MScore
)
#The following code is used to display only a number of variables of your choosing.
head(Loyalty[,c("CustomerID","Score")],30)
Segmenting Customers
df = Loyalty
table(Loyalty$Score)
##
## 3 4 5 6 7 8 9 10 11 12 13 14
## 475 225 234 33 293 829 1362 909 367 168 78 27
Assigning a badge to customers
# More revenue, more points!
Loyalty$Badge <-
sapply( X = Loyalty$Score
, FUN = function( i )
if( i >=3 &&
i <= 6 ){
return("Bronze" )
} else if(i >=7 &&
i <= 10 ){
return( "Gold" )
} else if(i >=11 &&
i <= 14 ){
return( "Platinum" )
}
)
# view the results
head(Loyalty, 10)
Create a piechart for customer representation.
#df = Loyalty
#table(Loyalty$Badge)
#Manually extracted the value of each badge for now!
x <- c(967, 3393, 640)
labels <- c("Bronze", "Gold", "Platinum")
# Plot the chart.
pie(x, labels, main = "Customer Segmentation", col = rainbow(length(x)))
While this may not be the most efficient way of segmenting customers and predicting behaviors, it gives some insights in terms whom we should be
This will allow us to drive online sales while still keeping the ad spends in check.