Overview

In my prior experience, I worked in the marketing & analytics team. I helped brands/clients like Footlocker, Macy’s with large scale marketing campaigns, strategic & creative direction, budget planning, ad optimization and reporting, and technology integration with the sole aim of driving awareness and online sales of products on their eCommerce platforms. In this Vignette, I am working on a sample sales record that you’d get from a typical eCommerce site to identify the most loyal customers using the RFM method.

About me:

I did my engineering in computer science. While I worked the first couple of years as a mobile developer, I eventually moved to project management/ consulting role. I had been in the digital marketing space for many years. Wearing the PM hat allowed me to work on various aspects of digital marketing (marketing analytics, development, content seeding, branding, etc). In the last 4 years of working in Deloitte, I worked largely in the marketing team running large scale ad campaigns for Costco, Macy’s and many other e-Retail brands. My strength lies in my resourcefulness to find a solution to a problem. I’d learn the topic in-depth. As part of the process, I learnt basics statistics and a few ML techniques (that were implemented by the technical team) to be able to work on marketing reports for my clients and help them in the decision making process. I have some programming background. I am well versed in digital advertising especially web/ad analytics, clubbed with project management skills. R and Statistics is new for me. I am looking forward to learning the STDS course so that I can come up with interesting insights that are backed by numbers and based on visualization techniques which I will be learning in R and elsewhere.

RFM Method

RFM stands for Recency, Frequency, and Monetary Score which is a method to find your loyal customers.
It gives score to each customers based on their order frequency, total order cost & Last order date.
We will assign an RFM Score to each customer using a certain rating method which is explained in the later sections of this tutorial.

Loading tidyverse library and importing a sample data set

I am also attempting to plot the data using ggplot to show recency, frequency and monetary visualization. I am loading the tidyverse library in order to use ggplot.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

To import the CSV to R Studio IDE-

I downloaded the Sales.csv (containing the sales record) from Kaggle. If you go to File> Import Dataset> From Text (base), you can locate the CSV file saved to your computer. After I click Import, it produces the R Code- read.csv which is a command that can read obviously .csv files.So that’s comma separated value files. Commas act as separators in the file. The same GUI can be used to import .txt files. Now, if you are importing files from an excel sheet, this GUI will not work. However, you can open an excel and save your file as .csv and then run this command.

Alternatively, you can import the dataset (using the same command) by specifying the path name as shown below.-

Sales <- read.csv("~/Desktop/Sales.csv")

# To sort the data, use the order( ) function. By default, sorting is ASCENDING.

Salesorted <- Sales[order(Sales$index), ] 

# To show the first 5 or any number of rows from the beginning, use head(df, number of rows)
head(Salesorted, 5)

We will store the parameters used in RFM analysis in a new dataset

The Sales.csv dataset has 5000 observations across 41 variables, also referred to as columns.

We will use a few variables from the dataset, namely: CustomerID, Revenue, Latest Order Date, and Average Days Between Orders.

RFM <- Salesorted[,c("CustomerID","TOTAL_ORDERS","REVENUE","LATEST_ORDER_DATE","AVGDAYSBETWEENORDERS")]
head(RFM, 10)

1. To find Recency score: We calculate the difference between today’s date and date of last purchase

For example: If today is 22nd October and last date of purchase on the record of a particular customer is 20th October, the Recency Score would be (24-22)= 2

2. To find Frequency score: We will use the the data of the column ‘AVGDAYSBETWEENORDERS’ which gives the average numbers of days between each order. We will call this variable Frequency.

For example: If there is a product being purchased every week by a customer, their Frequency score will be 7.

3. To find Monetary score: We Calculate the Total Revenue generated by a customer.

This doesn’t need to be calculated. There is a separate column for total revenue generated by the customer.

Here is the code that bungs in all the three components to calculate the RFM score of a customer.

#The as.Date() is a built-in R function that converts between character representations and class “Date” objects representing the calendar dates. 
today <- as.Date("2022-08-10")
RFMCalculation <- RFM %>%
  mutate(
   
    Recency = today - as.Date(LATEST_ORDER_DATE),
#Rounds values to specified number of decimal places.
        Frequency = round(AVGDAYSBETWEENORDERS, 0),
    Monetary= REVENUE,
  
         ) 

#The following code is used to display only a number of variables of your choosing. 

RFMScore <- RFMCalculation[,c("CustomerID","Monetary","Recency","Frequency")]

head(RFMScore, 10)

Monetary- Data Visualization

Summary <- RFMScore %>%
#The summary is a built-in R function used to produce result summaries of various model fitting functions. 
  summarise(
    Number_of_Rows = n(),
            Monetary_mean = mean(Monetary),
            Monetary_SD = sd(Monetary),
            Monetary_Max = max(Monetary),
            Monetary_Min = min(Monetary),
           # rnorm(number_of_rows,Rev_mean, Rev_SD )
          )
# head() shows the first 5 rows, knitr::kable() makes nice tables in RMarkdown

Summary %>% head() %>% knitr::kable()

Number_of_Rows	Monetary_mean	Monetary_SD	Monetary_Max	Monetary_Min
5000	1681.524	1998.619	34847.4	38.5

Density Distribution:
Here we are showing a density plot which is a representation of the distribution of a numeric variable that uses a kernel density estimate to show the probability density function of the variable. It shows the probability of occurrence of a variable within a specified range. Larger the area (called Probability area), higher the probability of its occurrence.

# install.packages("ggplot2")
library(ggplot2)
library(dplyr)

# Setting the x axis-range using the Filter functin
# You could spruce up your distribution graph by passing parameters like color code to Geom_density function
RFMScore %>%
  filter(Monetary<10000 ) %>%
  ggplot( aes(x=Monetary)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)

Density Distribution:

# install.packages("ggplot2")
library(ggplot2)
library(dplyr)
#Frequency <- RFMScore$Frequency
#hist(Frequency)

# Data
RFMScore %>%
  filter(Frequency<1000 ) %>%
  ggplot( aes(x=Frequency)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)

Recency- Data Visualisation:

Plotting the same density plot for Recency variable.

Summary <- RFMScore %>%
 
  summarise(
    Number_of_Rows = n(),
            Recency_mean = mean(Recency),
            Recency_SD = sd(Recency),
            Recency_Max = max(Recency),
            Recency_Min = min(Recency),
           # rnorm(number_of_rows,Rev_mean, Rev_SD )
            )
Summary %>% head() %>% knitr::kable()

Number_of_Rows	Recency_mean	Recency_SD	Recency_Max	Recency_Min
5000	376.42 days	80.15651	496 days	290 days

Density Distribution:

# install.packages("ggplot2")
library(ggplot2)
library(dplyr)
#Recency <- RFMScore$Recency
#hist(Recency)


# Data
RFMScore %>%
  filter(Recency<1000 ) %>%
  ggplot( aes(x=Recency)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)

## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

Calculating Quantiles

Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities. All the observations of a variable get sorted in an ascending order of their numerical value. In the example code below; [0-25] is the first quantile, [25-50] is the second quantile and so on. So, at the 25%, the observation of the Recency variable is 296. Second quantile can also be a median if there are odd number of observations.

quantile(RFMScore$Recency, probs = seq(0, 1, 1/4))

## Time differences in days
##     0%    25%    50%    75%   100% 
## 290.00 296.00 357.00 460.25 496.00

Assigning a score- By calculating percentile using Quantiles

We can also use Quantiles to split observations in terms of percentiles. For eg: we can find out a range of observations that are above 99th percentile. This can help us assign a score to observations depending on the percentile range (score of 1 for 15th to 29th, 2 for 29th to 59th) that they belong to.

Calculating the Recency score of the customer.

#Quantiles are given as decimal values, values such as 0.9992, 0.8492, and 0.5992. A 0.95 quantile point is exactly the same as the 95th percentile point.

interested.quantiles <-
  quantile( x = RFMScore$Recency
            , probs = c( 0.9992, 0.8492, 0.5992, 0.2992, 0.1592 )
  )
  
# A score is getting assigned to all observations in the Recency table. 
# However, higher the Recency (number of days), lower the loyalty score a customer will get. So the top percentiles will have low scores. 
#sapply() isuseful for operations on list objects and returns a list object of same length of original set. 
RFMScore$RScore <-
  sapply( X = RFMScore$Recency
          , FUN = function( i )
            if( i >= interested.quantiles[ 1 ] ){
Value=1
              return(Value )

            } else if( i < interested.quantiles[ 1 ] &&
                       i >= interested.quantiles[ 2 ] ){
Value=2

              return( Value )

            } else if( i < interested.quantiles[ 2 ] &&
                       i >= interested.quantiles[ 3 ] ){
Value=3
              return( Value )

            } else if( i < interested.quantiles[ 3 ] &&
                       i >= interested.quantiles[ 4 ] ){
Value=4
              return( Value )

            } else if( i < interested.quantiles[ 4 ] &&
                       i >= interested.quantiles[ 5 ] ){
Value=5
              return( Value )

            } else if( i < interested.quantiles[ 5 ] ){
Value=6
              return( Value)
            }

  )

# view the results


head(RFMScore, 10)

Calculating the Frequency score.

interested.quantiles <-
  quantile( x = RFMScore$Recency
            , probs = c( 0.9992, 0.8492, 0.5992, 0.2992, 0.1592 )
  )
  
#Here, greater the Frequency value, higher the score. 
RFMScore$FScore <-
  sapply( X = RFMScore$Frequency
          , FUN = function( i )
            if( i >= interested.quantiles[ 1 ] ){
Value=6
              return(Value )

            } else if( i < interested.quantiles[ 1 ] &&
                       i >= interested.quantiles[ 2 ] ){
Value=5

              return( Value )

            } else if( i < interested.quantiles[ 2 ] &&
                       i >= interested.quantiles[ 3 ] ){
Value=4
              return( Value )

            } else if( i < interested.quantiles[ 3 ] &&
                       i >= interested.quantiles[ 4 ] ){
Value=3
              return( Value )

            } else if( i < interested.quantiles[ 4 ] &&
                       i >= interested.quantiles[ 5 ] ){
Value=2
              return( Value )

            } else if( i < interested.quantiles[ 5 ] ){
Value=1
             return( Value)
            }

  )

# view the results


head(RFMScore, 10)

Calculating the Monetary score.

interested.quantiles <-
  quantile( x = RFMScore$Monetary
            , probs = c( 0.9992, 0.8492, 0.5992, 0.2992, 0.1592 )
  )
  

# More revenue, more points!
RFMScore$MScore <-
  sapply( X = RFMScore$Monetary
          , FUN = function( i )
            if( i >= interested.quantiles[ 1 ] ){
Value=6
              return(Value )

            } else if( i < interested.quantiles[ 1 ] &&
                       i >= interested.quantiles[ 2 ] ){
Value=5

              return( Value )

            } else if( i < interested.quantiles[ 2 ] &&
                       i >= interested.quantiles[ 3 ] ){
Value=4
              return( Value )

            } else if( i < interested.quantiles[ 3 ] &&
                       i >= interested.quantiles[ 4 ] ){
Value=3
              return( Value )

            } else if( i < interested.quantiles[ 4 ] &&
                       i >= interested.quantiles[ 5 ] ){
Value=2
              return( Value )

            } else if( i < interested.quantiles[ 5 ] ){
Value=1
             return( Value)
            }

  )

# view the results


head(RFMScore, 10)

Summing Up (Literally!)

We will add all the three scores. Needless to say, customers with the highest score are the most loyal.

Loyalty <- RFMScore %>%
  mutate(
   
   Score= RScore + FScore + MScore
  
         ) 

#The following code is used to display only a number of variables of your choosing. 

head(Loyalty[,c("CustomerID","Score")],30)

Segmenting Customers

df = Loyalty
table(Loyalty$Score)

## 
##    3    4    5    6    7    8    9   10   11   12   13   14 
##  475  225  234   33  293  829 1362  909  367  168   78   27

Assigning a badge to customers

Customers with a score of 3-6 will be assigned a Bronze badge.
Customers with a score of 7-10 will be assigned a Gold badge.
Customers with a score of 11-14 will be assigned a Platinum badge.

# More revenue, more points!
Loyalty$Badge <-
  sapply( X = Loyalty$Score
          , FUN = function( i )
            if( i >=3 &&
                       i <= 6 ){

              return("Bronze" )

            } else if(i >=7 &&
                       i <= 10 ){

              return( "Gold" )

            } else if(i >=11 &&
                       i <= 14 ){

              return( "Platinum"  )

            } 

  )

# view the results


head(Loyalty, 10)

Create a piechart for customer representation.

#df = Loyalty
#table(Loyalty$Badge)
#Manually extracted the value of each badge for now!

x <- c(967, 3393, 640)

labels <- c("Bronze", "Gold", "Platinum")



# Plot the chart.
pie(x, labels, main = "Customer Segmentation", col = rainbow(length(x)))

While this may not be the most efficient way of segmenting customers and predicting behaviors, it gives some insights in terms whom we should be

Retargeting- customers who haven’t purchased recently and have the propensity to buy from the platform again.
Remaketing- repeat customers who need incentives to want to buy more from the platform.

This will allow us to drive online sales while still keeping the ad spends in check.

AT1 - R Vignette

Ecommerce - RFM Model

Kiran Das