Overview

In my prior experience, I worked in the marketing & analytics team. I helped brands/clients like Footlocker, Macy’s with large scale marketing campaigns, strategic & creative direction, budget planning, ad optimization and reporting, and technology integration with the sole aim of driving awareness and online sales of products on their eCommerce platforms. In this Vignette, I am working on a sample sales record that you’d get from a typical eCommerce site to identify the most loyal customers using the RFM method.

RFM Method

Loading tidyverse library and importing a sample data set

I am also attempting to plot the data using ggplot to show recency, frequency and monetary visualization. I am loading the tidyverse library in order to use ggplot.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

To import the CSV to R Studio IDE-

I downloaded the Sales.csv (containing the sales record) from Kaggle. If you go to File> Import Dataset> From Text (base), you can locate the CSV file saved to your computer. After I click Import, it produces the R Code- read.csv which is a command that can read obviously .csv files.So that’s comma separated value files. Commas act as separators in the file. The same GUI can be used to import .txt files. Now, if you are importing files from an excel sheet, this GUI will not work. However, you can open an excel and save your file as .csv and then run this command.

Alternatively, you can import the dataset (using the same command) by specifying the path name as shown below.-

Sales <- read.csv("~/Desktop/Sales.csv")

# To sort the data, use the order( ) function. By default, sorting is ASCENDING.

Salesorted <- Sales[order(Sales$index), ] 

# To show the first 5 or any number of rows from the beginning, use head(df, number of rows)
head(Salesorted, 5)
##   index CustomerID TOTAL_ORDERS  REVENUE AVERAGE_ORDER_VALUE CARRIAGE_REVENUE
## 1     0       2354          124 11986.54               96.67           529.59
## 2     1       2361           82 11025.96              134.46            97.92
## 3     2       2415           43  7259.69              168.83           171.69
## 4     3       2427           44  6992.27              158.92            92.82
## 5     4       2456           55  6263.44              113.88           179.04
##   AVERAGESHIPPING FIRST_ORDER_DATE LATEST_ORDER_DATE AVGDAYSBETWEENORDERS
## 1            4.27       2016-12-30        2021-10-24                14.19
## 2            1.19       2018-03-31        2021-10-24                15.89
## 3            3.99       2017-11-30        2021-10-24                33.12
## 4            2.11       2019-04-09        2021-10-24                21.11
## 5            3.26       2020-10-23        2021-10-24                 6.65
##   DAYSSINCELASTORDER MONDAY_ORDERS TUESDAY_ORDERS WEDNESDAY_ORDERS
## 1                  1            13             13               29
## 2                  1            11             13               10
## 3                  1             5              4                3
## 4                  1            10              8                5
## 5                  1             2              3                8
##   THURSDAY_ORDERS FRIDAY_ORDERS SATURDAY_ORDERS SUNDAY_ORDERS MONDAY_REVENUE
## 1              25            19              15            10        1287.50
## 2              13            14              10            11        1253.08
## 3               5             5               8            13         698.95
## 4               8             5               3             5        1753.95
## 5               8            12              10            12         150.98
##   TUESDAY_REVENUE WEDNESDAY_REVENUE THURSDAY_REVENUE FRIDAY_REVENUE
## 1         1054.15           2842.52          2133.79        2051.94
## 2         2003.25           1819.75          1114.83        1791.86
## 3          504.97            197.97          1182.97         733.96
## 4         1400.23            570.48          1154.06         798.07
## 5          330.38           1090.23           563.63        1339.44
##   SATURDAY_REVENUE SUNDAY_REVENUE WEEK1_DAY01_DAY07_ORDERS
## 1          1496.33        1120.31                       28
## 2          1624.34        1418.85                       18
## 3          1223.96        2716.91                        9
## 4           630.00         685.48                       12
## 5           956.56        1832.22                       10
##   WEEK2_DAY08_DAY15_ORDERS WEEK3_DAY16_DAY23_ORDERS WEEK4_DAY24_DAY31_ORDERS
## 1                       42                       30                       24
## 2                       19                       19                       26
## 3                       11                        6                       17
## 4                       15                        9                        8
## 5                       18                       21                        6
##   WEEK1_DAY01_DAY07_REVENUE WEEK2_DAY08_DAY15_REVENUE WEEK3_DAY16_DAY23_REVENUE
## 1                   2685.37                   4299.28                   2592.18
## 2                   1336.09                   2776.02                   2807.66
## 3                   2299.93                   1383.92                    713.94
## 4                   2317.95                   2417.22                    997.02
## 5                    831.14                   1938.18                   2725.66
##   WEEK4_DAY24_DAY31_REVENUE TIME_0000_0600_ORDERS TIME_0601_1200_ORDERS
## 1                   2409.71                    12                    61
## 2                   4106.19                     1                    13
## 3                   2861.90                     0                     6
## 4                   1260.08                     0                    10
## 5                    768.46                     6                     9
##   TIME_1200_1800_ORDERS TIME_1801_2359_ORDERS TIME_0000_0600_REVENUE
## 1                    37                    14                1067.92
## 2                    24                    44                  30.25
## 3                    12                    25                   0.00
## 4                    13                    21                   0.00
## 5                    27                    13                 501.34
##   TIME_0601_1200_REVENUE TIME_1200_1800_REVENUE TIME_1801_2359_REVENUE
## 1                5674.65                3916.53                1327.44
## 2                3143.69                2205.47                5646.55
## 3                 707.98                2560.91                3990.80
## 4                1908.99                2025.05                3058.23
## 5                1197.83                3004.77                1559.50

We will store the parameters used in RFM analysis in a new dataset

The Sales.csv dataset has 5000 observations across 41 variables, also referred to as columns.

We will use a few variables from the dataset, namely: CustomerID, Revenue, Latest Order Date, and Average Days Between Orders.

RFM <- Salesorted[,c("CustomerID","TOTAL_ORDERS","REVENUE","LATEST_ORDER_DATE","AVGDAYSBETWEENORDERS")]
head(RFM, 10)
##    CustomerID TOTAL_ORDERS  REVENUE LATEST_ORDER_DATE AVGDAYSBETWEENORDERS
## 1        2354          124 11986.54        2021-10-24                14.19
## 2        2361           82 11025.96        2021-10-24                15.89
## 3        2415           43  7259.69        2021-10-24                33.12
## 4        2427           44  6992.27        2021-10-24                21.11
## 5        2456           55  6263.44        2021-10-24                 6.65
## 6        2485           49  5841.24        2021-10-24                46.57
## 7        2519           43  5470.27        2021-10-24                23.58
## 8        2551           54  5200.53        2021-10-24                13.06
## 9        2590           19  4967.06        2021-10-24                12.37
## 10       2640           21  4726.38        2021-10-24                31.29

1. To find Recency score: We calculate the difference between today’s date and date of last purchase

For example: If today is 22nd October and last date of purchase on the record of a particular customer is 20th October, the Recency Score would be (24-22)= 2

2. To find Frequency score: We will use the the data of the column ‘AVGDAYSBETWEENORDERS’ which gives the average numbers of days between each order. We will call this variable Frequency.

For example: If there is a product being purchased every week by a customer, their Frequency score will be 7.

3. To find Monetary score: We Calculate the Total Revenue generated by a customer.

This doesn’t need to be calculated. There is a separate column for total revenue generated by the customer.

Here is the code that bungs in all the three components to calculate the RFM score of a customer.

#The as.Date() is a built-in R function that converts between character representations and class “Date” objects representing the calendar dates. 
today <- as.Date("2022-08-10")
RFMCalculation <- RFM %>%
  mutate(
   
    Recency = today - as.Date(LATEST_ORDER_DATE),
#Rounds values to specified number of decimal places.
        Frequency = round(AVGDAYSBETWEENORDERS, 0),
    Monetary= REVENUE,
  
         ) 

#The following code is used to display only a number of variables of your choosing. 

RFMScore <- RFMCalculation[,c("CustomerID","Monetary","Recency","Frequency")]

head(RFMScore, 10)
##    CustomerID Monetary  Recency Frequency
## 1        2354 11986.54 290 days        14
## 2        2361 11025.96 290 days        16
## 3        2415  7259.69 290 days        33
## 4        2427  6992.27 290 days        21
## 5        2456  6263.44 290 days         7
## 6        2485  5841.24 290 days        47
## 7        2519  5470.27 290 days        24
## 8        2551  5200.53 290 days        13
## 9        2590  4967.06 290 days        12
## 10       2640  4726.38 290 days        31

Monetary- Data Visualization

Summary <- RFMScore %>%
#The summary is a built-in R function used to produce result summaries of various model fitting functions. 
  summarise(
    Number_of_Rows = n(),
            Monetary_mean = mean(Monetary),
            Monetary_SD = sd(Monetary),
            Monetary_Max = max(Monetary),
            Monetary_Min = min(Monetary),
           # rnorm(number_of_rows,Rev_mean, Rev_SD )
          )
# head() shows the first 5 rows, knitr::kable() makes nice tables in RMarkdown

Summary %>% head() %>% knitr::kable()
Number_of_Rows Monetary_mean Monetary_SD Monetary_Max Monetary_Min
5000 1681.524 1998.619 34847.4 38.5

Density Distribution:
Here we are showing a density plot which is a representation of the distribution of a numeric variable that uses a kernel density estimate to show the probability density function of the variable. It shows the probability of occurrence of a variable within a specified range. Larger the area (called Probability area), higher the probability of its occurrence.

# install.packages("ggplot2")
library(ggplot2)
library(dplyr)

# Setting the x axis-range using the Filter functin
# You could spruce up your distribution graph by passing parameters like color code to Geom_density function
RFMScore %>%
  filter(Monetary<10000 ) %>%
  ggplot( aes(x=Monetary)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)

Frequency- Data Visualisation

Plotting the same density plot for Frequency variable.

Summary <- RFMScore %>%
 
  summarise(
    Number_of_Rows = n(),
            Frequency_mean = mean(Frequency),
            Frequency_SD = sd(Frequency),
            Frequency_Max = max(Frequency),
            Frequency_Min = min(Frequency),
           # rnorm(number_of_rows,Rev_mean, Rev_SD )
            )
Summary %>% head() %>% knitr::kable()  
Number_of_Rows Frequency_mean Frequency_SD Frequency_Max Frequency_Min
5000 163.1568 259.7011 1410 0

Density Distribution:

# install.packages("ggplot2")
library(ggplot2)
library(dplyr)
#Frequency <- RFMScore$Frequency
#hist(Frequency)

# Data
RFMScore %>%
  filter(Frequency<1000 ) %>%
  ggplot( aes(x=Frequency)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)

Recency- Data Visualisation:

Plotting the same density plot for Recency variable.

Summary <- RFMScore %>%
 
  summarise(
    Number_of_Rows = n(),
            Recency_mean = mean(Recency),
            Recency_SD = sd(Recency),
            Recency_Max = max(Recency),
            Recency_Min = min(Recency),
           # rnorm(number_of_rows,Rev_mean, Rev_SD )
            )
Summary %>% head() %>% knitr::kable()  
Number_of_Rows Recency_mean Recency_SD Recency_Max Recency_Min
5000 376.42 days 80.15651 496 days 290 days

Density Distribution:

# install.packages("ggplot2")
library(ggplot2)
library(dplyr)
#Recency <- RFMScore$Recency
#hist(Recency)


# Data
RFMScore %>%
  filter(Recency<1000 ) %>%
  ggplot( aes(x=Recency)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

Calculating Quantiles

Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities. All the observations of a variable get sorted in an ascending order of their numerical value. In the example code below; [0-25] is the first quantile, [25-50] is the second quantile and so on. So, at the 25%, the observation of the Recency variable is 296. Second quantile can also be a median if there are odd number of observations.

quantile(RFMScore$Recency, probs = seq(0, 1, 1/4))
## Time differences in days
##     0%    25%    50%    75%   100% 
## 290.00 296.00 357.00 460.25 496.00

Assigning a score- By calculating percentile using Quantiles

We can also use Quantiles to split observations in terms of percentiles. For eg: we can find out a range of observations that are above 99th percentile. This can help us assign a score to observations depending on the percentile range (score of 1 for 15th to 29th, 2 for 29th to 59th) that they belong to.

Calculating the Recency score of the customer.

#Quantiles are given as decimal values, values such as 0.9992, 0.8492, and 0.5992. A 0.95 quantile point is exactly the same as the 95th percentile point.

interested.quantiles <-
  quantile( x = RFMScore$Recency
            , probs = c( 0.9992, 0.8492, 0.5992, 0.2992, 0.1592 )
  )
  
# A score is getting assigned to all observations in the Recency table. 
# However, higher the Recency (number of days), lower the loyalty score a customer will get. So the top percentiles will have low scores. 
#sapply() isuseful for operations on list objects and returns a list object of same length of original set. 
RFMScore$RScore <-
  sapply( X = RFMScore$Recency
          , FUN = function( i )
            if( i >= interested.quantiles[ 1 ] ){
Value=1
              return(Value )

            } else if( i < interested.quantiles[ 1 ] &&
                       i >= interested.quantiles[ 2 ] ){
Value=2

              return( Value )

            } else if( i < interested.quantiles[ 2 ] &&
                       i >= interested.quantiles[ 3 ] ){
Value=3
              return( Value )

            } else if( i < interested.quantiles[ 3 ] &&
                       i >= interested.quantiles[ 4 ] ){
Value=4
              return( Value )

            } else if( i < interested.quantiles[ 4 ] &&
                       i >= interested.quantiles[ 5 ] ){
Value=5
              return( Value )

            } else if( i < interested.quantiles[ 5 ] ){
Value=6
              return( Value)
            }

  )

# view the results


head(RFMScore, 10)
##    CustomerID Monetary  Recency Frequency RScore
## 1        2354 11986.54 290 days        14      5
## 2        2361 11025.96 290 days        16      5
## 3        2415  7259.69 290 days        33      5
## 4        2427  6992.27 290 days        21      5
## 5        2456  6263.44 290 days         7      5
## 6        2485  5841.24 290 days        47      5
## 7        2519  5470.27 290 days        24      5
## 8        2551  5200.53 290 days        13      5
## 9        2590  4967.06 290 days        12      5
## 10       2640  4726.38 290 days        31      5

Calculating the Frequency score.

interested.quantiles <-
  quantile( x = RFMScore$Recency
            , probs = c( 0.9992, 0.8492, 0.5992, 0.2992, 0.1592 )
  )
  
#Here, greater the Frequency value, higher the score. 
RFMScore$FScore <-
  sapply( X = RFMScore$Frequency
          , FUN = function( i )
            if( i >= interested.quantiles[ 1 ] ){
Value=6
              return(Value )

            } else if( i < interested.quantiles[ 1 ] &&
                       i >= interested.quantiles[ 2 ] ){
Value=5

              return( Value )

            } else if( i < interested.quantiles[ 2 ] &&
                       i >= interested.quantiles[ 3 ] ){
Value=4
              return( Value )

            } else if( i < interested.quantiles[ 3 ] &&
                       i >= interested.quantiles[ 4 ] ){
Value=3
              return( Value )

            } else if( i < interested.quantiles[ 4 ] &&
                       i >= interested.quantiles[ 5 ] ){
Value=2
              return( Value )

            } else if( i < interested.quantiles[ 5 ] ){
Value=1
             return( Value)
            }

  )

# view the results


head(RFMScore, 10)
##    CustomerID Monetary  Recency Frequency RScore FScore
## 1        2354 11986.54 290 days        14      5      1
## 2        2361 11025.96 290 days        16      5      1
## 3        2415  7259.69 290 days        33      5      1
## 4        2427  6992.27 290 days        21      5      1
## 5        2456  6263.44 290 days         7      5      1
## 6        2485  5841.24 290 days        47      5      1
## 7        2519  5470.27 290 days        24      5      1
## 8        2551  5200.53 290 days        13      5      1
## 9        2590  4967.06 290 days        12      5      1
## 10       2640  4726.38 290 days        31      5      1

Calculating the Monetary score.

interested.quantiles <-
  quantile( x = RFMScore$Monetary
            , probs = c( 0.9992, 0.8492, 0.5992, 0.2992, 0.1592 )
  )
  

# More revenue, more points!
RFMScore$MScore <-
  sapply( X = RFMScore$Monetary
          , FUN = function( i )
            if( i >= interested.quantiles[ 1 ] ){
Value=6
              return(Value )

            } else if( i < interested.quantiles[ 1 ] &&
                       i >= interested.quantiles[ 2 ] ){
Value=5

              return( Value )

            } else if( i < interested.quantiles[ 2 ] &&
                       i >= interested.quantiles[ 3 ] ){
Value=4
              return( Value )

            } else if( i < interested.quantiles[ 3 ] &&
                       i >= interested.quantiles[ 4 ] ){
Value=3
              return( Value )

            } else if( i < interested.quantiles[ 4 ] &&
                       i >= interested.quantiles[ 5 ] ){
Value=2
              return( Value )

            } else if( i < interested.quantiles[ 5 ] ){
Value=1
             return( Value)
            }

  )

# view the results


head(RFMScore, 10)
##    CustomerID Monetary  Recency Frequency RScore FScore MScore
## 1        2354 11986.54 290 days        14      5      1      5
## 2        2361 11025.96 290 days        16      5      1      5
## 3        2415  7259.69 290 days        33      5      1      5
## 4        2427  6992.27 290 days        21      5      1      5
## 5        2456  6263.44 290 days         7      5      1      5
## 6        2485  5841.24 290 days        47      5      1      5
## 7        2519  5470.27 290 days        24      5      1      5
## 8        2551  5200.53 290 days        13      5      1      5
## 9        2590  4967.06 290 days        12      5      1      5
## 10       2640  4726.38 290 days        31      5      1      5

Summing Up (Literally!)

We will add all the three scores. Needless to say, customers with the highest score are the most loyal.

Loyalty <- RFMScore %>%
  mutate(
   
   Score= RScore + FScore + MScore
  
         ) 

#The following code is used to display only a number of variables of your choosing. 

head(Loyalty[,c("CustomerID","Score")],30) 
##    CustomerID Score
## 1        2354    11
## 2        2361    11
## 3        2415    11
## 4        2427    11
## 5        2456    11
## 6        2485    11
## 7        2519    11
## 8        2551    11
## 9        2590    11
## 10       2640    11
## 11       2823    11
## 12       2824    11
## 13       2904    11
## 14       2927    11
## 15       2945    11
## 16       3001    11
## 17       3003    11
## 18       3072    11
## 19       3082    11
## 20       3117    10
## 21       3127    10
## 22       3138    10
## 23       3186    10
## 24       3196    10
## 25       3234    10
## 26       3260    10
## 27       3279    10
## 28       3304    10
## 29       3332    10
## 30       3407    10

Segmenting Customers

df = Loyalty
table(Loyalty$Score)
## 
##    3    4    5    6    7    8    9   10   11   12   13   14 
##  475  225  234   33  293  829 1362  909  367  168   78   27

Assigning a badge to customers 1. Customers with a score of 3-6 will be assigned a Bronze badge. 2. Customers with a score of 7-10 will be assigned a Gold badge. 3. Customers with a score of 11-14 will be assigned a Platinum badge.

# More revenue, more points!
Loyalty$Badge <-
  sapply( X = Loyalty$Score
          , FUN = function( i )
            if( i >=3 &&
                       i <= 6 ){

              return("Bronze" )

            } else if(i >=7 &&
                       i <= 10 ){

              return( "Gold" )

            } else if(i >=11 &&
                       i <= 14 ){

              return( "Platinum"  )

            } 

  )

# view the results


head(Loyalty, 10)
##    CustomerID Monetary  Recency Frequency RScore FScore MScore Score    Badge
## 1        2354 11986.54 290 days        14      5      1      5    11 Platinum
## 2        2361 11025.96 290 days        16      5      1      5    11 Platinum
## 3        2415  7259.69 290 days        33      5      1      5    11 Platinum
## 4        2427  6992.27 290 days        21      5      1      5    11 Platinum
## 5        2456  6263.44 290 days         7      5      1      5    11 Platinum
## 6        2485  5841.24 290 days        47      5      1      5    11 Platinum
## 7        2519  5470.27 290 days        24      5      1      5    11 Platinum
## 8        2551  5200.53 290 days        13      5      1      5    11 Platinum
## 9        2590  4967.06 290 days        12      5      1      5    11 Platinum
## 10       2640  4726.38 290 days        31      5      1      5    11 Platinum

Create a piechart for customer representation.

#df = Loyalty
#table(Loyalty$Badge)
#Manually extracted the value of each badge for now!

x <- c(967, 3393, 640)

labels <- c("Bronze", "Gold", "Platinum")



# Plot the chart.
pie(x, labels, main = "Customer Segmentation", col = rainbow(length(x)))