Customer Segmentation using RFM Analysis (R)

RFM (recency, frequency, monetary) analysis is a marketing technique used to determine quantitatively which customers are the best ones by examining

https://help.synerise.com/use-cases/all-cases/_gfx/rfm1.png

Identifying the most valuable RFM segments can capitalize on chance relationships in the data used for this analysis.

Load libraries :

library(data.table)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
#library(stringr)
#library(DT)
library(tidyr)
library(knitr)
library(rmarkdown)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:dplyr':
## 
##     intersect, setdiff, union
## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

#Data Loading :

#data <- read.csv("E:/Project/CustomerSegmentataion/data.csv")

data <- read.csv("D:/VisualStudioCode/R/CustomerSegmentationusingR/Project_1_Customer_Segmentation_using_R/Project_CustomerSegmentationUsingR/data.csv")
data
data <- data.frame(data)
summary(data)
##    InvoiceNo        StockCode                                  Description    
##  573585 :  1114   85123A :  2313   WHITE HANGING HEART T-LIGHT HOLDER:  2369  
##  581219 :   749   22423  :  2203   REGENCY CAKESTAND 3 TIER          :  2200  
##  581492 :   731   85099B :  2159   JUMBO BAG RED RETROSPOT           :  2159  
##  580729 :   721   47566  :  1727   PARTY BUNTING                     :  1727  
##  558475 :   705   20725  :  1639   LUNCH BAG RED RETROSPOT           :  1638  
##  579777 :   687   84879  :  1502   ASSORTED COLOUR BIRD ORNAMENT     :  1501  
##  (Other):537202   (Other):530366   (Other)                           :530315  
##     Quantity                   InvoiceDate       UnitPrice        
##  Min.   :-80995.00   10/31/2011 14:41:  1114   Min.   :-11062.06  
##  1st Qu.:     1.00   12/08/2011 9:28 :   749   1st Qu.:     1.25  
##  Median :     3.00   12/09/2011 10:03:   731   Median :     2.08  
##  Mean   :     9.55   12/05/2011 17:24:   721   Mean   :     4.61  
##  3rd Qu.:    10.00   6/29/2011 15:58 :   705   3rd Qu.:     4.13  
##  Max.   : 80995.00   11/30/2011 15:13:   687   Max.   : 38970.00  
##                      (Other)         :537202                      
##    CustomerID               Country      
##  Min.   :12346    United Kingdom:495478  
##  1st Qu.:13953    Germany       :  9495  
##  Median :15152    France        :  8557  
##  Mean   :15288    EIRE          :  8196  
##  3rd Qu.:16791    Spain         :  2533  
##  Max.   :18287    Netherlands   :  2371  
##  NA's   :135080   (Other)       : 15279

Data cleaning :

glimpse(data)
## Rows: 541,909
## Columns: 8
## $ InvoiceNo   <fct> 536365, 536365, 536365, 536365, 536365, 536365, 536365,...
## $ StockCode   <fct> 85123A, 71053, 84406B, 84029G, 84029E, 22752, 21730, 22...
## $ Description <fct> WHITE HANGING HEART T-LIGHT HOLDER, WHITE METAL LANTERN...
## $ Quantity    <int> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3, ...
## $ InvoiceDate <fct> 12/01/2010 8:26, 12/01/2010 8:26, 12/01/2010 8:26, 12/0...
## $ UnitPrice   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, 1...
## $ CustomerID  <int> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850,...
## $ Country     <fct> United Kingdom, United Kingdom, United Kingdom, United ...

checking NA values :

sum(is.na(data))
## [1] 135080

Removing Negative values from Quantity :

data <- data %>% filter(Quantity>=0)
sum(is.na(data))
## [1] 133361

Filtering the Dataframe : Removing NA values

data <- data %>% filter(!is.na(CustomerID))
sum(is.na(data))
## [1] 0
#summary(data)

No of customer records :

nrow(data)
## [1] 397924

Number Of countries :

length(unique(data$Country))
## [1] 37

The custumers are from 37 different countries. Lets visualize this.

#reorder the table and reset the factor to that ordering
data %>%
  group_by(Country) %>%                              # calculate the counts
  summarize(counts = n()) %>%
  arrange(counts) %>%                                # sort by counts
  mutate(Country = factor(Country, Country)) %>%     # reset factor
  ggplot(aes(x=Country, y=counts)) +                 # plot 
    geom_bar(stat="identity") +                      # plot histogram
    coord_flip()                                     # flip the coordinates

The goods are shipped to 37 unique countries. The majority of goods is shipped to the United Kingdom. After the United Kingdom, Germany, France and EIRE (=Ireland) are the most important countries.

Feature Engineering :

Make a new variable called TotalPrice, this variable gives the total price for each entry

data$TotalPrice <- data$Quantity * data$UnitPrice
data
#range 
#range(data$InvoiceDate)
data$InvoiceDateTime <- mdy_hm(data$InvoiceDate) #make datetime object
data$InvoiceDate <- ymd(date(data$InvoiceDateTime)) #make date variable
head(data)
data$InvoiceYear <- year(data$InvoiceDate)
data$InvoiceMonth <- month(data$InvoiceDate,label=T)
data$InvoiceWeekday <- wday(data$InvoiceDate, label=T)
data$InvoiceHour <- hour(data$InvoiceDate)
#data

Here we have the number of transactions per month for 2011.

timedata <- data %>% 
  filter(InvoiceYear==2011) %>% 
  count(InvoiceMonth)  #count the number of invoices per month for 2011

ggplot(timedata, aes(InvoiceMonth, n)) +  #plot the number of invoices per day               
  geom_col() +
  labs(x="Month", y="Number of invoices")

It seems that the number of transactions is rising from September and the highest in November. In december the lowest number of transactions is performed.

timedata <- data %>% 
  filter(InvoiceYear==2011) %>% 
  count(InvoiceWeekday)

ggplot(timedata, aes(InvoiceWeekday, n)) +  #plot the number of invoices per day               
    geom_col() +
  labs(x="Week", y="Number of invoices") 

Most transactions are placed on monday, tuesday, wednesday and thursday.

timedata <- data %>% 
 filter(InvoiceYear==2011) %>% 
  count(InvoiceHour)
#data
#ggplot(timedata, aes(InvoiceHour, n)) +  #plot the number of invoices per day               
    #geom_col() +
  #labs(x="hour", y="Number of invoices") 
#The most transactions are performed between 10 en 16:00 hours. At 12 hours the most transactions are performed.
#range(data$InvoiceYear)

##########################################################33

retail <- data.frame(na.omit(data))
sum(is.na(retail))
## [1] 0
retail
retail$InvoiceDate <- as.Date(retail$InvoiceDate, '%m/%d/%Y %H:%M')
range(retail$InvoiceDate)
## [1] "2010-12-01" "2011-12-09"
summary(retail)
##    InvoiceNo        StockCode                                  Description    
##  576339 :   542   85123A :  2035   WHITE HANGING HEART T-LIGHT HOLDER:  2028  
##  579196 :   533   22423  :  1724   REGENCY CAKESTAND 3 TIER          :  1724  
##  580727 :   529   85099B :  1618   JUMBO BAG RED RETROSPOT           :  1618  
##  578270 :   442   84879  :  1408   ASSORTED COLOUR BIRD ORNAMENT     :  1408  
##  573576 :   435   47566  :  1397   PARTY BUNTING                     :  1397  
##  567656 :   421   20725  :  1317   LUNCH BAG RED RETROSPOT           :  1316  
##  (Other):395022   (Other):388425   (Other)                           :388433  
##     Quantity         InvoiceDate           UnitPrice          CustomerID   
##  Min.   :    1.00   Min.   :2010-12-01   Min.   :   0.000   Min.   :12346  
##  1st Qu.:    2.00   1st Qu.:2011-04-07   1st Qu.:   1.250   1st Qu.:13969  
##  Median :    6.00   Median :2011-07-31   Median :   1.950   Median :15159  
##  Mean   :   13.02   Mean   :2011-07-10   Mean   :   3.116   Mean   :15294  
##  3rd Qu.:   12.00   3rd Qu.:2011-10-20   3rd Qu.:   3.750   3rd Qu.:16795  
##  Max.   :80995.00   Max.   :2011-12-09   Max.   :8142.750   Max.   :18287  
##                                                                            
##            Country         TotalPrice        InvoiceDateTime              
##  United Kingdom:354345   Min.   :     0.00   Min.   :2010-12-01 08:26:00  
##  Germany       :  9042   1st Qu.:     4.68   1st Qu.:2011-04-07 11:12:00  
##  France        :  8342   Median :    11.80   Median :2011-07-31 14:39:00  
##  EIRE          :  7238   Mean   :    22.39   Mean   :2011-07-10 23:43:36  
##  Spain         :  2485   3rd Qu.:    19.80   3rd Qu.:2011-10-20 14:33:00  
##  Netherlands   :  2363   Max.   :168469.60   Max.   :2011-12-09 12:50:00  
##  (Other)       : 14109                                                    
##   InvoiceYear    InvoiceMonth    InvoiceWeekday  InvoiceHour
##  Min.   :2010   Nov    : 64545   Sun:62775      Min.   :0   
##  1st Qu.:2011   Oct    : 49557   Mon:64899      1st Qu.:0   
##  Median :2011   Dec    : 43464   Tue:66476      Median :0   
##  Mean   :2011   Sep    : 40030   Wed:68888      Mean   :0   
##  3rd Qu.:2011   May    : 28322   Thu:80052      3rd Qu.:0   
##  Max.   :2011   Jun    : 27185   Fri:54834      Max.   :0   
##                 (Other):144821   Sat:    0
retail <- subset(retail, InvoiceDate >= "2010-12-09")
range(retail$InvoiceDate)
## [1] "2010-12-09" "2011-12-09"
table(retail$Country)
## 
##            Australia              Austria              Bahrain 
##                 1163                  398                   17 
##              Belgium               Brazil               Canada 
##                 2019                   32                  151 
##      Channel Islands               Cyprus       Czech Republic 
##                  748                  614                   25 
##              Denmark                 EIRE   European Community 
##                  380                 7106                   60 
##              Finland               France              Germany 
##                  685                 8164                 8838 
##               Greece            Hong Kong              Iceland 
##                  145                    0                  151 
##               Israel                Italy                Japan 
##                  248                  734                  305 
##              Lebanon            Lithuania                Malta 
##                   45                    0                  112 
##          Netherlands               Norway               Poland 
##                 2361                  925                  322 
##             Portugal                  RSA         Saudi Arabia 
##                 1395                   58                    9 
##            Singapore                Spain               Sweden 
##                  222                 2480                  451 
##          Switzerland United Arab Emirates       United Kingdom 
##                 1836                   68               342478 
##          Unspecified                  USA 
##                  244                  179
countries <- as.data.frame(table(retail$Country))
#countries
#names(countries)[names(countries) == 'Var1'] <- 'country'
#countries
#retail <- subset(retail, Country == "United Kingdom")
print("Invoice No ")
## [1] "Invoice No "
length(unique(retail$InvoiceNo))
## [1] 17858
print("Customer id")
## [1] "Customer id"
length(unique(retail$CustomerID))
## [1] 4275
# Identify returns
retail$item.return <- grepl("C", retail$InvoiceNo, fixed=TRUE)
retail$purchase.invoice <- ifelse(retail$item.return=="TRUE", 0, 1)
print("return ")
## [1] "return "
table(retail$item.return)
## 
##  FALSE 
## 385168
print("Purchase")
## [1] "Purchase"
table(retail$purchase.invoice)
## 
##      1 
## 385168
#retail

RFM :

RECENCY - how recent did a customer bought something FREQUENCY - how often does a customer buy something MONETARY VALUE - what is the value of the purchased items

customers <- as.data.frame(unique(retail$CustomerID))
names(customers) <- "CustomerID"
customers

2010/12/09 - 2011/12/09

consider current date is “01-01-2012”

# Recency #
###########

#retail$recency <- as.Date("2011-12-10") - retail$InvoiceDate
#data$InvoiceDateTime <- mdy_hm(data$InvoiceDate) #make datetime variable
#data$InvoiceDate <- ymd(date(data$InvoiceDate)) #make date variable


data$Recency <-  as.numeric(mdy("01-01-2012") - data$InvoiceDate)
#data
r_data <- data %>% 
  group_by(CustomerID) %>% 
  summarize(Recency=min(Recency))

r_data

Recency Recency was calculated as one of the features for the segmentation analysis. In this case recency has been calculated as follows, time of customer’s last purchase minus Latest date

# Frequency #
#############

#customer.invoices <- subset(retail, select = c("CustomerID","InvoiceNo", "purchase.invoice"))

f_data <- data %>% 
  group_by(CustomerID) %>%  
  count(Frequency=n_distinct(InvoiceNo)) %>%
  mutate(Frequency > 0)%>%
  select(CustomerID, Frequency ) # Remove customers who have not made any purchases in the past year


#customers <- subset(f_data, Frequency > 0)

f_data

Frequency of Purchase Frequency was calculated counting the number of times a customer has made a transaction with the Online Retailer in a year. It is important to calculate the frequency of purchases, the online retailer wants it’s customers to buy as many times as possible, but the behavior of customers may be very different, some may a purchase a few times in bulk while other purchase low quantities frequently. The objective is to understand this behavior to serve them better.

Huge difference between the 3rd and maximum number of purcchases (7,812) Let’s investigate this further and visualize it in two different boxplots

###############################
# Monetary Value of Customers #
###############################

# Total spent on each item on an invoice
#retail$Amount <- retail$Quantity * retail$UnitPrice
m_data <- data %>% 
  group_by(CustomerID) %>% 
  summarise(Monetary=sum(TotalPrice >0))

# Identify customers with negative monetary value numbers, as they were presumably returning purchases from the preceding year

m_data

Monetary Value Finally, the last calculation to build before the cluster segmentation model is Monetary Value. This refers to the total sum of revenue generated by the user over the course of a year.

It has been estimated calculating the Unit Price and Quantity per transaction and grouping by CustomerID.

MV_3Q <- m_data %>%
  filter( Monetary <= 15000)

MV_Outliers <- m_data %>%
  filter(Monetary > 15000)

# Visualizing a histogram of revenue generated by user
MV_3Q_Visz <- ggplot(MV_3Q, aes(Monetary)) +
  geom_histogram() +
  ggtitle('Revenue of Users - Below $15K') +
  ylab('Number of Users') +
  xlab('Revenue') +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::comma)
print(MV_3Q_Visz)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Visualizing histogram of Revenue Outliers
Outliers_Visz <- ggplot(m_data, aes(Monetary)) +
  geom_histogram() +
  ggtitle('High Revenue Users - Outliers') +
  ylab('Number of Users') +
  xlab('Revenue') +
  scale_x_continuous(labels = scales::dollar, breaks = c(50000, 100000, 150000, 200000, 250000, 300000, 350000)) +
  scale_y_continuous(labels = scales::comma)
print(Outliers_Visz)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Merging Recency, Frequency and Monetary Value. RFM Time to start merging the dataset for the cluster segmentation. So far, there has been three features constructed for the model. Recency, Frequency and Monetary Value of each customer. The three of these variables are now linked to the respective CustomerID.

Combine these the data sets r_data, f_data and m_data

new_data <- r_data %>% 
  full_join(m_data, by="CustomerID") %>% 
  full_join(f_data, by="CustomerID")

new_data

Calculate RFM

To implement the RFM analysis, we need to further process the data set in by the following steps:

Now we have a new data set, containing four variables called 1) customerID, 2)Recency, 3)Monatory and 4) Frequency.

These four variables should be segmented in equal groups. Lets start with the first variable recency:

#print(summary(customers$recency))
#kable(head(customers))

new_data$Recency_group <- cut(new_data$Recency, 
                              quantile(new_data$Recency, 
                              probs =seq(0,1,0.25)), #0 0.25 0.5 0.75 1
                              ordered_result=T,  #  should the result be an ordered factor?
                              include.lowest=T) # segment data into groups

new_data$Recency_group <- factor(new_data$Recency_group, 
                                 labels=c("very recent", "recent", "old", "oldest")) # rename levels

new_data

Now we can proceed with the variable called frequency. This variable will also be divided in four groups.

table(new_data$Frequency)
## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1494  835  508  387  243  172  143   98   68   54   52   45   30   20   28   11 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##   18   14   12   12   11    5    5    3    8    7    3    6    1    4    3    3 
##   33   34   35   37   38   39   41   44   45   46   47   48   50   51   55   57 
##    2    3    1    3    2    2    1    1    1    1    2    1    1    1    2    1 
##   60   62   63   74   86   91   93   97  124  201  210 
##    1    1    1    1    1    1    1    1    1    1    1

Aas shown in the table above, it is not possible to use quantile to divide the population in four equal groups. This is because the majority of customers only places 1 or 2 orders (n=2305) and there are very few customers that place many orders. Lets try the following segmentation

1 order: n=1494 2-3 orders: n=1340 4-9 orders: n=1111 10 or more: n=378

new_data$Frequency_group <- cut(new_data$Frequency, 
                              c(0,1,3,10,188), 
                              ordered_result=T) #segment into four groups

new_data$Frequency_group <- factor(new_data$Frequency_group, 
                                 labels=c("very rare", "rare", "frequent", "very frequent"))

new_data

Now proceed with the monetary value

#new_data$Monetary
new_data$Monetory_group <- cut(new_data$Monetary, 
                              quantile(new_data$Monetary, probs=seq(0,1,0.25)), 
                              ordered_result=T, include.lowest=T) #segment into groups

new_data$Monetory_group <- factor(new_data$Monetory_group, 
                                 labels=c("small", "medium", "large", "very large")) #rename levels
new_data

Visualize results

We now have segmented our customers in different groups. Lets visualize this:

ggplot(new_data, aes(Recency_group, Frequency_group)) +
  geom_count() +
  facet_grid(Monetory_group ~ .) + #If you have only one variable with many levels,
  labs(x="Recency", y="Frequency", title="RFM analysis 2011") 

In this figure the Recency is put on the x-axis. Customers placed an order very recent, recent, less recent or not recent. Frequency is placed on the y-axis.

Customers ordered items very frequent (more than 10 times), frequent (between 4 and 9 times), rare (2 or 3 times) or they placed only one order (very rare). Customers spend a small, medium, large or very large amount of money in total.

From the figure it can be observed, that customers that placed a recent order, were more likely to be frequent buyers and they were also more likely to spend a larger amount of money. When customers placed an order a long time ago, they were more likely to place only one or a few orders and they were also more likely to spend a small amount of money.

This information could be helpfull to target specific groups of customers.

Statistical Clustering - KMeans

# Creating Clusters based on the RFM Table using Unsupervised Statistical Learning
set.seed(2020)
clusters <- kmeans(scale(new_data[,2:4]), 3, nstart = 1) # Performing kmeans with RFM variables and creating 3 clusters. 

new_data$Cluster <- as.factor(clusters$cluster) # Attaching the results to CustomersID to identify each customer's cluster
KMeans_Results <- new_data %>%
  group_by(Cluster) %>%
  summarise('Number of Users' = n(),
            'Recency Mean' = round(mean(Recency)),
            'Frequency Mean' = scales::comma(round(mean(Frequency))),
            'Monetary Value Mean' = scales::dollar(round(mean(Monetary))),
            'Cluster Revenue' = scales::dollar(sum(Monetary))
            )
KMeans_Results
#DT::datatable((KMeans_Results),
 #             rownames = FALSE) # Display cluster means to identify their value to the business

This is how we can describe the 3 segments of customers:

High Value Customers - Cluster 1

8 customers Avg 24 days of inactivity. Avg number of Purchases, 113 a year. Avg Monetary Revenue of $4025 a year.

Medium Value Customers - Cluster 3

3219 customers. Avg 63 days of inactivity Avg Number of Purchases, 5. Avg Revenue of $104 a year Cluster with highest revenue generated in 2011

low Value Customers - Cluster 2

1112 customers. Avg 267 days of inactivity Avg Number of Purchases, 2. Avg Revenue of $28 a year

Cluster_size_visz <- ggplot(KMeans_Results, aes(Cluster, `Number of Users`)) +
  geom_text(aes(label = `Number of Users`), vjust = -0.3) +
         geom_bar(aes(fill=Cluster), stat='identity') +
  ggtitle('Number of Users per Cluster') + 
  xlab("Cluster Number") +
  theme_classic()
         print(Cluster_size_visz)

Summary - KMeans Segmentation

The importance of using data mining techniques for customer-centric strategies based on purchase behavior. Customers of the business have been clearly identified into 3 using KMeans clustering algorithm.

The clusters created can help the business understand its customers in terms of revenue generated, frequency of purchase and days of inactivity. Therefore, perform marketing strategies to retain and/or improve the profitability of different type of customers.

====================