RFM (recency, frequency, monetary) analysis is a marketing technique used to determine quantitatively which customers are the best ones by examining
https://help.synerise.com/use-cases/all-cases/_gfx/rfm1.png
Identifying the most valuable RFM segments can capitalize on chance relationships in the data used for this analysis.
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
#library(stringr)
#library(DT)
library(tidyr)
library(knitr)
library(rmarkdown)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:dplyr':
##
## intersect, setdiff, union
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
#Data Loading :
#data <- read.csv("E:/Project/CustomerSegmentataion/data.csv")
data <- read.csv("D:/VisualStudioCode/R/CustomerSegmentationusingR/Project_1_Customer_Segmentation_using_R/Project_CustomerSegmentationUsingR/data.csv")
data
data <- data.frame(data)
summary(data)
## InvoiceNo StockCode Description
## 573585 : 1114 85123A : 2313 WHITE HANGING HEART T-LIGHT HOLDER: 2369
## 581219 : 749 22423 : 2203 REGENCY CAKESTAND 3 TIER : 2200
## 581492 : 731 85099B : 2159 JUMBO BAG RED RETROSPOT : 2159
## 580729 : 721 47566 : 1727 PARTY BUNTING : 1727
## 558475 : 705 20725 : 1639 LUNCH BAG RED RETROSPOT : 1638
## 579777 : 687 84879 : 1502 ASSORTED COLOUR BIRD ORNAMENT : 1501
## (Other):537202 (Other):530366 (Other) :530315
## Quantity InvoiceDate UnitPrice
## Min. :-80995.00 10/31/2011 14:41: 1114 Min. :-11062.06
## 1st Qu.: 1.00 12/08/2011 9:28 : 749 1st Qu.: 1.25
## Median : 3.00 12/09/2011 10:03: 731 Median : 2.08
## Mean : 9.55 12/05/2011 17:24: 721 Mean : 4.61
## 3rd Qu.: 10.00 6/29/2011 15:58 : 705 3rd Qu.: 4.13
## Max. : 80995.00 11/30/2011 15:13: 687 Max. : 38970.00
## (Other) :537202
## CustomerID Country
## Min. :12346 United Kingdom:495478
## 1st Qu.:13953 Germany : 9495
## Median :15152 France : 8557
## Mean :15288 EIRE : 8196
## 3rd Qu.:16791 Spain : 2533
## Max. :18287 Netherlands : 2371
## NA's :135080 (Other) : 15279
glimpse(data)
## Rows: 541,909
## Columns: 8
## $ InvoiceNo <fct> 536365, 536365, 536365, 536365, 536365, 536365, 536365,...
## $ StockCode <fct> 85123A, 71053, 84406B, 84029G, 84029E, 22752, 21730, 22...
## $ Description <fct> WHITE HANGING HEART T-LIGHT HOLDER, WHITE METAL LANTERN...
## $ Quantity <int> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2, 3, ...
## $ InvoiceDate <fct> 12/01/2010 8:26, 12/01/2010 8:26, 12/01/2010 8:26, 12/0...
## $ UnitPrice <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1.85, 1...
## $ CustomerID <int> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850,...
## $ Country <fct> United Kingdom, United Kingdom, United Kingdom, United ...
checking NA values :
sum(is.na(data))
## [1] 135080
Removing Negative values from Quantity :
data <- data %>% filter(Quantity>=0)
sum(is.na(data))
## [1] 133361
Filtering the Dataframe : Removing NA values
data <- data %>% filter(!is.na(CustomerID))
sum(is.na(data))
## [1] 0
#summary(data)
No of customer records :
nrow(data)
## [1] 397924
Number Of countries :
length(unique(data$Country))
## [1] 37
The custumers are from 37 different countries. Lets visualize this.
#reorder the table and reset the factor to that ordering
data %>%
group_by(Country) %>% # calculate the counts
summarize(counts = n()) %>%
arrange(counts) %>% # sort by counts
mutate(Country = factor(Country, Country)) %>% # reset factor
ggplot(aes(x=Country, y=counts)) + # plot
geom_bar(stat="identity") + # plot histogram
coord_flip() # flip the coordinates
The goods are shipped to 37 unique countries. The majority of goods is shipped to the United Kingdom. After the United Kingdom, Germany, France and EIRE (=Ireland) are the most important countries.
Make a new variable called TotalPrice, this variable gives the total price for each entry
data$TotalPrice <- data$Quantity * data$UnitPrice
data
#range
#range(data$InvoiceDate)
data$InvoiceDateTime <- mdy_hm(data$InvoiceDate) #make datetime object
data$InvoiceDate <- ymd(date(data$InvoiceDateTime)) #make date variable
head(data)
data$InvoiceYear <- year(data$InvoiceDate)
data$InvoiceMonth <- month(data$InvoiceDate,label=T)
data$InvoiceWeekday <- wday(data$InvoiceDate, label=T)
data$InvoiceHour <- hour(data$InvoiceDate)
#data
Here we have the number of transactions per month for 2011.
timedata <- data %>%
filter(InvoiceYear==2011) %>%
count(InvoiceMonth) #count the number of invoices per month for 2011
ggplot(timedata, aes(InvoiceMonth, n)) + #plot the number of invoices per day
geom_col() +
labs(x="Month", y="Number of invoices")
It seems that the number of transactions is rising from September and the highest in November. In december the lowest number of transactions is performed.
timedata <- data %>%
filter(InvoiceYear==2011) %>%
count(InvoiceWeekday)
ggplot(timedata, aes(InvoiceWeekday, n)) + #plot the number of invoices per day
geom_col() +
labs(x="Week", y="Number of invoices")
Most transactions are placed on monday, tuesday, wednesday and thursday.
timedata <- data %>%
filter(InvoiceYear==2011) %>%
count(InvoiceHour)
#data
#ggplot(timedata, aes(InvoiceHour, n)) + #plot the number of invoices per day
#geom_col() +
#labs(x="hour", y="Number of invoices")
#The most transactions are performed between 10 en 16:00 hours. At 12 hours the most transactions are performed.
#range(data$InvoiceYear)
##########################################################33
retail <- data.frame(na.omit(data))
sum(is.na(retail))
## [1] 0
retail
retail$InvoiceDate <- as.Date(retail$InvoiceDate, '%m/%d/%Y %H:%M')
range(retail$InvoiceDate)
## [1] "2010-12-01" "2011-12-09"
summary(retail)
## InvoiceNo StockCode Description
## 576339 : 542 85123A : 2035 WHITE HANGING HEART T-LIGHT HOLDER: 2028
## 579196 : 533 22423 : 1724 REGENCY CAKESTAND 3 TIER : 1724
## 580727 : 529 85099B : 1618 JUMBO BAG RED RETROSPOT : 1618
## 578270 : 442 84879 : 1408 ASSORTED COLOUR BIRD ORNAMENT : 1408
## 573576 : 435 47566 : 1397 PARTY BUNTING : 1397
## 567656 : 421 20725 : 1317 LUNCH BAG RED RETROSPOT : 1316
## (Other):395022 (Other):388425 (Other) :388433
## Quantity InvoiceDate UnitPrice CustomerID
## Min. : 1.00 Min. :2010-12-01 Min. : 0.000 Min. :12346
## 1st Qu.: 2.00 1st Qu.:2011-04-07 1st Qu.: 1.250 1st Qu.:13969
## Median : 6.00 Median :2011-07-31 Median : 1.950 Median :15159
## Mean : 13.02 Mean :2011-07-10 Mean : 3.116 Mean :15294
## 3rd Qu.: 12.00 3rd Qu.:2011-10-20 3rd Qu.: 3.750 3rd Qu.:16795
## Max. :80995.00 Max. :2011-12-09 Max. :8142.750 Max. :18287
##
## Country TotalPrice InvoiceDateTime
## United Kingdom:354345 Min. : 0.00 Min. :2010-12-01 08:26:00
## Germany : 9042 1st Qu.: 4.68 1st Qu.:2011-04-07 11:12:00
## France : 8342 Median : 11.80 Median :2011-07-31 14:39:00
## EIRE : 7238 Mean : 22.39 Mean :2011-07-10 23:43:36
## Spain : 2485 3rd Qu.: 19.80 3rd Qu.:2011-10-20 14:33:00
## Netherlands : 2363 Max. :168469.60 Max. :2011-12-09 12:50:00
## (Other) : 14109
## InvoiceYear InvoiceMonth InvoiceWeekday InvoiceHour
## Min. :2010 Nov : 64545 Sun:62775 Min. :0
## 1st Qu.:2011 Oct : 49557 Mon:64899 1st Qu.:0
## Median :2011 Dec : 43464 Tue:66476 Median :0
## Mean :2011 Sep : 40030 Wed:68888 Mean :0
## 3rd Qu.:2011 May : 28322 Thu:80052 3rd Qu.:0
## Max. :2011 Jun : 27185 Fri:54834 Max. :0
## (Other):144821 Sat: 0
retail <- subset(retail, InvoiceDate >= "2010-12-09")
range(retail$InvoiceDate)
## [1] "2010-12-09" "2011-12-09"
table(retail$Country)
##
## Australia Austria Bahrain
## 1163 398 17
## Belgium Brazil Canada
## 2019 32 151
## Channel Islands Cyprus Czech Republic
## 748 614 25
## Denmark EIRE European Community
## 380 7106 60
## Finland France Germany
## 685 8164 8838
## Greece Hong Kong Iceland
## 145 0 151
## Israel Italy Japan
## 248 734 305
## Lebanon Lithuania Malta
## 45 0 112
## Netherlands Norway Poland
## 2361 925 322
## Portugal RSA Saudi Arabia
## 1395 58 9
## Singapore Spain Sweden
## 222 2480 451
## Switzerland United Arab Emirates United Kingdom
## 1836 68 342478
## Unspecified USA
## 244 179
countries <- as.data.frame(table(retail$Country))
#countries
#names(countries)[names(countries) == 'Var1'] <- 'country'
#countries
#retail <- subset(retail, Country == "United Kingdom")
print("Invoice No ")
## [1] "Invoice No "
length(unique(retail$InvoiceNo))
## [1] 17858
print("Customer id")
## [1] "Customer id"
length(unique(retail$CustomerID))
## [1] 4275
# Identify returns
retail$item.return <- grepl("C", retail$InvoiceNo, fixed=TRUE)
retail$purchase.invoice <- ifelse(retail$item.return=="TRUE", 0, 1)
print("return ")
## [1] "return "
table(retail$item.return)
##
## FALSE
## 385168
print("Purchase")
## [1] "Purchase"
table(retail$purchase.invoice)
##
## 1
## 385168
#retail
RECENCY - how recent did a customer bought something FREQUENCY - how often does a customer buy something MONETARY VALUE - what is the value of the purchased items
customers <- as.data.frame(unique(retail$CustomerID))
names(customers) <- "CustomerID"
customers
2010/12/09 - 2011/12/09
consider current date is “01-01-2012”
# Recency #
###########
#retail$recency <- as.Date("2011-12-10") - retail$InvoiceDate
#data$InvoiceDateTime <- mdy_hm(data$InvoiceDate) #make datetime variable
#data$InvoiceDate <- ymd(date(data$InvoiceDate)) #make date variable
data$Recency <- as.numeric(mdy("01-01-2012") - data$InvoiceDate)
#data
r_data <- data %>%
group_by(CustomerID) %>%
summarize(Recency=min(Recency))
r_data
Recency Recency was calculated as one of the features for the segmentation analysis. In this case recency has been calculated as follows, time of customer’s last purchase minus Latest date
# Frequency #
#############
#customer.invoices <- subset(retail, select = c("CustomerID","InvoiceNo", "purchase.invoice"))
f_data <- data %>%
group_by(CustomerID) %>%
count(Frequency=n_distinct(InvoiceNo)) %>%
mutate(Frequency > 0)%>%
select(CustomerID, Frequency ) # Remove customers who have not made any purchases in the past year
#customers <- subset(f_data, Frequency > 0)
f_data
Frequency of Purchase Frequency was calculated counting the number of times a customer has made a transaction with the Online Retailer in a year. It is important to calculate the frequency of purchases, the online retailer wants it’s customers to buy as many times as possible, but the behavior of customers may be very different, some may a purchase a few times in bulk while other purchase low quantities frequently. The objective is to understand this behavior to serve them better.
Huge difference between the 3rd and maximum number of purcchases (7,812) Let’s investigate this further and visualize it in two different boxplots
###############################
# Monetary Value of Customers #
###############################
# Total spent on each item on an invoice
#retail$Amount <- retail$Quantity * retail$UnitPrice
m_data <- data %>%
group_by(CustomerID) %>%
summarise(Monetary=sum(TotalPrice >0))
# Identify customers with negative monetary value numbers, as they were presumably returning purchases from the preceding year
m_data
Monetary Value Finally, the last calculation to build before the cluster segmentation model is Monetary Value. This refers to the total sum of revenue generated by the user over the course of a year.
It has been estimated calculating the Unit Price and Quantity per transaction and grouping by CustomerID.
MV_3Q <- m_data %>%
filter( Monetary <= 15000)
MV_Outliers <- m_data %>%
filter(Monetary > 15000)
# Visualizing a histogram of revenue generated by user
MV_3Q_Visz <- ggplot(MV_3Q, aes(Monetary)) +
geom_histogram() +
ggtitle('Revenue of Users - Below $15K') +
ylab('Number of Users') +
xlab('Revenue') +
scale_x_continuous(labels = scales::dollar) +
scale_y_continuous(labels = scales::comma)
print(MV_3Q_Visz)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Visualizing histogram of Revenue Outliers
Outliers_Visz <- ggplot(m_data, aes(Monetary)) +
geom_histogram() +
ggtitle('High Revenue Users - Outliers') +
ylab('Number of Users') +
xlab('Revenue') +
scale_x_continuous(labels = scales::dollar, breaks = c(50000, 100000, 150000, 200000, 250000, 300000, 350000)) +
scale_y_continuous(labels = scales::comma)
print(Outliers_Visz)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Merging Recency, Frequency and Monetary Value. RFM Time to start merging the dataset for the cluster segmentation. So far, there has been three features constructed for the model. Recency, Frequency and Monetary Value of each customer. The three of these variables are now linked to the respective CustomerID.
Combine these the data sets r_data, f_data and m_data
new_data <- r_data %>%
full_join(m_data, by="CustomerID") %>%
full_join(f_data, by="CustomerID")
new_data
To implement the RFM analysis, we need to further process the data set in by the following steps:
Now we have a new data set, containing four variables called 1) customerID, 2)Recency, 3)Monatory and 4) Frequency.
These four variables should be segmented in equal groups. Lets start with the first variable recency:
#print(summary(customers$recency))
#kable(head(customers))
new_data$Recency_group <- cut(new_data$Recency,
quantile(new_data$Recency,
probs =seq(0,1,0.25)), #0 0.25 0.5 0.75 1
ordered_result=T, # should the result be an ordered factor?
include.lowest=T) # segment data into groups
new_data$Recency_group <- factor(new_data$Recency_group,
labels=c("very recent", "recent", "old", "oldest")) # rename levels
new_data
Now we can proceed with the variable called frequency. This variable will also be divided in four groups.
table(new_data$Frequency)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1494 835 508 387 243 172 143 98 68 54 52 45 30 20 28 11
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 18 14 12 12 11 5 5 3 8 7 3 6 1 4 3 3
## 33 34 35 37 38 39 41 44 45 46 47 48 50 51 55 57
## 2 3 1 3 2 2 1 1 1 1 2 1 1 1 2 1
## 60 62 63 74 86 91 93 97 124 201 210
## 1 1 1 1 1 1 1 1 1 1 1
Aas shown in the table above, it is not possible to use quantile to divide the population in four equal groups. This is because the majority of customers only places 1 or 2 orders (n=2305) and there are very few customers that place many orders. Lets try the following segmentation
1 order: n=1494 2-3 orders: n=1340 4-9 orders: n=1111 10 or more: n=378
new_data$Frequency_group <- cut(new_data$Frequency,
c(0,1,3,10,188),
ordered_result=T) #segment into four groups
new_data$Frequency_group <- factor(new_data$Frequency_group,
labels=c("very rare", "rare", "frequent", "very frequent"))
new_data
Now proceed with the monetary value
#new_data$Monetary
new_data$Monetory_group <- cut(new_data$Monetary,
quantile(new_data$Monetary, probs=seq(0,1,0.25)),
ordered_result=T, include.lowest=T) #segment into groups
new_data$Monetory_group <- factor(new_data$Monetory_group,
labels=c("small", "medium", "large", "very large")) #rename levels
new_data
We now have segmented our customers in different groups. Lets visualize this:
ggplot(new_data, aes(Recency_group, Frequency_group)) +
geom_count() +
facet_grid(Monetory_group ~ .) + #If you have only one variable with many levels,
labs(x="Recency", y="Frequency", title="RFM analysis 2011")
In this figure the Recency is put on the x-axis. Customers placed an order very recent, recent, less recent or not recent. Frequency is placed on the y-axis.
Customers ordered items very frequent (more than 10 times), frequent (between 4 and 9 times), rare (2 or 3 times) or they placed only one order (very rare). Customers spend a small, medium, large or very large amount of money in total.
From the figure it can be observed, that customers that placed a recent order, were more likely to be frequent buyers and they were also more likely to spend a larger amount of money. When customers placed an order a long time ago, they were more likely to place only one or a few orders and they were also more likely to spend a small amount of money.
This information could be helpfull to target specific groups of customers.
# Creating Clusters based on the RFM Table using Unsupervised Statistical Learning
set.seed(2020)
clusters <- kmeans(scale(new_data[,2:4]), 3, nstart = 1) # Performing kmeans with RFM variables and creating 3 clusters.
new_data$Cluster <- as.factor(clusters$cluster) # Attaching the results to CustomersID to identify each customer's cluster
KMeans_Results <- new_data %>%
group_by(Cluster) %>%
summarise('Number of Users' = n(),
'Recency Mean' = round(mean(Recency)),
'Frequency Mean' = scales::comma(round(mean(Frequency))),
'Monetary Value Mean' = scales::dollar(round(mean(Monetary))),
'Cluster Revenue' = scales::dollar(sum(Monetary))
)
KMeans_Results
#DT::datatable((KMeans_Results),
# rownames = FALSE) # Display cluster means to identify their value to the business
This is how we can describe the 3 segments of customers:
8 customers Avg 24 days of inactivity. Avg number of Purchases, 113 a year. Avg Monetary Revenue of $4025 a year.
3219 customers. Avg 63 days of inactivity Avg Number of Purchases, 5. Avg Revenue of $104 a year Cluster with highest revenue generated in 2011
1112 customers. Avg 267 days of inactivity Avg Number of Purchases, 2. Avg Revenue of $28 a year
Cluster_size_visz <- ggplot(KMeans_Results, aes(Cluster, `Number of Users`)) +
geom_text(aes(label = `Number of Users`), vjust = -0.3) +
geom_bar(aes(fill=Cluster), stat='identity') +
ggtitle('Number of Users per Cluster') +
xlab("Cluster Number") +
theme_classic()
print(Cluster_size_visz)
The importance of using data mining techniques for customer-centric strategies based on purchase behavior. Customers of the business have been clearly identified into 3 using KMeans clustering algorithm.
The clusters created can help the business understand its customers in terms of revenue generated, frequency of purchase and days of inactivity. Therefore, perform marketing strategies to retain and/or improve the profitability of different type of customers.
====================