Introduction

This article is the second part of an analysis conducted using the Online Retailer Dataset. This analysis focuses on customer segmentation using statistical analysis in R.

Customer Segementation is used in marketing to better understand customers of a business and target them accordingly. Segmentation of customer can take many forms, based on demographic, geographic, interest, behavior or a combination of these characteristics. Segmentation for this analysis was conducted based on their purchase behavior, the features to be analyzed were Recency, Frequency and Monetary Value, (RFM) for short.

For the scope of this project RFM was defined as follow:

Recency: Number of days the user has been inactive, from the moment of last purchase to the latest time in the dataset.
Frequency: Total number of transactions completed by a customer in a year.
Monetary Value: Total revenue generated by the customer in a year.

Ideally, at least 3 or more clusters will be created. This helps the Online Retailer define Low, Medium and High values users.

For the first article on metrics and more of my data science projects visit: https://www.luisnoguera.com/data-science-projects

Loading and Preprocessing the Dataset

Importing and cleaning process has been explained in the previous article on Growth and Marketing Analytics Metrics.

# Importing libraries 
library(readxl)
library(dplyr)
library(ggplot2)
library(DT)
library(lubridate)
library(scales)
library(plotly)

# Importing Data into R
OnlineRetailData1 <- read_excel("Online Retail.xlsx", 
                               col_types = c("text", "text", "text", 
                                             "numeric", "date", "numeric", "numeric", 
                                             "text"))

# Preprocessing - Eliminating duplicates
# observations with duplicated values
dupes <- which(duplicated(OnlineRetailData1))
# Subsetting out the duplicated values
OnlineRetailData2 <- OnlineRetailData1[-dupes,]
# Preprocessing for Missing Values
OnlineRetailData <- OnlineRetailData2[complete.cases(OnlineRetailData2),]
# Removing not useful datasets
rm(OnlineRetailData1, dupes)
rm(OnlineRetailData2)


# Showing the first 20 observations of the dataset
DT::datatable(head(OnlineRetailData, 20),
              rownames = FALSE,
              options = list(
                pageLength = 5))

Transactions per Country

As seen in the first article, most of the transactions for this Online Retailer happend in the United Kingdom. 88% of the total transactions.

# Filtering for top 10 countries by transaction
Transactions_per_Country <- OnlineRetailData %>%
  group_by(Country) %>%
  summarise('Number of Transcations' = n()) %>%
  arrange(desc(`Number of Transcations`)) %>%
  top_n(10)

names(Transactions_per_Country) <- c("Country", "Number of Transactions")

library(ggthemes)


Transaction_per_Country_Visz <- ggplot(head(Transactions_per_Country,5), aes(x=reorder(Country,-`Number of Transactions`), y=`Number of Transactions`)) +
  geom_bar(stat = 'identity', fill = 'steelblue') +
  scale_y_continuous(labels = comma) +
  geom_text(aes(label = `Number of Transactions`), vjust = -0.5) +
  ggtitle('Top 5 Countries by Number of Transactions') +
  xlab('Countries') +
  ylab('Number of Transactions') +
  theme_minimal() 
print(Transaction_per_Country_Visz)

Transactions per Country Table

Below, the complete list of countries with total number of transactions in a year. Most of the transactions happened in EU countries, turns out the Online Retailer is based in UK.

# Display table of transactions per Country
DT::datatable((Transactions_per_Country),
              rownames = FALSE,
              options = list(
                pageLength = 10))

Filter Transactions in UK

for the scope of this segmentation analysis, only users and transactions that happened in the UK were analyzed.

Code to filter UK customers below.

UK_OnlineRetail <- OnlineRetailData %>%
  filter(Country == 'United Kingdom')

UK_OnlineRetail <- OnlineRetailData %>%
  filter(Country == 'United Kingdom')

Recency

Recency was calculated as one of the features for the segmentation analysis. In this case recency has been calculated as follows, time of customer’s last purchase minus the last transaction date in days.

Based on the histogram, most of the clients have been active in the last 90-100 days. There are not benchmarks to compare this to, but 3 months of customer inactivity does not sound terrible, specially after it was discovered in the previous analysis that the company has not been acquired users as quick as it used to. However, some of this customers have over 140 days without making a single purchase.

Recency Statistics

The summary statistics below gives a goo rough undersating of the users behavior.

50% of the users are below 49 days of inactivity
Average of 3 months without making a single purchase.
Small group of customers that have not made a single transaction in over a year!

# Lets dive into the statistics of how long customer have been inactive. 

# Creating a dataframe with unique CustomersID
Users_Recency <- UK_OnlineRetail %>% 
  group_by(CustomerID) %>%
  summarise(Last_Customer_Activity = max(InvoiceDate)) %>%
  mutate(Last_Invoice = max(Last_Customer_Activity)) 
  
# WORK ON RECENCY! 
Users_Recency$Recency<- round(as.numeric(difftime(Users_Recency$Last_Invoice, Users_Recency$Last_Customer_Activity , units = c("days"))))

Users_Recency <- Users_Recency %>%
  select(CustomerID, Recency)

# Recency refers to the amount of days that a customer has remained inactive. 
# From the moment of it's last purchase up to the moment of the last invoice in the dataset
# Build a histogram of Recency
print(summary(Users_Recency$Recency))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   16.00   50.00   91.32  143.00  373.00

Recency Histogram

ggplot(Users_Recency, aes(Recency)) +
  geom_histogram() +
  ylab('Number of Customers') +
  theme_minimal()

Frequency of Purchase

Frequency was calculated counting the number of times a customer has made a transaction with the Online Retailer in a year. It is important to calculate the frequency of purchases, the online retailer wants it’s customers to buy as many times as possible, but the behavior of customers may be very different, some may a purchase a few times in bulk while other purchase low quantities frequently. The objective is to understand this behavior to serve them better.

Frequency Statistics

Diving into how often the customers purchase from the online retailer.

Average of 90 transactions a year
75% of users have less than 100 purchases a year.
Huge difference between the 3rd and maximum number of purcchases (7,812) Let’s investigate this further and visualize it in two different boxplots.

# Calculating the number of transaction per CustomerID
User_Frequency <- UK_OnlineRetail %>%
  group_by(CustomerID) %>%
  summarise(Frequency = n())

# Summary Statistics of Number of Purchases for each user
summary(User_Frequency$Frequency)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   17.00   41.00   90.31   99.00 7812.00

Frequency Boxplot

Since there is a remmarkable difference between the 3rd Quartile and Max value, the boxplot graph will be divided in two.

1st Boxplot Graph - Customers below the 3rd Quartile

2nd Boxplot Graph - Customers in the 4th Quartile.

Below_3Q <- User_Frequency %>%
  filter(Frequency <= 99)

Outliers <- User_Frequency %>%
  filter(Frequency >= 500)

# Plotting first 3 Quartile
Below_3Q_Visz <- ggplot(Below_3Q, aes(CustomerID, Frequency)) +
  geom_boxplot() +
  ylab('Number of Purchases per Customer') +
  ggtitle('Purchase Frequency - First 3 Quartiles') +
  theme(axis.ticks.x = element_blank()) +
  theme_minimal() 
  
print(Below_3Q_Visz)

# Plotting last Quartile
Outliers_Visz <- ggplot(Outliers, aes(CustomerID, Frequency)) +
  geom_boxplot() +
  ylab('Number of Purchases per Customer') +
  scale_y_continuous(labels= scales::comma) + 
  ggtitle('Purchase Frequency - Outliers') +
  theme(axis.ticks.x = element_blank()) +
  theme_minimal()
print(Outliers_Visz)

Monetary Value

Finally, the last calculation to build before the cluster segmentation model is Monetary Value. This refers to the total sum of revenue generated by the user over the course of a year.

It has been estimated calculating the Unit Price and Quantity per transaction and grouping by CustomerID.

Monetary Value Statistics

There are customers who have Let’s visualize them in two plots like it was done previously.

There seems to be customers with negative revenue, apparently there are negative numbers in the price column. It could suggest purchase returns, this is an important assumption as we keep working on the segmentation.

# Calculate Revenue per CustomerID
Users_Monetary_Value <- UK_OnlineRetail %>%
  mutate(Revenue = Quantity * UnitPrice) %>%
  group_by(CustomerID) %>%
  summarise(Monetary_Value=sum(Revenue))

# Summary Statistics
summary(Users_Monetary_Value$Monetary_Value)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  -4287.6    281.0    623.3   1708.1   1517.8 256438.5

Monetary Value Histograms

Plotting two sets of users.

1st Histogram - Customers below the 3rd Quartile

2nd Histogram - Customers with revenue greater than 15k a year.

MV_3Q <- Users_Monetary_Value %>%
  filter(Monetary_Value <= 15000)

MV_Outliers <- Users_Monetary_Value %>%
  filter(Monetary_Value > 15000)

# Visualizing a histogram of revenue generated by user
MV_3Q_Visz <- ggplot(MV_3Q, aes(Monetary_Value)) +
  geom_histogram() +
  ggtitle('Revenue of Users - Below $15K') +
  ylab('Number of Users') +
  xlab('Revenue') +
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::comma)
print(MV_3Q_Visz)

# Visualizing histogram of Revenue Outliers
Outliers_Visz <- ggplot(MV_Outliers, aes(Monetary_Value)) +
  geom_histogram() +
  ggtitle('High Revenue Users - Outliers') +
  ylab('Number of Users') +
  xlab('Revenue') +
  scale_x_continuous(labels = scales::dollar, breaks = c(50000, 100000, 150000, 200000, 250000, 300000, 350000)) +
  scale_y_continuous(labels = scales::comma)
print(Outliers_Visz)

Merging Recency, Frequency and Monetary Value. RFM

Time to start merging the dataset for the cluster segmentation. So far, there has been three features constructed for the model. Recency, Frequency and Monetary Value of each customer. The three of these variables are now linked to the respective CustomerID.

Below,the Users RFM table, this table includes each CustomerID and their respective RFM features. The information in this table will be used for the Customer Segmentation.

Users_RFM <- merge(Users_Recency, User_Frequency) # Merging Recency and Frequency
Users_RFM <- merge(Users_RFM, Users_Monetary_Value) # Merging Monetary Value
DT::datatable((Users_RFM),
              rownames = FALSE,
              options = list(
                pageLength = 10))

Statistical Clustering - KMeans

The segmentation will be performed using KMeans clustering, which is a simple and elegant way of subsetting the customers into non-overlapping segments. There are advantages and disadvantages of this type of clustering.

Pros

The number of clusters $k$ can be defined before segmenting the users. In this case we will select 3 clusters, hoping to get Low, Medium and High Value customers.

Cons

As numbers of clusters needs to be defined by the analyst, how many should there be? There might be more variances and more clusters that escape the segmentation, so this is up to the discretion of the business and strategy.

Importance of Scaling the Data before Performing KMeans

In the RFM data, the variables are measured in different units, where a unit increase or decrease in one day for the Recency (days inactive) is completely different than a unit increase or decrease in dollars for the Monetary Value feature. Therefore the importance of scaling the data, to represent the true distance among variables. The data has been scaled using the function $scale()$ in the kmeans algorithm. Check the code below.

Applying KMeans Clustering.

3 clusters have been constrcuted based on the following features:

Recency (Days of inactiviy)
Frequency (Number of purchases)
Monetary Value (Revenue generated).

The data has been scaled using the function scale() and CustomerId has been merged to it’s respective cluster number.

KMeans Cluster - Code

# Creating Clusters based on the RFM Table using Unsupervised Statistical Learning
set.seed(415)
clusters <- kmeans(scale(Users_RFM[,2:4]), 3, nstart = 1) # Performing kmeans with RFM variables and creating 3 clusters. 

Users_RFM$Cluster <- as.factor(clusters$cluster) # Attaching the results to CustomersID to identify each customer's cluster

# Creating Clusters based on the RFM Table using Unsupervised Statistical
set.seed(415)
clusters <- kmeans(scale(Users_RFM[,2:4]), 3, nstart = 1) # Performing kmeans with RFM variables and creating 3 clusters. 

Users_RFM$Cluster <- as.factor(clusters$cluster) # Attaching the results to CustomersID to identify each customer's cluster

Defining Users Value to the Business - KMeans

Amazing! There are three different clusters based on customers behavior with the Online Retailer.

Let’s look at the differences of these clusters to identify

High Value Customers.
Medium Value Customers.
Low Value Custommers.

KMeans_Results <- Users_RFM %>%
  group_by(Cluster) %>%
  summarise('Number of Users' = n(),
            'Recency Mean' = round(mean(Recency)),
            'Frequency Mean' = scales::comma(round(mean(Frequency))),
            'Monetary Value Mean' = scales::dollar(round(mean(Monetary_Value))),
            'Cluster Revenue' = scales::dollar(sum(Monetary_Value))
            )

DT::datatable((KMeans_Results),
              rownames = FALSE) # Display cluster means to identify their value to the business

This is how we can describe the 3 segments of customers:

Low Value Customers - Cluster 1

1,006 customers
Avg 244 days of inactivity.
Avg number of Purchases, 27 a year.
Avg Monetary Revenue of $430 a year.

Medium Value Customers - Cluster 3

2,934 customers.
Avg 39 days of inactivity
Avg Number of Purchases, 103.
Avg Revenue of $1,868 a year
Cluster with highest revenue generated in 2011

High Value Customers - Cluster 2

10 customers.
Avg 10 days of inactivity
Avg Number of Purchases, 2,799.
Avg Revenue of $83,638 a year

Cluster_size_visz <- ggplot(KMeans_Results, aes(Cluster, `Number of Users`)) +
  geom_text(aes(label = `Number of Users`), vjust = -0.3) +
         geom_bar(aes(fill=Cluster), stat='identity') +
  ggtitle('Number of Users per Cluster') + 
  xlab("Cluster Number") +
  theme_classic()
         print(Cluster_size_visz)

Summary - KMeans Segmentation

This short article showcased the importance of using data mining techniques for customer-centric strategies based on purchase behavior. Customers of the business have been clearly identified into 3 using KMeans clustering algorithm.

The clusters created can help the business understand its customers in terms of revenue generated, frequency of purchase and days of inactivity. Therefore, perform marketing strategies to retain and/or improve the profitability of different type of customers.

Further analysis can be conducted on clusters to identify more narrowed characteristics of the customers, undeerstand relationship between cluster and types of product purchased or predicting each cluster and customer’s lifetime value.

Extra Segmentation - Hiererchical Clustering

To enhance this clustering analysis it was decided to further segment the largest cluster of customer in the first segementation (Cluster 3), this further sub-segmentation was performed using hierechical clustering to further understand the customers the characteristics of this group.

Monetary Value was selected as the value for the further segmentation, using frequency and recency as estimator for it.

How to read the graph below

Read the statement on top of the model and move right for “no” or left for “yes”.
The top number in the bubble represents the average monetary value for the group
n = represents the number of variables that are part of the group

library(rpart)
library(rpart.plot)
Cluster_3_Tree <- Users_RFM %>%
  filter(Cluster == '3') %>%
  select(Frequency, Monetary_Value, Recency)

fit_tree <-rpart(Monetary_Value ~ ., 
                 data=Cluster_3_Tree,
                 method = 'anova', 
                 control= rpart.control(cp=0.0127102))

rpart.plot(fit_tree, type=1,extra=1, box.palette=c("gray","lightblue"))

Summary Hiererchical Segmentation - (Cluster 3)

This sub-segmentation of Cluster 3, divided the segment into 4 smaller different clusters.

Results: (From low value to high value customers)

1,724 customers that purchase less than 75 times, average monetary value of $733.
854 customers that purchase more than 75 times, average monetary value of $2,383 (Significantly higher than the previous group)
231 customers that purchase more than 215 times and have not purchased in the last 5 or more days. Average Monetary Value of $4,813.
125 customers that purchase more than 215 times and have purchased within the last 5 days. Highest average monetary value of $8,556.

This last sub-segment of 125 customers represents the most valuable customers within Cluster 3. From these insights, executive and management team can take further strategic actions to increase the averague monetary value of lower sub-segments within this cluster of customers.

Customer Segmentation

Statistical Analysis using R

Luis Noguera

11/19/2019