1.0.INTRODUCTION

Effective Customer Segmentation can have several benefits for your business, such as:

  1. More Effective Marketing Strategy
  2. Optimizing the Customer Journey
  3. Predict Customer Behavior
  4. Personalizing the Customer Experience
  5. Improves Customer Loyalty & Retention
  6. Improves Conversion Metrics
  7. Supports Product Development

1.1.Problem Statement

1.2.Dataset Information:

Here is a brief version of the data description file.

1.3. Analysis Action Plan

The structure that I will follow for this analysis will include the following steps - sections:

  1. Loading and Checking the dataset

# suppress warning message / global option output
options(warn = -1)
# Loading the required Packages
library(tidyverse)
library(readxl)
library(DataExplorer)
library(visdat)
library(naniar)
library(rmarkdown)
library(knitr)
library(DT)
library(wordcloud2)
library(wordcloud)
library(extrafont)
library(SnowballC) 
library(tm) 
library(dlookr) 
library(gridExtra)
library(scales)
library(lubridate)
library(ggThemeAssist)
library(plotly)
library(hrbrthemes)
library(xts)
library(highcharter)
library(countrycode)
library(factoextra) 

Loading Dataset

# Loading the dataset
retail <- read_excel("Online Retail.xlsx")

# getting data View
# View(retail)
#display the colnames 
colnames(retail)
## [1] "InvoiceNo"   "StockCode"   "Description" "Quantity"    "InvoiceDate"
## [6] "UnitPrice"   "CustomerID"  "Country"

Structure of the dataset

# getting information about the structure of the data
str(retail)
## tibble [541,909 x 8] (S3: tbl_df/tbl/data.frame)
##  $ InvoiceNo  : chr [1:541909] "536365" "536365" "536365" "536365" ...
##  $ StockCode  : chr [1:541909] "85123A" "71053" "84406B" "84029G" ...
##  $ Description: chr [1:541909] "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity   : num [1:541909] 6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: POSIXct[1:541909], format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
##  $ UnitPrice  : num [1:541909] 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : num [1:541909] 17850 17850 17850 17850 17850 ...
##  $ Country    : chr [1:541909] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...

Observations

Summary of the data

# getting summarized information about the features of the data
summary(retail)
##   InvoiceNo          StockCode         Description           Quantity        
##  Length:541909      Length:541909      Length:541909      Min.   :-80995.00  
##  Class :character   Class :character   Class :character   1st Qu.:     1.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :     3.00  
##                                                           Mean   :     9.55  
##                                                           3rd Qu.:    10.00  
##                                                           Max.   : 80995.00  
##                                                                              
##   InvoiceDate                    UnitPrice           CustomerID    
##  Min.   :2010-12-01 08:26:00   Min.   :-11062.06   Min.   :12346   
##  1st Qu.:2011-03-28 11:34:00   1st Qu.:     1.25   1st Qu.:13953   
##  Median :2011-07-19 17:17:00   Median :     2.08   Median :15152   
##  Mean   :2011-07-04 13:34:57   Mean   :     4.61   Mean   :15288   
##  3rd Qu.:2011-10-19 11:27:00   3rd Qu.:     4.13   3rd Qu.:16791   
##  Max.   :2011-12-09 12:50:00   Max.   : 38970.00   Max.   :18287   
##                                                    NA's   :135080  
##    Country         
##  Length:541909     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Observations

First 5 rows

Peek on the first 5 rows

# Peek on the first 5 rows
head(retail, 5)
## # A tibble: 5 x 8
##   InvoiceNo StockCode Description         Quantity InvoiceDate         UnitPrice
##   <chr>     <chr>     <chr>                  <dbl> <dttm>                  <dbl>
## 1 536365    85123A    WHITE HANGING HEAR~        6 2010-12-01 08:26:00      2.55
## 2 536365    71053     WHITE METAL LANTERN        6 2010-12-01 08:26:00      3.39
## 3 536365    84406B    CREAM CUPID HEARTS~        8 2010-12-01 08:26:00      2.75
## 4 536365    84029G    KNITTED UNION FLAG~        6 2010-12-01 08:26:00      3.39
## 5 536365    84029E    RED WOOLLY HOTTIE ~        6 2010-12-01 08:26:00      3.39
## # ... with 2 more variables: CustomerID <dbl>, Country <chr>

Unique customers & products

# Getting the amount of distinct CustomerID values
n_distinct(retail$CustomerID)
## [1] 4373
n_distinct(retail$Description)
## [1] 4212

Obsevations

3.Data Preparation

Data Cleaning

Checking & Dealing with missing values

Plotting percentage of missing values per feature

# Plotting percentage of missing values per feature
gg_miss_var(retail, show_pct = TRUE)

Observations

# Deleting rows where CustomerID is missing
retail <- retail %>% 
  na.omit(retail$CustomerID)

# Replacing NA Description values with the string "empty"
retail$Description <- replace_na(retail$Description, "empty")

Checking & Treating Outliers

# plotting outliers for Quantity and UnitPrice

plot_outlier(retail, Quantity, UnitPrice, col = "#ACAC89")

Lets check the minimum and maximum Unit Price of products

min(retail$UnitPrice)
## [1] 0
max(retail$UnitPrice)
## [1] 38970

Observations

# checking quantities that are negative, arranged by descending order
quantity_check <- retail %>% 
  filter(Quantity < 0) %>% 
  arrange(Quantity)

head(quantity_check, 3)
## # A tibble: 3 x 8
##   InvoiceNo StockCode Description         Quantity InvoiceDate         UnitPrice
##   <chr>     <chr>     <chr>                  <dbl> <dttm>                  <dbl>
## 1 C581484   23843     PAPER CRAFT , LITT~   -80995 2011-12-09 09:27:00      2.08
## 2 C541433   23166     MEDIUM CERAMIC TOP~   -74215 2011-01-18 10:17:00      1.04
## 3 C536757   84347     ROTATING SILVER AN~    -9360 2010-12-02 14:23:00      0.03
## # ... with 2 more variables: CustomerID <dbl>, Country <chr>

Observations

retail %>% 
  filter(CustomerID == 16446)
## # A tibble: 4 x 8
##   InvoiceNo StockCode Description         Quantity InvoiceDate         UnitPrice
##   <chr>     <chr>     <chr>                  <dbl> <dttm>                  <dbl>
## 1 553573    22980     PANTRY SCRUBBING B~        1 2011-05-18 09:52:00      1.65
## 2 553573    22982     PANTRY PASTRY BRUSH        1 2011-05-18 09:52:00      1.25
## 3 581483    23843     PAPER CRAFT , LITT~    80995 2011-12-09 09:15:00      2.08
## 4 C581484   23843     PAPER CRAFT , LITT~   -80995 2011-12-09 09:27:00      2.08
## # ... with 2 more variables: CustomerID <dbl>, Country <chr>

Observations

# deleting the outliers by their Invoice Number, InvoiceNo
retail <- retail[!(retail$InvoiceNo == 581483 | retail$InvoiceNo == 541431),]

So we will clean the dataset by filtering out negative and zero values for Quantity and Unit Price features. Checking again the outlier plots with the clean filtered dataset to get one more view.

# filtering our data to have only positive Quantities
retail <- retail %>% 
  filter(Quantity > 0) %>% 
  filter(UnitPrice >0)

plot_outlier(retail, Quantity, UnitPrice, col = "#ACAC89")

Observations

I will check the unique countries that the dataset contains transaction data from.

unique(retail$Country)
##  [1] "United Kingdom"       "France"               "Australia"           
##  [4] "Netherlands"          "Germany"              "Norway"              
##  [7] "EIRE"                 "Switzerland"          "Spain"               
## [10] "Poland"               "Portugal"             "Italy"               
## [13] "Belgium"              "Lithuania"            "Japan"               
## [16] "Iceland"              "Channel Islands"      "Denmark"             
## [19] "Cyprus"               "Sweden"               "Finland"             
## [22] "Austria"              "Greece"               "Singapore"           
## [25] "Lebanon"              "United Arab Emirates" "Israel"              
## [28] "Saudi Arabia"         "Czech Republic"       "Canada"              
## [31] "Unspecified"          "Brazil"               "USA"                 
## [34] "European Community"   "Bahrain"              "Malta"               
## [37] "RSA"

Our dataset is now clean and we can move on to creating some new features that we can work with

Feature Engineering

Creation of new features:

  # new feature Spent for each customer, for his total amount of spent money on the store
retail <- mutate(retail, Spent = Quantity * UnitPrice)
  # new customer dataframe
customer <- summarise_at(group_by(retail,CustomerID,Country), vars(Spent,Quantity), funs(sum(.,na.rm = TRUE)))
  # new date and time columns extracted from InvoiceDate column
retail$InvoiceDate <- as.character(retail$InvoiceDate)
retail$date <- sapply(retail$InvoiceDate, FUN = function(x) {strsplit(x, split = '[ ]')[[1]][1]})
retail$time <- sapply(retail$InvoiceDate, FUN = function(x) {strsplit(x, split = '[ ]')[[1]][2]})
 # new month, year and hour columns
retail$year <- sapply(retail$date, FUN = function(x) {strsplit(x, split = '[-]')[[1]][1]})
retail$month <- sapply(retail$date, FUN = function(x) {strsplit(x, split = '[-]')[[1]][2]})
retail$hour <- sapply(retail$time, FUN = function(x) {strsplit(x, split = '[:]')[[1]][1]})
# turning date feature to date type
retail$date <- as.Date(retail$date, "%Y-%m-%d")
# creating day of the week feature
retail$day_week <- wday(retail$date, label = TRUE)
# setting up a frame with unique descriptions for further exploration of products offered
products_list <- unique(retail$Description)

Checking the new structure with the new features

# checking the new structure with the new features
head(retail, 2)
## # A tibble: 2 x 15
##   InvoiceNo StockCode Description     Quantity InvoiceDate  UnitPrice CustomerID
##   <chr>     <chr>     <chr>              <dbl> <chr>            <dbl>      <dbl>
## 1 536365    85123A    WHITE HANGING ~        6 2010-12-01 ~      2.55      17850
## 2 536365    71053     WHITE METAL LA~        6 2010-12-01 ~      3.39      17850
## # ... with 8 more variables: Country <chr>, Spent <dbl>, date <date>,
## #   time <chr>, year <chr>, month <chr>, hour <chr>, day_week <ord>

Customer data by unique Customer

# creation of interactive customer data table

datatable(customer, class = 'cell-border stripe',filter = 'top', options = list(
  pageLength = 10, autoWidth = TRUE, 
  columnDefs = list(list(className = 'dt-center', targets = "_all")), searchHighlight = TRUE),
  caption = htmltools::tags$caption(
    style = 'caption-side: bottom; text-align: center;',
    'Table 1: ', htmltools::em('Customer datatable Information '))
  )

3.Exploratory Data Analysis(EDA)

plot_clean1 <- retail %>% 
  group_by(Country) %>% 
  dplyr::summarise(n = n()) 


highchart() %>% 
  hc_chart(type ="column",
           options3d = list(enabled = TRUE, alpha = 15, beta = 15)) %>%
  hc_xAxis(categories = plot_clean1$Country) %>% 
  hc_add_series(data = plot_clean1$n, name = "Total Invoices") %>%
  hc_add_theme(hc_theme_google()) %>%
  hc_title(
    text="Total Invoices - Transaction per Country"
    ) %>%
    hc_chart(
      borderColor = '#EBBA95',
      borderRadius = 10,
      borderWidth = 1,
      backgroundColor = list(
        linearGradient = c(0, 0, 500, 500), stops = list(
               list(0, 'rgb(255, 255, 255)'),
               list(1, 'rgb(200, 200, 255)')
             )))

Observations

To get a better look of the transactions that happen outside the UK we will plot again the total invoices excluding UK this time and lets arange it also in a descending order.

plot_clean <- retail %>% 
  group_by(Country) %>% 
  filter(Country != "United Kingdom") %>% 
  dplyr::summarise(n = n()) %>% 
  arrange(-n)

highchart() %>% 
  hc_chart(type ="column",
           options3d = list(enabled = TRUE, alpha = 15, beta = 15)) %>%
  hc_xAxis(categories = plot_clean$Country) %>% 
  hc_add_series(data = plot_clean$n, name = "Total Invoices") %>%
  hc_add_theme(hc_theme_sandsignika()) %>%
  hc_title(
    text="Total Invoices - Transaction per Country excl. UK (descending)"
    ) %>%
    hc_chart(
      borderColor = '#EBBA95',
      borderRadius = 10,
      borderWidth = 1,
      backgroundColor = list(
        linearGradient = c(0, 0, 500, 500), stops = list(
               list(0, 'rgb(255, 255, 255)'),
               list(1, 'rgb(200, 200, 255)')
             )))

OBservations

I will take the top 5 countries by Revenue to get some more insight about them.

retail_country <- retail %>%
  group_by(Country) %>%
  dplyr::summarise(revenue = sum(Spent), transactions = n_distinct(InvoiceNo)) %>%
  mutate(aveOrdVal = (round((revenue / transactions),2))) %>%
  ungroup() %>%
  arrange(desc(revenue))

top_countries_filter <- retail %>%
  filter(Country == 'Netherlands' | Country == 'EIRE' | Country == 'Germany' | Country == 'France' 
         | Country == 'Australia')

top_5 <- top_countries_filter %>%
  group_by(Country, date) %>%
  dplyr::summarise(revenue = sum(Spent), transactions = n_distinct(InvoiceNo), 
                   customers = n_distinct(CustomerID)) %>%
  mutate(aveOrdVal = (round((revenue / transactions),2))) %>%
  ungroup() %>%
  arrange(desc(revenue))

top_5 %>% 
  group_by(Country) %>%
  dplyr::summarise(revenue = sum(revenue)) %>% 
  hchart('treemap', hcaes(x = 'Country', value = 'revenue', color = 'revenue')) %>%
  hc_add_theme(hc_theme_elementary()) %>%
  hc_title(text=" Top 5 Countries by Revenue excl. UK")

Observations

-We can see that the top 5 countries excluding UK by revenue are with descending order:

  1. Netherlands
  2. EIRE (Ireland)
  3. Germany
  4. France
  5. Australia

I will check the revenue of these countries over time.

ggplot(top_5, aes(x = date, y = revenue, colour = Country)) + geom_smooth(method = 'auto', se = FALSE) + 
  labs(x = ' Country', y = 'Revenue', title = 'Revenue by Country over Time') + 
  theme(panel.grid.major = element_line(colour = NA),
    legend.text = element_text(colour = "skyblue4"),
    legend.title = element_text(face = "bold"),
    panel.background = element_rect(fill = NA),
    plot.background = element_rect(fill = "seashell1",
        colour = NA), legend.key = element_rect(fill = "gray71"),
    legend.background = element_rect(fill = NA))

Observations

I will take a look on the total revenue chart by date from the total amount spent by customers.

revenue_date1 <- retail %>%
  group_by(date) %>%
  dplyr::summarise(revenue = sum(Spent))

time_series <- xts (
  revenue_date1$revenue,order.by = as.POSIXct(revenue_date1$date))


highchart(type = "stock") %>% 
  hc_title(text = "Revenue by Date") %>% 
  hc_subtitle(text = "Revenue generated from the online store") %>% 
  hc_add_series(time_series) %>%
  hc_add_theme(hc_theme_darkunica()) 

OBservations

Now I will plot some time focused plots to explore transaction and revenue patterns.

Revenue by the different day of the week

retail %>%
  group_by(day_week) %>%
  dplyr::summarise(revenue = sum(Spent)) %>%
  hchart(type = 'column', hcaes(x = day_week, y = revenue)) %>% 
  hc_yAxis(title = list(text = "Revenue")) %>%  
  hc_xAxis(title = list(text = "Day of the Week")) %>% 
  hc_title(text = "Revenue by Day of Week")

From the graph above we can observe that customers are buying considerably less products during Sundays.

Revenue and transactions by Hour of the Day

retail %>%
  group_by(hour) %>%
  dplyr::summarise(revenue = sum(Spent)) %>%
  hchart(type = 'column', hcaes(x = hour, y = revenue)) %>% 
  hc_yAxis(title = list(text = "Revenue")) %>%  
  hc_xAxis(title = list(text = "Hour Of Day")) %>% 
  hc_title(text = "Revenue by Hour Of Day")
retail %>%
  group_by(hour) %>%
  dplyr::summarise(transactions = n_distinct(InvoiceNo)) %>%
  hchart(type = 'column', hcaes(x = hour, y = transactions)) %>% 
  hc_yAxis(title = list(text = "Number of Transactions")) %>%  
  hc_xAxis(title = list(text = "Hour Of Day")) %>% 
  hc_title(text = "Transactions by Hour Of Day")

Observations

-We can see that the peak hours of the store for transactions and consequently as we would expect for revenue also occurs during the middle of the day. Particularly the most busy hour is around 12 at noon where both plot graphs see their peak.

Lets also visualize where most of our unique customers come from without United Kingdom which we know is the majority of our customer base.

# Wordcloud
topic.corpus <- Corpus(VectorSource(customer$Country))

removeHTML <- function(text){
  text <- gsub(pattern = '<.+\\">', "",text)
  text <- gsub(pattern = '</.+>', "", text)
  return(text)
}

topic.corpus <- topic.corpus %>% 
  tm_map(content_transformer(removeHTML)) %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords("english")) %>%
  tm_map(removeWords, stopwords("SMART")) 

tdm = TermDocumentMatrix(topic.corpus) %>%
  as.matrix()
words = sort(rowSums(tdm), decreasing = TRUE)
df = data.frame(word = names(words), freq = words)

df = df %>%
  filter(nchar(as.character(word)) > 1,
         word != "united" , word != "kingdom")

uxc.colors = c("#fefefe", "#f4f2a8", "#030303")
uxc.background = "#00ccff"


wordcloud2(df,
           color = rep_len(uxc.colors, nrow(df)),
           backgroundColor = uxc.background,
           fontFamily = "DM Sans",
           size = 1,
           minSize = 3,
           rotateRatio = 0)

The main customer base outside the UK comes from Germany, France and following up is Spain and Belgium.

Product Exploration

Sadly Wordcloud2 package seems to have a limitation as you can only have one wordcloud from the package on your output.. which is weird.. so for the next wordclouds I will use the regular R wordcloud package

The dataset does not provide product categories and outlooks, so I will try to engineer some product categories based on the available data descriptions of the products so that we can put the products into groups

Exploring the 20 most common words from product Descriptions, visualizing with a wordcloud and getting interactive tooltip of the amount of times they appear on unique product descriptions

# preparing and cleaning the text

docs <- Corpus(VectorSource(products_list))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=20, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Observations

# preparing and cleaning the text

docs <- Corpus(VectorSource(products_list))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("pink", "blue","red","set","white","metal", "glass","large","small","holder"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=20, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

It seems that the following product categories are mainly offered

Product categories: - Decoration products - Christmas related products - Bags - Vintage design focused products - Flowers and presents

World Map To visualize the revenue streams of the store from around the world I will plot a world map showing the different countries where transactions have occurred in a logarithmic revenue metric.

map_info <- retail %>% 
  group_by(Country) %>% 
  dplyr::summarise(revenue = sum(Spent))
  
highchart(type = "map") %>%
  hc_add_series_map(worldgeojson,
                    map_info %>% 
                      bind_cols(as_tibble(map_info$revenue)) %>% 
                      group_by(map_info$Country) %>% 
                      dplyr::summarise(revenue = log1p(sum(value))) %>% 
                      ungroup() %>% 
                      mutate(iso2 = countrycode(sourcevar = map_info$Country, 
                          origin="country.name", destination="iso2c")),
                    value = "revenue", joinBy = "iso2") %>%
  hc_title(text = "Revenue by country (log)") %>%
  hc_tooltip(useHTML = TRUE, headerFormat = "",
             pointFormat = "{point.map_info$Country}") %>% 
  hc_colorAxis(stops = color_stops(colors = viridisLite::inferno(10, begin = 0.1)))

Observations

We see from the World map that the online store has transactions and customers all around the globe but its main transactions are: - UK - Europe (Germany, France mainly) - Australia - Japan

Segmentation - Clustering

Clustering:

RFM: Recency, Frequency, Monetary Value: RFM analysis is a prominent client segmentation and identification tool in database marketing. It is crucial, especially in the retail industry. RFM assigns a score to each consumer based on three elements.

K-means clustering: K-means clustering is an iterative clustering approach. The K-Means algorithm uses the Euclidean distance metric to partition the customers for RFM values. The number of groups K is predefined, and each data point is assigned to one of the K clusters repeatedly based on feature similarity.

Choosing the best number of clusters K: The number of clusters is the essential input for k-means clustering. This is calculated using the principle of minimizing the within-clusters-sum of square (WCSS). The number of clusters is plotted on the X-axis, and the WCSS for each cluster number is plotted on the y axis.

To determine the optimal number of clusters, use the scree plot/Elbow approach. The WCSS decreases as the number of clusters increases. The decline in WCSS is sharp at first, but the rate of decrease eventually slows, resulting in an elbow plot. The number of clusters in the elbow formation usually indicates the ideal clusters.

RFM feature engineering

We will create customer categories

Before we dive into clustering our customers, we have to feature engineer our RFM values to have a solid dataframe for the clustering.

Recency

Recency: It refers to the number of days before the reference date when a customer made the last purchase. Lesser the value of recency, higher is the customer visit to a store.

recency <- retail %>% 
  dplyr::select(CustomerID, InvoiceDate) %>% 
  mutate(recency = as.Date("2011-12-09") - as.Date(InvoiceDate))

recency <- recency %>% 
  dplyr::select(CustomerID, recency) %>% 
  group_by(CustomerID) %>% 
  slice(which.min(recency))

head(recency,3)
## # A tibble: 3 x 2
## # Groups:   CustomerID [3]
##   CustomerID recency
##        <dbl> <drtn> 
## 1      12347  2 days
## 2      12348 75 days
## 3      12349 18 days

Observations

We have constructed our recency feature and we can see the number of days since the last purchase of its individual unique customer of the store

Frequency

Frequency: It is the period between two subsequent purchases of a customer. Higher the value of Frequency, more is the customer visit to the company.

amount_products <- retail %>%
    dplyr::select(CustomerID, InvoiceDate) %>% 
    group_by(CustomerID, InvoiceDate) %>% 
    summarize(n_prod = n())

df_frequency <- amount_products %>% 
    dplyr::select(CustomerID) %>%
    group_by(CustomerID) %>% 
    summarize(frequency = n())

head(df_frequency,3)
## # A tibble: 3 x 2
##   CustomerID frequency
##        <dbl>     <int>
## 1      12347         7
## 2      12348         4
## 3      12349         1

Monetary value

Monetary: This refers to the amount of money spent by a customer during a specific period of time. Higher the value, more is the profit generated to the company.

  • We have the total revenue per customer ready from previous feature engineering with the Spent feature so we will just drop the other columns
monetary <- select(customer, c("CustomerID", "Spent"))


head(monetary,3)
## # A tibble: 3 x 2
## # Groups:   CustomerID [3]
##   CustomerID Spent
##        <dbl> <dbl>
## 1      12347 4310 
## 2      12348 1797.
## 3      12349 1758.

RFM complete table

  • Now that our RFM engineering is complete we can join the 3 data frames for a complete dataframe that will serve the RFM clustering analysis for the customers of the store
# inner join the three RFM data frames by CustomerID
rfm <- recency %>% 
  dplyr::inner_join(., df_frequency, by = "CustomerID") %>% 
  dplyr::inner_join(., monetary, by = "CustomerID")

# drop the days from recency column and transform it into numeric data type
rfm <- rfm %>% 
  mutate(recency = str_replace(recency, " days", "")) %>% 
  mutate(recency = as.numeric(recency)) %>% 
  ungroup()

head(rfm, 3)
## # A tibble: 3 x 4
##   CustomerID recency frequency Spent
##        <dbl>   <dbl>     <int> <dbl>
## 1      12347       2         7 4310 
## 2      12348      75         4 1797.
## 3      12349      18         1 1758.
# Deleting the CustomerID column to have only our 3 RFM features for our modelling data frame
rfm_clean <- select(rfm, -CustomerID)

Plotting RFM values

h1 <- hchart(
  rfm_clean$recency, 
  color = "#B71C1C", name = "Recency")

h2 <- hchart(
  rfm_clean$frequency, 
  color = "#668F80", name = "Frequency")

h3 <- hchart(
  rfm_clean$Spent, 
  color = "#4A6670", name = "Monetary Value")

htmltools::browsable(hw_grid(h1, h2, h3, ncol = 3, rowheight = 500))

Observations

  • From the plots we can see the different distributions that occur for the 3 RFM attributes.

K-means Clustering

Scaling the data

For k-means approach it is best use to scale your data.

# scaling
rfm_norm <- scale(rfm_clean)
summary(rfm_norm)
##     recency          frequency            Spent         
##  Min.   :-0.9207   Min.   :-0.42653   Min.   :-0.23309  
##  1st Qu.:-0.7505   1st Qu.:-0.42653   1st Qu.:-0.19745  
##  Median :-0.4202   Median :-0.29549   Median :-0.15480  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.4908   3rd Qu.: 0.09762   3rd Qu.:-0.03904  
##  Max.   : 2.8131   Max.   :26.96062   Max.   :32.56198

Calculating how many clusters we need

Before we do the actual clustering, we need to identify the Optimal number of clusters (k) for this data set of wholesale customers. The popular way of determining number of clusters are

  1. Elbow Method
  2. Silhouette Method
  3. Gap Static Method

Elbow and Silhouette methods are direct methods and gap statistic method is the statistics method.

Elbow method

fviz_nbclust(rfm_norm, kmeans, method="wss")+geom_vline(xintercept=3,linetype=2) + labs(subtitle = "Elbow method")

Based on the Elbow method chart we could use 3, 4, 5 or even 6 clusters based on the slope and getting closed to zero. But is subjective in which it works better.

Lets test the optimum number of cluster with the Silhouette method also:

Silhouette method

The silhouette co efficient estimates the average distance between clusters. The silhouette plot shows the silhouette co efficient over values of k ranging from 1 to 10.

fviz_nbclust(rfm_norm,kmeans, method = "silhouette")+labs(subtitle = "Silhouette method") 

Observations

  • This plot shows the highest average silhouette co-efficient occurring when k = 4.

I will test the optimum number with Gap statistics as well:

Gap statistic method

The gap statistic compares intra cluster variation for different values of k with expected intra cluster variation under null distribution. Ideally we should be choosing the value of k which maximizes the gap statistic however in real world datasets where clusters are not so well defined it may be more parsimonious to choose the k value to be the one where the rate of increase of the statistic begins to slow down (i.e. value lowest value of k that is greater than or equal to the value of k+1 minus the standard error)

Lets see what we get

library(cluster)
# compute gap statistic
set.seed(123)
  gap_stat <- clusGap(rfm_norm, FUN = kmeans, nstart = 25,
                 K.max = 10, B = 50)

fviz_gap_stat(gap_stat) + labs(subtitle = "Gap statistic method")

Observations

As we can see from the graph the rate of increase begins to slow down on k = 3, as such gap statistics proposes that the optimal number of clusters k is 3.

Finding the optimum number of clusters

  • One of the most discussed disadvantages of k-means clustering, is the subjective nature of choosing the optimal k number of clusters.

  • To avoid the subjective guesswork that follows the visual observation of cluster results, I will use the NbClust R package. While there are several programs and packages that do validation for a particular clustering solution, different validation metrics yield different results for different clustering algorithms.

  • The NbClust package can help us with determining the best number of clusters for our data.

# NbClust package for determining the best number of clusters
library(NbClust)

NbClust(rfm_norm, distance = "euclidean", min.nc = 2, max.nc = 10, method = "kmeans")

The Hubert index:- It is a graphical method of determining the number of clusters. In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in Hubert index second differences plot.

From the Hubert index results:

Among all indices:

  • 3 proposed 2 as the best number of clusters
  • 6 proposed 3 as the best number of clusters
  • 3 proposed 4 as the best number of clusters
  • 3 proposed 6 as the best number of clusters
  • 3 proposed 9 as the best number of clusters
  • 5 proposed 10 as the best number of clusters

Consequently, the ideal number of clusters, according to the majority rule, is three, k = 3. However, we see that 5 indices proposed 10 as the best number of clusters, which is very close. While with 10 clusters you would have more unique demonstrable groups, the categorization might get too specific and with large clusters there is danger of overfitting the data.

Clusters - Visualizing

k = 3 clusters

After running several tests above, we can conclude that the most appropriate optimal number of clusters for our data set seems to be k = 3, so we will compute k-means with k = 3 to segment the customers into groups.

# Compute k-means with k = 3 

set.seed(123)
km.res <- kmeans(rfm_norm, 3, nstart = 10) # 3 centers, and nstart set to 10 random sets

km.res$centers # centroids from model on normalized data
##      recency   frequency       Spent
## 1  1.5389291 -0.35067434 -0.16793154
## 2 -0.8638170  8.35832179  9.43429552
## 3 -0.5126431  0.05364613 -0.01635047

Observations

  • As we can see from the results, Cluster 3 has the highest transactions recency, but what is really interesting is that Cluster 2 has the highest transaction frequency, as well as the highest transaction amount compared to the other 2 clusters.

Lets also check the amount of customers in each cluster.

km.res$size
## [1] 1090   25 3230
km.res$betweenss/km.res$totss
## [1] 0.5828409

Observations

  • The score (between_SS / total_SS) is 58.2% which is relatively good. This ratio is the total sum of squares of the data points that are shared between the clusters.

Visualization of the clustering algorithm results

set.seed(123)
fviz_cluster(km.res, geom = "point", data = rfm_norm) + ggtitle("k = 3")

Observations

  • First of all we see this extreme distribution because of the k-means disadvantage at dealing with outliers.

We have 3 demonstable clusters, with cluster 2 containing the most valuable customers.

frm_k3 <- rfm %>% 
  mutate(cluster = km.res$cluster)

frm_k3
## # A tibble: 4,345 x 5
##    CustomerID recency frequency Spent cluster
##         <dbl>   <dbl>     <int> <dbl>   <int>
##  1      12347       2         7 4310        3
##  2      12348      75         4 1797.       3
##  3      12349      18         1 1758.       3
##  4      12350     310         1  334.       1
##  5      12352      36         8 2506.       3
##  6      12353     204         1   89        1
##  7      12354     232         1 1079.       1
##  8      12355     214         1  459.       1
##  9      12356      22         3 2811.       3
## 10      12357      33         1 6208.       3
## # ... with 4,335 more rows

Observations

  • So we can see that cluster 2, is our MVP customers that have generated the most revenue having Spent the most money in the products of the store. In cluster 1 we have different kind of groups who can be potential valuable customers for the business, as either they purchase a lot or have made a transaction recently singaling new opportunities.

k = 10 clusters

# Compute k-means with k = 10 

set.seed(123)
km.res_10 <- kmeans(rfm_norm, 10, nstart = 10) # 10 centers, and nstart set to 10 random starts

km.res_10$centers # centroids from model on normalized data
##       recency  frequency      Spent
## 1  -0.9156694 26.3709426 10.1565245
## 2   1.5566561 -0.3731575 -0.1653148
## 3  -0.8546077  6.7544044  6.5144269
## 4  -0.7103538 -0.1627713 -0.1096455
## 5  -0.8160590  2.6978481  1.2252349
## 6   2.4395250 -0.3942279 -0.1901048
## 7  -0.7278237  0.7546869  0.2267256
## 8  -0.1731544 -0.2640112 -0.1325608
## 9  -0.8906441  7.0426902 28.4186004
## 10  0.6849949 -0.2770338 -0.1449719

Lets also check the amount of customers in each cluster.

km.res_10$size
##  [1]    2  437   20 1479  102  288  561  949    3  504
km.res_10$betweenss/km.res_10$totss
## [1] 0.89463

As expected when we have more clusters the ratio increases to 89.4%.

Visualization of the clustering algorithm results

set.seed(123)
fviz_cluster(km.res_10, geom = "point", data = rfm_norm) + ggtitle("k = 10")

frm_k10 <- rfm %>% 
  mutate(cluster = km.res_10$cluster)

head(frm_k10)
## # A tibble: 6 x 5
##   CustomerID recency frequency Spent cluster
##        <dbl>   <dbl>     <int> <dbl>   <int>
## 1      12347       2         7 4310        7
## 2      12348      75         4 1797.       8
## 3      12349      18         1 1758.       4
## 4      12350     310         1  334.       6
## 5      12352      36         8 2506.       7
## 6      12353     204         1   89        2
frm_k10 %>%
  group_by(cluster) %>% 
  summarize(number = n(),
            frequency = round(mean(frequency)),
            recency = round(mean(recency)),
            monetary = round(mean(Spent)))
## # A tibble: 10 x 5
##    cluster number frequency recency monetary
##      <int>  <int>     <dbl>   <dbl>    <dbl>
##  1       1      2       206       0    88772
##  2       2    437         1     247      582
##  3       3     20        56       7    57654
##  4       4   1479         3      21     1058
##  5       5    102        25      10    12463
##  6       6    288         1     336      370
##  7       7    561        10      19     3932
##  8       8    949         2      75      862
##  9       9      3        58       3   244805
## 10      10    504         2     160      756

With 10 clusters we have to be careful as there is a chance that we are over fitting the data. As such k = 3 clusters is prefered

RFM analysis

To find our most valuable customers based on RFM analysis I will also use the rfm package, which based on the input will give automate RFM scores for our customers.

According to the Pareto principle, 20% of the customers (the vital few) contribute more to the revenue of the company than the rest. It argues that 80 percent of effects can be traced back to as few as 20% of all causes — these 20% of causes are vital, and the remaining 20% of effects is then naturally dispersed to be mapped with the remaining 80% causes — they are trivial numerous. These 20% are the high-value, important consumers that the company would like to keep.

recency_s <- retail %>% 
  dplyr::select(CustomerID, date) %>% 
  group_by(CustomerID) %>% 
  slice(which.max(date))

rfm_test <- customer %>% inner_join(recency_s, by = "CustomerID")

rfm_test <- rfm_test %>% left_join(rfm, by = "CustomerID") 

library(rfm)
rfm_result <- rfm_table_customer(rfm_test, CustomerID, Quantity, recency, Spent.x, as.Date("2011-12-09"))
rfm_datatable <- rfm_result$rfm

Creation of interactive customer data table with RFM scores, to observe values

# creation of interactive customer data table with RFM scores, to observe his values
rfm_datatable_print <- rfm_datatable %>% 
  select(-recency_days,-transaction_count, - amount)

datatable(rfm_datatable_print, class = 'cell-border stripe',filter = 'top', options = list(
  pageLength = 10, autoWidth = TRUE, 
  columnDefs = list(list(className = 'dt-center', targets = "_all")), searchHighlight = TRUE),
  caption = htmltools::tags$caption(
    style = 'caption-side: bottom; text-align: center;',
    'Table 1: ', htmltools::em('Customer Values - RFM scores '))
 )

Creation of interactive customer data table with RFM scores for approximately the top 20% of customers, who are the most vital for the business

# get the top 20% of rfm_score of Customers, to find the most valuable customers 
n <- 20
top_20 <- rfm_datatable %>% 
  subset(rfm_score >  quantile(rfm_score, prob = 1 - n/100))

top_20 <- top_20 %>% 
  select(customer_id, recency_days, transaction_count, amount, rfm_score)

datatable(top_20, class = 'cell-border stripe',filter = 'top', options = list(
  pageLength = 10, autoWidth = TRUE, 
  columnDefs = list(list(className = 'dt-center', targets = "_all")), searchHighlight = TRUE),
  caption = htmltools::tags$caption(
    style = 'caption-side: bottom; text-align: center;',
  'Table 1: ', htmltools::em('Most valuable customers (~20% of total) '))
  )

Observations

-Also the data we have contains transactions of just over one year of business, which is not that much to accurately distinguish the customer base appropriately.

Now I will create customer segments based on the image below

Segments

We will Segment our customers based on this proposed image from rsquareacademy.com, however there seems to be some overlaps between the conditions, so I will adjust

rfm_segments <- rfm_datatable
rfm_segments <- rfm_segments %>% 
  mutate(segment = ifelse(recency_score >= 4 & frequency_score >= 4 & monetary_score >= 4, "Champion", 
  ifelse(recency_score >= 2 & frequency_score >= 3 & monetary_score >= 3, "Loyal Customer", 
  ifelse(recency_score >= 3 & frequency_score <= 3 & monetary_score <= 3, "Potential Loyalist",
  ifelse(recency_score >= 4 & frequency_score <= 1 & monetary_score <= 1, "New Customer",
  ifelse((recency_score == 3 | recency_score == 4) & frequency_score <= 1 & monetary_score <= 1, "Promising",
  ifelse((recency_score == 2 | recency_score == 3) & (frequency_score == 2 | frequency_score == 3) & 
           (monetary_score == 2 | monetary_score == 3), "Need Attention",
  ifelse((recency_score == 2 | recency_score == 3) & frequency_score <= 2 & monetary_score <= 2, "About to Sleep",
  ifelse(recency_score <= 2 & frequency_score > 2 & monetary_score > 2, "At Risk",
  ifelse(recency_score <= 1 & frequency_score >= 4 & monetary_score >= 4, "Can't lose them",
  ifelse(recency_score <= 2 & frequency_score == 2 & monetary_score == 2, "Hibernating", "Lost")))))))))))

Now we have our customer categories/segments ready to differentiate them and act

Lets first plot the number of customers that we have in each segment.

# checking segments size

rfm_segments %>%
  group_by(segment) %>% 
  count(segment) %>%
  arrange(desc(n)) %>%
  rename(Segment = segment, Count = n)
## # A tibble: 8 x 2
## # Groups:   Segment [8]
##   Segment            Count
##   <chr>              <int>
## 1 Loyal Customer      1242
## 2 Champion             981
## 3 Potential Loyalist   813
## 4 Lost                 623
## 5 About to Sleep       204
## 6 Need Attention       198
## 7 At Risk              158
## 8 Hibernating          142
rfm_plot_median_monetary(rfm_segments)

Observations

Lets have a table that can distinguish each of the stores customers so that they can treat them accordingly and focus on the vital customers that will enforce the prosperity of the business.

plot_segments <- rfm_segments %>% 
  select(customer_id, segment)

datatable(plot_segments, class = 'cell-border stripe',filter = 'top', options = list(
  pageLength = 10, autoWidth = TRUE, 
  columnDefs = list(list(className = 'dt-center', targets = "_all")), searchHighlight = TRUE),
  caption = htmltools::tags$caption(
    style = 'caption-side: bottom; text-align: center;',
  'Table 1: ', htmltools::em('Customers by segment'))
  )

Conclusion

In this analysis I explored the online retail data to try and observe important factors for the business and segment the customer base into insightful information that can be implemented for better customer relationship management.

More specifically I identified 3 important customer clusters through the k-means approach with one cluster containing 25 of the most valuable MVP customers of the business.

Consequently I implemented an RFM analysis and segmented the customers into more specific groups to drive actions. The interactive datatable above can serve as an indicative for CRM practices.

Other key differentiators For more in depth division of customers into appropriate groups, other data that could be used include:

K-means clustering algorithm & Disadvantages:

K-means clustering is a straightforward and quick technique. Furthermore, it is capable of handling massive data sets with ease. The k-means strategy, on the other hand, has some flaws. K-means clustering has the drawback of requiring us to define the number of clusters ahead of time. Another downside of K-means is that it is sensitive to outliers, and different results can be obtained if the data is reordered.

Same stands for RFM analysis. At the end, the way that a company would choose to segment its customer base would depend upon the philosophy and culture of the company. Furthermore, the unique type of product or service they provide would also play a major role on the approach that they would want to use, as well as, the categorization of their customers and where they would want to focus their efforts for better customer relationships and thus profitable future.

To sum up, whatever the results, before making business decisions we should never forget that clustering is a model, and while in most cases it is able to generate value, it should always be treated as one.