Customer segmentation is the process of dividing customers into groups based on common behaviors or other characteristics. The groupings should be homogeneous within themselves and heterogeneous in relation to one another. The ultimate goal of this approach is to find a high-value client base, or customers with the greatest potential for growth or profit.
Customer segmentation insights are used to create tailored marketing campaigns as well as overall marketing strategy and planning.Broadly, Customer segmentation is the practice of dividing a company s customers into groups that reflect similarities among customers in each group.
Effective Customer Segmentation can have several benefits for your business, such as:
Here is a brief version of the data description file.
The structure that I will follow for this analysis will include the following steps - sections:
# suppress warning message / global option output
options(warn = -1)
# Loading the required Packages
library(tidyverse)
library(readxl)
library(DataExplorer)
library(visdat)
library(naniar)
library(rmarkdown)
library(knitr)
library(DT)
library(wordcloud2)
library(wordcloud)
library(extrafont)
library(SnowballC)
library(tm)
library(dlookr)
library(gridExtra)
library(scales)
library(lubridate)
library(ggThemeAssist)
library(plotly)
library(hrbrthemes)
library(xts)
library(highcharter)
library(countrycode)
library(factoextra)
# Loading the dataset
retail <- read_excel("Online Retail.xlsx")
# getting data View
# View(retail)
#display the colnames
colnames(retail)
## [1] "InvoiceNo" "StockCode" "Description" "Quantity" "InvoiceDate"
## [6] "UnitPrice" "CustomerID" "Country"
# getting information about the structure of the data
str(retail)
## tibble [541,909 x 8] (S3: tbl_df/tbl/data.frame)
## $ InvoiceNo : chr [1:541909] "536365" "536365" "536365" "536365" ...
## $ StockCode : chr [1:541909] "85123A" "71053" "84406B" "84029G" ...
## $ Description: chr [1:541909] "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
## $ Quantity : num [1:541909] 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: POSIXct[1:541909], format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
## $ UnitPrice : num [1:541909] 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : num [1:541909] 17850 17850 17850 17850 17850 ...
## $ Country : chr [1:541909] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
Observations
# getting summarized information about the features of the data
summary(retail)
## InvoiceNo StockCode Description Quantity
## Length:541909 Length:541909 Length:541909 Min. :-80995.00
## Class :character Class :character Class :character 1st Qu.: 1.00
## Mode :character Mode :character Mode :character Median : 3.00
## Mean : 9.55
## 3rd Qu.: 10.00
## Max. : 80995.00
##
## InvoiceDate UnitPrice CustomerID
## Min. :2010-12-01 08:26:00 Min. :-11062.06 Min. :12346
## 1st Qu.:2011-03-28 11:34:00 1st Qu.: 1.25 1st Qu.:13953
## Median :2011-07-19 17:17:00 Median : 2.08 Median :15152
## Mean :2011-07-04 13:34:57 Mean : 4.61 Mean :15288
## 3rd Qu.:2011-10-19 11:27:00 3rd Qu.: 4.13 3rd Qu.:16791
## Max. :2011-12-09 12:50:00 Max. : 38970.00 Max. :18287
## NA's :135080
## Country
## Length:541909
## Class :character
## Mode :character
##
##
##
##
Observations
Peek on the first 5 rows
# Peek on the first 5 rows
head(retail, 5)
## # A tibble: 5 x 8
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## <chr> <chr> <chr> <dbl> <dttm> <dbl>
## 1 536365 85123A WHITE HANGING HEAR~ 6 2010-12-01 08:26:00 2.55
## 2 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39
## 3 536365 84406B CREAM CUPID HEARTS~ 8 2010-12-01 08:26:00 2.75
## 4 536365 84029G KNITTED UNION FLAG~ 6 2010-12-01 08:26:00 3.39
## 5 536365 84029E RED WOOLLY HOTTIE ~ 6 2010-12-01 08:26:00 3.39
## # ... with 2 more variables: CustomerID <dbl>, Country <chr>
# Getting the amount of distinct CustomerID values
n_distinct(retail$CustomerID)
## [1] 4373
n_distinct(retail$Description)
## [1] 4212
Obsevations
Checking & Dealing with missing values
Plotting percentage of missing values per feature
# Plotting percentage of missing values per feature
gg_miss_var(retail, show_pct = TRUE)
Observations
# Deleting rows where CustomerID is missing
retail <- retail %>%
na.omit(retail$CustomerID)
# Replacing NA Description values with the string "empty"
retail$Description <- replace_na(retail$Description, "empty")
Checking & Treating Outliers
# plotting outliers for Quantity and UnitPrice
plot_outlier(retail, Quantity, UnitPrice, col = "#ACAC89")
Lets check the minimum and maximum Unit Price of products
min(retail$UnitPrice)
## [1] 0
max(retail$UnitPrice)
## [1] 38970
Observations
# checking quantities that are negative, arranged by descending order
quantity_check <- retail %>%
filter(Quantity < 0) %>%
arrange(Quantity)
head(quantity_check, 3)
## # A tibble: 3 x 8
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## <chr> <chr> <chr> <dbl> <dttm> <dbl>
## 1 C581484 23843 PAPER CRAFT , LITT~ -80995 2011-12-09 09:27:00 2.08
## 2 C541433 23166 MEDIUM CERAMIC TOP~ -74215 2011-01-18 10:17:00 1.04
## 3 C536757 84347 ROTATING SILVER AN~ -9360 2010-12-02 14:23:00 0.03
## # ... with 2 more variables: CustomerID <dbl>, Country <chr>
Observations
retail %>%
filter(CustomerID == 16446)
## # A tibble: 4 x 8
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## <chr> <chr> <chr> <dbl> <dttm> <dbl>
## 1 553573 22980 PANTRY SCRUBBING B~ 1 2011-05-18 09:52:00 1.65
## 2 553573 22982 PANTRY PASTRY BRUSH 1 2011-05-18 09:52:00 1.25
## 3 581483 23843 PAPER CRAFT , LITT~ 80995 2011-12-09 09:15:00 2.08
## 4 C581484 23843 PAPER CRAFT , LITT~ -80995 2011-12-09 09:27:00 2.08
## # ... with 2 more variables: CustomerID <dbl>, Country <chr>
Observations
# deleting the outliers by their Invoice Number, InvoiceNo
retail <- retail[!(retail$InvoiceNo == 581483 | retail$InvoiceNo == 541431),]
So we will clean the dataset by filtering out negative and zero values for Quantity and Unit Price features. Checking again the outlier plots with the clean filtered dataset to get one more view.
# filtering our data to have only positive Quantities
retail <- retail %>%
filter(Quantity > 0) %>%
filter(UnitPrice >0)
plot_outlier(retail, Quantity, UnitPrice, col = "#ACAC89")
Observations
I will check the unique countries that the dataset contains transaction data from.
unique(retail$Country)
## [1] "United Kingdom" "France" "Australia"
## [4] "Netherlands" "Germany" "Norway"
## [7] "EIRE" "Switzerland" "Spain"
## [10] "Poland" "Portugal" "Italy"
## [13] "Belgium" "Lithuania" "Japan"
## [16] "Iceland" "Channel Islands" "Denmark"
## [19] "Cyprus" "Sweden" "Finland"
## [22] "Austria" "Greece" "Singapore"
## [25] "Lebanon" "United Arab Emirates" "Israel"
## [28] "Saudi Arabia" "Czech Republic" "Canada"
## [31] "Unspecified" "Brazil" "USA"
## [34] "European Community" "Bahrain" "Malta"
## [37] "RSA"
Our dataset is now clean and we can move on to creating some new features that we can work with
Creation of new features:
# new feature Spent for each customer, for his total amount of spent money on the store
retail <- mutate(retail, Spent = Quantity * UnitPrice)
# new customer dataframe
customer <- summarise_at(group_by(retail,CustomerID,Country), vars(Spent,Quantity), funs(sum(.,na.rm = TRUE)))
# new date and time columns extracted from InvoiceDate column
retail$InvoiceDate <- as.character(retail$InvoiceDate)
retail$date <- sapply(retail$InvoiceDate, FUN = function(x) {strsplit(x, split = '[ ]')[[1]][1]})
retail$time <- sapply(retail$InvoiceDate, FUN = function(x) {strsplit(x, split = '[ ]')[[1]][2]})
# new month, year and hour columns
retail$year <- sapply(retail$date, FUN = function(x) {strsplit(x, split = '[-]')[[1]][1]})
retail$month <- sapply(retail$date, FUN = function(x) {strsplit(x, split = '[-]')[[1]][2]})
retail$hour <- sapply(retail$time, FUN = function(x) {strsplit(x, split = '[:]')[[1]][1]})
# turning date feature to date type
retail$date <- as.Date(retail$date, "%Y-%m-%d")
# creating day of the week feature
retail$day_week <- wday(retail$date, label = TRUE)
# setting up a frame with unique descriptions for further exploration of products offered
products_list <- unique(retail$Description)
Checking the new structure with the new features
# checking the new structure with the new features
head(retail, 2)
## # A tibble: 2 x 15
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID
## <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 536365 85123A WHITE HANGING ~ 6 2010-12-01 ~ 2.55 17850
## 2 536365 71053 WHITE METAL LA~ 6 2010-12-01 ~ 3.39 17850
## # ... with 8 more variables: Country <chr>, Spent <dbl>, date <date>,
## # time <chr>, year <chr>, month <chr>, hour <chr>, day_week <ord>
Customer data by unique Customer
# creation of interactive customer data table
datatable(customer, class = 'cell-border stripe',filter = 'top', options = list(
pageLength = 10, autoWidth = TRUE,
columnDefs = list(list(className = 'dt-center', targets = "_all")), searchHighlight = TRUE),
caption = htmltools::tags$caption(
style = 'caption-side: bottom; text-align: center;',
'Table 1: ', htmltools::em('Customer datatable Information '))
)
In this section I will perform an Exploratory Data Analysis (EDA). EDA will be an important factor to conduct an initial investigation inside the dataset and observe common patterns, spot anomalies and retrieve useful information about the data in a graphical way.
Getting a visual of the Countries with the most transactions - total invoices.
plot_clean1 <- retail %>%
group_by(Country) %>%
dplyr::summarise(n = n())
highchart() %>%
hc_chart(type ="column",
options3d = list(enabled = TRUE, alpha = 15, beta = 15)) %>%
hc_xAxis(categories = plot_clean1$Country) %>%
hc_add_series(data = plot_clean1$n, name = "Total Invoices") %>%
hc_add_theme(hc_theme_google()) %>%
hc_title(
text="Total Invoices - Transaction per Country"
) %>%
hc_chart(
borderColor = '#EBBA95',
borderRadius = 10,
borderWidth = 1,
backgroundColor = list(
linearGradient = c(0, 0, 500, 500), stops = list(
list(0, 'rgb(255, 255, 255)'),
list(1, 'rgb(200, 200, 255)')
)))
Observations
To get a better look of the transactions that happen outside the UK we will plot again the total invoices excluding UK this time and lets arange it also in a descending order.
plot_clean <- retail %>%
group_by(Country) %>%
filter(Country != "United Kingdom") %>%
dplyr::summarise(n = n()) %>%
arrange(-n)
highchart() %>%
hc_chart(type ="column",
options3d = list(enabled = TRUE, alpha = 15, beta = 15)) %>%
hc_xAxis(categories = plot_clean$Country) %>%
hc_add_series(data = plot_clean$n, name = "Total Invoices") %>%
hc_add_theme(hc_theme_sandsignika()) %>%
hc_title(
text="Total Invoices - Transaction per Country excl. UK (descending)"
) %>%
hc_chart(
borderColor = '#EBBA95',
borderRadius = 10,
borderWidth = 1,
backgroundColor = list(
linearGradient = c(0, 0, 500, 500), stops = list(
list(0, 'rgb(255, 255, 255)'),
list(1, 'rgb(200, 200, 255)')
)))
OBservations
I will take the top 5 countries by Revenue to get some more insight about them.
retail_country <- retail %>%
group_by(Country) %>%
dplyr::summarise(revenue = sum(Spent), transactions = n_distinct(InvoiceNo)) %>%
mutate(aveOrdVal = (round((revenue / transactions),2))) %>%
ungroup() %>%
arrange(desc(revenue))
top_countries_filter <- retail %>%
filter(Country == 'Netherlands' | Country == 'EIRE' | Country == 'Germany' | Country == 'France'
| Country == 'Australia')
top_5 <- top_countries_filter %>%
group_by(Country, date) %>%
dplyr::summarise(revenue = sum(Spent), transactions = n_distinct(InvoiceNo),
customers = n_distinct(CustomerID)) %>%
mutate(aveOrdVal = (round((revenue / transactions),2))) %>%
ungroup() %>%
arrange(desc(revenue))
top_5 %>%
group_by(Country) %>%
dplyr::summarise(revenue = sum(revenue)) %>%
hchart('treemap', hcaes(x = 'Country', value = 'revenue', color = 'revenue')) %>%
hc_add_theme(hc_theme_elementary()) %>%
hc_title(text=" Top 5 Countries by Revenue excl. UK")
Observations
-We can see that the top 5 countries excluding UK by revenue are with descending order:
I will check the revenue of these countries over time.
ggplot(top_5, aes(x = date, y = revenue, colour = Country)) + geom_smooth(method = 'auto', se = FALSE) +
labs(x = ' Country', y = 'Revenue', title = 'Revenue by Country over Time') +
theme(panel.grid.major = element_line(colour = NA),
legend.text = element_text(colour = "skyblue4"),
legend.title = element_text(face = "bold"),
panel.background = element_rect(fill = NA),
plot.background = element_rect(fill = "seashell1",
colour = NA), legend.key = element_rect(fill = "gray71"),
legend.background = element_rect(fill = NA))
Observations
I will take a look on the total revenue chart by date from the total amount spent by customers.
revenue_date1 <- retail %>%
group_by(date) %>%
dplyr::summarise(revenue = sum(Spent))
time_series <- xts (
revenue_date1$revenue,order.by = as.POSIXct(revenue_date1$date))
highchart(type = "stock") %>%
hc_title(text = "Revenue by Date") %>%
hc_subtitle(text = "Revenue generated from the online store") %>%
hc_add_series(time_series) %>%
hc_add_theme(hc_theme_darkunica())
OBservations
Now I will plot some time focused plots to explore transaction and revenue patterns.
Revenue by the different day of the week
retail %>%
group_by(day_week) %>%
dplyr::summarise(revenue = sum(Spent)) %>%
hchart(type = 'column', hcaes(x = day_week, y = revenue)) %>%
hc_yAxis(title = list(text = "Revenue")) %>%
hc_xAxis(title = list(text = "Day of the Week")) %>%
hc_title(text = "Revenue by Day of Week")
From the graph above we can observe that customers are buying considerably less products during Sundays.
Revenue and transactions by Hour of the Day
retail %>%
group_by(hour) %>%
dplyr::summarise(revenue = sum(Spent)) %>%
hchart(type = 'column', hcaes(x = hour, y = revenue)) %>%
hc_yAxis(title = list(text = "Revenue")) %>%
hc_xAxis(title = list(text = "Hour Of Day")) %>%
hc_title(text = "Revenue by Hour Of Day")
retail %>%
group_by(hour) %>%
dplyr::summarise(transactions = n_distinct(InvoiceNo)) %>%
hchart(type = 'column', hcaes(x = hour, y = transactions)) %>%
hc_yAxis(title = list(text = "Number of Transactions")) %>%
hc_xAxis(title = list(text = "Hour Of Day")) %>%
hc_title(text = "Transactions by Hour Of Day")
Observations
-We can see that the peak hours of the store for transactions and consequently as we would expect for revenue also occurs during the middle of the day. Particularly the most busy hour is around 12 at noon where both plot graphs see their peak.
Lets also visualize where most of our unique customers come from without United Kingdom which we know is the majority of our customer base.
# Wordcloud
topic.corpus <- Corpus(VectorSource(customer$Country))
removeHTML <- function(text){
text <- gsub(pattern = '<.+\\">', "",text)
text <- gsub(pattern = '</.+>', "", text)
return(text)
}
topic.corpus <- topic.corpus %>%
tm_map(content_transformer(removeHTML)) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(removeWords, stopwords("SMART"))
tdm = TermDocumentMatrix(topic.corpus) %>%
as.matrix()
words = sort(rowSums(tdm), decreasing = TRUE)
df = data.frame(word = names(words), freq = words)
df = df %>%
filter(nchar(as.character(word)) > 1,
word != "united" , word != "kingdom")
uxc.colors = c("#fefefe", "#f4f2a8", "#030303")
uxc.background = "#00ccff"
wordcloud2(df,
color = rep_len(uxc.colors, nrow(df)),
backgroundColor = uxc.background,
fontFamily = "DM Sans",
size = 1,
minSize = 3,
rotateRatio = 0)
The main customer base outside the UK comes from Germany, France and following up is Spain and Belgium.
Product Exploration
Sadly Wordcloud2 package seems to have a limitation as you can only have one wordcloud from the package on your output.. which is weird.. so for the next wordclouds I will use the regular R wordcloud package
The dataset does not provide product categories and outlooks, so I will try to engineer some product categories based on the available data descriptions of the products so that we can put the products into groups
Exploring the 20 most common words from product Descriptions, visualizing with a wordcloud and getting interactive tooltip of the amount of times they appear on unique product descriptions
# preparing and cleaning the text
docs <- Corpus(VectorSource(products_list))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=20, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Observations
# preparing and cleaning the text
docs <- Corpus(VectorSource(products_list))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("pink", "blue","red","set","white","metal", "glass","large","small","holder"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=20, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
It seems that the following product categories are mainly offered
Product categories: - Decoration products - Christmas related products - Bags - Vintage design focused products - Flowers and presents
World Map To visualize the revenue streams of the store from around the world I will plot a world map showing the different countries where transactions have occurred in a logarithmic revenue metric.
map_info <- retail %>%
group_by(Country) %>%
dplyr::summarise(revenue = sum(Spent))
highchart(type = "map") %>%
hc_add_series_map(worldgeojson,
map_info %>%
bind_cols(as_tibble(map_info$revenue)) %>%
group_by(map_info$Country) %>%
dplyr::summarise(revenue = log1p(sum(value))) %>%
ungroup() %>%
mutate(iso2 = countrycode(sourcevar = map_info$Country,
origin="country.name", destination="iso2c")),
value = "revenue", joinBy = "iso2") %>%
hc_title(text = "Revenue by country (log)") %>%
hc_tooltip(useHTML = TRUE, headerFormat = "",
pointFormat = "{point.map_info$Country}") %>%
hc_colorAxis(stops = color_stops(colors = viridisLite::inferno(10, begin = 0.1)))
Observations
We see from the World map that the online store has transactions and customers all around the globe but its main transactions are: - UK - Europe (Germany, France mainly) - Australia - Japan
Clustering:
The clustering problem is described as the problem of locating homogeneous groupings of data points in a given data collection in its most basic form.
Any clustering algorithm aims to keep the distance between data points in a cluster as little as possible relative to the distance between two clusters. In other words, individuals of one group are quite similar, whereas members of separate groups are very distinct.
One of the most widely used and researched clustering methods is the K-means clustering algorithm. Thus, I am utilizing it for clustering the customers.
RFM: Recency, Frequency, Monetary Value: RFM analysis is a prominent client segmentation and identification tool in database marketing. It is crucial, especially in the retail industry. RFM assigns a score to each consumer based on three elements.
K-means clustering: K-means clustering is an iterative clustering approach. The K-Means algorithm uses the Euclidean distance metric to partition the customers for RFM values. The number of groups K is predefined, and each data point is assigned to one of the K clusters repeatedly based on feature similarity.
Choosing the best number of clusters K: The number of clusters is the essential input for k-means clustering. This is calculated using the principle of minimizing the within-clusters-sum of square (WCSS). The number of clusters is plotted on the X-axis, and the WCSS for each cluster number is plotted on the y axis.
To determine the optimal number of clusters, use the scree plot/Elbow approach. The WCSS decreases as the number of clusters increases. The decline in WCSS is sharp at first, but the rate of decrease eventually slows, resulting in an elbow plot. The number of clusters in the elbow formation usually indicates the ideal clusters.
We will create customer categories
Before we dive into clustering our customers, we have to feature engineer our RFM values to have a solid dataframe for the clustering.
Recency: It refers to the number of days before the reference date when a customer made the last purchase. Lesser the value of recency, higher is the customer visit to a store.
recency <- retail %>%
dplyr::select(CustomerID, InvoiceDate) %>%
mutate(recency = as.Date("2011-12-09") - as.Date(InvoiceDate))
recency <- recency %>%
dplyr::select(CustomerID, recency) %>%
group_by(CustomerID) %>%
slice(which.min(recency))
head(recency,3)
## # A tibble: 3 x 2
## # Groups: CustomerID [3]
## CustomerID recency
## <dbl> <drtn>
## 1 12347 2 days
## 2 12348 75 days
## 3 12349 18 days
Observations
We have constructed our recency feature and we can see the number of days since the last purchase of its individual unique customer of the store
Frequency: It is the period between two subsequent purchases of a customer. Higher the value of Frequency, more is the customer visit to the company.
amount_products <- retail %>%
dplyr::select(CustomerID, InvoiceDate) %>%
group_by(CustomerID, InvoiceDate) %>%
summarize(n_prod = n())
df_frequency <- amount_products %>%
dplyr::select(CustomerID) %>%
group_by(CustomerID) %>%
summarize(frequency = n())
head(df_frequency,3)
## # A tibble: 3 x 2
## CustomerID frequency
## <dbl> <int>
## 1 12347 7
## 2 12348 4
## 3 12349 1
Monetary: This refers to the amount of money spent by a customer during a specific period of time. Higher the value, more is the profit generated to the company.
monetary <- select(customer, c("CustomerID", "Spent"))
head(monetary,3)
## # A tibble: 3 x 2
## # Groups: CustomerID [3]
## CustomerID Spent
## <dbl> <dbl>
## 1 12347 4310
## 2 12348 1797.
## 3 12349 1758.
# inner join the three RFM data frames by CustomerID
rfm <- recency %>%
dplyr::inner_join(., df_frequency, by = "CustomerID") %>%
dplyr::inner_join(., monetary, by = "CustomerID")
# drop the days from recency column and transform it into numeric data type
rfm <- rfm %>%
mutate(recency = str_replace(recency, " days", "")) %>%
mutate(recency = as.numeric(recency)) %>%
ungroup()
head(rfm, 3)
## # A tibble: 3 x 4
## CustomerID recency frequency Spent
## <dbl> <dbl> <int> <dbl>
## 1 12347 2 7 4310
## 2 12348 75 4 1797.
## 3 12349 18 1 1758.
# Deleting the CustomerID column to have only our 3 RFM features for our modelling data frame
rfm_clean <- select(rfm, -CustomerID)
Plotting RFM values
h1 <- hchart(
rfm_clean$recency,
color = "#B71C1C", name = "Recency")
h2 <- hchart(
rfm_clean$frequency,
color = "#668F80", name = "Frequency")
h3 <- hchart(
rfm_clean$Spent,
color = "#4A6670", name = "Monetary Value")
htmltools::browsable(hw_grid(h1, h2, h3, ncol = 3, rowheight = 500))
Observations
Scaling the data
Why we need to normalize the data
Clustering is the process of calculating the similarity between two samples by integrating all of their feature data into a numeric number. When combining feature data, the data must be of the same scale. By scaling the data, you can bring numerous features to the same scale.
For k-means approach it is best use to scale your data.
# scaling
rfm_norm <- scale(rfm_clean)
summary(rfm_norm)
## recency frequency Spent
## Min. :-0.9207 Min. :-0.42653 Min. :-0.23309
## 1st Qu.:-0.7505 1st Qu.:-0.42653 1st Qu.:-0.19745
## Median :-0.4202 Median :-0.29549 Median :-0.15480
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.4908 3rd Qu.: 0.09762 3rd Qu.:-0.03904
## Max. : 2.8131 Max. :26.96062 Max. :32.56198
Before we do the actual clustering, we need to identify the Optimal number of clusters (k) for this data set of wholesale customers. The popular way of determining number of clusters are
Elbow and Silhouette methods are direct methods and gap statistic method is the statistics method.
fviz_nbclust(rfm_norm, kmeans, method="wss")+geom_vline(xintercept=3,linetype=2) + labs(subtitle = "Elbow method")
Based on the Elbow method chart we could use 3, 4, 5 or even 6 clusters based on the slope and getting closed to zero. But is subjective in which it works better.
Lets test the optimum number of cluster with the Silhouette method also:
The silhouette co efficient estimates the average distance between clusters. The silhouette plot shows the silhouette co efficient over values of k ranging from 1 to 10.
fviz_nbclust(rfm_norm,kmeans, method = "silhouette")+labs(subtitle = "Silhouette method")
Observations
I will test the optimum number with Gap statistics as well:
The gap statistic compares intra cluster variation for different values of k with expected intra cluster variation under null distribution. Ideally we should be choosing the value of k which maximizes the gap statistic however in real world datasets where clusters are not so well defined it may be more parsimonious to choose the k value to be the one where the rate of increase of the statistic begins to slow down (i.e. value lowest value of k that is greater than or equal to the value of k+1 minus the standard error)
Lets see what we get
library(cluster)
# compute gap statistic
set.seed(123)
gap_stat <- clusGap(rfm_norm, FUN = kmeans, nstart = 25,
K.max = 10, B = 50)
fviz_gap_stat(gap_stat) + labs(subtitle = "Gap statistic method")
Observations
As we can see from the graph the rate of increase begins to slow down on k = 3, as such gap statistics proposes that the optimal number of clusters k is 3.
One of the most discussed disadvantages of k-means clustering, is the subjective nature of choosing the optimal k number of clusters.
To avoid the subjective guesswork that follows the visual observation of cluster results, I will use the NbClust R package. While there are several programs and packages that do validation for a particular clustering solution, different validation metrics yield different results for different clustering algorithms.
The NbClust package can help us with determining the best number of clusters for our data.
# NbClust package for determining the best number of clusters
library(NbClust)
NbClust(rfm_norm, distance = "euclidean", min.nc = 2, max.nc = 10, method = "kmeans")
The Hubert index:- It is a graphical method of determining the number of clusters. In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in Hubert index second differences plot.
From the Hubert index results:
Among all indices:
Consequently, the ideal number of clusters, according to the majority rule, is three, k = 3. However, we see that 5 indices proposed 10 as the best number of clusters, which is very close. While with 10 clusters you would have more unique demonstrable groups, the categorization might get too specific and with large clusters there is danger of overfitting the data.
After running several tests above, we can conclude that the most appropriate optimal number of clusters for our data set seems to be k = 3, so we will compute k-means with k = 3 to segment the customers into groups.
# Compute k-means with k = 3
set.seed(123)
km.res <- kmeans(rfm_norm, 3, nstart = 10) # 3 centers, and nstart set to 10 random sets
km.res$centers # centroids from model on normalized data
## recency frequency Spent
## 1 1.5389291 -0.35067434 -0.16793154
## 2 -0.8638170 8.35832179 9.43429552
## 3 -0.5126431 0.05364613 -0.01635047
Observations
Lets also check the amount of customers in each cluster.
km.res$size
## [1] 1090 25 3230
km.res$betweenss/km.res$totss
## [1] 0.5828409
Observations
Visualization of the clustering algorithm results
set.seed(123)
fviz_cluster(km.res, geom = "point", data = rfm_norm) + ggtitle("k = 3")
Observations
We have 3 demonstable clusters, with cluster 2 containing the most valuable customers.
frm_k3 <- rfm %>%
mutate(cluster = km.res$cluster)
frm_k3
## # A tibble: 4,345 x 5
## CustomerID recency frequency Spent cluster
## <dbl> <dbl> <int> <dbl> <int>
## 1 12347 2 7 4310 3
## 2 12348 75 4 1797. 3
## 3 12349 18 1 1758. 3
## 4 12350 310 1 334. 1
## 5 12352 36 8 2506. 3
## 6 12353 204 1 89 1
## 7 12354 232 1 1079. 1
## 8 12355 214 1 459. 1
## 9 12356 22 3 2811. 3
## 10 12357 33 1 6208. 3
## # ... with 4,335 more rows
Observations
# Compute k-means with k = 10
set.seed(123)
km.res_10 <- kmeans(rfm_norm, 10, nstart = 10) # 10 centers, and nstart set to 10 random starts
km.res_10$centers # centroids from model on normalized data
## recency frequency Spent
## 1 -0.9156694 26.3709426 10.1565245
## 2 1.5566561 -0.3731575 -0.1653148
## 3 -0.8546077 6.7544044 6.5144269
## 4 -0.7103538 -0.1627713 -0.1096455
## 5 -0.8160590 2.6978481 1.2252349
## 6 2.4395250 -0.3942279 -0.1901048
## 7 -0.7278237 0.7546869 0.2267256
## 8 -0.1731544 -0.2640112 -0.1325608
## 9 -0.8906441 7.0426902 28.4186004
## 10 0.6849949 -0.2770338 -0.1449719
Lets also check the amount of customers in each cluster.
km.res_10$size
## [1] 2 437 20 1479 102 288 561 949 3 504
km.res_10$betweenss/km.res_10$totss
## [1] 0.89463
As expected when we have more clusters the ratio increases to 89.4%.
Visualization of the clustering algorithm results
set.seed(123)
fviz_cluster(km.res_10, geom = "point", data = rfm_norm) + ggtitle("k = 10")
frm_k10 <- rfm %>%
mutate(cluster = km.res_10$cluster)
head(frm_k10)
## # A tibble: 6 x 5
## CustomerID recency frequency Spent cluster
## <dbl> <dbl> <int> <dbl> <int>
## 1 12347 2 7 4310 7
## 2 12348 75 4 1797. 8
## 3 12349 18 1 1758. 4
## 4 12350 310 1 334. 6
## 5 12352 36 8 2506. 7
## 6 12353 204 1 89 2
frm_k10 %>%
group_by(cluster) %>%
summarize(number = n(),
frequency = round(mean(frequency)),
recency = round(mean(recency)),
monetary = round(mean(Spent)))
## # A tibble: 10 x 5
## cluster number frequency recency monetary
## <int> <int> <dbl> <dbl> <dbl>
## 1 1 2 206 0 88772
## 2 2 437 1 247 582
## 3 3 20 56 7 57654
## 4 4 1479 3 21 1058
## 5 5 102 25 10 12463
## 6 6 288 1 336 370
## 7 7 561 10 19 3932
## 8 8 949 2 75 862
## 9 9 3 58 3 244805
## 10 10 504 2 160 756
With 10 clusters we have to be careful as there is a chance that we are over fitting the data. As such k = 3 clusters is prefered
To find our most valuable customers based on RFM analysis I will also use the rfm package, which based on the input will give automate RFM scores for our customers.
According to the Pareto principle, 20% of the customers (the vital few) contribute more to the revenue of the company than the rest. It argues that 80 percent of effects can be traced back to as few as 20% of all causes — these 20% of causes are vital, and the remaining 20% of effects is then naturally dispersed to be mapped with the remaining 80% causes — they are trivial numerous. These 20% are the high-value, important consumers that the company would like to keep.
recency_s <- retail %>%
dplyr::select(CustomerID, date) %>%
group_by(CustomerID) %>%
slice(which.max(date))
rfm_test <- customer %>% inner_join(recency_s, by = "CustomerID")
rfm_test <- rfm_test %>% left_join(rfm, by = "CustomerID")
library(rfm)
rfm_result <- rfm_table_customer(rfm_test, CustomerID, Quantity, recency, Spent.x, as.Date("2011-12-09"))
rfm_datatable <- rfm_result$rfm
Creation of interactive customer data table with RFM scores, to observe values
# creation of interactive customer data table with RFM scores, to observe his values
rfm_datatable_print <- rfm_datatable %>%
select(-recency_days,-transaction_count, - amount)
datatable(rfm_datatable_print, class = 'cell-border stripe',filter = 'top', options = list(
pageLength = 10, autoWidth = TRUE,
columnDefs = list(list(className = 'dt-center', targets = "_all")), searchHighlight = TRUE),
caption = htmltools::tags$caption(
style = 'caption-side: bottom; text-align: center;',
'Table 1: ', htmltools::em('Customer Values - RFM scores '))
)
Creation of interactive customer data table with RFM scores for approximately the top 20% of customers, who are the most vital for the business
# get the top 20% of rfm_score of Customers, to find the most valuable customers
n <- 20
top_20 <- rfm_datatable %>%
subset(rfm_score > quantile(rfm_score, prob = 1 - n/100))
top_20 <- top_20 %>%
select(customer_id, recency_days, transaction_count, amount, rfm_score)
datatable(top_20, class = 'cell-border stripe',filter = 'top', options = list(
pageLength = 10, autoWidth = TRUE,
columnDefs = list(list(className = 'dt-center', targets = "_all")), searchHighlight = TRUE),
caption = htmltools::tags$caption(
style = 'caption-side: bottom; text-align: center;',
'Table 1: ', htmltools::em('Most valuable customers (~20% of total) '))
)
Observations
-Also the data we have contains transactions of just over one year of business, which is not that much to accurately distinguish the customer base appropriately.
Now I will create customer segments based on the image below
Segments
We will Segment our customers based on this proposed image from rsquareacademy.com, however there seems to be some overlaps between the conditions, so I will adjust
rfm_segments <- rfm_datatable
rfm_segments <- rfm_segments %>%
mutate(segment = ifelse(recency_score >= 4 & frequency_score >= 4 & monetary_score >= 4, "Champion",
ifelse(recency_score >= 2 & frequency_score >= 3 & monetary_score >= 3, "Loyal Customer",
ifelse(recency_score >= 3 & frequency_score <= 3 & monetary_score <= 3, "Potential Loyalist",
ifelse(recency_score >= 4 & frequency_score <= 1 & monetary_score <= 1, "New Customer",
ifelse((recency_score == 3 | recency_score == 4) & frequency_score <= 1 & monetary_score <= 1, "Promising",
ifelse((recency_score == 2 | recency_score == 3) & (frequency_score == 2 | frequency_score == 3) &
(monetary_score == 2 | monetary_score == 3), "Need Attention",
ifelse((recency_score == 2 | recency_score == 3) & frequency_score <= 2 & monetary_score <= 2, "About to Sleep",
ifelse(recency_score <= 2 & frequency_score > 2 & monetary_score > 2, "At Risk",
ifelse(recency_score <= 1 & frequency_score >= 4 & monetary_score >= 4, "Can't lose them",
ifelse(recency_score <= 2 & frequency_score == 2 & monetary_score == 2, "Hibernating", "Lost")))))))))))
Now we have our customer categories/segments ready to differentiate them and act
Lets first plot the number of customers that we have in each segment.
# checking segments size
rfm_segments %>%
group_by(segment) %>%
count(segment) %>%
arrange(desc(n)) %>%
rename(Segment = segment, Count = n)
## # A tibble: 8 x 2
## # Groups: Segment [8]
## Segment Count
## <chr> <int>
## 1 Loyal Customer 1242
## 2 Champion 981
## 3 Potential Loyalist 813
## 4 Lost 623
## 5 About to Sleep 204
## 6 Need Attention 198
## 7 At Risk 158
## 8 Hibernating 142
rfm_plot_median_monetary(rfm_segments)
Observations
Lets have a table that can distinguish each of the stores customers so that they can treat them accordingly and focus on the vital customers that will enforce the prosperity of the business.
plot_segments <- rfm_segments %>%
select(customer_id, segment)
datatable(plot_segments, class = 'cell-border stripe',filter = 'top', options = list(
pageLength = 10, autoWidth = TRUE,
columnDefs = list(list(className = 'dt-center', targets = "_all")), searchHighlight = TRUE),
caption = htmltools::tags$caption(
style = 'caption-side: bottom; text-align: center;',
'Table 1: ', htmltools::em('Customers by segment'))
)
In this analysis I explored the online retail data to try and observe important factors for the business and segment the customer base into insightful information that can be implemented for better customer relationship management.
More specifically I identified 3 important customer clusters through the k-means approach with one cluster containing 25 of the most valuable MVP customers of the business.
Consequently I implemented an RFM analysis and segmented the customers into more specific groups to drive actions. The interactive datatable above can serve as an indicative for CRM practices.
Other key differentiators For more in depth division of customers into appropriate groups, other data that could be used include:
Demographics (age, race, religion, gender, family size, ethnicity, income, education level)
Psychographic (social class, lifestyle and personality characteristics)
Geography (more specific locations)
K-means clustering algorithm & Disadvantages:
K-means clustering is a straightforward and quick technique. Furthermore, it is capable of handling massive data sets with ease. The k-means strategy, on the other hand, has some flaws. K-means clustering has the drawback of requiring us to define the number of clusters ahead of time. Another downside of K-means is that it is sensitive to outliers, and different results can be obtained if the data is reordered.
Same stands for RFM analysis. At the end, the way that a company would choose to segment its customer base would depend upon the philosophy and culture of the company. Furthermore, the unique type of product or service they provide would also play a major role on the approach that they would want to use, as well as, the categorization of their customers and where they would want to focus their efforts for better customer relationships and thus profitable future.
To sum up, whatever the results, before making business decisions we should never forget that clustering is a model, and while in most cases it is able to generate value, it should always be treated as one.