library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(purrr)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(ggpubr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(viridis)
## Loading required package: viridisLite
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:viridis':
##
## unemp
## The following object is masked from 'package:purrr':
##
## map
library(VGAM)
## Loading required package: stats4
## Loading required package: splines
##
## Attaching package: 'VGAM'
## The following object is masked from 'package:tidyr':
##
## fill
library(vegan)
## Loading required package: permute
## Warning: package 'permute' was built under R version 4.1.2
## Loading required package: lattice
## This is vegan 2.5-7
##
## Attaching package: 'vegan'
## The following object is masked from 'package:VGAM':
##
## calibrate
library(forecast)
## Warning: package 'forecast' was built under R version 4.1.2
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'forecast'
## The following object is masked from 'package:ggpubr':
##
## gghistogram
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# For text tokenization
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:vegan':
##
## tolerance
## The following object is masked from 'package:VGAM':
##
## predictors
## The following object is masked from 'package:purrr':
##
## lift
library(textclean)
options(warn = -1) # remove warning messages
commerce_bc <- read.csv('data/online_retail.csv', encoding = "UTF-8")
In this analysis we will be working with the E-commerce dataset from the UCI Machine Learning Repository which contains around 150.000 transactions for an UK-based online retailer. The dataset contains actual data for the period 2010/2011 grouped by country and transaction day.
EDA - Let's start the analysis taking a look at the missing values.
# EDA ---------------------------------------------------------------------
# Missing values
pMiss <- function(x){sum(is.na(x))/length(x)*100} #function to calculate missing values
apply(commerce_bc, 2, pMiss) # almost 25% of missing values in CustomerID column
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
## CustomerID Country
## 24.92669 0.00000
There are around 25% of missing values in the Customer ID column. This corresponds to entries for fees, discounts or devolutions, so the way the data was registered was different from usual.
We are going to remove those entries as they are not providing actual information about the products.
Next up, we need to do some feature engineering. Among other minor changes, we decided to make the following changes:
-> Remove Amazon related fees -> Remove rows with missing Customer information -> Convert categorical columns into factorial columns -> Convert Date column into a POSIX object -> Create grouping date columns such as year, month, quarter or part of the week.
# Cleanup -----------------------------------------------------------------
# Remove fees and adjustments
commerce <- commerce_bc %>%
filter(!grepl('amazon', Description, ignore.case = T) & !grepl('adjust', Description, ignore.case = T)) %>%
filter(!is.na(CustomerID)) %>% #remove entries without customer encoding
filter(!grepl('C', StockCode)) %>% #remove cancelled orders
distinct() #remove duplicates (around 150.000 rows removed from original dataset)
# Retype columns
commerce$Country <- factor(commerce$Country)
commerce$InvoiceDate <- as.POSIXct(commerce$InvoiceDate, tryFormats = "%m/%d/%Y %H:%M")
# Change factor labels
commerce$StockCode <- fct_recode(commerce$StockCode, "Postage" = "POST", "Discount" = "D",
"Carriage" = "C2", "Manual" = "M", "Bank Charges" = "BANK CHARGES",
"Doctom Postage" = "DOT")
# Time related variables
commerce$date <- as.Date(commerce$InvoiceDate)
commerce$year <- format(commerce$date, '%Y')
commerce$month <- factor(format(commerce$date, '%b'), ordered = T,
levels = c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep',
'Oct', 'Nov', 'Dec'))
commerce$quarter <- case_when(
commerce$month %in% c('Jan', 'Feb', 'Mar') ~ 'Q1',
commerce$month %in% c('Apr', 'May', 'Jun') ~ 'Q2',
commerce$month %in% c('Jul', 'Aug', 'Sep') ~ 'Q3',
commerce$month %in% c('Oct', 'Nov', 'Dec') ~ 'Q4',
TRUE ~ 'Unknown'
)
commerce$quarter <- factor(commerce$quarter, ordered = T, levels =c('Q1', 'Q2', 'Q3', 'Q4'))
commerce$weeknd <- ifelse(weekdays(commerce$date) %in% c('Saturday', 'Sunday'), 'weekend', 'weekday')
Another important topic is the outlier analysis, specially before clustering. The best approach to look into this is to use boxplots.
# EDA plot settings
options(repr.plot.width = 10, repr.plot.height = 9)
box_theme <- theme(axis.text = element_text(size = 14),
axis.title.y = element_text(size = 12),
axis.title.x = element_blank())
commerce %>%
ggplot() +
aes(x = 1, y = UnitPrice) +
geom_boxplot() +
geom_text(data = . %>% filter(UnitPrice > 750), aes(label = CustomerID), hjust = 1.2) +
theme_minimal() +
box_theme
There are some entries with extremely high Unit Prices, but are they actual revenues?
commerce %>%
filter(CustomerID %in% c('15098', '16029')) %>%
arrange(desc(UnitPrice)) %>%
head(15)
## InvoiceNo StockCode Description Quantity
## 1 C556445 Manual Manual -1
## 2 C551685 Postage POSTAGE -1
## 3 551697 Postage POSTAGE 1
## 4 C551699 Manual Manual -1
## 5 556444 22502 PICNIC BASKET WICKER 60 PIECES 60
## 6 556446 22502 PICNIC BASKET WICKER 60 PIECES 1
## 7 C566614 Discount Discount -1
## 8 C571634 Manual Manual -1
## 9 C561225 Discount Discount -1
## 10 C566613 Discount Discount -1
## 11 C557927 Discount Discount -1
## 12 C563592 Discount Discount -1
## 13 547726 72760B VINTAGE CREAM 3 BASKET CAKE STAND 16
## 14 553469 72760B VINTAGE CREAM 3 BASKET CAKE STAND 24
## 15 558362 22625 RED KITCHEN SCALES 24
## InvoiceDate UnitPrice CustomerID Country date year
## 1 0011-06-10 15:31:00 38970.00 15098 United Kingdom 0011-06-10 0011
## 2 0011-05-03 12:51:00 8142.75 16029 United Kingdom 0011-05-03 0011
## 3 0011-05-03 13:46:00 8142.75 16029 United Kingdom 0011-05-03 0011
## 4 0011-05-03 14:12:00 6930.00 16029 United Kingdom 0011-05-03 0011
## 5 0011-06-10 15:28:00 649.50 15098 United Kingdom 0011-06-10 0011
## 6 0011-06-10 15:33:00 649.50 15098 United Kingdom 0011-06-10 0011
## 7 0011-09-13 17:28:00 102.24 16029 United Kingdom 0011-09-13 0011
## 8 0011-10-18 11:34:00 79.20 16029 United Kingdom 0011-10-18 0011
## 9 0011-07-26 10:06:00 45.60 16029 United Kingdom 0011-07-26 0011
## 10 0011-09-13 17:17:00 15.00 16029 United Kingdom 0011-09-13 0011
## 11 0011-06-23 14:56:00 10.00 16029 United Kingdom 0011-06-23 0011
## 12 0011-08-18 06:12:00 9.60 16029 United Kingdom 0011-08-18 0011
## 13 0011-03-25 10:49:00 8.49 16029 United Kingdom 0011-03-25 0011
## 14 0011-05-17 11:15:00 8.49 16029 United Kingdom 0011-05-17 0011
## 15 0011-06-28 15:14:00 7.65 16029 United Kingdom 0011-06-28 0011
## month quarter weeknd
## 1 Jun Q2 weekday
## 2 May Q2 weekday
## 3 May Q2 weekday
## 4 May Q2 weekday
## 5 Jun Q2 weekday
## 6 Jun Q2 weekday
## 7 Sep Q3 weekday
## 8 Oct Q4 weekday
## 9 Jul Q3 weekday
## 10 Sep Q3 weekday
## 11 Jun Q2 weekday
## 12 Aug Q3 weekday
## 13 Mar Q1 weekday
## 14 May Q2 weekday
## 15 Jun Q2 weekday
#filter(!Description %in% c('Manual', 'POSTAGE', 'Discount')) %>%
As we can see in the dataframe, those extreme values are not actual products but manually introduced values, discounts and bank fees, so it's better to get rid of these rows. To do so, we will be using the z-score formula to remove all those values which surpasse 3 standard deviations in the Standard Normalized Distribution, by using the formula: 𝑧=𝑥𝑖−μσ
commerce %<>% filter(!StockCode %in% c('Postage', 'Discount', 'Carriage', 'Manual', 'Bank Charges', 'Doctom Postage'))
up_median <- mean(commerce$UnitPrice, na.rm = T)
up_sd <- sd(commerce$UnitPrice, na.rm = T)
commerce %>%
filter(!(abs(UnitPrice - up_median) / up_sd) > 3) %>%
ggplot() +
aes(x = 1, y = UnitPrice) +
geom_boxplot() +
geom_text(data = . %>% filter(UnitPrice > 5), aes(label = CustomerID), hjust = 1.2) +
theme_minimal() +
box_theme
Now the number of outliers has been significantly reduced. With these changes our clustering algorithm might work better. But we need to group the data by Invoice number so we can cluster the total amount of acquired products by customer.
# Total amount by invoice
comm_clean <- commerce %>%
group_by(CustomerID, Country, date, year, month, quarter, weeknd) %>%
summarise(InvoiceAmount = sum(Quantity * UnitPrice)) %>%
filter(InvoiceAmount > 0) # drop article devolutions
## `summarise()` has grouped output by 'CustomerID', 'Country', 'date', 'year', 'month', 'quarter'. You can override using the `.groups` argument.
# Outlier analysis
comm_clean %>%
ggplot() +
aes(x = 1, y = InvoiceAmount) +
geom_boxplot() +
theme_minimal() +
box_theme
The effect of the Unit Price outliers is obvious when aggregating the data. Let's see how it would look after removing the outliers with the z-score formula.
# Z-score outlier removal
invoice_ave = median(comm_clean$InvoiceAmount, na.rm = T)
invoice_sd = sd(comm_clean$InvoiceAmount, na.rm = T)
comm_clean %>%
filter(!(abs(InvoiceAmount - invoice_ave) / invoice_sd) > 3) %>%
ggplot() +
aes(x = 1, y = InvoiceAmount) +
geom_boxplot() +
theme_minimal() +
box_theme
# Outlier extraction (z-score formula)
x_median <- median(comm_clean$InvoiceAmount)
x_sd <- sd(comm_clean$InvoiceAmount, na.rm = T)
comm_clean %<>%
filter(!(abs(InvoiceAmount - x_median) / x_sd) > 3)
y_median <- median(commerce$UnitPrice, na.rm = T)
y_sd <- sd(commerce$UnitPrice, na.rm = T)
commerce %<>%
filter(!(abs(UnitPrice - y_median) / y_sd) > 3)
Visual Exploration - Now we are gonna turn our attention to the visualization of the data. First things first, let's visualize the Unit Price as a Kernel Density plot.
options(repr.plot.width = 14, repr.plot.height = 8) #resize
xmean = round(mean(comm_clean$InvoiceAmount, na.rm = T), 2)
xmedian = round(median(comm_clean$InvoiceAmount, na.rm = T), 2)
o1 <- comm_clean %>%
ggplot() +
aes(x = InvoiceAmount) +
geom_density(alpha = 0.5, fill = 'steelblue') +
geom_vline(aes(xintercept = xmean), lty = 2, col = 'red', size = 1.2) +
geom_vline(aes(xintercept = xmedian), lty = 2, col = 'blue', size = 1.2) +
theme_bw() +
theme(strip.text = element_text(size = 14),
legend.position = 'top',
legend.text = element_text(size = 14),
legend.title = element_blank(),
plot.title = element_text(size = 24, face = 'bold.italic'),
plot.subtitle = element_text(size = 14),
axis.title = element_blank(),
axis.text = element_text(size = 12)) +
labs(title = 'Invoice Amount Kernel Density',
subtitle = paste('Mean (red line): ',xmean, '; Median (blue line): ', xmedian)) +
xlim(0, 2500)
o2 <- comm_clean %>%
ggplot() +
aes(x = InvoiceAmount) +
geom_boxplot() +
theme_bw() +
theme(strip.text = element_text(size = 14),
legend.position = 'top',
legend.text = element_text(size = 14),
legend.title = element_blank(),
#plot.title = element_text(size = 20, face = 'bold'),
#plot.subtitle = element_text(size = 12, face = 'italic'),
axis.title = element_blank(),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 14)) +
xlim(0, 2500)
grid.arrange(o1, o2, layout_matrix = rbind(c(1, 1, 1),
c(1, 1, 1),
c(1, 1, 1),
c(2, 2, 2)))
The distribution of this variables shows a heavy long tail, which suggests that we are gonna need to standardize the data before clustering it.
Another grouping variable are countries. Let's visualize it.
# by country
table(comm_clean$Country) # Countries with too few entries
##
## Australia Austria Bahrain
## 38 17 2
## Belgium Brazil Canada
## 95 1 5
## Channel Islands Cyprus Czech Republic
## 20 13 2
## Denmark EIRE European Community
## 18 165 3
## Finland France Germany
## 37 354 406
## Greece Iceland Israel
## 5 7 3
## Italy Japan Lebanon
## 34 16 1
## Lithuania Malta Netherlands
## 2 5 44
## Norway Poland Portugal
## 27 18 48
## RSA Saudi Arabia Singapore
## 1 1 4
## Spain Sweden Switzerland
## 85 28 41
## United Arab Emirates United Kingdom Unspecified
## 2 14827 7
## USA
## 5
filter_few <- table(comm_clean$Country) > 30
countries <- rownames(filter_few)
countries <- countries[filter_few]
comm_clean %>%
filter(Country %in% countries) %>%
group_by(Country) %>%
mutate(xmean = mean(InvoiceAmount, na.rm = T),
xmedian = median(InvoiceAmount, na.rm = T)) %>%
ggplot() +
aes(x = InvoiceAmount, fill = Country) +
geom_density(alpha = 0.5) +
geom_vline(aes(xintercept = xmean), lty = 2, col = 'red', size = 1.2) +
geom_vline(aes(xintercept = xmedian), lty = 2, col = 'blue', size = 1.2) +
geom_text(aes(x = xmedian + 150, y = 0.0015, label = round(xmedian, 2)), col = 'blue', size= 4.5) +
geom_text(aes(x = xmean + 150, y = 0.001, label = round(xmean, 2)), col = 'red', size= 4.5) +
theme_minimal() +
theme(strip.text = element_text(size = 14),
legend.position = 'none',
plot.title = element_text(size = 20, face = 'bold'),
plot.subtitle = element_text(size = 12, face = 'italic'),
axis.title = element_blank(),
axis.text = element_text(size = 10, family = 'sans')) +
xlim(0, 2500) +
labs(title = 'Results by Country (2011)') +
facet_wrap(~Country, nrow = 3)
world_map = map_data("world")
to_world <- comm_clean %>%
group_by(Country) %>%
summarise(InvoiceAmount = log(sum(InvoiceAmount, na.rm = T))) %>%
mutate(Country = fct_recode(Country, "UK" = "United Kingdom", 'Ireland' = 'EIRE')) %>%
right_join(world_map, by = c("Country" = "region"))
# by country worldmap
options(repr.plot.width = 20, repr.plot.height = 11) #resize for worldmap
to_world %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill=InvoiceAmount), colour = "grey50") +
theme_void() +
scale_fill_viridis(option = 'C', na.value = 'grey90')
Most of the sales are concentrated on the European Union, specially in the UK. About the Invoice Amount mean, there are some differences between countries: some countries such as Switzerland or Ireland are over two times higher than other countries such as Belgium or Italy.
Another way to visualize the same variable is as time series.
options(repr.plot.width = 21, repr.plot.height = 6)
# Ploting time series
comm2 <- comm_clean %>%
dplyr::select(date, InvoiceAmount) %>%
group_by(date) %>%
summarise(InvoiceAmount = sum(InvoiceAmount, na.rm = T))
## Adding missing grouping variables: `CustomerID`, `Country`, `year`, `month`, `quarter`
comm_ts <- xts(comm2$InvoiceAmount, order.by = as.POSIXct(comm2$date))
autoplot(comm_ts, ts.colour = 'green') +
geom_line(aes(y = ma(comm_ts, 10)), series = '10 Moving Average', size = 1.5, col = 'orange') +
ggtitle("Daily Profits with 10th Moving Average") +
labs(col = '') +
theme_dark() +
theme(axis.text = element_text(size = 14),
plot.title = element_text(size = 22, face = 'italic'),
legend.text = element_text(size = 14),
legend.position = 'top',
axis.title.y = element_blank())
As we can see in the orange line depicting the moving average with 10 lags, there was a sales increase towards the end of the year.
Finally, we can tokenize the product descriptions and take a look at the most frequent used words.
# Wordcloud ---------------------------------------------------------------
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(tidytext)
library(ggwordcloud)
# Remove: stopwords in ENG
eng_stop = stopwords("eng")
# tokenize Commerce db sentences
comm_txt <- commerce %>%
mutate(Description = as.character(Description)) %>%
unnest_tokens(input = Description, text, token = 'words') %>%
mutate(text = tolower(text)) %>%
group_by(text) %>%
filter(!text %in% eng_stop) %>% # remove stopwords
filter(!grepl("[0-9]", text)) %>% # remove rows with some numbers
filter(!is.na(CustomerID)) %>% # Remove unknown customers
filter(text != 'abc') %>%
filter(nchar(text) > 1) %>% # remove characters
summarise(freq = n()) %>%
arrange(desc(freq))
set.seed(5)
options(repr.plot.width = 21, repr.plot.height = 6)
comm_txt %>%
filter(freq > 750) %>%
ggplot() +
aes(label = text, size = freq, col = freq) +
geom_text_wordcloud() +
scale_size_area(max_size = 22) +
#scale_color_gradient(low = '#66e6ff', high = '#003ca1') +
scale_color_viridis(option = 'D', direction = -1) +
theme_minimal()
The most frequently used words are related to how the products are sell (bag, set, box, pack, ...) or related to the product design (vintage, hanging, design, glass, ...).
For Clustering we are going to keep it simple.
Doing some feature engineering again on the data frame.
head(commerce_bc)
## InvoiceNo StockCode Description Quantity InvoiceDate
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/10 8:26
## 2 536365 71053 WHITE METAL LANTERN 6 12/1/10 8:26
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/10 8:26
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/10 8:26
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/10 8:26
## 6 536365 22752 SET 7 BABUSHKA NESTING BOXES 2 12/1/10 8:26
## UnitPrice CustomerID Country
## 1 2.55 17850 United Kingdom
## 2 3.39 17850 United Kingdom
## 3 2.75 17850 United Kingdom
## 4 3.39 17850 United Kingdom
## 5 3.39 17850 United Kingdom
## 6 7.65 17850 United Kingdom
Data cleaning and preparing
There are a few data preparation steps to take before I can start segmenting customers:
-> Filtering our negative (canceled) orders. If you look at the Quantity column, you notice some negative values there - canceled or returned orders. I need to remove those orders.
-> Handling NA values in the CustomerID field. Customer ID is crutial for this analysis, so I will have to drop NA values there.
-> Changing the InvoiceDate column to DateTime format. In order to prevent any issues with dates, I like to make sure all the data is in the appropriate format.
-> Removing incomplete data. December 2011 - not a full month in this data set.
-> Calculating the sales column by multiplying the Quantity and UnitPrice columns.
-> Transforming data - each record represents the purchase history of individual customers.
#filtering out canceled orders
commerce_bc <- commerce_bc %>%
filter(Quantity > 0)
#drop NA values
sum(is.na(commerce_bc$CustomerID))
## [1] 133361
commerce_bc <- na.omit(commerce_bc)
#changing to datetime format
commerce_bc$InvoiceDate=as.POSIXct(commerce_bc$InvoiceDate,
format="%m/%d/%Y %H:%M",
tz=Sys.timezone())
#removing incomplete data
commerce_bc <- commerce_bc %>%
filter(InvoiceDate < "2011-12-01")
#calculate the sales column
commerce_bc <- commerce_bc %>%
mutate(Sales=Quantity*UnitPrice)
#transform data
customer_data <- commerce_bc %>%
group_by(CustomerID) %>%
summarize(TotalSales=sum(Sales),
OrderCount=length(unique(InvoiceDate))) %>%
mutate(AvgOrderValue=TotalSales/OrderCount)
head(customer_data)
## # A tibble: 6 × 4
## CustomerID TotalSales OrderCount AvgOrderValue
## <int> <dbl> <int> <dbl>
## 1 12346 77184. 1 77184.
## 2 12347 4310 7 616.
## 3 12348 1797. 4 449.
## 4 12349 1758. 1 1758.
## 5 12350 334. 1 334.
## 6 12352 2506. 8 313.
Data normalization Clustering algorithms (k-means clustering including) are highly affected by the scales of the data. So I need to normalize the data to be on the same scale.
#rank the data, values from 1 to 4298 (total number of records)
rank_data <- customer_data %>%
mutate(TotalSales=rank(TotalSales), OrderCount=rank(OrderCount,
ties.method = "first"), AvgOrderValue=rank(AvgOrderValue))
#normalizing data
normalized_data <- rank_data %>%
mutate(TotalSales=scale(TotalSales), OrderCount=scale(OrderCount),
AvgOrderValue=scale(AvgOrderValue))
#look at the summary of the normalized data
summary(normalized_data)
## CustomerID TotalSales.V1 OrderCount.V1 AvgOrderValue.V1
## Min. :12346 Min. :-1.7314521 Min. :-1.7314521 Min. :-1.7314521
## 1st Qu.:13812 1st Qu.:-0.8657261 1st Qu.:-0.8657261 1st Qu.:-0.8657261
## Median :15299 Median : 0.0000000 Median : 0.0000000 Median : 0.0000000
## Mean :15300 Mean : 0.0000000 Mean : 0.0000000 Mean : 0.0000000
## 3rd Qu.:16778 3rd Qu.: 0.8657261 3rd Qu.: 0.8657261 3rd Qu.: 0.8657261
## Max. :18287 Max. : 1.7314521 Max. : 1.7314521 Max. : 1.7314521
sapply(normalized_data, sd)
## CustomerID TotalSales OrderCount AvgOrderValue
## 1721.89 1.00 1.00 1.00
In the summary, we see the values are centered around 1 (mean=0) and have a standard deviation of 1. Now, this data is ready for clustering analysis.
K-means clustering
K-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (Wiki). This is a method of unsupervised learning that learns the commonalities between groups of data without any target or labeled variable.
K-means clustering algorithm spits the records in the data into a pre-defined number of clusters, where the data points within each cluster are close to each other. One difficulty of using k-means clustering for customer segmentation is the fact that you need to know the number of clusters beforehand.
Luckily, the silhouette coefficient can help you. The silhouette coefficient measures how close the data points are to their clusters compared to other clusters. The silhouette coefficient values range from -1 to 1, where the closer the values are to 1, the better they are.
Let's find the best number of clusters:
library(cluster)
##
## Attaching package: 'cluster'
## The following object is masked from 'package:maps':
##
## votes.repub
for(n_cluster in 4:8) {
cluster <- kmeans(normalized_data[c("TotalSales", "OrderCount",
"AvgOrderValue")], n_cluster)
silhouetteScore <- mean(
silhouette(
cluster$cluster,
dist(normalized_data[c("TotalSales", "OrderCount",
"AvgOrderValue")],
method = "euclidean")
)[,3]
)
print(sprintf("Silhouette Score for %i Clusters: %0.4f", n_cluster,
silhouetteScore))
}
## [1] "Silhouette Score for 4 Clusters: 0.4122"
## [1] "Silhouette Score for 5 Clusters: 0.3768"
## [1] "Silhouette Score for 6 Clusters: 0.3711"
## [1] "Silhouette Score for 7 Clusters: 0.3912"
## [1] "Silhouette Score for 8 Clusters: 0.3802"
The best number of clusters with the highest silhouette score is 4. So I'm going to use this number of clusters for the following analysis.
Finally, it's time to perform a k-means clustering algorithm.
#k-means clustering
cluster <- kmeans(normalized_data[c("TotalSales", "OrderCount",
"AvgOrderValue")], 4)
#cluster labels
normalized_data$Cluster <- cluster$cluster
#visualizing clusters
options(repr.plot.width = 16, repr.plot.height = 7)
ggplot(normalized_data, aes(x=AvgOrderValue, y=OrderCount, color=Cluster)) +
geom_point() + theme(legend.position = "bottom", legend.text=element_text(size=25),
axis.title = element_text(size = 20))
Looking at the plot above, you see 4 customer clusters.
-> A bottom left cluster is a group of low-value customers - low Average Order Value and a low number of orders. The company probably should not spend too much effort on this customer group.
-> An upper right cluster is a group of high-value customers who spend a lot and buy often. This is the cluster the company should focus on to increase sales and ROI.
cluster$centers
## TotalSales OrderCount AvgOrderValue
## 1 1.1995480 1.0041526 0.8628554
## 2 0.2203286 0.7203233 -0.6434174
## 3 -1.2314808 -0.7949971 -1.0527355
## 4 -0.1383505 -0.8579406 0.8030855
4th cluster has the lowest numbers for all three attributes - Total Sales, Order Count, and Average Order Value. It means that the 4th cluster contains customers with the lowest amount of sales, orders, and lowest average per-order value.
1st cluster has the highest numbers for all attributes. The customers in this cluster purchase expensive items and give the business the highest revenue. The company should focus on this cluster, as it will result in the highest ROI.
Customers in the 3rd cluster have a high value for OrderCount, but their AvgOrderValue is low. Basically, these customers make frequent purchases of low-value items. The business can make recommendations of low-value items to this segment, in order to get a high engagement rate.
The customers in the 2nd cluster have low numbers of orders and total revenue, but their average per-order value is high. So these customers buy expensive items infrequently. It would be great to recommend expensive products to this segment group.
Now that we know the customer segments, how customers in those segments behave, and what they are interested in. This information can help target the company audience better and, therefore, increase sales and ROI.