ANLY 506 Final Project

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

library(purrr)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(ggpubr)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(viridis)

## Loading required package: viridisLite

library(maps)

## 
## Attaching package: 'maps'

## The following object is masked from 'package:viridis':
## 
##     unemp

## The following object is masked from 'package:purrr':
## 
##     map

library(VGAM)

## Loading required package: stats4

## Loading required package: splines

## 
## Attaching package: 'VGAM'

## The following object is masked from 'package:tidyr':
## 
##     fill

library(vegan)

## Loading required package: permute

## Warning: package 'permute' was built under R version 4.1.2

## Loading required package: lattice

## This is vegan 2.5-7

## 
## Attaching package: 'vegan'

## The following object is masked from 'package:VGAM':
## 
##     calibrate

library(forecast)

## Warning: package 'forecast' was built under R version 4.1.2

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## 
## Attaching package: 'forecast'

## The following object is masked from 'package:ggpubr':
## 
##     gghistogram

library(xts)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

# For text tokenization
library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:vegan':
## 
##     tolerance

## The following object is masked from 'package:VGAM':
## 
##     predictors

## The following object is masked from 'package:purrr':
## 
##     lift

library(textclean)

options(warn = -1) # remove warning messages

commerce_bc <- read.csv('data/online_retail.csv', encoding = "UTF-8")

In this analysis we will be working with the E-commerce dataset from the UCI Machine Learning Repository which contains around 150.000 transactions for an UK-based online retailer. The dataset contains actual data for the period 2010/2011 grouped by country and transaction day.

EDA - Let's start the analysis taking a look at the missing values.

# EDA ---------------------------------------------------------------------
# Missing values 
pMiss <- function(x){sum(is.na(x))/length(x)*100} #function to calculate missing values
apply(commerce_bc, 2, pMiss) # almost 25% of missing values in CustomerID column

##   InvoiceNo   StockCode Description    Quantity InvoiceDate   UnitPrice 
##     0.00000     0.00000     0.00000     0.00000     0.00000     0.00000 
##  CustomerID     Country 
##    24.92669     0.00000

There are around 25% of missing values in the Customer ID column. This corresponds to entries for fees, discounts or devolutions, so the way the data was registered was different from usual.

We are going to remove those entries as they are not providing actual information about the products.

Next up, we need to do some feature engineering. Among other minor changes, we decided to make the following changes:

-> Remove Amazon related fees -> Remove rows with missing Customer information -> Convert categorical columns into factorial columns -> Convert Date column into a POSIX object -> Create grouping date columns such as year, month, quarter or part of the week.

# Cleanup -----------------------------------------------------------------
# Remove fees and adjustments 
commerce <- commerce_bc %>%
    filter(!grepl('amazon', Description, ignore.case = T) & !grepl('adjust', Description, ignore.case = T)) %>%
    filter(!is.na(CustomerID)) %>% #remove entries without customer encoding
    filter(!grepl('C', StockCode)) %>% #remove cancelled orders
    distinct() #remove duplicates (around 150.000 rows removed from original dataset)

# Retype columns 
commerce$Country <- factor(commerce$Country)
commerce$InvoiceDate <- as.POSIXct(commerce$InvoiceDate, tryFormats = "%m/%d/%Y %H:%M")

# Change factor labels 
commerce$StockCode <- fct_recode(commerce$StockCode, "Postage" = "POST", "Discount" = "D",
                                 "Carriage" = "C2", "Manual" = "M", "Bank Charges" = "BANK CHARGES",
                                 "Doctom Postage" = "DOT")
# Time related variables
commerce$date <- as.Date(commerce$InvoiceDate)
commerce$year <- format(commerce$date, '%Y')
commerce$month <- factor(format(commerce$date, '%b'), ordered = T, 
                         levels = c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep',
                                    'Oct', 'Nov', 'Dec'))
commerce$quarter <- case_when(
  commerce$month %in% c('Jan', 'Feb', 'Mar') ~ 'Q1',
  commerce$month %in% c('Apr', 'May', 'Jun') ~ 'Q2',
  commerce$month %in% c('Jul', 'Aug', 'Sep') ~ 'Q3',
  commerce$month %in% c('Oct', 'Nov', 'Dec') ~ 'Q4',
  TRUE ~ 'Unknown'
)

commerce$quarter <- factor(commerce$quarter, ordered = T, levels =c('Q1', 'Q2', 'Q3', 'Q4'))
commerce$weeknd <- ifelse(weekdays(commerce$date) %in% c('Saturday', 'Sunday'), 'weekend', 'weekday')

Another important topic is the outlier analysis, specially before clustering. The best approach to look into this is to use boxplots.

# EDA plot settings
options(repr.plot.width = 10, repr.plot.height = 9)
box_theme <- theme(axis.text = element_text(size = 14),
                   axis.title.y = element_text(size = 12),
                   axis.title.x = element_blank())

commerce %>% 
    ggplot() +
        aes(x = 1, y = UnitPrice) +
        geom_boxplot() +
        geom_text(data = . %>% filter(UnitPrice > 750), aes(label = CustomerID), hjust = 1.2) +
        theme_minimal() +
        box_theme

There are some entries with extremely high Unit Prices, but are they actual revenues?

commerce %>%
    filter(CustomerID %in% c('15098', '16029')) %>% 
    arrange(desc(UnitPrice)) %>%
    head(15)

##    InvoiceNo StockCode                       Description Quantity
## 1    C556445    Manual                            Manual       -1
## 2    C551685   Postage                           POSTAGE       -1
## 3     551697   Postage                           POSTAGE        1
## 4    C551699    Manual                            Manual       -1
## 5     556444     22502    PICNIC BASKET WICKER 60 PIECES       60
## 6     556446     22502    PICNIC BASKET WICKER 60 PIECES        1
## 7    C566614  Discount                          Discount       -1
## 8    C571634    Manual                            Manual       -1
## 9    C561225  Discount                          Discount       -1
## 10   C566613  Discount                          Discount       -1
## 11   C557927  Discount                          Discount       -1
## 12   C563592  Discount                          Discount       -1
## 13    547726    72760B VINTAGE CREAM 3 BASKET CAKE STAND       16
## 14    553469    72760B VINTAGE CREAM 3 BASKET CAKE STAND       24
## 15    558362     22625                RED KITCHEN SCALES       24
##            InvoiceDate UnitPrice CustomerID        Country       date year
## 1  0011-06-10 15:31:00  38970.00      15098 United Kingdom 0011-06-10 0011
## 2  0011-05-03 12:51:00   8142.75      16029 United Kingdom 0011-05-03 0011
## 3  0011-05-03 13:46:00   8142.75      16029 United Kingdom 0011-05-03 0011
## 4  0011-05-03 14:12:00   6930.00      16029 United Kingdom 0011-05-03 0011
## 5  0011-06-10 15:28:00    649.50      15098 United Kingdom 0011-06-10 0011
## 6  0011-06-10 15:33:00    649.50      15098 United Kingdom 0011-06-10 0011
## 7  0011-09-13 17:28:00    102.24      16029 United Kingdom 0011-09-13 0011
## 8  0011-10-18 11:34:00     79.20      16029 United Kingdom 0011-10-18 0011
## 9  0011-07-26 10:06:00     45.60      16029 United Kingdom 0011-07-26 0011
## 10 0011-09-13 17:17:00     15.00      16029 United Kingdom 0011-09-13 0011
## 11 0011-06-23 14:56:00     10.00      16029 United Kingdom 0011-06-23 0011
## 12 0011-08-18 06:12:00      9.60      16029 United Kingdom 0011-08-18 0011
## 13 0011-03-25 10:49:00      8.49      16029 United Kingdom 0011-03-25 0011
## 14 0011-05-17 11:15:00      8.49      16029 United Kingdom 0011-05-17 0011
## 15 0011-06-28 15:14:00      7.65      16029 United Kingdom 0011-06-28 0011
##    month quarter  weeknd
## 1    Jun      Q2 weekday
## 2    May      Q2 weekday
## 3    May      Q2 weekday
## 4    May      Q2 weekday
## 5    Jun      Q2 weekday
## 6    Jun      Q2 weekday
## 7    Sep      Q3 weekday
## 8    Oct      Q4 weekday
## 9    Jul      Q3 weekday
## 10   Sep      Q3 weekday
## 11   Jun      Q2 weekday
## 12   Aug      Q3 weekday
## 13   Mar      Q1 weekday
## 14   May      Q2 weekday
## 15   Jun      Q2 weekday

  #filter(!Description %in% c('Manual', 'POSTAGE', 'Discount')) %>%

As we can see in the dataframe, those extreme values are not actual products but manually introduced values, discounts and bank fees, so it's better to get rid of these rows. To do so, we will be using the z-score formula to remove all those values which surpasse 3 standard deviations in the Standard Normalized Distribution, by using the formula: 𝑧=𝑥𝑖−μσ

commerce %<>% filter(!StockCode %in% c('Postage', 'Discount', 'Carriage', 'Manual', 'Bank Charges', 'Doctom Postage'))

up_median <- mean(commerce$UnitPrice, na.rm = T) 
up_sd <- sd(commerce$UnitPrice, na.rm = T)
commerce %>% 
    filter(!(abs(UnitPrice - up_median) / up_sd) > 3) %>%
    ggplot() +
        aes(x = 1, y = UnitPrice) +
        geom_boxplot() +
        geom_text(data = . %>% filter(UnitPrice > 5), aes(label = CustomerID), hjust = 1.2) +
        theme_minimal() +
        box_theme

Now the number of outliers has been significantly reduced. With these changes our clustering algorithm might work better. But we need to group the data by Invoice number so we can cluster the total amount of acquired products by customer.

# Total amount by invoice
comm_clean <- commerce %>% 
  group_by(CustomerID, Country, date, year, month, quarter, weeknd) %>% 
  summarise(InvoiceAmount = sum(Quantity * UnitPrice)) %>%
  filter(InvoiceAmount > 0) # drop article devolutions

## `summarise()` has grouped output by 'CustomerID', 'Country', 'date', 'year', 'month', 'quarter'. You can override using the `.groups` argument.

# Outlier analysis 
comm_clean %>% 
    ggplot() +
        aes(x = 1, y = InvoiceAmount) +
        geom_boxplot() +
        theme_minimal() +
        box_theme

The effect of the Unit Price outliers is obvious when aggregating the data. Let's see how it would look after removing the outliers with the z-score formula.

# Z-score outlier removal
invoice_ave = median(comm_clean$InvoiceAmount, na.rm = T)
invoice_sd = sd(comm_clean$InvoiceAmount, na.rm = T)
comm_clean %>% 
    filter(!(abs(InvoiceAmount - invoice_ave) / invoice_sd) > 3) %>%
    ggplot() +
        aes(x = 1, y = InvoiceAmount) +
        geom_boxplot() +
        theme_minimal() +
        box_theme

# Outlier extraction (z-score formula)
x_median <- median(comm_clean$InvoiceAmount)
x_sd <- sd(comm_clean$InvoiceAmount, na.rm = T)
comm_clean %<>%
    filter(!(abs(InvoiceAmount - x_median) / x_sd) > 3) 

y_median <- median(commerce$UnitPrice, na.rm = T) 
y_sd <- sd(commerce$UnitPrice, na.rm = T)
commerce %<>%
    filter(!(abs(UnitPrice - y_median) / y_sd) > 3)

Visual Exploration - Now we are gonna turn our attention to the visualization of the data. First things first, let's visualize the Unit Price as a Kernel Density plot.

options(repr.plot.width = 14, repr.plot.height = 8) #resize
xmean = round(mean(comm_clean$InvoiceAmount, na.rm = T), 2)
xmedian = round(median(comm_clean$InvoiceAmount, na.rm = T), 2)
o1 <- comm_clean %>% 
             ggplot() +
                aes(x = InvoiceAmount) +
                geom_density(alpha = 0.5, fill = 'steelblue') +
                geom_vline(aes(xintercept = xmean), lty = 2, col = 'red', size = 1.2) +
                geom_vline(aes(xintercept = xmedian), lty = 2, col = 'blue', size = 1.2) +
                theme_bw() +
                theme(strip.text = element_text(size = 14),
                      legend.position = 'top',
                      legend.text = element_text(size = 14),
                      legend.title = element_blank(),
                      plot.title = element_text(size = 24, face = 'bold.italic'),
                      plot.subtitle = element_text(size = 14),
                      axis.title = element_blank(),
                      axis.text = element_text(size = 12)) +
                labs(title = 'Invoice Amount Kernel Density', 
                     subtitle = paste('Mean (red line): ',xmean, '; Median (blue line): ', xmedian)) +
    xlim(0, 2500)

o2 <- comm_clean %>% 
  ggplot() +
    aes(x = InvoiceAmount) +
    geom_boxplot() +
    theme_bw() +
    theme(strip.text = element_text(size = 14),
          legend.position = 'top',
          legend.text = element_text(size = 14),
          legend.title = element_blank(),
          #plot.title = element_text(size = 20, face = 'bold'),
          #plot.subtitle = element_text(size = 12, face = 'italic'),
          axis.title = element_blank(),
          axis.text.x = element_text(size = 12),
          axis.text.y = element_text(size = 14)) +
    xlim(0, 2500)

grid.arrange(o1, o2, layout_matrix = rbind(c(1, 1, 1),
                                           c(1, 1, 1),
                                           c(1, 1, 1),
                                           c(2, 2, 2)))

The distribution of this variables shows a heavy long tail, which suggests that we are gonna need to standardize the data before clustering it.

Another grouping variable are countries. Let's visualize it.

# by country
table(comm_clean$Country) # Countries with too few entries

## 
##            Australia              Austria              Bahrain 
##                   38                   17                    2 
##              Belgium               Brazil               Canada 
##                   95                    1                    5 
##      Channel Islands               Cyprus       Czech Republic 
##                   20                   13                    2 
##              Denmark                 EIRE   European Community 
##                   18                  165                    3 
##              Finland               France              Germany 
##                   37                  354                  406 
##               Greece              Iceland               Israel 
##                    5                    7                    3 
##                Italy                Japan              Lebanon 
##                   34                   16                    1 
##            Lithuania                Malta          Netherlands 
##                    2                    5                   44 
##               Norway               Poland             Portugal 
##                   27                   18                   48 
##                  RSA         Saudi Arabia            Singapore 
##                    1                    1                    4 
##                Spain               Sweden          Switzerland 
##                   85                   28                   41 
## United Arab Emirates       United Kingdom          Unspecified 
##                    2                14827                    7 
##                  USA 
##                    5

filter_few <- table(comm_clean$Country) > 30
countries <- rownames(filter_few)
countries <- countries[filter_few]

comm_clean %>%
  filter(Country %in% countries) %>% 
  group_by(Country) %>% 
  mutate(xmean = mean(InvoiceAmount, na.rm = T),
         xmedian = median(InvoiceAmount, na.rm = T)) %>% 
  ggplot() +
    aes(x = InvoiceAmount, fill = Country) +
    geom_density(alpha = 0.5) +
    geom_vline(aes(xintercept = xmean), lty = 2, col = 'red', size = 1.2) +
    geom_vline(aes(xintercept = xmedian), lty = 2, col = 'blue', size = 1.2) +
    geom_text(aes(x = xmedian + 150, y = 0.0015, label = round(xmedian, 2)), col = 'blue', size= 4.5) +
    geom_text(aes(x = xmean + 150, y = 0.001, label = round(xmean, 2)), col = 'red', size= 4.5) +
    theme_minimal() +
    theme(strip.text = element_text(size = 14),
          legend.position = 'none',
          plot.title = element_text(size = 20, face = 'bold'),
          plot.subtitle = element_text(size = 12, face = 'italic'),
          axis.title = element_blank(),
          axis.text = element_text(size = 10, family = 'sans')) +
    xlim(0, 2500) +
    labs(title = 'Results by Country (2011)') +
    facet_wrap(~Country, nrow = 3)

world_map = map_data("world")
to_world <- comm_clean %>% 
  group_by(Country) %>% 
  summarise(InvoiceAmount = log(sum(InvoiceAmount, na.rm = T))) %>% 
  mutate(Country = fct_recode(Country, "UK" = "United Kingdom", 'Ireland' = 'EIRE')) %>% 
  right_join(world_map, by = c("Country" = "region"))

# by country worldmap
options(repr.plot.width = 20, repr.plot.height = 11) #resize for worldmap
to_world %>%
      ggplot(aes(x = long, y = lat, group = group)) +
        geom_polygon(aes(fill=InvoiceAmount), colour = "grey50") +
        theme_void() +
        scale_fill_viridis(option = 'C', na.value = 'grey90')

Most of the sales are concentrated on the European Union, specially in the UK. About the Invoice Amount mean, there are some differences between countries: some countries such as Switzerland or Ireland are over two times higher than other countries such as Belgium or Italy.

Another way to visualize the same variable is as time series.

options(repr.plot.width = 21, repr.plot.height = 6)
# Ploting time series
comm2 <- comm_clean %>%
            dplyr::select(date, InvoiceAmount) %>% 
            group_by(date) %>%
            summarise(InvoiceAmount = sum(InvoiceAmount, na.rm = T))

## Adding missing grouping variables: `CustomerID`, `Country`, `year`, `month`, `quarter`

comm_ts <- xts(comm2$InvoiceAmount, order.by = as.POSIXct(comm2$date))

autoplot(comm_ts, ts.colour = 'green') +
    geom_line(aes(y = ma(comm_ts, 10)), series = '10 Moving Average', size = 1.5, col = 'orange') +
    ggtitle("Daily Profits with 10th Moving Average") + 
    labs(col = '') +
    theme_dark() +
    theme(axis.text = element_text(size = 14),
         plot.title = element_text(size = 22, face = 'italic'),
         legend.text = element_text(size = 14),
         legend.position = 'top',
         axis.title.y = element_blank())

As we can see in the orange line depicting the moving average with 10 lags, there was a sales increase towards the end of the year.

Finally, we can tokenize the product descriptions and take a look at the most frequent used words.

# Wordcloud ---------------------------------------------------------------
library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(tidytext)
library(ggwordcloud)

# Remove: stopwords in ENG
eng_stop = stopwords("eng")

# tokenize Commerce db sentences 
comm_txt <- commerce %>% 
  mutate(Description = as.character(Description)) %>%
  unnest_tokens(input = Description, text, token = 'words') %>% 
  mutate(text = tolower(text)) %>% 
  group_by(text) %>% 
  filter(!text %in% eng_stop) %>% # remove stopwords 
  filter(!grepl("[0-9]", text)) %>%  # remove rows with some numbers 
  filter(!is.na(CustomerID)) %>% # Remove unknown customers 
  filter(text != 'abc') %>%
  filter(nchar(text) > 1) %>% # remove characters
  summarise(freq = n()) %>% 
  arrange(desc(freq))

set.seed(5)
options(repr.plot.width = 21, repr.plot.height = 6)
comm_txt %>% 
  filter(freq > 750) %>% 
  ggplot() +
    aes(label = text, size = freq, col = freq) +
    geom_text_wordcloud() +
    scale_size_area(max_size = 22) +
    #scale_color_gradient(low = '#66e6ff', high = '#003ca1') +
    scale_color_viridis(option = 'D', direction = -1) +    
    theme_minimal()

The most frequently used words are related to how the products are sell (bag, set, box, pack, ...) or related to the product design (vintage, hanging, design, glass, ...).

For Clustering we are going to keep it simple.

Doing some feature engineering again on the data frame.

head(commerce_bc)

##   InvoiceNo StockCode                         Description Quantity  InvoiceDate
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6 12/1/10 8:26
## 2    536365     71053                 WHITE METAL LANTERN        6 12/1/10 8:26
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8 12/1/10 8:26
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6 12/1/10 8:26
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6 12/1/10 8:26
## 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2 12/1/10 8:26
##   UnitPrice CustomerID        Country
## 1      2.55      17850 United Kingdom
## 2      3.39      17850 United Kingdom
## 3      2.75      17850 United Kingdom
## 4      3.39      17850 United Kingdom
## 5      3.39      17850 United Kingdom
## 6      7.65      17850 United Kingdom

Data cleaning and preparing

There are a few data preparation steps to take before I can start segmenting customers:

-> Filtering our negative (canceled) orders. If you look at the Quantity column, you notice some negative values there - canceled or returned orders. I need to remove those orders.

-> Handling NA values in the CustomerID field. Customer ID is crutial for this analysis, so I will have to drop NA values there.

-> Changing the InvoiceDate column to DateTime format. In order to prevent any issues with dates, I like to make sure all the data is in the appropriate format.

-> Removing incomplete data. December 2011 - not a full month in this data set.

-> Calculating the sales column by multiplying the Quantity and UnitPrice columns.

-> Transforming data - each record represents the purchase history of individual customers.

#filtering out canceled orders
commerce_bc <- commerce_bc %>%
  filter(Quantity > 0)

#drop NA values
sum(is.na(commerce_bc$CustomerID))

## [1] 133361

commerce_bc <- na.omit(commerce_bc)

#changing to datetime format
commerce_bc$InvoiceDate=as.POSIXct(commerce_bc$InvoiceDate,
                            format="%m/%d/%Y %H:%M",
                            tz=Sys.timezone())
#removing incomplete data
commerce_bc <- commerce_bc %>%
  filter(InvoiceDate < "2011-12-01")

#calculate the sales column
commerce_bc <- commerce_bc %>%
  mutate(Sales=Quantity*UnitPrice)

#transform data
customer_data <- commerce_bc %>%
  group_by(CustomerID) %>%
  summarize(TotalSales=sum(Sales),
            OrderCount=length(unique(InvoiceDate))) %>%
  mutate(AvgOrderValue=TotalSales/OrderCount)

head(customer_data)

## # A tibble: 6 × 4
##   CustomerID TotalSales OrderCount AvgOrderValue
##        <int>      <dbl>      <int>         <dbl>
## 1      12346     77184.          1        77184.
## 2      12347      4310           7          616.
## 3      12348      1797.          4          449.
## 4      12349      1758.          1         1758.
## 5      12350       334.          1          334.
## 6      12352      2506.          8          313.

Data normalization Clustering algorithms (k-means clustering including) are highly affected by the scales of the data. So I need to normalize the data to be on the same scale.

#rank the data, values from 1 to 4298 (total number of records)
rank_data <- customer_data %>%
  mutate(TotalSales=rank(TotalSales), OrderCount=rank(OrderCount, 
         ties.method = "first"), AvgOrderValue=rank(AvgOrderValue))

#normalizing data
normalized_data <- rank_data %>%
  mutate(TotalSales=scale(TotalSales), OrderCount=scale(OrderCount), 
         AvgOrderValue=scale(AvgOrderValue))

#look at the summary of the normalized data
summary(normalized_data)

##    CustomerID       TotalSales.V1        OrderCount.V1       AvgOrderValue.V1  
##  Min.   :12346   Min.   :-1.7314521   Min.   :-1.7314521   Min.   :-1.7314521  
##  1st Qu.:13812   1st Qu.:-0.8657261   1st Qu.:-0.8657261   1st Qu.:-0.8657261  
##  Median :15299   Median : 0.0000000   Median : 0.0000000   Median : 0.0000000  
##  Mean   :15300   Mean   : 0.0000000   Mean   : 0.0000000   Mean   : 0.0000000  
##  3rd Qu.:16778   3rd Qu.: 0.8657261   3rd Qu.: 0.8657261   3rd Qu.: 0.8657261  
##  Max.   :18287   Max.   : 1.7314521   Max.   : 1.7314521   Max.   : 1.7314521

sapply(normalized_data, sd)

##    CustomerID    TotalSales    OrderCount AvgOrderValue 
##       1721.89          1.00          1.00          1.00

In the summary, we see the values are centered around 1 (mean=0) and have a standard deviation of 1. Now, this data is ready for clustering analysis.

K-means clustering

K-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (Wiki). This is a method of unsupervised learning that learns the commonalities between groups of data without any target or labeled variable.

K-means clustering algorithm spits the records in the data into a pre-defined number of clusters, where the data points within each cluster are close to each other. One difficulty of using k-means clustering for customer segmentation is the fact that you need to know the number of clusters beforehand.

Luckily, the silhouette coefficient can help you. The silhouette coefficient measures how close the data points are to their clusters compared to other clusters. The silhouette coefficient values range from -1 to 1, where the closer the values are to 1, the better they are.

Let's find the best number of clusters:

library(cluster)

## 
## Attaching package: 'cluster'

## The following object is masked from 'package:maps':
## 
##     votes.repub

for(n_cluster in 4:8) {
  cluster <- kmeans(normalized_data[c("TotalSales", "OrderCount",
                "AvgOrderValue")], n_cluster)
  silhouetteScore <- mean(
    silhouette(
      cluster$cluster,
      dist(normalized_data[c("TotalSales", "OrderCount",
                             "AvgOrderValue")],
           method = "euclidean")
    )[,3]
  )
  print(sprintf("Silhouette Score for %i Clusters: %0.4f", n_cluster, 
                silhouetteScore))
}

## [1] "Silhouette Score for 4 Clusters: 0.4122"
## [1] "Silhouette Score for 5 Clusters: 0.3768"
## [1] "Silhouette Score for 6 Clusters: 0.3711"
## [1] "Silhouette Score for 7 Clusters: 0.3912"
## [1] "Silhouette Score for 8 Clusters: 0.3802"

The best number of clusters with the highest silhouette score is 4. So I'm going to use this number of clusters for the following analysis.

Finally, it's time to perform a k-means clustering algorithm.

#k-means clustering
cluster <- kmeans(normalized_data[c("TotalSales", "OrderCount",
                                    "AvgOrderValue")], 4)
#cluster labels
normalized_data$Cluster <- cluster$cluster

#visualizing clusters
options(repr.plot.width = 16, repr.plot.height = 7)
ggplot(normalized_data, aes(x=AvgOrderValue, y=OrderCount, color=Cluster)) +
  geom_point() + theme(legend.position = "bottom", legend.text=element_text(size=25),
                      axis.title = element_text(size = 20))

Looking at the plot above, you see 4 customer clusters.

-> A bottom left cluster is a group of low-value customers - low Average Order Value and a low number of orders. The company probably should not spend too much effort on this customer group.

-> An upper right cluster is a group of high-value customers who spend a lot and buy often. This is the cluster the company should focus on to increase sales and ROI.

cluster$centers

##   TotalSales OrderCount AvgOrderValue
## 1  1.1995480  1.0041526     0.8628554
## 2  0.2203286  0.7203233    -0.6434174
## 3 -1.2314808 -0.7949971    -1.0527355
## 4 -0.1383505 -0.8579406     0.8030855

4th cluster has the lowest numbers for all three attributes - Total Sales, Order Count, and Average Order Value. It means that the 4th cluster contains customers with the lowest amount of sales, orders, and lowest average per-order value.

1st cluster has the highest numbers for all attributes. The customers in this cluster purchase expensive items and give the business the highest revenue. The company should focus on this cluster, as it will result in the highest ROI.

Customers in the 3rd cluster have a high value for OrderCount, but their AvgOrderValue is low. Basically, these customers make frequent purchases of low-value items. The business can make recommendations of low-value items to this segment, in order to get a high engagement rate.

The customers in the 2nd cluster have low numbers of orders and total revenue, but their average per-order value is high. So these customers buy expensive items infrequently. It would be great to recommend expensive products to this segment group.

Now that we know the customer segments, how customers in those segments behave, and what they are interested in. This information can help target the company audience better and, therefore, increase sales and ROI.