Problem Motivation and Summary of Key Findings

  1. We are an online retailer based in the UK. UK is our primary market, what other countries are our customers in? Should we expand our marketing efforts internationally? If so, where?

    • On any given month, international sales account for 9-21% of our total sales
    • Ireland, Germany, and France account for the highest sales outside of the UK
    • Ireland also has the highest spending per customer, making it an attractive target  

  2. Product analysis based on 80/20 rule

    • 18% of our products account for 80% of our sales
    • 50% of our products account for nearly all of our sales
    • We should look into the range of products, and considering removing a portion that do not contribute to sales

  3. There are products with very high return rates (as high as 50%)

    • This could indicate quality issues or discrepancy of our product description and customers’ expectation
    • High return rates leads to unnecessary cost, and should be addressed with the product/marketing teams

  4. Customer analysis based on 80/20 rule

    • 27% of our customers account for 80% of our sales
    • These customers, on average, spend 10x as much as other customers
    • Will conduct further cluster analysis

Detailed Analysis

library(readxl)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(feather)
library(RColorBrewer)
library(scales)

data <- read_feather("online_retail.feather")

head(data)
summary(data)
str(data)
data <- rename(data, 'CustomerID' = 'Customer ID')
data$CustomerID <- as.factor(data$CustomerID)

data <- data %>% 
  mutate(Type = if_else(Quantity < 0, 'Return', 'Purchase'),
         Sale = Quantity * Price,
         Date = parse_date(format(InvoiceDate, '%Y-%m-%d')),
         Year = format(InvoiceDate, '%Y'),
         Month = format(InvoiceDate, '%m'),
         Time = parse_time(format(InvoiceDate, '%H:%M')))

1. We are based in the UK. But how much business comes from aboard?

Q: What percentage of the business is in the UK?

As expected, a consistently high portion of the transactions come from the UK. However, overseas contribution is not insignificant. In January, it accounted for 20% of sales. Also, on average, international orders tend to be higher in value.

within_UK <- data %>%
  mutate(YnM = paste(Year, Month, sep = '-'),
         is_UK = ifelse(Country == 'United Kingdom', 1, 0)) %>% 
  group_by(YnM, is_UK) %>% 
  summarize(n = n(), agg_sales = sum(Sale)) %>% 
  mutate(prop = n / sum(n),
         prop_sales = agg_sales / sum(agg_sales)) %>% 
  filter(is_UK == 1)  

within_UK %>% 
  ggplot(aes(x = YnM)) +
  geom_bar(aes(y = prop), stat = 'identity', alpha = 0.8, fill = '#1b9e77') + 
  geom_bar(aes(y = prop_sales), stat = 'identity', alpha = 0.8, fill = '#7570b3') +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(x = 'n-th month', y = 'UK share of transactions')

Q: Which countries outside of the UK have the highest sales?

data %>% 
  filter(Country != 'United Kingdom') %>% 
  ggplot(aes(x = Country, y = sum(Sale)/1000000, fill = Type)) +
  geom_bar(stat = 'identity') +
  scale_fill_brewer(palette = 'Dark2') +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(y = 'Sales (in millions)', x = '')

Q: Which countries have the highest sales-per-customer?

sales_per_cust <- data %>% 
  group_by(Country) %>% 
  summarise(n_customers = n_distinct(CustomerID),
            sales = sum(Sale), 
            volume = sum(Quantity),
            orders = n()) %>% 
  mutate(top_sales = min_rank(desc(sales)),
         top_volume = min_rank(desc(volume)),
         top_orders = min_rank(desc(orders)),
         vol_per_order = volume / orders,
         sales_per_order = sales / orders,
         sales_per_customer = sales / n_customers) %>% 
  arrange(desc(sales_per_customer)) %>% 
  select(Country, n_customers, sales_per_customer) 

sales_per_cust %>% 
  filter(n_customers > 1 & sales_per_customer > 2000) %>% 
  ggplot(aes(x = reorder(Country, -sales_per_customer), y = sales_per_customer)) +
  geom_bar(stat = 'identity', fill = '#1b9e77') +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(x = '', y = 'Sales per customer')

2. Seasonal patterns of sales

note: we only have 1 year worth of data, difficult to generalize this pattern

Q: Are there patterns based on time of year or time of day?

#seasonal trend 
data %>% 
  ggplot(aes(x = Date, col = Type)) +
  geom_freqpoly(bins = 80, size = 1) + 
  facet_grid(Type~., scales = 'free') +
  scale_color_brewer(palette = 'Dark2') +
  theme(legend.position = 'none') +
  labs(y = 'Transaction volume')

#time of the day trend 
data %>% 
  ggplot(aes(x = Time, col = Type))+
  geom_freqpoly(bins = 80, size = 1) + 
  facet_grid(Type~., scales = 'free') +
  scale_color_brewer(palette = 'Dark2') +
  theme(legend.position = 'none')  +
  labs(y = 'Transaction volume')

Before we go any further, we need to address this problem:

There are instances where the first order of a customer is a Return (meaning the original purchases are not in our data). These need to be deleted.

#find instances where the first order by a customer is a Return 
first_orders <- data %>% 
  filter(!is.na(CustomerID)) %>% 
  group_by(CustomerID, InvoiceDate) %>% 
  summarize(sales = sum(Sale)) %>% 
  filter(InvoiceDate == first(InvoiceDate),
         sales < 0) %>% 
  mutate(first_order_neg = 'Yes') %>% 
  select(CustomerID, InvoiceDate, first_order_neg)
first_orders
## # A tibble: 272 x 3
## # Groups:   CustomerID [272]
##    CustomerID InvoiceDate         first_order_neg
##    <fct>      <dttm>              <chr>          
##  1 12349      2009-12-04 12:49:00 Yes            
##  2 12382      2010-01-26 16:25:00 Yes            
##  3 12424      2009-12-07 14:43:00 Yes            
##  4 12454      2010-01-26 16:24:00 Yes            
##  5 12468      2010-01-19 12:13:00 Yes            
##  6 12471      2009-12-02 14:43:00 Yes            
##  7 12472      2009-12-09 10:47:00 Yes            
##  8 12484      2009-12-03 10:57:00 Yes            
##  9 12487      2009-12-09 09:16:00 Yes            
## 10 12497      2010-01-26 17:21:00 Yes            
## # ... with 262 more rows
#join onto the orignal and filter out 
data <- left_join(data, first_orders, by = c('CustomerID' = 'CustomerID', 'InvoiceDate' = 'InvoiceDate'))
data <- filter(data, is.na(first_order_neg))
data$first_order_neg <- NULL

#Let's check again: 
first_orders <- data %>% 
  filter(!is.na(CustomerID)) %>% 
  group_by(CustomerID, InvoiceDate) %>% 
  summarize(sales = sum(Sale)) %>% 
  filter(InvoiceDate == first(InvoiceDate),
         sales < 0) %>% 
  mutate(first_order_neg = 'Yes') %>% 
  select(CustomerID, InvoiceDate, first_order_neg)
data <- left_join(data, first_orders, by = c('CustomerID' = 'CustomerID', 'InvoiceDate' = 'InvoiceDate'))
data <- filter(data, is.na(first_order_neg))
data$first_order_neg <- NULL

first_orders <- data %>% 
  filter(!is.na(CustomerID)) %>% 
  group_by(CustomerID, InvoiceDate) %>% 
  summarize(sales = sum(Sale)) %>% 
  filter(InvoiceDate == first(InvoiceDate),
         sales < 0) %>% 
  mutate(first_order_neg = 'Yes') %>% 
  select(CustomerID, InvoiceDate, first_order_neg)
data <- left_join(data, first_orders, by = c('CustomerID' = 'CustomerID', 'InvoiceDate' = 'InvoiceDate'))
data <- filter(data, is.na(first_order_neg))
data$first_order_neg <- NULL

3. 80/20 rule - our products

The 80/20 rule says that 80% of the results come from 20% of the inputs. In this case, it would imply that 80% of our sales are generated by 20% of our products. Is this true?

Let’s group together our products and take a glimpse

products <- data %>% 
  group_by(StockCode) %>% 
  summarise(sales = sum(Sale), 
            volume = sum(Quantity),
            orders = n_distinct(Invoice),
            cost = sales/volume) %>% 
  mutate(top_sales = min_rank(desc(sales)),
         top_volume = min_rank(desc(volume)),
         top_orders = min_rank(desc(orders))) %>% 
  arrange(top_sales) 

products[1:10,]
## # A tibble: 10 x 8
##    StockCode   sales volume orders  cost top_sales top_volume top_orders
##    <chr>       <dbl>  <dbl>  <int> <dbl>     <int>      <int>      <int>
##  1 22423     163051.  12784   2195 12.8          1         61          2
##  2 85123A    155891.  57259   3368  2.72         2          2          1
##  3 DOT       116402.   1231    736 94.6          3       1015         77
##  4 85099B     88924.  48377   2010  1.84         4          4          3
##  5 M          73495.   1515    736 48.5          5        870         77
##  6 84879      72468.  44492   1420  1.63         6          6          7
##  7 22086      57876.  17025    965  3.40         7         35         41
##  8 47566      49650.   9270   1019  5.36         8        101         37
##  9 84347      47672.  13048    356  3.65         9         58        375
## 10 POST       46222.   3763    858 12.3         10        358         56

note: StockCode ‘M’ indicates manual adjustments to the data

Q: What % of products make up 80% of sales?

Perhaps we need to look at the diversity of our product offerings. If we reduce the number of products, and still retain nearly all of sales, that could lead to lower cost and higher profitability.

topprod <- products %>% 
  mutate(temp = 1/dim(products)[1], 
         cumprod = cumsum(temp),
         cumsales = cumsum(sales / sum(sales)),
         top18prod = if_else(cumsales < 0.801, 'yes', 'no')) 

data <- left_join(data, topprod[, c('StockCode', 'top18prod')], by = c('StockCode' = 'StockCode')) 

topprod %>% 
  ggplot(aes(x = cumprod*100, y = cumsales*100)) + 
  geom_line(size = 1)+
  geom_hline(linetype = 'dashed', color = 'red', size = 0.8, yintercept = 80) + 
  theme(axis.title.x = element_text(size = 14), axis.title.y = element_text(size = 14)) + 
  labs(y = '% of sales', x = '% of products')

Q: Are some products more likely to be returned than others?

We should look into what’s going on with these products. Possible porblems could be quality issue, mismatch of website description and customer expectation, etc. Having high return rate could lead to higher cost and lower profitability.

freq_returns <- data %>% 
  filter(top18prod == 'yes' & StockCode != 'M') %>% 
  group_by(StockCode) %>% 
  summarize(purchased = sum(Sale[Type == 'Purchase']),
            returned = sum(Sale[Type == 'Return']),
            prop = -returned / purchased) %>% 
  filter(prop > 0.2) %>% 
  arrange(desc(prop)) 

freq_returns
## # A tibble: 10 x 4
##    StockCode purchased returned  prop
##    <chr>         <dbl>    <dbl> <dbl>
##  1 85220         6875.   -3646. 0.530
##  2 22341         4901.   -1838. 0.375
##  3 85184D        4550.   -1665. 0.366
##  4 79323W       18049.   -6306. 0.349
##  5 79323LP       8481.   -2538. 0.299
##  6 79323P       13761.   -3934. 0.286
##  7 22085         5340.   -1502. 0.281
##  8 22656         5605    -1568. 0.280
##  9 21735         8926.   -2114. 0.237
## 10 20829         7167.   -1597. 0.223
freq_returns %>% 
  ggplot(aes(x = reorder(StockCode, -prop)))+
  geom_bar(aes(y = purchased), stat = 'identity') +
  geom_bar(aes(y = returned), stat = 'identity', fill = 'red') + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(y = 'Sales vs Returned', x = 'Product code')

4. 80/20 rule - our customers

The 80/20 rule applied to products, does it also apply to our customeres?

Let’s group the data by customers: ex. for this period of data, customer 18102 generated the most sales for us; the customer made 95 orders: the most recent on Dec 09

customers <- data %>% 
  filter(!is.na(CustomerID)) %>% ##there are missing values in CustomerID 
  group_by(CustomerID) %>% 
  summarise(sales = sum(Sale), 
            volume = sum(Quantity),
            orders = n_distinct(Invoice),
            first_order = first(Date),
            last_order = last(Date)) %>% 
  mutate(rank_sales = min_rank(desc(sales)),
         rank_volume = min_rank(desc(volume)),
         rank_orders = min_rank(desc(orders))) %>% 
  arrange(rank_sales) %>% 
  select(CustomerID, sales, orders, first_order, last_order, rank_sales, rank_orders)

customers[1:10,]
## # A tibble: 10 x 7
##    CustomerID   sales orders first_order last_order rank_sales rank_orders
##    <fct>        <dbl>  <int> <date>      <date>          <int>       <int>
##  1 18102      341777.     95 2009-12-01  2010-12-09          1          11
##  2 14646      243853.     87 2009-12-02  2010-11-30          2          13
##  3 14156      183181.    138 2009-12-01  2010-12-03          3           5
##  4 14911      137676.    270 2009-12-01  2010-12-09          4           1
##  5 13694      128172.    105 2009-12-04  2010-12-01          5          10
##  6 17511       83761.     42 2009-12-02  2010-12-07          6          38
##  7 15061       82163.     90 2009-12-01  2010-12-07          7          12
##  8 16684       75610.     34 2009-12-07  2010-11-25          8          51
##  9 13089       55828.    132 2009-12-02  2010-12-06          9           7
## 10 16754       54558.     35 2010-03-08  2010-12-02         10          46

Q: What % of customers make up 80% of sales?

#top 27% of the customers generated 80% of the sales 
topcust <- customers %>% 
  mutate(temp = 1/dim(customers)[1], 
         cumcustomers = cumsum(temp),
         cumsales = cumsum(sales / sum(sales)),
         top27cust = if_else(cumsales < 0.801, 'yes', 'no')) 
#top25cust = if_else(cumsales<0.205, 'yes', 'no')) 

data <- left_join(data, topcust[, c('CustomerID', 'top27cust')], by = c('CustomerID' = 'CustomerID')) 

topcust %>% 
  ggplot(aes(x = cumcustomers*100, y = cumsales*100)) + 
  geom_line(size = 1)+
  geom_hline(linetype = 'dashed', color = 'red', size = 0.8, yintercept = 80) + 
  theme(axis.title.x = element_text(size = 14), axis.title.y = element_text(size = 14)) + 
  labs(y = '% of sales', x = '% of customers')

Q: Where do the top 27% of the customers come from?

Almost 90% are from the UK, followed by Germany and France

data %>% 
  filter(top27cust == 'yes') %>% 
  group_by(Country) %>% 
  summarize(n_cust = n_distinct(CustomerID)) %>% 
  arrange(desc(n_cust)) %>% 
  mutate(prop = round(n_cust / sum(n_cust), 2))
## # A tibble: 29 x 3
##    Country         n_cust  prop
##    <chr>            <int> <dbl>
##  1 United Kingdom    1055  0.89
##  2 Germany             33  0.03
##  3 France              18  0.02
##  4 Switzerland          8  0.01
##  5 Spain                7  0.01
##  6 Belgium              6  0.01
##  7 Channel Islands      6  0.01
##  8 Austria              5  0   
##  9 Italy                5  0   
## 10 Netherlands          5  0   
## # ... with 19 more rows

Q: How often do they shop? And how much do they spend?

a <- data %>% 
  filter(top27cust == 'yes') %>% 
  summarize(avg_sales = round(sum(Sale) / n_distinct(CustomerID)),
            avg_n_orders = round(n_distinct(Invoice) / n_distinct(CustomerID))) 

b <- data %>% 
  filter(top27cust == 'no') %>%
  summarize(avg_sales = round(sum(Sale) / n_distinct(CustomerID)),
            avg_n_orders = round(n_distinct(Invoice) / n_distinct(CustomerID)))


c <- as.data.frame(rbind(a,b))
row.names(c) <- c('top27', 'others')
c
##        avg_sales avg_n_orders
## top27       5695           13
## others       538            3

5. RFM - Recency, Frequency, and Monetary

We saw from the analysis above that a quarter of our customers accounted for 80% of the sales. We can calculate the RFM of each customer and proceed with K-means clustering to identify segments and generating customer-centric business intelligence.

note: will exclude customers with negative Monetary value; they are caused unexplained Manual adjustments

today <- last(data$Date)
RFM <- customers %>% 
  transmute(CustomerID, 
            Recency = as.numeric(today-last_order),
            Frequency = orders,
            Monetary = sales)  %>% 
  filter(Monetary > 0)
  
RFM
## # A tibble: 4,288 x 4
##    CustomerID Recency Frequency Monetary
##    <fct>        <dbl>     <int>    <dbl>
##  1 18102            0        95  341777.
##  2 14646            9        87  243853.
##  3 14156            6       138  183181.
##  4 14911            0       270  137676.
##  5 13694            8       105  128172.
##  6 17511            2        42   83761.
##  7 15061            2        90   82163.
##  8 16684           14        34   75610.
##  9 13089            3       132   55828.
## 10 16754            7        35   54558.
## # ... with 4,278 more rows
write_feather(RFM, "data_RFM.feather")