We are an online retailer based in the UK. UK is our primary market, what other countries are our customers in? Should we expand our marketing efforts internationally? If so, where?
Product analysis based on 80/20 rule
There are products with very high return rates (as high as 50%)
Customer analysis based on 80/20 rule
library(readxl)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(feather)
library(RColorBrewer)
library(scales)
data <- read_feather("online_retail.feather")
head(data)
summary(data)
str(data)
data <- rename(data, 'CustomerID' = 'Customer ID')
data$CustomerID <- as.factor(data$CustomerID)
data <- data %>%
mutate(Type = if_else(Quantity < 0, 'Return', 'Purchase'),
Sale = Quantity * Price,
Date = parse_date(format(InvoiceDate, '%Y-%m-%d')),
Year = format(InvoiceDate, '%Y'),
Month = format(InvoiceDate, '%m'),
Time = parse_time(format(InvoiceDate, '%H:%M')))
Q: What percentage of the business is in the UK?
As expected, a consistently high portion of the transactions come from the UK. However, overseas contribution is not insignificant. In January, it accounted for 20% of sales. Also, on average, international orders tend to be higher in value.
within_UK <- data %>%
mutate(YnM = paste(Year, Month, sep = '-'),
is_UK = ifelse(Country == 'United Kingdom', 1, 0)) %>%
group_by(YnM, is_UK) %>%
summarize(n = n(), agg_sales = sum(Sale)) %>%
mutate(prop = n / sum(n),
prop_sales = agg_sales / sum(agg_sales)) %>%
filter(is_UK == 1)
within_UK %>%
ggplot(aes(x = YnM)) +
geom_bar(aes(y = prop), stat = 'identity', alpha = 0.8, fill = '#1b9e77') +
geom_bar(aes(y = prop_sales), stat = 'identity', alpha = 0.8, fill = '#7570b3') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = 'n-th month', y = 'UK share of transactions')
Q: Which countries outside of the UK have the highest sales?
data %>%
filter(Country != 'United Kingdom') %>%
ggplot(aes(x = Country, y = sum(Sale)/1000000, fill = Type)) +
geom_bar(stat = 'identity') +
scale_fill_brewer(palette = 'Dark2') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(y = 'Sales (in millions)', x = '')
Q: Which countries have the highest sales-per-customer?
sales_per_cust <- data %>%
group_by(Country) %>%
summarise(n_customers = n_distinct(CustomerID),
sales = sum(Sale),
volume = sum(Quantity),
orders = n()) %>%
mutate(top_sales = min_rank(desc(sales)),
top_volume = min_rank(desc(volume)),
top_orders = min_rank(desc(orders)),
vol_per_order = volume / orders,
sales_per_order = sales / orders,
sales_per_customer = sales / n_customers) %>%
arrange(desc(sales_per_customer)) %>%
select(Country, n_customers, sales_per_customer)
sales_per_cust %>%
filter(n_customers > 1 & sales_per_customer > 2000) %>%
ggplot(aes(x = reorder(Country, -sales_per_customer), y = sales_per_customer)) +
geom_bar(stat = 'identity', fill = '#1b9e77') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = '', y = 'Sales per customer')
note: we only have 1 year worth of data, difficult to generalize this pattern
Q: Are there patterns based on time of year or time of day?
#seasonal trend
data %>%
ggplot(aes(x = Date, col = Type)) +
geom_freqpoly(bins = 80, size = 1) +
facet_grid(Type~., scales = 'free') +
scale_color_brewer(palette = 'Dark2') +
theme(legend.position = 'none') +
labs(y = 'Transaction volume')
#time of the day trend
data %>%
ggplot(aes(x = Time, col = Type))+
geom_freqpoly(bins = 80, size = 1) +
facet_grid(Type~., scales = 'free') +
scale_color_brewer(palette = 'Dark2') +
theme(legend.position = 'none') +
labs(y = 'Transaction volume')
Before we go any further, we need to address this problem:
There are instances where the first order of a customer is a Return (meaning the original purchases are not in our data). These need to be deleted.
#find instances where the first order by a customer is a Return
first_orders <- data %>%
filter(!is.na(CustomerID)) %>%
group_by(CustomerID, InvoiceDate) %>%
summarize(sales = sum(Sale)) %>%
filter(InvoiceDate == first(InvoiceDate),
sales < 0) %>%
mutate(first_order_neg = 'Yes') %>%
select(CustomerID, InvoiceDate, first_order_neg)
first_orders
## # A tibble: 272 x 3
## # Groups: CustomerID [272]
## CustomerID InvoiceDate first_order_neg
## <fct> <dttm> <chr>
## 1 12349 2009-12-04 12:49:00 Yes
## 2 12382 2010-01-26 16:25:00 Yes
## 3 12424 2009-12-07 14:43:00 Yes
## 4 12454 2010-01-26 16:24:00 Yes
## 5 12468 2010-01-19 12:13:00 Yes
## 6 12471 2009-12-02 14:43:00 Yes
## 7 12472 2009-12-09 10:47:00 Yes
## 8 12484 2009-12-03 10:57:00 Yes
## 9 12487 2009-12-09 09:16:00 Yes
## 10 12497 2010-01-26 17:21:00 Yes
## # ... with 262 more rows
#join onto the orignal and filter out
data <- left_join(data, first_orders, by = c('CustomerID' = 'CustomerID', 'InvoiceDate' = 'InvoiceDate'))
data <- filter(data, is.na(first_order_neg))
data$first_order_neg <- NULL
#Let's check again:
first_orders <- data %>%
filter(!is.na(CustomerID)) %>%
group_by(CustomerID, InvoiceDate) %>%
summarize(sales = sum(Sale)) %>%
filter(InvoiceDate == first(InvoiceDate),
sales < 0) %>%
mutate(first_order_neg = 'Yes') %>%
select(CustomerID, InvoiceDate, first_order_neg)
data <- left_join(data, first_orders, by = c('CustomerID' = 'CustomerID', 'InvoiceDate' = 'InvoiceDate'))
data <- filter(data, is.na(first_order_neg))
data$first_order_neg <- NULL
first_orders <- data %>%
filter(!is.na(CustomerID)) %>%
group_by(CustomerID, InvoiceDate) %>%
summarize(sales = sum(Sale)) %>%
filter(InvoiceDate == first(InvoiceDate),
sales < 0) %>%
mutate(first_order_neg = 'Yes') %>%
select(CustomerID, InvoiceDate, first_order_neg)
data <- left_join(data, first_orders, by = c('CustomerID' = 'CustomerID', 'InvoiceDate' = 'InvoiceDate'))
data <- filter(data, is.na(first_order_neg))
data$first_order_neg <- NULL
The 80/20 rule says that 80% of the results come from 20% of the inputs. In this case, it would imply that 80% of our sales are generated by 20% of our products. Is this true?
Let’s group together our products and take a glimpse
products <- data %>%
group_by(StockCode) %>%
summarise(sales = sum(Sale),
volume = sum(Quantity),
orders = n_distinct(Invoice),
cost = sales/volume) %>%
mutate(top_sales = min_rank(desc(sales)),
top_volume = min_rank(desc(volume)),
top_orders = min_rank(desc(orders))) %>%
arrange(top_sales)
products[1:10,]
## # A tibble: 10 x 8
## StockCode sales volume orders cost top_sales top_volume top_orders
## <chr> <dbl> <dbl> <int> <dbl> <int> <int> <int>
## 1 22423 163051. 12784 2195 12.8 1 61 2
## 2 85123A 155891. 57259 3368 2.72 2 2 1
## 3 DOT 116402. 1231 736 94.6 3 1015 77
## 4 85099B 88924. 48377 2010 1.84 4 4 3
## 5 M 73495. 1515 736 48.5 5 870 77
## 6 84879 72468. 44492 1420 1.63 6 6 7
## 7 22086 57876. 17025 965 3.40 7 35 41
## 8 47566 49650. 9270 1019 5.36 8 101 37
## 9 84347 47672. 13048 356 3.65 9 58 375
## 10 POST 46222. 3763 858 12.3 10 358 56
note: StockCode ‘M’ indicates manual adjustments to the data
Q: What % of products make up 80% of sales?
Perhaps we need to look at the diversity of our product offerings. If we reduce the number of products, and still retain nearly all of sales, that could lead to lower cost and higher profitability.
topprod <- products %>%
mutate(temp = 1/dim(products)[1],
cumprod = cumsum(temp),
cumsales = cumsum(sales / sum(sales)),
top18prod = if_else(cumsales < 0.801, 'yes', 'no'))
data <- left_join(data, topprod[, c('StockCode', 'top18prod')], by = c('StockCode' = 'StockCode'))
topprod %>%
ggplot(aes(x = cumprod*100, y = cumsales*100)) +
geom_line(size = 1)+
geom_hline(linetype = 'dashed', color = 'red', size = 0.8, yintercept = 80) +
theme(axis.title.x = element_text(size = 14), axis.title.y = element_text(size = 14)) +
labs(y = '% of sales', x = '% of products')
Q: Are some products more likely to be returned than others?
We should look into what’s going on with these products. Possible porblems could be quality issue, mismatch of website description and customer expectation, etc. Having high return rate could lead to higher cost and lower profitability.
freq_returns <- data %>%
filter(top18prod == 'yes' & StockCode != 'M') %>%
group_by(StockCode) %>%
summarize(purchased = sum(Sale[Type == 'Purchase']),
returned = sum(Sale[Type == 'Return']),
prop = -returned / purchased) %>%
filter(prop > 0.2) %>%
arrange(desc(prop))
freq_returns
## # A tibble: 10 x 4
## StockCode purchased returned prop
## <chr> <dbl> <dbl> <dbl>
## 1 85220 6875. -3646. 0.530
## 2 22341 4901. -1838. 0.375
## 3 85184D 4550. -1665. 0.366
## 4 79323W 18049. -6306. 0.349
## 5 79323LP 8481. -2538. 0.299
## 6 79323P 13761. -3934. 0.286
## 7 22085 5340. -1502. 0.281
## 8 22656 5605 -1568. 0.280
## 9 21735 8926. -2114. 0.237
## 10 20829 7167. -1597. 0.223
freq_returns %>%
ggplot(aes(x = reorder(StockCode, -prop)))+
geom_bar(aes(y = purchased), stat = 'identity') +
geom_bar(aes(y = returned), stat = 'identity', fill = 'red') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(y = 'Sales vs Returned', x = 'Product code')
The 80/20 rule applied to products, does it also apply to our customeres?
Let’s group the data by customers: ex. for this period of data, customer 18102 generated the most sales for us; the customer made 95 orders: the most recent on Dec 09
customers <- data %>%
filter(!is.na(CustomerID)) %>% ##there are missing values in CustomerID
group_by(CustomerID) %>%
summarise(sales = sum(Sale),
volume = sum(Quantity),
orders = n_distinct(Invoice),
first_order = first(Date),
last_order = last(Date)) %>%
mutate(rank_sales = min_rank(desc(sales)),
rank_volume = min_rank(desc(volume)),
rank_orders = min_rank(desc(orders))) %>%
arrange(rank_sales) %>%
select(CustomerID, sales, orders, first_order, last_order, rank_sales, rank_orders)
customers[1:10,]
## # A tibble: 10 x 7
## CustomerID sales orders first_order last_order rank_sales rank_orders
## <fct> <dbl> <int> <date> <date> <int> <int>
## 1 18102 341777. 95 2009-12-01 2010-12-09 1 11
## 2 14646 243853. 87 2009-12-02 2010-11-30 2 13
## 3 14156 183181. 138 2009-12-01 2010-12-03 3 5
## 4 14911 137676. 270 2009-12-01 2010-12-09 4 1
## 5 13694 128172. 105 2009-12-04 2010-12-01 5 10
## 6 17511 83761. 42 2009-12-02 2010-12-07 6 38
## 7 15061 82163. 90 2009-12-01 2010-12-07 7 12
## 8 16684 75610. 34 2009-12-07 2010-11-25 8 51
## 9 13089 55828. 132 2009-12-02 2010-12-06 9 7
## 10 16754 54558. 35 2010-03-08 2010-12-02 10 46
Q: What % of customers make up 80% of sales?
#top 27% of the customers generated 80% of the sales
topcust <- customers %>%
mutate(temp = 1/dim(customers)[1],
cumcustomers = cumsum(temp),
cumsales = cumsum(sales / sum(sales)),
top27cust = if_else(cumsales < 0.801, 'yes', 'no'))
#top25cust = if_else(cumsales<0.205, 'yes', 'no'))
data <- left_join(data, topcust[, c('CustomerID', 'top27cust')], by = c('CustomerID' = 'CustomerID'))
topcust %>%
ggplot(aes(x = cumcustomers*100, y = cumsales*100)) +
geom_line(size = 1)+
geom_hline(linetype = 'dashed', color = 'red', size = 0.8, yintercept = 80) +
theme(axis.title.x = element_text(size = 14), axis.title.y = element_text(size = 14)) +
labs(y = '% of sales', x = '% of customers')
Q: Where do the top 27% of the customers come from?
Almost 90% are from the UK, followed by Germany and France
data %>%
filter(top27cust == 'yes') %>%
group_by(Country) %>%
summarize(n_cust = n_distinct(CustomerID)) %>%
arrange(desc(n_cust)) %>%
mutate(prop = round(n_cust / sum(n_cust), 2))
## # A tibble: 29 x 3
## Country n_cust prop
## <chr> <int> <dbl>
## 1 United Kingdom 1055 0.89
## 2 Germany 33 0.03
## 3 France 18 0.02
## 4 Switzerland 8 0.01
## 5 Spain 7 0.01
## 6 Belgium 6 0.01
## 7 Channel Islands 6 0.01
## 8 Austria 5 0
## 9 Italy 5 0
## 10 Netherlands 5 0
## # ... with 19 more rows
Q: How often do they shop? And how much do they spend?
a <- data %>%
filter(top27cust == 'yes') %>%
summarize(avg_sales = round(sum(Sale) / n_distinct(CustomerID)),
avg_n_orders = round(n_distinct(Invoice) / n_distinct(CustomerID)))
b <- data %>%
filter(top27cust == 'no') %>%
summarize(avg_sales = round(sum(Sale) / n_distinct(CustomerID)),
avg_n_orders = round(n_distinct(Invoice) / n_distinct(CustomerID)))
c <- as.data.frame(rbind(a,b))
row.names(c) <- c('top27', 'others')
c
## avg_sales avg_n_orders
## top27 5695 13
## others 538 3
We saw from the analysis above that a quarter of our customers accounted for 80% of the sales. We can calculate the RFM of each customer and proceed with K-means clustering to identify segments and generating customer-centric business intelligence.
note: will exclude customers with negative Monetary value; they are caused unexplained Manual adjustments
today <- last(data$Date)
RFM <- customers %>%
transmute(CustomerID,
Recency = as.numeric(today-last_order),
Frequency = orders,
Monetary = sales) %>%
filter(Monetary > 0)
RFM
## # A tibble: 4,288 x 4
## CustomerID Recency Frequency Monetary
## <fct> <dbl> <int> <dbl>
## 1 18102 0 95 341777.
## 2 14646 9 87 243853.
## 3 14156 6 138 183181.
## 4 14911 0 270 137676.
## 5 13694 8 105 128172.
## 6 17511 2 42 83761.
## 7 15061 2 90 82163.
## 8 16684 14 34 75610.
## 9 13089 3 132 55828.
## 10 16754 7 35 54558.
## # ... with 4,278 more rows
write_feather(RFM, "data_RFM.feather")