Clustering

Purpose

This script shows how I segment my company’s customer base by clustering average order values.

Clustering Inspiration

“Clustering algorithms for customer segmentation” https://towardsdatascience.com/clustering-algorithms-for-customer-segmentation-af637c6830ac

“Finding Optimal Number of Clusters” https://www.r-bloggers.com/finding-optimal-number-of-clusters/

“UC Business Analytics R Programming Guide - K-means Cluster Analysis” https://uc-r.github.io/kmeans_clustering

Load libraries

library(dplyr)
library(readxl)
library(stringr)
library(ggplot2)

# some homemade functions, such as a function to format dataframe columns for petty viewing
source('/Users/jarad/Fake Folder/R Libraries/jb_functions.R')

Load data

path <- '/Users/jarad/Desktop/Desktop/Portfolio\ Scripts/Clustering/Data\ for\ Clustering.xlsx'

data_super_main <- read_excel(path)

# fix column names
colnames(data_super_main) <- str_replace_all(colnames(data_super_main), ' ', '_')

# remove where revenue = 0, and reseller orders
data_super_main <- data_super_main %>% subset(revenue > 0 & customer_type == 'non reseller')

min_ <- min(data_super_main$date_purchased)
max_ <- max(data_super_main$date_purchased)
n <- floor((max_-min_)/30)
print(paste('data is from ',min_,' to ',max_,', which is ',n, ' months', sep = '') )

## [1] "data is from 2018-10-01 to 2019-03-31, which is 6 months"

Check it out

head(data_super_main) %>%
  jb_pretty_df(.)

date_purchased	orders_id	account_type	customers_email_address	part_id	products_quantity	products_price	revenue	customer_type	products_category
2018-10-01	3735272	account	email #25538	937	2	6.416164	6.416164	non reseller	cables
2018-10-01	3735272	account	email #25538	3258	2	8.040510	8.040510	non reseller	cables
2018-10-01	3735274	guest	email #82009	2886	2	1.543128	1.543128	non reseller	arduino
2018-10-01	3735274	guest	email #82009	2830	2	2.030432	2.030432	non reseller	arduino
2018-10-01	3735274	guest	email #82009	2884	3	8.040510	16.081019	non reseller	arduino
2018-10-01	3735276	account	email #72846	3806	2	11.289200	11.289200	non reseller	tools

Get some stats

On average, account holders spend over one and a half times as much as guests.

data_super_main %>%
  group_by(account_type, orders_id) %>%
  summarise(order_subtotal = sum(revenue)) %>%
  group_by(account_type) %>%
  summarise(mean = mean(order_subtotal),
            count = n()) %>%
  jb_format(., c('nothing','money','number0')) %>%
  jb_pretty_df(.)

account_type	mean	count
account	$163.91	72,696
guest	$95.57	53,490

The density plot below shows that the collection of values for guests are typically lower than those of account holders.

df <- data_super_main %>%
        group_by(account_type, orders_id) %>%
        summarise(revenue = sum(revenue))

ggplot(df, aes(x = revenue, color = account_type)) + 
  geom_density() + 
  ggtitle('Density Plot of Revenue per Customer Type') + 
  scale_x_continuous('Revenue', limits = c(-100, 200), labels = scales::dollar) +
  ylab('Density')

## Warning: Removed 17730 rows containing non-finite values (stat_density).

Get avg days between orders

Account holders order less frequently, with an average of six days between orders, which is higher than guests who average one order every two days.

for (t in c('account','guest')) {
  df <- data_super_main %>% 
        subset(account_type == t) %>%
        group_by(customers_email_address) %>%
        summarise(min = min(date_purchased),
                  max = max(date_purchased),
                  order_count = length(unique(orders_id))) %>%
        mutate(days_elapsed = as.numeric(floor((max-min)/86400))) %>%
        mutate(avg_days_between_orders = days_elapsed/order_count)  
  
  avg <- round(mean(df$avg_days_between_orders))
  print(paste(t,': ',avg, sep = ''))
}

## [1] "account: 6"
## [1] "guest: 2"

Group data by account type and get average order value per customer

data_main <- data_super_main %>%
  group_by(account_type, customers_email_address) %>%
  summarise(order_count = length(unique(orders_id)),
            revenue = sum(revenue)) %>%
  mutate(avg_order_value = revenue/order_count)

accounts <- data_main %>% subset(account_type == 'account')
guests <- data_main %>% subset(account_type == 'guest')

Determine number of clusters using the Elbow method

The plot below compares the total “with-in clusters sum of squares” (WCSS) with the number of clusters. A single WCSS for a single cluster shows how much variation is in that cluster; so, in general, a smaller WCSS means a better cluster. When you have more than one cluster, you can add these WCSS’s up and get the total per each set of clusters.

We want to choose the number of clusters right before the line in the plot begins to level out. At this point, the point of leveling out, adding more clusters would not explain much more than what we have already explained with the previous clusters.

# scale the data
accounts_scaled <- scale(accounts$avg_order_value)

set.seed(1)
k.max <- 10
wss <- sapply(1:k.max,
              function(k){kmeans(accounts_scaled, k)$tot.withinss})

plot(1:k.max, wss,
     type = "b", pch = 19, frame = FALSE, 
     xlab = "Number of clusters K",
     ylab = "Total within-clusters sum of squares")
title('Elbow Method to Determine Optimal Number of Clusters')

Cluster

The plot suggests that 5 clusters is ideal.

C <- 5
set.seed(1)
k <- kmeans(accounts_scaled, 
            centers = C,
            nstart = 25)

# get and print model fit
model_fit <- k$betweenss/k$totss
print(paste('the model fit with ',C,' clusters is ', scales::percent(model_fit), sep = ''))

## [1] "the model fit with 5 clusters is 90.4%"

The model fit is nice and high :)

Add clusters to main data and group by cluster to get some stats

The table below shows the min, median, avg, max, count, and proportion of average order values, broken down by cluster.

The table shows that 96% of account holders have average order values between $1.20 and $552, while 3.7% have average order values between $553 and $3,487, etc.

accounts$cluster <- k$cluster

by_cluster <- accounts %>% 
      group_by(cluster) %>%
      summarise(min = min(avg_order_value),
                median = median(avg_order_value),
                avg = mean(avg_order_value),
                max = max(avg_order_value),
                count = n())
by_cluster <- by_cluster[order(by_cluster$min),]
by_cluster$count_as_percentage <- by_cluster$count/sum(by_cluster$count)

by_cluster %>%
  jb_format(., c('nothing','money','money','money','money','number0','percent1')) %>%
  jb_pretty_df()

cluster	min	median	avg	max	count	count_as_percentage
1	$1.22	$74.33	$106.68	$552.11	41,798	96.1%
2	$552.93	$762.91	$998.76	$3,486.93	1,625	3.7%
3	$3,596.56	$5,134.67	$6,065.69	$15,554.73	87	0.2%
5	$21,615.98	$30,032.12	$29,903.47	$37,933.66	4	0.0%
4	$62,537.30	$74,703.64	$74,703.64	$86,869.99	2	0.0%

The plot below shows that cluster 1 (red) has many members with low average order values, while cluster 4 (blue) has two members who have large average order values.

ggplot(accounts, aes(x = 1:nrow(accounts), 
                     y = avg_order_value, 
                     color = factor(cluster))) +
  geom_point(size = 5) + 
  ggtitle('Average Order Values and their Clusters') +
  scale_y_continuous(name = 'Avg Order Value', labels = scales::dollar_format()) +
  scale_x_continuous(name = 'Unique Customer ID', labels = scales::comma) +
  labs(color = 'Cluster')

The box plot below shows cluster 4 towering above all others.

ggplot(accounts, aes(x = cluster, y = avg_order_value, group = cluster)) +
  geom_boxplot() +
  ggtitle('Box Plot of Avg Order Value by Cluster') + 
  scale_y_continuous(name = 'Avg Order Value', labels = scales::dollar_format()) +
  scale_x_continuous(name = 'Cluster', labels = scales::comma)

Summary and Next Steps

This cluster analysis shows that 96% of customers have average order values between $1.22 and $552, which is a huge range. A next step could be to cluster on this segment only; it’s worth understanding, since it makes up most of our customer base.

Another approach is to break down the entire data set by US state, and then cluster within each state. The results from this method could be used for marketing campaigns or incentive programs meant to increase the average order value of account holders.

Yet another approach is to cluster the guest data along with the account holder data and compare the two. These results could be used to develop an incentive program to encourage guests to create accounts.