Purpose
This script shows how I segment my company’s customer base by clustering average order values.
Clustering Inspiration
“Clustering algorithms for customer segmentation” https://towardsdatascience.com/clustering-algorithms-for-customer-segmentation-af637c6830ac
“Finding Optimal Number of Clusters” https://www.r-bloggers.com/finding-optimal-number-of-clusters/
“UC Business Analytics R Programming Guide - K-means Cluster Analysis” https://uc-r.github.io/kmeans_clustering
Load libraries
Load data
path <- '/Users/jarad/Desktop/Desktop/Portfolio\ Scripts/Clustering/Data\ for\ Clustering.xlsx'
data_super_main <- read_excel(path)
# fix column names
colnames(data_super_main) <- str_replace_all(colnames(data_super_main), ' ', '_')
# remove where revenue = 0, and reseller orders
data_super_main <- data_super_main %>% subset(revenue > 0 & customer_type == 'non reseller')min_ <- min(data_super_main$date_purchased)
max_ <- max(data_super_main$date_purchased)
n <- floor((max_-min_)/30)
print(paste('data is from ',min_,' to ',max_,', which is ',n, ' months', sep = '') )## [1] "data is from 2018-10-01 to 2019-03-31, which is 6 months"
Check it out
| date_purchased | orders_id | account_type | customers_email_address | part_id | products_quantity | products_price | revenue | customer_type | products_category |
|---|---|---|---|---|---|---|---|---|---|
| 2018-10-01 | 3735272 | account | email #25538 | 937 | 2 | 6.416164 | 6.416164 | non reseller | cables |
| 2018-10-01 | 3735272 | account | email #25538 | 3258 | 2 | 8.040510 | 8.040510 | non reseller | cables |
| 2018-10-01 | 3735274 | guest | email #82009 | 2886 | 2 | 1.543128 | 1.543128 | non reseller | arduino |
| 2018-10-01 | 3735274 | guest | email #82009 | 2830 | 2 | 2.030432 | 2.030432 | non reseller | arduino |
| 2018-10-01 | 3735274 | guest | email #82009 | 2884 | 3 | 8.040510 | 16.081019 | non reseller | arduino |
| 2018-10-01 | 3735276 | account | email #72846 | 3806 | 2 | 11.289200 | 11.289200 | non reseller | tools |
Get some stats
On average, account holders spend over one and a half times as much as guests.
data_super_main %>%
group_by(account_type, orders_id) %>%
summarise(order_subtotal = sum(revenue)) %>%
group_by(account_type) %>%
summarise(mean = mean(order_subtotal),
count = n()) %>%
jb_format(., c('nothing','money','number0')) %>%
jb_pretty_df(.)| account_type | mean | count |
|---|---|---|
| account | $163.91 | 72,696 |
| guest | $95.57 | 53,490 |
The density plot below shows that the collection of values for guests are typically lower than those of account holders.
df <- data_super_main %>%
group_by(account_type, orders_id) %>%
summarise(revenue = sum(revenue))
ggplot(df, aes(x = revenue, color = account_type)) +
geom_density() +
ggtitle('Density Plot of Revenue per Customer Type') +
scale_x_continuous('Revenue', limits = c(-100, 200), labels = scales::dollar) +
ylab('Density')## Warning: Removed 17730 rows containing non-finite values (stat_density).
Get avg days between orders
Account holders order less frequently, with an average of six days between orders, which is higher than guests who average one order every two days.
for (t in c('account','guest')) {
df <- data_super_main %>%
subset(account_type == t) %>%
group_by(customers_email_address) %>%
summarise(min = min(date_purchased),
max = max(date_purchased),
order_count = length(unique(orders_id))) %>%
mutate(days_elapsed = as.numeric(floor((max-min)/86400))) %>%
mutate(avg_days_between_orders = days_elapsed/order_count)
avg <- round(mean(df$avg_days_between_orders))
print(paste(t,': ',avg, sep = ''))
}## [1] "account: 6"
## [1] "guest: 2"
Group data by account type and get average order value per customer
data_main <- data_super_main %>%
group_by(account_type, customers_email_address) %>%
summarise(order_count = length(unique(orders_id)),
revenue = sum(revenue)) %>%
mutate(avg_order_value = revenue/order_count)
accounts <- data_main %>% subset(account_type == 'account')
guests <- data_main %>% subset(account_type == 'guest')Determine number of clusters using the Elbow method
The plot below compares the total “with-in clusters sum of squares” (WCSS) with the number of clusters. A single WCSS for a single cluster shows how much variation is in that cluster; so, in general, a smaller WCSS means a better cluster. When you have more than one cluster, you can add these WCSS’s up and get the total per each set of clusters.
We want to choose the number of clusters right before the line in the plot begins to level out. At this point, the point of leveling out, adding more clusters would not explain much more than what we have already explained with the previous clusters.
# scale the data
accounts_scaled <- scale(accounts$avg_order_value)
set.seed(1)
k.max <- 10
wss <- sapply(1:k.max,
function(k){kmeans(accounts_scaled, k)$tot.withinss})
plot(1:k.max, wss,
type = "b", pch = 19, frame = FALSE,
xlab = "Number of clusters K",
ylab = "Total within-clusters sum of squares")
title('Elbow Method to Determine Optimal Number of Clusters')Cluster
The plot suggests that 5 clusters is ideal.
C <- 5
set.seed(1)
k <- kmeans(accounts_scaled,
centers = C,
nstart = 25)
# get and print model fit
model_fit <- k$betweenss/k$totss
print(paste('the model fit with ',C,' clusters is ', scales::percent(model_fit), sep = ''))## [1] "the model fit with 5 clusters is 90.4%"
The model fit is nice and high :)
Add clusters to main data and group by cluster to get some stats
The table below shows the min, median, avg, max, count, and proportion of average order values, broken down by cluster.
The table shows that 96% of account holders have average order values between $1.20 and $552, while 3.7% have average order values between $553 and $3,487, etc.
accounts$cluster <- k$cluster
by_cluster <- accounts %>%
group_by(cluster) %>%
summarise(min = min(avg_order_value),
median = median(avg_order_value),
avg = mean(avg_order_value),
max = max(avg_order_value),
count = n())
by_cluster <- by_cluster[order(by_cluster$min),]
by_cluster$count_as_percentage <- by_cluster$count/sum(by_cluster$count)
by_cluster %>%
jb_format(., c('nothing','money','money','money','money','number0','percent1')) %>%
jb_pretty_df()| cluster | min | median | avg | max | count | count_as_percentage |
|---|---|---|---|---|---|---|
| 1 | $1.22 | $74.33 | $106.68 | $552.11 | 41,798 | 96.1% |
| 2 | $552.93 | $762.91 | $998.76 | $3,486.93 | 1,625 | 3.7% |
| 3 | $3,596.56 | $5,134.67 | $6,065.69 | $15,554.73 | 87 | 0.2% |
| 5 | $21,615.98 | $30,032.12 | $29,903.47 | $37,933.66 | 4 | 0.0% |
| 4 | $62,537.30 | $74,703.64 | $74,703.64 | $86,869.99 | 2 | 0.0% |
The plot below shows that cluster 1 (red) has many members with low average order values, while cluster 4 (blue) has two members who have large average order values.
ggplot(accounts, aes(x = 1:nrow(accounts),
y = avg_order_value,
color = factor(cluster))) +
geom_point(size = 5) +
ggtitle('Average Order Values and their Clusters') +
scale_y_continuous(name = 'Avg Order Value', labels = scales::dollar_format()) +
scale_x_continuous(name = 'Unique Customer ID', labels = scales::comma) +
labs(color = 'Cluster')The box plot below shows cluster 4 towering above all others.
ggplot(accounts, aes(x = cluster, y = avg_order_value, group = cluster)) +
geom_boxplot() +
ggtitle('Box Plot of Avg Order Value by Cluster') +
scale_y_continuous(name = 'Avg Order Value', labels = scales::dollar_format()) +
scale_x_continuous(name = 'Cluster', labels = scales::comma)Summary and Next Steps
This cluster analysis shows that 96% of customers have average order values between $1.22 and $552, which is a huge range. A next step could be to cluster on this segment only; it’s worth understanding, since it makes up most of our customer base.
Another approach is to break down the entire data set by US state, and then cluster within each state. The results from this method could be used for marketing campaigns or incentive programs meant to increase the average order value of account holders.
Yet another approach is to cluster the guest data along with the account holder data and compare the two. These results could be used to develop an incentive program to encourage guests to create accounts.