Clustering

Jarad T. Bushnell

Apr 18, 2019

Purpose

This script shows how I segment my company’s customer base by clustering average order values.

Clustering Inspiration

“Clustering algorithms for customer segmentation” https://towardsdatascience.com/clustering-algorithms-for-customer-segmentation-af637c6830ac

“Finding Optimal Number of Clusters” https://www.r-bloggers.com/finding-optimal-number-of-clusters/

“UC Business Analytics R Programming Guide - K-means Cluster Analysis” https://uc-r.github.io/kmeans_clustering

Check it out

date_purchased orders_id account_type customers_email_address part_id products_quantity products_price revenue customer_type products_category
2018-10-01 3735272 account email #25538 937 2 6.416164 6.416164 non reseller cables
2018-10-01 3735272 account email #25538 3258 2 8.040510 8.040510 non reseller cables
2018-10-01 3735274 guest email #82009 2886 2 1.543128 1.543128 non reseller arduino
2018-10-01 3735274 guest email #82009 2830 2 2.030432 2.030432 non reseller arduino
2018-10-01 3735274 guest email #82009 2884 3 8.040510 16.081019 non reseller arduino
2018-10-01 3735276 account email #72846 3806 2 11.289200 11.289200 non reseller tools

Determine number of clusters using the Elbow method

The plot below compares the total “with-in clusters sum of squares” (WCSS) with the number of clusters. A single WCSS for a single cluster shows how much variation is in that cluster; so, in general, a smaller WCSS means a better cluster. When you have more than one cluster, you can add these WCSS’s up and get the total per each set of clusters.

We want to choose the number of clusters right before the line in the plot begins to level out. At this point, the point of leveling out, adding more clusters would not explain much more than what we have already explained with the previous clusters.

Add clusters to main data and group by cluster to get some stats

The table below shows the min, median, avg, max, count, and proportion of average order values, broken down by cluster.

The table shows that 96% of account holders have average order values between $1.20 and $552, while 3.7% have average order values between $553 and $3,487, etc.

cluster min median avg max count count_as_percentage
1 $1.22 $74.33 $106.68 $552.11 41,798 96.1%
2 $552.93 $762.91 $998.76 $3,486.93 1,625 3.7%
3 $3,596.56 $5,134.67 $6,065.69 $15,554.73 87 0.2%
5 $21,615.98 $30,032.12 $29,903.47 $37,933.66 4 0.0%
4 $62,537.30 $74,703.64 $74,703.64 $86,869.99 2 0.0%

The plot below shows that cluster 1 (red) has many members with low average order values, while cluster 4 (blue) has two members who have large average order values.

The box plot below shows cluster 4 towering above all others.

Summary and Next Steps

This cluster analysis shows that 96% of customers have average order values between $1.22 and $552, which is a huge range. A next step could be to cluster on this segment only; it’s worth understanding, since it makes up most of our customer base.

Another approach is to break down the entire data set by US state, and then cluster within each state. The results from this method could be used for marketing campaigns or incentive programs meant to increase the average order value of account holders.

Yet another approach is to cluster the guest data along with the account holder data and compare the two. These results could be used to develop an incentive program to encourage guests to create accounts.