RFM Customer segmentation with K-Means

This article is part of a bigger project for development of a basic marketing analytics stack. The idea is to build a stack of reusable and replicable templates for ML based marketing analysis. Template here is used in the sense of process logic, reusable code snippets and generalised business case scenarios for which the model is applicable.

For every analytical model in the stack an interactive application in Shiny is suggested. The idea of the app is to serve as a reusable framework for model presentation. This will allow for non-technical users to get visual understanding of the data and play with different scenarios directly on the app. In addition, the application will allow for automation of the analysis. This means that when the application is integrated with the data source, every change in the data will automatically change the output figures from the analysis. This will allow for instantaneous marketing action like grasping time sensitive opportunities for up-sale promotions.

Structure of the article:

  1. Business objectives of the model;
  2. Algorithm behind the model;
  3. Model development process;
  4. Ideas for further analysis of the model output.

Business objectives of customer segmentation

Customer segmentation helps you understand how, why, and when your product or service is purchased/used. These insights are crucial for efficient allocation of marketing resources.

Recency, frequency and monetary (RFM) customer segmentation gives you a base model to analyse customers according to their transactional behaviour - how recent was their last transaction, how often they purchase and how much are they spending.

RFM customer segmentation therefore is an effective way to prevent customer churn, identify and use up-sale and cross-sale opportunities.

Why K-Means clustering

K means is unsupervised algorithm and probably the most used algorithm for clustering. One of the major advantages of K-Means is that it can handle larger data sets compared to for example the hierarchical cluster approaches.

However, K-Means comes with some disadvantages as well. It is sensitive to outliers. The model is based on measuring distances between the data sets. The closer are the data points, the more similar they are. Therefore the data used with K-Means needs to be scaled before performing the algorithm.

Why R

The advantages in using R:

  • Availability of well developed analytical packages;
  • Ease of creating beautiful layered visualisations;
  • Ease of presenting ML models through interactive applications (Shiny App)

Model development process

The development process of the model has 2 main stages:

Data Preparation

  1. Data Preparation
  2. Feature engineering
  3. Summary statistics of the engineered data sets
  • remove outliers;
  • scale the data
  1. Defining any correlations between the variables.

Model Development

  1. K-Means model development
  • Define the method for measurement of the distances between the data points ;
  • Define optimal number of clusters;
  • Build the clusters;
  • Visualise the clusters.
  1. Ideas for cluster analysis and persona building
  • Categorise the clusters according to their value for the business;
  • Provide geospatial visualisation of the most dominant clusters per country

Data preparation

The data used to build the current model is the overused transactional data set of the UK online retailer that you can find here. We join the original data set with a table of world countries geospatial coordinate to be able to get some nice geospatial visualisations later. The main objective of this stage is to remove missing values, convert data types where necessary and examine the data for any discrepancies.

After we clean the data we can run some basic statistics to get an idea of the main data features and their distribution. The R summary() function is a perfect tool for that purpose. The stats of the data set give us two main insights the time dimensions of the data the fact that there are some discrepancies in the values of the transactions.

The time span of the data set is about a year:

  • maxDate = 2011-12-09
  • minDate = 2010-12-01

The summary of the data shows that there are some discrepancies in the values of the transactions. There are some transactions with total product price of 0 and products with negative quantities. This means that we might be dealing with refunds and discounts.

The way we will deal with refunds and discounts in transactional data depends pretty much on the business case. Are we interested in all purchases as registering the frequency of touch point with the customer. Or we are interested only in the monetised transactions as adding an actual net value to the business.

For the purpose of the current analysis we will remove the refunds and discounts together with their original transactions. These can be analysed later on segment level. For example which of the segments claimed the most refunds or which of the segments were the most responsive to promotions. This type of analysis can turn out to be quite insightful not only for the development of the product and services, but also for the further profiling of the segments

Features engineering

We want to build a RFM based segmentation of the customers. Therefore first we need to generate the values for the recency, frequency and monetary variables. We can use the below equations.

recency = the most recent date of the data set minus the maximum date per customer purchase. The lower value we get the more recent is the customer purchase

frequency = number of distinct purchases. The higher is the value here the more often the customer interacts with the business

monetary = the sum of the total spent per customer. The higher is the value here the more valuable is the customer for the business

These variables together with the customer id will form the data frame for our clusters

As we mentioned earlier K- Means is calculated based on the distances between the data points. This makes that it is sensitive to the presence of outliers. Running a summary statistics of the data set will help us define whether there are any extreme values in the model features. Again the quickest way to do that is to use the summary() function.

The summary statistics show that we have outliers in the frequency and monetary variables. For the purpose of getting the clustering right we need to remove these outliers. However, we should not get completely rid of them as they represent our most valuable customers. These customers can even be defined as our loyal customers - high level of interaction an purchasing value. We can mark them as a separate segment and bring them back when analysing the segments.

To remove the outliers we need to identify a numerical cut-off range that differentiates an outlier from a non-outlier. We will use the 1.5*iqr rule for that purpose.

We can calculate the interquartile range by finding first the quantiles. Here we use the quantile() function to find the 25th and the 75th percentile of the dataset and the IQR() function, which gives the difference between the 75th and 25th percentiles.

Then we calculate the upper and lower range limits of our data:

upper <- Q3+1.5iqr lower <- Q1-1.5iqr

We can use the subset() function to apply the 1.5*iqr rule and remove the outliers from the data set

Now we can check whether our values range looks more evenly spread. Visualise the range by using a boxplot visualisation

Scaling (normalising) the data

The next step is to normalise the variables to a state where they have a mean = 0 and a variance = 1. For this we will use the R scale() function. This step is important as we want the K - Means algorithm to weight equally the variables when creating the clusters.

Now we can use a heatmap to identify the existence of any correlations between the variables. This step is important as any strong correlations will lead to biase when building the model.

We have only 3 variables. Out of them only recency and frequency show negative correlation. This correlation should actually be read as positive as we already mentioned the recency value is measured in an opposite way, e.g. the lower the better.

Another observation is that the more frequent transactions have higher value, which kind of make sense. We can take a closer look at these correlations.

To sum it up, it turns out that the more recent transactions are more frequent and as a result have a higher value. This is already an anomaly in the customer behaviour. It might be due to a very successful marketing campaign or some sort of seasonality of the business or both.

As the cut off date of the data set is in December (Christmas period) the closer to Christmas we get, the more frequent the orders per customer become. This is already a hint for existing seasonality in the business therefore in order to be accurate in our segmentation we will need data from at least 2 years, e.g. 2 shopping seasons. Unfortunately, the available data set is only from one year. However, we will stick to it with the idea that it is just a showcase of the way the algorithm works.

Building the K means model

Choose a method to measure the distances between the data points

The main objective of K-means is to group the data into uniform sets of similar observations that are distinguishable from other sets of similar observations. The similarity and dissimilarity between the observations is defined by computing the distances between the data points. There are different ways of measuring the distances between data points.The Euclidean method is the only method that works with K-means. Therefore, the Euclidean distance between the data points will define our clusters.

Define the number of clusters

The next step is to specify the optimal number of clusters we can split the data on. First we can try to find the number of clusters manually just to get a feeling of how the K-Means work.Then we can automate this process by using an algorithm. For the manual calculation we can use kmeans() function. Let`s calculate and visualise our options with clusters from 2 to 8. The “centres” part of the function defines the number of clusters while the “nstart” defines the number of times the data will be reshuffled to find the most cohesive cluster given the set up. Here we will set the nstar = 25. This means that R will run with 25 random assignments and will select the one with the lowest within cluster variation.

There are different automated ways to define the optimal number of cluster. We will compare the output of 2 methods - the Elbow and the Average Silhouette Method. For both of them use the fviz_nbclust() function.

Elbow method - the basic idea of this method is that the total inter cluster variation,e.g. the total within cluster sum of squares is minimised.

Average Silhouette Method - measures the quality of the clustering, how well each object lies within its cluster

Both of the methods suggested that the best number of clusters is 4. Let`s apply this result in a final K-means calculation. Use the set.seed() function to get reproducible result. Remember from above that the K means uses random reshuffling of the observations to get the optimal cluster split.

After running the algorithm we have got 4 clusters with pretty much similar sizes apart from one, which is about double the size of the rest of the clusters. In the remaining part of this article we will explore some ideas for further analysis of the segments

Segment Analysis

As a beginning we can get some basic figures describing the size of the segments and the way they score within the RFM model.

The size we can define by the number of observations assigned to each cluster. While the mean value of each feature (recency, frequency and monetary) will give us an idea of which are our best and worst performing segments.

We can get these figures by running a summary table of the clustered data using summarise_all() function. This function works only with numeric data. Therefore before running it we will need to exclude the customer id column as it has character format. After we get our summary table we will transpose it to facilitate its visualisation.

Now we can visualise the segments to determine which are the most and the least valuable ones. This will give us an idea of how and where to focus our marketing efforts.

From the visualisation of the summary statistics we can conclude that our most active and most recent segment is segment 3.

Segment 1 can be defined as totally lapsed with on average one purchase made almost a year ago. The interesting thing about this segment is that by one purchase it generates about half the value of segment 3 the most active one. Therefore it might worth the effort to try to re-engage this segment.

Our largest segment is segment 2. However, with only 2 purchases on average this segment is not particularly active. It also has the lowest propensity to spend. This might be the seasonal shopper. Whatever is the case this segment represents the largest set of the customer base. Despite that it does not represent the best customer for the business. We need to analyse more the demographics and the behaviour of this segment to see whether it has the propensity to become the best segment or the business needs to change its product or market.

Finally segment 4 is our best segment in terms of value generating. Unfortunately segment 4 is our smallest segment. This segment like segment 2 needs deeper analysis as we would like to acquire more customers like it.

Geographic Profiling of the segments

Another interesting analysis of the segments is their geographic distribution. To perform this analysis we need to attach the segments to the initial data frame and get the geographic attributes of the customers. Also do not forget to deduplicate your original data set and leave only unique customer records.

We can run some initial bar chart visualisations of the segments performance per country.

We can see that the number of customers for UK is disproportionally higher compared to the rest of the countries. This makes sense as the data set is from a UK based online retailer. We will separate the analysis of the segment distribution for the UK from the rest of the countries to get a clear picture of the segment performance per country.

Our best segment 4 is well represented in Germany and France. It is a dominant segment in Germany and in some interesting countries like Austria, Portugal and Switzerland.

Segment 2 has dominant representation in the UK market. This was our largest, but not best segment. The lapsed segment 1 has also higher representation than our best segment 4.

Finally we can get a geospatial representation of the dominant segments per country.

The map gives a clear picture of the segment dominance. It can be a perfect starting point for market analysis and business performance measures.This type of analysis together with a demographic profiling of the segments will generate enough information for the business to build hypothesis about needed improvements in the product features or marketing communications.

We can conclude by saying that K-Means clustering is a great initial point for understanding the customer base, improving marketing communications and most of all testing different features development of the business product and services.