1. Introduction

This is the first of a series of 4 dynamic documents written to demonstrate the value of customer analytics to the senior management of a diversified holding company with subsidiaries in key emerging markets. The documents aim to support the argument for significant investments in data and analytics capabilities as a means of driving revenue and profit growth through improved customer relationship management.

These reports are meant for analytics practitioners.

The key insights and recommendations are presented in a separate powerpoint deck prepared for non-technical executives.

The documents implement advanced analytics methods that attempt to:

1. Identify managerially relevant subgroups in a customer database (segmentation).

  1. Build a model that predicts customer churn.

  2. Build a model that predicts the spending level of customers predicted to remain loyal.

  3. Estimate the lifetime value of a customer database.

Analyses are performed on a very narrow dataset. Although this will likely restrict the performance of the predictive models, the dataset is representative of the limited data currently available within most of our subsidiaries.


2. Objective

This report focuses on the first task - identifying valid subgroups (clusters) within customer data

From a managerial point of view clusters must be charactierized by sufficient within-group similarity and between group dissimilarity to justify distinct treatment.

** Performance target**

This is an unsupervised learning task. The objective is not to predict a given outcome but to discover useful patterns in the data.

It is therefore impossible to set a specific performance target in this context. Save perhaps to ensure that identified clusters are indeed valid i.e non-random.


3. Set Up

3.1 Load Packages

library(ggplot2)
library(dplyr)
library(pastecs)
library(fpc)
library(FactoMineR)
library(plotly)

3.2 Load Data

cust_data = read.delim(file = 'transactions.txt', header = FALSE, sep = '\t', dec = '.')

4. Dataset Description

The dataset is proprietary customer data consisting of 51,243 observations across 3 variables:

  • customer i.d
  • purchase amount in USD
  • transaction date

The data covers the period January 2nd 2005 to December 31st 2015.

The dataset is tidy and ready for analysis.

5. Data Preparation

Perform various data wrangling actions and compute additional variables (key marketing indicators):

recency - time since last purchase (days)

frequency - number of purchase transactions

avg_amount - average amount spent

## Name original columns
cust_data <- rename(cust_data, 
                     customer_id = V1, purchase_amount = V2, date_of_purchase = V3)

## Convert purchase date to R date object
cust_data$date_of_purchase <- as.Date(cust_data$date_of_purchase, "%Y-%m-%d")

## Create recency variable
cust_data$days_since <- as.numeric(difftime(time1 = "2016-01-01",
                                            time2 = cust_data$date_of_purchase,
                                            units = "days"))   

## Create analysis dataset including 'frequency' variable
customers <- cust_data %>%
                group_by(customer_id) %>%
                summarise(recency = min(days_since), 
                          frequency = n(), 
                          avg_amount = mean(purchase_amount))

6. Data Exploration

6.1 Descriptive statistics

options(digits = 2, scipen = 99)
stat.desc(select(customers, -customer_id))
##                  recency frequency avg_amount
## nbr.val         18417.00 18417.000    18417.0
## nbr.null            0.00     0.000        0.0
## nbr.na              0.00     0.000        0.0
## min                 1.33     1.000        5.0
## max              4014.33    45.000     4500.0
## range            4013.00    44.000     4495.0
## sum          23083338.00 51243.000  1064373.4
## median           1070.33     2.000       30.0
## mean             1253.37     2.782       57.8
## SE.mean             7.97     0.022        1.1
## CI.mean.0.95       15.62     0.042        2.2
## var           1169507.86     8.625    23827.0
## std.dev          1081.44     2.937      154.4
## coef.var            0.86     1.056        2.7

Data types

str(customers)
## Classes 'tbl_df', 'tbl' and 'data.frame':    18417 obs. of  4 variables:
##  $ customer_id: int  10 80 90 120 130 160 190 220 230 240 ...
##  $ recency    : num  3829 343 758 1401 2970 ...
##  $ frequency  : int  1 7 10 1 2 2 5 2 1 4 ...
##  $ avg_amount : num  30 71.4 115.8 20 50 ...

6.2 Distributions

frequency

The distribution is severely right skewed. Most customers have made between 1 and 15 purchases. A few outliers can be found beyond this point up to the extreme of 45

ggplot(customers, aes(x = frequency)) +
   geom_histogram(bins = 20, fill = "steelblue") +
   theme_classic() +
   labs(x = "", title = "Purchase frequency") +
   scale_x_continuous(breaks = c(0, 5, 10, 15, 25, 35, 45))

Almost 50% of customers have only ever made 1 purchase

dim(filter(customers, frequency == 1))[1]/dim(customers)[1]
## [1] 0.49

recency

Apart from a spike on the extreme left, the distribution is fairly uniform.

ggplot(customers, aes(x = recency)) +
   geom_histogram(fill = "steelblue") +
   theme_classic() +
   labs(x = "", y = "", title = "Recency in Days") +
  scale_x_continuous(breaks = c(0, 250, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000))

Just below 30% of customers have been active in the last year.

dim(filter(customers, recency < 365))[1]/dim(customers)[1]
## [1] 0.29

amount

The series has mean 58 with 50% of samples between 22 and 50. The mean is pulled above the median by the severe right skew caused by some extreme outliers. The distribution stretches to 4,500.

summary(customers$avg_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       5      22      30      58      50    4500
ggplot(customers, aes(x = avg_amount)) +
   geom_histogram(bins = 20, fill = "steelblue") +
   theme_classic() +
   labs(x = "average amount", title = "Average purchase amount")

ggplot(customers, aes(x = avg_amount)) +
   geom_histogram(bins = 20, fill = "steelblue") +
   theme_classic() +
   labs(x = "amount (log10)", 
        title = "Average purchase amount\n (Data are log transformed to ease visualization)") +
   scale_x_log10()


7. Statistical Segmentation

7.1 Data preparation for modelling

## Remove customer_id column and set as row names
row.names(customers) <- customers$customer_id
customers <- select(customers, -customer_id)

## Scale data to balance the influence of variables measured on different scales and/or with vastly different ## variances
clust_data <- scale(customers)

7.2 Clustering Method

Several methods for finding sub-groups in data are available. I will implement the hierarchical clustering method that partitions data by creating a cluster tree (dendrogram). Each leaf of the tree (terminal node) represents an observation. The algorithm progressively merges the observation into ever larger clusters (upward branching). See the dendrograms below

The hierarchical clustering algorithm requires the analyst to make a number of crucial decisions.

  • Dissimilarity Measure - I use Euclidean distance for this analysis because I believe it is best suited for the task of partitioning rows of observations into sub-groups across diverse variables.

  • Linkage - As the algorithm moves above the bottom level of the dendrogram and is required to fuse clusters - no longer individual samples - a different dissimilarity measure is required. Linkage defines such measures. There are no hard rules guiding their selection. I shall try 3 different linkage measures.

  • Dendrogram height of fusion - The tree builds up from individual samples up to a single cluster representing the entire dataset. It must therefore be cut at a given point to obtain the desired number of clusters. A number of methods have been developed to find the optimal number of clusters in a dataset. They are complex and computationally expensive. A simple and often useful alternative is to simply observe the tree and make a judgement.

I can imagine scenarios where the decision is influenced by the business context. Inexpensive marketing messages could be customized for a large number of clusters. New products with high development costs may only be developed to target a limited number of sub-groups.

Implement hierarchical clustering with 3 linkage types using Euclidean distance.

set.seed(11)
hc.complete =hclust (dist(clust_data), method = "complete")
hc.average =hclust (dist(clust_data), method = "average")
hc.ward =hclust (dist(clust_data), method = "ward.D2")

7.3 Dendrograms

## plot(hc.complete, main = "Complete Linkage", xlab = "", sub = "", cex = .9)
## plot(hc.average, main = "Average Linkage", xlab = "", sub = "", cex = .9)
# plot(hc.ward, main = "Cluster Tree", xlab = "", sub = "", cex = .9)

# rect.hclust(hc.ward, k = 5, border = "darkred")

The Ward.D2 linkage provides the most balanced dendrogram and will be used to obtain the clusters.

Cut the dendrogram to produce 5 clusters. This choice is partly based on cluster validity evaluation by R packages {NbClust} and {clValid}. The tests are unfortunately not reproducible here due to memory constraints.

plot(hc.ward, main = "Cluster Tree cut at 5 clusters", xlab = "", sub = "", cex = .9)
abline(h = 82, col = "red")

clusters <- cutree(hc.ward, 5)
table(clusters)
## clusters
##    1    2    3    4    5 
## 5890 3445 1295 7733   54

Assign customers to segments

customers$cluster <- clusters

7.4 Segment Centroids

We see the average recency, frequency and amount spent for each segment. Customers in the 5th segment spend $2,300 per purchase whilst those in the 1st spend only 35 on average and are the least active. Indeed many seem to have lapsed completely. None have made any purchases in the last year. It should be noted that there are only 54 customers in the 5th segment and 5,890 in the 1st.

Members of the 3rd segment are the most frequent shoppers and have also been the most active during the last year.

The remaining 2 segments can be similarly analyzed.

profiles <- customers %>% 
  group_by(cluster) %>%
  summarise_each(funs(mean))

cluster_counts <- customers %>% 
  group_by(cluster) %>% 
  summarise(counts = n())

profiles <- mutate(profiles, counts = cluster_counts$counts) 
profiles
## # A tibble: 5 × 5
##   cluster recency frequency avg_amount counts
##     <int>   <dbl>     <dbl>      <dbl>  <int>
## 1       1    2607       1.5         35   5890
## 2       2     678       4.9         94   3445
## 3       3     232      10.9         52   1295
## 4       4     649       1.5         45   7733
## 5       5    1249       2.1       2305     54

It is possible to gain deeper insights into the segments through more detailed analyses with R packages such as factomineR ().

8. Cluster Visualizations

8.1 Pairwise

frequency against avg_amount with segment as the grouping variable.

customers$cluster <- as.factor(customers$cluster)
ggplot(data = customers,
       aes(x = frequency, y = avg_amount, colour = cluster, shape = cluster)) +
  geom_point() +
  theme_dark() +
  labs(title = "average amount spent vs. frequency", 
       shape = "customer\nclusters", colour = "customer\nclusters")

Distinct clusters are apparent in the data. Overlaps do occur. Notably between the 1st and 4th clusters. Neverthless, the justification for keeping both sub-groups separate is apparent when examining avg_amount by recency chart below.


avg_amount against recency with segment as the grouping variable.

ggplot(data = customers,
       aes(x = recency, y = avg_amount, colour = cluster, shape = cluster)) +
  geom_point() +
  theme_dark() +
  labs(title = "average amount spent vs recency",  shape = "cluster", colour = "cluster")

The clusters are well separated along these 2 dimensions with some overlap between segments 2 and 4.

8.2 First 2 Principal Components

The first 2 principal components describe a plane in \(N\)-data dimensions that captures as much information about the feature space as is possible to capture in 2 dimensions.

This is a more faithful 2 dimensional visualization of the clusters than the plots above that only show pairwise combinations of the data.

## Project clusters onto the first 2 principal components

prin_comp <- princomp(clust_data)
nComp <- 2
project <- predict(prin_comp, newdata=clust_data)[,1:nComp]
## Create dataframe with transformed data

project_data <- cbind(as.data.frame(project),
                      cluster = as.factor(clusters))
ggplot(project_data, aes(x=Comp.1, y=Comp.2)) +
   geom_point(aes(shape=cluster, colour = cluster)) +
   theme_dark() +
   labs(x = "Principal Component 1", y = "Principal Component 2")

Segment 5 clearly separates itself out and shows significant variability. The 3rd segment shows similar variability in a different direction.

Segments 1 to 4 overlap with each other with the 2nd segment significantly overlapping with the 3rd and 4th; perhaps a clue as to its stability.


The visualizations offer interesting initial insights into patterns in the data. The overlaps between segments - significant in some cases - should be a cause for some concern.

It is worth trying to establish the reliability of the clusters. Are they ‘real’ - in the sense that they capture non-random structure in the data?

9. Cluster Evaluation

The clusterboot algorithm attempts to answer this question by evaluating the robustness of each cluster to perturbations in the data. The cluster’s stability is evaluated by measuring the similarity between sets of a 100 bootstrap samples.

clust_no <- 5

set.seed(12)
cboot_hclust <- clusterboot(clust_data, 
                            clustermethod = hclustCBI,
                            method = "ward.D2", 
                            k = clust_no)
bootMean_data <- data.frame(cluster = 1:5, bootMeans = cboot_hclust$bootmean) 

Visualize evaluation results

A rule of thumb for interpreting stability values suggests that clusters scored below 0.6 should be considered unstable. Those between 0.6 and 0.75 are measuring patterns to a reasonable degree. Scores above 0.8 indicate highly stable clusters.

p <- ggplot(data = bootMean_data, aes(x = cluster, y = bootMeans)) +
     geom_point(aes(colour = "darkred", size = 1)) +
     geom_hline(yintercept = c(0.6, 0.8)) +
     labs(y = "stability", title = "Cluster Stability Evaluation") +
     theme(legend.position="none")

ggplotly(p)

By those rules:

  • segment 1 is highly stable

  • segments 4 and 5 are reasonably so

  • 3 is borderline and

  • segment 2 is highly unstable. It is likely made up of unusual cases that do not comfortably fit anywhere else.

10. Critique of Statistical Clustering Methods

There are a number of important limitations with the deployment of statistical clustering methods in industrial environments.

  1. The method captures patterns in a dataset at a specific moment in time. This poses some challenges:
  • customers continuously enter and leave the database. New customers may have different characteristics from old ones

  • customer behaviour may evolve over time

  1. A particular cluster scheme may capture seasonal effects thus making it inaplicable at different periods of time.

  2. Most clustering methods are not very robust to disturbances to the data. The cluster evaluation in section 9 illustrates this.

  3. All of this requires that the algorithm be frequently updated. However, most clustering algorithms cannot be fully automated and require the supervision of an analyst, a significant handicap in industrial settings.

11. Conclusion

Statistical segmentation, nevertheless, remains a powerful and useful approach to finding subgroups in customer databases. For one thing these methods identify natural patterns in multivariate datasets, unlike most commonly used non-statistical alternatives that rely heavily on judgement. The latter approaches can be severely compromised by various types of bias, personal agendas and cognitive limitations in processing complex data.

In practical terms the approach demonstrated here can be used to obtain a faithful description of customer subgroups at a particular point in time and could serve to formulate hypotheses, guide further investigations and generally inform non-statistical managerial segmentation solutions.