This is the first of a series of 4 dynamic documents written to demonstrate the value of customer analytics to the senior management of a diversified holding company with subsidiaries in key emerging markets. The documents aim to support the argument for significant investments in data and analytics capabilities as a means of driving revenue and profit growth through improved customer relationship management.
These reports are meant for analytics practitioners.
The key insights and recommendations are presented in a separate powerpoint deck prepared for non-technical executives.
The documents implement advanced analytics methods that attempt to:
1. Identify managerially relevant subgroups in a customer database (segmentation).
Build a model that predicts customer churn.
Build a model that predicts the spending level of customers predicted to remain loyal.
Estimate the lifetime value of a customer database.
Analyses are performed on a very narrow dataset. Although this will likely restrict the performance of the predictive models, the dataset is representative of the limited data currently available within most of our subsidiaries.
This report focuses on the first task - identifying valid subgroups (clusters) within customer data
From a managerial point of view clusters must be charactierized by sufficient within-group similarity and between group dissimilarity to justify distinct treatment.
** Performance target**
This is an unsupervised learning task. The objective is not to predict a given outcome but to discover useful patterns in the data.
It is therefore impossible to set a specific performance target in this context. Save perhaps to ensure that identified clusters are indeed valid i.e non-random.
library(ggplot2)
library(dplyr)
library(pastecs)
library(fpc)
library(FactoMineR)
library(plotly)
cust_data = read.delim(file = 'transactions.txt', header = FALSE, sep = '\t', dec = '.')
The dataset is proprietary customer data consisting of 51,243 observations across 3 variables:
The data covers the period January 2nd 2005 to December 31st 2015.
The dataset is tidy and ready for analysis.
Perform various data wrangling actions and compute additional variables (key marketing indicators):
recency
- time since last purchase (days)
frequency
- number of purchase transactions
avg_amount
- average amount spent
## Name original columns
cust_data <- rename(cust_data,
customer_id = V1, purchase_amount = V2, date_of_purchase = V3)
## Convert purchase date to R date object
cust_data$date_of_purchase <- as.Date(cust_data$date_of_purchase, "%Y-%m-%d")
## Create recency variable
cust_data$days_since <- as.numeric(difftime(time1 = "2016-01-01",
time2 = cust_data$date_of_purchase,
units = "days"))
## Create analysis dataset including 'frequency' variable
customers <- cust_data %>%
group_by(customer_id) %>%
summarise(recency = min(days_since),
frequency = n(),
avg_amount = mean(purchase_amount))
options(digits = 2, scipen = 99)
stat.desc(select(customers, -customer_id))
## recency frequency avg_amount
## nbr.val 18417.00 18417.000 18417.0
## nbr.null 0.00 0.000 0.0
## nbr.na 0.00 0.000 0.0
## min 1.33 1.000 5.0
## max 4014.33 45.000 4500.0
## range 4013.00 44.000 4495.0
## sum 23083338.00 51243.000 1064373.4
## median 1070.33 2.000 30.0
## mean 1253.37 2.782 57.8
## SE.mean 7.97 0.022 1.1
## CI.mean.0.95 15.62 0.042 2.2
## var 1169507.86 8.625 23827.0
## std.dev 1081.44 2.937 154.4
## coef.var 0.86 1.056 2.7
Data types
str(customers)
## Classes 'tbl_df', 'tbl' and 'data.frame': 18417 obs. of 4 variables:
## $ customer_id: int 10 80 90 120 130 160 190 220 230 240 ...
## $ recency : num 3829 343 758 1401 2970 ...
## $ frequency : int 1 7 10 1 2 2 5 2 1 4 ...
## $ avg_amount : num 30 71.4 115.8 20 50 ...
frequency
The distribution is severely right skewed. Most customers have made between 1 and 15 purchases. A few outliers can be found beyond this point up to the extreme of 45
ggplot(customers, aes(x = frequency)) +
geom_histogram(bins = 20, fill = "steelblue") +
theme_classic() +
labs(x = "", title = "Purchase frequency") +
scale_x_continuous(breaks = c(0, 5, 10, 15, 25, 35, 45))
Almost 50% of customers have only ever made 1 purchase
dim(filter(customers, frequency == 1))[1]/dim(customers)[1]
## [1] 0.49
recency
Apart from a spike on the extreme left, the distribution is fairly uniform.
ggplot(customers, aes(x = recency)) +
geom_histogram(fill = "steelblue") +
theme_classic() +
labs(x = "", y = "", title = "Recency in Days") +
scale_x_continuous(breaks = c(0, 250, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000))
Just below 30% of customers have been active in the last year.
dim(filter(customers, recency < 365))[1]/dim(customers)[1]
## [1] 0.29
amount
The series has mean 58 with 50% of samples between 22 and 50. The mean is pulled above the median by the severe right skew caused by some extreme outliers. The distribution stretches to 4,500.
summary(customers$avg_amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5 22 30 58 50 4500
ggplot(customers, aes(x = avg_amount)) +
geom_histogram(bins = 20, fill = "steelblue") +
theme_classic() +
labs(x = "average amount", title = "Average purchase amount")
ggplot(customers, aes(x = avg_amount)) +
geom_histogram(bins = 20, fill = "steelblue") +
theme_classic() +
labs(x = "amount (log10)",
title = "Average purchase amount\n (Data are log transformed to ease visualization)") +
scale_x_log10()
## Remove customer_id column and set as row names
row.names(customers) <- customers$customer_id
customers <- select(customers, -customer_id)
## Scale data to balance the influence of variables measured on different scales and/or with vastly different ## variances
clust_data <- scale(customers)
Several methods for finding sub-groups in data are available. I will implement the hierarchical clustering method that partitions data by creating a cluster tree (dendrogram). Each leaf of the tree (terminal node) represents an observation. The algorithm progressively merges the observation into ever larger clusters (upward branching). See the dendrograms below
The hierarchical clustering algorithm requires the analyst to make a number of crucial decisions.
Dissimilarity Measure - I use Euclidean distance for this analysis because I believe it is best suited for the task of partitioning rows of observations into sub-groups across diverse variables.
Linkage - As the algorithm moves above the bottom level of the dendrogram and is required to fuse clusters - no longer individual samples - a different dissimilarity measure is required. Linkage defines such measures. There are no hard rules guiding their selection. I shall try 3 different linkage measures.
Dendrogram height of fusion - The tree builds up from individual samples up to a single cluster representing the entire dataset. It must therefore be cut at a given point to obtain the desired number of clusters. A number of methods have been developed to find the optimal number of clusters in a dataset. They are complex and computationally expensive. A simple and often useful alternative is to simply observe the tree and make a judgement.
I can imagine scenarios where the decision is influenced by the business context. Inexpensive marketing messages could be customized for a large number of clusters. New products with high development costs may only be developed to target a limited number of sub-groups.
Implement hierarchical clustering with 3 linkage types using Euclidean distance.
set.seed(11)
hc.complete =hclust (dist(clust_data), method = "complete")
hc.average =hclust (dist(clust_data), method = "average")
hc.ward =hclust (dist(clust_data), method = "ward.D2")
## plot(hc.complete, main = "Complete Linkage", xlab = "", sub = "", cex = .9)
## plot(hc.average, main = "Average Linkage", xlab = "", sub = "", cex = .9)
# plot(hc.ward, main = "Cluster Tree", xlab = "", sub = "", cex = .9)
# rect.hclust(hc.ward, k = 5, border = "darkred")
The Ward.D2 linkage provides the most balanced dendrogram and will be used to obtain the clusters.
Cut the dendrogram to produce 5 clusters. This choice is partly based on cluster validity evaluation by R packages {NbClust} and {clValid}. The tests are unfortunately not reproducible here due to memory constraints.
plot(hc.ward, main = "Cluster Tree cut at 5 clusters", xlab = "", sub = "", cex = .9)
abline(h = 82, col = "red")
clusters <- cutree(hc.ward, 5)
table(clusters)
## clusters
## 1 2 3 4 5
## 5890 3445 1295 7733 54
Assign customers to segments
customers$cluster <- clusters
We see the average recency, frequency and amount spent for each segment. Customers in the 5th segment spend $2,300 per purchase whilst those in the 1st spend only 35 on average and are the least active. Indeed many seem to have lapsed completely. None have made any purchases in the last year. It should be noted that there are only 54 customers in the 5th segment and 5,890 in the 1st.
Members of the 3rd segment are the most frequent shoppers and have also been the most active during the last year.
The remaining 2 segments can be similarly analyzed.
profiles <- customers %>%
group_by(cluster) %>%
summarise_each(funs(mean))
cluster_counts <- customers %>%
group_by(cluster) %>%
summarise(counts = n())
profiles <- mutate(profiles, counts = cluster_counts$counts)
profiles
## # A tibble: 5 × 5
## cluster recency frequency avg_amount counts
## <int> <dbl> <dbl> <dbl> <int>
## 1 1 2607 1.5 35 5890
## 2 2 678 4.9 94 3445
## 3 3 232 10.9 52 1295
## 4 4 649 1.5 45 7733
## 5 5 1249 2.1 2305 54
It is possible to gain deeper insights into the segments through more detailed analyses with R packages such as factomineR ().
frequency
against avg_amount
with segment
as the grouping variable.
customers$cluster <- as.factor(customers$cluster)
ggplot(data = customers,
aes(x = frequency, y = avg_amount, colour = cluster, shape = cluster)) +
geom_point() +
theme_dark() +
labs(title = "average amount spent vs. frequency",
shape = "customer\nclusters", colour = "customer\nclusters")
Distinct clusters are apparent in the data. Overlaps do occur. Notably between the 1st and 4th clusters. Neverthless, the justification for keeping both sub-groups separate is apparent when examining
avg_amount
by recency
chart below.
avg_amount
against recency
with segment
as the grouping variable.
ggplot(data = customers,
aes(x = recency, y = avg_amount, colour = cluster, shape = cluster)) +
geom_point() +
theme_dark() +
labs(title = "average amount spent vs recency", shape = "cluster", colour = "cluster")
The clusters are well separated along these 2 dimensions with some overlap between segments 2 and 4.
The first 2 principal components describe a plane in \(N\)-data dimensions that captures as much information about the feature space as is possible to capture in 2 dimensions.
This is a more faithful 2 dimensional visualization of the clusters than the plots above that only show pairwise combinations of the data.
## Project clusters onto the first 2 principal components
prin_comp <- princomp(clust_data)
nComp <- 2
project <- predict(prin_comp, newdata=clust_data)[,1:nComp]
## Create dataframe with transformed data
project_data <- cbind(as.data.frame(project),
cluster = as.factor(clusters))
ggplot(project_data, aes(x=Comp.1, y=Comp.2)) +
geom_point(aes(shape=cluster, colour = cluster)) +
theme_dark() +
labs(x = "Principal Component 1", y = "Principal Component 2")
Segment 5 clearly separates itself out and shows significant variability. The 3rd segment shows similar variability in a different direction.
Segments 1 to 4 overlap with each other with the 2nd segment significantly overlapping with the 3rd and 4th; perhaps a clue as to its stability.
The visualizations offer interesting initial insights into patterns in the data. The overlaps between segments - significant in some cases - should be a cause for some concern.
It is worth trying to establish the reliability of the clusters. Are they ‘real’ - in the sense that they capture non-random structure in the data?
The clusterboot algorithm attempts to answer this question by evaluating the robustness of each cluster to perturbations in the data. The cluster’s stability is evaluated by measuring the similarity between sets of a 100 bootstrap samples.
clust_no <- 5
set.seed(12)
cboot_hclust <- clusterboot(clust_data,
clustermethod = hclustCBI,
method = "ward.D2",
k = clust_no)
bootMean_data <- data.frame(cluster = 1:5, bootMeans = cboot_hclust$bootmean)
Visualize evaluation results
A rule of thumb for interpreting stability values suggests that clusters scored below 0.6 should be considered unstable. Those between 0.6 and 0.75 are measuring patterns to a reasonable degree. Scores above 0.8 indicate highly stable clusters.
p <- ggplot(data = bootMean_data, aes(x = cluster, y = bootMeans)) +
geom_point(aes(colour = "darkred", size = 1)) +
geom_hline(yintercept = c(0.6, 0.8)) +
labs(y = "stability", title = "Cluster Stability Evaluation") +
theme(legend.position="none")
ggplotly(p)
By those rules:
segment 1 is highly stable
segments 4 and 5 are reasonably so
3 is borderline and
segment 2 is highly unstable. It is likely made up of unusual cases that do not comfortably fit anywhere else.
There are a number of important limitations with the deployment of statistical clustering methods in industrial environments.
customers continuously enter and leave the database. New customers may have different characteristics from old ones
customer behaviour may evolve over time
A particular cluster scheme may capture seasonal effects thus making it inaplicable at different periods of time.
Most clustering methods are not very robust to disturbances to the data. The cluster evaluation in section 9 illustrates this.
All of this requires that the algorithm be frequently updated. However, most clustering algorithms cannot be fully automated and require the supervision of an analyst, a significant handicap in industrial settings.
Statistical segmentation, nevertheless, remains a powerful and useful approach to finding subgroups in customer databases. For one thing these methods identify natural patterns in multivariate datasets, unlike most commonly used non-statistical alternatives that rely heavily on judgement. The latter approaches can be severely compromised by various types of bias, personal agendas and cognitive limitations in processing complex data.
In practical terms the approach demonstrated here can be used to obtain a faithful description of customer subgroups at a particular point in time and could serve to formulate hypotheses, guide further investigations and generally inform non-statistical managerial segmentation solutions.