This is the third of a series of 4 dynamic documents written to demonstrate the value of customer analytics to the senior management of a diversified holding company with subsidiaries in key emerging markets. The documents aim to support the argument for significant investments in data and analytics capabilities as a means of driving revenue and profit growth through improved customer relationship management.
These reports are meant for analytics practitioners.
The key insights and arguments are presented in a separate powerpoint deck prepared for non-technical executives.
The documents implement advanced analytics methods that attempt to:
Identify managerially relevant subgroups in a customer database.
Build a model that predicts customer churn.
3. Build a model that predicts the spending level of customers predicted to remain loyal.
Analyses are performed in a small feature space. Although this will likely restrict the performance of some of the predictive models implemented in this report, the dataset is representative of the limited data currently available across most our subsidiaries.
This report focuses on the third task - predicting the spending level of the 2014 customers who remain loyal 2015.
Model performance target
There is currently no working predictive model to serve as a baseline.
I shall therefore use the simplest possible model as a lower bound. In this case the null model is set to predict all customers will spend the mean amount.
Performance metrics considered: Root mean squared error (RMSE) and \(R^2\).
A good model should:
Statistical significance tests will be used to compare performance.
There is no such objective for this demonstration.
library(ggplot2)
library(dplyr)
library(pastecs)
library(car)
library(caret)
library(gvlma)cust_data <- read.delim(file = 'transactions.txt', header = FALSE, sep = '\t', dec = '.')Description
The dataset is a proprietary customer database consisting of 51,243 observations across 3 features:
The data covers the period January 2nd 2005 to December 31st 2015.
The dataset is tidy and ready for analysis.
Preparation
Compute additional variables (marketing indicators):
recency - time since last purchase (days)
frequency - number of purchase transactions
revenue - total amount spent
year_of_purchase
avg_amount - average amount spent
## Name columns
cust_data <- rename(cust_data, customer_id = V1, purchase_amount = V2, date_of_purchase = V3)
## Convert purchase date to R date object
cust_data$date_of_purchase <- as.Date(cust_data$date_of_purchase, "%Y-%m-%d")## Create recency variable
cust_data$days_since <- as.numeric(difftime(time1 = "2016-01-01",
time2 = cust_data$date_of_purchase,
units = "days"))
cust_data$year_of_purchase <- as.numeric(format(cust_data$date_of_purchase, "%Y"))Analysis dataset
Filter customers active in 2014 and compute remaining marketing indicators.
customers_2014 <- cust_data %>%
filter(year_of_purchase == 2014) %>%
group_by(customer_id) %>%
summarise(recency = min(days_since) - 365,
frequency = n(),
first_purchase = max(days_since) - 365,
avg_amount = mean(purchase_amount),
revenue = sum(purchase_amount))
summary(customers_2014)## customer_id recency frequency first_purchase
## Min. : 80 Min. : 1.333 Min. : 1.000 Min. : 1.333
## 1st Qu.: 97840 1st Qu.: 20.333 1st Qu.: 1.000 1st Qu.: 23.333
## Median :175080 Median : 62.333 Median : 1.000 Median : 85.333
## Mean :153909 Mean :107.117 Mean : 1.166 Mean :128.750
## 3rd Qu.:218295 3rd Qu.:204.333 3rd Qu.: 1.000 3rd Qu.:240.333
## Max. :245840 Max. :355.333 Max. :15.000 Max. :362.333
## avg_amount revenue
## Min. : 5.00 Min. : 5.00
## 1st Qu.: 30.00 1st Qu.: 30.00
## Median : 40.00 Median : 45.00
## Mean : 77.74 Mean : 87.89
## 3rd Qu.: 60.00 3rd Qu.: 80.00
## Max. :4500.00 Max. :5000.00
Dataset of customers active in 2015
customers_2015 <- cust_data %>%
filter(year_of_purchase == 2015) %>%
group_by(customer_id) %>%
summarise(recency = min(days_since),
frequency = n(),
first_purchase = max(days_since),
avg_amount = mean(purchase_amount),
revenue = sum(purchase_amount))
summary(customers_2015)## customer_id recency frequency first_purchase
## Min. : 80 Min. : 1.333 Min. : 1.000 Min. : 1.333
## 1st Qu.:104105 1st Qu.: 16.333 1st Qu.: 1.000 1st Qu.: 24.333
## Median :185495 Median : 59.333 Median : 1.000 Median : 65.333
## Mean :167782 Mean :100.074 Mean : 1.148 Mean :120.129
## 3rd Qu.:246058 3rd Qu.:174.333 3rd Qu.: 1.000 3rd Qu.:216.333
## Max. :264200 Max. :360.333 Max. :11.000 Max. :360.333
## avg_amount revenue
## Min. : 5.00 Min. : 5.00
## 1st Qu.: 30.00 1st Qu.: 30.00
## Median : 40.00 Median : 50.00
## Mean : 79.49 Mean : 88.62
## 3rd Qu.: 60.00 3rd Qu.: 85.00
## Max. :4500.00 Max. :4500.00
Training set
2014 customers who stayed loyal in 2015
active_2014 <- semi_join(customers_2014, customers_2015, by = "customer_id")
summary(active_2014)## customer_id recency frequency first_purchase
## Min. : 80 Min. : 1.333 Min. : 1.000 Min. : 1.333
## 1st Qu.: 83590 1st Qu.: 13.333 1st Qu.: 1.000 1st Qu.: 20.333
## Median :150660 Median : 49.333 Median : 1.000 Median : 71.333
## Mean :139562 Mean : 95.128 Mean : 1.209 Mean :122.061
## 3rd Qu.:203480 3rd Qu.:176.333 3rd Qu.: 1.000 3rd Qu.:236.333
## Max. :234760 Max. :342.333 Max. :15.000 Max. :362.333
## avg_amount revenue
## Min. : 5.00 Min. : 5.0
## 1st Qu.: 30.00 1st Qu.: 30.0
## Median : 45.00 Median : 50.0
## Mean : 79.16 Mean : 91.7
## 3rd Qu.: 75.00 3rd Qu.: 100.0
## Max. :4500.00 Max. :4500.0
row.names(active_2014) <- active_2014$customer_id
active_2014 <- select(active_2014, -customer_id)Test set
Actual spending by active 2014 customers in 2015
active_2015 <- semi_join(customers_2015, customers_2014, by = "customer_id")
summary(active_2015)## customer_id recency frequency first_purchase
## Min. : 80 Min. : 1.333 Min. : 1.000 Min. : 1.333
## 1st Qu.: 83590 1st Qu.: 14.333 1st Qu.: 1.000 1st Qu.: 15.333
## Median :150660 Median : 50.333 Median : 1.000 Median : 76.333
## Mean :139562 Mean : 96.184 Mean : 1.206 Mean :123.609
## 3rd Qu.:203480 3rd Qu.:177.333 3rd Qu.: 1.000 3rd Qu.:224.333
## Max. :234760 Max. :359.333 Max. :10.000 Max. :359.333
## avg_amount revenue
## Min. : 5.00 Min. : 5.0
## 1st Qu.: 30.00 1st Qu.: 30.0
## Median : 50.00 Median : 50.0
## Mean : 84.78 Mean : 98.3
## 3rd Qu.: 80.00 3rd Qu.: 100.0
## Max. :4500.00 Max. :4500.0
options(digits = 2, scipen = 99)
stat.desc(active_2014)## recency frequency first_purchase avg_amount revenue
## nbr.val 2965.0 2965.000 2965.00 2965.0 2965.0
## nbr.null 0.0 0.000 0.00 0.0 0.0
## nbr.na 0.0 0.000 0.00 0.0 0.0
## min 1.3 1.000 1.33 5.0 5.0
## max 342.3 15.000 362.33 4500.0 4500.0
## range 341.0 14.000 361.00 4495.0 4495.0
## sum 282053.3 3584.000 361909.33 234707.8 271882.4
## median 49.3 1.000 71.33 45.0 50.0
## mean 95.1 1.209 122.06 79.2 91.7
## SE.mean 1.9 0.012 2.10 3.5 3.9
## CI.mean.0.95 3.6 0.024 4.11 6.8 7.6
## var 10197.2 0.459 13016.99 35754.1 44218.8
## std.dev 101.0 0.678 114.09 189.1 210.3
## coef.var 1.1 0.561 0.93 2.4 2.3
Revenue per customer
ggplot(active_2014, aes(x = revenue)) +
geom_histogram(bins = 40, fill = "steelblue") +
theme_classic() +
labs(x = "revenue", title = "Annual revenue p/customer") The distribution is severely right skewed. Half of all customers spent below $50 during the year. A small number of outliers make up a long tail stretching up to $4,500
summary(active_2014$revenue)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5 30 50 92 100 4500
Frequency and Recency
ggplot(active_2014, aes(x = frequency)) +
geom_histogram(bins = 10, fill = "steelblue") +
theme_classic() +
labs(x = "frequency", title = "Purchase frequency")
ggplot(active_2014, aes(x = recency)) +
geom_histogram(bins = 40,fill = "steelblue") +
theme_classic() +
labs(x = "recency", title = "Recency in Days")frequency is extremely right skewed. Although the data range is 1, 15 at least three quarters of customers have made just 1 purchase during the year.
summary(active_2014$frequency)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 1.0 1.0 1.2 1.0 15.0
recency is bimodal suggesting the presence of distinct subgroups in the data. The wide range of the series together with the shape of the distribution indicate a large number of lapsed customers.
First purchase and Average amount spent
ggplot(active_2014, aes(x = first_purchase)) +
geom_histogram(fill = "steelblue") +
theme_classic() +
labs(x = "first purchase (no.of days since)", title = "Days since first purchase")
ggplot(active_2014, aes(x = avg_amount)) +
geom_histogram(bins = 40, fill = "steelblue") +
theme_classic() +
labs(x = "average amount", title = "Average amount spent per transaction")Unsurprisingly the histograms for first_purchase and avg_amount look almost identical to those for recency and revenue. They contain essentially the same information and may well be redundant in a predictive model.
cor1 <- cor(active_2014[,-1])
cor1## frequency first_purchase avg_amount revenue
## frequency 1.000 0.384 -0.031 0.111
## first_purchase 0.384 1.000 -0.015 0.051
## avg_amount -0.031 -0.015 1.000 0.952
## revenue 0.111 0.051 0.952 1.000
Univariate density plots are shown on the diagonal whilst pairwise plots are shown on either side. The green lines are non-parametric regression smoothers.
cor_plot <- scatterplotMatrix(active_2014[,-1], spread = FALSE, main = "Scatterplot Matrix")Some clear linear relationships are visible in the pairwise plots. Non-linear relationships are also present - such as between recency and revenue.
The objective is to predict the spending level of customers in 2015 based on a set of predictor variables. Regression models are best suited to predicting numeric outcomes.
There are scores of algorithms that attempt to model a continuous response from a given set of features. Model selection is a challenge that requires skill, experience and a fair amount of trial and error.
I will first set a performance ceiling by fitting some flexible, non-linear models. These models are generally less interpretable but show better predictive performance than linear models. As the main goal here is predictive accuracy, the performance of the more flexible models will serve as a useful guide.
Performance metrics; RMSE; \(R^2\)
Next I will try a simpler, more interpretable linear model to see if its performance is acceptably close to the non-linear models. This is possible if the relationships between predictors and the response is essentially linear.
The combination of accuracy and simplicity is ideal in a statistical model.
The ‘best’ model will then be compared to the baseline model described above. If my best model performs significantly better than the baseline, I will investigate the feasibility of further tuning to approach the bayes model; the most accurate possible model given the available set of predictors.
Models will be trained on a dataset of 2014 customers. Initial evaluation will be based on resampling of the training set. The selected model will be tested against the actual values in the 2015 dataset. Considering the limited size of the dataset it is best to maximize the data available for the algorithms to learn the structure of the data. Given a larger dataset I would have split 3 ways into training, validation and test sets.
Results are evaluated in section 7.
Cross validation settings for all models
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5)Non-linear models
Random forests
set.seed(32)
rF_fit <- train(revenue ~.,
data = active_2014,
method = "rf",
metric = "RMSE",
trControl = ctrl,
importance = TRUE)
rF_fit## Random Forest
##
## 2965 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 2669, 2670, 2669, 2669, 2668, 2668, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared
## 2 53 0.96
## 3 43 0.97
## 4 40 0.97
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 4.
Multivariate adaptive regression splines (MARS)
set.seed(32)
mars_fit <- train(revenue ~.,
data = active_2014,
method = "earth",
metric = "RMSE",
tuneLength = 10,
trControl = ctrl)
mars_fit## Multivariate Adaptive Regression Spline
##
## 2965 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 2669, 2670, 2669, 2669, 2668, 2668, ...
## Resampling results across tuning parameters:
##
## nprune RMSE Rsquared
## 2 103 0.80
## 3 58 0.90
## 4 49 0.94
## 5 49 0.94
##
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 5 and degree = 1.
Support vector machines
set.seed(32)
svm_fit <- train(revenue~.,
data = active_2014,
method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 10,
trControl = ctrl)
svm_fit## Support Vector Machines with Radial Basis Function Kernel
##
## 2965 samples
## 4 predictor
##
## Pre-processing: centered (4), scaled (4)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 2669, 2670, 2669, 2669, 2668, 2668, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared
## 0.25 165 0.41
## 0.50 157 0.47
## 1.00 148 0.54
## 2.00 140 0.58
## 4.00 134 0.62
## 8.00 129 0.65
## 16.00 127 0.66
## 32.00 126 0.66
## 64.00 126 0.66
## 128.00 126 0.66
##
## Tuning parameter 'sigma' was held constant at a value of 4.7
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 4.7 and C = 32.
** Linear models**
Ordinary least squares
lm_fit <- train(revenue ~.,
data = active_2014,
method = "lm",
trControl = ctrl)
lm_fit## Linear Regression
##
## 2965 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 2669, 2668, 2667, 2669, 2669, 2670, ...
## Resampling results:
##
## RMSE Rsquared
## 49 0.93
##
##
results <- resamples(list(RandomForest = rF_fit,
MARS = mars_fit,
SVM = svm_fit,
OLS = lm_fit))
summary(results)##
## Call:
## summary.resamples(object = results)
##
## Models: RandomForest, MARS, SVM, OLS
## Number of resamples: 50
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RandomForest 5.7 16 26 40 51 138 0
## MARS 21.9 35 39 49 51 136 0
## SVM 23.3 55 97 126 196 278 0
## OLS 17.0 29 42 49 58 131 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RandomForest 0.87 0.96 0.98 0.97 0.99 1.00 0
## MARS 0.82 0.91 0.95 0.94 0.98 0.99 0
## SVM 0.38 0.51 0.68 0.66 0.81 0.96 0
## OLS 0.76 0.89 0.94 0.93 0.98 0.99 0
The support vector machine performs much worse than all the other models on both RMSE and \(R^2\).
The random forest model performs best on both evaluation metrics: lowest RMSE of 40 and highest \(R^2\) of 0.97. This model seems to fit the data well. According to \(R^2\), 97% of the variation in the response is explained by the model.
The linear model’s perfromance is close to that of the MARS. However, it should be noted that the ordinary least squares method makes several highly restrictive assumptions that must be satisfied to ensure the validity such models.
bwplot(results, metric = "RMSE", main = "Comparative Performance\nRoot Mean Squared Error")
bwplot(results, metric = "Rsquared", main = "Comparative Performance\nR Squared")It is worth running hypothesis tests to confirm that the performance differences observed above are statistically significant.
modelDiff <- diff(results)
summary(modelDiff)##
## Call:
## summary.diff.resamples(object = modelDiff)
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## RMSE
## RandomForest MARS SVM OLS
## RandomForest -8.519 -86.232 -9.210
## MARS 0.0123 -77.713 -0.691
## SVM 0.000000000000023 0.000000000989258 77.022
## OLS 1.0000 1.0000 0.000000862237572
##
## Rsquared
## RandomForest MARS SVM
## RandomForest 0.03501 0.30847
## MARS 0.00000023376818 0.27346
## SVM < 0.0000000000000002 0.00000000000201
## OLS 0.000254 1.000000 0.00000000000231
## OLS
## RandomForest 0.04160
## MARS 0.00659
## SVM -0.26687
## OLS
P-values are close to 0 for almost all pairwise comparisons confirming statistical differences between the models.
Some tests involving the OLS model return high values notably when compared with MARS, confirming the results in the boxplots above.
Use random forest model to predict 2015 spending
rF_pred <- predict(rF_fit, newdata = active_2015)
head(rF_pred)## 1 2 3 4 5 6
## 80 45 50 60 60 85
Prediction error
MSE - mean squared error
pred_MSE <- mean((rF_pred - active_2015$revenue)^2)
pred_MSE## [1] 2195
RMSE
sqrt(pred_MSE)## [1] 47
Build null model that predicts every outcome as mean revenue
useless <- nullModel(x = active_2014[,-5], y = active_2014$revenue)
useless## Null Classification Model
##
## Call:
## nullModel.default(x = active_2014[, -5], y = active_2014$revenue)
##
## Predicted Value: 92
uselessPred <- predict(useless, newdata = active_2015, type = "raw")Null model prediction error
MSE
uselessMSE <- mean((uselessPred - active_2015$revenue)^2)
uselessMSE## [1] 54969
RMSE
sqrt(uselessMSE)## [1] 234
The random forest model prediction error is 25.05 times smaller than the null model prediction error.
The non linear random forests model does a good job of recognizing true patterns in the data. This learning serves to make accurate predictions about the spending levels of customers in the following year.
It should be noted that this database shows high rates of customer attrition from year to year. In this report I predicted the spending levels of customers I knew remained loyal in the subsequent year. This is of course not possible going forward.
Arguably the more pressing business need lies in predicting which customers will lapse. I cover the issue and its challenges in the context of our business in the second report of the series - (Customer Analytics II - Predicting Customer Attrition).
I recommend we work towards building a highly accurate classifier that correctly predicts customer attrition combined with a strong regression model predicting customer spending for each customer.