1. Introduction

This is the third of a series of 4 dynamic documents written to demonstrate the value of customer analytics to the senior management of a diversified holding company with subsidiaries in key emerging markets. The documents aim to support the argument for significant investments in data and analytics capabilities as a means of driving revenue and profit growth through improved customer relationship management.

These reports are meant for analytics practitioners.

The key insights and arguments are presented in a separate powerpoint deck prepared for non-technical executives.

The documents implement advanced analytics methods that attempt to:

  1. Identify managerially relevant subgroups in a customer database.

  2. Build a model that predicts customer churn.

3. Build a model that predicts the spending level of customers predicted to remain loyal.

  1. Estimate the lifetime value of a customer database.

Analyses are performed in a small feature space. Although this will likely restrict the performance of some of the predictive models implemented in this report, the dataset is representative of the limited data currently available across most our subsidiaries.


2. Objectives

This report focuses on the third task - predicting the spending level of the 2014 customers who remain loyal 2015.

Model performance target

There is currently no working predictive model to serve as a baseline.

I shall therefore use the simplest possible model as a lower bound. In this case the null model is set to predict all customers will spend the mean amount.

Performance metrics considered: Root mean squared error (RMSE) and \(R^2\).

A good model should:

  • At least outperform the baseline model.

Statistical significance tests will be used to compare performance.

  • Meet the managerial objective determined by project sponsors.

There is no such objective for this demonstration.


3. Set Up

3.1 Load Packages

library(ggplot2)
library(dplyr)
library(pastecs)
library(car)
library(caret)
library(gvlma)

3.2 Load Data

cust_data <- read.delim(file = 'transactions.txt', header = FALSE, sep = '\t', dec = '.')

4. Data

Description

The dataset is a proprietary customer database consisting of 51,243 observations across 3 features:

  • customer i.d
  • purchase amount in USD
  • transaction date

The data covers the period January 2nd 2005 to December 31st 2015.

The dataset is tidy and ready for analysis.

Preparation

Compute additional variables (marketing indicators):

recency - time since last purchase (days)

frequency - number of purchase transactions

revenue - total amount spent

year_of_purchase

avg_amount - average amount spent

## Name columns
cust_data <- rename(cust_data, customer_id = V1, purchase_amount = V2, date_of_purchase = V3)

## Convert purchase date to R date object
cust_data$date_of_purchase <- as.Date(cust_data$date_of_purchase, "%Y-%m-%d")
## Create recency variable
cust_data$days_since <- as.numeric(difftime(time1 = "2016-01-01",
                                            time2 = cust_data$date_of_purchase,
                                            units = "days"))  

cust_data$year_of_purchase <- as.numeric(format(cust_data$date_of_purchase, "%Y"))

Analysis dataset

Filter customers active in 2014 and compute remaining marketing indicators.

customers_2014 <- cust_data %>%
                    filter(year_of_purchase == 2014) %>%
                       group_by(customer_id) %>%
                          summarise(recency = min(days_since) - 365, 
                             frequency = n(), 
                             first_purchase = max(days_since) - 365,
                             avg_amount = mean(purchase_amount),
                             revenue = sum(purchase_amount))

summary(customers_2014)
##   customer_id        recency          frequency      first_purchase   
##  Min.   :    80   Min.   :  1.333   Min.   : 1.000   Min.   :  1.333  
##  1st Qu.: 97840   1st Qu.: 20.333   1st Qu.: 1.000   1st Qu.: 23.333  
##  Median :175080   Median : 62.333   Median : 1.000   Median : 85.333  
##  Mean   :153909   Mean   :107.117   Mean   : 1.166   Mean   :128.750  
##  3rd Qu.:218295   3rd Qu.:204.333   3rd Qu.: 1.000   3rd Qu.:240.333  
##  Max.   :245840   Max.   :355.333   Max.   :15.000   Max.   :362.333  
##    avg_amount         revenue       
##  Min.   :   5.00   Min.   :   5.00  
##  1st Qu.:  30.00   1st Qu.:  30.00  
##  Median :  40.00   Median :  45.00  
##  Mean   :  77.74   Mean   :  87.89  
##  3rd Qu.:  60.00   3rd Qu.:  80.00  
##  Max.   :4500.00   Max.   :5000.00

Dataset of customers active in 2015

customers_2015 <- cust_data %>%
                    filter(year_of_purchase == 2015) %>%
                       group_by(customer_id) %>%
                          summarise(recency = min(days_since), 
                             frequency = n(), 
                             first_purchase = max(days_since),
                             avg_amount = mean(purchase_amount),
                             revenue = sum(purchase_amount)) 

summary(customers_2015)
##   customer_id        recency          frequency      first_purchase   
##  Min.   :    80   Min.   :  1.333   Min.   : 1.000   Min.   :  1.333  
##  1st Qu.:104105   1st Qu.: 16.333   1st Qu.: 1.000   1st Qu.: 24.333  
##  Median :185495   Median : 59.333   Median : 1.000   Median : 65.333  
##  Mean   :167782   Mean   :100.074   Mean   : 1.148   Mean   :120.129  
##  3rd Qu.:246058   3rd Qu.:174.333   3rd Qu.: 1.000   3rd Qu.:216.333  
##  Max.   :264200   Max.   :360.333   Max.   :11.000   Max.   :360.333  
##    avg_amount         revenue       
##  Min.   :   5.00   Min.   :   5.00  
##  1st Qu.:  30.00   1st Qu.:  30.00  
##  Median :  40.00   Median :  50.00  
##  Mean   :  79.49   Mean   :  88.62  
##  3rd Qu.:  60.00   3rd Qu.:  85.00  
##  Max.   :4500.00   Max.   :4500.00

Training set

2014 customers who stayed loyal in 2015

active_2014 <- semi_join(customers_2014, customers_2015, by = "customer_id")
summary(active_2014)
##   customer_id        recency          frequency      first_purchase   
##  Min.   :    80   Min.   :  1.333   Min.   : 1.000   Min.   :  1.333  
##  1st Qu.: 83590   1st Qu.: 13.333   1st Qu.: 1.000   1st Qu.: 20.333  
##  Median :150660   Median : 49.333   Median : 1.000   Median : 71.333  
##  Mean   :139562   Mean   : 95.128   Mean   : 1.209   Mean   :122.061  
##  3rd Qu.:203480   3rd Qu.:176.333   3rd Qu.: 1.000   3rd Qu.:236.333  
##  Max.   :234760   Max.   :342.333   Max.   :15.000   Max.   :362.333  
##    avg_amount         revenue      
##  Min.   :   5.00   Min.   :   5.0  
##  1st Qu.:  30.00   1st Qu.:  30.0  
##  Median :  45.00   Median :  50.0  
##  Mean   :  79.16   Mean   :  91.7  
##  3rd Qu.:  75.00   3rd Qu.: 100.0  
##  Max.   :4500.00   Max.   :4500.0
row.names(active_2014) <- active_2014$customer_id
active_2014 <- select(active_2014, -customer_id)

Test set

Actual spending by active 2014 customers in 2015

active_2015 <- semi_join(customers_2015, customers_2014, by = "customer_id")
summary(active_2015)
##   customer_id        recency          frequency      first_purchase   
##  Min.   :    80   Min.   :  1.333   Min.   : 1.000   Min.   :  1.333  
##  1st Qu.: 83590   1st Qu.: 14.333   1st Qu.: 1.000   1st Qu.: 15.333  
##  Median :150660   Median : 50.333   Median : 1.000   Median : 76.333  
##  Mean   :139562   Mean   : 96.184   Mean   : 1.206   Mean   :123.609  
##  3rd Qu.:203480   3rd Qu.:177.333   3rd Qu.: 1.000   3rd Qu.:224.333  
##  Max.   :234760   Max.   :359.333   Max.   :10.000   Max.   :359.333  
##    avg_amount         revenue      
##  Min.   :   5.00   Min.   :   5.0  
##  1st Qu.:  30.00   1st Qu.:  30.0  
##  Median :  50.00   Median :  50.0  
##  Mean   :  84.78   Mean   :  98.3  
##  3rd Qu.:  80.00   3rd Qu.: 100.0  
##  Max.   :4500.00   Max.   :4500.0

5. Data Exploration

5.1 Descriptive statistics

options(digits = 2, scipen = 99)
stat.desc(active_2014)
##               recency frequency first_purchase avg_amount  revenue
## nbr.val        2965.0  2965.000        2965.00     2965.0   2965.0
## nbr.null          0.0     0.000           0.00        0.0      0.0
## nbr.na            0.0     0.000           0.00        0.0      0.0
## min               1.3     1.000           1.33        5.0      5.0
## max             342.3    15.000         362.33     4500.0   4500.0
## range           341.0    14.000         361.00     4495.0   4495.0
## sum          282053.3  3584.000      361909.33   234707.8 271882.4
## median           49.3     1.000          71.33       45.0     50.0
## mean             95.1     1.209         122.06       79.2     91.7
## SE.mean           1.9     0.012           2.10        3.5      3.9
## CI.mean.0.95      3.6     0.024           4.11        6.8      7.6
## var           10197.2     0.459       13016.99    35754.1  44218.8
## std.dev         101.0     0.678         114.09      189.1    210.3
## coef.var          1.1     0.561           0.93        2.4      2.3

5.2 Distributions

Revenue per customer

ggplot(active_2014, aes(x = revenue)) +
   geom_histogram(bins = 40, fill = "steelblue") +
   theme_classic() +
   labs(x = "revenue", title = "Annual revenue p/customer") 

The distribution is severely right skewed. Half of all customers spent below $50 during the year. A small number of outliers make up a long tail stretching up to $4,500

summary(active_2014$revenue)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       5      30      50      92     100    4500

Frequency and Recency

ggplot(active_2014, aes(x = frequency)) +
   geom_histogram(bins = 10, fill = "steelblue") +
   theme_classic() +
   labs(x = "frequency", title = "Purchase frequency") 

ggplot(active_2014, aes(x = recency)) +
   geom_histogram(bins = 40,fill = "steelblue") +
   theme_classic() +
   labs(x = "recency", title = "Recency in Days")

frequency is extremely right skewed. Although the data range is 1, 15 at least three quarters of customers have made just 1 purchase during the year.

summary(active_2014$frequency)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0     1.0     1.2     1.0    15.0

recency is bimodal suggesting the presence of distinct subgroups in the data. The wide range of the series together with the shape of the distribution indicate a large number of lapsed customers.

First purchase and Average amount spent

ggplot(active_2014, aes(x = first_purchase)) +
   geom_histogram(fill = "steelblue") +
   theme_classic() +
   labs(x = "first purchase (no.of days since)", title = "Days since first purchase") 

ggplot(active_2014, aes(x = avg_amount)) +
   geom_histogram(bins = 40, fill = "steelblue") +
   theme_classic() +
   labs(x = "average amount", title = "Average amount spent per transaction")

Unsurprisingly the histograms for first_purchase and avg_amount look almost identical to those for recency and revenue. They contain essentially the same information and may well be redundant in a predictive model.

5.3 Correlations

cor1 <- cor(active_2014[,-1])
cor1
##                frequency first_purchase avg_amount revenue
## frequency          1.000          0.384     -0.031   0.111
## first_purchase     0.384          1.000     -0.015   0.051
## avg_amount        -0.031         -0.015      1.000   0.952
## revenue            0.111          0.051      0.952   1.000

Univariate density plots are shown on the diagonal whilst pairwise plots are shown on either side. The green lines are non-parametric regression smoothers.

cor_plot <- scatterplotMatrix(active_2014[,-1], spread = FALSE, main = "Scatterplot Matrix")

Some clear linear relationships are visible in the pairwise plots. Non-linear relationships are also present - such as between recency and revenue.

6. Modelling

6.1 Objective

The objective is to predict the spending level of customers in 2015 based on a set of predictor variables. Regression models are best suited to predicting numeric outcomes.

There are scores of algorithms that attempt to model a continuous response from a given set of features. Model selection is a challenge that requires skill, experience and a fair amount of trial and error.

6.2 Strategy

I will first set a performance ceiling by fitting some flexible, non-linear models. These models are generally less interpretable but show better predictive performance than linear models. As the main goal here is predictive accuracy, the performance of the more flexible models will serve as a useful guide.

Performance metrics; RMSE; \(R^2\)

Next I will try a simpler, more interpretable linear model to see if its performance is acceptably close to the non-linear models. This is possible if the relationships between predictors and the response is essentially linear.

The combination of accuracy and simplicity is ideal in a statistical model.

The ‘best’ model will then be compared to the baseline model described above. If my best model performs significantly better than the baseline, I will investigate the feasibility of further tuning to approach the bayes model; the most accurate possible model given the available set of predictors.

Models will be trained on a dataset of 2014 customers. Initial evaluation will be based on resampling of the training set. The selected model will be tested against the actual values in the 2015 dataset. Considering the limited size of the dataset it is best to maximize the data available for the algorithms to learn the structure of the data. Given a larger dataset I would have split 3 ways into training, validation and test sets.

Results are evaluated in section 7.

6.3 Model fitting

Cross validation settings for all models

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 5)

Non-linear models

Random forests

set.seed(32)
rF_fit <- train(revenue ~., 
                 data = active_2014, 
                 method = "rf", 
                 metric = "RMSE", 
                 trControl = ctrl,
                 importance = TRUE)

rF_fit
## Random Forest 
## 
## 2965 samples
##    4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 2669, 2670, 2669, 2669, 2668, 2668, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE  Rsquared
##   2     53    0.96    
##   3     43    0.97    
##   4     40    0.97    
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 4.

Multivariate adaptive regression splines (MARS)

set.seed(32)
mars_fit <- train(revenue ~., 
                   data = active_2014, 
                   method = "earth", 
                   metric = "RMSE",
                   tuneLength = 10,
                   trControl = ctrl)

mars_fit
## Multivariate Adaptive Regression Spline 
## 
## 2965 samples
##    4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 2669, 2670, 2669, 2669, 2668, 2668, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE  Rsquared
##   2       103   0.80    
##   3        58   0.90    
##   4        49   0.94    
##   5        49   0.94    
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were nprune = 5 and degree = 1.

Support vector machines

set.seed(32)
svm_fit <- train(revenue~.,
                  data = active_2014,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 10,
                  trControl = ctrl)
svm_fit
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 2965 samples
##    4 predictor
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 2669, 2670, 2669, 2669, 2668, 2668, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE  Rsquared
##     0.25  165   0.41    
##     0.50  157   0.47    
##     1.00  148   0.54    
##     2.00  140   0.58    
##     4.00  134   0.62    
##     8.00  129   0.65    
##    16.00  127   0.66    
##    32.00  126   0.66    
##    64.00  126   0.66    
##   128.00  126   0.66    
## 
## Tuning parameter 'sigma' was held constant at a value of 4.7
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were sigma = 4.7 and C = 32.

** Linear models**

Ordinary least squares

lm_fit <- train(revenue ~.,
                 data = active_2014,
                 method = "lm",
                 trControl = ctrl)
lm_fit
## Linear Regression 
## 
## 2965 samples
##    4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 2669, 2668, 2667, 2669, 2669, 2670, ... 
## Resampling results:
## 
##   RMSE  Rsquared
##   49    0.93    
## 
## 

7. Model evaluation and selection

7.1 Compare models

results <- resamples(list(RandomForest = rF_fit,
                          MARS = mars_fit,
                          SVM = svm_fit,
                          OLS = lm_fit))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: RandomForest, MARS, SVM, OLS 
## Number of resamples: 50 
## 
## RMSE 
##              Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RandomForest  5.7      16     26   40      51  138    0
## MARS         21.9      35     39   49      51  136    0
## SVM          23.3      55     97  126     196  278    0
## OLS          17.0      29     42   49      58  131    0
## 
## Rsquared 
##              Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RandomForest 0.87    0.96   0.98 0.97    0.99 1.00    0
## MARS         0.82    0.91   0.95 0.94    0.98 0.99    0
## SVM          0.38    0.51   0.68 0.66    0.81 0.96    0
## OLS          0.76    0.89   0.94 0.93    0.98 0.99    0

The support vector machine performs much worse than all the other models on both RMSE and \(R^2\).

The random forest model performs best on both evaluation metrics: lowest RMSE of 40 and highest \(R^2\) of 0.97. This model seems to fit the data well. According to \(R^2\), 97% of the variation in the response is explained by the model.

The linear model’s perfromance is close to that of the MARS. However, it should be noted that the ordinary least squares method makes several highly restrictive assumptions that must be satisfied to ensure the validity such models.

bwplot(results, metric = "RMSE", main = "Comparative Performance\nRoot Mean Squared Error")

bwplot(results, metric = "Rsquared", main = "Comparative Performance\nR Squared")

It is worth running hypothesis tests to confirm that the performance differences observed above are statistically significant.

modelDiff <- diff(results)

summary(modelDiff)
## 
## Call:
## summary.diff.resamples(object = modelDiff)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## RMSE 
##              RandomForest      MARS              SVM               OLS    
## RandomForest                    -8.519           -86.232            -9.210
## MARS         0.0123                              -77.713            -0.691
## SVM          0.000000000000023 0.000000000989258                    77.022
## OLS          1.0000            1.0000            0.000000862237572        
## 
## Rsquared 
##              RandomForest         MARS             SVM             
## RandomForest                       0.03501          0.30847        
## MARS         0.00000023376818                       0.27346        
## SVM          < 0.0000000000000002 0.00000000000201                 
## OLS          0.000254             1.000000         0.00000000000231
##              OLS     
## RandomForest  0.04160
## MARS          0.00659
## SVM          -0.26687
## OLS

P-values are close to 0 for almost all pairwise comparisons confirming statistical differences between the models.

Some tests involving the OLS model return high values notably when compared with MARS, confirming the results in the boxplots above.

7.2 Predictions

Use random forest model to predict 2015 spending

rF_pred <- predict(rF_fit, newdata = active_2015)
head(rF_pred)
##  1  2  3  4  5  6 
## 80 45 50 60 60 85

Prediction error

MSE - mean squared error

pred_MSE <- mean((rF_pred - active_2015$revenue)^2)
pred_MSE
## [1] 2195

RMSE

sqrt(pred_MSE)
## [1] 47

7.3 Best model vs. null model

Build null model that predicts every outcome as mean revenue

useless <- nullModel(x = active_2014[,-5], y = active_2014$revenue)
useless
## Null Classification Model
## 
## Call:
## nullModel.default(x = active_2014[, -5], y = active_2014$revenue)
## 
## Predicted Value: 92
uselessPred <- predict(useless, newdata = active_2015, type = "raw")

Null model prediction error

MSE

uselessMSE <- mean((uselessPred - active_2015$revenue)^2)
uselessMSE
## [1] 54969

RMSE

sqrt(uselessMSE)
## [1] 234

The random forest model prediction error is 25.05 times smaller than the null model prediction error.

8. Conclusion

The non linear random forests model does a good job of recognizing true patterns in the data. This learning serves to make accurate predictions about the spending levels of customers in the following year.

It should be noted that this database shows high rates of customer attrition from year to year. In this report I predicted the spending levels of customers I knew remained loyal in the subsequent year. This is of course not possible going forward.

Arguably the more pressing business need lies in predicting which customers will lapse. I cover the issue and its challenges in the context of our business in the second report of the series - (Customer Analytics II - Predicting Customer Attrition).

I recommend we work towards building a highly accurate classifier that correctly predicts customer attrition combined with a strong regression model predicting customer spending for each customer.