1. Introduction

This is the second of a series of four dynamic documents written to demonstrate the value of customer analytics to the senior management of a diversified holding company with subsidiaries in key emerging markets. The documents aim to support the argument for significant investments in data and analytics capabilities as a means of driving revenue and profit growth through improved customer relationship management.

The key insights and recommendations are presented in a separate powerpoint deck prepared for non-technical executives.

The papers implement advanced analytics methods that attempt to:

  1. Identify managerially relevant subgroups in a customer database.

2. Build a model that predicts customer attrition (churn).

  1. Build a model that predicts the spending level of customers predicted to remain loyal.

  2. Estimate the lifetime value of each customer.

Analyses are performed on a very narrow dataset. Although this will likely restrict the performance of the predictive models, the dataset is representative of the limited data currently available within most of our subsidiaries.


2. Objective

This report focuses on the second task - building a binary classifier that predicts customer churn.

The aim is to predict for each customer active in 2014, whether or not they will stay loyal in 2015.

Model performance target

There is currently no working predictive model to serve as a baseline upon which to compare my models.

I shall therefore use the simplest possible model as a lower bound. In this case the null model is set to predict all customers remain loyal - 0 customer churn.

Performance metrics considered: ROC, accuracy, sensitivity and specificity.

A good model should:

  • At least outperform the baseline model.

Statistical significance tests will be used to compare performance.

  • Meet the managerial objective determined by project sponsors.

There is no such objective for this demonstration.


3. Set Up

Load Packages

library(ggplot2)
library(dplyr)
library(pastecs)
library(kernlab)
library(caret)

Load Data

cust_data = read.delim(file = 'transactions.txt', header = FALSE, sep = '\t', dec = '.')

4. Dataset Description

The dataset is proprietary customer data consisting of 51,243 observations across 3 features:

  • customer i.d
  • purchase amount in USD
  • transaction date

The data covers the period January 2nd 2005 to December 31st 2015.

The dataset is tidy and ready for analysis.

Data Preparation

Create additional variables:

recency - time since last purchase (days)

frequency - number of purchase transactions

avg_amount - average amount spent

## Name columns
cust_data <- rename(cust_data, customer_id = V1, purchase_amount = V2, date_of_purchase = V3)

## Convert purchase date to R date object
cust_data$date_of_purchase <- as.Date(cust_data$date_of_purchase, "%Y-%m-%d")

## Create recency variable
cust_data$days_since <- as.numeric(difftime(time1 = "2016-01-01",
                                            time2 = cust_data$date_of_purchase,
                                            units = "days"))

## Create dataset for rfm analysis including frequency variable
customers <- cust_data %>%
                group_by(customer_id) %>%
                summarise(recency = min(days_since), frequency = n(), 
                          avg_amount = mean(purchase_amount),
                          first_purchase = max(days_since),
                          max_amount = max(purchase_amount))

5. Data Exploration

5.1 Descriptive statistics

options(digits = 2, scipen = 99)
stat.desc(customers)
##                customer_id     recency frequency avg_amount first_purchase
## nbr.val           18417.00    18417.00 18417.000    18417.0       18417.00
## nbr.null              0.00        0.00     0.000        0.0           0.00
## nbr.na                0.00        0.00     0.000        0.0           0.00
## min                  10.00        1.33     1.000        5.0           1.33
## max              264200.00     4014.33    45.000     4500.0        4016.33
## range            264190.00     4013.00    44.000     4495.0        4015.00
## sum                     NA 23083338.00 51243.000  1064373.4    36545649.00
## median           136430.00     1070.33     2.000       30.0        2087.33
## mean             137573.51     1253.37     2.782       57.8        1984.34
## SE.mean             512.16        7.97     0.022        1.1           8.35
## CI.mean.0.95       1003.88       15.62     0.042        2.2          16.37
## var          4830889404.55  1169507.86     8.625    23827.0     1284607.89
## std.dev           69504.60     1081.44     2.937      154.4        1133.41
## coef.var              0.51        0.86     1.056        2.7           0.57
##              max_amount
## nbr.val         18417.0
## nbr.null            0.0
## nbr.na              0.0
## min                 5.0
## max              4500.0
## range            4495.0
## sum           1266285.0
## median             30.0
## mean               68.8
## SE.mean             1.4
## CI.mean.0.95        2.8
## var             37759.5
## std.dev           194.3
## coef.var            2.8

5.2 Distributions

frequency

The distribution is heavily right skewed. Most customers have made between 1 and 15 purchases. A few outliers can be found beyond this point up to a maximum of 45

ggplot(customers, aes(x = frequency)) +
   geom_histogram(bins = 20, fill = "steelblue") +
   theme_classic() +
   labs(x = "frequency", title = "Purchase frequency") +
   scale_x_continuous(breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45))

Almost 50% of customers have only ever made 1 purchase

dim(filter(customers, frequency == 1))[1]/dim(customers)[1]
## [1] 0.49

recency

Apart from a spike on the extreme left, the distribution is fairly uniform.

ggplot(customers, aes(x = recency)) +
   geom_histogram(fill = "steelblue") +
   theme_classic() +
   labs(x = "recency", title = "Recency in Days") +
   scale_x_continuous(breaks = c(0, 250, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000))

About 30% of customers have been active in the last year.

dim(filter(customers, recency < 365))[1]/dim(customers)[1]
## [1] 0.29
x <- filter(customers, recency < 365)

amount

The series has mean 58 with 50% of samples between 22 and 50. The mean is pulled above the median by the severe right skew caused by some extreme outliers.

summary(customers$avg_amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       5      22      30      58      50    4500
ggplot(customers, aes(x = avg_amount)) +
   geom_histogram(bins = 20, fill = "steelblue") +
   theme_classic() +
   labs(x = "amount", title = "average purchase amount")

ggplot(customers, aes(x = avg_amount)) +
   geom_histogram(bins = 20, fill = "steelblue") +
   theme_classic() +
   labs(x = "amount (log10)", title = "Log transformed average purchase amount") +
   scale_x_log10()

6. Modelling

6.1 Objective

The objective is to assign customers into 1 of 2 categories (churn/no churn).

There are scores of algorithms that attempt to model a binary response from a given set of features. Model selection is a challenge that requires skill, experience and a fair bit trial and error.

6.2 Strategy

I will first set a performance ceiling by fitting some flexible, non-linear models. These models are generally less interpretable but show better predictive performance than linear models. As the principal goal here is predictive accuracy, the performance of the more flexible models will serve as a useful guide.

Next I will try some simpler, more interpretable linear models to see if their performance is acceptably close to the non-linear models as can be the case if the relationships between predictors and response is essentially linear.

The combination of accuracy and simplicity is ideal in a statistical model.

The ‘best’ model will then be compared to the baseline model described above. If my best model performs significantly better than the baseline, I will investigate the feasibility of further tuning to approach the bayes model; the most accurate possible model given the available set of predictors.

Results are evaluated in section 7.

Models are fitted on a dataset made up of both customers active in 2014 and 2015 (did not churn) and those active in 2014 but not so in 2015 (churned). Any customers acquired in 2015 will therefore not be allowed to interefere with the analysis.

6.3 Data pre-processing

Customers active in 2014

active_2014 <- cust_data %>%
                    filter(year_of_purchase == 2014) %>%
                       group_by(customer_id) %>%
                       summarise(recency = min(days_since), 
                                  frequency = n(), 
                                  avg_amount = mean(purchase_amount),
                                  first_purchase = max(days_since),
                                  max_amount = max(purchase_amount))
                  

summary(active_2014)
##   customer_id        recency      frequency      avg_amount  
##  Min.   :    80   Min.   :366   Min.   : 1.0   Min.   :   5  
##  1st Qu.: 97840   1st Qu.:385   1st Qu.: 1.0   1st Qu.:  30  
##  Median :175080   Median :427   Median : 1.0   Median :  40  
##  Mean   :153909   Mean   :472   Mean   : 1.2   Mean   :  78  
##  3rd Qu.:218295   3rd Qu.:569   3rd Qu.: 1.0   3rd Qu.:  60  
##  Max.   :245840   Max.   :720   Max.   :15.0   Max.   :4500  
##  first_purchase   max_amount  
##  Min.   :366    Min.   :   5  
##  1st Qu.:388    1st Qu.:  30  
##  Median :450    Median :  40  
##  Mean   :494    Mean   :  80  
##  3rd Qu.:605    3rd Qu.:  60  
##  Max.   :727    Max.   :4500

Customers active in 2015

active_2015 <- cust_data %>%
                    filter(year_of_purchase == 2015) %>%
                       group_by(customer_id) %>%
                       summarise(recency = min(days_since), 
                                  frequency = n(), 
                                  avg_amount = mean(purchase_amount),
                                  first_purchase = max(days_since),
                                  max_amount = max(purchase_amount))

summary(active_2015)
##   customer_id        recency      frequency      avg_amount  
##  Min.   :    80   Min.   :  1   Min.   : 1.0   Min.   :   5  
##  1st Qu.:104105   1st Qu.: 16   1st Qu.: 1.0   1st Qu.:  30  
##  Median :185495   Median : 59   Median : 1.0   Median :  40  
##  Mean   :167782   Mean   :100   Mean   : 1.1   Mean   :  79  
##  3rd Qu.:246058   3rd Qu.:174   3rd Qu.: 1.0   3rd Qu.:  60  
##  Max.   :264200   Max.   :360   Max.   :11.0   Max.   :4500  
##  first_purchase   max_amount  
##  Min.   :  1    Min.   :   5  
##  1st Qu.: 24    1st Qu.:  30  
##  Median : 65    Median :  40  
##  Mean   :120    Mean   :  81  
##  3rd Qu.:216    3rd Qu.:  65  
##  Max.   :360    Max.   :4500

New customers acquired in 2015

new_cust2015 <- customers %>%
                  filter(first_purchase <= 365)

Remove new customers from active_2015

old_2015 <- subset(active_2015, !(active_2015$customer_id %in% new_cust2015$customer_id))

Identify customers active in 2014 but not in 2015 - the churners.

churners <- anti_join(active_2014, old_2015, by = "customer_id")

Note there are some old customers who were active in 2015 but not in 2014. They were active before 2014, skipped that year and returned in 2015. Those customers are excluded from the model data.

Label customers in active_2014 to create the binary response variable - \(y = churn/loyal\).

Dataset used for model fitting

model_data <- mutate(active_2014, 
                     churn = ifelse(active_2014$customer_id %in% churners$customer_id, 
                                    "churn", "loyal"))

model_data$churn <- factor(model_data$churn)

row.names(model_data) <- model_data$customer_id

model_data <- select(model_data, -customer_id)

Set customer_id as row names and remove customer_id variable.

row.names(model_data) <- model_data$customer_id
model_data <- model_data[,-1]

Response variable

churn

Predictors

recency, first_purchase, frequency, av_amount

High churn rate - 40%

table(model_data$churn)
## 
## churn loyal 
##  1958  2965
dim(filter(model_data, churn == "churn"))[1]/dim(model_data)[1]
## [1] 0.4

Partition data into training and test sets

Create balanced data splits

set.seed(11)
trainIndex <- createDataPartition(model_data$churn, p = .75,
                                  list = FALSE,
                                  times = 1)
head(trainIndex)
##      Resample1
## [1,]         1
## [2,]         2
## [3,]         3
## [4,]         4
## [5,]         5
## [6,]         7
train_data <- model_data[ trainIndex,]
test_data  <- model_data[-trainIndex,]

6.4 Model Fitting

Non-linear models

Random forests

Cross validation settings

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 5, 
                     summaryFunction = twoClassSummary,
                     classProbs = TRUE)

Model fit

set.seed(12)
rF.fit <- train(churn ~., 
            data = train_data, 
            method = "rf", 
            metric = "ROC", 
            trControl = ctrl)

Accuracy on the train Support Vector Machines

Cross validation settings

set.seed(13)
sigmaRange <- sigest(churn ~., data = train_data)
svmRGrid <- expand.grid(.sigma = sigmaRange,
                               .C = 2^(seq(-4, 4)))

Model fit

set.seed(12)
svm.fit <- train(churn ~., data = train_data,
                 method = "svmRadial",
                 metric = "ROC",
                 preProc = c("center", "scale"),
                 tuneGrid = svmRGrid,
                 fit = FALSE,
                 trControl = ctrl)

K-Nearest Neighbour

Model fit

set.seed(12)
knn.fit <- train(churn ~ ., data=train_data, method="knn",
                trControl=ctrl, metric="ROC", tuneLength=20,
                preProc=c("center", "scale"))

Linear models

Cross validation settings for both linear models

ctrl_lM <- trainControl(method = "LGOCV",
                         summaryFunction = twoClassSummary,
                         classProbs = TRUE)

Generalized linear model

set.seed(12)
lM.fit <- train(churn ~ .,
            data = train_data,
            method = "glm",
            metric = "ROC",
            trControl = ctrl_lM)

Linear discriminant analysis

set.seed(12)
lda.fit <- train(churn ~ .,
                 data = train_data,
                 method = "lda",
                 metric = "ROC",
                 preProc = c("center", "scale"),
                 trControl = ctrl_lM)

6.5 Model selection

Compare non-linear models

model_results1 <- resamples(list(rF = rF.fit,
                                svm = svm.fit,
                                knn = knn.fit))

summary(model_results1)
## 
## Call:
## summary.resamples(object = model_results1)
## 
## Models: rF, svm, knn 
## Number of resamples: 50 
## 
## ROC 
##     Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rF  0.56    0.61   0.62 0.62    0.65 0.67    0
## svm 0.55    0.60   0.62 0.62    0.64 0.70    0
## knn 0.57    0.62   0.63 0.63    0.64 0.68    0
## 
## Sens 
##     Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rF  0.40    0.45   0.48 0.48    0.52 0.56    0
## svm 0.16    0.19   0.21 0.21    0.23 0.29    0
## knn 0.30    0.35   0.37 0.37    0.39 0.46    0
## 
## Spec 
##     Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rF  0.64    0.68   0.70 0.71    0.73 0.80    0
## svm 0.83    0.87   0.88 0.88    0.90 0.94    0
## knn 0.72    0.78   0.80 0.79    0.81 0.84    0

Visualize results

bwplot(model_results1, 
       fill = "red",
       main = "Comparative performance of non-linear models")

dotplot(model_results1, 
       fill = "red",
       main = "Comparative performance of non-linear models")

Compare linear models

model_results2 <- resamples(list(glm = lM.fit,
                                 lda = lda.fit))

summary(model_results2)
## 
## Call:
## summary.resamples(object = model_results2)
## 
## Models: glm, lda 
## Number of resamples: 25 
## 
## ROC 
##     Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.57     0.6   0.61 0.61    0.62 0.64    0
## lda 0.57     0.6   0.60 0.60    0.62 0.64    0
## 
## Sens 
##      Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.052   0.131   0.16 0.16    0.18 0.26    0
## lda 0.030   0.071   0.09 0.11    0.15 0.22    0
## 
## Spec 
##     Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.83    0.87   0.88 0.89    0.93 0.96    0
## lda 0.86    0.90   0.92 0.92    0.95 0.97    0
bwplot(model_results2, 
       fill = "red",
       main = "Comparative performance of linear models")

Apply selected model to the test set

Select random forest and use to predict on the test set

rFPred.test <- predict(rF.fit, test_data)

Confusion Matrix

cMatrix <- confusionMatrix(rFPred.test, test_data$churn)
print(cMatrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction churn loyal
##      churn   207   232
##      loyal   282   509
##                                        
##                Accuracy : 0.582        
##                  95% CI : (0.554, 0.61)
##     No Information Rate : 0.602        
##     P-Value [Acc > NIR] : 0.9310       
##                                        
##                   Kappa : 0.112        
##  Mcnemar's Test P-Value : 0.0307       
##                                        
##             Sensitivity : 0.423        
##             Specificity : 0.687        
##          Pos Pred Value : 0.472        
##          Neg Pred Value : 0.643        
##              Prevalence : 0.398        
##          Detection Rate : 0.168        
##    Detection Prevalence : 0.357        
##       Balanced Accuracy : 0.555        
##                                        
##        'Positive' Class : churn        
## 

Attach class probabilities and assignments to active 2015 dataset

Store class probabilities

rFProbs <- predict(rF.fit, active_2015, type = "prob")

Attach probabilities to active 2015 dataframe

active_2015$churn_prob <- rFProbs$churn

Attach class assignments to active 2015 dataframe

active_2015$class <- predict(rF.fit, active_2015)

head(active_2015[,-c(5,6)])
## Source: local data frame [6 x 6]
## 
##   customer_id recency frequency avg_amount churn_prob  class
##         (int)   (dbl)     (int)      (dbl)      (dbl) (fctr)
## 1          80   343.3         1         80      0.114  loyal
## 2         480    21.3         1         45      0.296  loyal
## 3         830   321.3         1         50      0.004  loyal
## 4         850    24.3         2         30      0.340  loyal
## 5         860   237.3         1         60      0.030  loyal
## 6        1020     1.3         2         60      0.068  loyal

7. Evaluation and critique

7.1 Summary

Although the results vary modestly across the different models, none of these classifiers satisfy the performance criteria set out in section 2.

Of my 5 classifiers, the random forest model performs the best overall, although the knn and support vector machines perform better on one metric or the other.

The linear models perform similarly to the non-linear ones on ROC but very poorly on sensitivity, arguably the most important metric in a context such as ours.

Results are summarized in the tables and plot above.

7.2 Comparison to the null model

The null classifies all customers as loyal - it predicts 0 churn. Given our known churn rate of 40%, this model will have an overall error rate of 40%.

The random forest model has a cross validation error rate of 38%. Barely below the null model.

rF.fit$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 38%
## Confusion matrix:
##       churn loyal class.error
## churn   696   773        0.53
## loyal   638  1586        0.29

However, it is important to note that the null model, always predicting no customer churns, will have sensitivity 0 compared to the random forest model’s 38.4% on the test set. Nevertheless, 38.4% sensitivity despite being much better than the null model is not very impressive in this context.

My best model slightly outperforms the null model. It offers modest value greater than the status quo.

7.3 Critique

This modest performance could either be due to deficiencies in the models or in the data.

Possible model deficiencies

Classifiers could perform poorly due to a number of model related reasons:

  • Mismatches in the inductive biases of particular models and the data. For example linear models that assume linear relationships between predictors and the response variable will struggle to correctly identify non-linear decision boundaries in a feature space.

In addition to simple linear models, I applied some flexible models such as a tree model and a support vector machine with reputations for accurately modelling complex, non-linear relationships.

  • Poor parameter tuning

  • Outliers are known to strongly prejudice some models.

Data exploration (part 5) do indeed show the presence of unusual values. However,

In addition to the steps described above, the fact that the key performance metrics are broadly similar across the very different types of models implemented here suggests the underperformance is most likely due to deficiencies in the data.

Data deficiencies

Data quality

Inacurate or inconsistently recorded data can have dramatic effects on predictive models. I have no reason to suspect the quality of the data used here.

Feature space

My classifiers rely on a very narrow feature space to predict classes. 4 input variables.

Context: for the Kaggle Cup 2009 data science competition, the telecommunication company Orange France provided 230 features to serve as inputs in classification models predicting customer churn.

It is more than likely that our 4 features do not sufficiently explain the variation observed in the response.

My conclusion is that the models’ relatively poor performance is mainly due to insufficient data.


8. Conclusion

One of the key objectives of this series of papers is to motivate a data collection and management program at some of our subsidiaries.

The models presented here can be highly predictive. Some of them are highly sophisticated and are capable of drawing complex decision boundaries to assign non-linear data to separate classes.

However, they require data to be effective. In most cases, the more the better. If input variables can be combined to explain variation in a response, it stands to reason that larger numbers of informative inputs will lead to more accurate predictions in most cases.