This is the second of a series of four dynamic documents written to demonstrate the value of customer analytics to the senior management of a diversified holding company with subsidiaries in key emerging markets. The documents aim to support the argument for significant investments in data and analytics capabilities as a means of driving revenue and profit growth through improved customer relationship management.
The key insights and recommendations are presented in a separate powerpoint deck prepared for non-technical executives.
The papers implement advanced analytics methods that attempt to:
2. Build a model that predicts customer attrition (churn).
Build a model that predicts the spending level of customers predicted to remain loyal.
Estimate the lifetime value of each customer.
Analyses are performed on a very narrow dataset. Although this will likely restrict the performance of the predictive models, the dataset is representative of the limited data currently available within most of our subsidiaries.
This report focuses on the second task - building a binary classifier that predicts customer churn.
The aim is to predict for each customer active in 2014, whether or not they will stay loyal in 2015.
Model performance target
There is currently no working predictive model to serve as a baseline upon which to compare my models.
I shall therefore use the simplest possible model as a lower bound. In this case the null model is set to predict all customers remain loyal - 0 customer churn.
Performance metrics considered: ROC, accuracy, sensitivity and specificity.
A good model should:
Statistical significance tests will be used to compare performance.
There is no such objective for this demonstration.
library(ggplot2)
library(dplyr)
library(pastecs)
library(kernlab)
library(caret)
cust_data = read.delim(file = 'transactions.txt', header = FALSE, sep = '\t', dec = '.')
The dataset is proprietary customer data consisting of 51,243 observations across 3 features:
The data covers the period January 2nd 2005 to December 31st 2015.
The dataset is tidy and ready for analysis.
Data Preparation
Create additional variables:
recency
- time since last purchase (days)
frequency
- number of purchase transactions
avg_amount
- average amount spent
## Name columns
cust_data <- rename(cust_data, customer_id = V1, purchase_amount = V2, date_of_purchase = V3)
## Convert purchase date to R date object
cust_data$date_of_purchase <- as.Date(cust_data$date_of_purchase, "%Y-%m-%d")
## Create recency variable
cust_data$days_since <- as.numeric(difftime(time1 = "2016-01-01",
time2 = cust_data$date_of_purchase,
units = "days"))
## Create dataset for rfm analysis including frequency variable
customers <- cust_data %>%
group_by(customer_id) %>%
summarise(recency = min(days_since), frequency = n(),
avg_amount = mean(purchase_amount),
first_purchase = max(days_since),
max_amount = max(purchase_amount))
options(digits = 2, scipen = 99)
stat.desc(customers)
## customer_id recency frequency avg_amount first_purchase
## nbr.val 18417.00 18417.00 18417.000 18417.0 18417.00
## nbr.null 0.00 0.00 0.000 0.0 0.00
## nbr.na 0.00 0.00 0.000 0.0 0.00
## min 10.00 1.33 1.000 5.0 1.33
## max 264200.00 4014.33 45.000 4500.0 4016.33
## range 264190.00 4013.00 44.000 4495.0 4015.00
## sum NA 23083338.00 51243.000 1064373.4 36545649.00
## median 136430.00 1070.33 2.000 30.0 2087.33
## mean 137573.51 1253.37 2.782 57.8 1984.34
## SE.mean 512.16 7.97 0.022 1.1 8.35
## CI.mean.0.95 1003.88 15.62 0.042 2.2 16.37
## var 4830889404.55 1169507.86 8.625 23827.0 1284607.89
## std.dev 69504.60 1081.44 2.937 154.4 1133.41
## coef.var 0.51 0.86 1.056 2.7 0.57
## max_amount
## nbr.val 18417.0
## nbr.null 0.0
## nbr.na 0.0
## min 5.0
## max 4500.0
## range 4495.0
## sum 1266285.0
## median 30.0
## mean 68.8
## SE.mean 1.4
## CI.mean.0.95 2.8
## var 37759.5
## std.dev 194.3
## coef.var 2.8
frequency
The distribution is heavily right skewed. Most customers have made between 1 and 15 purchases. A few outliers can be found beyond this point up to a maximum of 45
ggplot(customers, aes(x = frequency)) +
geom_histogram(bins = 20, fill = "steelblue") +
theme_classic() +
labs(x = "frequency", title = "Purchase frequency") +
scale_x_continuous(breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45))
Almost 50% of customers have only ever made 1 purchase
dim(filter(customers, frequency == 1))[1]/dim(customers)[1]
## [1] 0.49
recency
Apart from a spike on the extreme left, the distribution is fairly uniform.
ggplot(customers, aes(x = recency)) +
geom_histogram(fill = "steelblue") +
theme_classic() +
labs(x = "recency", title = "Recency in Days") +
scale_x_continuous(breaks = c(0, 250, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000))
About 30% of customers have been active in the last year.
dim(filter(customers, recency < 365))[1]/dim(customers)[1]
## [1] 0.29
x <- filter(customers, recency < 365)
amount
The series has mean 58 with 50% of samples between 22 and 50. The mean is pulled above the median by the severe right skew caused by some extreme outliers.
summary(customers$avg_amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5 22 30 58 50 4500
ggplot(customers, aes(x = avg_amount)) +
geom_histogram(bins = 20, fill = "steelblue") +
theme_classic() +
labs(x = "amount", title = "average purchase amount")
ggplot(customers, aes(x = avg_amount)) +
geom_histogram(bins = 20, fill = "steelblue") +
theme_classic() +
labs(x = "amount (log10)", title = "Log transformed average purchase amount") +
scale_x_log10()
cust_data$year_of_purchase <- as.numeric(format(cust_data$date_of_purchase, "%Y"))
Number of customers active in each year and corresponding total annual revenues.
active <- cust_data %>%
group_by(year_of_purchase) %>%
summarise(active_customers = length(unique(customer_id)),
revenues = sum(purchase_amount))
print(active)
## Source: local data frame [11 x 3]
##
## year_of_purchase active_customers revenues
## (dbl) (int) (dbl)
## 1 2005 1225 82064
## 2 2006 2036 114010
## 3 2007 4112 230260
## 4 2008 3913 229854
## 5 2009 4423 256467
## 6 2010 4337 290117
## 7 2011 4123 303940
## 8 2012 5289 374963
## 9 2013 5187 401610
## 10 2014 4923 432665
## 11 2015 5398 478394
Number of customers
ggplot(data = active, aes(x = year_of_purchase, y = active_customers)) +
geom_bar(stat = "identity", fill = 'steelblue') +
theme_classic() +
labs(x = "Year", y = "No. of Customers", title = "Flat customer growth") +
scale_x_continuous(breaks = c(2007, 2010, 2013, 2015))
Total revenue
ggplot(data = active, aes(x = year_of_purchase, y = revenues/1000)) +
geom_line(colour = "darkred", size = 1.5) +
theme_gray() +
labs(x = "Year", y = "Revenues\n(thousands)", title = "Revenues")
The objective is to assign customers into 1 of 2 categories (churn/no churn).
There are scores of algorithms that attempt to model a binary response from a given set of features. Model selection is a challenge that requires skill, experience and a fair bit trial and error.
I will first set a performance ceiling by fitting some flexible, non-linear models. These models are generally less interpretable but show better predictive performance than linear models. As the principal goal here is predictive accuracy, the performance of the more flexible models will serve as a useful guide.
Next I will try some simpler, more interpretable linear models to see if their performance is acceptably close to the non-linear models as can be the case if the relationships between predictors and response is essentially linear.
The combination of accuracy and simplicity is ideal in a statistical model.
The ‘best’ model will then be compared to the baseline model described above. If my best model performs significantly better than the baseline, I will investigate the feasibility of further tuning to approach the bayes model; the most accurate possible model given the available set of predictors.
Results are evaluated in section 7.
Models are fitted on a dataset made up of both customers active in 2014 and 2015 (did not churn) and those active in 2014 but not so in 2015 (churned). Any customers acquired in 2015 will therefore not be allowed to interefere with the analysis.
Customers active in 2014
active_2014 <- cust_data %>%
filter(year_of_purchase == 2014) %>%
group_by(customer_id) %>%
summarise(recency = min(days_since),
frequency = n(),
avg_amount = mean(purchase_amount),
first_purchase = max(days_since),
max_amount = max(purchase_amount))
summary(active_2014)
## customer_id recency frequency avg_amount
## Min. : 80 Min. :366 Min. : 1.0 Min. : 5
## 1st Qu.: 97840 1st Qu.:385 1st Qu.: 1.0 1st Qu.: 30
## Median :175080 Median :427 Median : 1.0 Median : 40
## Mean :153909 Mean :472 Mean : 1.2 Mean : 78
## 3rd Qu.:218295 3rd Qu.:569 3rd Qu.: 1.0 3rd Qu.: 60
## Max. :245840 Max. :720 Max. :15.0 Max. :4500
## first_purchase max_amount
## Min. :366 Min. : 5
## 1st Qu.:388 1st Qu.: 30
## Median :450 Median : 40
## Mean :494 Mean : 80
## 3rd Qu.:605 3rd Qu.: 60
## Max. :727 Max. :4500
Customers active in 2015
active_2015 <- cust_data %>%
filter(year_of_purchase == 2015) %>%
group_by(customer_id) %>%
summarise(recency = min(days_since),
frequency = n(),
avg_amount = mean(purchase_amount),
first_purchase = max(days_since),
max_amount = max(purchase_amount))
summary(active_2015)
## customer_id recency frequency avg_amount
## Min. : 80 Min. : 1 Min. : 1.0 Min. : 5
## 1st Qu.:104105 1st Qu.: 16 1st Qu.: 1.0 1st Qu.: 30
## Median :185495 Median : 59 Median : 1.0 Median : 40
## Mean :167782 Mean :100 Mean : 1.1 Mean : 79
## 3rd Qu.:246058 3rd Qu.:174 3rd Qu.: 1.0 3rd Qu.: 60
## Max. :264200 Max. :360 Max. :11.0 Max. :4500
## first_purchase max_amount
## Min. : 1 Min. : 5
## 1st Qu.: 24 1st Qu.: 30
## Median : 65 Median : 40
## Mean :120 Mean : 81
## 3rd Qu.:216 3rd Qu.: 65
## Max. :360 Max. :4500
New customers acquired in 2015
new_cust2015 <- customers %>%
filter(first_purchase <= 365)
Remove new customers from active_2015
old_2015 <- subset(active_2015, !(active_2015$customer_id %in% new_cust2015$customer_id))
Identify customers active in 2014 but not in 2015 - the churners.
churners <- anti_join(active_2014, old_2015, by = "customer_id")
Note there are some old customers who were active in 2015 but not in 2014. They were active before 2014, skipped that year and returned in 2015. Those customers are excluded from the model data.
Label customers in active_2014
to create the binary response variable - \(y = churn/loyal\).
Dataset used for model fitting
model_data <- mutate(active_2014,
churn = ifelse(active_2014$customer_id %in% churners$customer_id,
"churn", "loyal"))
model_data$churn <- factor(model_data$churn)
row.names(model_data) <- model_data$customer_id
model_data <- select(model_data, -customer_id)
Set customer_id
as row names and remove customer_id variable.
row.names(model_data) <- model_data$customer_id
model_data <- model_data[,-1]
Response variable
churn
Predictors
recency
, first_purchase
, frequency
, av_amount
High churn rate - 40%
table(model_data$churn)
##
## churn loyal
## 1958 2965
dim(filter(model_data, churn == "churn"))[1]/dim(model_data)[1]
## [1] 0.4
Partition data into training and test sets
Create balanced data splits
set.seed(11)
trainIndex <- createDataPartition(model_data$churn, p = .75,
list = FALSE,
times = 1)
head(trainIndex)
## Resample1
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [5,] 5
## [6,] 7
train_data <- model_data[ trainIndex,]
test_data <- model_data[-trainIndex,]
Non-linear models
Random forests
Cross validation settings
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE)
Model fit
set.seed(12)
rF.fit <- train(churn ~.,
data = train_data,
method = "rf",
metric = "ROC",
trControl = ctrl)
Accuracy on the train Support Vector Machines
Cross validation settings
set.seed(13)
sigmaRange <- sigest(churn ~., data = train_data)
svmRGrid <- expand.grid(.sigma = sigmaRange,
.C = 2^(seq(-4, 4)))
Model fit
set.seed(12)
svm.fit <- train(churn ~., data = train_data,
method = "svmRadial",
metric = "ROC",
preProc = c("center", "scale"),
tuneGrid = svmRGrid,
fit = FALSE,
trControl = ctrl)
K-Nearest Neighbour
Model fit
set.seed(12)
knn.fit <- train(churn ~ ., data=train_data, method="knn",
trControl=ctrl, metric="ROC", tuneLength=20,
preProc=c("center", "scale"))
Linear models
Cross validation settings for both linear models
ctrl_lM <- trainControl(method = "LGOCV",
summaryFunction = twoClassSummary,
classProbs = TRUE)
Generalized linear model
set.seed(12)
lM.fit <- train(churn ~ .,
data = train_data,
method = "glm",
metric = "ROC",
trControl = ctrl_lM)
Linear discriminant analysis
set.seed(12)
lda.fit <- train(churn ~ .,
data = train_data,
method = "lda",
metric = "ROC",
preProc = c("center", "scale"),
trControl = ctrl_lM)
Compare non-linear models
model_results1 <- resamples(list(rF = rF.fit,
svm = svm.fit,
knn = knn.fit))
summary(model_results1)
##
## Call:
## summary.resamples(object = model_results1)
##
## Models: rF, svm, knn
## Number of resamples: 50
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rF 0.56 0.61 0.62 0.62 0.65 0.67 0
## svm 0.55 0.60 0.62 0.62 0.64 0.70 0
## knn 0.57 0.62 0.63 0.63 0.64 0.68 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rF 0.40 0.45 0.48 0.48 0.52 0.56 0
## svm 0.16 0.19 0.21 0.21 0.23 0.29 0
## knn 0.30 0.35 0.37 0.37 0.39 0.46 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rF 0.64 0.68 0.70 0.71 0.73 0.80 0
## svm 0.83 0.87 0.88 0.88 0.90 0.94 0
## knn 0.72 0.78 0.80 0.79 0.81 0.84 0
Visualize results
bwplot(model_results1,
fill = "red",
main = "Comparative performance of non-linear models")
dotplot(model_results1,
fill = "red",
main = "Comparative performance of non-linear models")
Compare linear models
model_results2 <- resamples(list(glm = lM.fit,
lda = lda.fit))
summary(model_results2)
##
## Call:
## summary.resamples(object = model_results2)
##
## Models: glm, lda
## Number of resamples: 25
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.57 0.6 0.61 0.61 0.62 0.64 0
## lda 0.57 0.6 0.60 0.60 0.62 0.64 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.052 0.131 0.16 0.16 0.18 0.26 0
## lda 0.030 0.071 0.09 0.11 0.15 0.22 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glm 0.83 0.87 0.88 0.89 0.93 0.96 0
## lda 0.86 0.90 0.92 0.92 0.95 0.97 0
bwplot(model_results2,
fill = "red",
main = "Comparative performance of linear models")
Apply selected model to the test set
Select random forest and use to predict on the test set
rFPred.test <- predict(rF.fit, test_data)
Confusion Matrix
cMatrix <- confusionMatrix(rFPred.test, test_data$churn)
print(cMatrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction churn loyal
## churn 207 232
## loyal 282 509
##
## Accuracy : 0.582
## 95% CI : (0.554, 0.61)
## No Information Rate : 0.602
## P-Value [Acc > NIR] : 0.9310
##
## Kappa : 0.112
## Mcnemar's Test P-Value : 0.0307
##
## Sensitivity : 0.423
## Specificity : 0.687
## Pos Pred Value : 0.472
## Neg Pred Value : 0.643
## Prevalence : 0.398
## Detection Rate : 0.168
## Detection Prevalence : 0.357
## Balanced Accuracy : 0.555
##
## 'Positive' Class : churn
##
Attach class probabilities and assignments to active 2015 dataset
Store class probabilities
rFProbs <- predict(rF.fit, active_2015, type = "prob")
Attach probabilities to active 2015 dataframe
active_2015$churn_prob <- rFProbs$churn
Attach class assignments to active 2015 dataframe
active_2015$class <- predict(rF.fit, active_2015)
head(active_2015[,-c(5,6)])
## Source: local data frame [6 x 6]
##
## customer_id recency frequency avg_amount churn_prob class
## (int) (dbl) (int) (dbl) (dbl) (fctr)
## 1 80 343.3 1 80 0.114 loyal
## 2 480 21.3 1 45 0.296 loyal
## 3 830 321.3 1 50 0.004 loyal
## 4 850 24.3 2 30 0.340 loyal
## 5 860 237.3 1 60 0.030 loyal
## 6 1020 1.3 2 60 0.068 loyal
Although the results vary modestly across the different models, none of these classifiers satisfy the performance criteria set out in section 2.
Of my 5 classifiers, the random forest model performs the best overall, although the knn and support vector machines perform better on one metric or the other.
The linear models perform similarly to the non-linear ones on ROC but very poorly on sensitivity, arguably the most important metric in a context such as ours.
Results are summarized in the tables and plot above.
The null classifies all customers as loyal - it predicts 0 churn. Given our known churn rate of 40%, this model will have an overall error rate of 40%.
The random forest model has a cross validation error rate of 38%. Barely below the null model.
rF.fit$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 38%
## Confusion matrix:
## churn loyal class.error
## churn 696 773 0.53
## loyal 638 1586 0.29
However, it is important to note that the null model, always predicting no customer churns, will have sensitivity 0 compared to the random forest model’s 38.4% on the test set. Nevertheless, 38.4% sensitivity despite being much better than the null model is not very impressive in this context.
My best model slightly outperforms the null model. It offers modest value greater than the status quo.
This modest performance could either be due to deficiencies in the models or in the data.
Possible model deficiencies
Classifiers could perform poorly due to a number of model related reasons:
In addition to simple linear models, I applied some flexible models such as a tree model and a support vector machine with reputations for accurately modelling complex, non-linear relationships.
Poor parameter tuning
Outliers are known to strongly prejudice some models.
Data exploration (part 5) do indeed show the presence of unusual values. However,
In addition to the steps described above, the fact that the key performance metrics are broadly similar across the very different types of models implemented here suggests the underperformance is most likely due to deficiencies in the data.
Data deficiencies
Data quality
Inacurate or inconsistently recorded data can have dramatic effects on predictive models. I have no reason to suspect the quality of the data used here.
Feature space
My classifiers rely on a very narrow feature space to predict classes. 4 input variables.
Context: for the Kaggle Cup 2009 data science competition, the telecommunication company Orange France provided 230 features to serve as inputs in classification models predicting customer churn.
It is more than likely that our 4 features do not sufficiently explain the variation observed in the response.
My conclusion is that the models’ relatively poor performance is mainly due to insufficient data.
One of the key objectives of this series of papers is to motivate a data collection and management program at some of our subsidiaries.
The models presented here can be highly predictive. Some of them are highly sophisticated and are capable of drawing complex decision boundaries to assign non-linear data to separate classes.
However, they require data to be effective. In most cases, the more the better. If input variables can be combined to explain variation in a response, it stands to reason that larger numbers of informative inputs will lead to more accurate predictions in most cases.