Customer Retention - Case Study 4

library(SMCRM) # CRM data
library(dplyr) # data wrangling
library(tidyr) # data wrangling
library(ggplot2) # plotting
library(survival) # survival
library(rpart) # DT
library(randomForestSRC) # RF
library(tidyverse)
library(tree)
library(Metrics)
library(caret)
library(car)
library(kernlab)
library(MASS)
library(performance)
library(ROCR)
library(lmtest)
library(randomForest)
library(corrplot)


theme_nice = theme_classic()+
                theme(
                  axis.line.y.left = element_line(colour = "black"),
                  axis.line.y.right = element_line(colour = "black"),
                  axis.line.x.bottom = element_line(colour = "black"),
                  axis.line.x.top = element_line(colour = "black"),
                  axis.text.y = element_text(colour = "black", size = 12),
                  axis.text.x = element_text(color = "black", size = 12),
                  axis.ticks = element_line(color = "black")) +
                theme(
                  axis.ticks.length = unit(-0.25, "cm"), 
                  axis.text.x = element_text(margin=unit(c(0.5,0.5,0.5,0.5), "cm")), 
                  axis.text.y = element_text(margin=unit(c(0.5,0.5,0.5,0.5), "cm")))

Executive Summary

The goal of this case study is to explore and model the behavior of the acquiring process of a customer and prevent them from abandoning the company. This report will explore different predictive models for managing customer retention and acquisition for later developing and maintaining customer relationships.

For this task we will perform three different predictive models: Random Forest, Decision Trees and Logistic Regression, we will use all three of them to model our classification problem (customer acquisition) and Random Forest for our regression problem (duration of the customer with the company). This report will cover these two types of models (classification and regression) to predict which customers will be acquired and for how long (duration).

Surprisingly, the result of our case study brought back an accuracy of 83% for Logistic Regression, outperforming Random Forest with an accuracy of 77% and Decision Trees with an accuracy of 71%.

For the interpretation of the results we will also use the help of PDP plots (Partial Dependence Plots) to help us visualize how our variables behave.

The Problem

In a competitive business world, managing customer retention and acquisition is essential to developing and maintaining customer relationships and revenue. Retaining a customer is cost efficient as keeping a costumer costs 5 to 25 times less that acquiring new customers. Identifying customers who are likely to leave the company, is essential in order to strategize actions that encourages customers stay with the company and increase customer loyalty. However, maximizing revenue and increasing market base can be achieved with efficient customer acquisition in addition to retention strategies. Models that accurately predict customer retention and acquisition are pivotal in targeting the right customers, thereby decreasing the cost of the marketing campaign and using scarce firm resources more efficiently. The “acquisitionRetention” dataset from the SMCRM package will be used in this case study.

The main objective this case study is to build and accurate machine learning(ML) model by comparing random forest, decision tree and logistic regression models. More specifically, the project aims to predict prospective customers that can be acquired and predict the duration they are likely to stay with the company. Variable importance will be computed to detect interactions and optimize hyper parameters for acquired customers.

Review of Literature

Customer retention is one of the main concerns of marketing objectives. Currently the perception and application of customer retention is significantly valuable for companies. Thus, in order to perceive and apply marketing principles in practice relevantly, it is important to ground theoretically and assess empirically customer retention. A customer is acquired into the company as a part of reaching the ambition to grow the business. Acquired customers form the base of customer retention – without any customers, there is no churn to prevent or value to enhance. (Buttle, 2009.).

Managing customer retention is divided into three parts: retained customers, at-risk customers and lost customers. The protocol of how to handle and the management plan for each of these customer groups differ. (Griffin & Lowenstein, 2001.). Retained customers keep on re-purchasing, acquiring the new services and referring the services to their inner circle compared to the other customers (Griffin & Lowenstein, 2001; Peppers & Rogers, 2011). However, once retained customers tend to, at some point, show indicators of churning. The activity of the retained customers is monitored and nurtured to lengthen the customer relationship. At-risk customers are red-flagged as a separate group where the re-activation programs are executed in order to stabilize the endangered relationship. (Griffin & Lowenstein, 2001.) The lost customer has either silently churned without notifying the company or terminated their customership with the company. These lost customers are evaluated based on their future value to the company. The unprofitable ones are left to churn, and the potential ones are moved to the win-back program. (Griffin & Lowenstein, 2001.)

Methodology

Exploratory data analysis was performed to understand the data set. The problem for this project is presented in two folds (parts). First we need to perform a classification task to determine whether a prospective customer is acquired or not. Then we need to determine the duration (in number of days) an acquired customer remains a customer. Predictions will be performed using Random Forest, Decision Tree and Logistic Regression.

Decision tree is a type of supervised learning algorithm that can be used in both regression and classification problems. The data is split into two or more homogeneous sets based on the significant split input variables. In the decision tree the root node represents the entire population or sample and is further split into sub-node. The decision nodes show a point where decision is made, and the nodes that can not be split are named terminal nodes.

In order to build a regression tree, recursive binary splitting is used to grow a large tree on the training data, terminated only when each terminal node has fewer than some minimum number of observations. Recursive Binary Splitting is used to minimize the Residual Sum of Squares (RSS). Cost complexity pruning can be applied to the large tree in order to obtain a sequence of best sub-trees, as a function of 𝛼 to reduce overfitting. Classification trees are used to predict a qualitative response rather than a quantitative one; while the predicted response for an observation is given by the mean response of the training observations that belong to the same terminal node in a regression tree, the class of each observation is classified to the most commonly occurring class of training observations in the region to which it belongs. The main advantage of the decision trees is that they are easy to understand as they have a very intuitive graphical representation. This graphical advantage is also useful in data exploration as it helps us identify the most significant variables and relationship between two or more variables. Moreover, it requires less data cleaning, handles both numerical and categorical variables and requires no assumption about the space distribution and the classifier structure. The main disadvantage with Decision Trees is the susceptibility to overfitting and lack of robustness as uncertainties in the data can have a significant impact on the decisions.

Random forest is a tree-based algorithm which involves building several trees (decision trees), then combining their output to improve generalization ability of the model. The method of combining trees is known as an ensemble method. Ensembling is the combination of weak learners (individual trees) to produce a strong learner. It is a versatile machine learning method capable of performing both regression and classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier values and other essential steps of data exploration, and does a fairly good job. Random Forests is very effective at estimating missing data and maintaining accuracy when a large proportion of the data is missing. It can also balance errors in datasets where the classes are imbalanced and it can handle massive datasets with large dimensionality. The main disadvantage with the Random Forest model is that it is difficult to interpret, the tendency to easily overfit on noisy datasets, and take longer than expected time to compute large number of trees. The variable importance and partial dependence plots will be used in determining the most important variables and the interaction between predictors.

The first step to analyze how the customer retention for a company behaves will be to predict which customers have a high probability of terminating their relationship with the firm, and also the probability of acquiring a new customer. Beforehand we should know that for both the response variables: “acquisition” and “duration”, when their value is 0 would be translated as for acquisition the prospect customer was not acquired, and for duration the customer stayed ZERO days with the company.

Data

The data set is part of the SMCRM package and contains 500 observations with 2 response variables: “duration” and “acquisition”, and 13 predictors that are numerical. These predictors include expenditures for acquiring and retaining customers, the number of purchases, product categories purchased, if customer is in the B2B (business-to-business) industry, share of wallet, number of employees, revenue (customer lifetime value), and profit of the prospective firm.

Besides the data inconsistency issue for the variable sow, which has 5 observations with value greater than the possible 100%, the data set was fairly clean. There were no missing values and data type of columns was applied wherever it was necessary.

As for data limitations, we faced a couple of variables that were perfectly correlated and led us to perfect separation, and that will be explained further in the report.

data("acquisitionRetention")

str(acquisitionRetention)

## 'data.frame':    500 obs. of  15 variables:
##  $ customer   : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ acquisition: num  1 1 1 0 1 1 1 1 0 0 ...
##  $ duration   : num  1635 1039 1288 0 1631 ...
##  $ profit     : num  6134 3524 4081 -638 5446 ...
##  $ acq_exp    : num  694 460 249 638 589 ...
##  $ ret_exp    : num  972 450 805 0 920 ...
##  $ acq_exp_sq : num  480998 211628 62016 407644 346897 ...
##  $ ret_exp_sq : num  943929 202077 648089 0 846106 ...
##  $ freq       : num  6 11 21 0 2 7 15 13 0 0 ...
##  $ freq_sq    : num  36 121 441 0 4 49 225 169 0 0 ...
##  $ crossbuy   : num  5 6 6 0 9 4 5 5 0 0 ...
##  $ sow        : num  95 22 90 0 80 48 51 23 0 0 ...
##  $ industry   : num  1 0 0 0 0 1 0 1 0 1 ...
##  $ revenue    : num  47.2 45.1 29.1 40.6 48.7 ...
##  $ employees  : num  898 686 1423 181 631 ...

We replaced the “sow” values above 100% with the median of the observations. We used the median instead of the mean due to the skewness of the data.

## Replace sow > 100 value with median
acquisitionRetention = acquisitionRetention %>% 
  mutate(sow = ifelse(sow > 100, median(sow, na.rm=T), sow))

Findings

Exploratory Analysis

The exploratory data analysis shows that the majority of the predictors have a strong correlation with the response variables. When the customer is not acquired, profit, which represents the total lifetime value of the customer is negative, and is equal to the expense for acquisition(acq_exp). Similarly, when duration is equal to zero, ret_exp, freq, crossbuy and sow will be zero as well resulting in these variables perfectly predicting whether the prospective customer is acquired or not. These are observed in the variable importance plots of the Random Forest model as these variables have the highest importance. These variables were therefore excluded from the prediction models. Running the find.interaction function of the rfscr package on the duration model of the acquired customers reveal that a strong association exists between the variables acq_exp and employees. This interaction was added to the final hyper - tuned Random Forest, Decision tree and Logistic regression models.

corrplot(cor(acquisitionRetention [,-c(1,7,8,10)]), method = "number", diag = F, 
 addgrid.col = "blue", number.cex = 0.8, tl.col = "Blue", outline = "blue")

Correlation plot of variables

## Bar plot to check the relationship between selected predictors and response, acquisition

par(mfrow = c(3, 2))
boxplot(duration ~ acquisition, data = acquisitionRetention, xlab = "acquisition", ylab = "duration")
boxplot(profit~ acquisition, data = acquisitionRetention, xlab = "acquisition", ylab = "profit")
boxplot(ret_exp~ acquisition, data = acquisitionRetention, xlab = "acquisition", ylab = "ret_exp")
boxplot(freq~ acquisition, data = acquisitionRetention, xlab = "acquisition", ylab = "freq")
boxplot(crossbuy~ acquisition, data = acquisitionRetention, xlab = "acquisition", ylab = "crossbuy")
boxplot(sow~ acquisition, data = acquisitionRetention, xlab = "acquisition", ylab = "sow")

Bar plot to check the relationship between selected predictors and response, acquisition.

We subset the customers where acquisition and duration = 0, for us to see how the values of the predictor variables behave when the customer was not acquired, thus, no duration with the company. We can see that all predictors: ret_exp, freq, crossbuy and sow are also 0.

subset = acquisitionRetention[,c(1:6,9,11,12)] %>% subset(acquisition==0)
head(subset,9)

##    customer acquisition duration  profit acq_exp ret_exp freq crossbuy sow
## 4         4           0        0 -638.47  638.47       0    0        0   0
## 9         9           0        0 -284.96  284.96       0    0        0   0
## 10       10           0        0 -292.54  292.54       0    0        0   0
## 15       15           0        0 -610.43  610.43       0    0        0   0
## 16       16           0        0 -397.93  397.93       0    0        0   0
## 17       17           0        0 -615.95  615.95       0    0        0   0
## 21       21           0        0 -400.60  400.60       0    0        0   0
## 22       22           0        0 -616.10  616.10       0    0        0   0
## 26       26           0        0 -440.27  440.27       0    0        0   0

The following plot is to explore any other visible relationship between different predictors, such as employees, with our response variable “Duration”.

acquisitionRetention %>% 
  ggplot(aes(y=duration, x=customer, color=profit, size=employees)) +
  geom_point(alpha=0.5) +
  labs(x="Observations", y="Duration of customer (days)", title="Days the customer is with the company VS the profit obtained")

We can see for sure that there is expected to have more profit the more the customer is with the company, but with the number of employees, this relationship is not clear.

Then we would like to review for how many observations the duration of the relationship customer-company is 0, that means there is no retention whatsoever, and out of 500 observations 162 clients didn’t stayed with the company.

sum(acquisitionRetention$duration == 0, na.rm=TRUE)

## [1] 162

We verify that those 162 clients correspond to the ones who were not “retained” and never turned into customers.

sum(acquisitionRetention$acquisition == 0, na.rm=TRUE)

## [1] 162

We also would like to know the relationship between customer duration and the expected retention, but specially to know if using the square columns would help us better.

par(mfrow = c(1,2))
plot(acquisitionRetention$ret_exp, acquisitionRetention$duration)
plot(acquisitionRetention$ret_exp_sq, acquisitionRetention$duration)

par(mfrow = c(1,1))

Random Forest - Full Analysis

Customers acquired Model

In the following lines of codes we will split our dataset into train and test data sets.

set.seed(123)
idx.train = sample(1:nrow(acquisitionRetention), size = 0.7 * nrow(acquisitionRetention))
train.df = acquisitionRetention[idx.train,]
test.df = acquisitionRetention[-idx.train,]

acquisitionRetention$industry = as.factor(acquisitionRetention$industry)

set.seed(123)
rf_model0 = rfsrc(as.factor(acquisition) ~ profit + 
                     acq_exp + 
                     ret_exp + 
                     freq + 
                     crossbuy + 
                     sow + 
                     industry + 
                     revenue + 
                     employees, 
                   data = acquisitionRetention,  
                   importance = TRUE, 
                   ntree = 4000)
set.seed(123)

rf_model00 = randomForest(as.factor(acquisition) ~ profit + 
                     acq_exp + 
                     ret_exp + 
                     freq + 
                     crossbuy + 
                     sow + 
                     industry + 
                     revenue + 
                     employees, 
                   data = acquisitionRetention,  
                   importance = TRUE, 
                   ntree = 4000)

Forest Inference

After running the first Random Forest Model we would like to get some insights from it, because we would like to understand our data set even before actually training the model and apply that model to the test data set and get predictions out of that.

1. Variable Importance

plot(rf_model0, m.target = "NULL",
  plots.one.page = TRUE, sorted = TRUE, verbose = F)

2. Minimal depth

mindepth = max.subtree(rf_model0,
                        sub.order = TRUE)

# first order depths
print(round(mindepth$order, 3)[,1])

##    profit   acq_exp   ret_exp      freq  crossbuy       sow  industry   revenue 
##     1.253     1.654     1.290     1.215     1.200     1.209     1.662     1.649 
## employees 
##     1.536

Here we will see the interactions including all the variables.

3. Partial dependence

par(mfrow = c(2,2))

partialPlot(rf_model00, acquisitionRetention , x.var = ret_exp, plot = TRUE, add = FALSE,
rug = TRUE, xlab= "Acq_exp", ylab="",
main=paste("Partial Dependence on acq_exp"))

partialPlot(rf_model00, acquisitionRetention , x.var = acq_exp, plot = TRUE, add = FALSE,
rug = TRUE, xlab= "Acq_exp", ylab="",
main=paste("Partial Dependence on acq_exp"))

partialPlot(rf_model00, acquisitionRetention , x.var = revenue, plot = TRUE, add = FALSE,
rug = TRUE, xlab= "Acq_exp", ylab="",
main=paste("Partial Dependence on acq_exp"))

We will run three different models: Random Forest, Logistic Regression and Decision Tree.

Prediction Models

Random Forest

Predict acquisition using train - test split

## Tuning forest hyper-parameters for predictive accuracy

mtry.values = seq(2,6,1)
nodesize.values = seq(2,8,2)
ntree.values = seq(1e3,6e3,1e3)
 
hyper_grid = expand.grid(mtry = mtry.values, nodesize = nodesize.values, ntree = ntree.values) #df with all combinations
oob_err = c()

for (i in 1:nrow(hyper_grid)) {
set.seed(123)
    # Train a Random Forest model
   model = randomForest(as.factor(acquisition) ~ acq_exp + 
                    industry +
                    revenue +
                    employees,
                  data = train.df,
                  mtry = hyper_grid$mtry[i],
                  nodesize = hyper_grid$nodesize[i],
                  ntree = hyper_grid$ntree[i])  
  
    # Store OOB error for the model                      
    oob_err[i] = model$err.rate[length(model$err.rate)]
}

# Identify optimal set of hyperparmeters based on OOB error
opt_i = which.min(oob_err)
print(hyper_grid[opt_i,])

##   mtry nodesize ntree
## 1    2        2  1000

## Build randomforest model using best hyperparameter

set.seed(123)
rf_model01 <- randomForest(as.factor(acquisition) ~ acq_exp + 
                               industry + 
                               revenue + 
                               employees,
                             data = train.df, 
                             mtry=2, 
                             ntree=1000, 
                             nodesize=2, 
                             importance=TRUE)

partialPlot(rf_model01, train.df , x.var = employees, plot = TRUE, add = FALSE,
rug = TRUE, xlab= "Acq_exp", ylab="",
main=paste("Partial Dependence on acq_exp"))

## Predict acquisitions of customers(classification)

test.df$pred_acq <- predict(rf_model01, newdata = test.df) # predict class

confusionMatrix.rf <- caret::confusionMatrix(as.factor(test.df$pred_acq), as.factor(test.df$acquisition))
confusionMatrix.rf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 31  9
##          1 26 84
##                                           
##                Accuracy : 0.7667          
##                  95% CI : (0.6907, 0.8318)
##     No Information Rate : 0.62            
##     P-Value [Acc > NIR] : 9.448e-05       
##                                           
##                   Kappa : 0.4745          
##                                           
##  Mcnemar's Test P-Value : 0.006841        
##                                           
##             Sensitivity : 0.5439          
##             Specificity : 0.9032          
##          Pos Pred Value : 0.7750          
##          Neg Pred Value : 0.7636          
##              Prevalence : 0.3800          
##          Detection Rate : 0.2067          
##    Detection Prevalence : 0.2667          
##       Balanced Accuracy : 0.7235          
##                                           
##        'Positive' Class : 0               
##

accuracy.rf <- confusionMatrix.rf$overall[1]

Duration model

Only for the subset of acquired(predicted) customers to check interaction of variables.

rf.acq_df = test.df %>% subset(pred_acq == 1)

set.seed(123)
rf_model02 = rfsrc(duration ~ acq_exp + 
                    industry +
                    revenue +
                    employees,
                  data = train.df,
                  mtry = 2,
                  nodesize = ,
                  ntree = 1000) 
rf_model02

##                          Sample size: 350
##                      Number of trees: 1000
##            Forest terminal node size: 5
##        Average no. of terminal nodes: 45.6
## No. of variables tried at each split: 2
##               Total no. of variables: 4
##        Resampling used to grow trees: swor
##     Resample size used to grow trees: 221
##                             Analysis: RF-R
##                               Family: regr
##                       Splitting rule: mse *random*
##        Number of random split points: 10
##                      (OOB) R squared: 0.28293978
##    (OOB) Requested performance error: 203772.52122159

rf.acq_df$pred_duration = predict(rf_model02,rf.acq_df)$predicted

Now we will display the interactions for the duration model, which contains only the subset of “acquired customers”.

With our analysis we can observe that the VIMP values are high, however the highest we can observe is the interaction between acq_exp:employees. A large positive or negative difference between ‘Paired’ and ‘Additive’ indicates an association worth pursuing if the uni variate VIMP for each of the paired-variables is reasonably large. See Ishwaran (2007) for more details.

New Random Forest model to predict duration, including interactions

set.seed(123)
rf_model03 = rfsrc(duration ~ acq_exp*employees + 
                     acq_exp + 
                    industry +
                    revenue +
                    employees,
                  data = train.df,
                  mtry = 2,
                  nodesize = 2,
                  ntree = 4000) 
rf_model03

##                          Sample size: 350
##                      Number of trees: 4000
##            Forest terminal node size: 2
##        Average no. of terminal nodes: 100.1942
## No. of variables tried at each split: 2
##               Total no. of variables: 4
##        Resampling used to grow trees: swor
##     Resample size used to grow trees: 221
##                             Analysis: RF-R
##                               Family: regr
##                       Splitting rule: mse *random*
##        Number of random split points: 10
##                      (OOB) R squared: 0.26537034
##    (OOB) Requested performance error: 208765.3635088

## Predict duration
rf.acq_df$rf.duration.pred = predict(rf_model03,rf.acq_df)$predicted

rf.abs_error = abs(rf.acq_df$duration - rf.acq_df$rf.duration.pred)
rf.mae = mean(rf.abs_error)
rf.mae

## [1] 377.5252

Logistic Regression

Fit logistic regression with acq_exp_sq and interaction term between acq_exp and employees

glm_model0 = glm(acquisition ~  acq_exp_sq + acq_exp:employees +
                     industry + 
                     revenue + 
                     employees, 
                     data = train.df)
summary(glm_model0)

## 
## Call:
## glm(formula = acquisition ~ acq_exp_sq + acq_exp:employees + 
##     industry + revenue + employees, data = train.df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.83937  -0.24914   0.05711   0.27690   0.71840  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.862e-02  1.242e-01  -0.230    0.818    
## acq_exp_sq        -1.490e-06  2.697e-07  -5.524 6.54e-08 ***
## industry           2.071e-01  3.895e-02   5.317 1.90e-07 ***
## revenue            1.106e-02  1.952e-03   5.666 3.10e-08 ***
## employees         -2.742e-04  2.075e-04  -1.321    0.187    
## acq_exp:employees  2.314e-06  3.989e-07   5.801 1.50e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1306472)
## 
##     Null deviance: 73.500  on 349  degrees of freedom
## Residual deviance: 44.943  on 344  degrees of freedom
## AIC: 288.87
## 
## Number of Fisher Scoring iterations: 2

# Compute optimal cutoff
PredProb = prediction(predict.lm(glm_model0, newdata = test.df, type = "response"), as.factor(test.df$acquisition))

# Computing threshold for cutoff to best trade off sensitivity and specificity
plot(unlist(performance(PredProb,'sens')@x.values),
     unlist(performance(PredProb,'sens')@y.values), 
     type='l', lwd=2, ylab = "", xlab = 'Cutoff')
mtext('Sensitivity',side=2)
mtext('Sensitivity vs. Specificity Plot', side=3)

# Second specificity in same plot
par(new=TRUE)
plot(unlist(performance(PredProb,'spec')@x.values),
     unlist(performance(PredProb,'spec')@y.values), 
     type='l', lwd=2,col='red', ylab = "", xlab = 'Cutoff')
axis(4,at=seq(0,1,0.2)) 
mtext('Specificity',side=4, col='red')

par(new=TRUE)

min.diff <-which.min(abs(unlist(performance(PredProb, "sens")@y.values) - unlist(performance(PredProb, "spec")@y.values)))
min.x<-unlist(performance(PredProb, "sens")@x.values)[min.diff]
min.y<-unlist(performance(PredProb, "spec")@y.values)[min.diff]
optimal <-min.x

abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,5)), pos = 3)

## Predict classes and convert probabilities to binary

logr_prob = predict(glm_model0, newdata = test.df, type = "response")
logr_pred <- ifelse(logr_prob >= optimal, 1, 0)

## Check the accuracy of logistic model

confusionMatrix.glm <- confusionMatrix(as.factor(logr_pred), as.factor(test.df$acquisition ), positive = '1')
confusionMatrix.glm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 47 16
##          1 10 77
##                                           
##                Accuracy : 0.8267          
##                  95% CI : (0.7564, 0.8835)
##     No Information Rate : 0.62            
##     P-Value [Acc > NIR] : 3.017e-08       
##                                           
##                   Kappa : 0.6395          
##                                           
##  Mcnemar's Test P-Value : 0.3268          
##                                           
##             Sensitivity : 0.8280          
##             Specificity : 0.8246          
##          Pos Pred Value : 0.8851          
##          Neg Pred Value : 0.7460          
##              Prevalence : 0.6200          
##          Detection Rate : 0.5133          
##    Detection Prevalence : 0.5800          
##       Balanced Accuracy : 0.8263          
##                                           
##        'Positive' Class : 1               
##

accuracy.glm <- confusionMatrix.glm$overall[1]

Decision Trees

## Build model to predict acquisition
dt.acquired = rpart(as.factor(acquisition) ~ acq_exp + 
                     industry + 
                     revenue + 
                     employees, 
                     data = train.df) 

rattle::fancyRpartPlot(dt.acquired, sub = "")

dt.acquired.pred = predict(dt.acquired, test.df, type = "class")

## Confussion matrix and accuracy of model

confusionMatrix.dt <- confusionMatrix(as.factor(dt.acquired.pred), as.factor(test.df$acquisition), positive = '1')
confusionMatrix.dt

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 29 15
##          1 28 78
##                                           
##                Accuracy : 0.7133          
##                  95% CI : (0.6339, 0.7841)
##     No Information Rate : 0.62            
##     P-Value [Acc > NIR] : 0.01055         
##                                           
##                   Kappa : 0.3635          
##                                           
##  Mcnemar's Test P-Value : 0.06725         
##                                           
##             Sensitivity : 0.8387          
##             Specificity : 0.5088          
##          Pos Pred Value : 0.7358          
##          Neg Pred Value : 0.6591          
##              Prevalence : 0.6200          
##          Detection Rate : 0.5200          
##    Detection Prevalence : 0.7067          
##       Balanced Accuracy : 0.6737          
##                                           
##        'Positive' Class : 1               
##

accuracy.dt <- confusionMatrix.dt$overall[1]

## Predict duration in number of days for acquired customers

dt.duration = rpart(duration ~  acq_exp + 
                     industry + 
                     revenue + 
                     employees, 
                     data = train.df)

rattle::fancyRpartPlot(dt.duration, sub = "")

Predict duration for prospective cutomers predicted to be acquired by Decision Tree

test.df = cbind(test.df,dt.acquired.pred)  # combine classification to test data frame
dt.acq_df = test.df %>% subset(dt.acquired.pred == 1) # subset of acquired customers
 
dt.acq_df$dt.duration.pred = predict(dt.duration, dt.acq_df) # Predict duration

## MAE: Mean Absolute Error
dt.abs_error = abs(dt.acq_df$duration - dt.acq_df$dt.duration.pred)
dt.mae = mean(dt.abs_error)
dt.mae

## [1] 395.1868

Accuracy comparisons - Customer Acquisition Models

Model Comparison

In the table bellow we display the accuracy obtained from the Random Forest, Logistic Regression and Decision Trees models, all of them for the Classification “Customer Acquisition” problem.

We can observe that the highest accuracy comes from the Logistic Regression model (83%) and not from the Random Forest (77%).

accuracy.df = data.frame(round(accuracy.rf,2), round(accuracy.glm,2), round(accuracy.dt,2))
colnames(accuracy.df) <- c("Random Forest", "Logistic Regression", "Decision Tree")

accuracy.df

##          Random Forest Logistic Regression Decision Tree
## Accuracy          0.77                0.83          0.71

Conclusion

In this case study we explored three models: Random Forest, Decision Trees and Logistic Regression. We used all three of them to model our classification problem (customer acquisition) and Random Forest for our regression problem (duration). As we reported earlier in the Summary, Logistic Regression brought back the highest accuracy with 83%, outperforming both Random Forest and Decision Trees Model.

Usually, we will see Random Forest outperforming all of the other models when the data set may have missing values, outliers, or unbalanced data, however, as we explained earlier this data set is fairly clean and Logistic Regression will bring back a simple but effective model. For Random Forest in the other hand, we would have to perform more exhausting pruning and tuning processes to find the best accuracy.

References

Ang, L. & Buttle, F., 2006. Customer retention management processes: A quantitative study. European Journal of Marketing, Vol. 40, No. 1-2, pp. 83-99.

Buttle, F., 2009. Customer Relationship Management: Concepts and Technologies. 2nd edition, Elsevier Ltd.

Griffin, J. & Lowenstein, M.W., 2001. Customer Winback: How to Recapture Lost Customers And Keep Them Loyal. John Wiley & Sons, Incorporated, Somerset. Available from: ProQuest Ebook Central. [ 9 April 2019].

Peppers, D. & Rogers, M., 2011. Managing customer relationships: A Strategic Framework. 2nd edition, John Wiley & Sons, Incorporated.

Kumar, V. & Petersen, J.A., 2012. Statistical Methods in Customer Relationship Management. John Wiley & Sons, Incorporated, New York. Available from: ProQuest Ebook Central. [6 April 2019].