Bank Customer Churn Prediction

Introduction

Customer churn has become a major problem in banking industry and banks have always tried to track customer interaction with the company to detect early warning signs in customer’s behavior.

Marketing literature states that it is more costly to engage a new customer than to retain an existing loyal customer (Sharma & Kumar Panigrahi, 2011). Khan et al. (2010) state that cost of obtaining new customers is five times higher than retaining existing customers. Therefore, banks need to shift their attention from customer acquisition to customer retention, provide accurate prediction models, and effective churn prediction strategies as customer retention solution, to prevent churn.

This project aims to present machine learning models that can be used to predict which customers are most likely to churn and how long (years) the customer will stay with the bank. The models were trained with real-life U.S. bank customers dataset.

Objectives

To predict if a bank’s customers will churn or stay with the bank.

To predict how many years would a customer stay with the bank.

Workflow Overview

# The figure above shows workflow of Bank Customers Churn Prediction Model

Import Libraries & Dataset

The dataset that will be used in this project is obtained from https://www.kaggle.com/datasets/shantanudhakadd/bank-customer-churn-prediction?resource=download. Import the dataset into our R markdown.

library(tidyr)
library(ggplot2)
library(purrr)
library(printr)
library(pROC) 
library(ROCR) 
library(caret)
library(car)
library(rpart)
library(rpart.plot)

data = read.csv('Churn_Modelling.csv', stringsAsFactors = TRUE)

Descriptive Statistic

By taking a glimpse on our dataset, we have total of 10,000 and 14 columns. Three non-useful variables are identified: RowNumber, CustomerID, and Surname. Two categorical variables: Geography and Gender need to be encoded into numbers because machine learning models can only work with numerical input.

# show dimension, datatype, content of the data set
str(data)

## 'data.frame':    10000 obs. of  14 variables:
##  $ RowNumber      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ CustomerId     : int  15634602 15647311 15619304 15701354 15737888 15574012 15592531 15656148 15792365 15592389 ...
##  $ Surname        : Factor w/ 2932 levels "Abazu","Abbie",..: 1116 1178 2041 290 1823 538 178 2001 1147 1082 ...
##  $ CreditScore    : int  619 608 502 699 850 645 822 376 501 684 ...
##  $ Geography      : Factor w/ 3 levels "France","Germany",..: 1 3 1 1 3 3 1 2 1 1 ...
##  $ Gender         : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 2 1 2 2 ...
##  $ Age            : int  42 41 42 39 43 44 50 29 44 27 ...
##  $ Tenure         : int  2 1 8 1 2 8 7 4 4 2 ...
##  $ Balance        : num  0 83808 159661 0 125511 ...
##  $ NumOfProducts  : int  1 1 3 2 1 2 2 4 2 1 ...
##  $ HasCrCard      : int  1 0 1 0 1 1 1 1 0 1 ...
##  $ IsActiveMember : int  1 1 0 0 1 0 1 0 1 1 ...
##  $ EstimatedSalary: num  101349 112543 113932 93827 79084 ...
##  $ Exited         : int  1 0 1 0 0 1 0 1 0 0 ...

No missing value is detected for all of the variables.

# detect missing value
knitr::kable(sapply(data, function(x) sum(is.na(x))), col.names = c("Missing Value Count"))

	Missing Value Count
RowNumber	0
CustomerId	0
Surname	0
CreditScore	0
Geography	0
Gender	0
Age	0
Tenure	0
Balance	0
NumOfProducts	0
HasCrCard	0
IsActiveMember	0
EstimatedSalary	0
Exited	0

This is the summary statistics of the variables. Looking at the Min and Max of the continuous variables, we can see that their scales are different hugely, e.g., Age and EstimatedSalary. The variables with larger scale would overshadow the smaller scale one, so scaling is needed to scale these variables to the same 0 - 1 range.

# show summary statistics of the variables 
summary(data[, !names(data) %in% c('RowNumber', 'CustomerId', 'Surname')])

CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited
Min. :350.0	France :5014	Female:4543	Min. :18.00	Min. : 0.000	Min. : 0	Min. :1.00	Min. :0.0000	Min. :0.0000	Min. : 11.58	Min. :0.0000
1st Qu.:584.0	Germany:2509	Male :5457	1st Qu.:32.00	1st Qu.: 3.000	1st Qu.: 0	1st Qu.:1.00	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.: 51002.11	1st Qu.:0.0000
Median :652.0	Spain :2477	NA	Median :37.00	Median : 5.000	Median : 97199	Median :1.00	Median :1.0000	Median :1.0000	Median :100193.91	Median :0.0000
Mean :650.5	NA	NA	Mean :38.92	Mean : 5.013	Mean : 76486	Mean :1.53	Mean :0.7055	Mean :0.5151	Mean :100090.24	Mean :0.2037
3rd Qu.:718.0	NA	NA	3rd Qu.:44.00	3rd Qu.: 7.000	3rd Qu.:127644	3rd Qu.:2.00	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:149388.25	3rd Qu.:0.0000
Max. :850.0	NA	NA	Max. :92.00	Max. :10.000	Max. :250898	Max. :4.00	Max. :1.0000	Max. :1.0000	Max. :199992.48	Max. :1.0000

Box plot is plotted to show the data distribution of continuous variables and check if there is any outlier. Outliers are detected in Age and CreditScore, but they are not erroneous outliers and this outlier situation occurs because of the small sample number between these outliers range. Age, CreditScore, Balance variables are skewed toward the majority values while EstimatedSalary seems to be normally distributed.
Log transformation can be applied to these three variables to solve the skewed and outliers data issues.

# plot box plot
data[, names(data) %in% c('Age', 'Balance', 'CreditScore', 'EstimatedSalary')] %>%
  gather() %>%
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_boxplot() +
    theme(axis.text.x = element_text(size = 7, angle=90), axis.text.y = element_text(size = 7))

Data Processing

The data processing that need to be done include:
1) Drop RowNumber, CustomerID, and Surname.
2) Encode Geography and Gender.
3) Log Transform Age, CreditScore, and Balance.
4) Scale range of Age, CreditScore, Balance, EstimatedSalary from 0 to 1.

# drop non-useful variables
data = data[, !names(data) %in% c('RowNumber', 'CustomerId', 'Surname')]

# data encoding
data$Geography = factor(data$Geography, labels=c(0, 1, 2))
data$Gender = factor(data$Gender, labels=c(0, 1))

# data transformation
data$Age = log(data$Age)
data$CreditScore = log(data$CreditScore)
data$Balance = log(data$Balance)
data[data$Balance == -Inf, 'Balance'] <- 0

# scaling
fun_scale_0to1 <- function(x) {                           
  (x - min(x)) / (max(x) - min(x))
}
data$Age = fun_scale_0to1(data$Age)
data$CreditScore = fun_scale_0to1(data$CreditScore)
data$Balance = fun_scale_0to1(data$Balance)
data$EstimatedSalary = fun_scale_0to1(data$EstimatedSalary)

head(data, 5)

CreditScore	Geography	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited
0.6425900	0	0.5193632	2	0.0000000	1	1	1	0.5067349	1
0.6223822	2	0.5045923	1	0.9118043	1	0	1	0.5627087	0
0.4064754	0	0.5193632	8	0.9636449	3	1	0	0.5696544	1
0.7795730	0	0.4739377	1	0.0000000	2	0	0	0.4691201	0
1.0000000	2	0.5337866	2	0.9442881	1	1	1	0.3954004	0

Machine Learning - Classification

To train a classification model, there is mainly three steps:
1. Splitting Data into Training and Testing Set
2. Model Training/ Tuning
3. Model Testing

The Exited variable will be used as the target variable to predict whether a bank customer will churn or not.

Splitting Data into training set and testing set

set.seed(1000)
trainIndex <- createDataPartition(data$Exited, p = 0.8, list = FALSE, times = 1)
training_data <- data[ trainIndex,]
testing_data  <- data[-trainIndex,]

Data From Traing Set and Testing Set

# Check if the splitting process is correct
prop.table(table(training_data$Exited))

0	1
0.798625	0.201375

prop.table(table(testing_data$Exited))

0	1
0.787	0.213

Model Training
1. Logistic Regression:
Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based on prior observations of a data set.
We will first fit all the features into the logistic regression to identify which are the important feature that contribute to the result.

LR_model = glm(Exited ~ ., data = training_data, family = "binomial")
summary(LR_model)

## 
## Call:
## glm(formula = Exited ~ ., family = "binomial", data = training_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0640  -0.6641  -0.4465  -0.2455   3.2877  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -3.19650    0.21483 -14.879  < 2e-16 ***
## CreditScore     -0.37416    0.17495  -2.139   0.0325 *  
## Geography1       0.76241    0.07817   9.753  < 2e-16 ***
## Geography2       0.04367    0.07911   0.552   0.5809    
## Gender1         -0.49592    0.06104  -8.124 4.51e-16 ***
## Age              5.30861    0.21240  24.993  < 2e-16 ***
## Tenure          -0.02405    0.01048  -2.295   0.0217 *  
## Balance          0.35903    0.08468   4.240 2.24e-05 ***
## NumOfProducts   -0.10872    0.05379  -2.021   0.0433 *  
## HasCrCard       -0.06089    0.06636  -0.917   0.3589    
## IsActiveMember  -1.02682    0.06407 -16.028  < 2e-16 ***
## EstimatedSalary  0.15150    0.10569   1.433   0.1517    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8036.8  on 7999  degrees of freedom
## Residual deviance: 6790.3  on 7988  degrees of freedom
## AIC: 6814.3
## 
## Number of Fisher Scoring iterations: 5

From the summary above, we can drop feature of HasCrCard, EstimatedSalary from the training model, thus they’re not statistical significance to the target column (p-value > 0.05)

LR_model = glm(Exited ~ CreditScore + Geography + Gender + Age +Tenure+ Balance + NumOfProducts + IsActiveMember, data = training_data, family = "binomial")
summary(LR_model)

## 
## Call:
## glm(formula = Exited ~ CreditScore + Geography + Gender + Age + 
##     Tenure + Balance + NumOfProducts + IsActiveMember, family = "binomial", 
##     data = training_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0473  -0.6630  -0.4471  -0.2474   3.2725  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -3.16576    0.20285 -15.607  < 2e-16 ***
## CreditScore    -0.37706    0.17490  -2.156   0.0311 *  
## Geography1      0.76132    0.07815   9.742  < 2e-16 ***
## Geography2      0.04457    0.07909   0.564   0.5731    
## Gender1        -0.49718    0.06102  -8.147 3.72e-16 ***
## Age             5.30976    0.21232  25.008  < 2e-16 ***
## Tenure         -0.02410    0.01047  -2.301   0.0214 *  
## Balance         0.36277    0.08464   4.286 1.82e-05 ***
## NumOfProducts  -0.10664    0.05377  -1.983   0.0474 *  
## IsActiveMember -1.02653    0.06405 -16.028  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8036.8  on 7999  degrees of freedom
## Residual deviance: 6793.2  on 7990  degrees of freedom
## AIC: 6813.2
## 
## Number of Fisher Scoring iterations: 5

After dropping those features, we can notice that the statistical significance of credit score,Geography, Gender,Balance has significantly increase. Apart from that, the deviance residuals has also move closer to 0 and AIC reduces as well.

Apart from checking the p-value, we can also check on the VIF of features. Variance inflation factor (VIF) provides a measure of multicollinearity among the independent variables in a multiple regression model. Multicollinearity exist when two/ more predictor are highly relative to each other and it will become difficult to understand the impact of an independent variable.

One of the assumptions fron logistic regression is the feature should be independent.A predictor having a VIF of 2 or less is generally considered safe and it can be assumed that it is not correlated with other predictor variables. Higher the VIF, greater is the correlation of the predictor variable with other predictor variables.

From the result below, all the feature selected is good to use for training the model.

vif(LR_model)

	GVIF	Df	GVIF^(1/(2*Df))
CreditScore	1.001321	1	1.000660
Geography	1.298090	2	1.067397
Gender	1.002371	1	1.001185
Age	1.069003	1	1.033926
Tenure	1.002698	1	1.001348
Balance	1.390988	1	1.179401
NumOfProducts	1.103875	1	1.050654
IsActiveMember	1.067723	1	1.033307

Logistic Regression Result
The model has achieved 81% of accuracy, 70% of sensitivity and 82.77% of specificity. The Area Under Curve for this model achieves 80% which is considered a good result.

# Performance of model on testing data set
pred2 <- predict(LR_model,testing_data,type="response")
cutoff_churn <- ifelse(pred2>=0.50, 1,0)
cm <- confusionMatrix(as.factor(testing_data$Exited),as.factor(cutoff_churn),positive ='1')
cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1528   46
##          1  318  108
##                                           
##                Accuracy : 0.818           
##                  95% CI : (0.8004, 0.8347)
##     No Information Rate : 0.923           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2924          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7013          
##             Specificity : 0.8277          
##          Pos Pred Value : 0.2535          
##          Neg Pred Value : 0.9708          
##              Prevalence : 0.0770          
##          Detection Rate : 0.0540          
##    Detection Prevalence : 0.2130          
##       Balanced Accuracy : 0.7645          
##                                           
##        'Positive' Class : 1               
##

# Plot ROC Curve
ROCpred = prediction(pred2, testing_data$Exited)
ROCperf <- performance(ROCpred, "tpr", "fpr")
plot(ROCperf, colorize=TRUE)
abline(a=0, b=1)
auc_train <- round(as.numeric(performance(ROCpred, "auc")@y.values),2)
legend(.8, .2, auc_train, title = "AUC", cex=1)

2. Decision Tree
A supervised machine learning model that works as flow chat that used to visualize the decision-making process by mapping out different courses of action, as well as their potential outcomes.

We first build the decision tree with all the feature. However, fitting all the features into the model is always not the best choice. From the summary of the model, we obtain the result of CP, which stands for Complexity Parameter. It refers to the trade-off between the size of a tree and the error rate that help to prevent overfitting. So we want the cp value of the smallest tree that is having the smallest cross validation error.

Dtree = rpart(Exited ~., data = training_data, method = "class")
printcp(Dtree)

## 
## Classification tree:
## rpart(formula = Exited ~ ., data = training_data, method = "class")
## 
## Variables actually used in tree construction:
## [1] Age            Balance        Geography      IsActiveMember NumOfProducts 
## 
## Root node error: 1611/8000 = 0.20137
## 
## n= 8000 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.048727      0   1.00000 1.00000 0.022265
## 2 0.044693      2   0.90255 0.88951 0.021290
## 3 0.040348      3   0.85785 0.85289 0.020940
## 4 0.030416      4   0.81750 0.81316 0.020545
## 5 0.029174      5   0.78709 0.80323 0.020444
## 6 0.023588      6   0.75791 0.78399 0.020244
## 7 0.014898      7   0.73433 0.75916 0.019980
## 8 0.010000      9   0.70453 0.71881 0.019535

# Plot Full Tree
prp(Dtree, type = 1, extra = 1, under = TRUE, split.font = 2, varlen = 0)

Find the best pruned Decision Tree by selecting the tree that is having least cross validation error.

set.seed(12345)
cv.ct <- rpart(Exited ~., data = training_data, method = "class", 
               cp = 0.00001, minsplit = 5, xval = 5)
printcp(cv.ct)

## 
## Classification tree:
## rpart(formula = Exited ~ ., data = training_data, method = "class", 
##     cp = 1e-05, minsplit = 5, xval = 5)
## 
## Variables actually used in tree construction:
##  [1] Age             Balance         CreditScore     EstimatedSalary
##  [5] Gender          Geography       HasCrCard       IsActiveMember 
##  [9] NumOfProducts   Tenure         
## 
## Root node error: 1611/8000 = 0.20137
## 
## n= 8000 
## 
##            CP nsplit rel error  xerror     xstd
## 1  4.8727e-02      0   1.00000 1.00000 0.022265
## 2  4.4693e-02      2   0.90255 0.85040 0.020916
## 3  4.0348e-02      3   0.85785 0.84668 0.020879
## 4  3.0416e-02      4   0.81750 0.81937 0.020608
## 5  2.9174e-02      5   0.78709 0.80199 0.020431
## 6  2.3588e-02      6   0.75791 0.77654 0.020165
## 7  1.4898e-02      7   0.73433 0.75605 0.019946
## 8  4.9659e-03      9   0.70453 0.71508 0.019493
## 9  3.7244e-03     10   0.69957 0.70515 0.019379
## 10 3.1037e-03     15   0.67846 0.71136 0.019450
## 11 2.7933e-03     19   0.66605 0.70888 0.019422
## 12 2.4829e-03     21   0.66046 0.71012 0.019436
## 13 2.3277e-03     25   0.65053 0.71012 0.019436
## 14 2.1726e-03     31   0.63315 0.71633 0.019507
## 15 1.8622e-03     33   0.62880 0.71757 0.019521
## 16 1.6553e-03     36   0.62322 0.71074 0.019443
## 17 1.5518e-03     39   0.61825 0.71570 0.019500
## 18 1.2415e-03     41   0.61515 0.72564 0.019611
## 19 9.7544e-04     94   0.53942 0.74115 0.019784
## 20 9.3110e-04    113   0.51645 0.76660 0.020060
## 21 8.2764e-04    126   0.50403 0.77778 0.020179
## 22 7.7592e-04    133   0.49783 0.77778 0.020179
## 23 7.5867e-04    149   0.48417 0.83116 0.020726
## 24 7.2419e-04    165   0.46927 0.83116 0.020726
## 25 7.0941e-04    183   0.45376 0.83116 0.020726
## 26 6.2073e-04    198   0.43886 0.84606 0.020873
## 27 4.9659e-04    337   0.34451 0.86716 0.021078
## 28 4.6555e-04    342   0.34202 0.90006 0.021388
## 29 4.1382e-04    362   0.33271 0.90130 0.021399
## 30 3.7244e-04    433   0.29795 0.90317 0.021416
## 31 3.1037e-04    463   0.27623 0.92489 0.021614
## 32 2.7588e-04    511   0.26071 0.93544 0.021709
## 33 2.6603e-04    520   0.25822 0.93917 0.021742
## 34 2.4829e-04    533   0.25388 0.94600 0.021802
## 35 2.2572e-04    543   0.25140 0.94600 0.021802
## 36 2.0691e-04    554   0.24891 0.95158 0.021852
## 37 1.5518e-04    586   0.24209 0.95717 0.021900
## 38 1.2415e-04    616   0.23588 0.96276 0.021949
## 39 1.1286e-04    628   0.23277 0.96524 0.021970
## 40 8.8676e-05    639   0.23153 0.96586 0.021976
## 41 7.7592e-05    646   0.23091 0.96586 0.021976
## 42 1.0000e-05    654   0.23029 0.96586 0.021976

From the result above, the CP with value of 2.7933e-03 is having the least cross validation error.

# Prune by lowest cp
prune_dt <- prune(cv.ct,cp=cv.ct$cptable[which.min(cv.ct$cptable[,"xerror"]),"CP"])
predict_dt <- predict(prune_dt, testing_data,type="class") 
length(prune_dt$frame$var[prune_dt$frame$var == "<leaf>"])

## [1] 11

prp(prune_dt, type = 1, extra = 1, split.font = 1, varlen = -10)

cm_dt <- confusionMatrix(as.factor(testing_data$Exited),as.factor(predict_dt),positive='1')
cm_dt

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1538   36
##          1  236  190
##                                           
##                Accuracy : 0.864           
##                  95% CI : (0.8482, 0.8787)
##     No Information Rate : 0.887           
##     P-Value [Acc > NIR] : 0.9993          
##                                           
##                   Kappa : 0.5105          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8407          
##             Specificity : 0.8670          
##          Pos Pred Value : 0.4460          
##          Neg Pred Value : 0.9771          
##              Prevalence : 0.1130          
##          Detection Rate : 0.0950          
##    Detection Prevalence : 0.2130          
##       Balanced Accuracy : 0.8538          
##                                           
##        'Positive' Class : 1               
##

Decision Tree Result
The model has achieved 86.4% of accuracy, 84% of sensitivity and 86% of specificity. The Area Under Curve for this model achieves 79% slightly lower compared to logistic regression.

pred_dt <- predict(prune_dt, newdata= testing_data,type = "prob")[, 2]
Pred_val = prediction(pred_dt, testing_data$Exited) 
plot(performance(Pred_val, "tpr", "fpr"),colorize=TRUE)
abline(0, 1, lty = 2)
auc_train <- round(as.numeric(performance(Pred_val, "auc")@y.values),2)
legend(.8, .2, auc_train, title = "AUC", cex=1)

3. Support Vector Machine
A support vector machine (SVM) is a supervised machine learning model that uses classification algorithms for two-group classification problems

One of the important stuff in building SVM model is feature Scaling. In the previous pre-processing section, we had already normalized all the range of the numeric values. Thus, the data is ready to use to train the model.

library(e1071) 
library(ISLR) 
learn_svm <- svm(factor(Exited)~.,data=training_data) 
predict_svm <- predict(learn_svm, testing_data,type ="response")

Support Vector Machine Result
The model has achieved 86.45% of accuracy, 86.38% of sentivity and 86.46% of specificity. The Area Under Curve for this model achieves 71% which is the lowest among all three.

cm_svm <- confusionMatrix(as.factor(testing_data$Exited),as.factor(predict_svm),positive='1')
cm_svm$byClass

##          Sensitivity          Specificity       Pos Pred Value 
##            0.8638498            0.8645775            0.4319249 
##       Neg Pred Value            Precision               Recall 
##            0.9815756            0.4319249            0.8638498 
##                   F1           Prevalence       Detection Rate 
##            0.5758998            0.1065000            0.0920000 
## Detection Prevalence    Balanced Accuracy 
##            0.2130000            0.8642136

pred_ROCR  <- prediction(as.numeric(predict_svm), as.numeric(testing_data$Exited))
roc_ROCR <- performance(pred_ROCR, measure = "tpr", x.measure = "fpr")
auc_train <- round(as.numeric(performance(pred_ROCR, "auc")@y.values),2)
plot(roc_ROCR, main = "ROC curve", colorize = T)
abline(a = 0, b = 1)
legend(.8, .2, auc_train, title = "AUC", cex=1)

Machine Learning - Regression

Similar to training a classification model, there is mainly three steps to train a regression model:
1. Splitting Data into Training and Testing Set
2. Model Training
3. Model Testing

The Tenure variable will be used as the target variable to predict how long (year) the bank customer will stay with the bank.

Splitting Data into training set and testing set

set.seed(1000)
data[, names(data)] = apply(data[, names(data)], 2, function(x) as.numeric(as.character(x)))
trainIndex <- createDataPartition(data$Tenure, p=0.8, list=FALSE, times=1)
data_train <- data[trainIndex,]
data_test <- data[-trainIndex,]

Model Training
1. Linear Regression

# set method as "lm" to train a linear regression model using the training data.
linRegModel <- train(Tenure ~., data = data_train, method = "lm")
summary(linRegModel)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3379 -2.2698 -0.0239  2.3653  5.3673 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.88230    0.21785  22.412  < 2e-16 ***
## CreditScore      0.10882    0.18778   0.580  0.56227    
## Geography        0.02287    0.03909   0.585  0.55857    
## Gender           0.07857    0.06533   1.203  0.22912    
## Age             -0.10005    0.21436  -0.467  0.64069    
## Balance         -0.05492    0.07694  -0.714  0.47536    
## NumOfProducts    0.05931    0.05942   0.998  0.31822    
## HasCrCard        0.10421    0.07129   1.462  0.14382    
## IsActiveMember  -0.17016    0.06603  -2.577  0.00998 ** 
## EstimatedSalary  0.05971    0.11279   0.529  0.59654    
## Exited          -0.13193    0.08710  -1.515  0.12987    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.89 on 7990 degrees of freedom
## Multiple R-squared:  0.00211,    Adjusted R-squared:  0.0008615 
## F-statistic:  1.69 on 10 and 7990 DF,  p-value: 0.07686

From the summary above, only the IsActiceMember variable is statistically significant in predicting the Tenure target outcome. The adjusted R-square achieved by the model is 0.0008615, which is considered extremely low as it is far from the perfect score of 1.

In the following code, 5-fold cross validation is used for the linear regression model to see if the model performance can be improved.

# set method as "lm" to train a linear regression model and use 5-fold cross validation on the whole data.
linRegModelcv <- train(Tenure ~., data = data, method = "lm", trControl = trainControl(method="cv", number=5))
summary(linRegModelcv$finalModel)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3126 -2.3105 -0.0185  2.3756  5.4001 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.91197    0.19404  25.314  < 2e-16 ***
## CreditScore      0.01898    0.16761   0.113  0.90984    
## Geography        0.01892    0.03507   0.539  0.58959    
## Gender           0.08102    0.05842   1.387  0.16548    
## Age             -0.06739    0.19234  -0.350  0.72607    
## Balance         -0.06535    0.06858  -0.953  0.34063    
## NumOfProducts    0.04804    0.05270   0.912  0.36199    
## HasCrCard        0.13900    0.06344   2.191  0.02847 *  
## IsActiveMember  -0.17692    0.05899  -2.999  0.00272 ** 
## EstimatedSalary  0.07984    0.10057   0.794  0.42730    
## Exited          -0.10477    0.07738  -1.354  0.17576    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.89 on 9989 degrees of freedom
## Multiple R-squared:  0.002178,   Adjusted R-squared:  0.001179 
## F-statistic:  2.18 on 10 and 9989 DF,  p-value: 0.01622

From the summary above, we have now two variables that are statistically significant which are IsActiveMember and HasCrCard. The adjusted R-square (0.001179) is slightly improved but the score achieved is still considered very low.

Linear Regression Result
The linear regression model has achieved RMSE score of 2.945562.

# root mean square error function
rmse = function(actual, predicted) {
  sqrt(mean((actual - predicted) ^ 2))
}

# performance of model on testing data set 
pred_tenure = as.integer(predict(linRegModel, data_test))
rmse(data_test$Tenure, pred_tenure)

## [1] 2.946072

2. Regression Tree

library(rpart)
library(rpart.plot)

# set method as anova for regression tree
modelTree <- rpart(Tenure ~., data= data_train, method="anova")
summary(modelTree)

## Call:
## rpart(formula = Tenure ~ ., data = data_train, method = "anova")
##   n= 8001 
## 
##             CP nsplit rel error xerror xstd
## 1 0.0007559446      0         1      0    0
## 
## Node number 1: 8001 observations
##   mean=5.019248, MSE=8.360584

rpart.plot(modelTree)

From the model summary and tree plot above, we notice that the regression tree is not able to grow. This happens because the independent variables are not useful in predicting the Tenure target outcome, hence the information provided by the independent variables are insufficient to grow the tree.

Regression Tree Result
The regression tree model has achieved RMSE score of 2.894164. The RMSE score of this model is slightly better than the RMSE of linear regression model. However, neither of the two models achieved a good model performance that is ready for deployment because the independent variables in this data set are not contributing to the prediction of the Tenure target variable.

pred_tenure2 <- as.integer(predict(modelTree,  data_test, type="vector"))

#Accuracy of model on testing data set
rmse(data_test$Tenure, pred_tenure2)

## [1] 2.894164

Conclusion

Some interesting findings in the dataset:

Older customers are churning more than younger ones alluding to a difference in service preference in the age categories. The bank may need to review their target market or review the strategy for retention between the different age groups.
Having a credit card is not a good predictor for churn status mainly due to the high credit card ownership in Germany, France and Spain.
Credit Score may be perceived as an important factor, but its significance is minimal among the other factors given that the distribution of credit score is similar for churned and retained customers.
Clients with the longest and shortest tenure are more likely to churn compared to those that are of average tenure.
Surprisingly, the churning customers are also those who have a greater bank balance with the bank. This should be concerning to the bank as they are losing customers that provide higher capitals.

In predicting if a customer will churn or not, we employed 3 types of models: Logistics Regression, Decision Tree and Support Vector Machine. The performances of the models are fairly good with accuracies ranging from 81% - 86%. Other performance metrics that we considered are sensitivity(recall), precision f1-scores and the Area Under Curve of ROC.

Overall, Decision Tree is the best model for predicting churn among the three models.

For objective 2, neither of the regression models can a good prediction on how long a bank customer will stay with the bank because the independent variables are not useful in contributing to the prediction of our Tenure target variable.

Recommendations:

Regression works better with more continuous data as features. Active Member status is quite arbitrary and it is better to replaced with other useful features such as Recency, Frequency and Lifetime Value of customers that captures customer behaviour in interacting with the services provided by the bank.

Through this project, we learn that a good quality data set is important as it directly influence the performance of machine learning models and the independent variables have to be relevant in order to contribute to the prediction of the target variable.

References

Khan, A. A., Jamwal, S., & Sepehri, M. M. (2010). Applying Data Mining to Customer Churn Prediction in an Internet Service Provider. International Journal of Computer Applications, 9(7), 8–14. https://doi.org/10.5120/1400-1889

Sharma, A., & Kumar Panigrahi, P. (2011). A Neural Network based Approach for Predicting Customer Churn in Cellular Network Services. International Journal of Computer Applications, 27(11), 26–31. https://doi.org/10.5120/3344-4605