Customer churn has become a major problem in banking industry and banks have always tried to track customer interaction with the company to detect early warning signs in customer’s behavior.
Marketing literature states that it is more costly to engage a new customer than to retain an existing loyal customer (Sharma & Kumar Panigrahi, 2011). Khan et al. (2010) state that cost of obtaining new customers is five times higher than retaining existing customers. Therefore, banks need to shift their attention from customer acquisition to customer retention, provide accurate prediction models, and effective churn prediction strategies as customer retention solution, to prevent churn.
This project aims to present machine learning models that can be used to predict which customers are most likely to churn and how long (years) the customer will stay with the bank. The models were trained with real-life U.S. bank customers dataset.
#
The dataset that will be used in this project is obtained from https://www.kaggle.com/datasets/shantanudhakadd/bank-customer-churn-prediction?resource=download. Import the dataset into our R markdown.
library(tidyr)
library(ggplot2)
library(purrr)
library(printr)
library(pROC)
library(ROCR)
library(caret)
library(car)
library(rpart)
library(rpart.plot)
data = read.csv('Churn_Modelling.csv', stringsAsFactors = TRUE)
By taking a glimpse on our dataset, we have total of 10,000 and 14 columns. Three non-useful variables are identified: RowNumber, CustomerID, and Surname. Two categorical variables: Geography and Gender need to be encoded into numbers because machine learning models can only work with numerical input.
# show dimension, datatype, content of the data set
str(data)
## 'data.frame': 10000 obs. of 14 variables:
## $ RowNumber : int 1 2 3 4 5 6 7 8 9 10 ...
## $ CustomerId : int 15634602 15647311 15619304 15701354 15737888 15574012 15592531 15656148 15792365 15592389 ...
## $ Surname : Factor w/ 2932 levels "Abazu","Abbie",..: 1116 1178 2041 290 1823 538 178 2001 1147 1082 ...
## $ CreditScore : int 619 608 502 699 850 645 822 376 501 684 ...
## $ Geography : Factor w/ 3 levels "France","Germany",..: 1 3 1 1 3 3 1 2 1 1 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 2 1 2 2 ...
## $ Age : int 42 41 42 39 43 44 50 29 44 27 ...
## $ Tenure : int 2 1 8 1 2 8 7 4 4 2 ...
## $ Balance : num 0 83808 159661 0 125511 ...
## $ NumOfProducts : int 1 1 3 2 1 2 2 4 2 1 ...
## $ HasCrCard : int 1 0 1 0 1 1 1 1 0 1 ...
## $ IsActiveMember : int 1 1 0 0 1 0 1 0 1 1 ...
## $ EstimatedSalary: num 101349 112543 113932 93827 79084 ...
## $ Exited : int 1 0 1 0 0 1 0 1 0 0 ...
No missing value is detected for all of the variables.
# detect missing value
knitr::kable(sapply(data, function(x) sum(is.na(x))), col.names = c("Missing Value Count"))
Missing Value Count | |
---|---|
RowNumber | 0 |
CustomerId | 0 |
Surname | 0 |
CreditScore | 0 |
Geography | 0 |
Gender | 0 |
Age | 0 |
Tenure | 0 |
Balance | 0 |
NumOfProducts | 0 |
HasCrCard | 0 |
IsActiveMember | 0 |
EstimatedSalary | 0 |
Exited | 0 |
This is the summary statistics of the variables. Looking at the Min and Max of the continuous variables, we can see that their scales are different hugely, e.g., Age and EstimatedSalary. The variables with larger scale would overshadow the smaller scale one, so scaling is needed to scale these variables to the same 0 - 1 range.
# show summary statistics of the variables
summary(data[, !names(data) %in% c('RowNumber', 'CustomerId', 'Surname')])
CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|
Min. :350.0 | France :5014 | Female:4543 | Min. :18.00 | Min. : 0.000 | Min. : 0 | Min. :1.00 | Min. :0.0000 | Min. :0.0000 | Min. : 11.58 | Min. :0.0000 | |
1st Qu.:584.0 | Germany:2509 | Male :5457 | 1st Qu.:32.00 | 1st Qu.: 3.000 | 1st Qu.: 0 | 1st Qu.:1.00 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.: 51002.11 | 1st Qu.:0.0000 | |
Median :652.0 | Spain :2477 | NA | Median :37.00 | Median : 5.000 | Median : 97199 | Median :1.00 | Median :1.0000 | Median :1.0000 | Median :100193.91 | Median :0.0000 | |
Mean :650.5 | NA | NA | Mean :38.92 | Mean : 5.013 | Mean : 76486 | Mean :1.53 | Mean :0.7055 | Mean :0.5151 | Mean :100090.24 | Mean :0.2037 | |
3rd Qu.:718.0 | NA | NA | 3rd Qu.:44.00 | 3rd Qu.: 7.000 | 3rd Qu.:127644 | 3rd Qu.:2.00 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:149388.25 | 3rd Qu.:0.0000 | |
Max. :850.0 | NA | NA | Max. :92.00 | Max. :10.000 | Max. :250898 | Max. :4.00 | Max. :1.0000 | Max. :1.0000 | Max. :199992.48 | Max. :1.0000 |
Box plot is plotted to show the data distribution of continuous variables and check if there is any outlier. Outliers are detected in Age and CreditScore, but they are not erroneous outliers and this outlier situation occurs because of the small sample number between these outliers range. Age, CreditScore, Balance variables are skewed toward the majority values while EstimatedSalary seems to be normally distributed.
Log transformation can be applied to these three variables to solve the skewed and outliers data issues.
# plot box plot
data[, names(data) %in% c('Age', 'Balance', 'CreditScore', 'EstimatedSalary')] %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_boxplot() +
theme(axis.text.x = element_text(size = 7, angle=90), axis.text.y = element_text(size = 7))
The data processing that need to be done include:
1) Drop RowNumber, CustomerID, and Surname.
2) Encode Geography and Gender.
3) Log Transform Age, CreditScore, and Balance.
4) Scale range of Age, CreditScore, Balance, EstimatedSalary from 0 to 1.
# drop non-useful variables
data = data[, !names(data) %in% c('RowNumber', 'CustomerId', 'Surname')]
# data encoding
data$Geography = factor(data$Geography, labels=c(0, 1, 2))
data$Gender = factor(data$Gender, labels=c(0, 1))
# data transformation
data$Age = log(data$Age)
data$CreditScore = log(data$CreditScore)
data$Balance = log(data$Balance)
data[data$Balance == -Inf, 'Balance'] <- 0
# scaling
fun_scale_0to1 <- function(x) {
(x - min(x)) / (max(x) - min(x))
}
data$Age = fun_scale_0to1(data$Age)
data$CreditScore = fun_scale_0to1(data$CreditScore)
data$Balance = fun_scale_0to1(data$Balance)
data$EstimatedSalary = fun_scale_0to1(data$EstimatedSalary)
head(data, 5)
CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited |
---|---|---|---|---|---|---|---|---|---|---|
0.6425900 | 0 | 0 | 0.5193632 | 2 | 0.0000000 | 1 | 1 | 1 | 0.5067349 | 1 |
0.6223822 | 2 | 0 | 0.5045923 | 1 | 0.9118043 | 1 | 0 | 1 | 0.5627087 | 0 |
0.4064754 | 0 | 0 | 0.5193632 | 8 | 0.9636449 | 3 | 1 | 0 | 0.5696544 | 1 |
0.7795730 | 0 | 0 | 0.4739377 | 1 | 0.0000000 | 2 | 0 | 0 | 0.4691201 | 0 |
1.0000000 | 2 | 0 | 0.5337866 | 2 | 0.9442881 | 1 | 1 | 1 | 0.3954004 | 0 |
To train a classification model, there is mainly three steps:
1. Splitting Data into Training and Testing Set
2. Model Training/ Tuning
3. Model Testing
The Exited variable will be used as the target variable to predict whether a bank customer will churn or not.
Splitting Data into training set and testing set
set.seed(1000)
trainIndex <- createDataPartition(data$Exited, p = 0.8, list = FALSE, times = 1)
training_data <- data[ trainIndex,]
testing_data <- data[-trainIndex,]
Data From Traing Set and Testing Set
# Check if the splitting process is correct
prop.table(table(training_data$Exited))
0 | 1 |
---|---|
0.798625 | 0.201375 |
prop.table(table(testing_data$Exited))
0 | 1 |
---|---|
0.787 | 0.213 |
Model Training
1. Logistic Regression:
Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based on prior observations of a data set.
We will first fit all the features into the logistic regression to identify which are the important feature that contribute to the result.
LR_model = glm(Exited ~ ., data = training_data, family = "binomial")
summary(LR_model)
##
## Call:
## glm(formula = Exited ~ ., family = "binomial", data = training_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0640 -0.6641 -0.4465 -0.2455 3.2877
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.19650 0.21483 -14.879 < 2e-16 ***
## CreditScore -0.37416 0.17495 -2.139 0.0325 *
## Geography1 0.76241 0.07817 9.753 < 2e-16 ***
## Geography2 0.04367 0.07911 0.552 0.5809
## Gender1 -0.49592 0.06104 -8.124 4.51e-16 ***
## Age 5.30861 0.21240 24.993 < 2e-16 ***
## Tenure -0.02405 0.01048 -2.295 0.0217 *
## Balance 0.35903 0.08468 4.240 2.24e-05 ***
## NumOfProducts -0.10872 0.05379 -2.021 0.0433 *
## HasCrCard -0.06089 0.06636 -0.917 0.3589
## IsActiveMember -1.02682 0.06407 -16.028 < 2e-16 ***
## EstimatedSalary 0.15150 0.10569 1.433 0.1517
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8036.8 on 7999 degrees of freedom
## Residual deviance: 6790.3 on 7988 degrees of freedom
## AIC: 6814.3
##
## Number of Fisher Scoring iterations: 5
From the summary above, we can drop feature of HasCrCard, EstimatedSalary from the training model, thus they’re not statistical significance to the target column (p-value > 0.05)
LR_model = glm(Exited ~ CreditScore + Geography + Gender + Age +Tenure+ Balance + NumOfProducts + IsActiveMember, data = training_data, family = "binomial")
summary(LR_model)
##
## Call:
## glm(formula = Exited ~ CreditScore + Geography + Gender + Age +
## Tenure + Balance + NumOfProducts + IsActiveMember, family = "binomial",
## data = training_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0473 -0.6630 -0.4471 -0.2474 3.2725
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.16576 0.20285 -15.607 < 2e-16 ***
## CreditScore -0.37706 0.17490 -2.156 0.0311 *
## Geography1 0.76132 0.07815 9.742 < 2e-16 ***
## Geography2 0.04457 0.07909 0.564 0.5731
## Gender1 -0.49718 0.06102 -8.147 3.72e-16 ***
## Age 5.30976 0.21232 25.008 < 2e-16 ***
## Tenure -0.02410 0.01047 -2.301 0.0214 *
## Balance 0.36277 0.08464 4.286 1.82e-05 ***
## NumOfProducts -0.10664 0.05377 -1.983 0.0474 *
## IsActiveMember -1.02653 0.06405 -16.028 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8036.8 on 7999 degrees of freedom
## Residual deviance: 6793.2 on 7990 degrees of freedom
## AIC: 6813.2
##
## Number of Fisher Scoring iterations: 5
After dropping those features, we can notice that the statistical significance of credit score,Geography, Gender,Balance has significantly increase. Apart from that, the deviance residuals has also move closer to 0 and AIC reduces as well.
Apart from checking the p-value, we can also check on the VIF of features. Variance inflation factor (VIF) provides a measure of multicollinearity among the independent variables in a multiple regression model. Multicollinearity exist when two/ more predictor are highly relative to each other and it will become difficult to understand the impact of an independent variable.
One of the assumptions fron logistic regression is the feature should be independent.A predictor having a VIF of 2 or less is generally considered safe and it can be assumed that it is not correlated with other predictor variables. Higher the VIF, greater is the correlation of the predictor variable with other predictor variables.
From the result below, all the feature selected is good to use for training the model.
vif(LR_model)
GVIF | Df | GVIF^(1/(2*Df)) | |
---|---|---|---|
CreditScore | 1.001321 | 1 | 1.000660 |
Geography | 1.298090 | 2 | 1.067397 |
Gender | 1.002371 | 1 | 1.001185 |
Age | 1.069003 | 1 | 1.033926 |
Tenure | 1.002698 | 1 | 1.001348 |
Balance | 1.390988 | 1 | 1.179401 |
NumOfProducts | 1.103875 | 1 | 1.050654 |
IsActiveMember | 1.067723 | 1 | 1.033307 |
Logistic Regression Result
The model has achieved 81% of accuracy, 70% of sensitivity and 82.77% of specificity. The Area Under Curve for this model achieves 80% which is considered a good result.
# Performance of model on testing data set
pred2 <- predict(LR_model,testing_data,type="response")
cutoff_churn <- ifelse(pred2>=0.50, 1,0)
cm <- confusionMatrix(as.factor(testing_data$Exited),as.factor(cutoff_churn),positive ='1')
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1528 46
## 1 318 108
##
## Accuracy : 0.818
## 95% CI : (0.8004, 0.8347)
## No Information Rate : 0.923
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2924
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7013
## Specificity : 0.8277
## Pos Pred Value : 0.2535
## Neg Pred Value : 0.9708
## Prevalence : 0.0770
## Detection Rate : 0.0540
## Detection Prevalence : 0.2130
## Balanced Accuracy : 0.7645
##
## 'Positive' Class : 1
##
# Plot ROC Curve
ROCpred = prediction(pred2, testing_data$Exited)
ROCperf <- performance(ROCpred, "tpr", "fpr")
plot(ROCperf, colorize=TRUE)
abline(a=0, b=1)
auc_train <- round(as.numeric(performance(ROCpred, "auc")@y.values),2)
legend(.8, .2, auc_train, title = "AUC", cex=1)
2. Decision Tree
A supervised machine learning model that works as flow chat that used to visualize the decision-making process by mapping out different courses of action, as well as their potential outcomes.
We first build the decision tree with all the feature. However, fitting all the features into the model is always not the best choice. From the summary of the model, we obtain the result of CP, which stands for Complexity Parameter. It refers to the trade-off between the size of a tree and the error rate that help to prevent overfitting. So we want the cp value of the smallest tree that is having the smallest cross validation error.
Dtree = rpart(Exited ~., data = training_data, method = "class")
printcp(Dtree)
##
## Classification tree:
## rpart(formula = Exited ~ ., data = training_data, method = "class")
##
## Variables actually used in tree construction:
## [1] Age Balance Geography IsActiveMember NumOfProducts
##
## Root node error: 1611/8000 = 0.20137
##
## n= 8000
##
## CP nsplit rel error xerror xstd
## 1 0.048727 0 1.00000 1.00000 0.022265
## 2 0.044693 2 0.90255 0.88951 0.021290
## 3 0.040348 3 0.85785 0.85289 0.020940
## 4 0.030416 4 0.81750 0.81316 0.020545
## 5 0.029174 5 0.78709 0.80323 0.020444
## 6 0.023588 6 0.75791 0.78399 0.020244
## 7 0.014898 7 0.73433 0.75916 0.019980
## 8 0.010000 9 0.70453 0.71881 0.019535
# Plot Full Tree
prp(Dtree, type = 1, extra = 1, under = TRUE, split.font = 2, varlen = 0)
Find the best pruned Decision Tree by selecting the tree that is having least cross validation error.
set.seed(12345)
cv.ct <- rpart(Exited ~., data = training_data, method = "class",
cp = 0.00001, minsplit = 5, xval = 5)
printcp(cv.ct)
##
## Classification tree:
## rpart(formula = Exited ~ ., data = training_data, method = "class",
## cp = 1e-05, minsplit = 5, xval = 5)
##
## Variables actually used in tree construction:
## [1] Age Balance CreditScore EstimatedSalary
## [5] Gender Geography HasCrCard IsActiveMember
## [9] NumOfProducts Tenure
##
## Root node error: 1611/8000 = 0.20137
##
## n= 8000
##
## CP nsplit rel error xerror xstd
## 1 4.8727e-02 0 1.00000 1.00000 0.022265
## 2 4.4693e-02 2 0.90255 0.85040 0.020916
## 3 4.0348e-02 3 0.85785 0.84668 0.020879
## 4 3.0416e-02 4 0.81750 0.81937 0.020608
## 5 2.9174e-02 5 0.78709 0.80199 0.020431
## 6 2.3588e-02 6 0.75791 0.77654 0.020165
## 7 1.4898e-02 7 0.73433 0.75605 0.019946
## 8 4.9659e-03 9 0.70453 0.71508 0.019493
## 9 3.7244e-03 10 0.69957 0.70515 0.019379
## 10 3.1037e-03 15 0.67846 0.71136 0.019450
## 11 2.7933e-03 19 0.66605 0.70888 0.019422
## 12 2.4829e-03 21 0.66046 0.71012 0.019436
## 13 2.3277e-03 25 0.65053 0.71012 0.019436
## 14 2.1726e-03 31 0.63315 0.71633 0.019507
## 15 1.8622e-03 33 0.62880 0.71757 0.019521
## 16 1.6553e-03 36 0.62322 0.71074 0.019443
## 17 1.5518e-03 39 0.61825 0.71570 0.019500
## 18 1.2415e-03 41 0.61515 0.72564 0.019611
## 19 9.7544e-04 94 0.53942 0.74115 0.019784
## 20 9.3110e-04 113 0.51645 0.76660 0.020060
## 21 8.2764e-04 126 0.50403 0.77778 0.020179
## 22 7.7592e-04 133 0.49783 0.77778 0.020179
## 23 7.5867e-04 149 0.48417 0.83116 0.020726
## 24 7.2419e-04 165 0.46927 0.83116 0.020726
## 25 7.0941e-04 183 0.45376 0.83116 0.020726
## 26 6.2073e-04 198 0.43886 0.84606 0.020873
## 27 4.9659e-04 337 0.34451 0.86716 0.021078
## 28 4.6555e-04 342 0.34202 0.90006 0.021388
## 29 4.1382e-04 362 0.33271 0.90130 0.021399
## 30 3.7244e-04 433 0.29795 0.90317 0.021416
## 31 3.1037e-04 463 0.27623 0.92489 0.021614
## 32 2.7588e-04 511 0.26071 0.93544 0.021709
## 33 2.6603e-04 520 0.25822 0.93917 0.021742
## 34 2.4829e-04 533 0.25388 0.94600 0.021802
## 35 2.2572e-04 543 0.25140 0.94600 0.021802
## 36 2.0691e-04 554 0.24891 0.95158 0.021852
## 37 1.5518e-04 586 0.24209 0.95717 0.021900
## 38 1.2415e-04 616 0.23588 0.96276 0.021949
## 39 1.1286e-04 628 0.23277 0.96524 0.021970
## 40 8.8676e-05 639 0.23153 0.96586 0.021976
## 41 7.7592e-05 646 0.23091 0.96586 0.021976
## 42 1.0000e-05 654 0.23029 0.96586 0.021976
From the result above, the CP with value of 2.7933e-03 is having the least cross validation error.
# Prune by lowest cp
prune_dt <- prune(cv.ct,cp=cv.ct$cptable[which.min(cv.ct$cptable[,"xerror"]),"CP"])
predict_dt <- predict(prune_dt, testing_data,type="class")
length(prune_dt$frame$var[prune_dt$frame$var == "<leaf>"])
## [1] 11
prp(prune_dt, type = 1, extra = 1, split.font = 1, varlen = -10)
cm_dt <- confusionMatrix(as.factor(testing_data$Exited),as.factor(predict_dt),positive='1')
cm_dt
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1538 36
## 1 236 190
##
## Accuracy : 0.864
## 95% CI : (0.8482, 0.8787)
## No Information Rate : 0.887
## P-Value [Acc > NIR] : 0.9993
##
## Kappa : 0.5105
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8407
## Specificity : 0.8670
## Pos Pred Value : 0.4460
## Neg Pred Value : 0.9771
## Prevalence : 0.1130
## Detection Rate : 0.0950
## Detection Prevalence : 0.2130
## Balanced Accuracy : 0.8538
##
## 'Positive' Class : 1
##
Decision Tree Result
The model has achieved 86.4% of accuracy, 84% of sensitivity and 86% of specificity. The Area Under Curve for this model achieves 79% slightly lower compared to logistic regression.
pred_dt <- predict(prune_dt, newdata= testing_data,type = "prob")[, 2]
Pred_val = prediction(pred_dt, testing_data$Exited)
plot(performance(Pred_val, "tpr", "fpr"),colorize=TRUE)
abline(0, 1, lty = 2)
auc_train <- round(as.numeric(performance(Pred_val, "auc")@y.values),2)
legend(.8, .2, auc_train, title = "AUC", cex=1)
3. Support Vector Machine
A support vector machine (SVM) is a supervised machine learning model that uses classification algorithms for two-group classification problems
One of the important stuff in building SVM model is feature Scaling. In the previous pre-processing section, we had already normalized all the range of the numeric values. Thus, the data is ready to use to train the model.
library(e1071)
library(ISLR)
learn_svm <- svm(factor(Exited)~.,data=training_data)
predict_svm <- predict(learn_svm, testing_data,type ="response")
Support Vector Machine Result
The model has achieved 86.45% of accuracy, 86.38% of sentivity and 86.46% of specificity. The Area Under Curve for this model achieves 71% which is the lowest among all three.
cm_svm <- confusionMatrix(as.factor(testing_data$Exited),as.factor(predict_svm),positive='1')
cm_svm$byClass
## Sensitivity Specificity Pos Pred Value
## 0.8638498 0.8645775 0.4319249
## Neg Pred Value Precision Recall
## 0.9815756 0.4319249 0.8638498
## F1 Prevalence Detection Rate
## 0.5758998 0.1065000 0.0920000
## Detection Prevalence Balanced Accuracy
## 0.2130000 0.8642136
pred_ROCR <- prediction(as.numeric(predict_svm), as.numeric(testing_data$Exited))
roc_ROCR <- performance(pred_ROCR, measure = "tpr", x.measure = "fpr")
auc_train <- round(as.numeric(performance(pred_ROCR, "auc")@y.values),2)
plot(roc_ROCR, main = "ROC curve", colorize = T)
abline(a = 0, b = 1)
legend(.8, .2, auc_train, title = "AUC", cex=1)
Similar to training a classification model, there is mainly three steps to train a regression model:
1. Splitting Data into Training and Testing Set
2. Model Training
3. Model Testing
The Tenure variable will be used as the target variable to predict how long (year) the bank customer will stay with the bank.
Splitting Data into training set and testing set
set.seed(1000)
data[, names(data)] = apply(data[, names(data)], 2, function(x) as.numeric(as.character(x)))
trainIndex <- createDataPartition(data$Tenure, p=0.8, list=FALSE, times=1)
data_train <- data[trainIndex,]
data_test <- data[-trainIndex,]
Model Training
1. Linear Regression
# set method as "lm" to train a linear regression model using the training data.
linRegModel <- train(Tenure ~., data = data_train, method = "lm")
summary(linRegModel)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3379 -2.2698 -0.0239 2.3653 5.3673
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.88230 0.21785 22.412 < 2e-16 ***
## CreditScore 0.10882 0.18778 0.580 0.56227
## Geography 0.02287 0.03909 0.585 0.55857
## Gender 0.07857 0.06533 1.203 0.22912
## Age -0.10005 0.21436 -0.467 0.64069
## Balance -0.05492 0.07694 -0.714 0.47536
## NumOfProducts 0.05931 0.05942 0.998 0.31822
## HasCrCard 0.10421 0.07129 1.462 0.14382
## IsActiveMember -0.17016 0.06603 -2.577 0.00998 **
## EstimatedSalary 0.05971 0.11279 0.529 0.59654
## Exited -0.13193 0.08710 -1.515 0.12987
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.89 on 7990 degrees of freedom
## Multiple R-squared: 0.00211, Adjusted R-squared: 0.0008615
## F-statistic: 1.69 on 10 and 7990 DF, p-value: 0.07686
From the summary above, only the IsActiceMember variable is statistically significant in predicting the Tenure target outcome. The adjusted R-square achieved by the model is 0.0008615, which is considered extremely low as it is far from the perfect score of 1.
In the following code, 5-fold cross validation is used for the linear regression model to see if the model performance can be improved.
# set method as "lm" to train a linear regression model and use 5-fold cross validation on the whole data.
linRegModelcv <- train(Tenure ~., data = data, method = "lm", trControl = trainControl(method="cv", number=5))
summary(linRegModelcv$finalModel)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3126 -2.3105 -0.0185 2.3756 5.4001
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.91197 0.19404 25.314 < 2e-16 ***
## CreditScore 0.01898 0.16761 0.113 0.90984
## Geography 0.01892 0.03507 0.539 0.58959
## Gender 0.08102 0.05842 1.387 0.16548
## Age -0.06739 0.19234 -0.350 0.72607
## Balance -0.06535 0.06858 -0.953 0.34063
## NumOfProducts 0.04804 0.05270 0.912 0.36199
## HasCrCard 0.13900 0.06344 2.191 0.02847 *
## IsActiveMember -0.17692 0.05899 -2.999 0.00272 **
## EstimatedSalary 0.07984 0.10057 0.794 0.42730
## Exited -0.10477 0.07738 -1.354 0.17576
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.89 on 9989 degrees of freedom
## Multiple R-squared: 0.002178, Adjusted R-squared: 0.001179
## F-statistic: 2.18 on 10 and 9989 DF, p-value: 0.01622
From the summary above, we have now two variables that are statistically significant which are IsActiveMember and HasCrCard. The adjusted R-square (0.001179) is slightly improved but the score achieved is still considered very low.
Linear Regression Result
The linear regression model has achieved RMSE score of 2.945562.
# root mean square error function
rmse = function(actual, predicted) {
sqrt(mean((actual - predicted) ^ 2))
}
# performance of model on testing data set
pred_tenure = as.integer(predict(linRegModel, data_test))
rmse(data_test$Tenure, pred_tenure)
## [1] 2.946072
2. Regression Tree
library(rpart)
library(rpart.plot)
# set method as anova for regression tree
modelTree <- rpart(Tenure ~., data= data_train, method="anova")
summary(modelTree)
## Call:
## rpart(formula = Tenure ~ ., data = data_train, method = "anova")
## n= 8001
##
## CP nsplit rel error xerror xstd
## 1 0.0007559446 0 1 0 0
##
## Node number 1: 8001 observations
## mean=5.019248, MSE=8.360584
rpart.plot(modelTree)
From the model summary and tree plot above, we notice that the regression tree is not able to grow. This happens because the independent variables are not useful in predicting the Tenure target outcome, hence the information provided by the independent variables are insufficient to grow the tree.
Regression Tree Result
The regression tree model has achieved RMSE score of 2.894164. The RMSE score of this model is slightly better than the RMSE of linear regression model. However, neither of the two models achieved a good model performance that is ready for deployment because the independent variables in this data set are not contributing to the prediction of the Tenure target variable.
pred_tenure2 <- as.integer(predict(modelTree, data_test, type="vector"))
#Accuracy of model on testing data set
rmse(data_test$Tenure, pred_tenure2)
## [1] 2.894164
Some interesting findings in the dataset:
Older customers are churning more than younger ones alluding to a difference in service preference in the age categories. The bank may need to review their target market or review the strategy for retention between the different age groups.
Having a credit card is not a good predictor for churn status mainly due to the high credit card ownership in Germany, France and Spain.
Credit Score may be perceived as an important factor, but its significance is minimal among the other factors given that the distribution of credit score is similar for churned and retained customers.
Clients with the longest and shortest tenure are more likely to churn compared to those that are of average tenure.
Surprisingly, the churning customers are also those who have a greater bank balance with the bank. This should be concerning to the bank as they are losing customers that provide higher capitals.
In predicting if a customer will churn or not, we employed 3 types of models: Logistics Regression, Decision Tree and Support Vector Machine. The performances of the models are fairly good with accuracies ranging from 81% - 86%. Other performance metrics that we considered are sensitivity(recall), precision f1-scores and the Area Under Curve of ROC.
Overall, Decision Tree is the best model for predicting churn among the three models.
For objective 2, neither of the regression models can a good prediction on how long a bank customer will stay with the bank because the independent variables are not useful in contributing to the prediction of our Tenure target variable.
Recommendations:
Regression works better with more continuous data as features. Active Member status is quite arbitrary and it is better to replaced with other useful features such as Recency, Frequency and Lifetime Value of customers that captures customer behaviour in interacting with the services provided by the bank.
Through this project, we learn that a good quality data set is important as it directly influence the performance of machine learning models and the independent variables have to be relevant in order to contribute to the prediction of the target variable.
Khan, A. A., Jamwal, S., & Sepehri, M. M. (2010). Applying Data Mining to Customer Churn Prediction in an Internet Service Provider. International Journal of Computer Applications, 9(7), 8–14. https://doi.org/10.5120/1400-1889
Sharma, A., & Kumar Panigrahi, P. (2011). A Neural Network based Approach for Predicting Customer Churn in Cellular Network Services. International Journal of Computer Applications, 27(11), 26–31. https://doi.org/10.5120/3344-4605