Firstly, we download all the needed libraries:

library(foreign)
library(ggplot2)
library(dplyr)
library(boot)
library(plyr)
library(MASS)
library(descr)
library(SDMTools)
library(knitr)
library(corrplot)
library(survival)

Importing data:

ChurnM <- read.csv("Churn_Modelling.csv")

EXPLORING THE DATA

How many churners do we have?

count(ChurnM, vars="Exited")
##   Exited freq
## 1      0 7963
## 2      1 2037

The result shows, that 2037 customers out 0f 10000 have churned and 7963 have not.

How much on average do churners and no-churners last as bank customers?

subset(ChurnM, ChurnM$Exited =="1") %>% summarise(mean=mean(Tenure))
##       mean
## 1 4.932744
subset(ChurnM, ChurnM$Exited =="0") %>% summarise(mean=mean(Tenure))
##       mean
## 1 5.033279
  • Churners on average last as bank customers for 4.932744
  • No-Churners on average last as bank customers for 5.033279

PREDICTING CHURN

Building a logistic model:

Model1 <- glm(Exited ~ Age+Gender+CreditScore+Geography+Tenure+Balance+NumOfProducts+HasCrCard+IsActiveMember+EstimatedSalary,family=binomial, ChurnM)
summary(Model1)
## 
## Call:
## glm(formula = Exited ~ Age + Gender + CreditScore + Geography + 
##     Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember + 
##     EstimatedSalary, family = binomial, data = ChurnM)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3097  -0.6589  -0.4560  -0.2697   2.9940  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.392e+00  2.448e-01 -13.857  < 2e-16 ***
## Age               7.271e-02  2.576e-03  28.230  < 2e-16 ***
## GenderMale       -5.285e-01  5.449e-02  -9.699  < 2e-16 ***
## CreditScore      -6.683e-04  2.803e-04  -2.384   0.0171 *  
## GeographyGermany  7.747e-01  6.767e-02  11.448  < 2e-16 ***
## GeographySpain    3.522e-02  7.064e-02   0.499   0.6181    
## Tenure           -1.595e-02  9.355e-03  -1.705   0.0882 .  
## Balance           2.637e-06  5.142e-07   5.128 2.92e-07 ***
## NumOfProducts    -1.015e-01  4.713e-02  -2.154   0.0312 *  
## HasCrCard        -4.468e-02  5.934e-02  -0.753   0.4515    
## IsActiveMember   -1.075e+00  5.769e-02 -18.643  < 2e-16 ***
## EstimatedSalary   4.807e-07  4.737e-07   1.015   0.3102    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10109.8  on 9999  degrees of freedom
## Residual deviance:  8561.4  on 9988  degrees of freedom
## AIC: 8585.4
## 
## Number of Fisher Scoring iterations: 5

For the vatiables Age,GenderMale,GeographyGermany,Balance and IsActiveMember three stars indicate that the coefficients are highly significant

Extracting coefficients and removing logorifm:

coefExp <- coef(Model1) %>% exp() %>% round(2)
coefExp
##      (Intercept)              Age       GenderMale      CreditScore 
##             0.03             1.08             0.59             1.00 
## GeographyGermany   GeographySpain           Tenure          Balance 
##             2.17             1.04             0.98             1.00 
##    NumOfProducts        HasCrCard   IsActiveMember  EstimatedSalary 
##             0.90             0.96             0.34             1.00
  • Interpreation of coefficients:
  • One unit increase in Age increases the odds of churning by a factor of 1.08 (8%)
  • Being a male customer decreases the odds of churning by a factor of 0.59
  • Being a customer from Germany increases the odds of churning by a factor of 2.17
  • Being an active member decreases the odds of churning by a factor of 0.34
  • One unit increase in Balance increases the odds of churning by a factor of 1.00

Deciding which variables to include in the model:

Model2 <- stepAIC(Model1, trace = 0)
summary(Model2)
## 
## Call:
## glm(formula = Exited ~ Age + Gender + CreditScore + Geography + 
##     Tenure + Balance + NumOfProducts + IsActiveMember, family = binomial, 
##     data = ChurnM)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3270  -0.6592  -0.4553  -0.2682   2.9826  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.377e+00  2.359e-01 -14.312  < 2e-16 ***
## Age               7.268e-02  2.575e-03  28.230  < 2e-16 ***
## GenderMale       -5.291e-01  5.448e-02  -9.712  < 2e-16 ***
## CreditScore      -6.682e-04  2.803e-04  -2.384   0.0171 *  
## GeographyGermany  7.741e-01  6.766e-02  11.441  < 2e-16 ***
## GeographySpain    3.586e-02  7.063e-02   0.508   0.6116    
## Tenure           -1.595e-02  9.350e-03  -1.706   0.0879 .  
## Balance           2.653e-06  5.140e-07   5.162 2.45e-07 ***
## NumOfProducts    -1.007e-01  4.712e-02  -2.137   0.0326 *  
## IsActiveMember   -1.075e+00  5.767e-02 -18.648  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10110  on 9999  degrees of freedom
## Residual deviance:  8563  on 9990  degrees of freedom
## AIC: 8583
## 
## Number of Fisher Scoring iterations: 5

Comparing Model2 with Model1 we can see, that variables HasCrCard and EstimatedSalary have been removed from the model, as they are less significant, and those varibles, which are left have superior AIC values

Evaluating the new model (Model2):

LogRegR2(Model2)
## Chi2                 1546.83 
## Df                   9 
## Sig.                 0 
## Cox and Snell Index  0.1433133 
## Nagelkerke Index     0.2252868 
## McFadden's R2        0.1530033

By looking at Cox and Snell index, Nagelkerke index and McFadden’s R2 it is possible to say that our new model (Model2) is in the best case reasonable, as we have only Nagelkerke index bigger than 0.2, while other two just more or less close to this value

Evalution of the new model (Model2) by putting correct predicitions in relation to overall number of observations(accuracy):

ChurnM$predNew <- predict(Model2, type="response")
ChurnM %>% dplyr:: select(Exited, predNew) %>% tail()
##       Exited    predNew
## 9995       0 0.11540322
## 9996       0 0.13383612
## 9997       0 0.05264931
## 9998       1 0.07436407
## 9999       1 0.34413235
## 10000      0 0.15578079
confMatrixNew <- confusion.matrix(ChurnM$Exited, ChurnM$predNew, threshold = 0.5)
confMatrixNew
##     obs
## pred    0    1
##    0 7676 1597
##    1  287  440
## attr(,"class")
## [1] "confusion.matrix"

Now observations are classified according to a threshold of 0.5 and the predicted and observed outcomes are compared in a so called confusion matrix.

  • In this confusion matrix we can see that:
  • 7676 customers were correctly classified as no churners
  • 440 were correctly classified as churners
  • 1597 customers were classified as no-churners, while in fact they did churn
  • 287 customers were classified as churners, while in fact they’re not
accuracyNew <- sum(diag(confMatrixNew))/sum(confMatrixNew)
accuracyNew
## [1] 0.8116

Comparing our classifications to the actual observations we find that we have accuracy of 0.8116 (~81%), which is pretty high.

CROSS-VALIDATION

Accuracy function with threshold 0.3:

Acc03 <- function (r, pi = 0) {
  cm <- confusion.matrix(r, pi, threshold = 0.3) 
  acc <- sum(diag(cm)) / sum(cm)
  return(acc)
}

Calculating cross-validated accuracy:

set.seed(534381)
cv.glm(ChurnM, Model2, cost=Acc03, K=6)$delta
## [1] 0.7817000 0.7818502

Here we get accuracy of ~78%, which is lower, than it was in in-sample validation (~81%), but stil pretty close to it.

PREDICTING TIME TO CHURN (SURVIVAL ANALYSIS)

Tenure time:

plotChurn <- ChurnM %>% 
  mutate(Exited = Exited %>% factor(labels = c("No-churners", "Churners"))) %>%
  
  ggplot() +
  geom_histogram(aes(x = Tenure, fill = factor(Exited)), binwidth = 0.5) +
  facet_grid( ~ Exited) +
  theme(legend.position = "none") +
  ggtitle("Tenure of no-churners and churners")

plotChurn

Comparing these two histograms we can say, that in general there are a lot more No-Churners, than Churners in each Tenure value.

Survival curve analysis by Kaplan Meier:

cbind(ChurnM %>% dplyr::select(Tenure, Exited),
      surv = Surv(ChurnM$Tenure, ChurnM$Exited)) %>% head(10)
##    Tenure Exited surv
## 1       2      1    2
## 2       1      0   1+
## 3       8      1    8
## 4       1      0   1+
## 5       2      0   2+
## 6       8      1    8
## 7       7      0   7+
## 8       4      1    4
## 9       4      0   4+
## 10      2      0   2+
fitKM <- survfit(Surv(ChurnM$Tenure, ChurnM$Exited) ~ 1,
                 type="kaplan-meier")
fitKM$surv
##  [1] 0.9905000 0.9665305 0.9438138 0.9170238 0.8883624 0.8546414 0.8173673
##  [8] 0.7763483 0.7151476 0.6118054 0.4856986
print(fitKM)
## Call: survfit(formula = Surv(ChurnM$Tenure, ChurnM$Exited) ~ 1, type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##   10000    2037      10      10      NA
  • Here we can see, that:
  • Our dataset holds 10000 customers, of which about 2037 churned in the time under observation.
  • The median survival time is 10, that is, about 50% of the customers do not churn before they reach a Tenure duration of 10.
plot(fitKM, conf.int = FALSE, xlab = "Tenure duration of a client",  ylab = "Survival function", main = "Survival function")

This function gives us the probability that a client will not churn in a period leading up to the time point T. So, it can be seen that the function keeps deccreasing. Therefore, we can cconclude, that the longer the person continues to be the client of the bank the higher the probobility, that he/she will not churn in the future