Firstly, we download all the needed libraries:
library(foreign)
library(ggplot2)
library(dplyr)
library(boot)
library(plyr)
library(MASS)
library(descr)
library(SDMTools)
library(knitr)
library(corrplot)
library(survival)
ChurnM <- read.csv("Churn_Modelling.csv")
count(ChurnM, vars="Exited")
## Exited freq
## 1 0 7963
## 2 1 2037
The result shows, that 2037 customers out 0f 10000 have churned and 7963 have not.
subset(ChurnM, ChurnM$Exited =="1") %>% summarise(mean=mean(Tenure))
## mean
## 1 4.932744
subset(ChurnM, ChurnM$Exited =="0") %>% summarise(mean=mean(Tenure))
## mean
## 1 5.033279
- Churners on average last as bank customers for 4.932744
- No-Churners on average last as bank customers for 5.033279
Model1 <- glm(Exited ~ Age+Gender+CreditScore+Geography+Tenure+Balance+NumOfProducts+HasCrCard+IsActiveMember+EstimatedSalary,family=binomial, ChurnM)
summary(Model1)
##
## Call:
## glm(formula = Exited ~ Age + Gender + CreditScore + Geography +
## Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember +
## EstimatedSalary, family = binomial, data = ChurnM)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3097 -0.6589 -0.4560 -0.2697 2.9940
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.392e+00 2.448e-01 -13.857 < 2e-16 ***
## Age 7.271e-02 2.576e-03 28.230 < 2e-16 ***
## GenderMale -5.285e-01 5.449e-02 -9.699 < 2e-16 ***
## CreditScore -6.683e-04 2.803e-04 -2.384 0.0171 *
## GeographyGermany 7.747e-01 6.767e-02 11.448 < 2e-16 ***
## GeographySpain 3.522e-02 7.064e-02 0.499 0.6181
## Tenure -1.595e-02 9.355e-03 -1.705 0.0882 .
## Balance 2.637e-06 5.142e-07 5.128 2.92e-07 ***
## NumOfProducts -1.015e-01 4.713e-02 -2.154 0.0312 *
## HasCrCard -4.468e-02 5.934e-02 -0.753 0.4515
## IsActiveMember -1.075e+00 5.769e-02 -18.643 < 2e-16 ***
## EstimatedSalary 4.807e-07 4.737e-07 1.015 0.3102
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10109.8 on 9999 degrees of freedom
## Residual deviance: 8561.4 on 9988 degrees of freedom
## AIC: 8585.4
##
## Number of Fisher Scoring iterations: 5
For the vatiables Age,GenderMale,GeographyGermany,Balance and IsActiveMember three stars indicate that the coefficients are highly significant
coefExp <- coef(Model1) %>% exp() %>% round(2)
coefExp
## (Intercept) Age GenderMale CreditScore
## 0.03 1.08 0.59 1.00
## GeographyGermany GeographySpain Tenure Balance
## 2.17 1.04 0.98 1.00
## NumOfProducts HasCrCard IsActiveMember EstimatedSalary
## 0.90 0.96 0.34 1.00
- Interpreation of coefficients:
- One unit increase in Age increases the odds of churning by a factor of 1.08 (8%)
- Being a male customer decreases the odds of churning by a factor of 0.59
- Being a customer from Germany increases the odds of churning by a factor of 2.17
- Being an active member decreases the odds of churning by a factor of 0.34
- One unit increase in Balance increases the odds of churning by a factor of 1.00
Model2 <- stepAIC(Model1, trace = 0)
summary(Model2)
##
## Call:
## glm(formula = Exited ~ Age + Gender + CreditScore + Geography +
## Tenure + Balance + NumOfProducts + IsActiveMember, family = binomial,
## data = ChurnM)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3270 -0.6592 -0.4553 -0.2682 2.9826
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.377e+00 2.359e-01 -14.312 < 2e-16 ***
## Age 7.268e-02 2.575e-03 28.230 < 2e-16 ***
## GenderMale -5.291e-01 5.448e-02 -9.712 < 2e-16 ***
## CreditScore -6.682e-04 2.803e-04 -2.384 0.0171 *
## GeographyGermany 7.741e-01 6.766e-02 11.441 < 2e-16 ***
## GeographySpain 3.586e-02 7.063e-02 0.508 0.6116
## Tenure -1.595e-02 9.350e-03 -1.706 0.0879 .
## Balance 2.653e-06 5.140e-07 5.162 2.45e-07 ***
## NumOfProducts -1.007e-01 4.712e-02 -2.137 0.0326 *
## IsActiveMember -1.075e+00 5.767e-02 -18.648 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10110 on 9999 degrees of freedom
## Residual deviance: 8563 on 9990 degrees of freedom
## AIC: 8583
##
## Number of Fisher Scoring iterations: 5
Comparing Model2 with Model1 we can see, that variables HasCrCard and EstimatedSalary have been removed from the model, as they are less significant, and those varibles, which are left have superior AIC values
LogRegR2(Model2)
## Chi2 1546.83
## Df 9
## Sig. 0
## Cox and Snell Index 0.1433133
## Nagelkerke Index 0.2252868
## McFadden's R2 0.1530033
By looking at Cox and Snell index, Nagelkerke index and McFadden’s R2 it is possible to say that our new model (Model2) is in the best case reasonable, as we have only Nagelkerke index bigger than 0.2, while other two just more or less close to this value
ChurnM$predNew <- predict(Model2, type="response")
ChurnM %>% dplyr:: select(Exited, predNew) %>% tail()
## Exited predNew
## 9995 0 0.11540322
## 9996 0 0.13383612
## 9997 0 0.05264931
## 9998 1 0.07436407
## 9999 1 0.34413235
## 10000 0 0.15578079
confMatrixNew <- confusion.matrix(ChurnM$Exited, ChurnM$predNew, threshold = 0.5)
confMatrixNew
## obs
## pred 0 1
## 0 7676 1597
## 1 287 440
## attr(,"class")
## [1] "confusion.matrix"
Now observations are classified according to a threshold of 0.5 and the predicted and observed outcomes are compared in a so called confusion matrix.
- In this confusion matrix we can see that:
- 7676 customers were correctly classified as no churners
- 440 were correctly classified as churners
- 1597 customers were classified as no-churners, while in fact they did churn
- 287 customers were classified as churners, while in fact they’re not
accuracyNew <- sum(diag(confMatrixNew))/sum(confMatrixNew)
accuracyNew
## [1] 0.8116
Comparing our classifications to the actual observations we find that we have accuracy of 0.8116 (~81%), which is pretty high.
Acc03 <- function (r, pi = 0) {
cm <- confusion.matrix(r, pi, threshold = 0.3)
acc <- sum(diag(cm)) / sum(cm)
return(acc)
}
set.seed(534381)
cv.glm(ChurnM, Model2, cost=Acc03, K=6)$delta
## [1] 0.7817000 0.7818502
Here we get accuracy of ~78%, which is lower, than it was in in-sample validation (~81%), but stil pretty close to it.
plotChurn <- ChurnM %>%
mutate(Exited = Exited %>% factor(labels = c("No-churners", "Churners"))) %>%
ggplot() +
geom_histogram(aes(x = Tenure, fill = factor(Exited)), binwidth = 0.5) +
facet_grid( ~ Exited) +
theme(legend.position = "none") +
ggtitle("Tenure of no-churners and churners")
plotChurn
Comparing these two histograms we can say, that in general there are a lot more No-Churners, than Churners in each Tenure value.
cbind(ChurnM %>% dplyr::select(Tenure, Exited),
surv = Surv(ChurnM$Tenure, ChurnM$Exited)) %>% head(10)
## Tenure Exited surv
## 1 2 1 2
## 2 1 0 1+
## 3 8 1 8
## 4 1 0 1+
## 5 2 0 2+
## 6 8 1 8
## 7 7 0 7+
## 8 4 1 4
## 9 4 0 4+
## 10 2 0 2+
fitKM <- survfit(Surv(ChurnM$Tenure, ChurnM$Exited) ~ 1,
type="kaplan-meier")
fitKM$surv
## [1] 0.9905000 0.9665305 0.9438138 0.9170238 0.8883624 0.8546414 0.8173673
## [8] 0.7763483 0.7151476 0.6118054 0.4856986
print(fitKM)
## Call: survfit(formula = Surv(ChurnM$Tenure, ChurnM$Exited) ~ 1, type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## 10000 2037 10 10 NA
- Here we can see, that:
- Our dataset holds 10000 customers, of which about 2037 churned in the time under observation.
- The median survival time is 10, that is, about 50% of the customers do not churn before they reach a Tenure duration of 10.
plot(fitKM, conf.int = FALSE, xlab = "Tenure duration of a client", ylab = "Survival function", main = "Survival function")
This function gives us the probability that a client will not churn in a period leading up to the time point T. So, it can be seen that the function keeps deccreasing. Therefore, we can cconclude, that the longer the person continues to be the client of the bank the higher the probobility, that he/she will not churn in the future