setwd("C:/Users/evdya/Desktop")
halloween <- read.csv("Churn_Modelling.csv") 

summary(halloween)
##    RowNumber       CustomerId           Surname      CreditScore   
##  Min.   :    1   Min.   :15565701   Smith   :  32   Min.   :350.0  
##  1st Qu.: 2501   1st Qu.:15628528   Martin  :  29   1st Qu.:584.0  
##  Median : 5000   Median :15690738   Scott   :  29   Median :652.0  
##  Mean   : 5000   Mean   :15690941   Walker  :  28   Mean   :650.5  
##  3rd Qu.: 7500   3rd Qu.:15753234   Brown   :  26   3rd Qu.:718.0  
##  Max.   :10000   Max.   :15815690   Genovese:  25   Max.   :850.0  
##                                     (Other) :9831                  
##    Geography       Gender          Age            Tenure      
##  France :5014   Female:4543   Min.   :18.00   Min.   : 0.000  
##  Germany:2509   Male  :5457   1st Qu.:32.00   1st Qu.: 3.000  
##  Spain  :2477                 Median :37.00   Median : 5.000  
##                               Mean   :38.92   Mean   : 5.013  
##                               3rd Qu.:44.00   3rd Qu.: 7.000  
##                               Max.   :92.00   Max.   :10.000  
##                                                               
##     Balance       NumOfProducts    HasCrCard      IsActiveMember  
##  Min.   :     0   Min.   :1.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:     0   1st Qu.:1.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 97199   Median :1.00   Median :1.0000   Median :1.0000  
##  Mean   : 76486   Mean   :1.53   Mean   :0.7055   Mean   :0.5151  
##  3rd Qu.:127644   3rd Qu.:2.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :250898   Max.   :4.00   Max.   :1.0000   Max.   :1.0000  
##                                                                   
##  EstimatedSalary         Exited      
##  Min.   :    11.58   Min.   :0.0000  
##  1st Qu.: 51002.11   1st Qu.:0.0000  
##  Median :100193.91   Median :0.0000  
##  Mean   :100090.24   Mean   :0.2037  
##  3rd Qu.:149388.25   3rd Qu.:0.0000  
##  Max.   :199992.48   Max.   :1.0000  
## 

STEP 1: Exploring the data

How many customers churn?

library(dplyr)
count(halloween, Exited)
## # A tibble: 2 x 2
##   Exited     n
##    <int> <int>
## 1      0  7963
## 2      1  2037
library(ggplot2) 
ggplot(halloween, aes(x = Exited)) + 
  geom_histogram(stat = "count", fill = "tan1") +
  ggtitle("Number of no-churners and churners")

2037 customers churned from Halloween BI

How much on average do churners and no-churners last as bank customers?

Churners <- halloween %>%
  filter(Exited==1)
Nochurners <- halloween %>%
  filter(Exited==0)

mean(Churners$Tenure)  
## [1] 4.932744
mean(Nochurners$Tenure)
## [1] 5.033279

Both churners and no-churners last as customers of Halloween BI for about 5 years

STEP 2: Predicting churn

Prediciting churn with logistic regression

Firstly, we need to build a logistic regression model
Initially, we assume that all of the explanatory variables, which we included in our model, do not have influence on churn of the customers.

logModel <- glm(Exited ~ CreditScore + Geography + Gender + Age + Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary, 
                family = binomial, data = halloween)

summary(logModel)
## 
## Call:
## glm(formula = Exited ~ CreditScore + Geography + Gender + Age + 
##     Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember + 
##     EstimatedSalary, family = binomial, data = halloween)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3097  -0.6589  -0.4560  -0.2697   2.9940  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.392e+00  2.448e-01 -13.857  < 2e-16 ***
## CreditScore      -6.683e-04  2.803e-04  -2.384   0.0171 *  
## GeographyGermany  7.747e-01  6.767e-02  11.448  < 2e-16 ***
## GeographySpain    3.522e-02  7.064e-02   0.499   0.6181    
## GenderMale       -5.285e-01  5.449e-02  -9.699  < 2e-16 ***
## Age               7.271e-02  2.576e-03  28.230  < 2e-16 ***
## Tenure           -1.595e-02  9.355e-03  -1.705   0.0882 .  
## Balance           2.637e-06  5.142e-07   5.128 2.92e-07 ***
## NumOfProducts    -1.015e-01  4.713e-02  -2.154   0.0312 *  
## HasCrCard        -4.468e-02  5.934e-02  -0.753   0.4515    
## IsActiveMember   -1.075e+00  5.769e-02 -18.643  < 2e-16 ***
## EstimatedSalary   4.807e-07  4.737e-07   1.015   0.3102    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10109.8  on 9999  degrees of freedom
## Residual deviance:  8561.4  on 9988  degrees of freedom
## AIC: 8585.4
## 
## Number of Fisher Scoring iterations: 5

From the summary of our model, we can see that variables CreditScore, GeographyGermany, GenderMale, Age, Balance, NumOfProducts, and IsActiveMember have significant effect on customer churn because their p-values are statistically significant.
Significant p-value means that the probability of getting the given result or larger for each of the mentioned variables, assuming that they have no effect on customer churn, is less that 0.05.

Secondly, we need to remove logarithm and transform our coefficients to the odds

odds <- coef(logModel) %>% exp() %>% round(2)
odds
##      (Intercept)      CreditScore GeographyGermany   GeographySpain 
##             0.03             1.00             2.17             1.04 
##       GenderMale              Age           Tenure          Balance 
##             0.59             1.08             0.98             1.00 
##    NumOfProducts        HasCrCard   IsActiveMember  EstimatedSalary 
##             0.90             0.96             0.34             1.00

We will interpret all of the variables to better understand what is going on.
Each one-unit increase in customer’s credit score increases the odds of churning by a factor of 1.
If a customer is from Germany, it increases the odds of churning by a factor of 2.17 compared to customers from other countries.
If a customer is from Spain it increases the odds of churning by a factor of 1.04 compared to customers from other countries.
If a customer is male, the odds of him churning decrease by a factor of 0.59 compared to female customers.
Each one-unit increase in customer’s age increases the odds of churning by a factor of 1.08.
Each one-unit increase in tenure of a customer decreases the odds of churning by a factor of 0.98.
Each one-unit increase in customer’s balance increases the odds of churning by a factor of 1.
Each one-unit increase in number of products of a customer decreases the odds of churning by a factor of 0.90.
Having a credit card in our bank decreases the odds of a customer’s churning by a factor of 0.96 compared to customers who do not have it.
Being an active member of our bank decreases the odd of a customer’s churning by a factor 0.34 compared to not active members.
Each one-unit increase in customer’s estimated salary increases the odds of churning by a factor of 1.

Thirdly, we need to select which variables to include in our model

library(MASS)
logModelNew <- stepAIC(logModel, trace = 0)
summary(logModelNew)
## 
## Call:
## glm(formula = Exited ~ CreditScore + Geography + Gender + Age + 
##     Tenure + Balance + NumOfProducts + IsActiveMember, family = binomial, 
##     data = halloween)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3270  -0.6592  -0.4553  -0.2682   2.9826  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.377e+00  2.359e-01 -14.312  < 2e-16 ***
## CreditScore      -6.682e-04  2.803e-04  -2.384   0.0171 *  
## GeographyGermany  7.741e-01  6.766e-02  11.441  < 2e-16 ***
## GeographySpain    3.586e-02  7.063e-02   0.508   0.6116    
## GenderMale       -5.291e-01  5.448e-02  -9.712  < 2e-16 ***
## Age               7.268e-02  2.575e-03  28.230  < 2e-16 ***
## Tenure           -1.595e-02  9.350e-03  -1.706   0.0879 .  
## Balance           2.653e-06  5.140e-07   5.162 2.45e-07 ***
## NumOfProducts    -1.007e-01  4.712e-02  -2.137   0.0326 *  
## IsActiveMember   -1.075e+00  5.767e-02 -18.648  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10110  on 9999  degrees of freedom
## Residual deviance:  8563  on 9990  degrees of freedom
## AIC: 8583
## 
## Number of Fisher Scoring iterations: 5

According to the summary of our new model, only the remaining variables have significant effect on customer churn: CreditScore, GeographyGermany, GeographySpain, GenderMale, Age, Tenure, Balance, NumOfProducts, and IsActiveMember.

Finally, we will evaluate the created model

library(descr)
LogRegR2(logModelNew)
## Chi2                 1546.83 
## Df                   9 
## Sig.                 0 
## Cox and Snell Index  0.1433133 
## Nagelkerke Index     0.2252868 
## McFadden's R2        0.1530033

Two out of three goodness-of-fit measures have coefficients smaller that 0.2, but they are nearly close to it. It means that the explanatory power of the model is relatively normal. We would conclude that our model is quite reasonable.

Assessing model quality with confusion matrix

Firstly, we make the predictions about our model and construct the in-sample matrix according to them

library(SDMTools)
halloween$pred <- predict(logModelNew, type = "response", na.action = na.exclude)

confMatModel <- confusion.matrix(halloween$Exited, halloween$pred, threshold = 0.5)
confMatModel
##     obs
## pred    0    1
##    0 7676 1597
##    1  287  440
## attr(,"class")
## [1] "confusion.matrix"

By the confusion matrix we can conclude that 7676 customers were correctly classified as no-churners and 440 customers were correctly classified as churners. Others were misclassified by our model: 1597 customers churned while we predicted them not to churn, and 287 customers didn’t churn but they were predicted to churn by our model.

Secondly, we calculate the accuracy

accModel <- sum(diag(confMatModel)) / sum(confMatModel)
accModel
## [1] 0.8116

The accuracy of our model is 81%, which is very high.
We can conclude that our model is pretty accurate.

Assessing model quality with сross-validation

At first, we need an accuracy function based on a threshold of 0.5

library(boot)
acc03 <- function(r, pi = 0) {
  cm <- confusion.matrix(r, pi, threshold = 0.5)
  acc <- sum(diag(cm)) / sum(cm)
  return(acc)
}

Then, we calculate cross-validated accuracy

set.seed(1234)
cv.glm(halloween, logModelNew, cost = acc03, K = 6)$delta
## [1] 0.81110 0.81105

The accuracy we got is equal to 81% which is the same as our in-sample accuracy (81%).
Thus, we can conclude that quality of our model is good.

The most important factors of churn

According to the information we got from our data, the factors which influence customer churn in Halloween BI the most are GeographyGermany, GenderMale, and IsActiveMember.
Customers from Germany are much more likely to churn than customers from other countries, while men and active customers are much less likely to churn. So, we think that the company should pay more attention to German customers as well as to female ones and those who are not atively using the services of the bank to prevent then churning.

STEP 3: Predicting time to churn

Tenure time

We begin by buiding two histograms to look at the distribution of the tenure time dependent on whether or not a customer churned

plotChurn <- halloween %>% 
  mutate(Exited = Exited %>% factor(labels = c("No-churners", "Churners"))) %>%
  
ggplot() +
  geom_histogram(binwidth = 1, aes(x = Tenure, fill = factor(Exited))) +
  facet_grid( ~ Exited) +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("tan1", "tan2")) +
  ggtitle("Tenure of no-churners and churners")

plotChurn

From the histograms we can see that for both churners’ and no-churners’ tenure time is almost the same.
The least amount of people were customers of Halloween BI for less than a year or 10 years. The most amount of customers are using the bank for the period between 1 and 9 years.

Survival analysis

First step is to create a new column that holds a survival object

library(survival)
cbind(halloween %>% dplyr::select(Tenure, Exited), 
      Surv = Surv(halloween$Tenure, halloween$Exited)) %>% head(10)
##    Tenure Exited Surv
## 1       2      1    2
## 2       1      0   1+
## 3       8      1    8
## 4       1      0   1+
## 5       2      0   2+
## 6       8      1    8
## 7       7      0   7+
## 8       4      1    4
## 9       4      0   4+
## 10      2      0   2+

Second step is to estimate our survival function using Kaplan-Meier analysis

SurvObj <- Surv(halloween$Tenure, halloween$Exited) 

fitKM <- survfit(SurvObj ~ 1, type = "kaplan-meier")
print(fitKM)
## Call: survfit(formula = SurvObj ~ 1, type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##   10000    2037      10      10      NA

The results of the analysis show us that we have 10000 customers, of which 2037 churned in the time under observation.
The median survival time is 10 years. It means that about 50% of our customers do not churn before they reach a tenure duration of 10 years.

Now we plot the survival function

plot(fitKM, conf.int = FALSE, xlab = "Tenure",  ylab = "Survival function", main = "Survival function")

The function gives us the information about the probability that a customer will not churn in the period leading up to the time point of 10 years.
The longer a person is a customer of Halloween BI, the higher the probability of him not churning.