library(readr)
db <- read_csv("Churn_Modelling.csv")
str(db)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 10000 obs. of  14 variables:
##  $ RowNumber      : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ CustomerId     : num  15634602 15647311 15619304 15701354 15737888 ...
##  $ Surname        : chr  "Hargrave" "Hill" "Onio" "Boni" ...
##  $ CreditScore    : num  619 608 502 699 850 645 822 376 501 684 ...
##  $ Geography      : chr  "France" "Spain" "France" "France" ...
##  $ Gender         : chr  "Female" "Female" "Female" "Female" ...
##  $ Age            : num  42 41 42 39 43 44 50 29 44 27 ...
##  $ Tenure         : num  2 1 8 1 2 8 7 4 4 2 ...
##  $ Balance        : num  0 83808 159661 0 125511 ...
##  $ NumOfProducts  : num  1 1 3 2 1 2 2 4 2 1 ...
##  $ HasCrCard      : num  1 0 1 0 1 1 1 1 0 1 ...
##  $ IsActiveMember : num  1 1 0 0 1 0 1 0 1 1 ...
##  $ EstimatedSalary: num  101349 112543 113932 93827 79084 ...
##  $ Exited         : num  1 0 1 0 0 1 0 1 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   RowNumber = col_double(),
##   ..   CustomerId = col_double(),
##   ..   Surname = col_character(),
##   ..   CreditScore = col_double(),
##   ..   Geography = col_character(),
##   ..   Gender = col_character(),
##   ..   Age = col_double(),
##   ..   Tenure = col_double(),
##   ..   Balance = col_double(),
##   ..   NumOfProducts = col_double(),
##   ..   HasCrCard = col_double(),
##   ..   IsActiveMember = col_double(),
##   ..   EstimatedSalary = col_double(),
##   ..   Exited = col_double()
##   .. )
library(dplyr)
library(corrplot)
db %>% select_if(is.numeric) %>%
  dplyr::select(-CustomerId) %>%
  cor() %>%
  corrplot()

Let’s also have a quick look on the correlation graph with our variables. There we may notice a slight positive correlation between variables “Exited” and “Age”. And also very slight negative correlation between variables “Exited” and “IsActiveMember”. BUT the coefficients here are small. Anyway, we’ll get to it a bit later. And also there’s no multicollinearity, which is good.

To-do list

  1. explore the data to find out how many customers churn at all and how much on average do churners and no-churners last as bank customers;
  2. predict churn (variable “Exited”) with a logistic regression model, with a confusion matrix and cross-validation to assess model quality; and pick up 2-3 most important factors of churn;
  3. predict time to churn (variables “Tenure”, “Exited”).

1 - quick overview

How many customers churn at all?

Observing a churn rate: here we see that 2037 customers have already churned (left the the bank after all the bank after all), which is 20% of all number of observations.

library("janitor")
churn = tabyl(db$Exited)
churn

How much on average do churners and no-churners last as bank customers?

As for all of the customers, the average tenure value is 5.01 (which is 5 years). As it is seen from the historam the same can be said about both groups (churners and no-churners).

mean(db$Tenure) # to look at the average tenure value for all of the customers 
## [1] 5.0128
library(ggplot2)
plotTenure <- db %>% 
    mutate(Exited = Exited %>% factor(labels = c("No", "Yes"))) %>% 
ggplot() +
  geom_histogram(aes(x = Tenure,
                 fill = factor(Exited)), binwidth = 1) +
  geom_vline(aes(xintercept=mean(Tenure))) +
  facet_grid( ~ Exited) +
  theme(legend.position = "none")
plotTenure # to look at the average tenure value for two groups: churners and no-churners 

2 - Predicting churn with Logistic Regression

full model

logitModelFull <- glm(Exited ~ CreditScore + Geography + Gender + Age + Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary, family = binomial, data = db)  
summary(logitModelFull) # the model
## 
## Call:
## glm(formula = Exited ~ CreditScore + Geography + Gender + Age + 
##     Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember + 
##     EstimatedSalary, family = binomial, data = db)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3097  -0.6589  -0.4560  -0.2697   2.9940  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.392e+00  2.448e-01 -13.857  < 2e-16 ***
## CreditScore      -6.683e-04  2.803e-04  -2.384   0.0171 *  
## GeographyGermany  7.747e-01  6.767e-02  11.448  < 2e-16 ***
## GeographySpain    3.522e-02  7.064e-02   0.499   0.6181    
## GenderMale       -5.285e-01  5.449e-02  -9.699  < 2e-16 ***
## Age               7.271e-02  2.576e-03  28.230  < 2e-16 ***
## Tenure           -1.595e-02  9.355e-03  -1.705   0.0882 .  
## Balance           2.637e-06  5.142e-07   5.128 2.92e-07 ***
## NumOfProducts    -1.015e-01  4.713e-02  -2.154   0.0312 *  
## HasCrCard        -4.468e-02  5.934e-02  -0.753   0.4515    
## IsActiveMember   -1.075e+00  5.769e-02 -18.643  < 2e-16 ***
## EstimatedSalary   4.807e-07  4.737e-07   1.015   0.3102    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10109.8  on 9999  degrees of freedom
## Residual deviance:  8561.4  on 9988  degrees of freedom
## AIC: 8585.4
## 
## Number of Fisher Scoring iterations: 5
coefsexp <- coef(logitModelFull) %>% exp() %>% round(2)
coefsexp # the odds
##      (Intercept)      CreditScore GeographyGermany   GeographySpain 
##             0.03             1.00             2.17             1.04 
##       GenderMale              Age           Tenure          Balance 
##             0.59             1.08             0.98             1.00 
##    NumOfProducts        HasCrCard   IsActiveMember  EstimatedSalary 
##             0.90             0.96             0.34             1.00

Coefficient interpretation:

From the full model we can see some factors, that have an influence on customers’ churning. For example, customer’s geographical location in Germany have an influence on exiting (we refuse from H0 which states that there is no geographical location does not influence customers churning). By looking at the odds we may add: location in Germany increases the odds of exiting the bank by a factor of 2.17, so 117% compared to someone who is not located in Germany.

There also can be seen some influence of gender, age, and customers’ being an active member. BUT before interpretting everything it’s better to use stepAIC function and find the best model. Moreover, it’s important to check it if it is good enought.

new model

library(MASS)
logitModelNew <- stepAIC(logitModelFull, trace = 0)
summary(logitModelNew)
## 
## Call:
## glm(formula = Exited ~ CreditScore + Geography + Gender + Age + 
##     Tenure + Balance + NumOfProducts + IsActiveMember, family = binomial, 
##     data = db)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3270  -0.6592  -0.4553  -0.2682   2.9826  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.377e+00  2.359e-01 -14.312  < 2e-16 ***
## CreditScore      -6.682e-04  2.803e-04  -2.384   0.0171 *  
## GeographyGermany  7.741e-01  6.766e-02  11.441  < 2e-16 ***
## GeographySpain    3.586e-02  7.063e-02   0.508   0.6116    
## GenderMale       -5.291e-01  5.448e-02  -9.712  < 2e-16 ***
## Age               7.268e-02  2.575e-03  28.230  < 2e-16 ***
## Tenure           -1.595e-02  9.350e-03  -1.706   0.0879 .  
## Balance           2.653e-06  5.140e-07   5.162 2.45e-07 ***
## NumOfProducts    -1.007e-01  4.712e-02  -2.137   0.0326 *  
## IsActiveMember   -1.075e+00  5.767e-02 -18.648  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10110  on 9999  degrees of freedom
## Residual deviance:  8563  on 9990  degrees of freedom
## AIC: 8583
## 
## Number of Fisher Scoring iterations: 5
coefsexp <- coef(logitModelNew) %>% exp() %>% round(2)
coefsexp 
##      (Intercept)      CreditScore GeographyGermany   GeographySpain 
##             0.03             1.00             2.17             1.04 
##       GenderMale              Age           Tenure          Balance 
##             0.59             1.08             0.98             1.00 
##    NumOfProducts   IsActiveMember 
##             0.90             0.34

Goodness of fit measures

  • LogRegR2
library(descr)
LogRegR2(logitModelNew)
## Chi2                 1546.83 
## Df                   9 
## Sig.                 0 
## Cox and Snell Index  0.1433133 
## Nagelkerke Index     0.2252868 
## McFadden's R2        0.1530033

Using LogRegR2() function we see, that the model is not very good (because most of the values are <0.2), the algorithm has troubles with explaining lots of the varibles.

  • Accuracy and confusion matrix
library(SDMTools)
db$predNew <- predict(logitModelNew, type = "response", na.action = na.exclude)
db %>% dplyr::select(Exited, predNew) %>% tail()
confMatrixNew <- confusion.matrix(db$Exited, 
                    db$predNew, threshold = 0.5)
confMatrixNew
##     obs
## pred    0    1
##    0 7676 1597
##    1  287  440
## attr(,"class")
## [1] "confusion.matrix"

Here we see that 7676 customers were correctly predicted not to exit the bank (true-negative) and 440 were correctly predicted to churn (true-positive).

accuracyNew <- sum(diag(confMatrixNew)) / sum(confMatrixNew)
accuracyNew
## [1] 0.8116

And as for the accuracy we have 81% of it, which is very good. But we can also check another variants, look at payoffs and choosу the best option.

payoffMatrix <- data.frame(threshold = seq(from = 0.1, to = 0.5, by = 0.1),
                           payoff = NA) 
for(i in 1:length(payoffMatrix$threshold)) {
  confMatrix <- confusion.matrix(db$Exited,
                db$predNew, 
                threshold = payoffMatrix$threshold[i])
  payoffMatrix$payoff[i] <- confMatrix[2,2]*1000 + confMatrix[2,1]*(-250)
}
payoffMatrix

From this matrix we see, that the best payoff the company will get with threshold 0.2, so let’s stick to it.

confMatrixNew <- confusion.matrix(db$Exited, 
                    db$predNew, threshold = 0.2)
confMatrixNew
##     obs
## pred    0    1
##    0 5555  608
##    1 2408 1429
## attr(,"class")
## [1] "confusion.matrix"
accuracyNew <- sum(diag(confMatrixNew)) / sum(confMatrixNew)
accuracyNew
## [1] 0.6984

This time from the confusion matrix we see that 5555 customers were correctly predicted not to exit the bank (true-negative) and 1429 were correctly predicted to churn (true-positive). And as for the accuracy it is ~70%. The same result gives the calculatation of cross-validated accuracy (which is below). Not 80% like it was before, but still pretty high. So we can come back to interpretation of the coefficients.

library(boot)

Acc03 <- function(r, pi = 0) {
  cm <- confusion.matrix(r, pi, threshold = 0.2)
  acc <- sum(diag(cm)) / sum(cm)
  return(acc)
}

set.seed(534381)
cv.glm(db, logitModelNew, cost = Acc03, K = 6)$delta
## [1] 0.6967000 0.6969337

Gender of the customer has an influence on their exiting: ‘Male’ gender of a customer decreases the odds of churn by a factor of 0.59, so 41% compared to female customers.

Being an active customer also has influence on churning in the same negative way as being a man: status as an “active” customer decreases the odds of exiting the bank by a factor of 0.34, so 76% compared to “unactive” customers.

And we also can recall that customers’ location in Germany increases the odds of exiting the bank by a factor of 2.17, so 117% compared to someone who is not located in Germany.

3 - Survival analysis

  • remind two histograms
library(ggplot2)
plotTenure <- db %>% 
    mutate(Exited = Exited %>% factor(labels = c("Did not churn", "Churned"))) %>% 
ggplot() +
    geom_histogram(aes(x = Tenure,
                 fill = factor(Exited)), binwidth = 1) +
   facet_grid( ~ Exited) +
   theme(legend.position = "none")
plotTenure

Predict time to churn

We want to predict the time when the customer will churn. First, let’s create a survial object.

library(survival)
cbind(db %>% dplyr::select(Tenure, Exited),
      surv = Surv(db$Tenure, db$Exited)) %>% head(10)
survObj <- Surv(db$Tenure, db$Exited)
str(survObj)
##  'Surv' num [1:10000, 1:2]  2   1+  8   1+  2+  8   7+  4   4+  2+ ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:2] "time" "status"
##  - attr(*, "type")= chr "right"
# Compute and print fit
fitKMSimple <- survfit(survObj ~ 1)
print(fitKMSimple)
## Call: survfit(formula = survObj ~ 1)
## 
##       n  events  median 0.95LCL 0.95UCL 
##   10000    2037      10      10      NA
plot(fitKMSimple, xlab = "Tenure", ylab = "Survival function", main = "Survival function")

Here we see that we observe 10000 customers, 2037 of which have exited our bank (churned). The median here is 10 which means that 50% of the customers do not exit our service (churn) before they reach a tenure duration of 10 years.

The survival function gives us the probability that a customer will churn in the period leading up to the time point t. Here, we see, that longer customer stays in our bank, less the probability that they will churn.

But that’s not the end actually… even if it could be ~

Kaplan-Meier with Categorial Covariate

I’m really interested in conducting Kaplan-Meier with Categorial Covariate! Let’s try geographical position at first (as Germany play not the last “role” in increasing customer’s churn)

fitKMCov <- survfit(survObj ~ Geography, data = db)
print(fitKMCov)
## Call: survfit(formula = survObj ~ Geography, data = db)
## 
##                      n events median 0.95LCL 0.95UCL
## Geography=France  5014    810     NA      NA      NA
## Geography=Germany 2509    814      9       9       9
## Geography=Spain   2477    413     NA      NA      NA

NA in rows of France and Spain mean that only few customers churned within the time under observation. Let’s also look at the graph where we can wee how “Germany” like is decreasing (more, than others), which means that customers from Germany tend to churn more than customers from Spain or France.

plot(fitKMCov, lty = 2:3,
     xlab = "Tenure", ylab = "Survival function", main = "Survival function")
legend(1, .4, c("France", "Germany", "Spain"), lty = 2:3)

Let’s try the same with Gender:

fitKMCov <- survfit(survObj ~ Gender, data = db)
print(fitKMCov)
## Call: survfit(formula = survObj ~ Gender, data = db)
## 
##                  n events median 0.95LCL 0.95UCL
## Gender=Female 4543   1139     10      10      10
## Gender=Male   5457    898     NA      NA      NA
plot(fitKMCov, lty = 2:3,
     xlab = "Tenure", ylab = "Survival function", main = "Survival function")
legend(1, .4, c("Female", "Male"), lty = 2:3)

Again our analysis from the second part (with logistic regression) is confirmed: according to the graph Male customers stay loyal to our bank longer than Female.

And now for the status of an Active Member:

fitKMCov <- survfit(survObj ~ IsActiveMember  , data = db)
print(fitKMCov)
## Call: survfit(formula = survObj ~ IsActiveMember, data = db)
## 
##                     n events median 0.95LCL 0.95UCL
## IsActiveMember=0 4849   1302     10       9      10
## IsActiveMember=1 5151    735     NA      NA      NA
plot(fitKMCov, lty = 2:3,
     xlab = "Tenure", ylab = "Survival function", main = "Survival function")
legend(1, .4, c("Not active", "Active"), lty = 2:3)

Q.E.D. members of the bank, which are considerend to be active tend to stay with the bank longer than non-active members (who churn more often).

Cox Proportional Hazards Model

Now let’s try to specify the Cox PH model :

library(rms)
fitCPH1 <- cph(Surv(Tenure, Exited) ~ CreditScore + Geography + Gender + Age + Balance + NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary,
               data = db, 
               x = TRUE, y = TRUE, surv = TRUE, 
               time.inc = 1)

cph(formula = Surv(Tenure, Exited) ~ CreditScore + Geography + Gender + Age + Balance + NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary, data = db,
    x = TRUE, y = TRUE, surv = TRUE, time.inc = 1)
## Cox Proportional Hazards Model
##  
##  cph(formula = Surv(Tenure, Exited) ~ CreditScore + Geography + 
##      Gender + Age + Balance + NumOfProducts + HasCrCard + IsActiveMember + 
##      EstimatedSalary, data = db, x = TRUE, y = TRUE, surv = TRUE, 
##      time.inc = 1)
##  
##                      Model Tests        Discrimination    
##                                            Indexes        
##  Obs     10000    LR chi2    1156.95    R2       0.113    
##  Events   2037    d.f.            10    Dxy      0.436    
##  Center 1.0789    Pr(> chi2)  0.0000    g        0.806    
##                   Score chi2 1246.04    gr       2.240    
##                   Pr(> chi2)  0.0000                      
##  
##                    Coef    S.E.   Wald Z Pr(>|Z|)
##  CreditScore       -0.0005 0.0002  -2.23 0.0255  
##  Geography=Germany  0.4883 0.0544   8.98 <0.0001 
##  Geography=Spain    0.0496 0.0605   0.82 0.4123  
##  Gender=Male       -0.3868 0.0448  -8.64 <0.0001 
##  Age                0.0475 0.0018  26.75 <0.0001 
##  Balance            0.0000 0.0000   4.79 <0.0001 
##  NumOfProducts     -0.0666 0.0387  -1.72 0.0852  
##  HasCrCard         -0.0634 0.0484  -1.31 0.1900  
##  IsActiveMember    -0.7377 0.0475 -15.54 <0.0001 
##  EstimatedSalary    0.0000 0.0000  -0.01 0.9920  
## 
exp(fitCPH1$coefficients) %>% round(2)
##       CreditScore Geography=Germany   Geography=Spain       Gender=Male 
##              1.00              1.63              1.05              0.68 
##               Age           Balance     NumOfProducts         HasCrCard 
##              1.05              1.00              0.94              0.94 
##    IsActiveMember   EstimatedSalary 
##              0.48              1.00

Here we see pretty much the same picture as we had with logistic regression in the secind part. The interpretation for that will be :

  • the hazard to churn for German customers is 63% higher than for others (French/Spanish) or increases by a factor of about 1.63 as compared to customers from other countries;
  • the hazard to churn for Male customers is 32% lower than for female customers or decreases by a factor about 0.68 as compared to female customers;
  • the hazard to churn for Active customers is 52% lower than for non-active ones or decreases by a factor about 0.48 as compared to non-active customers.

I’ve been using only three factors in all parts of the analysis, because these ones happened to be the most important, according to coefficients.

SO, dear client! If you meet german girl who is not considered as an active one [I mean customer] (here i don’t want to make you sad, but) she will probably churn…