library(readr)
db <- read_csv("Churn_Modelling.csv")
str(db)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 10000 obs. of 14 variables:
## $ RowNumber : num 1 2 3 4 5 6 7 8 9 10 ...
## $ CustomerId : num 15634602 15647311 15619304 15701354 15737888 ...
## $ Surname : chr "Hargrave" "Hill" "Onio" "Boni" ...
## $ CreditScore : num 619 608 502 699 850 645 822 376 501 684 ...
## $ Geography : chr "France" "Spain" "France" "France" ...
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ Age : num 42 41 42 39 43 44 50 29 44 27 ...
## $ Tenure : num 2 1 8 1 2 8 7 4 4 2 ...
## $ Balance : num 0 83808 159661 0 125511 ...
## $ NumOfProducts : num 1 1 3 2 1 2 2 4 2 1 ...
## $ HasCrCard : num 1 0 1 0 1 1 1 1 0 1 ...
## $ IsActiveMember : num 1 1 0 0 1 0 1 0 1 1 ...
## $ EstimatedSalary: num 101349 112543 113932 93827 79084 ...
## $ Exited : num 1 0 1 0 0 1 0 1 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. RowNumber = col_double(),
## .. CustomerId = col_double(),
## .. Surname = col_character(),
## .. CreditScore = col_double(),
## .. Geography = col_character(),
## .. Gender = col_character(),
## .. Age = col_double(),
## .. Tenure = col_double(),
## .. Balance = col_double(),
## .. NumOfProducts = col_double(),
## .. HasCrCard = col_double(),
## .. IsActiveMember = col_double(),
## .. EstimatedSalary = col_double(),
## .. Exited = col_double()
## .. )
library(dplyr)
library(corrplot)
db %>% select_if(is.numeric) %>%
dplyr::select(-CustomerId) %>%
cor() %>%
corrplot()
Let’s also have a quick look on the correlation graph with our variables. There we may notice a slight positive correlation between variables “Exited” and “Age”. And also very slight negative correlation between variables “Exited” and “IsActiveMember”. BUT the coefficients here are small. Anyway, we’ll get to it a bit later. And also there’s no multicollinearity, which is good.
Observing a churn rate: here we see that 2037 customers have already churned (left the the bank after all the bank after all), which is 20% of all number of observations.
library("janitor")
churn = tabyl(db$Exited)
churn
As for all of the customers, the average tenure value is 5.01 (which is 5 years). As it is seen from the historam the same can be said about both groups (churners and no-churners).
mean(db$Tenure) # to look at the average tenure value for all of the customers
## [1] 5.0128
library(ggplot2)
plotTenure <- db %>%
mutate(Exited = Exited %>% factor(labels = c("No", "Yes"))) %>%
ggplot() +
geom_histogram(aes(x = Tenure,
fill = factor(Exited)), binwidth = 1) +
geom_vline(aes(xintercept=mean(Tenure))) +
facet_grid( ~ Exited) +
theme(legend.position = "none")
plotTenure # to look at the average tenure value for two groups: churners and no-churners
logitModelFull <- glm(Exited ~ CreditScore + Geography + Gender + Age + Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary, family = binomial, data = db)
summary(logitModelFull) # the model
##
## Call:
## glm(formula = Exited ~ CreditScore + Geography + Gender + Age +
## Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember +
## EstimatedSalary, family = binomial, data = db)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3097 -0.6589 -0.4560 -0.2697 2.9940
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.392e+00 2.448e-01 -13.857 < 2e-16 ***
## CreditScore -6.683e-04 2.803e-04 -2.384 0.0171 *
## GeographyGermany 7.747e-01 6.767e-02 11.448 < 2e-16 ***
## GeographySpain 3.522e-02 7.064e-02 0.499 0.6181
## GenderMale -5.285e-01 5.449e-02 -9.699 < 2e-16 ***
## Age 7.271e-02 2.576e-03 28.230 < 2e-16 ***
## Tenure -1.595e-02 9.355e-03 -1.705 0.0882 .
## Balance 2.637e-06 5.142e-07 5.128 2.92e-07 ***
## NumOfProducts -1.015e-01 4.713e-02 -2.154 0.0312 *
## HasCrCard -4.468e-02 5.934e-02 -0.753 0.4515
## IsActiveMember -1.075e+00 5.769e-02 -18.643 < 2e-16 ***
## EstimatedSalary 4.807e-07 4.737e-07 1.015 0.3102
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10109.8 on 9999 degrees of freedom
## Residual deviance: 8561.4 on 9988 degrees of freedom
## AIC: 8585.4
##
## Number of Fisher Scoring iterations: 5
coefsexp <- coef(logitModelFull) %>% exp() %>% round(2)
coefsexp # the odds
## (Intercept) CreditScore GeographyGermany GeographySpain
## 0.03 1.00 2.17 1.04
## GenderMale Age Tenure Balance
## 0.59 1.08 0.98 1.00
## NumOfProducts HasCrCard IsActiveMember EstimatedSalary
## 0.90 0.96 0.34 1.00
Coefficient interpretation:
From the full model we can see some factors, that have an influence on customers’ churning. For example, customer’s geographical location in Germany have an influence on exiting (we refuse from H0 which states that there is no geographical location does not influence customers churning). By looking at the odds we may add: location in Germany increases the odds of exiting the bank by a factor of 2.17, so 117% compared to someone who is not located in Germany.
There also can be seen some influence of gender, age, and customers’ being an active member. BUT before interpretting everything it’s better to use stepAIC function and find the best model. Moreover, it’s important to check it if it is good enought.
library(MASS)
logitModelNew <- stepAIC(logitModelFull, trace = 0)
summary(logitModelNew)
##
## Call:
## glm(formula = Exited ~ CreditScore + Geography + Gender + Age +
## Tenure + Balance + NumOfProducts + IsActiveMember, family = binomial,
## data = db)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3270 -0.6592 -0.4553 -0.2682 2.9826
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.377e+00 2.359e-01 -14.312 < 2e-16 ***
## CreditScore -6.682e-04 2.803e-04 -2.384 0.0171 *
## GeographyGermany 7.741e-01 6.766e-02 11.441 < 2e-16 ***
## GeographySpain 3.586e-02 7.063e-02 0.508 0.6116
## GenderMale -5.291e-01 5.448e-02 -9.712 < 2e-16 ***
## Age 7.268e-02 2.575e-03 28.230 < 2e-16 ***
## Tenure -1.595e-02 9.350e-03 -1.706 0.0879 .
## Balance 2.653e-06 5.140e-07 5.162 2.45e-07 ***
## NumOfProducts -1.007e-01 4.712e-02 -2.137 0.0326 *
## IsActiveMember -1.075e+00 5.767e-02 -18.648 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10110 on 9999 degrees of freedom
## Residual deviance: 8563 on 9990 degrees of freedom
## AIC: 8583
##
## Number of Fisher Scoring iterations: 5
coefsexp <- coef(logitModelNew) %>% exp() %>% round(2)
coefsexp
## (Intercept) CreditScore GeographyGermany GeographySpain
## 0.03 1.00 2.17 1.04
## GenderMale Age Tenure Balance
## 0.59 1.08 0.98 1.00
## NumOfProducts IsActiveMember
## 0.90 0.34
library(descr)
LogRegR2(logitModelNew)
## Chi2 1546.83
## Df 9
## Sig. 0
## Cox and Snell Index 0.1433133
## Nagelkerke Index 0.2252868
## McFadden's R2 0.1530033
Using LogRegR2() function we see, that the model is not very good (because most of the values are <0.2), the algorithm has troubles with explaining lots of the varibles.
library(SDMTools)
db$predNew <- predict(logitModelNew, type = "response", na.action = na.exclude)
db %>% dplyr::select(Exited, predNew) %>% tail()
confMatrixNew <- confusion.matrix(db$Exited,
db$predNew, threshold = 0.5)
confMatrixNew
## obs
## pred 0 1
## 0 7676 1597
## 1 287 440
## attr(,"class")
## [1] "confusion.matrix"
Here we see that 7676 customers were correctly predicted not to exit the bank (true-negative) and 440 were correctly predicted to churn (true-positive).
accuracyNew <- sum(diag(confMatrixNew)) / sum(confMatrixNew)
accuracyNew
## [1] 0.8116
And as for the accuracy we have 81% of it, which is very good. But we can also check another variants, look at payoffs and choosу the best option.
payoffMatrix <- data.frame(threshold = seq(from = 0.1, to = 0.5, by = 0.1),
payoff = NA)
for(i in 1:length(payoffMatrix$threshold)) {
confMatrix <- confusion.matrix(db$Exited,
db$predNew,
threshold = payoffMatrix$threshold[i])
payoffMatrix$payoff[i] <- confMatrix[2,2]*1000 + confMatrix[2,1]*(-250)
}
payoffMatrix
From this matrix we see, that the best payoff the company will get with threshold 0.2, so let’s stick to it.
confMatrixNew <- confusion.matrix(db$Exited,
db$predNew, threshold = 0.2)
confMatrixNew
## obs
## pred 0 1
## 0 5555 608
## 1 2408 1429
## attr(,"class")
## [1] "confusion.matrix"
accuracyNew <- sum(diag(confMatrixNew)) / sum(confMatrixNew)
accuracyNew
## [1] 0.6984
This time from the confusion matrix we see that 5555 customers were correctly predicted not to exit the bank (true-negative) and 1429 were correctly predicted to churn (true-positive). And as for the accuracy it is ~70%. The same result gives the calculatation of cross-validated accuracy (which is below). Not 80% like it was before, but still pretty high. So we can come back to interpretation of the coefficients.
library(boot)
Acc03 <- function(r, pi = 0) {
cm <- confusion.matrix(r, pi, threshold = 0.2)
acc <- sum(diag(cm)) / sum(cm)
return(acc)
}
set.seed(534381)
cv.glm(db, logitModelNew, cost = Acc03, K = 6)$delta
## [1] 0.6967000 0.6969337
Gender of the customer has an influence on their exiting: ‘Male’ gender of a customer decreases the odds of churn by a factor of 0.59, so 41% compared to female customers.
Being an active customer also has influence on churning in the same negative way as being a man: status as an “active” customer decreases the odds of exiting the bank by a factor of 0.34, so 76% compared to “unactive” customers.
And we also can recall that customers’ location in Germany increases the odds of exiting the bank by a factor of 2.17, so 117% compared to someone who is not located in Germany.
library(ggplot2)
plotTenure <- db %>%
mutate(Exited = Exited %>% factor(labels = c("Did not churn", "Churned"))) %>%
ggplot() +
geom_histogram(aes(x = Tenure,
fill = factor(Exited)), binwidth = 1) +
facet_grid( ~ Exited) +
theme(legend.position = "none")
plotTenure
We want to predict the time when the customer will churn. First, let’s create a survial object.
library(survival)
cbind(db %>% dplyr::select(Tenure, Exited),
surv = Surv(db$Tenure, db$Exited)) %>% head(10)
survObj <- Surv(db$Tenure, db$Exited)
str(survObj)
## 'Surv' num [1:10000, 1:2] 2 1+ 8 1+ 2+ 8 7+ 4 4+ 2+ ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:2] "time" "status"
## - attr(*, "type")= chr "right"
# Compute and print fit
fitKMSimple <- survfit(survObj ~ 1)
print(fitKMSimple)
## Call: survfit(formula = survObj ~ 1)
##
## n events median 0.95LCL 0.95UCL
## 10000 2037 10 10 NA
plot(fitKMSimple, xlab = "Tenure", ylab = "Survival function", main = "Survival function")
Here we see that we observe 10000 customers, 2037 of which have exited our bank (churned). The median here is 10 which means that 50% of the customers do not exit our service (churn) before they reach a tenure duration of 10 years.
The survival function gives us the probability that a customer will churn in the period leading up to the time point t. Here, we see, that longer customer stays in our bank, less the probability that they will churn.
But that’s not the end actually… even if it could be ~
I’m really interested in conducting Kaplan-Meier with Categorial Covariate! Let’s try geographical position at first (as Germany play not the last “role” in increasing customer’s churn)
fitKMCov <- survfit(survObj ~ Geography, data = db)
print(fitKMCov)
## Call: survfit(formula = survObj ~ Geography, data = db)
##
## n events median 0.95LCL 0.95UCL
## Geography=France 5014 810 NA NA NA
## Geography=Germany 2509 814 9 9 9
## Geography=Spain 2477 413 NA NA NA
NA in rows of France and Spain mean that only few customers churned within the time under observation. Let’s also look at the graph where we can wee how “Germany” like is decreasing (more, than others), which means that customers from Germany tend to churn more than customers from Spain or France.
plot(fitKMCov, lty = 2:3,
xlab = "Tenure", ylab = "Survival function", main = "Survival function")
legend(1, .4, c("France", "Germany", "Spain"), lty = 2:3)
Let’s try the same with Gender:
fitKMCov <- survfit(survObj ~ Gender, data = db)
print(fitKMCov)
## Call: survfit(formula = survObj ~ Gender, data = db)
##
## n events median 0.95LCL 0.95UCL
## Gender=Female 4543 1139 10 10 10
## Gender=Male 5457 898 NA NA NA
plot(fitKMCov, lty = 2:3,
xlab = "Tenure", ylab = "Survival function", main = "Survival function")
legend(1, .4, c("Female", "Male"), lty = 2:3)
Again our analysis from the second part (with logistic regression) is confirmed: according to the graph Male customers stay loyal to our bank longer than Female.
And now for the status of an Active Member:
fitKMCov <- survfit(survObj ~ IsActiveMember , data = db)
print(fitKMCov)
## Call: survfit(formula = survObj ~ IsActiveMember, data = db)
##
## n events median 0.95LCL 0.95UCL
## IsActiveMember=0 4849 1302 10 9 10
## IsActiveMember=1 5151 735 NA NA NA
plot(fitKMCov, lty = 2:3,
xlab = "Tenure", ylab = "Survival function", main = "Survival function")
legend(1, .4, c("Not active", "Active"), lty = 2:3)
Q.E.D. members of the bank, which are considerend to be active tend to stay with the bank longer than non-active members (who churn more often).
Now let’s try to specify the Cox PH model :
library(rms)
fitCPH1 <- cph(Surv(Tenure, Exited) ~ CreditScore + Geography + Gender + Age + Balance + NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary,
data = db,
x = TRUE, y = TRUE, surv = TRUE,
time.inc = 1)
cph(formula = Surv(Tenure, Exited) ~ CreditScore + Geography + Gender + Age + Balance + NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary, data = db,
x = TRUE, y = TRUE, surv = TRUE, time.inc = 1)
## Cox Proportional Hazards Model
##
## cph(formula = Surv(Tenure, Exited) ~ CreditScore + Geography +
## Gender + Age + Balance + NumOfProducts + HasCrCard + IsActiveMember +
## EstimatedSalary, data = db, x = TRUE, y = TRUE, surv = TRUE,
## time.inc = 1)
##
## Model Tests Discrimination
## Indexes
## Obs 10000 LR chi2 1156.95 R2 0.113
## Events 2037 d.f. 10 Dxy 0.436
## Center 1.0789 Pr(> chi2) 0.0000 g 0.806
## Score chi2 1246.04 gr 2.240
## Pr(> chi2) 0.0000
##
## Coef S.E. Wald Z Pr(>|Z|)
## CreditScore -0.0005 0.0002 -2.23 0.0255
## Geography=Germany 0.4883 0.0544 8.98 <0.0001
## Geography=Spain 0.0496 0.0605 0.82 0.4123
## Gender=Male -0.3868 0.0448 -8.64 <0.0001
## Age 0.0475 0.0018 26.75 <0.0001
## Balance 0.0000 0.0000 4.79 <0.0001
## NumOfProducts -0.0666 0.0387 -1.72 0.0852
## HasCrCard -0.0634 0.0484 -1.31 0.1900
## IsActiveMember -0.7377 0.0475 -15.54 <0.0001
## EstimatedSalary 0.0000 0.0000 -0.01 0.9920
##
exp(fitCPH1$coefficients) %>% round(2)
## CreditScore Geography=Germany Geography=Spain Gender=Male
## 1.00 1.63 1.05 0.68
## Age Balance NumOfProducts HasCrCard
## 1.05 1.00 0.94 0.94
## IsActiveMember EstimatedSalary
## 0.48 1.00
Here we see pretty much the same picture as we had with logistic regression in the secind part. The interpretation for that will be :
I’ve been using only three factors in all parts of the analysis, because these ones happened to be the most important, according to coefficients.
SO, dear client! If you meet german girl who is not considered as an active one [I mean customer] (here i don’t want to make you sad, but) she will probably churn…