setwd("C:/Users/evdya/Desktop")
halloween <- read.csv("Churn_Modelling.csv")
summary(halloween)
## RowNumber CustomerId Surname CreditScore
## Min. : 1 Min. :15565701 Smith : 32 Min. :350.0
## 1st Qu.: 2501 1st Qu.:15628528 Martin : 29 1st Qu.:584.0
## Median : 5000 Median :15690738 Scott : 29 Median :652.0
## Mean : 5000 Mean :15690941 Walker : 28 Mean :650.5
## 3rd Qu.: 7500 3rd Qu.:15753234 Brown : 26 3rd Qu.:718.0
## Max. :10000 Max. :15815690 Genovese: 25 Max. :850.0
## (Other) :9831
## Geography Gender Age Tenure
## France :5014 Female:4543 Min. :18.00 Min. : 0.000
## Germany:2509 Male :5457 1st Qu.:32.00 1st Qu.: 3.000
## Spain :2477 Median :37.00 Median : 5.000
## Mean :38.92 Mean : 5.013
## 3rd Qu.:44.00 3rd Qu.: 7.000
## Max. :92.00 Max. :10.000
##
## Balance NumOfProducts HasCrCard IsActiveMember
## Min. : 0 Min. :1.00 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 97199 Median :1.00 Median :1.0000 Median :1.0000
## Mean : 76486 Mean :1.53 Mean :0.7055 Mean :0.5151
## 3rd Qu.:127644 3rd Qu.:2.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :250898 Max. :4.00 Max. :1.0000 Max. :1.0000
##
## EstimatedSalary Exited
## Min. : 11.58 Min. :0.0000
## 1st Qu.: 51002.11 1st Qu.:0.0000
## Median :100193.91 Median :0.0000
## Mean :100090.24 Mean :0.2037
## 3rd Qu.:149388.25 3rd Qu.:0.0000
## Max. :199992.48 Max. :1.0000
##
library(dplyr)
count(halloween, Exited)
## # A tibble: 2 x 2
## Exited n
## <int> <int>
## 1 0 7963
## 2 1 2037
library(ggplot2)
ggplot(halloween, aes(x = Exited)) +
geom_histogram(stat = "count", fill = "tan1") +
ggtitle("Number of no-churners and churners")
2037 customers churned from Halloween BI
Churners <- halloween %>%
filter(Exited==1)
Nochurners <- halloween %>%
filter(Exited==0)
mean(Churners$Tenure)
## [1] 4.932744
mean(Nochurners$Tenure)
## [1] 5.033279
Both churners and no-churners last as customers of Halloween BI for about 5 years
Firstly, we need to build a logistic regression model
Initially, we assume that all of the explanatory variables, which we included in our model, do not have influence on churn of the customers.
logModel <- glm(Exited ~ CreditScore + Geography + Gender + Age + Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary,
family = binomial, data = halloween)
summary(logModel)
##
## Call:
## glm(formula = Exited ~ CreditScore + Geography + Gender + Age +
## Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember +
## EstimatedSalary, family = binomial, data = halloween)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3097 -0.6589 -0.4560 -0.2697 2.9940
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.392e+00 2.448e-01 -13.857 < 2e-16 ***
## CreditScore -6.683e-04 2.803e-04 -2.384 0.0171 *
## GeographyGermany 7.747e-01 6.767e-02 11.448 < 2e-16 ***
## GeographySpain 3.522e-02 7.064e-02 0.499 0.6181
## GenderMale -5.285e-01 5.449e-02 -9.699 < 2e-16 ***
## Age 7.271e-02 2.576e-03 28.230 < 2e-16 ***
## Tenure -1.595e-02 9.355e-03 -1.705 0.0882 .
## Balance 2.637e-06 5.142e-07 5.128 2.92e-07 ***
## NumOfProducts -1.015e-01 4.713e-02 -2.154 0.0312 *
## HasCrCard -4.468e-02 5.934e-02 -0.753 0.4515
## IsActiveMember -1.075e+00 5.769e-02 -18.643 < 2e-16 ***
## EstimatedSalary 4.807e-07 4.737e-07 1.015 0.3102
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10109.8 on 9999 degrees of freedom
## Residual deviance: 8561.4 on 9988 degrees of freedom
## AIC: 8585.4
##
## Number of Fisher Scoring iterations: 5
From the summary of our model, we can see that variables CreditScore, GeographyGermany, GenderMale, Age, Balance, NumOfProducts, and IsActiveMember have significant effect on customer churn because their p-values are statistically significant.
Significant p-value means that the probability of getting the given result or larger for each of the mentioned variables, assuming that they have no effect on customer churn, is less that 0.05.
Secondly, we need to remove logarithm and transform our coefficients to the odds
odds <- coef(logModel) %>% exp() %>% round(2)
odds
## (Intercept) CreditScore GeographyGermany GeographySpain
## 0.03 1.00 2.17 1.04
## GenderMale Age Tenure Balance
## 0.59 1.08 0.98 1.00
## NumOfProducts HasCrCard IsActiveMember EstimatedSalary
## 0.90 0.96 0.34 1.00
We will interpret all of the variables to better understand what is going on.
Each one-unit increase in customer’s credit score increases the odds of churning by a factor of 1.
If a customer is from Germany, it increases the odds of churning by a factor of 2.17 compared to customers from other countries.
If a customer is from Spain it increases the odds of churning by a factor of 1.04 compared to customers from other countries.
If a customer is male, the odds of him churning decrease by a factor of 0.59 compared to female customers.
Each one-unit increase in customer’s age increases the odds of churning by a factor of 1.08.
Each one-unit increase in tenure of a customer decreases the odds of churning by a factor of 0.98.
Each one-unit increase in customer’s balance increases the odds of churning by a factor of 1.
Each one-unit increase in number of products of a customer decreases the odds of churning by a factor of 0.90.
Having a credit card in our bank decreases the odds of a customer’s churning by a factor of 0.96 compared to customers who do not have it.
Being an active member of our bank decreases the odd of a customer’s churning by a factor 0.34 compared to not active members.
Each one-unit increase in customer’s estimated salary increases the odds of churning by a factor of 1.
Thirdly, we need to select which variables to include in our model
library(MASS)
logModelNew <- stepAIC(logModel, trace = 0)
summary(logModelNew)
##
## Call:
## glm(formula = Exited ~ CreditScore + Geography + Gender + Age +
## Tenure + Balance + NumOfProducts + IsActiveMember, family = binomial,
## data = halloween)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3270 -0.6592 -0.4553 -0.2682 2.9826
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.377e+00 2.359e-01 -14.312 < 2e-16 ***
## CreditScore -6.682e-04 2.803e-04 -2.384 0.0171 *
## GeographyGermany 7.741e-01 6.766e-02 11.441 < 2e-16 ***
## GeographySpain 3.586e-02 7.063e-02 0.508 0.6116
## GenderMale -5.291e-01 5.448e-02 -9.712 < 2e-16 ***
## Age 7.268e-02 2.575e-03 28.230 < 2e-16 ***
## Tenure -1.595e-02 9.350e-03 -1.706 0.0879 .
## Balance 2.653e-06 5.140e-07 5.162 2.45e-07 ***
## NumOfProducts -1.007e-01 4.712e-02 -2.137 0.0326 *
## IsActiveMember -1.075e+00 5.767e-02 -18.648 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10110 on 9999 degrees of freedom
## Residual deviance: 8563 on 9990 degrees of freedom
## AIC: 8583
##
## Number of Fisher Scoring iterations: 5
According to the summary of our new model, only the remaining variables have significant effect on customer churn: CreditScore, GeographyGermany, GeographySpain, GenderMale, Age, Tenure, Balance, NumOfProducts, and IsActiveMember.
Finally, we will evaluate the created model
library(descr)
LogRegR2(logModelNew)
## Chi2 1546.83
## Df 9
## Sig. 0
## Cox and Snell Index 0.1433133
## Nagelkerke Index 0.2252868
## McFadden's R2 0.1530033
Two out of three goodness-of-fit measures have coefficients smaller that 0.2, but they are nearly close to it. It means that the explanatory power of the model is relatively normal. We would conclude that our model is quite reasonable.
Firstly, we make the predictions about our model and construct the in-sample matrix according to them
library(SDMTools)
halloween$pred <- predict(logModelNew, type = "response", na.action = na.exclude)
confMatModel <- confusion.matrix(halloween$Exited, halloween$pred, threshold = 0.5)
confMatModel
## obs
## pred 0 1
## 0 7676 1597
## 1 287 440
## attr(,"class")
## [1] "confusion.matrix"
By the confusion matrix we can conclude that 7676 customers were correctly classified as no-churners and 440 customers were correctly classified as churners. Others were misclassified by our model: 1597 customers churned while we predicted them not to churn, and 287 customers didn’t churn but they were predicted to churn by our model.
Secondly, we calculate the accuracy
accModel <- sum(diag(confMatModel)) / sum(confMatModel)
accModel
## [1] 0.8116
The accuracy of our model is 81%, which is very high.
We can conclude that our model is pretty accurate.
At first, we need an accuracy function based on a threshold of 0.5
library(boot)
acc03 <- function(r, pi = 0) {
cm <- confusion.matrix(r, pi, threshold = 0.5)
acc <- sum(diag(cm)) / sum(cm)
return(acc)
}
Then, we calculate cross-validated accuracy
set.seed(1234)
cv.glm(halloween, logModelNew, cost = acc03, K = 6)$delta
## [1] 0.81110 0.81105
The accuracy we got is equal to 81% which is the same as our in-sample accuracy (81%).
Thus, we can conclude that quality of our model is good.
According to the information we got from our data, the factors which influence customer churn in Halloween BI the most are GeographyGermany, GenderMale, and IsActiveMember.
Customers from Germany are much more likely to churn than customers from other countries, while men and active customers are much less likely to churn. So, we think that the company should pay more attention to German customers as well as to female ones and those who are not atively using the services of the bank to prevent then churning.
We begin by buiding two histograms to look at the distribution of the tenure time dependent on whether or not a customer churned
plotChurn <- halloween %>%
mutate(Exited = Exited %>% factor(labels = c("No-churners", "Churners"))) %>%
ggplot() +
geom_histogram(binwidth = 1, aes(x = Tenure, fill = factor(Exited))) +
facet_grid( ~ Exited) +
theme(legend.position = "none") +
scale_fill_manual(values = c("tan1", "tan2")) +
ggtitle("Tenure of no-churners and churners")
plotChurn
From the histograms we can see that for both churners’ and no-churners’ tenure time is almost the same.
The least amount of people were customers of Halloween BI for less than a year or 10 years. The most amount of customers are using the bank for the period between 1 and 9 years.
First step is to create a new column that holds a survival object
library(survival)
cbind(halloween %>% dplyr::select(Tenure, Exited),
Surv = Surv(halloween$Tenure, halloween$Exited)) %>% head(10)
## Tenure Exited Surv
## 1 2 1 2
## 2 1 0 1+
## 3 8 1 8
## 4 1 0 1+
## 5 2 0 2+
## 6 8 1 8
## 7 7 0 7+
## 8 4 1 4
## 9 4 0 4+
## 10 2 0 2+
Second step is to estimate our survival function using Kaplan-Meier analysis
SurvObj <- Surv(halloween$Tenure, halloween$Exited)
fitKM <- survfit(SurvObj ~ 1, type = "kaplan-meier")
print(fitKM)
## Call: survfit(formula = SurvObj ~ 1, type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## 10000 2037 10 10 NA
The results of the analysis show us that we have 10000 customers, of which 2037 churned in the time under observation.
The median survival time is 10 years. It means that about 50% of our customers do not churn before they reach a tenure duration of 10 years.
Now we plot the survival function
plot(fitKM, conf.int = FALSE, xlab = "Tenure", ylab = "Survival function", main = "Survival function")
The function gives us the information about the probability that a customer will not churn in the period leading up to the time point of 10 years.
The longer a person is a customer of Halloween BI, the higher the probability of him not churning.