‘Churn for Bank Customers’ is a Kaggle Dataset which contains artificial information about bank clients for 10 thousand clients from 3 countries: France, Germany, Spain. It is important for banks to know in advance which client tends to leave the company, in other words to churn, since new clients are more expensive to get than to prevent churning. In the analysis the churn will be predicted using Binary Logistic Regression model.
cat("Number of Observations in Dataset:", nrow(ds),
"\nNumber of Variables in Dataset:", ncol(ds),
"\nNumber of Observations with Missing Data in Dataset:", nrow(ds)-nrow(na.omit(ds)))
## Number of Observations in Dataset: 10000
## Number of Variables in Dataset: 14
## Number of Observations with Missing Data in Dataset: 0
Variables Description (source):
1. RowNumber — corresponds to the record (row) number and has
no effect on the output.
2. CustomerId — contains random values and has no effect on
customer leaving the bank.
3. Surname — the surname of a customer has no impact on their
decision to leave the bank.
4. CreditScore — can have an effect on customer churn, since a
customer with a higher credit score is less likely to leave the
bank.
5. Geography — a customer’s location can affect their decision
to leave the bank.
6. Gender — it’s interesting to explore whether gender plays a
role in a customer leaving the bank.
7. Age — this is certainly relevant, since older customers are
less likely to leave their bank than younger ones.
8. Tenure — refers to the number of years that the customer has
been a client of the bank. Normally, older clients are more loyal and
less likely to leave a bank.
9. Balance — also a very good indicator of customer churn, as
people with a higher balance in their accounts are less likely to leave
the bank 10. compared to those with lower balances.
10. NumOfProducts — refers to the number of products that a
customer has purchased through the bank.
11. HasCrCard — denotes whether or not a customer has a credit
card. This column is also relevant, since people with a credit card are
less likely to leave the bank.
12. IsActiveMember — active customers are less likely to leave
the bank.
13. EstimatedSalary — as with balance, people with lower
salaries are more likely to leave the bank compared to those with higher
salaries.
14. Exited — whether or not the customer left the bank.
Variables class:
ds$RowNumber = ds$RowNumber %>% as.integer()
ds$CustomerId = as.integer(ds$CustomerId)
ds$Geography = ds$Geography %>% as.factor()
ds$Gender = ds$Gender %>% as.factor()
ds$HasCrCard = ds$HasCrCard %>% as.factor()
ds$IsActiveMember = ds$IsActiveMember %>% as.factor()
ds$Exited = ds$Exited %>% as.factor()
cl = sapply(ds,class)
col = as.vector(colnames(ds))
data.frame(variable = col,class = cl) %>%
group_by(class,variable) %>%
summarise()
## # A tibble: 14 × 2
## # Groups: class [4]
## class variable
## <chr> <chr>
## 1 character Surname
## 2 factor Exited
## 3 factor Gender
## 4 factor Geography
## 5 factor HasCrCard
## 6 factor IsActiveMember
## 7 integer Age
## 8 integer CreditScore
## 9 integer CustomerId
## 10 integer NumOfProducts
## 11 integer RowNumber
## 12 integer Tenure
## 13 numeric Balance
## 14 numeric EstimatedSalary
The dependent variable is “Exited” [binary] - whether a client has
churn or not
Independent variables are:
1. Balance [continuous]
2. IsActiveMember [binary]
3. HasCrCard [binary]
4. Geography (control) [factor]
5. Age (control) [continuous]
6. Gender (control) [binary]
RQ: Which categories of client data would influence the churn of the bank client?
Hypotheses:
1. People with a higher balance in their accounts are less likely to
stop using the bank services compared to those with lower balances
2. Clients who are active members will be less likely to stop using the
bank services
3. Clients who had a credit card are less likely to stop using the bank
services
# data preparation
ds1 = ds %>% dplyr::select(Exited, Balance, IsActiveMember, HasCrCard, Geography, Age, Gender) %>% na.omit()
summary(ds1)
## Exited Balance IsActiveMember HasCrCard Geography
## 0:7963 Min. : 0 0:4849 0:2945 France :5014
## 1:2037 1st Qu.: 0 1:5151 1:7055 Germany:2509
## Median : 97199 Spain :2477
## Mean : 76486
## 3rd Qu.:127644
## Max. :250898
## Age Gender
## Min. :18.00 Female:4543
## 1st Qu.:32.00 Male :5457
## Median :37.00
## Mean :38.92
## 3rd Qu.:44.00
## Max. :92.00
g = ggplot(ds1, aes(as.logical(as.integer(Exited)-1))) + labs(title = "Distribution of Churned Client", x = "Is Client is Churned")
p = g + geom_bar(aes(fill = Exited)) + theme(legend.position = 'none')
ggplotly(p)
80% of observations in data is non-churned clients and only 20% are clients who left the company. So, it certainly can be said that non-churned clients is the overrepresented group in the sample.
fig = plot_ly(x = ds1$Balance, type = "histogram") %>% layout(xaxis = list(title = 'Balance ($)'), title = "Balance Distribution")
fig
By the graph it can be seen that many client had zero-balance, it can affect the model in the future.
cat("non-zero obs:",sum(ds1$Balance > 0), "|", (sum(ds1$Balance > 0)/nrow(ds1))*100, "%")
## non-zero obs: 6383 | 63.83 %
cat("\n")
cat("zero obs:", sum(ds1$Balance == 0), "|", (sum(ds1$Balance == 0)/nrow(ds1))*100, "%")
## zero obs: 3617 | 36.17 %
36% of the observations were zero-balance, which is not critical, however it have to be considered while interpreting the model. Obviously, it is non-normal distribution. However, it is a large sample, so it does not matter so much.
fig <- plot_ly(alpha = 0.6)
fig <- fig %>% add_histogram(x = as.numeric(unlist((filter(ds1, ds1$Exited == 0))[2])), name = "Non-churned")
fig <- fig %>% add_histogram(x = as.numeric(unlist((filter(ds1, ds1$Exited == 1))[2])), name = "Churned")
fig <- fig %>% layout(barmode = "overlay", xaxis = list(title = 'Balance ($)'), title = "Balance Distribution")
fig
This is a double histogram graph for Churned and Non-churned clients, as it can be seen within churned clients there are much less those who have zero-balance.
Also, let’s assume that there is no zero-balance users, in this case, would the distribution be normal?
check = (filter(ds1, ds1$Balance > 0))[2] %>% unlist() %>% as.numeric()
plot_ly(x = check, type = "histogram") %>% layout(xaxis = list(title = 'Balance ($)'), title = "Balance Distribution")
qqnorm(check)
Indeed, excluding zero-balance observations the Balance distribution would be normal. So, an additional factor should be added.
g <- ggplot(ds1, aes(as.logical(as.integer(IsActiveMember)-1))) + labs(title = "Distribution of Active Members", x = "Was an Active Member?")
p <- g + geom_bar(aes(fill = as.logical(as.integer(Exited)-1))) + guides(fill=guide_legend(title="Exited"))
ggplotly(p)
Active and non-Active members are distributed quite similar, however in terms of churned and non-churned clients there is a big difference. Non-Active members include almost double amount of churned clients in comparison with Active members group.
g <- ggplot(ds1, aes(as.logical(as.integer(HasCrCard)-1))) + labs(title = "Distribution of Card Owners", x = "Had a Credit Card")
p <- g + geom_bar(aes(fill = as.logical(as.integer(Exited)-1))) + guides(fill=guide_legend(title="Exited"))
ggplotly(p)
So, according to the bar plot there are much more clients who has Credit Card. Not much can be concluded about churned and non-churned distribution among owners and non-owners of Credit Card.
g <- ggplot(ds1, aes(x = Geography)) + labs(title = "Geographical Distribution", x = "Country")
p <- g + geom_bar(aes(fill = as.logical(as.integer(Exited)-1))) + guides(fill=guide_legend(title="Exited"))
ggplotly(p)
Half of the data is about clients from France, the other half is divided equally by Germany and Spain. In France and Germany there are similar quantity of clients that churned, but proportionally the country with the most amount of churned clients is Germany, the least amount is in Spain.
fig <- plot_ly(alpha = 0.6)
fig <- fig %>% add_histogram(x = as.numeric(unlist((filter(ds1, ds1$Exited == 0))[6])), name = "Non-churned")
fig <- fig %>% add_histogram(x = as.numeric(unlist((filter(ds1, ds1$Exited == 1))[6])), name = "Churned")
fig <- fig %>% layout(barmode = "overlay", xaxis = list(title = 'Age (years)'), title = "Age Distribution")
fig
Most of the clients are people from 30 to 40 years old, however the churned clients are not in this range, which maybe an indicator of influence of age on exited in the future model. Most of the churned clients are people from 35 to 50.
g <- ggplot(ds1, aes(Gender)) + labs(title = "Distribution of Gender", x = "Gender")
p <- g + geom_bar(aes(fill = as.logical(as.integer(Exited)-1))) + guides(fill=guide_legend(title="Exited"))
ggplotly(p)
Major part of clients of the bank are men, however there are more woman who exit the bank (25% of women vs 16% of male)
g = ggplot(ds1, aes(x = Exited, y = Balance)) + geom_boxplot()
ggplotly(g)
On the graph it can be seen that there is no such visible difference in median, however the difference can be seen in first quartile of two boxes. So, it can be assumed that there is some relationship between variables.
However, in 3.1.2 interesting pattern was found: if we remove all zero-balance clients from sample we will get a normal distribution, so let’s divide the variable by two groups.
ds2 = ds1
ds2= filter(ds2, Balance > 0)
g = ggplot(ds2, aes(x = Exited, y = Balance)) + geom_boxplot()
ggplotly(g)
As it can be seen the boxplot graph is much different from the first. Let us test independence using statistical test
# F Test
group0 = (filter(ds2, Exited == 0))[2] %>% unlist() %>% as.numeric()
group1 = (filter(ds2, Exited == 1))[2] %>% unlist() %>% as.numeric()
var.test(group0,group1, alternative = "two.sided")
##
## F test to compare two variances
##
## data: group0 and group1
## F = 0.96045, num df = 4845, denom df = 1536, p-value = 0.3247
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.8847912 1.0407821
## sample estimates:
## ratio of variances
## 0.9604538
The p-value of F-test is p = 0.3247 which is greater than the significance level 0.05. There is no significant difference between the two variances. As the sample is large normality assumption can be ignored => use unpaired t-test
t.test(Balance ~ Exited ,data = ds2, alternative = "two.sided", var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: Balance by Exited
## t = -1.3605, df = 2539.6, p-value = 0.1738
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -2956.7129 534.4991
## sample estimates:
## mean in group 0 mean in group 1
## 119535.9 120747.0
p-value>0.05 ==> the Balance of churned clients is not significantly different from the Balance of non-churned clients
In conclusion, it can be assumed that there is no significant relationship between Balance and Exited if Balance is not zero.
Well, for this time boxplots will be useless since the value of variable balance is static and equals to zero for all of the observations. So, I will use another approach and create new binary variable which will be used as an indicator whether the client had zero balance or not. Then I will compute chi-square test of independency.
ds2 = ds1
ds2$isZeroBalance = ifelse(ds2$Balance == 0, 1, 0)
ds2$isZeroBalance = as.logical(as.integer(ds2$isZeroBalance))
g <- ggplot(ds2, aes(isZeroBalance)) + labs(title = "Distribution of Zero and Non-Zero Balance Clients")
p <- g + geom_bar(aes(fill = as.logical(as.integer(Exited)-1))) +
guides(fill=guide_legend(title="Exited"))
ggplotly(p) %>% layout(xaxis = list(title = 'Is client has Zero Balance'))
NON-Zero Balance clients includes three times more churned clients in absolute numbers than Zero balance clients or 24% churned in NON-Zero Balance group against 13.8% churned in Zero balance group. Testing it with chi-square test of independency.
H0: isZeroBalance is independent of Exited H1: isZeroBalance is not independent of Exited
contingency = table(ds2$Exited, ds2$isZeroBalance)
chisq.test(contingency)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contingency
## X-squared = 149.08, df = 1, p-value < 2.2e-16
p-value<0.05 ==> isZeroBalance may not be independent of Exited
In conclusion, it can be assumed that there is a significant relationship between isZeroBalance and Exited. So, this variable will be included into dataset.
ds1$isZeroBalance = ifelse(ds1$Balance == 0, 1, 0) %>% as.factor()
H0: IsActiveMember is independent of Exited H1: IsActiveMember is not independent of Exited
contingency = table(ds1$Exited, ds1$IsActiveMember)
chisq.test(contingency)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contingency
## X-squared = 242.99, df = 1, p-value < 2.2e-16
p-value<0.05 ==> IsActiveMember may not be independent of Exited
H0: HasCrCard is independent of Exited H1: HasCrCard is not independent of Exited
contingency = table(ds1$Exited, ds1$HasCrCard)
chisq.test(contingency)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contingency
## X-squared = 0.47134, df = 1, p-value = 0.4924
p-value>0.05 ==> HasCrCard is independent of Exited
H0: Geography is independent of Exited H1: Geography is not independent of Exited
contingency = table(ds1$Exited, ds1$Geography)
chisq.test(contingency)
##
## Pearson's Chi-squared test
##
## data: contingency
## X-squared = 301.26, df = 2, p-value < 2.2e-16
p-value<0.05 ==> Geography may not be independent of Exited
g = ggplot(ds1, aes(x = Exited, y = Age)) + geom_boxplot()
ggplotly(g)
So, median differs from group to group quite visibly, so it can be assumed that there is a relationship between two variables.
H0: Gender is independent of Exited H1: Gender is not independent of Exited
contingency = table(ds1$Exited, ds1$Gender)
chisq.test(contingency)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contingency
## X-squared = 112.92, df = 1, p-value < 2.2e-16
p-value<0.05 ==> Gender may not be independent of Exited
Now, finally to the model creation.
Firstly, I will create two models with just one independent variable to decide whether to use Balance or isZeroBalance variable. Both of them cannot be used because they highly correlated initially which will lead to multicollinearity problem in the model.
blr_a = glm(Exited ~ Balance , data = ds1, family = "binomial")
summary(blr_a)
##
## Call:
## glm(formula = Exited ~ Balance, family = "binomial", data = ds1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9029 -0.7300 -0.5631 -0.5631 1.9596
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.761e+00 4.395e-02 -40.08 <2e-16 ***
## Balance 4.852e-06 4.127e-07 11.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10110 on 9999 degrees of freedom
## Residual deviance: 9967 on 9998 degrees of freedom
## AIC: 9971
##
## Number of Fisher Scoring iterations: 4
Estimate for Balance is too low to be really influential.
blr_b = glm(Exited ~ isZeroBalance , data = ds1, family = "binomial")
summary(blr_b)
##
## Call:
## glm(formula = Exited ~ isZeroBalance, family = "binomial", data = ds1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7423 -0.7423 -0.5455 -0.5455 1.9894
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.14832 0.02927 -39.23 <2e-16 ***
## isZeroBalance1 -0.68170 0.05637 -12.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10110 on 9999 degrees of freedom
## Residual deviance: 9953 on 9998 degrees of freedom
## AIC: 9957
##
## Number of Fisher Scoring iterations: 4
Binary variable have estimate -0.68 which tells that isZeroBalance is more influential than Balance
So, by summaries of two model the second model is better in terms of AIC (9971 in model 1 and 9957 in model 2, the lower the AIC the better the model). Consider comparison with compareGLM function.
compareGLM(blr_a, blr_b)
## $Models
## Formula
## 1 "Exited ~ Balance"
## 2 "Exited ~ isZeroBalance"
##
## $Fit.criteria
## Rank Df.res AIC AICc BIC McFadden Cox.and.Snell Nagelkerke p.value
## 1 2 9998 9973 9973 9995 0.01412 0.01418 0.02228 3.317e-33
## 2 2 9998 9959 9959 9981 0.01551 0.01555 0.02445 2.909e-36
All values states that it is better to use isZeroBalance as variable.
I will create 5 models: 1. The first one will include only two the most influential (according to the assumptions in exploratory analysis in previous part) variables:
model1 = glm(Exited ~ isZeroBalance + IsActiveMember, data = ds1, family = "binomial")
summary(model1)
##
## Call:
## glm(formula = Exited ~ isZeroBalance + IsActiveMember, family = "binomial",
## data = ds1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8692 -0.6424 -0.6122 -0.4427 2.1779
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.77875 0.03649 -21.34 <2e-16 ***
## isZeroBalance1 -0.69438 0.05700 -12.18 <2e-16 ***
## IsActiveMember1 -0.80054 0.05175 -15.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10110 on 9999 degrees of freedom
## Residual deviance: 9705 on 9997 degrees of freedom
## AIC: 9711
##
## Number of Fisher Scoring iterations: 4
model2 = glm(Exited ~ isZeroBalance + IsActiveMember + Age , data = ds1, family = "binomial")
summary(model2)
##
## Call:
## glm(formula = Exited ~ isZeroBalance + IsActiveMember + Age,
## family = "binomial", data = ds1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0512 -0.6812 -0.4715 -0.2885 2.8707
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.627918 0.107702 -33.69 <2e-16 ***
## isZeroBalance1 -0.696366 0.059746 -11.66 <2e-16 ***
## IsActiveMember1 -1.088719 0.056860 -19.15 <2e-16 ***
## Age 0.072721 0.002536 28.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10109.8 on 9999 degrees of freedom
## Residual deviance: 8807.8 on 9996 degrees of freedom
## AIC: 8815.8
##
## Number of Fisher Scoring iterations: 5
model3 = glm(Exited ~ isZeroBalance + IsActiveMember + Age + Geography , data = ds1, family = "binomial")
summary(model3)
##
## Call:
## glm(formula = Exited ~ isZeroBalance + IsActiveMember + Age +
## Geography, family = "binomial", data = ds1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2265 -0.6655 -0.4622 -0.2782 2.8742
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.96715 0.11617 -34.150 < 2e-16 ***
## isZeroBalance1 -0.37214 0.06707 -5.548 2.88e-08 ***
## IsActiveMember1 -1.08550 0.05729 -18.947 < 2e-16 ***
## Age 0.07280 0.00256 28.441 < 2e-16 ***
## GeographyGermany 0.75250 0.06807 11.055 < 2e-16 ***
## GeographySpain 0.03080 0.07019 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10109.8 on 9999 degrees of freedom
## Residual deviance: 8671.8 on 9994 degrees of freedom
## AIC: 8683.8
##
## Number of Fisher Scoring iterations: 5
model4 = glm(Exited ~ isZeroBalance + IsActiveMember + Age + Geography + Gender , data = ds1, family = "binomial")
summary(model4)
##
## Call:
## glm(formula = Exited ~ isZeroBalance + IsActiveMember + Age +
## Geography + Gender, family = "binomial", data = ds1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3307 -0.6582 -0.4554 -0.2712 2.9625
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.693484 0.118874 -31.070 < 2e-16 ***
## isZeroBalance1 -0.389417 0.067463 -5.772 7.82e-09 ***
## IsActiveMember1 -1.078399 0.057562 -18.735 < 2e-16 ***
## Age 0.072754 0.002573 28.279 < 2e-16 ***
## GeographyGermany 0.739126 0.068480 10.793 < 2e-16 ***
## GeographySpain 0.033658 0.070576 0.477 0.633
## GenderMale -0.524079 0.054381 -9.637 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10109.8 on 9999 degrees of freedom
## Residual deviance: 8578.2 on 9993 degrees of freedom
## AIC: 8592.2
##
## Number of Fisher Scoring iterations: 5
model5 = glm(Exited ~ isZeroBalance + IsActiveMember + Age + Geography + Gender + HasCrCard, data = ds1, family = "binomial")
summary(model5)
##
## Call:
## glm(formula = Exited ~ isZeroBalance + IsActiveMember + Age +
## Geography + Gender + HasCrCard, family = "binomial", data = ds1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3256 -0.6578 -0.4562 -0.2715 2.9668
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.662553 0.125776 -29.120 < 2e-16 ***
## isZeroBalance1 -0.388067 0.067487 -5.750 8.91e-09 ***
## IsActiveMember1 -1.079226 0.057579 -18.743 < 2e-16 ***
## Age 0.072747 0.002573 28.274 < 2e-16 ***
## GeographyGermany 0.740231 0.068503 10.806 < 2e-16 ***
## GeographySpain 0.033213 0.070578 0.471 0.638
## GenderMale -0.523909 0.054383 -9.634 < 2e-16 ***
## HasCrCard1 -0.044234 0.059272 -0.746 0.455
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10109.8 on 9999 degrees of freedom
## Residual deviance: 8577.6 on 9992 degrees of freedom
## AIC: 8593.6
##
## Number of Fisher Scoring iterations: 5
tab_model(model1, model2, model3, model4, model5,
show.aic = T, show.loglik = T,
dv.labels = c('model1', 'model2', 'model3', 'model4', 'model5'))
| model1 | model2 | model3 | model4 | model5 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Predictors | Odds Ratios | CI | p | Odds Ratios | CI | p | Odds Ratios | CI | p | Odds Ratios | CI | p | Odds Ratios | CI | p |
| (Intercept) | 0.46 | 0.43 – 0.49 | <0.001 | 0.03 | 0.02 – 0.03 | <0.001 | 0.02 | 0.02 – 0.02 | <0.001 | 0.02 | 0.02 – 0.03 | <0.001 | 0.03 | 0.02 – 0.03 | <0.001 |
| isZeroBalance [1] | 0.50 | 0.45 – 0.56 | <0.001 | 0.50 | 0.44 – 0.56 | <0.001 | 0.69 | 0.60 – 0.79 | <0.001 | 0.68 | 0.59 – 0.77 | <0.001 | 0.68 | 0.59 – 0.77 | <0.001 |
| IsActiveMember [1] | 0.45 | 0.41 – 0.50 | <0.001 | 0.34 | 0.30 – 0.38 | <0.001 | 0.34 | 0.30 – 0.38 | <0.001 | 0.34 | 0.30 – 0.38 | <0.001 | 0.34 | 0.30 – 0.38 | <0.001 |
| Age | 1.08 | 1.07 – 1.08 | <0.001 | 1.08 | 1.07 – 1.08 | <0.001 | 1.08 | 1.07 – 1.08 | <0.001 | 1.08 | 1.07 – 1.08 | <0.001 | |||
| Geography [Germany] | 2.12 | 1.86 – 2.43 | <0.001 | 2.09 | 1.83 – 2.40 | <0.001 | 2.10 | 1.83 – 2.40 | <0.001 | ||||||
| Geography [Spain] | 1.03 | 0.90 – 1.18 | 0.661 | 1.03 | 0.90 – 1.19 | 0.633 | 1.03 | 0.90 – 1.19 | 0.638 | ||||||
| Gender [Male] | 0.59 | 0.53 – 0.66 | <0.001 | 0.59 | 0.53 – 0.66 | <0.001 | |||||||||
| HasCrCard [1] | 0.96 | 0.85 – 1.08 | 0.455 | ||||||||||||
| Observations | 10000 | 10000 | 10000 | 10000 | 10000 | ||||||||||
| R2 Tjur | 0.040 | 0.137 | 0.153 | 0.163 | 0.163 | ||||||||||
| AIC | 9711.032 | 8815.799 | 8683.810 | 8592.194 | 8593.639 | ||||||||||
| log-Likelihood | -4852.516 | -4403.899 | -4335.905 | -4289.097 | -4288.819 | ||||||||||
Comparison: 1. Pseudo R2 (R2 Tjur) The best model - model4/model5 (since it has the higher r2) 2. AIC The best model - model4 or model2 (since they has the lowest value of AIC) 3. Log-Likelihood The best model - model5 (since it has higher log-Likelihood)
So, choosing between model4 and model5. Comparing them by ANOVA (analysis of deviance)
anova(model4, test ="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Exited
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 9999 10109.8
## isZeroBalance 1 156.76 9998 9953.0 < 2.2e-16 ***
## IsActiveMember 1 247.99 9997 9705.0 < 2.2e-16 ***
## Age 1 897.23 9996 8807.8 < 2.2e-16 ***
## Geography 2 135.99 9994 8671.8 < 2.2e-16 ***
## Gender 1 93.62 9993 8578.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model5, test ="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Exited
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 9999 10109.8
## isZeroBalance 1 156.76 9998 9953.0 <2e-16 ***
## IsActiveMember 1 247.99 9997 9705.0 <2e-16 ***
## Age 1 897.23 9996 8807.8 <2e-16 ***
## Geography 2 135.99 9994 8671.8 <2e-16 ***
## Gender 1 93.62 9993 8578.2 <2e-16 ***
## HasCrCard 1 0.56 9992 8577.6 0.4561
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model4,model5, test ="Chisq")
## Analysis of Deviance Table
##
## Model 1: Exited ~ isZeroBalance + IsActiveMember + Age + Geography + Gender
## Model 2: Exited ~ isZeroBalance + IsActiveMember + Age + Geography + Gender +
## HasCrCard
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 9993 8578.2
## 2 9992 8577.6 1 0.55551 0.4561
The null deviance tells us how well the response variable can be predicted by a model with only an intercept term. And HasCrCard really poorly predicts the Exited, moreover its effect is insignificant.
Although, the function from rcompanion library has slightly more detailed results.
compareGLM(model4, model5)
## $Models
## Formula
## 1 "Exited ~ isZeroBalance + IsActiveMember + Age + Geography + Gender"
## 2 "Exited ~ isZeroBalance + IsActiveMember + Age + Geography + Gender + HasCrCard"
##
## $Fit.criteria
## Rank Df.res AIC AICc BIC McFadden Cox.and.Snell Nagelkerke p.value
## 1 7 9993 8594 8594 8652 0.1515 0.1420 0.2232 0
## 2 8 9992 8596 8596 8661 0.1516 0.1421 0.2233 0
Comparison: 1. Pseudo R2 The best model - model5 (since it has the higher r2) 2. AIC The best model - model4 (since it has the lowest value of AIC) 3. BIC The best model - model4 (since it has lower BIC)
So, in terms of Explained Variance model5 is better, but in terms of AIC and BIC (indicate a better balance of goodness-of-fit of the model and the complexity of the model) model4 is better. So, model4.
A well-fitting model shows no significant difference between the model and the observed data - Hosmer and Lemeshow test
logitgof(ds1$Exited, fitted(model4), g = 10) #g should be larger than the number of predictors; df = g - 2
##
## Hosmer and Lemeshow test (binary model)
##
## data: ds1$Exited, fitted(model4)
## X-squared = 7.9215, df = 8, p-value = 0.4412
For model4 there is not enough evidence to say it is a poor fit.
summary(model4)
##
## Call:
## glm(formula = Exited ~ isZeroBalance + IsActiveMember + Age +
## Geography + Gender, family = "binomial", data = ds1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3307 -0.6582 -0.4554 -0.2712 2.9625
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.693484 0.118874 -31.070 < 2e-16 ***
## isZeroBalance1 -0.389417 0.067463 -5.772 7.82e-09 ***
## IsActiveMember1 -1.078399 0.057562 -18.735 < 2e-16 ***
## Age 0.072754 0.002573 28.279 < 2e-16 ***
## GeographyGermany 0.739126 0.068480 10.793 < 2e-16 ***
## GeographySpain 0.033658 0.070576 0.477 0.633
## GenderMale -0.524079 0.054381 -9.637 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10109.8 on 9999 degrees of freedom
## Residual deviance: 8578.2 on 9993 degrees of freedom
## AIC: 8592.2
##
## Number of Fisher Scoring iterations: 5
The only non-significant predictor is GeographySpain
isZeroBalance1 - the log of odds ratio for people with zero balance will be lower by 0.39 in comparison with non-zero balance group IsActiveMember1 - the log of odds ratio for customers who are active members of the bank will be lower by 1 in comparison with non-active members Age - With increase on 1 in Age variable the log of odds will increase on 0.07 GeographyGermany - the log of odds ratio for customers from Germany will be higher by 0.7 in comparison with clients from France GeographySpain - the log of odds ratio for customers from Spain will be higher by 0.03 in comparison with clients from France GenderMale - the log of odds ratio for customers with male gender will be higher by 0.5 in comparison with female customers
# Splitting the data
ds3 = ds1[,-2]
set.seed(1)
# using 70% of ds as training set and 30% as test set
sample = sample(c(TRUE, FALSE), nrow(ds3), replace=TRUE, prob=c(0.7,0.3))
train = ds3[sample, ]
test = ds3[!sample, ]
# Creating ROC Curve Plot
model4a = glm(Exited ~ isZeroBalance + IsActiveMember + Age + Geography + Gender , data = train, family = "binomial")
test_prob4 = predict(model4a, newdata = test, type = "response")
test_roc4 = roc(test$Exited ~ test_prob4, plot = TRUE, print.auc = TRUE)
So, as it can be seen by the ROC plot the model is quite good. It is not an amazing prediction quality, but the model is far away (AUC = 0.765) from the baseline (AUC = 0.5).
There would not be much description, since the plots tell all. The general thing: these plots are about how different level of certain factors affect the slope of predicted probabilities of continuous variable Age.
plot_model(model4, type = "pred", terms = c("Age", "Geography"))
plot_model(model4, type = "pred", terms = c("Age", "Gender"))
plot_model(model4, type = "pred", terms = c("Age", "IsActiveMember"))
IsActiveMember has the most visible differences between two slopes
plot_model(model4, type = "pred", terms = c("Age", "isZeroBalance"))
outlierTest(model4)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 2142 2.964319 0.0030335 NA
Outlier test suggests that observation 2142 is outlier.
influenceIndexPlot(model4)
According to diagnostic plots, it can be assumed that: observations 6444 and 9573 are outliers by cook’s distance, observations 2141 and 9573 are outliers by studentized residuals, observations 2459 and 3995 are outliers by hat-values
influencePlot(model4, col = "red")
## StudRes Hat CookD
## 2142 2.964319 0.0001328598 0.0015093895
## 2459 -1.510892 0.0038653080 0.0011787911
## 3995 -1.251423 0.0034397229 0.0005856426
## 6444 -1.802326 0.0028927740 0.0016811639
## 9573 2.953355 0.0001580015 0.0017356948
According to influenceplot it can be assumed that observations 2142, 2459, 3995, 6444, 9573 are outliers.
vif(model4)
## GVIF Df GVIF^(1/(2*Df))
## isZeroBalance 1.255101 1 1.120313
## IsActiveMember 1.076039 1 1.037323
## Age 1.080528 1 1.039484
## Geography 1.256073 2 1.058653
## Gender 1.002214 1 1.001106
Values are ~ 1 -> there is no multicollinearity
model4fix <- update(model4, subset = c(-2142,-2459,-3995,-6444,-9573))
compareCoefs(model4, model4fix)
## Calls:
## 1: glm(formula = Exited ~ isZeroBalance + IsActiveMember + Age + Geography
## + Gender, family = "binomial", data = ds1)
## 2: glm(formula = Exited ~ isZeroBalance + IsActiveMember + Age + Geography
## + Gender, family = "binomial", data = ds1, subset = c(-2142, -2459, -3995,
## -6444, -9573))
##
## Model 1 Model 2
## (Intercept) -3.693 -3.729
## SE 0.119 0.120
##
## isZeroBalance1 -0.3894 -0.3935
## SE 0.0675 0.0676
##
## IsActiveMember1 -1.0784 -1.0838
## SE 0.0576 0.0577
##
## Age 0.07275 0.07369
## SE 0.00257 0.00259
##
## GeographyGermany 0.7391 0.7389
## SE 0.0685 0.0686
##
## GeographySpain 0.0337 0.0345
## SE 0.0706 0.0707
##
## GenderMale -0.5241 -0.5247
## SE 0.0544 0.0545
##
No coefficients change significantly, none of them changes significance level.
There are no changes in the sign or magnitude of coefficients, which means there is no need to exclude any observations from model 4.
So, about the hypotheses:
People with a higher balance in their accounts are less likely to stop using the bank services compared to those with lower balances - FALSE, however the balance factor is worse predictor in comparison with isZeroBalance (that divides clients by two groups according to their balance), isZeroBalance shows that people with non-zero balance are more likely to churn
The most active customers will be less likely to stop using the bank services - TRUE - Most active customers are less likely to churn, while People with higher balance tend to churn
People with a credit card are less likely to stop using the bank services - FALSE - The presence or absence of a credit card among bank clients does not in any way determine their likelihood of stopping using the bank’s services.
Age, Geography, Gender are significant predictors of churn
RQ: Which categories of client data would influence the churn of the
bank client?
The answer: Age, Geography, Gender, isZeroBalance and isActiveMember