1.Dataset contents & Project Idea

‘Churn for Bank Customers’ is a Kaggle Dataset which contains artificial information about bank clients for 10 thousand clients from 3 countries: France, Germany, Spain. It is important for banks to know in advance which client tends to leave the company, in other words to churn, since new clients are more expensive to get than to prevent churning. In the analysis the churn will be predicted using Binary Logistic Regression model.

cat("Number of Observations in Dataset:", nrow(ds),
    "\nNumber of Variables in Dataset:", ncol(ds),
    "\nNumber of Observations with Missing Data in Dataset:", nrow(ds)-nrow(na.omit(ds)))
## Number of Observations in Dataset: 10000 
## Number of Variables in Dataset: 14 
## Number of Observations with Missing Data in Dataset: 0

Variables Description (source):
1. RowNumber — corresponds to the record (row) number and has no effect on the output.
2. CustomerId — contains random values and has no effect on customer leaving the bank.
3. Surname — the surname of a customer has no impact on their decision to leave the bank.
4. CreditScore — can have an effect on customer churn, since a customer with a higher credit score is less likely to leave the bank.
5. Geography — a customer’s location can affect their decision to leave the bank.
6. Gender — it’s interesting to explore whether gender plays a role in a customer leaving the bank.
7. Age — this is certainly relevant, since older customers are less likely to leave their bank than younger ones.
8. Tenure — refers to the number of years that the customer has been a client of the bank. Normally, older clients are more loyal and less likely to leave a bank.
9. Balance — also a very good indicator of customer churn, as people with a higher balance in their accounts are less likely to leave the bank 10. compared to those with lower balances.
10. NumOfProducts — refers to the number of products that a customer has purchased through the bank.
11. HasCrCard — denotes whether or not a customer has a credit card. This column is also relevant, since people with a credit card are less likely to leave the bank.
12. IsActiveMember — active customers are less likely to leave the bank.
13. EstimatedSalary — as with balance, people with lower salaries are more likely to leave the bank compared to those with higher salaries.
14. Exited — whether or not the customer left the bank.

Variables class:

ds$RowNumber = ds$RowNumber %>% as.integer()
ds$CustomerId = as.integer(ds$CustomerId)
ds$Geography = ds$Geography %>% as.factor()  
ds$Gender = ds$Gender %>% as.factor() 
ds$HasCrCard = ds$HasCrCard %>% as.factor() 
ds$IsActiveMember =  ds$IsActiveMember %>% as.factor() 
ds$Exited = ds$Exited %>% as.factor()

cl = sapply(ds,class)
col = as.vector(colnames(ds))

data.frame(variable = col,class = cl) %>%
  group_by(class,variable) %>%
  summarise()
## # A tibble: 14 × 2
## # Groups:   class [4]
##    class     variable       
##    <chr>     <chr>          
##  1 character Surname        
##  2 factor    Exited         
##  3 factor    Gender         
##  4 factor    Geography      
##  5 factor    HasCrCard      
##  6 factor    IsActiveMember 
##  7 integer   Age            
##  8 integer   CreditScore    
##  9 integer   CustomerId     
## 10 integer   NumOfProducts  
## 11 integer   RowNumber      
## 12 integer   Tenure         
## 13 numeric   Balance        
## 14 numeric   EstimatedSalary

2. Research Question & Hypotheses

The dependent variable is “Exited” [binary] - whether a client has churn or not
Independent variables are:
1. Balance [continuous]
2. IsActiveMember [binary]
3. HasCrCard [binary]
4. Geography (control) [factor]
5. Age (control) [continuous]
6. Gender (control) [binary]

RQ: Which categories of client data would influence the churn of the bank client?

Hypotheses:
1. People with a higher balance in their accounts are less likely to stop using the bank services compared to those with lower balances
2. Clients who are active members will be less likely to stop using the bank services
3. Clients who had a credit card are less likely to stop using the bank services

3. Data Descriptives

# data preparation
ds1 = ds %>% dplyr::select(Exited, Balance, IsActiveMember, HasCrCard, Geography, Age, Gender) %>% na.omit()
summary(ds1)
##  Exited      Balance       IsActiveMember HasCrCard   Geography   
##  0:7963   Min.   :     0   0:4849         0:2945    France :5014  
##  1:2037   1st Qu.:     0   1:5151         1:7055    Germany:2509  
##           Median : 97199                            Spain  :2477  
##           Mean   : 76486                                          
##           3rd Qu.:127644                                          
##           Max.   :250898                                          
##       Age           Gender    
##  Min.   :18.00   Female:4543  
##  1st Qu.:32.00   Male  :5457  
##  Median :37.00                
##  Mean   :38.92                
##  3rd Qu.:44.00                
##  Max.   :92.00

3.1. Variables Distribution

3.1.1. Exited

g = ggplot(ds1, aes(as.logical(as.integer(Exited)-1))) + labs(title = "Distribution of Churned Client", x = "Is Client is Churned")
p = g + geom_bar(aes(fill = Exited)) + theme(legend.position = 'none')
ggplotly(p)

80% of observations in data is non-churned clients and only 20% are clients who left the company. So, it certainly can be said that non-churned clients is the overrepresented group in the sample.

3.1.2. Balance

fig = plot_ly(x = ds1$Balance, type = "histogram") %>% layout(xaxis = list(title = 'Balance ($)'), title = "Balance Distribution")
fig

By the graph it can be seen that many client had zero-balance, it can affect the model in the future.

cat("non-zero obs:",sum(ds1$Balance > 0), "|", (sum(ds1$Balance > 0)/nrow(ds1))*100, "%")
## non-zero obs: 6383 | 63.83 %
cat("\n")
cat("zero obs:", sum(ds1$Balance == 0), "|", (sum(ds1$Balance == 0)/nrow(ds1))*100, "%")
## zero obs: 3617 | 36.17 %

36% of the observations were zero-balance, which is not critical, however it have to be considered while interpreting the model. Obviously, it is non-normal distribution. However, it is a large sample, so it does not matter so much.

fig <- plot_ly(alpha = 0.6)
fig <- fig %>% add_histogram(x = as.numeric(unlist((filter(ds1, ds1$Exited == 0))[2])), name = "Non-churned")
fig <- fig %>% add_histogram(x = as.numeric(unlist((filter(ds1, ds1$Exited == 1))[2])), name = "Churned")
fig <- fig %>% layout(barmode = "overlay", xaxis = list(title = 'Balance ($)'), title = "Balance Distribution")
fig

This is a double histogram graph for Churned and Non-churned clients, as it can be seen within churned clients there are much less those who have zero-balance.

Also, let’s assume that there is no zero-balance users, in this case, would the distribution be normal?

check = (filter(ds1, ds1$Balance > 0))[2] %>% unlist() %>% as.numeric()
plot_ly(x = check, type = "histogram") %>% layout(xaxis = list(title = 'Balance ($)'), title = "Balance Distribution")
qqnorm(check)

Indeed, excluding zero-balance observations the Balance distribution would be normal. So, an additional factor should be added.

3.1.3. IsActiveMember

g <- ggplot(ds1, aes(as.logical(as.integer(IsActiveMember)-1))) + labs(title = "Distribution of Active Members", x = "Was an Active Member?")
p <-  g + geom_bar(aes(fill = as.logical(as.integer(Exited)-1))) + guides(fill=guide_legend(title="Exited"))
ggplotly(p)

Active and non-Active members are distributed quite similar, however in terms of churned and non-churned clients there is a big difference. Non-Active members include almost double amount of churned clients in comparison with Active members group.

3.1.4. HasCrCard

g <- ggplot(ds1, aes(as.logical(as.integer(HasCrCard)-1))) + labs(title = "Distribution of Card Owners", x = "Had a Credit Card")
p <-  g + geom_bar(aes(fill = as.logical(as.integer(Exited)-1))) + guides(fill=guide_legend(title="Exited"))
ggplotly(p)

So, according to the bar plot there are much more clients who has Credit Card. Not much can be concluded about churned and non-churned distribution among owners and non-owners of Credit Card.

3.1.5. Geography (control)

g <- ggplot(ds1, aes(x = Geography)) + labs(title = "Geographical Distribution", x = "Country")
p <-  g + geom_bar(aes(fill = as.logical(as.integer(Exited)-1))) + guides(fill=guide_legend(title="Exited"))
ggplotly(p)

Half of the data is about clients from France, the other half is divided equally by Germany and Spain. In France and Germany there are similar quantity of clients that churned, but proportionally the country with the most amount of churned clients is Germany, the least amount is in Spain.

3.1.6. Age (control)

fig <- plot_ly(alpha = 0.6)
fig <- fig %>% add_histogram(x = as.numeric(unlist((filter(ds1, ds1$Exited == 0))[6])), name = "Non-churned")
fig <- fig %>% add_histogram(x = as.numeric(unlist((filter(ds1, ds1$Exited == 1))[6])), name = "Churned")
fig <- fig %>% layout(barmode = "overlay", xaxis = list(title = 'Age (years)'), title = "Age Distribution")
fig

Most of the clients are people from 30 to 40 years old, however the churned clients are not in this range, which maybe an indicator of influence of age on exited in the future model. Most of the churned clients are people from 35 to 50.

3.1.7. Gender (control)

g <- ggplot(ds1, aes(Gender)) + labs(title = "Distribution of Gender", x = "Gender")
p <-  g + geom_bar(aes(fill = as.logical(as.integer(Exited)-1))) + guides(fill=guide_legend(title="Exited"))
ggplotly(p)

Major part of clients of the bank are men, however there are more woman who exit the bank (25% of women vs 16% of male)

3.2. Relationship between independent and dependent variables (deeper)

3.2.1. Exited & Balance

g = ggplot(ds1, aes(x = Exited, y = Balance)) + geom_boxplot()
ggplotly(g)

On the graph it can be seen that there is no such visible difference in median, however the difference can be seen in first quartile of two boxes. So, it can be assumed that there is some relationship between variables.

However, in 3.1.2 interesting pattern was found: if we remove all zero-balance clients from sample we will get a normal distribution, so let’s divide the variable by two groups.

3.2.1.1 Non-Zero Balance & Exited
ds2 = ds1 
ds2= filter(ds2, Balance > 0)
g = ggplot(ds2, aes(x = Exited, y = Balance)) + geom_boxplot()
ggplotly(g)

As it can be seen the boxplot graph is much different from the first. Let us test independence using statistical test

# F Test
group0 = (filter(ds2, Exited == 0))[2] %>% unlist() %>% as.numeric()
group1 = (filter(ds2, Exited == 1))[2] %>% unlist() %>% as.numeric()
var.test(group0,group1, alternative = "two.sided")
## 
##  F test to compare two variances
## 
## data:  group0 and group1
## F = 0.96045, num df = 4845, denom df = 1536, p-value = 0.3247
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.8847912 1.0407821
## sample estimates:
## ratio of variances 
##          0.9604538

The p-value of F-test is p = 0.3247 which is greater than the significance level 0.05. There is no significant difference between the two variances. As the sample is large normality assumption can be ignored => use unpaired t-test

t.test(Balance ~ Exited ,data = ds2, alternative = "two.sided", var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  Balance by Exited
## t = -1.3605, df = 2539.6, p-value = 0.1738
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -2956.7129   534.4991
## sample estimates:
## mean in group 0 mean in group 1 
##        119535.9        120747.0

p-value>0.05 ==> the Balance of churned clients is not significantly different from the Balance of non-churned clients

In conclusion, it can be assumed that there is no significant relationship between Balance and Exited if Balance is not zero.

3.2.1.2. Zero Balance & Exited

Well, for this time boxplots will be useless since the value of variable balance is static and equals to zero for all of the observations. So, I will use another approach and create new binary variable which will be used as an indicator whether the client had zero balance or not. Then I will compute chi-square test of independency.

ds2 = ds1 
ds2$isZeroBalance = ifelse(ds2$Balance == 0, 1, 0)
ds2$isZeroBalance = as.logical(as.integer(ds2$isZeroBalance))
g <- ggplot(ds2, aes(isZeroBalance)) + labs(title = "Distribution of Zero and Non-Zero Balance Clients")
p <-  g + geom_bar(aes(fill = as.logical(as.integer(Exited)-1))) + 
          guides(fill=guide_legend(title="Exited"))
ggplotly(p) %>% layout(xaxis = list(title = 'Is client has Zero Balance'))

NON-Zero Balance clients includes three times more churned clients in absolute numbers than Zero balance clients or 24% churned in NON-Zero Balance group against 13.8% churned in Zero balance group. Testing it with chi-square test of independency.

H0: isZeroBalance is independent of Exited H1: isZeroBalance is not independent of Exited

contingency = table(ds2$Exited, ds2$isZeroBalance) 
chisq.test(contingency)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency
## X-squared = 149.08, df = 1, p-value < 2.2e-16

p-value<0.05 ==> isZeroBalance may not be independent of Exited

In conclusion, it can be assumed that there is a significant relationship between isZeroBalance and Exited. So, this variable will be included into dataset.

ds1$isZeroBalance = ifelse(ds1$Balance == 0, 1, 0) %>% as.factor()

3.2.2. Exited & IsActiveMember

H0: IsActiveMember is independent of Exited H1: IsActiveMember is not independent of Exited

contingency = table(ds1$Exited, ds1$IsActiveMember) 
chisq.test(contingency)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency
## X-squared = 242.99, df = 1, p-value < 2.2e-16

p-value<0.05 ==> IsActiveMember may not be independent of Exited

3.2.3. Exited & HasCrCard

H0: HasCrCard is independent of Exited H1: HasCrCard is not independent of Exited

contingency = table(ds1$Exited, ds1$HasCrCard) 
chisq.test(contingency)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency
## X-squared = 0.47134, df = 1, p-value = 0.4924

p-value>0.05 ==> HasCrCard is independent of Exited

3.2.4. Exited & Geography

H0: Geography is independent of Exited H1: Geography is not independent of Exited

contingency = table(ds1$Exited, ds1$Geography) 
chisq.test(contingency)
## 
##  Pearson's Chi-squared test
## 
## data:  contingency
## X-squared = 301.26, df = 2, p-value < 2.2e-16

p-value<0.05 ==> Geography may not be independent of Exited

3.2.5. Exited & Age

g = ggplot(ds1, aes(x = Exited, y = Age)) + geom_boxplot()
ggplotly(g)

So, median differs from group to group quite visibly, so it can be assumed that there is a relationship between two variables.

3.2.6. Exited & Gender

H0: Gender is independent of Exited H1: Gender is not independent of Exited

contingency = table(ds1$Exited, ds1$Gender) 
chisq.test(contingency)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency
## X-squared = 112.92, df = 1, p-value < 2.2e-16

p-value<0.05 ==> Gender may not be independent of Exited

3.4 Summary of Descriptive Statistics.

  1. Most of the observations are about non-churned client (80%)
  2. Interesting Pattern occurs in Balance variable: it is normally distributed only because of the zeroes. Further research of the relationship between the dependent variable and Balance shows:
    • 2a. That it can be assumed that there is no significant relationship between non-zero balance and exited.
    • 2b. That it can be assumed that there is a significant relationship between zero balance and exited.
      However, I will include both Balance and isZeroBalance variables into the model to test assumptions that was made during the exploratory analysis
  3. HasCrCard turned out to be indepedent of Exited.
  4. IsActiveMember tends to be one of the most influential factors (presumably).
  5. Age & Gender are also seem to be influential factors.
  6. No inital expectation on level of influence for Geography variable.

Now, finally to the model creation.

4. Models

4.1 Whether to use Balance or isZeroBalance

Firstly, I will create two models with just one independent variable to decide whether to use Balance or isZeroBalance variable. Both of them cannot be used because they highly correlated initially which will lead to multicollinearity problem in the model.

blr_a = glm(Exited ~ Balance , data = ds1, family = "binomial")
summary(blr_a)
## 
## Call:
## glm(formula = Exited ~ Balance, family = "binomial", data = ds1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9029  -0.7300  -0.5631  -0.5631   1.9596  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.761e+00  4.395e-02  -40.08   <2e-16 ***
## Balance      4.852e-06  4.127e-07   11.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10110  on 9999  degrees of freedom
## Residual deviance:  9967  on 9998  degrees of freedom
## AIC: 9971
## 
## Number of Fisher Scoring iterations: 4

Estimate for Balance is too low to be really influential.

blr_b = glm(Exited ~ isZeroBalance , data = ds1, family = "binomial")
summary(blr_b)
## 
## Call:
## glm(formula = Exited ~ isZeroBalance, family = "binomial", data = ds1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7423  -0.7423  -0.5455  -0.5455   1.9894  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.14832    0.02927  -39.23   <2e-16 ***
## isZeroBalance1 -0.68170    0.05637  -12.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10110  on 9999  degrees of freedom
## Residual deviance:  9953  on 9998  degrees of freedom
## AIC: 9957
## 
## Number of Fisher Scoring iterations: 4

Binary variable have estimate -0.68 which tells that isZeroBalance is more influential than Balance

So, by summaries of two model the second model is better in terms of AIC (9971 in model 1 and 9957 in model 2, the lower the AIC the better the model). Consider comparison with compareGLM function.

compareGLM(blr_a, blr_b)
## $Models
##   Formula                 
## 1 "Exited ~ Balance"      
## 2 "Exited ~ isZeroBalance"
## 
## $Fit.criteria
##   Rank Df.res  AIC AICc  BIC McFadden Cox.and.Snell Nagelkerke   p.value
## 1    2   9998 9973 9973 9995  0.01412       0.01418    0.02228 3.317e-33
## 2    2   9998 9959 9959 9981  0.01551       0.01555    0.02445 2.909e-36

All values states that it is better to use isZeroBalance as variable.

4.2. Nested Models

I will create 5 models: 1. The first one will include only two the most influential (according to the assumptions in exploratory analysis in previous part) variables:

model1 = glm(Exited ~ isZeroBalance + IsActiveMember, data = ds1, family = "binomial")
summary(model1)
## 
## Call:
## glm(formula = Exited ~ isZeroBalance + IsActiveMember, family = "binomial", 
##     data = ds1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8692  -0.6424  -0.6122  -0.4427   2.1779  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -0.77875    0.03649  -21.34   <2e-16 ***
## isZeroBalance1  -0.69438    0.05700  -12.18   <2e-16 ***
## IsActiveMember1 -0.80054    0.05175  -15.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10110  on 9999  degrees of freedom
## Residual deviance:  9705  on 9997  degrees of freedom
## AIC: 9711
## 
## Number of Fisher Scoring iterations: 4
  1. The second one will include all of the variables from model1 and also Age
model2 = glm(Exited ~ isZeroBalance + IsActiveMember + Age , data = ds1, family = "binomial")
summary(model2)
## 
## Call:
## glm(formula = Exited ~ isZeroBalance + IsActiveMember + Age, 
##     family = "binomial", data = ds1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0512  -0.6812  -0.4715  -0.2885   2.8707  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -3.627918   0.107702  -33.69   <2e-16 ***
## isZeroBalance1  -0.696366   0.059746  -11.66   <2e-16 ***
## IsActiveMember1 -1.088719   0.056860  -19.15   <2e-16 ***
## Age              0.072721   0.002536   28.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10109.8  on 9999  degrees of freedom
## Residual deviance:  8807.8  on 9996  degrees of freedom
## AIC: 8815.8
## 
## Number of Fisher Scoring iterations: 5
  1. The third model will include all of the variables from model2 and also Geography
model3 = glm(Exited ~ isZeroBalance + IsActiveMember + Age + Geography , data = ds1, family = "binomial")
summary(model3)
## 
## Call:
## glm(formula = Exited ~ isZeroBalance + IsActiveMember + Age + 
##     Geography, family = "binomial", data = ds1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2265  -0.6655  -0.4622  -0.2782   2.8742  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.96715    0.11617 -34.150  < 2e-16 ***
## isZeroBalance1   -0.37214    0.06707  -5.548 2.88e-08 ***
## IsActiveMember1  -1.08550    0.05729 -18.947  < 2e-16 ***
## Age               0.07280    0.00256  28.441  < 2e-16 ***
## GeographyGermany  0.75250    0.06807  11.055  < 2e-16 ***
## GeographySpain    0.03080    0.07019   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10109.8  on 9999  degrees of freedom
## Residual deviance:  8671.8  on 9994  degrees of freedom
## AIC: 8683.8
## 
## Number of Fisher Scoring iterations: 5
  1. The forth one will include all of the variables from model3 and also Gender
model4 = glm(Exited ~ isZeroBalance + IsActiveMember + Age + Geography + Gender , data = ds1, family = "binomial")
summary(model4)
## 
## Call:
## glm(formula = Exited ~ isZeroBalance + IsActiveMember + Age + 
##     Geography + Gender, family = "binomial", data = ds1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3307  -0.6582  -0.4554  -0.2712   2.9625  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.693484   0.118874 -31.070  < 2e-16 ***
## isZeroBalance1   -0.389417   0.067463  -5.772 7.82e-09 ***
## IsActiveMember1  -1.078399   0.057562 -18.735  < 2e-16 ***
## Age               0.072754   0.002573  28.279  < 2e-16 ***
## GeographyGermany  0.739126   0.068480  10.793  < 2e-16 ***
## GeographySpain    0.033658   0.070576   0.477    0.633    
## GenderMale       -0.524079   0.054381  -9.637  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10109.8  on 9999  degrees of freedom
## Residual deviance:  8578.2  on 9993  degrees of freedom
## AIC: 8592.2
## 
## Number of Fisher Scoring iterations: 5
  1. The fith model will include all of the variables from model4 and also HasCrCard
model5 = glm(Exited ~ isZeroBalance + IsActiveMember + Age + Geography + Gender + HasCrCard, data = ds1, family = "binomial")
summary(model5)
## 
## Call:
## glm(formula = Exited ~ isZeroBalance + IsActiveMember + Age + 
##     Geography + Gender + HasCrCard, family = "binomial", data = ds1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3256  -0.6578  -0.4562  -0.2715   2.9668  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.662553   0.125776 -29.120  < 2e-16 ***
## isZeroBalance1   -0.388067   0.067487  -5.750 8.91e-09 ***
## IsActiveMember1  -1.079226   0.057579 -18.743  < 2e-16 ***
## Age               0.072747   0.002573  28.274  < 2e-16 ***
## GeographyGermany  0.740231   0.068503  10.806  < 2e-16 ***
## GeographySpain    0.033213   0.070578   0.471    0.638    
## GenderMale       -0.523909   0.054383  -9.634  < 2e-16 ***
## HasCrCard1       -0.044234   0.059272  -0.746    0.455    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10109.8  on 9999  degrees of freedom
## Residual deviance:  8577.6  on 9992  degrees of freedom
## AIC: 8593.6
## 
## Number of Fisher Scoring iterations: 5

4.3. Model Comparison

tab_model(model1, model2, model3, model4, model5, 
          show.aic = T, show.loglik = T,
          dv.labels = c('model1', 'model2', 'model3', 'model4', 'model5'))
  model1 model2 model3 model4 model5
Predictors Odds Ratios CI p Odds Ratios CI p Odds Ratios CI p Odds Ratios CI p Odds Ratios CI p
(Intercept) 0.46 0.43 – 0.49 <0.001 0.03 0.02 – 0.03 <0.001 0.02 0.02 – 0.02 <0.001 0.02 0.02 – 0.03 <0.001 0.03 0.02 – 0.03 <0.001
isZeroBalance [1] 0.50 0.45 – 0.56 <0.001 0.50 0.44 – 0.56 <0.001 0.69 0.60 – 0.79 <0.001 0.68 0.59 – 0.77 <0.001 0.68 0.59 – 0.77 <0.001
IsActiveMember [1] 0.45 0.41 – 0.50 <0.001 0.34 0.30 – 0.38 <0.001 0.34 0.30 – 0.38 <0.001 0.34 0.30 – 0.38 <0.001 0.34 0.30 – 0.38 <0.001
Age 1.08 1.07 – 1.08 <0.001 1.08 1.07 – 1.08 <0.001 1.08 1.07 – 1.08 <0.001 1.08 1.07 – 1.08 <0.001
Geography [Germany] 2.12 1.86 – 2.43 <0.001 2.09 1.83 – 2.40 <0.001 2.10 1.83 – 2.40 <0.001
Geography [Spain] 1.03 0.90 – 1.18 0.661 1.03 0.90 – 1.19 0.633 1.03 0.90 – 1.19 0.638
Gender [Male] 0.59 0.53 – 0.66 <0.001 0.59 0.53 – 0.66 <0.001
HasCrCard [1] 0.96 0.85 – 1.08 0.455
Observations 10000 10000 10000 10000 10000
R2 Tjur 0.040 0.137 0.153 0.163 0.163
AIC 9711.032 8815.799 8683.810 8592.194 8593.639
log-Likelihood -4852.516 -4403.899 -4335.905 -4289.097 -4288.819

Comparison: 1. Pseudo R2 (R2 Tjur) The best model - model4/model5 (since it has the higher r2) 2. AIC The best model - model4 or model2 (since they has the lowest value of AIC) 3. Log-Likelihood The best model - model5 (since it has higher log-Likelihood)

So, choosing between model4 and model5. Comparing them by ANOVA (analysis of deviance)

anova(model4, test ="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Exited
## 
## Terms added sequentially (first to last)
## 
## 
##                Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                            9999    10109.8              
## isZeroBalance   1   156.76      9998     9953.0 < 2.2e-16 ***
## IsActiveMember  1   247.99      9997     9705.0 < 2.2e-16 ***
## Age             1   897.23      9996     8807.8 < 2.2e-16 ***
## Geography       2   135.99      9994     8671.8 < 2.2e-16 ***
## Gender          1    93.62      9993     8578.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model5, test ="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Exited
## 
## Terms added sequentially (first to last)
## 
## 
##                Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                            9999    10109.8             
## isZeroBalance   1   156.76      9998     9953.0   <2e-16 ***
## IsActiveMember  1   247.99      9997     9705.0   <2e-16 ***
## Age             1   897.23      9996     8807.8   <2e-16 ***
## Geography       2   135.99      9994     8671.8   <2e-16 ***
## Gender          1    93.62      9993     8578.2   <2e-16 ***
## HasCrCard       1     0.56      9992     8577.6   0.4561    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model4,model5, test ="Chisq")
## Analysis of Deviance Table
## 
## Model 1: Exited ~ isZeroBalance + IsActiveMember + Age + Geography + Gender
## Model 2: Exited ~ isZeroBalance + IsActiveMember + Age + Geography + Gender + 
##     HasCrCard
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      9993     8578.2                     
## 2      9992     8577.6  1  0.55551   0.4561

The null deviance tells us how well the response variable can be predicted by a model with only an intercept term. And HasCrCard really poorly predicts the Exited, moreover its effect is insignificant.

Although, the function from rcompanion library has slightly more detailed results.

compareGLM(model4, model5)
## $Models
##   Formula                                                                         
## 1 "Exited ~ isZeroBalance + IsActiveMember + Age + Geography + Gender"            
## 2 "Exited ~ isZeroBalance + IsActiveMember + Age + Geography + Gender + HasCrCard"
## 
## $Fit.criteria
##   Rank Df.res  AIC AICc  BIC McFadden Cox.and.Snell Nagelkerke p.value
## 1    7   9993 8594 8594 8652   0.1515        0.1420     0.2232       0
## 2    8   9992 8596 8596 8661   0.1516        0.1421     0.2233       0

Comparison: 1. Pseudo R2 The best model - model5 (since it has the higher r2) 2. AIC The best model - model4 (since it has the lowest value of AIC) 3. BIC The best model - model4 (since it has lower BIC)

So, in terms of Explained Variance model5 is better, but in terms of AIC and BIC (indicate a better balance of goodness-of-fit of the model and the complexity of the model) model4 is better. So, model4.

A well-fitting model shows no significant difference between the model and the observed data - Hosmer and Lemeshow test

logitgof(ds1$Exited, fitted(model4), g = 10) #g should be larger than the number of predictors; df = g - 2
## 
##  Hosmer and Lemeshow test (binary model)
## 
## data:  ds1$Exited, fitted(model4)
## X-squared = 7.9215, df = 8, p-value = 0.4412

For model4 there is not enough evidence to say it is a poor fit.

summary(model4)
## 
## Call:
## glm(formula = Exited ~ isZeroBalance + IsActiveMember + Age + 
##     Geography + Gender, family = "binomial", data = ds1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3307  -0.6582  -0.4554  -0.2712   2.9625  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.693484   0.118874 -31.070  < 2e-16 ***
## isZeroBalance1   -0.389417   0.067463  -5.772 7.82e-09 ***
## IsActiveMember1  -1.078399   0.057562 -18.735  < 2e-16 ***
## Age               0.072754   0.002573  28.279  < 2e-16 ***
## GeographyGermany  0.739126   0.068480  10.793  < 2e-16 ***
## GeographySpain    0.033658   0.070576   0.477    0.633    
## GenderMale       -0.524079   0.054381  -9.637  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10109.8  on 9999  degrees of freedom
## Residual deviance:  8578.2  on 9993  degrees of freedom
## AIC: 8592.2
## 
## Number of Fisher Scoring iterations: 5

The only non-significant predictor is GeographySpain

isZeroBalance1 - the log of odds ratio for people with zero balance will be lower by 0.39 in comparison with non-zero balance group IsActiveMember1 - the log of odds ratio for customers who are active members of the bank will be lower by 1 in comparison with non-active members Age - With increase on 1 in Age variable the log of odds will increase on 0.07 GeographyGermany - the log of odds ratio for customers from Germany will be higher by 0.7 in comparison with clients from France GeographySpain - the log of odds ratio for customers from Spain will be higher by 0.03 in comparison with clients from France GenderMale - the log of odds ratio for customers with male gender will be higher by 0.5 in comparison with female customers

5. ROC Analysis

# Splitting the data
ds3 = ds1[,-2]
set.seed(1)

# using 70% of ds as training set and 30% as test set
sample = sample(c(TRUE, FALSE), nrow(ds3), replace=TRUE, prob=c(0.7,0.3))
train  = ds3[sample, ]
test   = ds3[!sample, ]

# Creating ROC Curve Plot

model4a = glm(Exited ~ isZeroBalance + IsActiveMember + Age + Geography + Gender , data = train, family = "binomial")
test_prob4 = predict(model4a, newdata = test, type = "response")
test_roc4 = roc(test$Exited ~ test_prob4, plot = TRUE, print.auc = TRUE)

So, as it can be seen by the ROC plot the model is quite good. It is not an amazing prediction quality, but the model is far away (AUC = 0.765) from the baseline (AUC = 0.5).

6. Plotting Predicted Probabilities

There would not be much description, since the plots tell all. The general thing: these plots are about how different level of certain factors affect the slope of predicted probabilities of continuous variable Age.

plot_model(model4, type = "pred", terms = c("Age", "Geography"))

plot_model(model4, type = "pred", terms = c("Age", "Gender"))

plot_model(model4, type = "pred", terms = c("Age", "IsActiveMember"))

IsActiveMember has the most visible differences between two slopes

plot_model(model4, type = "pred", terms = c("Age", "isZeroBalance"))

7. Leverage Effect Tests

outlierTest(model4)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##      rstudent unadjusted p-value Bonferroni p
## 2142 2.964319          0.0030335           NA

Outlier test suggests that observation 2142 is outlier.

influenceIndexPlot(model4)

According to diagnostic plots, it can be assumed that: observations 6444 and 9573 are outliers by cook’s distance, observations 2141 and 9573 are outliers by studentized residuals, observations 2459 and 3995 are outliers by hat-values

influencePlot(model4, col = "red")

##        StudRes          Hat        CookD
## 2142  2.964319 0.0001328598 0.0015093895
## 2459 -1.510892 0.0038653080 0.0011787911
## 3995 -1.251423 0.0034397229 0.0005856426
## 6444 -1.802326 0.0028927740 0.0016811639
## 9573  2.953355 0.0001580015 0.0017356948

According to influenceplot it can be assumed that observations 2142, 2459, 3995, 6444, 9573 are outliers.

vif(model4)
##                    GVIF Df GVIF^(1/(2*Df))
## isZeroBalance  1.255101  1        1.120313
## IsActiveMember 1.076039  1        1.037323
## Age            1.080528  1        1.039484
## Geography      1.256073  2        1.058653
## Gender         1.002214  1        1.001106

Values are ~ 1 -> there is no multicollinearity

8. Model Editing

model4fix <- update(model4, subset = c(-2142,-2459,-3995,-6444,-9573))
compareCoefs(model4, model4fix)
## Calls:
## 1: glm(formula = Exited ~ isZeroBalance + IsActiveMember + Age + Geography 
##   + Gender, family = "binomial", data = ds1)
## 2: glm(formula = Exited ~ isZeroBalance + IsActiveMember + Age + Geography 
##   + Gender, family = "binomial", data = ds1, subset = c(-2142, -2459, -3995, 
##   -6444, -9573))
## 
##                  Model 1 Model 2
## (Intercept)       -3.693  -3.729
## SE                 0.119   0.120
##                                 
## isZeroBalance1   -0.3894 -0.3935
## SE                0.0675  0.0676
##                                 
## IsActiveMember1  -1.0784 -1.0838
## SE                0.0576  0.0577
##                                 
## Age              0.07275 0.07369
## SE               0.00257 0.00259
##                                 
## GeographyGermany  0.7391  0.7389
## SE                0.0685  0.0686
##                                 
## GeographySpain    0.0337  0.0345
## SE                0.0706  0.0707
##                                 
## GenderMale       -0.5241 -0.5247
## SE                0.0544  0.0545
## 

No coefficients change significantly, none of them changes significance level.

There are no changes in the sign or magnitude of coefficients, which means there is no need to exclude any observations from model 4.

9. Results

So, about the hypotheses:

  1. People with a higher balance in their accounts are less likely to stop using the bank services compared to those with lower balances - FALSE, however the balance factor is worse predictor in comparison with isZeroBalance (that divides clients by two groups according to their balance), isZeroBalance shows that people with non-zero balance are more likely to churn

  2. The most active customers will be less likely to stop using the bank services - TRUE - Most active customers are less likely to churn, while People with higher balance tend to churn

  3. People with a credit card are less likely to stop using the bank services - FALSE - The presence or absence of a credit card among bank clients does not in any way determine their likelihood of stopping using the bank’s services.

Age, Geography, Gender are significant predictors of churn

RQ: Which categories of client data would influence the churn of the bank client?
The answer: Age, Geography, Gender, isZeroBalance and isActiveMember