A bank is worried about the turn over of its customers. They are believe that a person’s credit score is a factor that determines the probability of a customer staying with them. This analysis will conduct a logistic regression to determine if credit score is a factor to this probability. Futhermore, it will determine how significant that variable is.
Listed below is a table of the data:
datatable(Churn_Modelling, options=list(lengthMenu = c(5, 10, 100)))
The mathematical model for this logistical regression is as follows:
\(P(Y_i = 1|\, x_i) = \frac{e^{\beta_0 + \beta_1 x_i}}{1+e^{\beta_0 + \beta_1 x_i}} = \pi_i\)
Where \(x_i\) is the credit score of a person and \(P(Y_i = 1|\, x_i)\) is the probability of a customer staying with them. It should be noted that when \(\beta_1\) is equal to zero we believe that credit score does not give insight into a person leaving a bank.
Formally, the null and alternative hypothesis are as follows:
\(H_0: \beta_1 = 0\) \(H_a: \beta_1 \neq 0\)
In this analysis we \(\alpha\) will be set at 0.05.
Listed below are the results which describes \(\beta_1\) and \(\beta_0\) based on this data. These values best describe a logistical regression line that fits the data. You will also notice their respective p-values. We will use the p-value associated with credit score to determine if there is edvidence to suggest credit score is a significant variable. Listed below is the data:
| Estimate | Std. Error | z value | p value | |
|---|---|---|---|---|
| (Intercept) | -0.9122 | 0.1679 | -5.432 | 5.578e-08 |
| Credit Score | -0.0006956 | 0.0002568 | -2.708 | 0.006762 |
Based on this data we believe the best fitting logistical regression model is as follows:
\(P(Y_i = 1|x_i) \approx \frac{e^{-0.9122-0.0006956 x_i}}{1+e^{-0.9122-0.0006956 x_i}} = \hat{\pi}_i\)
It should be noted that the p-value for the credit score is 0.006762. Since that is smaller than our \(\alpha\) (0.05) we reject the null hypothesis. In other words, we believe that credit score is a factor that determines how likely someone will leave their bank.
As shown below, this is a visualization of our mathematical model:
We need to determine if a logistic regression is a good fit for this data. To do this we will conduct a goodfit test with a non-central chi square distribution. Our mathematical model has a residual deviance of 10102 on 9998 degrees of freedom. Based on those numbers the p-value for this chi-square distribution is 0.2303665. Since the p-value is greater than 0.05 there is sufficient evidence to believe that this mathematical model is a good fit for a logistic regression.
There is an interesting point that is worth discussing. We need to remember \(b_1\) equals -0.0006956. Furthermore, \(e^{b_1} = e^{-0.0006956} \approx 0.999305\). That means for every 1 point someone increases their credit score the chances of them leaving changes by a factor of 0.999305. While a number like that is best left up to personal interpretation, I wouldn’t consider that a significant change. Just to put things in perspective the highest and lowest credit scores in this data set were 850 and 350 respectively. This model demonstrates that someone with a credit score of 850 has a \(\approx\) 18.2% chance of leaving while someone with 350 has a \(\approx\) 23.9% chance of leaving. If we really wanted to determine the odds of someone leaving a bank a multiple logistical regression might be more appropriate. A future analysis should consider including the balance in the customer’s bank account, if they are an active member, and the estimated salary of the individual as other exploritory variables.