This research has many objectives, exploring the association between Saving_amount and the other explanatory variables present in the model, may be counted as chief among them. Then a Research Question for this research may be: is the probability of a borrower having a Savings amount greater than or equal to the mean Savings amount for borrowers related to the explanatory variables in the model.
Exploratory Data Analysis is now conducted. Currently, Saving_amount is continuous variable. However, in order to conduct multiple logistic regression, it must be dichotomous. So it was dichotomized along its mean, with \(Y=1\) associated to Saving_amount variables >= mean(Saving_amount) and \(Y=0\) associated with all other Saving_amount variables that do not meet this criteria. This newly dichotomized Saving_amount was then name SA.g. Afterwards, the numeric explanatory variables were z-score normalized and categorical variables were set as factors. Lastly, correlations were investigated for the numerical explanatory variables. There seemed to be no major correlations between the explanatory variables.
Checking_amount Credit_score Age
Checking_amount 1.0000000 0.1892957 0.2974109
Credit_score 0.1892957 1.0000000 0.3280754
Age 0.2974109 0.3280754 1.0000000
Note that the response variable is a binary factor variable that was stored as an integer, with the integer 1 associated with Saving_amount >= mean(Saving_amount), so \(P(Y=1)=P(SA.g=1)\). Next, the Multiple logistic regression is fitted.
Multiple logistic regression was conducted. Interpretations of the results are withheld until the after a model selection procedure is conducted. The final model will be the one the one which minimizes AIC the most. First the Full model was investigated. After, a reduced model, based off the information from that MLR, as well as a model created through automatic variable selection was investigated. The summary of there relevant statistics follows. Their goodness-of-fit measures are then compared. Lastly, the final model chosen, based off these measures, is interpreted. The automatic variable selection procedure is revealed before the goodness-of-fit measures are compared. Its choice of final model is based on an AIC value that has been minimized the most by a particular model. To clarify, the initial model explored for this research is (SA.g) = \(\beta_0\) + \(\beta_1*\)(Credit_score) + \(\beta_2*\)(Checking_amount) + \(\beta_3*(Age)\) + \(\beta_4*(Gender)\) + \(\beta_5*(Marital_status)\) + \(\epsilon\).
| Estimate | Std. Error | z value | Pr(>|z|) | 2.5 % | 97.5 % | odds.ratio | |
|---|---|---|---|---|---|---|---|
| (Intercept) | -0.0481053 | 0.2293374 | -0.2097576 | 0.8338569 | -0.4991149 | 0.4006800 | 0.9530335 |
| Credit_score | 0.1745227 | 0.0712296 | 2.4501438 | 0.0142799 | 0.0358639 | 0.3154724 | 1.1906777 |
| Checking_amount | 0.1745909 | 0.0695744 | 2.5094125 | 0.0120932 | 0.0387253 | 0.3117481 | 1.1907590 |
| Age | 0.4041335 | 0.0739754 | 5.4630834 | 0.0000000 | 0.2605115 | 0.5507527 | 1.4980040 |
| GenderMale | 0.2520317 | 0.2114313 | 1.1920265 | 0.2332508 | -0.1615128 | 0.6682551 | 1.2866368 |
| Marital_statusSingle | 0.0012250 | 0.1963517 | 0.0062390 | 0.9950221 | -0.3827033 | 0.3880781 | 1.0012258 |
| Estimate | Std. Error | z value | Pr(>|z|) | 2.5 % | 97.5 % | odds.ratio | |
|---|---|---|---|---|---|---|---|
| (Intercept) | 0.1264800 | 0.0658286 | 1.921353 | 0.0546872 | -0.0024301 | 0.2557216 | 1.134827 |
| Credit_score | 0.1804640 | 0.0710908 | 2.538500 | 0.0111329 | 0.0421039 | 0.3211636 | 1.197773 |
| Checking_amount | 0.1751612 | 0.0695342 | 2.519067 | 0.0117666 | 0.0393891 | 0.3122545 | 1.191438 |
| Age | 0.4051681 | 0.0738783 | 5.484264 | 0.0000000 | 0.2617485 | 0.5516090 | 1.499555 |
Start: AIC=1316.59
SA.g ~ Credit_score + Checking_amount + Age + Gender + Marital_status
Df Deviance AIC
- Marital_status 1 1304.6 1314.6
- Gender 1 1306.0 1316.0
<none> 1304.6 1316.6
- Credit_score 1 1310.7 1320.7
- Checking_amount 1 1310.9 1320.9
- Age 1 1335.8 1345.8
Step: AIC=1314.59
SA.g ~ Credit_score + Checking_amount + Age + Gender
Df Deviance AIC
<none> 1304.6 1314.6
- Gender 1 1307.7 1315.7
+ Marital_status 1 1304.6 1316.6
- Credit_score 1 1310.7 1318.7
- Checking_amount 1 1310.9 1318.9
- Age 1 1335.8 1343.8
| Estimate | Std. Error | z value | Pr(>|z|) | 2.5 % | 97.5 % | odds.ratio | |
|---|---|---|---|---|---|---|---|
| (Intercept) | -0.0468802 | 0.1184775 | -0.3956882 | 0.6923351 | -0.2795137 | 0.1853996 | 0.9542017 |
| Credit_score | 0.1745289 | 0.0712223 | 2.4504831 | 0.0142665 | 0.0358867 | 0.3154664 | 1.1906852 |
| Checking_amount | 0.1745906 | 0.0695744 | 2.5094103 | 0.0120933 | 0.0387251 | 0.3117477 | 1.1907586 |
| Age | 0.4041213 | 0.0739493 | 5.4648465 | 0.0000000 | 0.2605487 | 0.5506876 | 1.4979857 |
| GenderMale | 0.2510583 | 0.1426943 | 1.7594135 | 0.0785073 | -0.0286054 | 0.5310924 | 1.2853850 |
| null.dev | resid.dev | aic | |
|---|---|---|---|
| full | 1382.448 | 1304.590 | 1316.590 |
| reduced | 1382.448 | 1307.686 | 1315.686 |
| auto | 1382.448 | 1304.590 | 1314.590 |
As apparent, the Auto model selection procedure produced a model with the lowest AIC. This model was chosen as the final model.
The final model for this analysis will be: (SA.g) = -0.047 + 0.175(Credit_score) + 0.175(Checking_amount) + 0.404(Age) + 0.251(Gender). Based on the output from that model, it seems that the association between Credit_score and SA.g may be statistically significant (z=2.4501, p <.05, 95% CI=[0.0359, 0.3155]), the association between Checking_amount and SA.g may be statistically significant (z=2.5094, p <.05, 95% CI=[0.0387, 0.3117]), and the association between Age and SA.g may be statistically significant (z=5.4631, p <.05, 95% CI=[0.2605, 0.5508]). Further, the odds ratio associated with Credit_score is equal to 1.1907, the odds ratio associated with Checking_amount is equal to 1.1908, and the odds ratio associated with Age is equal to 1.498. This may mean that, as Credit_score increases by one unit or about 0.0129 standard deviations, the odds a borrower having a savings amount greater than or equal to the mean savings amount for borrowers may be 1.19 greater than the odds of a borrower having a savings amount less than the mean savings amount for borrowers while holding Checking_amount and Age constant, also it may mean that as Checking_amount increases by one unit or about 0.0033 standard deviations, the odds a borrower having a savings amount greater than or equal to the mean savings amount for borrowers may be 1.19 greater than the odds of a borrower having a savings amount less than the mean savings amount for borrowers while holding Credit_score and Age constant, and also it may mean that as Age increases by one unit or about 0.2443 standard deviations, the odds a borrower having a savings amount greater than or equal to the mean savings amount for borrowers may be 1.49 greater than the odds of a borrower having a savings amount less than the mean savings amount for borrowers while holding Credit_score and Checking_amount constant.
This analysis focused on the association between the savings amount of a borrower and a handful of explanatory variables. These variables were: Credit_score, Checking_amount, Age, Gender, Marital_status. Afterwards exploratory data analysis was conducted. This dichotomized the response variable, normalized the numeric explanatory variables, and made the categorical explanatory variables into factors. It also explored multicolinearity between the numeric explanatory variables. Next An MLR was conducted. based on the results of that analysis some variables were removed and another MLR was conducted on this reduced model. Next an automatic variable selection procedure was conducted. This procedure produced a third model. The goodness of fit measures for these models was then compared and a final model was chosen from these models. This choice was based on lowest AIC. The results of the analysis from that final model found that the association between Credit_score and SA.g may be statistically significant (z=2.4505, p <.05, 95% CI=[0.0359, 0.3155]). The association seemed to be positive, \(\beta_1=0.1745\). Also it found that the association between Checking_amount and SA.g may be statistically significant (z=2.5094, p <.05, 95% CI=[0.0387, 0.3117]), that association seemed to be positive too, \(\beta_2=0.1746\). Lastly, it found that the association between Age and SA.g may be statistically significant (z=5.4648, p <.05, 95% CI=[0.2605, 0.5507]), which also seemed to be positive, \(\beta_3=0.4041\). Although, Gender was not found to be statistically significant, the auto variable selction procedure kept it in the model. this may be because with it the AIC was still less than the model without it. Further research may be directed into why that was the case.