STA321: Week #07 Assignment

Data and Variable Descriptions

The data set used is called, “Loan Defualt Data” and it is taken from the book “Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6”. It was described where it was posted as, “…a subset of a large o data set. The structure of the data set is simple. It can be used for logistic and binary classification / predictive models and algorithms”. Additionally, The data set contains 1000 observations 16 variables. Saving_amount is explored as the response for this data set and Credit_score, Checking_amount, Age, Gender, Marital_status will be used as the explanatory variable for this research. These variables represent the minimum variable requirements on this research. For that reason, they were chosen.
- Saving_amount\((y)\) : the amount of saving a borrower has.
- Credit_score\((x_1)\): the credit score of borrower.
- Checking_amount\((x_2)\): the amount of checkings a borrower has.
- Age\((x_3)\): the age of borrower.
- Gender\((x_4)\): the gender of a borrower.
- Marital_status\((x_5)\): the marital status borrower.
- Default
- Term
- Car_loan
- Personal_loan
- Home_loan
- Education_loan
- Emp_status
- Amount
- Emp_duration
- No_of_credit_acc

Research Question

This research has many objectives, exploring the association between Saving_amount and the other explanatory variables present in the model, may be counted as chief among them. Then a Research Question for this research may be: is the probability of a borrower having a Savings amount greater than or equal to the mean Savings amount for borrowers related to the explanatory variables in the model.

Exploratory Data Analysis

Exploratory Data Analysis is now conducted. Currently, Saving_amount is continuous variable. However, in order to conduct multiple logistic regression, it must be dichotomous. So it was dichotomized along its mean, with \(Y=1\) associated to Saving_amount variables >= mean(Saving_amount) and \(Y=0\) associated with all other Saving_amount variables that do not meet this criteria. This newly dichotomized Saving_amount was then name SA.g. Afterwards, the numeric explanatory variables were z-score normalized and categorical variables were set as factors. Lastly, correlations were investigated for the numerical explanatory variables. There seemed to be no major correlations between the explanatory variables.

                Checking_amount Credit_score       Age
Checking_amount       1.0000000    0.1892957 0.2974109
Credit_score          0.1892957    1.0000000 0.3280754
Age                   0.2974109    0.3280754 1.0000000

Note that the response variable is a binary factor variable that was stored as an integer, with the integer 1 associated with Saving_amount >= mean(Saving_amount), so \(P(Y=1)=P(SA.g=1)\). Next, the Multiple logistic regression is fitted.

Multiple Logistic Regression Model

Multiple logistic regression was conducted. Interpretations of the results are withheld until the after a model selection procedure is conducted. The final model will be the one the one which minimizes AIC the most. First the Full model was investigated. After, a reduced model, based off the information from that MLR, as well as a model created through automatic variable selection was investigated. The summary of there relevant statistics follows. Their goodness-of-fit measures are then compared. Lastly, the final model chosen, based off these measures, is interpreted. The automatic variable selection procedure is revealed before the goodness-of-fit measures are compared. Its choice of final model is based on an AIC value that has been minimized the most by a particular model. To clarify, the initial model explored for this research is (SA.g) = \(\beta_0\) + \(\beta_1*\)(Credit_score) + \(\beta_2*\)(Checking_amount) + \(\beta_3*(Age)\) + \(\beta_4*(Gender)\) + \(\beta_5*(Marital_status)\) + \(\epsilon\).

MODEL FULL

The summary stats: Estimates,95% CI, OR
	Estimate	Std. Error	z value	Pr(>\|z\|)	2.5 %	97.5 %	odds.ratio
(Intercept)	-0.0481053	0.2293374	-0.2097576	0.8338569	-0.4991149	0.4006800	0.9530335
Credit_score	0.1745227	0.0712296	2.4501438	0.0142799	0.0358639	0.3154724	1.1906777
Checking_amount	0.1745909	0.0695744	2.5094125	0.0120932	0.0387253	0.3117481	1.1907590
Age	0.4041335	0.0739754	5.4630834	0.0000000	0.2605115	0.5507527	1.4980040
GenderMale	0.2520317	0.2114313	1.1920265	0.2332508	-0.1615128	0.6682551	1.2866368
Marital_statusSingle	0.0012250	0.1963517	0.0062390	0.9950221	-0.3827033	0.3880781	1.0012258

MODEL REDUCED

The summary stats: Estimates,95% CI, OR
	Estimate	Std. Error	z value	Pr(>\|z\|)	2.5 %	97.5 %	odds.ratio
(Intercept)	0.1264800	0.0658286	1.921353	0.0546872	-0.0024301	0.2557216	1.134827
Credit_score	0.1804640	0.0710908	2.538500	0.0111329	0.0421039	0.3211636	1.197773
Checking_amount	0.1751612	0.0695342	2.519067	0.0117666	0.0393891	0.3122545	1.191438
Age	0.4051681	0.0738783	5.484264	0.0000000	0.2617485	0.5516090	1.499555

MODEL AUTO

Start:  AIC=1316.59
SA.g ~ Credit_score + Checking_amount + Age + Gender + Marital_status

                  Df Deviance    AIC
- Marital_status   1   1304.6 1314.6
- Gender           1   1306.0 1316.0
<none>                 1304.6 1316.6
- Credit_score     1   1310.7 1320.7
- Checking_amount  1   1310.9 1320.9
- Age              1   1335.8 1345.8

Step:  AIC=1314.59
SA.g ~ Credit_score + Checking_amount + Age + Gender

                  Df Deviance    AIC
<none>                 1304.6 1314.6
- Gender           1   1307.7 1315.7
+ Marital_status   1   1304.6 1316.6
- Credit_score     1   1310.7 1318.7
- Checking_amount  1   1310.9 1318.9
- Age              1   1335.8 1343.8

The summary stats: Estimates,95% CI, OR
	Estimate	Std. Error	z value	Pr(>\|z\|)	2.5 %	97.5 %	odds.ratio
(Intercept)	-0.0468802	0.1184775	-0.3956882	0.6923351	-0.2795137	0.1853996	0.9542017
Credit_score	0.1745289	0.0712223	2.4504831	0.0142665	0.0358867	0.3154664	1.1906852
Checking_amount	0.1745906	0.0695744	2.5094103	0.0120933	0.0387251	0.3117477	1.1907586
Age	0.4041213	0.0739493	5.4648465	0.0000000	0.2605487	0.5506876	1.4979857
GenderMale	0.2510583	0.1426943	1.7594135	0.0785073	-0.0286054	0.5310924	1.2853850

GOODNESS OF FIT

GOF for the models
	null.dev	resid.dev	aic
full	1382.448	1304.590	1316.590
reduced	1382.448	1307.686	1315.686
auto	1382.448	1304.590	1314.590

As apparent, the Auto model selection procedure produced a model with the lowest AIC. This model was chosen as the final model.

Final Model

The final model for this analysis will be: (SA.g) = -0.047 + 0.175(Credit_score) + 0.175(Checking_amount) + 0.404(Age) + 0.251(Gender). Based on the output from that model, it seems that the association between Credit_score and SA.g may be statistically significant (z=2.4501, p <.05, 95% CI=[0.0359, 0.3155]), the association between Checking_amount and SA.g may be statistically significant (z=2.5094, p <.05, 95% CI=[0.0387, 0.3117]), and the association between Age and SA.g may be statistically significant (z=5.4631, p <.05, 95% CI=[0.2605, 0.5508]). Further, the odds ratio associated with Credit_score is equal to 1.1907, the odds ratio associated with Checking_amount is equal to 1.1908, and the odds ratio associated with Age is equal to 1.498. This may mean that, as Credit_score increases by one unit or about 0.0129 standard deviations, the odds a borrower having a savings amount greater than or equal to the mean savings amount for borrowers may be 1.19 greater than the odds of a borrower having a savings amount less than the mean savings amount for borrowers while holding Checking_amount and Age constant, also it may mean that as Checking_amount increases by one unit or about 0.0033 standard deviations, the odds a borrower having a savings amount greater than or equal to the mean savings amount for borrowers may be 1.19 greater than the odds of a borrower having a savings amount less than the mean savings amount for borrowers while holding Credit_score and Age constant, and also it may mean that as Age increases by one unit or about 0.2443 standard deviations, the odds a borrower having a savings amount greater than or equal to the mean savings amount for borrowers may be 1.49 greater than the odds of a borrower having a savings amount less than the mean savings amount for borrowers while holding Credit_score and Checking_amount constant.

Summary

This analysis focused on the association between the savings amount of a borrower and a handful of explanatory variables. These variables were: Credit_score, Checking_amount, Age, Gender, Marital_status. Afterwards exploratory data analysis was conducted. This dichotomized the response variable, normalized the numeric explanatory variables, and made the categorical explanatory variables into factors. It also explored multicolinearity between the numeric explanatory variables. Next An MLR was conducted. based on the results of that analysis some variables were removed and another MLR was conducted on this reduced model. Next an automatic variable selection procedure was conducted. This procedure produced a third model. The goodness of fit measures for these models was then compared and a final model was chosen from these models. This choice was based on lowest AIC. The results of the analysis from that final model found that the association between Credit_score and SA.g may be statistically significant (z=2.4505, p <.05, 95% CI=[0.0359, 0.3155]). The association seemed to be positive, \(\beta_1=0.1745\). Also it found that the association between Checking_amount and SA.g may be statistically significant (z=2.5094, p <.05, 95% CI=[0.0387, 0.3117]), that association seemed to be positive too, \(\beta_2=0.1746\). Lastly, it found that the association between Age and SA.g may be statistically significant (z=5.4648, p <.05, 95% CI=[0.2605, 0.5507]), which also seemed to be positive, \(\beta_3=0.4041\). Although, Gender was not found to be statistically significant, the auto variable selction procedure kept it in the model. this may be because with it the AIC was still less than the model without it. Further research may be directed into why that was the case.