STA321: Week #08 Assignment

Data Description

The data set used is called, “Loan Defualt Data” and it is taken from the book “Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6”. It was described where it was posted as, “…a subset of a large o data set. The structure of the data set is simple. It can be used for logistic and binary classification / predictive models and algorithms”. Additionally, The data set contains 1000 observations 16 variables. Saving_amount is explored as the response.
- Saving_amount\((y)\) : the amount of saving a borrower has.
- Credit_score\((x_1)\): the credit score of borrower.
- Checking_amount\((x_2)\): the amount of checkings a borrower has.
- Age\((x_3)\): the age of borrower.
- Gender\((x_4)\): the gender of a borrower.
- Marital_status\((x_5)\): the marital status borrower.
- Default\((x_6)\): the marital status borrower.
- Term\((x_7)\): the marital status borrower.
- Car_loan\((x_8)\): the marital status borrower.
- Personal_loan\((x_9)\): the marital status borrower.
- Home_loan\((x_{10})\): the marital status borrower.
- Education_loan\((x_{11})\): the marital status borrower.
- Emp_status\((x_{12})\): the marital status borrower.
- Amount\((x_{13})\): the marital status borrower.
- Emp_duration\((x_{14})\): the marital status borrower.
- No_of_credit_acc\((x_{15})\): the marital status borrower.

Research Question

This research has many objectives, building a logistic regression model to predict whether a borrower has a Savings amount greater than or equal to the mean Savings amount for borrowers, may be counted as chief among them.

Exploratory Data Analysis

Exploratory Data Analysis was conducted. Currently, Saving_amount is continuous variable. However, in order to conduct multiple logistic regression, it must be dichotomous. So it was dichotomized along its mean, with \(Y=1\) associated to Saving_amount variables >= mean(Saving_amount) and \(Y=0\) associated with all other Saving_amount variables that do not meet this criteria. This newly dichotomized Saving_amount was then name SA.g. Afterwards, the numeric explanatory variables were z-score normalized and categorical variables were set as factors. Lastly, correlations were investigated for the numerical explanatory variables. There seemed to be no major correlations between the explanatory variables.

                Credit_score Checking_amount         Age        Term
Credit_score      1.00000000      0.18929567  0.32807535 -0.19543628
Checking_amount   0.18929567      1.00000000  0.29741086 -0.19162919
Age               0.32807535      0.29741086  1.00000000 -0.24438528
Term             -0.19543628     -0.19162919 -0.24438528  1.00000000
Amount           -0.07839842     -0.11533011 -0.10776977  0.05407017
Emp_duration      0.06762275      0.06980798  0.07980933 -0.06373561
                     Amount Emp_duration
Credit_score    -0.07839842   0.06762275
Checking_amount -0.11533011   0.06980798
Age             -0.10776977   0.07980933
Term             0.05407017  -0.06373561
Amount           1.00000000   0.01793938
Emp_duration     0.01793938   1.00000000

Note that the response variable is a binary factor variable that was stored as an integer, with the integer 1 associated with Saving_amount >= mean(Saving_amount), so \(P(Y=1)=P(SA.g=1)\) .Next, the Multiple logistic regression was fitted to find models to use for prediction, then the Data was Split for k-Fold Cross Validation.

Data Split

Previously, three models were built to explore the association between Saving_amount and their respective explanatory variables. After an MLR was conducted for each, goodness of fit measures were used to pick from these models. Here these same models were not used. Instead different models were employed and cross-validated using k Fold Cross Validation. Their predictive error and and accuracy were stored after each was fit. This was done five times. The average of these five iterations for the predictive error and accuracy were stored and then output to tables.

The following is tabular output of the quantities of observations in each non overlapping partition employed in the k Fold Cross Validation procedure.

folds
  1   2   3   4   5 
205 181 214 207 193

Model Selection: Multiple Logistic Regression

Multiple logistic regression was conducted. This produced the models used in the prediction. Interpretations of the results of these MLR’s are withheld as well as output from the MLR. The full model is fit using all variables in the data set. The reduced model is then fit using only those variables that where found to be statistically significant in the full model. Lastly the auto model is fit using a automatic variable selection procedure, its choice of final model is based on an AIC value that has been minimized the most by a particular model.

Models

The theoretic forms of the models for the predictive analysis follows.

model full: \(\frac{E[SA.g]}{1-E[SA.g]}\) = \(\beta_0\) + \(\beta_1*\)(Credit_score) + \(\beta_2*\)(Checking_amount) + _**\(\beta_3*\)(Age)*_ + \(\beta_4*\)(Gender) + \(\beta_5*\)(Marital_status) + \(\beta_6*\)(Default) + \(\beta_7*\)(Term) + \(\beta_8*\)(Car_loan) + \(\beta_9*\)(Personal_loan) + \(\beta_{10}*\)(Home_loan) + \(\beta_{11}*\)(Education_loan) + \(\beta_{12}*\)(Emp_status) + \(\beta_{13}*\)(Amount) + \(\beta_{14}*\)(Emp_duration) + \(\beta_{15}*\)(No_of_credit_acc)

model reduced: \(\frac{E[SA.g]}{1-E[SA.g]}\) = \(\beta_0\) + \(\beta_1*\)(Default) + \(\beta_2*\)(Amount) + \(\beta_3*\)(No_of_credit_acc) .

model auto: varies based on the results of the stepAIC() procedure.

k-Fold Cross Validation

Tabular output from the 5-Fold Cross Validation procedure follows. the first table should contain the average predictive errors from each model . Note, PE1, PE2, and PE3 should represent the full, reduced, and auto models respectively. Likewise ACC1, ACC2, and ACC3, which should represent the average accuracy of each model, follows that same order. The final model was chosen based on greatest minimized predictive error. That model was the reduced model. The cut-off probability used to dichotomize each models predictions for later assessment was .5.

Average prediction errors of candidate models
PE1	PE2	PE3
0.3188	0.3106	0.3092

Average prediction accuracy of candidate models
ACC1	ACC2	ACC3
0.6812	0.6894	0.6908

Summary

This research had many objectives, building a logistic regression model to predict whether a borrower has a Savings amount greater than or equal to the mean Savings amount for borrowers, may be counted as chief among them. Exploratory data analysis was conducted. This dichotomized the response variable, normalized the numeric explanatory variables, and made the categorical explanatory variables into factors. It also explored multicolinearity between the numeric explanatory variables. Afterwards, three models from a previous analysis were forgone in favor of three models that fit their respective regression with different variables than the previous models that were used. These three models were cross-validated using 5-Fold Cross Validation. The model which minimized the predictive error the most was then chosen as the predictive model for this analysis. Further exploration may be recommended. This exploration may investigate the affect of different probability cut-off levels on the predictive error of the models.