Exploring Predictive Models

TMRB



2024-02-18

West Chester University

Agenda

Introduction

Introduction: Three V’s

Introduction: Banking

Introduction: Loan default

Data Set Description



Variables Descriptions
Default\((y)\): (1 displays bank loan default and 0 displays bank loan non default)
Checking_amount \((x_1)\): (Numeric)
Term \((x_2)\): (displayed in months (Numeric))
Credit_score \((x_3)\): (Numeric)
Gender \((x_4)\): (Categorical)
Marital_status\((x_5)\): (Categorical)
Car_loan \((x_6)\): (1- Own car loan, 0- Does not own car loan – Numeric)
Personal_loan\((x_7)\): (1- Own Personal loan, 0- Does not own Personal loan – Numeric)
Home_loan \((x_8)\): (1- Own Home loan, 0- Does not own Home loan – Numeric)
Education_loan \((x_9)\): (1- Own Education loan, 0- Does not own Education loan – Numeric)
Emp_status \((x_{10})\): (Categorical)
Amount \((x_{11})\): (Numeric)
Saving_amount \((x_{12})\): (Numeric)
Emp_duration \((x_{13})\): (which is displayed in months (Numeric))
Age \((x_{14})\): (which is displayed in years (Numeric))
No_of_credit_acc \((x_{15})\): (Numeric)

Research Questions

The primary question for this analysis may be how will the explored predictive models perform.

Exploratory Data Analysis



Exploratory Data Analysis: Data Structure Table

'data.frame':   1000 obs. of  16 variables:
 $ Default         : int  0 0 0 1 1 0 0 0 0 1 ...
 $ Checking_amount : int  988 458 158 300 63 1071 -192 172 585 189 ...
 $ Term            : int  15 15 14 25 24 20 13 16 20 19 ...
 $ Credit_score    : int  796 813 756 737 662 828 856 763 778 649 ...
 $ Gender          : chr  "Female" "Female" "Female" "Female" ...
 $ Marital_status  : chr  "Single" "Single" "Single" "Single" ...
 $ Car_loan        : int  1 1 0 0 0 1 1 1 1 1 ...
 $ Personal_loan   : int  0 0 1 0 0 0 0 0 0 0 ...
 $ Home_loan       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Education_loan  : int  0 0 0 1 1 0 0 0 0 0 ...
 $ Emp_status      : chr  "employed" "employed" "employed" "employed" ...
 $ Amount          : int  1536 947 1678 1804 1184 475 626 1224 1162 786 ...
 $ Saving_amount   : int  3455 3600 3093 2449 2867 3282 3398 3022 3475 2711 ...
 $ Emp_duration    : int  12 25 43 0 4 12 11 12 12 0 ...
 $ Age             : int  38 36 34 29 30 32 38 36 36 29 ...
 $ No_of_credit_acc: int  1 1 1 1 1 2 1 1 1 1 ...

Exploratory Data Analysis: Summary Statistics

Exploratory Data Analysis: Data

Data Summary

FREQ Counts for No_of_credit_acc
Default Min. :0.0 1st Qu.:0.0 Median :0.0 Mean :0.3 3rd Qu.:1.0 Max. :1.0
Checking_amount Min. :-665.0 1st Qu.: 164.8 Median : 351.5 Mean : 362.4 3rd Qu.: 553.5 Max. :1319.0
Term Min. : 9.00 1st Qu.:16.00 Median :18.00 Mean :17.82 3rd Qu.:20.00 Max. :27.00
Credit_score Min. : 376.0 1st Qu.: 725.8 Median : 770.5 Mean : 760.5 3rd Qu.: 812.0 Max. :1029.0
Gender Length:1000 Class :character Mode :character NA NA NA
Marital_status Length:1000 Class :character Mode :character NA NA NA
Car_loan Min. :0.000 1st Qu.:0.000 Median :0.000 Mean :0.353 3rd Qu.:1.000 Max. :1.000
Personal_loan Min. :0.000 1st Qu.:0.000 Median :0.000 Mean :0.474 3rd Qu.:1.000 Max. :1.000
Home_loan Min. :0.000 1st Qu.:0.000 Median :0.000 Mean :0.056 3rd Qu.:0.000 Max. :1.000
Education_loan Min. :0.000 1st Qu.:0.000 Median :0.000 Mean :0.112 3rd Qu.:0.000 Max. :1.000
Emp_status Length:1000 Class :character Mode :character NA NA NA
Amount Min. : 244 1st Qu.:1016 Median :1226 Mean :1219 3rd Qu.:1420 Max. :2362
Saving_amount Min. :2082 1st Qu.:2951 Median :3203 Mean :3179 3rd Qu.:3402 Max. :4108
Emp_duration Min. : 0.00 1st Qu.: 15.00 Median : 41.00 Mean : 49.39 3rd Qu.: 85.00 Max. :120.00
Age Min. :18.00 1st Qu.:29.00 Median :32.00 Mean :31.21 3rd Qu.:34.00 Max. :42.00
No_of_credit_acc Min. :1.000 1st Qu.:1.000 Median :2.000 Mean :2.546 3rd Qu.:3.000 Max. :9.000



FREQ Table: No_of_credit_acc

FREQ Counts for No_of_credit_acc
Var1 Freq
1 308
2 325
3 119
4 105
5 109
6 6
7 8
8 6
9 14

Exploratory Data Analysis: Response Variable

Exploratory Data Analysis: Response Variable Proportion Table

Prop FREQ Counts:Default
Var1 Freq
0 0.7
1 0.3

Exploratory Data Analysis: Correlations

Exploratory Data Analysis: Correlations Matrix

Correlations Matrix
Checking_amount Term Credit_score Amount Saving_amount Emp_duration Age
Checking_amount 1.0000000 -0.1916292 0.1892957 -0.1153301 0.2013942 0.0698080 0.2974109
Term -0.1916292 1.0000000 -0.1954363 0.0540702 -0.1868427 -0.0637356 -0.2443853
Credit_score 0.1892957 -0.1954363 1.0000000 -0.0783984 0.2138242 0.0676228 0.3280754
Amount -0.1153301 0.0540702 -0.0783984 1.0000000 -0.0097196 0.0179394 -0.1077698
Saving_amount 0.2013942 -0.1868427 0.2138242 -0.0097196 1.0000000 0.0909485 0.3430830
Emp_duration 0.0698080 -0.0637356 0.0676228 0.0179394 0.0909485 1.0000000 0.0798093
Age 0.2974109 -0.2443853 0.3280754 -0.1077698 0.3430830 0.0798093 1.0000000

Exploratory Data Analysis: Variables Transformations

Lastly, after the variables were transformed into factors, the data set was reexamined. The following is a print out of that information.

'data.frame':   1000 obs. of  16 variables:
 $ Default         : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 1 1 2 ...
 $ Checking_amount : int  988 458 158 300 63 1071 -192 172 585 189 ...
 $ Term            : int  15 15 14 25 24 20 13 16 20 19 ...
 $ Credit_score    : int  796 813 756 737 662 828 856 763 778 649 ...
 $ Gender          : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 2 1 1 2 ...
 $ Marital_status  : Factor w/ 2 levels "Married","Single": 2 2 2 2 2 1 2 2 2 1 ...
 $ Car_loan        : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 2 2 2 2 ...
 $ Personal_loan   : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
 $ Home_loan       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ Education_loan  : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 1 1 1 ...
 $ Emp_status      : Factor w/ 2 levels "employed","unemployed": 1 1 1 1 2 1 1 1 2 1 ...
 $ Amount          : int  1536 947 1678 1804 1184 475 626 1224 1162 786 ...
 $ Saving_amount   : int  3455 3600 3093 2449 2867 3282 3398 3022 3475 2711 ...
 $ Emp_duration    : int  12 25 43 0 4 12 11 12 12 0 ...
 $ Age             : int  38 36 34 29 30 32 38 36 36 29 ...
 $ No_of_credit_acc: Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 2 1 1 1 1 ...

Model Selection: Multiple Logistic Regression

Model Selection: Multicolinearity

Model Selection: GVIF Model 1

VIF: Model 1
GVIF Df GVIF^(1/(2*Df))
Checking_amount 1.185956 1 1.089016
Term 1.045746 1 1.022617
Credit_score 1.099802 1 1.048714
Gender 2.573097 1 1.604088
Marital_status 2.721962 1 1.649837
Car_loan 83.354431 1 9.129865
Personal_loan 81.896071 1 9.049645
Home_loan 14.354499 1 3.788733
Education_loan 32.253932 1 5.679254
Emp_status 1.143766 1 1.069470
Amount 1.076582 1 1.037585
Saving_amount 1.228646 1 1.108443
Emp_duration 1.166378 1 1.079990
Age 1.242704 1 1.114766
No_of_credit_acc 1.338116 8 1.018371

Model Selection: GVIF Model 2

VIF: Model 2
x
Checking_amount 1.115226
Term 1.015942
Credit_score 1.047873
Saving_amount 1.120649
Age 1.146111

Model Selection: GVIF Model 3

VIF: Model 3
x
Checking_amount 1.177743
Term 1.020826
Credit_score 1.072932
Personal_loan 1.187600
Home_loan 1.176122
Education_loan 1.200096
Emp_status 1.031324
Amount 1.049029
Saving_amount 1.204225
Age 1.169251

Data Split



The following is tabular output of the quantities of observations in each non overlapping partition employed in the k Fold Cross Validation procedure.



Fold Quantities

Fold Quantities
folds Freq
1 195
2 186
3 223
4 199
5 197

k-Fold Cross Validation



Average prediction errors

Average prediction errors
PE1 PE2 PE3
0.0741 0.0713 0.0631



Average prediction accuracy

Average prediction accuracy
ACC1 ACC2 ACC3
0.9259 0.9287 0.9369

Discussion: Results

Discussion: Process

Discussion: Models

The models were not used to make inferences, so no output of the model summary or interpretations were given.

Note that both the automatic model and 2nd model might have differed with each iteration of k-Fold Cross Validation, based on their selection criteria. That is, if the full model was fitted with training data from differing folds, it may have produced a reduced model that could be entirely different from model 2. The same can be said for model 3.

Conclusion

Predictive Error among the three models did not differ by much. However, if one was to pick a model for prediction based on lowest predictive error, one might choose the third model. Further research might be directed to exploring the cause of the high GVIF values in model 1.

REFs