Introduction

Credit card companies use models to predict the likelihood that an applicant will default on their Bank Loan. With that in mind, the purpose of this project is to create a model that will predict whether or not an applicant will default on their Bank Loan. A data set with 1000 observations and 16 variables from Applied Analytics through Case Studies Using SAS and R by Deepti Gupta will be used for the project. The data set includes 8 categorical variables, including the outcome variable of interest called “Default,” and 8 numerical variables.

Exploratory Data Analysis

Summary Statistics

After reviewing the summary statistics, it is apparent that the variables Default , Car_Loan , Personal_loan , Home_loan , and Education_Loan are categorized as numeric variables when they are in fact Categorical variables. They will be converted to Factor variables in addition to Emp_Status , Marital_Status , and Gender for ease of analysis. The summary statistic function was rerun after converting these variables and that output is displayed below.

##  Default Checking_amount       Term        Credit_score       Gender   
##  0:700   Min.   :-665.0   Min.   : 9.00   Min.   : 376.0   Female:310  
##  1:300   1st Qu.: 164.8   1st Qu.:16.00   1st Qu.: 725.8   Male  :690  
##          Median : 351.5   Median :18.00   Median : 770.5               
##          Mean   : 362.4   Mean   :17.82   Mean   : 760.5               
##          3rd Qu.: 553.5   3rd Qu.:20.00   3rd Qu.: 812.0               
##          Max.   :1319.0   Max.   :27.00   Max.   :1029.0               
##  Marital_status Car_loan Personal_loan Home_loan Education_loan
##  Married:548    0:647    0:526         0:944     0:888         
##  Single :452    1:353    1:474         1: 56     1:112         
##                                                                
##                                                                
##                                                                
##                                                                
##       Emp_status      Amount     Saving_amount   Emp_duration   
##  employed  :308   Min.   : 244   Min.   :2082   Min.   :  0.00  
##  unemployed:692   1st Qu.:1016   1st Qu.:2951   1st Qu.: 15.00  
##                   Median :1226   Median :3203   Median : 41.00  
##                   Mean   :1219   Mean   :3179   Mean   : 49.39  
##                   3rd Qu.:1420   3rd Qu.:3402   3rd Qu.: 85.00  
##                   Max.   :2362   Max.   :4108   Max.   :120.00  
##       Age        No_of_credit_acc
##  Min.   :18.00   Min.   :1.000   
##  1st Qu.:29.00   1st Qu.:1.000   
##  Median :32.00   Median :2.000   
##  Mean   :31.21   Mean   :2.546   
##  3rd Qu.:34.00   3rd Qu.:3.000   
##  Max.   :42.00   Max.   :9.000

EXPLORATORY ANALYSIS OF CATEGORICAL VARIABLES

Employment Status, gender, and marital status.

What sticks out from the summary table above is that 692 applicants out of 1000 were unemployed. Looking at a table comparing employment status by gender and performing a chi-square test of association, there’s a significant association with employment status and gender, however there is not a significant association between employment status and default, (see uni variate Chi Square analysis section). Looking at marital status by employment status there’s a significant association with married individuals being more unemployed. Another interesting piece of information appeared looking at cross tabulations between categorical variables. Within the data set none of the females are married. Performing a Mantel-Haenszel chi-squared test it can be seen that when controlling for gender there is significant association between employment status and marital status. Therefore, within the data set males are more likely to be married and unemployed compared to the females within the data.

##             
##              Female Male
##   employed      142  166
##   unemployed    168  524
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  default$Gender and default$Emp_status
## X-squared = 46.454, df = 1, p-value = 0.000000000009378

statistic

p.value

parameter

method

46.5

0.0000***

1

Pearson's Chi-squared test with Yates' continuity correction

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

##             
##              Married Single
##   employed       106    202
##   unemployed     442    250
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  default$Emp_status and default$Marital_status
## X-squared = 73.481, df = 1, p-value < 0.00000000000000022

Number of Credit Accounts

When looking at No_of_credit_acc, it can be seen that the number of credit accounts declines after five, however, for the time being, No_of_credit_acc will remain numeric and un collapsed. It may prove more advantageous in later stages to convert No_of_credit_acc to a factor variable and collapse 6-9 into a 6+ category.

## 
##   1   2   3   4   5   6   7   8   9 
## 308 325 119 105 109   6   8   6  14

Univariate ChiSquare Tests

In addition, Univariate ChiSquare Tests were performed with categorical variables by the default variable to discover any significant association.

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  default$Default and default$Gender
## X-squared = 5.3485, df = 1, p-value = 0.02074
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  default$Default and default$Marital_status
## X-squared = 6.1598, df = 1, p-value = 0.01307
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  default$Default and default$Car_loan
## X-squared = 5.074, df = 1, p-value = 0.02429
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  default$Default and default$Personal_loan
## X-squared = 45.298, df = 1, p-value = 0.00000000001693
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  default$Default and default$Home_loan
## X-squared = 7.7909, df = 1, p-value = 0.005251
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  default$Default and default$Education_loan
## X-squared = 80.093, df = 1, p-value < 0.00000000000000022
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  default$Default and default$Emp_status
## X-squared = 0.080655, df = 1, p-value = 0.7764

All categorical variables with the exception of Emp_status yielded a statistically significant assocation with default. The most significant association occurred between default and Education_Loan where those without student loans did not default. The results are shown in the bar chart below

Exploratory Analysis of Numeric Variables

Correlation Matrix

A correlation matrix was used to assess for any significant correlation between numeric variables. From the Matrix it can be seen that the most significant positive correlation is between Saving_amount and age. As age increases, savings amount increases. In terms of negative correlation, the most significant is between Age and Term. As age increases, term decreases. The squares with X’s indicate that there is no significant correlation between the two varibles at the .05 significance level. All squares without X’s demonstrated significant correlation at the .05 significance level.

Density Plots

In Review of density plots that are the numerical variables crossed by default, a number of things stick out visually. Those who default have a lower Checking amount, greater Terms, lower Credit Scores, greater Amount, lower Savings amount, and a younger age.

Scatter Plots

Scatter plots were prepared for the following numerical variables with the variable Default creating two different levels. These plots indicate visually there may be some interaction between the variables. For example, Among those who default, there appears to be a negative correlation between credit score and savings amount, whereas among those who do not default, there appears to be a positive correlation between the two variables.

Logistic Regression Model Building

To investigate whether or not someone will default on their loan, three logistic regression models were fit to the data to determine which model performs best in predicting whether or not someone will default. The logistic regression model will also indicate which variables are the most significant in terms of assocation between defaulting on a loan.

Splitting Data into Training and Test Samples

The Original data set was split in to a training data set to fit logistic regression models and a test data set to assess the performance of the models.

Fitting a Full Logistic Regression Model

Next a Full Logistic Regression Model is fit to the training data set with “Default” as the response variable and the other fifteen variables as predictor variables.

Coefficiencts of Full Logistic Regression Model
Estimate Std. Error z value Pr(>|z|)
(Intercept) 39.9025787 5.9240627 6.7356780 0.0000000
Checking_amount -0.0049317 0.0008038 -6.1354975 0.0000000
Term 0.2342254 0.0684103 3.4238316 0.0006174
Credit_score -0.0105484 0.0024878 -4.2401139 0.0000223
GenderMale -0.0468614 0.6260959 -0.0748471 0.9403364
Marital_statusSingle 0.3845520 0.6244803 0.6157953 0.5380297
Car_loan1 -1.3779309 3.5405050 -0.3891905 0.6971352
Personal_loan1 -2.4773232 3.5407698 -0.6996567 0.4841417
Home_loan1 -4.4749376 3.6213878 -1.2356969 0.2165712
Education_loan1 0.4208259 3.5738732 0.1177506 0.9062652
Emp_statusunemployed 0.2463111 0.4065920 0.6057943 0.5446513
Amount 0.0005281 0.0006678 0.7907602 0.4290840
Saving_amount -0.0048328 0.0007323 -6.5990257 0.0000000
Emp_duration 0.0056039 0.0056208 0.9969792 0.3187746
Age -0.6596863 0.0788019 -8.3714480 0.0000000
No_of_credit_acc -0.0799863 0.1207618 -0.6623480 0.5077482

From the table above it can be seen that the variables Checking_amount , Term , Credit_score , Saving_Amount , and Emp_duration have a highly significant assocation regarding whether or not someone will default. These five variables are also the only statistically significant variables at a .05 significance level in terms of association between whether or not someone will default.

Performance of Full Logistic Regression Model

In terms of general prediction, a cut off probability first needed to be established. Five fold cross-validation was performed to determine the optimal cut off probability. After performing Five Fold Cross Validation on the training dataset, it was determined that the optimal cut off probability was .33 for the full model. This means that any individual with a probability greater than .33 was predicted to default on their loan. In addition, using the training data set, the full model correctly predicted whether someone would default or not default on their loan 94.07% of the time. Incorporating the same cut off threshold of 0.33, the testing data set yielded an accuracy rate of 93.76947%, indicating the full model correctly predicted whether someone would default or not default on their loan 93.7694% of the time.

Proportion of Correct predictions from Full model
Dataset Percent.Correct
Testing 94

The C-Statistics or Area Under Receiver Operator curve below indicate outstanding discrimination between both the training and testing data sets. The full model does an outstanding job of determining whether someone will default on a loan between both data sets.

Local performance metrics
sensitivity specificity precision recall F1
0.9156627 0.9493088 0.8735632 0.9156627 0.8941176

The Full model contained VIF values all below five for numeric variables, indicating there’s no multicollinearity issues.

Vif Values

##  Checking_amount             Term     Credit_score           Gender 
##         1.178541         1.118359         1.144263         2.712260 
##   Marital_status         Car_loan    Personal_loan        Home_loan 
##         2.931119        92.365892        92.220960        21.413613 
##   Education_loan       Emp_status           Amount    Saving_amount 
##        31.926009         1.105981         1.075671         1.217359 
##     Emp_duration              Age No_of_credit_acc 
##         1.255676         1.236782         1.096510

Outliers:

Significant outliers were discovered for observations 766, 391, and 225. None of these outliers appear to be the result of any type of error and thus remained in the model.

STEP AIC MODEL

To construct a reduced model, a step wise selection algorithm was incorporated to select features for the reduced model. Forward and Backward selection yielded the same features for the model, whereas forward selection chose all but one feature to include in the model. Based on this, the features chosen from backward and foward-backward (both) selection were incorporated in to the reduced, Step AIC model.

Coefficiencts of Stepwise Logistic Regression Model
Estimate Std. Error z value Pr(>|z|)
(Intercept) 40.8381151 4.6231540 8.833388 0.0000000
Checking_amount -0.0050678 0.0008035 -6.307110 0.0000000
Term 0.2370735 0.0677178 3.500905 0.0004637
Credit_score -0.0101541 0.0024044 -4.223202 0.0000241
Car_loan1 -1.8782406 0.7065856 -2.658193 0.0078561
Personal_loan1 -2.9301812 0.7097100 -4.128702 0.0000365
Home_loan1 -4.9573357 1.0644451 -4.657202 0.0000032
Saving_amount -0.0047697 0.0007217 -6.608820 0.0000000
Age -0.6575394 0.0765847 -8.585783 0.0000000

Performance of stepwise AIC Model

Five-fold cross-validation was performed on the training dataset for the Reduced Stepwise AIC model. The results yielded an optimal cut off point of .29 for this model and an accuacy of 93.93%. The cut of point from cross-validation was then used against the testing data set to determine the accuracy of the Reduced model. The model was accurate 93.76947% in predicting whether someone would default or not default on their loan. This accuracy is identical to that of the full model.

Proportion of Correct Predictions from Stepwise Model
Dataset Percent.Correct
Testing 93.66667

Although the Sensitivity and Specificity are different between the full model and the Stepwise AIC Model, both models have an identical Area under the Curve of 0.9783. This indicates both models perform identically in terms of discrimination. Furthmore, these results indicate the reduced Stepwise model is not underfit in comparison to the full model.

Local performance metrics
sensitivity specificity precision recall F1
0.9156627 0.9447005 0.8636364 0.9156627 0.8888889

The reduced model also possessed outstanding discrimination on whether someone will default on their loan. What stands out is that the reduced model produced almost as well as the full model in terms of discriminating whether or not someone will default on their loan.

VIF values for the reduced model are all below five indicating multicollinearity is not an issue with the model.

Vif Values

## Checking_amount            Term    Credit_score        Car_loan   Personal_loan 
##        1.183323        1.085144        1.105634        3.756111        3.803263 
##       Home_loan   Saving_amount             Age 
##        1.831949        1.204715        1.190951

Outliers:

The reduced model produced the same outliers as the full model and again were maintained within the reduced model.

Interaction Included Model

After feature selection was performed using the Step AIC method within the prior section, the selected features in the reduced model were tested for pairwise interactions one at a time. The analysis determined the following terms have a significant interaction at the .05 significance level: Term:Credit_score: 0.00696 , Credit_score:Saving_amount: 0.028013 , Car_loan:Saving_Amount: 0.01270 , Personal_loan:Saving_amount: 0.02896. These interaction terms were then included into a new stepwise AIC algorithm using backward selection. The results are found below.

Coefficiencts of Stepwise Logistic Regression Model with Interaction Terms
Estimate Std. Error z value Pr(>|z|)
(Intercept) -38.3831883 26.6421948 -1.440692 0.1496718
Checking_amount -0.0054646 0.0008927 -6.121128 0.0000000
Term 2.1847681 0.8051805 2.713389 0.0066599
Credit_score 0.0914907 0.0356429 2.566874 0.0102620
Car_loan1 9.9227909 4.9684127 1.997175 0.0458061
Personal_loan1 -3.3529294 0.7408941 -4.525518 0.0000060
Home_loan1 -4.9800002 1.0525285 -4.731464 0.0000022
Saving_amount 0.0102921 0.0079472 1.295060 0.1952997
Age -0.6925115 0.0811738 -8.531225 0.0000000
Term:Credit_score -0.0025649 0.0010494 -2.444056 0.0145232
Credit_score:Saving_amount -0.0000184 0.0000105 -1.743303 0.0812807
Car_loan1:Saving_amount -0.0039166 0.0015985 -2.450170 0.0142789

Performance of interaction Model

Cross validation performed on the interaction terms Model yielded an optimal cut off probability of 0.52 with an accuracy level of 0.9496. In terms of prediction using the cross validation cut off threshold and the testing data sat, the model recorded an accuracy of 93.76947 which is identical to the accuracy of the Reduced Stepwise AIC Model and the Full Model.

Proportion of Correct Predictions from Stepwise Model with Interaction Terms
Dataset Percent.Correct
Testing 93.66667

The Interaction Terms model did not perform as well as the full model and reduced model in terms of discrimination within the testing data set. Although the AUC for the reduced interaction terms model was 0.9656, which is still outstanding discrimination, the AUC is 0.0127 less than that of the Full Model and Reduced Stepwise Model.

Local performance metrics
sensitivity specificity precision recall F1
0.8674699 0.9631336 0.9 0.8674699 0.8834356

Because there are interaction terms within this model, the VIF values are well above the 5-10 range which indicates there is significant multicollinearity issues.

Vif Values

##            Checking_amount                       Term 
##                   1.262488                 135.626848 
##               Credit_score                   Car_loan 
##                 190.751796                 161.014114 
##              Personal_loan                  Home_loan 
##                   3.814218                   1.964400 
##              Saving_amount                        Age 
##                 143.601513                   1.365835 
##          Term:Credit_score Credit_score:Saving_amount 
##                 181.089179                 267.646853 
##     Car_loan:Saving_amount 
##                 163.384862

Outliers:

The interaction model yielded the same outliers as the other two models and the outliers were maintained within the model.

Discussion and final selection of model.

After reviewing the three models, the Reduced Stepwise Model appears to be the most desirable to predict whether someone will default on their loan. The Reduced Stepwise Model performed identically to the full model in terms of accuracy and discrimination as shown through the method of cross-validation and the area under the receiver operator curve. No performance was lost in the Reduced Stepwise Model from paring down from fifteen variables to eight variables. Although the Reduced Stepwise Model with Interaction Terms performed as well as the full model and reduced model in terms of accuracy, it did not perform as well in terms of discrimination. The Reduced Stepwise Model with Interaction terms does not improve upon the performance of the Reduced Stepwise Model by including interaction terms into the Reduced Stepwise Model. A likely explanation is that the introduction of interaction terms may have “over fit” the model to the training data. When incorporating a logistic regression model building approach to determine who will and will not default on their loan, a reduced model incorporating the variables Checking_amount, Term, Credit_score, Car_loan, Personal_loan, Home_loan, Saving_amount, and Age proves to be the most parsimonious and effective model in terms of prediction and association.

Prediction with Neural Network Models

Creating a Neural Network for Prediction of Default

The initial step in developing a Neural network is developing a model matrix so the names of all feature variables including implicitly defined dummy variables are defined and extracted. This was performed on both the test and training dataset. Feature engineering in the form of normalization was also performed on all numerical variables to prepare them for building a neural network.

There are some naming issues in the above dummy feature variables for network modeling (although they are good for regular linear and generalized linear regression models). These feature variables were renamed by excluding special characters in order to build the neural network model.

Next we build the neural network model. The neuralnet function was used for this task. A single layer neural network was developed.

error 14.9974768
reached.threshold 0.0094634
steps 3617.0000000
Intercept.to.1layhid1 16.6141178
checkingAmount.to.1layhid1 -8.1722715
Term.to.1layhid1 4.2904723
creditScore.to.1layhid1 -6.5354810
GenderMale.to.1layhid1 0.3248997
maritalStatusSingle.to.1layhid1 0.7001326
carLoan1.to.1layhid1 -0.4712334
personalLoan1.to.1layhid1 -1.7527155
homeLoan1.to.1layhid1 -3.3455897
educationLoan1.to.1layhid1 0.8317540
empStatusUnemployed.to.1layhid1 0.3219977
Amount.to.1layhid1 -0.1122575
savingAmount.to.1layhid1 -9.1276846
empDuration.to.1layhid1 0.7902562
Age.to.1layhid1 -14.1673891
noOfCreditAcc.to.1layhid1 -0.7896480
Intercept.to.Default1 -0.0046067
1layhid1.to.Default1 1.0169859

A diagram of the neural network can be viewed below.

Figure 12. Single-layer backpropagation Neural network model for Default

Figure 12. Single-layer backpropagation Neural network model for Default

Performance Metrics

5-fold cross validation was performed on the Neural Network to determine the optimal cut off point for determining the model’s accuracy. The optimal cut of point for the model was 0.33 with an accuracy of 94.22% for the training data.

In terms of accuracy, the neural network model was accurate 89.41% of the time. Based on the ROC curve the Neural Network performed much better than random guessing. The results are underwhelming compared to predictive and accuracy capabilities of the logistic regression model.

## $confusion.matrix
##        
##           0   1
##   FALSE 198   6
##   TRUE   19  77
## 
## $accuracy
## [1] 0.9166667
Figure 14: ROC Curve of the neural network model.

Figure 14: ROC Curve of the neural network model.

Decision Tree Models for the Prediction of Default Status.

Four Decision Tree models were fit to the data set to see which variables perform the best in terms of predicting whether someone will default on there loan. The models are: -Tree with Gini Index: Non-Penalization -Tree with Entropy Non-Penalization -Tree with Gini Index: Penalization False Negatives -Tree with Entropy: Penalization False Negatives. -Tree with Gini Index: Penalization False Positives -Tree with Entropy: Penalization False Positives. The Trees themselves can be viewed below.

Figure 14. Non-penalized decision tree models using Gini index (left) and entropy (right).

Figure 14. Non-penalized decision tree models using Gini index (left) and entropy (right).

Figure 15. penalized decision tree models using Gini index (left) and entropy (right).

Figure 15. penalized decision tree models using Gini index (left) and entropy (right).

Figure 15. penalized decision tree models using Gini index (left) and entropy (right).

Figure 15. penalized decision tree models using Gini index (left) and entropy (right).

All trees incorporated Age as the main branch with a cut off between 30-32 years old. From there Checking Amount, Credit Score, Savings Amount, Amount, and term were incorporated into to branches within each model.

Performance of Decision Tree Models.

The ROC curves for each Decision tree model are seen in the graph below. From the information in the graph it can be observed that the Tree with Entropy with Penalization for false negatives recorded the highest AUC on the testing data. The AUC recorded was 0.974, which is very close the full and stepwise logistic regression models’ AUC.

Figure 16. Comparison of ROC curves

Figure 16. Comparison of ROC curves

Five Fold Cross-Validation was performed to determine optimal cut off points for each decision tree. the results of the Cross-Validation can be seen below.

Figure 17: Plot of optimal cut-off determination

Figure 17: Plot of optimal cut-off determination