1 Introduction

For this report, I will building several logistical regression models and seeing which one is the best for predicting whether or not a loan has been defaulted on. I will be using a data set that has documented 1000 loans and 16 variables.

The variables in this data set are as follows: 1) Checking_amount - Numeric 2) Term (in months) - Numeric 3) Credit_score - Numeric 4) Gender - Categorical 5) Marital_status - Categorical 6) Car_loan (1- Own car loan, 0- Does not own car loan) - Numeric 7) Personal_loan(1- Own Personal loan, 0- Does not own Personal loan) – Numeric 8) Home_loan (1- Own Home loan, 0- Does not own Home loan) - Numeric 9) Education_loan (1- Own Education loan, 0- Does not own Education loan) - Numeric 10) Emp_status - Categorical 11) Amount - Numeric 12) Saving_amount - Numeric 13) Emp_duration (in months) - Numeric 14) Age (which is displayed in years (Numeric)) 15) No_of_credit_account(Numeric) 16) Default (response variable; takes on values of 0 if loan was not defaulted and 1 if defaulted) - Numeric

Loan <- read.csv("BankLoanDefaultDataset.csv")

2 Recap of Building the models

To start, a pairwise scatterplot was generated to see if there are any issues with the predictor variables. Since none of the distributions of the variables were majorly skewed, and 2 out of the 3 ‘correlated’ variable pairs are consisted of categorical variables, none of the variables were transformed (this will change later for the models that will be built for prediction).

pairs.panels(Loan[,-16], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
)

Next, the full model was fitted with Default as the response variable and all of the other variables as explanatory variables. Via stepwise regression, a reduced model was generated with Default still as the response variable and Checking_amount + Term + Credit_score + Personal_loan + Home_loan + Education_loan + Emp_status + Amount + Saving_amount + Age. as explanatory variables.

full.model <- glm(Default ~ ., family = binomial(link = "logit"), data = Loan)
kable(summary(full.model)$coef, 
      caption="Summary of inferential statistics of the full model")
Summary of inferential statistics of the full model
Estimate Std. Error z value Pr(>|z|)
(Intercept) 39.6415229 4.7284136 8.3836834 0.0000000
Checking_amount -0.0050880 0.0006759 -7.5283125 0.0000000
Term 0.1703676 0.0520728 3.2717189 0.0010690
Credit_score -0.0109793 0.0020746 -5.2922299 0.0000001
GenderMale 0.1950806 0.5095698 0.3828338 0.7018430
Marital_statusSingle 0.3351480 0.4920120 0.6811786 0.4957585
Car_loan -0.6004643 2.7585197 -0.2176763 0.8276814
Personal_loan -1.5540876 2.7585124 -0.5633789 0.5731769
Home_loan -3.5684378 2.8457131 -1.2539696 0.2098531
Education_loan 0.6498873 2.7894965 0.2329766 0.8157796
Emp_statusunemployed 0.5872532 0.3474376 1.6902407 0.0909819
Amount 0.0008026 0.0005114 1.5694940 0.1165329
Saving_amount -0.0048212 0.0006085 -7.9224872 0.0000000
Emp_duration 0.0029178 0.0044391 0.6572906 0.5109941
Age -0.6475369 0.0646616 -10.0142428 0.0000000
No_of_credit_acc -0.0968614 0.1006467 -0.9623902 0.3358536
final.model = stepAIC(full.model, direction = "backward",   # forward selection
                      trace = 0)   # do not show the details
kable(summary(final.model)$coef, 
      caption="Summary of inferential statistics of the final model")
Summary of inferential statistics of the final model
Estimate Std. Error z value Pr(>|z|)
(Intercept) 39.0684818 3.8424298 10.167650 0.0000000
Checking_amount -0.0050961 0.0006725 -7.577703 0.0000000
Term 0.1748348 0.0516008 3.388216 0.0007035
Credit_score -0.0108236 0.0020559 -5.264700 0.0000001
Personal_loan -0.9656327 0.3346546 -2.885460 0.0039084
Home_loan -2.9990106 0.7783531 -3.853020 0.0001167
Education_loan 1.2333000 0.5425465 2.273169 0.0230160
Emp_statusunemployed 0.5517095 0.3352349 1.645740 0.0998174
Amount 0.0007966 0.0005100 1.561997 0.1182888
Saving_amount -0.0048470 0.0006068 -7.987851 0.0000000
Age -0.6446881 0.0634579 -10.159300 0.0000000

Some goodness of fit measures were checked to determine which of the two models were better, and then the parameter estimates were converted to log-odds and interpreted accordingly.

3 New Models

3.1 Change of Variables

Now that new models are being built to determine what’s best for prediction, All continuous variables in the data set will be standardized, since parameter interpretation isn’t important.

Loan$Checking_amount = (Loan$Checking_amount - mean(Loan$Checking_amount))/sd(Loan$Checking_amount)
Loan$Term = (Loan$Term - mean(Loan$Term))/sd(Loan$Term)
Loan$Credit_score = (Loan$Credit_score - mean(Loan$Credit_score))/sd(Loan$Credit_score)
Loan$Age = (Loan$Age - mean(Loan$Age))/sd(Loan$Age)
Loan$No_of_credit_acc = (Loan$No_of_credit_acc - mean(Loan$No_of_credit_acc))/sd(Loan$No_of_credit_acc)

3.2 Training and Test Data Sets

Cross Validation (CV) is a good way to estimate how effective a given model is at prediction. For both the full model and reduced models genenerated earlier, a five fold CV will be used to determine how effective both are at prediction.

First, the data set will be split into both a training set and test set. 80% of the observations will go into the training set and 20% will go into the test set.

n <- dim(Loan)[1]
train.n <- round(0.8*n)
train.id <- sample(1:n, train.n, replace = FALSE)
## training and testing data sets
train <- Loan[train.id, ]
test <- Loan[-train.id, ]

And now, to run the CV:

k=5
fold.size = round(dim(train)[1]/k)
## PE vectors for candidate models
PE1 = rep(0,5)
PE2 = rep(0,5)

for(i in 1:k){
  ## Training and testing folds
  valid.id = (fold.size*(i-1)+1):(fold.size*i)
  valid = train[valid.id, ]
  train.dat = train[-valid.id,]
}

##  full model
candidate01 = glm(Default ~., family = binomial(link = "logit"),  
                  data = train.dat)  
## reduced model
candidate02 = stepAIC(candidate01, direction = "backward", trace = 0) # backward selection  # do not show the details

##  predicted probabilities of each candidate model
pred01 = predict(candidate01, newdata = valid, type = "response")
pred02 = predict(candidate02, newdata = valid, type = "response")

## confusion matrix: ftable() will  
pre.outcome01 = ifelse(as.vector(pred01) > 0.5, "pos", "neg")
pre.outcome02 = ifelse(as.vector(pred02) > 0.5, "pos", "neg")

PE1[i] = sum(pre.outcome01 == valid$Loan)/length(pred01)
PE2[i] = sum(pre.outcome02 == valid$Loan)/length(pred02)

avg.pe = cbind(PE1 = mean(PE1), PE2 = mean(PE2))
kable(avg.pe, caption = "Average of prediction errors of candidate models")
Average of prediction errors of candidate models
PE1 PE2
0 0

According to the table, the average predictive errors for both the full and reduced models are the same. Since the reduced model has less variables, this model will be chosen as the final predictive model.

The prediction accuracy of the final model is given via the table

pred02 = predict(candidate02, newdata = test, type="response")
pred02.outcome = ifelse(as.vector(pred02) > 0.5, "pos", "neg")

accuracy = sum(pred02.outcome == test$Loan)/length(pred02)
kable(accuracy, caption = "The actual accuracy of the final model")
The actual accuracy of the final model
x
0

4 Summary of report

This study focused on predicting loan defaults. To find out which model was best for predicting loan defaults, 2 candidate models (the full and reduced model generated via backwards stepwise regression) were used. The reduced model was selected as the final model.