For this report, I will building several logistical regression models and seeing which one is the best for predicting whether or not a loan has been defaulted on. I will be using a data set that has documented 1000 loans and 16 variables.
The variables in this data set are as follows: 1) Checking_amount - Numeric 2) Term (in months) - Numeric 3) Credit_score - Numeric 4) Gender - Categorical 5) Marital_status - Categorical 6) Car_loan (1- Own car loan, 0- Does not own car loan) - Numeric 7) Personal_loan(1- Own Personal loan, 0- Does not own Personal loan) – Numeric 8) Home_loan (1- Own Home loan, 0- Does not own Home loan) - Numeric 9) Education_loan (1- Own Education loan, 0- Does not own Education loan) - Numeric 10) Emp_status - Categorical 11) Amount - Numeric 12) Saving_amount - Numeric 13) Emp_duration (in months) - Numeric 14) Age (which is displayed in years (Numeric)) 15) No_of_credit_account(Numeric) 16) Default (response variable; takes on values of 0 if loan was not defaulted and 1 if defaulted) - Numeric
Loan <- read.csv("BankLoanDefaultDataset.csv")
To start, a pairwise scatterplot was generated to see if there are any issues with the predictor variables. Since none of the distributions of the variables were majorly skewed, and 2 out of the 3 ‘correlated’ variable pairs are consisted of categorical variables, none of the variables were transformed (this will change later for the models that will be built for prediction).
pairs.panels(Loan[,-16],
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
Next, the full model was fitted with Default as the response variable and all of the other variables as explanatory variables. Via stepwise regression, a reduced model was generated with Default still as the response variable and Checking_amount + Term + Credit_score + Personal_loan + Home_loan + Education_loan + Emp_status + Amount + Saving_amount + Age. as explanatory variables.
full.model <- glm(Default ~ ., family = binomial(link = "logit"), data = Loan)
kable(summary(full.model)$coef,
caption="Summary of inferential statistics of the full model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 39.6415229 | 4.7284136 | 8.3836834 | 0.0000000 |
| Checking_amount | -0.0050880 | 0.0006759 | -7.5283125 | 0.0000000 |
| Term | 0.1703676 | 0.0520728 | 3.2717189 | 0.0010690 |
| Credit_score | -0.0109793 | 0.0020746 | -5.2922299 | 0.0000001 |
| GenderMale | 0.1950806 | 0.5095698 | 0.3828338 | 0.7018430 |
| Marital_statusSingle | 0.3351480 | 0.4920120 | 0.6811786 | 0.4957585 |
| Car_loan | -0.6004643 | 2.7585197 | -0.2176763 | 0.8276814 |
| Personal_loan | -1.5540876 | 2.7585124 | -0.5633789 | 0.5731769 |
| Home_loan | -3.5684378 | 2.8457131 | -1.2539696 | 0.2098531 |
| Education_loan | 0.6498873 | 2.7894965 | 0.2329766 | 0.8157796 |
| Emp_statusunemployed | 0.5872532 | 0.3474376 | 1.6902407 | 0.0909819 |
| Amount | 0.0008026 | 0.0005114 | 1.5694940 | 0.1165329 |
| Saving_amount | -0.0048212 | 0.0006085 | -7.9224872 | 0.0000000 |
| Emp_duration | 0.0029178 | 0.0044391 | 0.6572906 | 0.5109941 |
| Age | -0.6475369 | 0.0646616 | -10.0142428 | 0.0000000 |
| No_of_credit_acc | -0.0968614 | 0.1006467 | -0.9623902 | 0.3358536 |
final.model = stepAIC(full.model, direction = "backward", # forward selection
trace = 0) # do not show the details
kable(summary(final.model)$coef,
caption="Summary of inferential statistics of the final model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 39.0684818 | 3.8424298 | 10.167650 | 0.0000000 |
| Checking_amount | -0.0050961 | 0.0006725 | -7.577703 | 0.0000000 |
| Term | 0.1748348 | 0.0516008 | 3.388216 | 0.0007035 |
| Credit_score | -0.0108236 | 0.0020559 | -5.264700 | 0.0000001 |
| Personal_loan | -0.9656327 | 0.3346546 | -2.885460 | 0.0039084 |
| Home_loan | -2.9990106 | 0.7783531 | -3.853020 | 0.0001167 |
| Education_loan | 1.2333000 | 0.5425465 | 2.273169 | 0.0230160 |
| Emp_statusunemployed | 0.5517095 | 0.3352349 | 1.645740 | 0.0998174 |
| Amount | 0.0007966 | 0.0005100 | 1.561997 | 0.1182888 |
| Saving_amount | -0.0048470 | 0.0006068 | -7.987851 | 0.0000000 |
| Age | -0.6446881 | 0.0634579 | -10.159300 | 0.0000000 |
Some goodness of fit measures were checked to determine which of the two models were better, and then the parameter estimates were converted to log-odds and interpreted accordingly.
Now that new models are being built to determine what’s best for prediction, All continuous variables in the data set will be standardized, since parameter interpretation isn’t important.
Loan$Checking_amount = (Loan$Checking_amount - mean(Loan$Checking_amount))/sd(Loan$Checking_amount)
Loan$Term = (Loan$Term - mean(Loan$Term))/sd(Loan$Term)
Loan$Credit_score = (Loan$Credit_score - mean(Loan$Credit_score))/sd(Loan$Credit_score)
Loan$Age = (Loan$Age - mean(Loan$Age))/sd(Loan$Age)
Loan$No_of_credit_acc = (Loan$No_of_credit_acc - mean(Loan$No_of_credit_acc))/sd(Loan$No_of_credit_acc)
Cross Validation (CV) is a good way to estimate how effective a given model is at prediction. For both the full model and reduced models genenerated earlier, a five fold CV will be used to determine how effective both are at prediction.
First, the data set will be split into both a training set and test set. 80% of the observations will go into the training set and 20% will go into the test set.
n <- dim(Loan)[1]
train.n <- round(0.8*n)
train.id <- sample(1:n, train.n, replace = FALSE)
## training and testing data sets
train <- Loan[train.id, ]
test <- Loan[-train.id, ]
And now, to run the CV:
k=5
fold.size = round(dim(train)[1]/k)
## PE vectors for candidate models
PE1 = rep(0,5)
PE2 = rep(0,5)
for(i in 1:k){
## Training and testing folds
valid.id = (fold.size*(i-1)+1):(fold.size*i)
valid = train[valid.id, ]
train.dat = train[-valid.id,]
}
## full model
candidate01 = glm(Default ~., family = binomial(link = "logit"),
data = train.dat)
## reduced model
candidate02 = stepAIC(candidate01, direction = "backward", trace = 0) # backward selection # do not show the details
## predicted probabilities of each candidate model
pred01 = predict(candidate01, newdata = valid, type = "response")
pred02 = predict(candidate02, newdata = valid, type = "response")
## confusion matrix: ftable() will
pre.outcome01 = ifelse(as.vector(pred01) > 0.5, "pos", "neg")
pre.outcome02 = ifelse(as.vector(pred02) > 0.5, "pos", "neg")
PE1[i] = sum(pre.outcome01 == valid$Loan)/length(pred01)
PE2[i] = sum(pre.outcome02 == valid$Loan)/length(pred02)
avg.pe = cbind(PE1 = mean(PE1), PE2 = mean(PE2))
kable(avg.pe, caption = "Average of prediction errors of candidate models")
| PE1 | PE2 |
|---|---|
| 0 | 0 |
According to the table, the average predictive errors for both the full and reduced models are the same. Since the reduced model has less variables, this model will be chosen as the final predictive model.
The prediction accuracy of the final model is given via the table
pred02 = predict(candidate02, newdata = test, type="response")
pred02.outcome = ifelse(as.vector(pred02) > 0.5, "pos", "neg")
accuracy = sum(pred02.outcome == test$Loan)/length(pred02)
kable(accuracy, caption = "The actual accuracy of the final model")
| x |
|---|
| 0 |
This study focused on predicting loan defaults. To find out which model was best for predicting loan defaults, 2 candidate models (the full and reduced model generated via backwards stepwise regression) were used. The reduced model was selected as the final model.