1 Introduction

For this report, I will be analyzing banks loans and whether or not a loan has been defaulted on. I will be using a data set that has documented 1000 loans and 16 variables.

The variables in this data set are as follows: 1) Checking_amount - Numeric 2) Term (in months) - Numeric 3) Credit_score - Numeric 4) Gender - Categorical 5) Marital_status - Categorical 6) Car_loan (1- Own car loan, 0- Does not own car loan) - Numeric 7) Personal_loan(1- Own Personal loan, 0- Does not own Personal loan) – Numeric 8) Home_loan (1- Own Home loan, 0- Does not own Home loan) - Numeric 9) Education_loan (1- Own Education loan, 0- Does not own Education loan) - Numeric 10) Emp_status - Categorical 11) Amount - Numeric 12) Saving_amount - Numeric 13) Emp_duration (in months) - Numeric 14) Age (which is displayed in years (Numeric)) 15) No_of_credit_account(Numeric) 16) Default (response variable; takes on values of 0 if loan was not defaulted and 1 if defaulted) - Numeric

Loan <- read.csv("BankLoanDefaultDataset.csv")

2 Exploratory Data Analysis

Before the model is fitted, a pairwise scatterplot of all of the variables will be generated to determine if there is any issue with any of the explanatory variables.

pairs.panels(Loan[,-16], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
)

Some of the distributions are slightly skewed, but i don’t think any of the distributions are skewed to the point of having to worry about any of the variables, so analysis will continue without any transformations for now.

3 Fitting the model

3.1 Full Model

Now, the full model will be fitted with Default as the response variable and the rest of the variables as the explanatory variables.

full.model <- glm(Default ~ ., family = binomial(link = "logit"), data = Loan)
kable(summary(full.model)$coef, 
      caption="Summary of inferential statistics of the full model")
Summary of inferential statistics of the full model
Estimate Std. Error z value Pr(>|z|)
(Intercept) 39.6415229 4.7284136 8.3836834 0.0000000
Checking_amount -0.0050880 0.0006759 -7.5283125 0.0000000
Term 0.1703676 0.0520728 3.2717189 0.0010690
Credit_score -0.0109793 0.0020746 -5.2922299 0.0000001
GenderMale 0.1950806 0.5095698 0.3828338 0.7018430
Marital_statusSingle 0.3351480 0.4920120 0.6811786 0.4957585
Car_loan -0.6004643 2.7585197 -0.2176763 0.8276814
Personal_loan -1.5540876 2.7585124 -0.5633789 0.5731769
Home_loan -3.5684378 2.8457131 -1.2539696 0.2098531
Education_loan 0.6498873 2.7894965 0.2329766 0.8157796
Emp_statusunemployed 0.5872532 0.3474376 1.6902407 0.0909819
Amount 0.0008026 0.0005114 1.5694940 0.1165329
Saving_amount -0.0048212 0.0006085 -7.9224872 0.0000000
Emp_duration 0.0029178 0.0044391 0.6572906 0.5109941
Age -0.6475369 0.0646616 -10.0142428 0.0000000
No_of_credit_acc -0.0968614 0.1006467 -0.9623902 0.3358536

3.2 The optimal model

But what is the optimal model to use? Stepwise regression can be used to determine what variables to keep.

final.model = stepAIC(full.model, direction = "backward",   # forward selection
                      trace = 0)   # do not show the details
kable(summary(final.model)$coef, 
      caption="Summary of inferential statistics of the final model")
Summary of inferential statistics of the final model
Estimate Std. Error z value Pr(>|z|)
(Intercept) 39.0684818 3.8424298 10.167650 0.0000000
Checking_amount -0.0050961 0.0006725 -7.577703 0.0000000
Term 0.1748348 0.0516008 3.388216 0.0007035
Credit_score -0.0108236 0.0020559 -5.264700 0.0000001
Personal_loan -0.9656327 0.3346546 -2.885460 0.0039084
Home_loan -2.9990106 0.7783531 -3.853020 0.0001167
Education_loan 1.2333000 0.5425465 2.273169 0.0230160
Emp_statusunemployed 0.5517095 0.3352349 1.645740 0.0998174
Amount 0.0007966 0.0005100 1.561997 0.1182888
Saving_amount -0.0048470 0.0006068 -7.987851 0.0000000
Age -0.6446881 0.0634579 -10.159300 0.0000000

With an AIC score of 321.41, the new and reduced model is Default ~ Checking_amount + Term + Credit_score + Personal_loan + Home_loan + Education_loan + Emp_status + Amount + Saving_amount + Age.

3.3 Goodness of Fit

Next, we can check the goodness-of-fit measures for the two models.

global.measure = function(s.logit){
  dev.resid = s.logit$deviance
  dev.0.resid = s.logit$null.deviance
  aic = s.logit$aic
  goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
                   AIC = aic)
  goodness
}
goodness=rbind(full.model = global.measure(full.model),
               final.model=global.measure(final.model))
row.names(goodness) = c("full.model", "final.model")
kable(goodness, caption ="Comparison of global goodness-of-fit statistics")
Comparison of global goodness-of-fit statistics
Deviance.residual Null.Deviance.Residual AIC
full.model 297.6479 1221.729 329.6479
final.model 299.4105 1221.729 321.4105

3.4 Converting parameter estimates to log-odds

Now we can convert the slope estimates in the model to log-odds.

model.coef.stats = summary(final.model)$coef
odds.ratio = exp(coef(final.model))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
kable(out.stats,caption = "Summary Stats with Odds Ratios")
Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) 39.0684818 3.8424298 10.167650 0.0000000 9.273124e+16
Checking_amount -0.0050961 0.0006725 -7.577703 0.0000000 9.949169e-01
Term 0.1748348 0.0516008 3.388216 0.0007035 1.191049e+00
Credit_score -0.0108236 0.0020559 -5.264700 0.0000001 9.892348e-01
Personal_loan -0.9656327 0.3346546 -2.885460 0.0039084 3.807422e-01
Home_loan -2.9990106 0.7783531 -3.853020 0.0001167 4.983640e-02
Education_loan 1.2333000 0.5425465 2.273169 0.0230160 3.432538e+00
Emp_statusunemployed 0.5517095 0.3352349 1.645740 0.0998174 1.736219e+00
Amount 0.0007966 0.0005100 1.561997 0.1182888 1.000797e+00
Saving_amount -0.0048470 0.0006068 -7.987851 0.0000000 9.951647e-01
Age -0.6446881 0.0634579 -10.159300 0.0000000 5.248262e-01

There are some interesting numbers in this table. The biggest odds ratios are coming from unemployment status and whether or not the person has an education loan. It seems that if all of the other variables are kept constant, then a person who owns an education loan is 3.4325x more likely to default on a bank loan, and if a person is unemployed, they are 1.7362x more likely to default on a bank loan (again keeping the other variables constant).

4 Summary and Conlclusions

This study focused on potential contributing factors to loan defaulting. The intial data set had 1000 observations and 16 variables.

Exploratory analysis didn’t do much. A pairwise scatterplot was generated but didn’t do much because the variables pairs that would’ve been correlated were mostly categorical, but a chi-square test of independence wasn’t run to determine if they have some type of association.

The full model was fitted, and then using stepwise regression, the final model was generated with 10 explanatory variables.

This model is not to be used for prediction (or at least not yet). Adjustments will be made as necessary when predictions need to be made.