For this report, I will be analyzing banks loans and whether or not a loan has been defaulted on. I will be using a data set that has documented 1000 loans and 16 variables.
The variables in this data set are as follows: 1) Checking_amount - Numeric 2) Term (in months) - Numeric 3) Credit_score - Numeric 4) Gender - Categorical 5) Marital_status - Categorical 6) Car_loan (1- Own car loan, 0- Does not own car loan) - Numeric 7) Personal_loan(1- Own Personal loan, 0- Does not own Personal loan) – Numeric 8) Home_loan (1- Own Home loan, 0- Does not own Home loan) - Numeric 9) Education_loan (1- Own Education loan, 0- Does not own Education loan) - Numeric 10) Emp_status - Categorical 11) Amount - Numeric 12) Saving_amount - Numeric 13) Emp_duration (in months) - Numeric 14) Age (which is displayed in years (Numeric)) 15) No_of_credit_account(Numeric) 16) Default (response variable; takes on values of 0 if loan was not defaulted and 1 if defaulted) - Numeric
Loan <- read.csv("BankLoanDefaultDataset.csv")
Before the model is fitted, a pairwise scatterplot of all of the variables will be generated to determine if there is any issue with any of the explanatory variables.
pairs.panels(Loan[,-16],
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
Some of the distributions are slightly skewed, but i don’t think any of the distributions are skewed to the point of having to worry about any of the variables, so analysis will continue without any transformations for now.
Now, the full model will be fitted with Default as the response variable and the rest of the variables as the explanatory variables.
full.model <- glm(Default ~ ., family = binomial(link = "logit"), data = Loan)
kable(summary(full.model)$coef,
caption="Summary of inferential statistics of the full model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 39.6415229 | 4.7284136 | 8.3836834 | 0.0000000 |
| Checking_amount | -0.0050880 | 0.0006759 | -7.5283125 | 0.0000000 |
| Term | 0.1703676 | 0.0520728 | 3.2717189 | 0.0010690 |
| Credit_score | -0.0109793 | 0.0020746 | -5.2922299 | 0.0000001 |
| GenderMale | 0.1950806 | 0.5095698 | 0.3828338 | 0.7018430 |
| Marital_statusSingle | 0.3351480 | 0.4920120 | 0.6811786 | 0.4957585 |
| Car_loan | -0.6004643 | 2.7585197 | -0.2176763 | 0.8276814 |
| Personal_loan | -1.5540876 | 2.7585124 | -0.5633789 | 0.5731769 |
| Home_loan | -3.5684378 | 2.8457131 | -1.2539696 | 0.2098531 |
| Education_loan | 0.6498873 | 2.7894965 | 0.2329766 | 0.8157796 |
| Emp_statusunemployed | 0.5872532 | 0.3474376 | 1.6902407 | 0.0909819 |
| Amount | 0.0008026 | 0.0005114 | 1.5694940 | 0.1165329 |
| Saving_amount | -0.0048212 | 0.0006085 | -7.9224872 | 0.0000000 |
| Emp_duration | 0.0029178 | 0.0044391 | 0.6572906 | 0.5109941 |
| Age | -0.6475369 | 0.0646616 | -10.0142428 | 0.0000000 |
| No_of_credit_acc | -0.0968614 | 0.1006467 | -0.9623902 | 0.3358536 |
But what is the optimal model to use? Stepwise regression can be used to determine what variables to keep.
final.model = stepAIC(full.model, direction = "backward", # forward selection
trace = 0) # do not show the details
kable(summary(final.model)$coef,
caption="Summary of inferential statistics of the final model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 39.0684818 | 3.8424298 | 10.167650 | 0.0000000 |
| Checking_amount | -0.0050961 | 0.0006725 | -7.577703 | 0.0000000 |
| Term | 0.1748348 | 0.0516008 | 3.388216 | 0.0007035 |
| Credit_score | -0.0108236 | 0.0020559 | -5.264700 | 0.0000001 |
| Personal_loan | -0.9656327 | 0.3346546 | -2.885460 | 0.0039084 |
| Home_loan | -2.9990106 | 0.7783531 | -3.853020 | 0.0001167 |
| Education_loan | 1.2333000 | 0.5425465 | 2.273169 | 0.0230160 |
| Emp_statusunemployed | 0.5517095 | 0.3352349 | 1.645740 | 0.0998174 |
| Amount | 0.0007966 | 0.0005100 | 1.561997 | 0.1182888 |
| Saving_amount | -0.0048470 | 0.0006068 | -7.987851 | 0.0000000 |
| Age | -0.6446881 | 0.0634579 | -10.159300 | 0.0000000 |
With an AIC score of 321.41, the new and reduced model is Default ~ Checking_amount + Term + Credit_score + Personal_loan + Home_loan + Education_loan + Emp_status + Amount + Saving_amount + Age.
Next, we can check the goodness-of-fit measures for the two models.
global.measure = function(s.logit){
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
AIC = aic)
goodness
}
goodness=rbind(full.model = global.measure(full.model),
final.model=global.measure(final.model))
row.names(goodness) = c("full.model", "final.model")
kable(goodness, caption ="Comparison of global goodness-of-fit statistics")
| Deviance.residual | Null.Deviance.Residual | AIC | |
|---|---|---|---|
| full.model | 297.6479 | 1221.729 | 329.6479 |
| final.model | 299.4105 | 1221.729 | 321.4105 |
Now we can convert the slope estimates in the model to log-odds.
model.coef.stats = summary(final.model)$coef
odds.ratio = exp(coef(final.model))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)
kable(out.stats,caption = "Summary Stats with Odds Ratios")
| Estimate | Std. Error | z value | Pr(>|z|) | odds.ratio | |
|---|---|---|---|---|---|
| (Intercept) | 39.0684818 | 3.8424298 | 10.167650 | 0.0000000 | 9.273124e+16 |
| Checking_amount | -0.0050961 | 0.0006725 | -7.577703 | 0.0000000 | 9.949169e-01 |
| Term | 0.1748348 | 0.0516008 | 3.388216 | 0.0007035 | 1.191049e+00 |
| Credit_score | -0.0108236 | 0.0020559 | -5.264700 | 0.0000001 | 9.892348e-01 |
| Personal_loan | -0.9656327 | 0.3346546 | -2.885460 | 0.0039084 | 3.807422e-01 |
| Home_loan | -2.9990106 | 0.7783531 | -3.853020 | 0.0001167 | 4.983640e-02 |
| Education_loan | 1.2333000 | 0.5425465 | 2.273169 | 0.0230160 | 3.432538e+00 |
| Emp_statusunemployed | 0.5517095 | 0.3352349 | 1.645740 | 0.0998174 | 1.736219e+00 |
| Amount | 0.0007966 | 0.0005100 | 1.561997 | 0.1182888 | 1.000797e+00 |
| Saving_amount | -0.0048470 | 0.0006068 | -7.987851 | 0.0000000 | 9.951647e-01 |
| Age | -0.6446881 | 0.0634579 | -10.159300 | 0.0000000 | 5.248262e-01 |
There are some interesting numbers in this table. The biggest odds ratios are coming from unemployment status and whether or not the person has an education loan. It seems that if all of the other variables are kept constant, then a person who owns an education loan is 3.4325x more likely to default on a bank loan, and if a person is unemployed, they are 1.7362x more likely to default on a bank loan (again keeping the other variables constant).
This study focused on potential contributing factors to loan defaulting. The intial data set had 1000 observations and 16 variables.
Exploratory analysis didn’t do much. A pairwise scatterplot was generated but didn’t do much because the variables pairs that would’ve been correlated were mostly categorical, but a chi-square test of independence wasn’t run to determine if they have some type of association.
The full model was fitted, and then using stepwise regression, the final model was generated with 10 explanatory variables.
This model is not to be used for prediction (or at least not yet). Adjustments will be made as necessary when predictions need to be made.