1 Introduction

For this report, I will be analyzing banks loans and whether or not a loan has been defaulted on. I will be using a data set that has documented 1000 loans and 16 variables.

The variables in this data set are as follows:

  1. Checking_amount - Numeric
  2. Term (in months) - Numeric
  3. Credit_score - Numeric
  4. Gender - Categorical
  5. Marital_status - Categorical
  6. Car_loan (1- Own car loan, 0- Does not own car loan) - Numeric
  7. Personal_loan(1- Own Personal loan, 0- Does not own Personal loan) – Numeric
  8. Home_loan (1- Own Home loan, 0- Does not own Home loan) - Numeric
  9. Education_loan (1- Own Education loan, 0- Does not own Education loan) - Numeric
  10. Emp_status - Categorical
  11. Amount - Numeric
  12. Saving_amount - Numeric
  13. Emp_duration (in months) - Numeric
  14. Age (which is displayed in years (Numeric))
  15. No_of_credit_account(Numeric)
  16. Default (response variable; takes on values of 0 if loan was not defaulted and 1 if defaulted) - Numeric

The goal for this analysis is to explore any potential relationships between defaulting on a loan and any of the variables in this data set.

Loan <- read.csv("BankLoanDefaultDataset.csv")

2 Fitting the model

What interests me most is the potential relationship between defaulting and a borrower’s age. So, i will fit the model with Default as the response variable and Age as the explanatory variable. But before we fit the model, we should check to see whether or not age is severely skewed.

ylimit = max(density(Loan$Age)$y)
hist(Loan$Age, probability = TRUE, main = "Ages of Borrowers", xlab="", 
       col = "azure1", border="lightseagreen")
  lines(density(Loan$Age, adjust=2), col="blue") 

There is some skew, but not enough to where we need to worry. So Age will not be transformed. So we can fit the model as normal:

simple.logit = glm(Default ~ Age, 
          family = binomial(link = "logit"), data = Loan)  # family is the binomial, logit(p) = log(p/(1-p))!
                              # the data frame is a subset of the original iris data
# summary(simple.logit)

Next, we can check the coefficient estimates of the model, as well as a 95% confidence interval for the estimates:

model.coef.stats = summary(simple.logit)$coef       # output stats of coefficients
conf.ci = confint(simple.logit)
## Waiting for profiling to be done...
sum.stats = cbind(model.coef.stats, conf.ci.95=conf.ci)   # rounding off decimals
kable(sum.stats,caption = "The summary stats of regression coefficients")  
The summary stats of regression coefficients
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) 18.7147703 1.264302 14.80245 0 16.343467 21.3066192
Age -0.6516579 0.042711 -15.25738 0 -0.739301 -0.5716309

The estimate as well as the confidence interval shows that Age is negatively associated with chance of someone defaulting on a loan. And since the p-value is close to 0, we can conclude that there exists a significant relationship between the two.

Next, the coefficient estimates will be converted into odds-ratios:

model.coef.stats = summary(simple.logit)$coef
odds.ratio = exp(coef(simple.logit))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
kable(out.stats,caption = "Summary Stats with Odds Ratios")
Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) 18.7147703 1.264302 14.80245 0 1.341904e+08
Age -0.6516579 0.042711 -15.25738 0 5.211810e-01

It seems that for every additional year increase in Age, the probability that a person will default on a loan will decrease by about 48%.

Finally, we can plot the success probability curve (or S curve) to check both the probabilities of defaulting on a loan as well as the rate of change of the probabilities.

Age.range = range(Loan$Age)
x = seq(Age.range[1], Age.range[2], length = 200)
beta.x = coef(simple.logit)[1] + coef(simple.logit)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(simple.logit)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,2))
plot(x, success.prob, type = "l", lwd = 2, col = "navy",
     main = "The Probability of \n  Defaulting on a Loan", 
     ylim=c(0, 1.1*ylimit),
     xlab = "Age",
     ylab = "Probability",
     axes = FALSE,
     col.main = "navy",
     cex.main = 0.8)
# lines(x, failure.prob,lwd = 2, col = "darkred")
axis(1, pos = 0)
axis(2)
# legend(30, 1, c("Success Probability", "Failure Probability"), lwd = rep(2,2), 
#       col = c("navy", "darkred"), cex = 0.7, bty = "n")
##
y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type = "l", lwd = 2, col = "navy",
     main = "The Rate of Change in the Probability \n  of Defaulting on a Loan", 
     xlab = "Age",
     ylab = "Rate of Change",
     ylim=c(0,20*y.rate),
     axes = FALSE,
     col.main = "navy",
     cex.main = 0.8
     )
axis(1, pos = 0)
axis(2)

On the left, there is a clear negative slope on the curve. On the right, the curve shows that the rate of change is extremely large. I changed the code for the y-limits on the curve for the right several times, including 10x, 20x, 50x, and even 100x, and the maximum rate of change was still not captured. If there’s a turning point, it would be very hard to find.

3 Summary and Conclusion

This study focused on the potential relationship between Age and the probability of defaulting on a loan. A logistic regression model was fitted and the coefficients were then converted to odds ratios. It was determined that a one year increase in a borrower’s age would lower the probability of defaulting about 48%.

An S curve was then checked and showed that there was a significant decrease in the probability of defaulting as borrower’s got older.