Similar to Hometap’s model, home equity loans, while slower and less-customer oriented, depend on the credit level of the borrower. Below, I analyze a dataset of loan repayment results–completed payment or defaulted–and predictor variables, from the borrower’s debt to income ratio to the value of his or her current property. The CreditRiskAnalytics.com dataset provides 5960 loans with insightful categorical and quantitative predictors, which can be extrapolated to predict future loan status results.
Numerical Insights: About 1/5 of the loan recipients either defaulted or were seriously delayed in their payments. As most home equity borrowers made payments for their loans on time, the majority of those recipients had a Debt to Income Ratio below the typical threshold of 36%. For context, most loans fall around the $15,000 mark, ranging from $1,100 to $89,900. For a breakdown of the jobs held by the loan recipients, the 3D pie chart shows that the largest defined category is professional executives.
#Percentage of Loans Defaulted On
mean(~BAD, data=hmeq)
## [1] 0.1994966
#Distribution of Debt to Income Ratios
favstats(~DEBTINC, data=hmeq)
## min Q1 median Q3 max mean sd n missing
## 0.5244992 29.14003 34.81826 39.00314 203.3121 33.77992 8.601746 4693 1267
#Boxplot of Loan Amounts
gf_boxplot(~LOAN, data=hmeq)
#3D Pie Chart of Job Categories
labels <- c("Manager", "Office", "Executive", "Sales", "Other", "Self-Employed")
pie3D(c(767, 948, 1276, 109, 2388, 193), labels=labels, explode=0.1)
While stand-alone predictor variables might not tell a comprehensive story, finding an overall model of default probability paints a bigger picture of any relationships at play in this case. Multiple logistic regression predicts the estimated effect on the log odds of an outcome when predictor values change; using an automated selection techinque, we can generate a model that best predicts whether a borrower defaults based on our in-hand data.
After numerous tests and selection techniques, the best model yielded statistically significant predictors: Age in Months of Oldest Credit Source, Debt to Income Ratio, and Loan Size. A one point increase in the Debt/Income ratio predicts an 11.2% increase in default probability, while an additional year on one’s oldest credit line predicts a 7% decrease in default probability. Defaulting is not as skewed by loan size, as a one thousand dollar increase in loan size only predicts a 2% increase in default probability. To test the model, one random observation was selected: as its predicted probability was less than 50% and the borrower in question did not default, the model passes that test. To determine the model’s value, the data should be evaluated using a test/train function to complete the previous test for all observations.
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
hmeq <- hmeq %>%
clean_names()
basemodel <- glm(family="binomial", data=hmeq, bad~debtinc+mortdue+clno)
summary(basemodel)
##
## Call:
## glm(formula = bad ~ debtinc + mortdue + clno, family = "binomial",
## data = hmeq)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7182 -0.4728 -0.3855 -0.2856 3.4406
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.381e+00 3.316e-01 -16.228 <2e-16 ***
## debtinc 9.536e-02 8.544e-03 11.161 <2e-16 ***
## mortdue -2.230e-06 1.401e-06 -1.592 0.1115
## clno -1.126e-02 6.124e-03 -1.838 0.0661 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2477.8 on 4183 degrees of freedom
## Residual deviance: 2304.7 on 4180 degrees of freedom
## (1776 observations deleted due to missingness)
## AIC: 2312.7
##
## Number of Fisher Scoring iterations: 6
#Find five best models based on AIC (relative prediction quality)
library(glmulti)
## Loading required package: rJava
glmulti.lm.out <-
glmulti(bad ~ debtinc + loan + mortdue + value + yoj + clage + ninq, data = hmeq,
level = 1,
method = "h",
crit = "aic",
confsetsize = 5,
plotty = F, report = F,
fitfunction = "glm")
glmulti.lm.out@formulas
## [[1]]
## bad ~ 1 + debtinc + loan + mortdue + value + yoj + clage
## <environment: 0x7f9a1f216c00>
##
## [[2]]
## bad ~ 1 + debtinc + loan + mortdue + value + clage + ninq
## <environment: 0x7f9a1f216c00>
##
## [[3]]
## bad ~ 1 + debtinc + loan + mortdue + value + clage
## <environment: 0x7f9a1f216c00>
##
## [[4]]
## bad ~ 1 + debtinc + mortdue + value + yoj + clage
## <environment: 0x7f9a1f216c00>
##
## [[5]]
## bad ~ 1 + debtinc + mortdue + value + clage
## <environment: 0x7f9a1f216c00>
#Compare two best models
model1 <- glm(bad~debtinc + loan + mortdue + value + yoj + clage, family="binomial", data=hmeq)
summary(model1)
##
## Call:
## glm(formula = bad ~ debtinc + loan + mortdue + value + yoj +
## clage, family = "binomial", data = hmeq)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8848 -0.4463 -0.3458 -0.2373 3.9660
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.833e+00 3.874e-01 -12.475 < 2e-16 ***
## debtinc 1.062e-01 9.536e-03 11.138 < 2e-16 ***
## loan -1.875e-05 6.882e-06 -2.725 0.00643 **
## mortdue -6.321e-06 3.394e-06 -1.863 0.06252 .
## value 4.915e-06 2.804e-06 1.753 0.07967 .
## yoj -1.283e-02 9.200e-03 -1.394 0.16326
## clage -6.050e-03 9.187e-04 -6.585 4.54e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2240.1 on 3915 degrees of freedom
## Residual deviance: 1988.7 on 3909 degrees of freedom
## (2044 observations deleted due to missingness)
## AIC: 2002.7
##
## Number of Fisher Scoring iterations: 6
model2 <- glm(bad~debtinc + loan + value + yoj + clage, family="binomial", data=hmeq)
summary(model2)
##
## Call:
## glm(formula = bad ~ debtinc + loan + value + yoj + clage, family = "binomial",
## data = hmeq)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8575 -0.4488 -0.3482 -0.2339 4.0181
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.712e+00 3.604e-01 -13.074 < 2e-16 ***
## debtinc 9.627e-02 8.812e-03 10.924 < 2e-16 ***
## loan -1.410e-05 6.338e-06 -2.225 0.0261 *
## value 2.154e-06 1.001e-06 2.152 0.0314 *
## yoj -1.191e-02 8.902e-03 -1.338 0.1808
## clage -6.217e-03 8.884e-04 -6.998 2.6e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2353.5 on 4126 degrees of freedom
## Residual deviance: 2106.0 on 4121 degrees of freedom
## (1833 observations deleted due to missingness)
## AIC: 2118
##
## Number of Fisher Scoring iterations: 6
#Drop In Deviance Test of One Predictor in Question: Mortgage Due.
g.value.mortdue <- model2$aic - model1$aic
xpchisq(q=g.value.mortdue, df=1)
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## [1] 1
#Mortgage Due is significant at alpha of 0.001 according to the xpchisq function.
#It will be kept in final model.
#Final model
finalmodel <- glm(bad~debtinc + loan + mortdue + value + yoj + clage, family="binomial", data=hmeq)
summary(finalmodel)
##
## Call:
## glm(formula = bad ~ debtinc + loan + mortdue + value + yoj +
## clage, family = "binomial", data = hmeq)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8848 -0.4463 -0.3458 -0.2373 3.9660
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.833e+00 3.874e-01 -12.475 < 2e-16 ***
## debtinc 1.062e-01 9.536e-03 11.138 < 2e-16 ***
## loan -1.875e-05 6.882e-06 -2.725 0.00643 **
## mortdue -6.321e-06 3.394e-06 -1.863 0.06252 .
## value 4.915e-06 2.804e-06 1.753 0.07967 .
## yoj -1.283e-02 9.200e-03 -1.394 0.16326
## clage -6.050e-03 9.187e-04 -6.585 4.54e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2240.1 on 3915 degrees of freedom
## Residual deviance: 1988.7 on 3909 degrees of freedom
## (2044 observations deleted due to missingness)
## AIC: 2002.7
##
## Number of Fisher Scoring iterations: 6
#Computing predicted odds
debtincodds <- exp(1.062126e-01)
loanodds <- exp(1000*1.875197e-05)
clageodds <- exp(12*-6.049683e-03)
Odds <- c(debtincodds, loanodds, clageodds)
Predictors <- c("Debt to Income Ratio", "Loan Size (Thousands)", "Years on Oldest Credit Line")
table <- data.frame(Predictors, Odds)
library(DT)
datatable(table, options = list(pageLength = 3))
#Testing the model
#Three random observation numbers, choose the first
runif(3, min = 0, max = 100)
## [1] 47.724276 1.815805 65.033285
#Test of "nth" observation:
new_data1 <- data.frame(loan = 3600, debtinc = 41.5163897, mortdue = 61327, value = 76484, clage=202.51078, yoj=9)
predicted_logodds <- predict(finalmodel, new_data1)
odds <- exp(predicted_logodds)
odds / (1+odds)
## 1
## 0.1367019
#49 percent chance of default: borrower did not default, so it is a correct prediction.
#3D look at all cases
x <- hmeq$loan
y <- hmeq$debtinc
z <- hmeq$clage
library(scatterplot3d)
plot3D <- scatterplot3d(x,y,z,highlight.3d = TRUE, pch = 19, type = "h", xlab = "Loan Size", ylab = "Debt to Income Ratio", zlab = "CLAGE")