When a bank lends money to an individual or a business, the bank may transfer the entire loan amount to the borrower, who pays back in installment over time. In some cases, the borrower may not pay back i.e, default on the loan. In this report, this payment indicator is considered and its relationship with other variables is studied. Our goal is to build a statistical model for predicting loan default using information available at the time application by the customer, evaluate the risk of granting them credit facilities and finally minimize the bank’s loss. We are using binomial regression model, within generalized linear modeling (GLM) framework using different link functions in order to predict the risk associated with the response variable ‘repay_fail (0=no default and 1=default)’, using 36 potential predictor variables in the data set.
In this section, all the required packages for this report are load into R.
The data set is loaded into R and its dimension is printed.
## New names:
## Rows: 38480 Columns: 37
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (14): term, emp_length, home_ownership, verification_status, issue_d, lo... dbl
## (23): ...1, id, member_id, loan_amnt, funded_amnt, funded_amnt_inv, int_...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## [1] 38480 37
The response variable is repay_fail and there are other 36 predictor/explanatory variables, however, most of them are going to be irrelevant (statistically insignificant) to predict repay_fail. The variables we are going to consider are the ones known at the time when loan applications were made. More specifically, post-loan variables are not going to be considered here. Therefore, in our point of view, variables considered in this report that significantly predict the response variable are: loan_amt, funded_amnt, funded_amnt_inv, term, int_rate, installment, emp_length, home_ownership and annual_inc. Since, loan_amnt and funded_amnt seem to be statistically similar, that is, both have similar effect on the the response variable, one of them can be dropped (funded_amnt) from further investigation in order to avoid collinearity in the model.
In this section, all the data manipulation techniques needed to clean up the data, checking the variables to have proper data types, dealing with missing values in the data, exploring the statistical and the structural characteristics of the data are provided.
The first step in data pre-processing is to clean the dataset so that all the further calculations perform without any problem. For this purpose, simply the relevant columns are selected and all the irrelevant variables excluded from the data.
## [1] 38480 9
Here, we have a dataset containing 38,480 rows (observation) and 9 columns (variables).
Data type is one the most important data structures to check. For example, the class of a categorical variable must be factor, so all the factor variables in the dataset are prepared using as.factor() function. In order to perform a thorough data type investigation, the structure function (str) in R is helpful.
The data structure suggests that there are 3 character variables present and actually they should be of factor type (because they are qualitative) so all the categorical variables are converted to factor object.
## [1] 1
## emp_length
## repay_fail < 1 year 1 year 10+ years 2 years 3 years 4 years 5 years 6 years
## 0 3883 2762 7103 3707 3361 2837 2694 1826
## 1 682 492 1362 585 578 477 477 318
## emp_length
## repay_fail 7 years 8 years 9 years n/a
## 0 1441 1229 1034 774
## 1 261 216 162 219
## home_ownership
## repay_fail MORTGAGE NONE OTHER OWN RENT
## 0 14692 3 96 2497 15363
## 1 2448 1 29 461 2890
## term
## repay_fail 36 months 60 months
## 0 25072 7579
## 1 3521 2308
In this data set, emp_length, has a number of missing values represented by “n/a”. So, the cells containing “n/a” are also treated as missing values (NA, as default) and omitted from the dataset. In addition, interest rate greater than 80% are removed because charging higher than or equal to 80% of interest rate is not an option for a bank these days. Finally, the first row has values of zero for some columns therefore this row is removed from the dataset.
Considering the research purposes, the number of missing values and other factors, one can use either a data imputation approach (to replace the missing values with a valid statistic) or just to ignore the rows containing missing values and exclude them. Here, the latter is used to clean the data.
The last three sub-sections in this report, presented an overall data pre-processing to clear the dataset. Now, the final dataset meets all the technical criteria needed to perform a statistical modeling.
Looking at the distribution of the variables is a helpful way to study their behavior and data visualization is the most powerful tool, helping us to visualize all kind of variables and study the shape and behavior of each numerical variable.
Here we can see that some variables contain outliers, as shown by the box plots. For interest rate, although outliers seem to be in the range of 22 and 25, we decided not to remove them. The reason not to eliminate these outliers is that, by common sense, many banks usually charge interest rate at this range and loan amount has a similar situation. Also, no outliers are deleted for dti. For annual income, there seems to be two values which look like outliers. When the the annual column was arranged in ascending order, it was seen the two person’s annual income were 600,000 and 390,000. Although people in this range of income who applied for loan, these two persons applied for loan for as small as 5000 and 15,500 dollars, respectively. Usually, this doesn’t make sense as well, because their annual income is very high. Therefore the annual income of 600,000 and 390,000, are excluded from the dataset.
Now the dataset is updated and can be used in further modeling steps.
In this case, all the final explanatory variables are plotted to check linear relationship between each explanatory variable and the response variable. Variables which demonstrate strong linearity are going to be considered. That is, variables with poor linearity are ignored and excluded from fitting model stage. However, plotting a categorical variable such as home_ownership is not a good choice to make, because it doesn’t give logic results considering its qualitative nature but plotting a numerical variables for different levels of a categorical variable makes perfect sense.
From the scatter plots, the variables such as loan_amnt, installment and int_rate show evidences of positive linear relationship while the variable annual_inc show strongly negative linear relationship. Other variable such as funded_amnt_inv seem to have no linear relationships at all. Therefore, it was decided that the variable funded_amnt_inv, can be excluded from the data before modeling, because it is highly correlated with loan_amnt.
Correlation Analysis is intentionally performed to check whether two or more variables (specially the independent variables) are correlated. If the independent variables are highly correlated, there is a chance of collinearity and one of the should be excluded to remedy the problem. There seem to exist significant correlation between some of the variables, only one of those correlated variables is going to be considered. All the variables with similar correlation coefficient with a specific variable, have the save effect and therefore one of the can be removed from the study.
| repay_fail | loan_amnt | funded_amnt_inv | int_rate | annual_inc | dti | |
|---|---|---|---|---|---|---|
| repay_fail | 1.00 | 0.05 | 0.01 | 0.20 | -0.05 | 0.04 |
| loan_amnt | 0.05 | 1.00 | 0.93 | 0.29 | 0.41 | 0.07 |
| funded_amnt_inv | 0.01 | 0.93 | 1.00 | 0.28 | 0.38 | 0.07 |
| int_rate | 0.20 | 0.29 | 0.28 | 1.00 | 0.08 | 0.12 |
| annual_inc | -0.05 | 0.41 | 0.38 | 0.08 | 1.00 | -0.12 |
| dti | 0.04 | 0.07 | 0.07 | 0.12 | -0.12 | 1.00 |
It is quite clear from the correlation table that loan_amnt and funded_amnt_inv are highly and positively correlated. Statistically speaking, one of these two variables is redundant because of collinearity they might bring to the model, so one of them should be disregarded. Since, there was no evidence of linear relationship as found in 5.1 above, funded_amnt_inv is the variable to exclude from further modeling. Also it was found that annual_inc is negatively correlated with dti, though the correlation is very weak. Based on these relationships, a number of predictors are selected for model fitting.
The data exploration means extracting all the numerical information that are useful for further modeling. Here, a statistical summary of the data is provided:
| Name | updated_data1 |
| Number of rows | 37372 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| term | 0 | 1 | FALSE | 2 | 36 : 27760, 60 : 9612 |
| emp_length | 0 | 1 | FALSE | 11 | 10+: 8423, < 1: 4560, 2 y: 4277, 3 y: 3927 |
| home_ownership | 0 | 1 | FALSE | 5 | REN: 17845, MOR: 16638, OWN: 2761, OTH: 124 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| repay_fail | 0 | 1 | 0.15 | 0.36 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| loan_amnt | 0 | 1 | 11142.57 | 7391.71 | 500.00 | 5400.00 | 10000.00 | 15000.00 | 35000.00 | ▇▇▃▂▁ |
| int_rate | 0 | 1 | 12.19 | 3.70 | 5.42 | 9.63 | 11.99 | 14.72 | 24.11 | ▆▇▇▂▁ |
| annual_inc | 0 | 1 | 67630.96 | 40374.32 | 1896.00 | 41000.00 | 59706.00 | 82800.00 | 375000.00 | ▇▃▁▁▁ |
| dti | 0 | 1 | 13.41 | 6.71 | 0.00 | 8.27 | 13.51 | 18.70 | 29.99 | ▅▇▇▆▁ |
The dataset is spitted into the training (70%) and validation (30%) sets.
Since we are interested in prediction and modeling risk factors, variable selection method is employed first to determine the variables which are significantly related to the response. As such, we believe that these variable selection methods has the capability of retaining important and confounding variables, potentially resulting in a slightly richer and more reliable model. The following section concerns choosing the most significant independent variables.
Three different models are considered to fit the data and finally, the performance of all the models are compared using proper criteria and one of the would be the best model.
Stepwise regression models can be used to obtain an optimized model, a model with independent variables which are all significant and the insignificant variables are excluded.
| model | full_model | null_model | backward_model | forward_model | stepwise_model |
|---|---|---|---|---|---|
| AIC | 20770.61 | 22020.24 | 20767.27 | 20767.27 | 20767.27 |
## repay_fail ~ term + int_rate + emp_length + annual_inc
After running the variable selection methods, we noted that the AIC for full model (20770.61) was higher than that of forward, backward and both selection methods (20767.271). In addition, all three models selected the same variables with the same performance values. It was therefore anticipated that these selected/significant covariates (term, int_rate, emp_length and annual_inc) must have had an impact on defaulting loan repayment. As such, these variables were used to build the model on training data set. In the following sections, GLM models with different link functions are fitted to the data and the results are analyzed.
Considering the nature of the response (which is binary), a binomial model with logit link is fitted as the first model.
##
## Call:
## glm(formula = repay_fail ~ term + int_rate + emp_length + annual_inc,
## family = binomial(link = "logit"), data = training_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.324e+00 8.810e-02 -37.726 < 2e-16 ***
## term60 months 2.741e-01 4.205e-02 6.520 7.02e-11 ***
## int_rate 1.462e-01 5.490e-03 26.627 < 2e-16 ***
## emp_length1 year 1.919e-02 7.865e-02 0.244 0.8072
## emp_length10+ years 1.530e-01 6.393e-02 2.393 0.0167 *
## emp_length2 years -1.364e-01 7.491e-02 -1.820 0.0687 .
## emp_length3 years -5.935e-03 7.566e-02 -0.078 0.9375
## emp_length4 years -1.172e-01 8.040e-02 -1.458 0.1447
## emp_length5 years -2.928e-02 8.030e-02 -0.365 0.7154
## emp_length6 years 1.476e-02 9.039e-02 0.163 0.8703
## emp_length7 years 3.636e-02 9.679e-02 0.376 0.7072
## emp_length8 years -6.004e-02 1.085e-01 -0.554 0.5799
## emp_length9 years -1.501e-01 1.179e-01 -1.273 0.2032
## annual_inc -6.066e-06 5.184e-07 -11.702 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22018 on 26179 degrees of freedom
## Residual deviance: 20739 on 26166 degrees of freedom
## AIC: 20767
##
## Number of Fisher Scoring iterations: 5
After fitting the model, we test whether the over-dispersion is present or not (over-dispersion is the situation when variance of the response variable in significantly higher than its mean). Over-dispersion is present when the residual deviance is larger than the residual degrees of freedom or under-dispersed if vice-versa. In this model, residual deviance (20739) is less than the degrees of freedom (26166). This result clearly suggests that overdispersion is not present. All predictor variables are significant though some levels of variable, emp_length are not significant.
The coefficients matrix in the summary output of the logit model shows average change in log odds of a customer defaulting loan repayment based on different (levels of) independent variables. For instance; one unit decrease in annual_inc is associated with an average decrease of -6.0661741^{-6} in the log odds of a customer defaulting loan repayment. Also, one unit increase in int_rate is associated with an average increase of 0.1461805 in the log odds of a customer defaulting loan repayment.
## 1 2 3 4 5 6
## 0.07966139 0.28793523 0.15803798 0.06678672 0.08332566 0.09593395
A number of predictions are calculated from the logit model to be used in the following steps to calculate some accuracy criteria.
Every mathematical model has some model assumptions that should be met before using the model. However, sometimes assumptions could not be met. Therefore, assumptions must be checked by the researcher to consider other tools in case of impossibility of using the model. Residuals analysis, confusion matrix, predictive performance using ROC/GINI, goodness of fit tests and over-dispersion test (Christensen, 2020) are the most common tools to perform model diagnostics to assess the reliability and accuracy of the model.
##
## logit_model_pred_repayfail 0 1
## 0 9487 1698
## 1 5 2
## [1] 0.152162 is a misclassification error on validation dataset
From the confusion matrix, we see that out of 11192 loan customers, the model has predicted 9487 of them repaid the loan, while 2 of them have not paid the loan repayment. The other 5 and 1698 are incorrectly classified and the percentage of misclassification error is roughly 15%. Since the percentage of classification error is low, this model is thought to be realistic, accurate and reliable.
A ROC curve is plotted with False Positive Rate (FPR) on the x-axis and True Positive Rate (TPR) on the y-axis. It displays the percentage of true positives predicted by the model as the prediction probability cutoff is decreased from 1 to 0. The higher the AUC (Area Under the Curve), the more accurately our model predicts the values for the response variable.
## [1] 0.358186 is a gini for validation dataset
Here, a goodness of fit test and an over-dispersion test performed on the results obtained by the Logit model.
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.0026, p-value = 0.408
## alternative hypothesis: greater
The null hypothesis in the dispersion test is that there is equi-dispersion in the model residuals. If the p-value is less than the significance level (0.05 by default), then the null will be rejected and there is over-dispersion (not under-dispersion because in this test, the alternative is set greater and that means one-sided test is performed) is the residuals. Here the p-value is much greater than 0.05 so there is no over-dispersion.
Now lets look at binomial model with probit link to see the behavior of the data using different link functions.
##
## Call:
## glm(formula = repay_fail ~ term + int_rate + emp_length + annual_inc,
## family = binomial(link = "probit"), data = training_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.921e+00 4.736e-02 -40.553 < 2e-16 ***
## term60 months 1.556e-01 2.353e-02 6.614 3.75e-11 ***
## int_rate 8.133e-02 2.991e-03 27.194 < 2e-16 ***
## emp_length1 year 6.840e-03 4.315e-02 0.159 0.8740
## emp_length10+ years 7.703e-02 3.514e-02 2.192 0.0284 *
## emp_length2 years -7.874e-02 4.078e-02 -1.931 0.0535 .
## emp_length3 years -8.853e-03 4.143e-02 -0.214 0.8308
## emp_length4 years -6.631e-02 4.380e-02 -1.514 0.1301
## emp_length5 years -1.760e-02 4.386e-02 -0.401 0.6883
## emp_length6 years 3.596e-03 4.961e-02 0.072 0.9422
## emp_length7 years 1.641e-02 5.334e-02 0.308 0.7583
## emp_length8 years -3.456e-02 5.879e-02 -0.588 0.5566
## emp_length9 years -8.062e-02 6.346e-02 -1.270 0.2040
## annual_inc -3.134e-06 2.725e-07 -11.501 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22018 on 26179 degrees of freedom
## Residual deviance: 20729 on 26166 degrees of freedom
## AIC: 20757
##
## Number of Fisher Scoring iterations: 5
The model performance is just same as probit because AIC, deviance residual and coefficient values/statistics are almost similar.
## 1 2 3 4 5 6 7
## 0.07634873 0.28893915 0.15885497 0.06322380 0.08153112 0.09749245 0.18017971
## 8 9 10
## 0.38983898 0.18212017 0.21170907
As mentioned earlier in this report, model diagnostics contains a few methods. Of course there is no necessity to use them all, but we covered most of them to make sure the results are solid.
##
## probit_model_pred_repayfail 0 1
## 0 9488 1700
## 1 4 0
## [1] 0.152252 is a misclassification error for validation dataset
From the confusion matrix resulted above, we see that out of 11192 loan lenders, the model has predicted 9488 of them repaid the loan while none of them failing to repay the loan. The other 4 and 1700 are incorrectly classified by the model and the percentage of misclassification error is roughly 15%, which is roughly same as logit model.
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.0023, p-value = 0.428
## alternative hypothesis: greater
Again, the p-value in the dispersion test is far greater than the significance level and we can accept the null hypothesis, which is “the residuals of the probit model are neither over-dispersed nor under-dispersed”. This result suggests that the model in appropriately fitted to the data.
The AUC and GINI values are calculated from this model to be used in comparison in the following sections.
## [1] 0.358412 is a gini for validation dataset
Now the binomial model with cloglog link function is used to model the data.
##
## Call:
## glm(formula = repay_fail ~ term + int_rate + emp_length + annual_inc,
## family = binomial(link = "cloglog"), data = training_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.246e+00 7.993e-02 -40.609 < 2e-16 ***
## term60 months 2.413e-01 3.785e-02 6.376 1.82e-10 ***
## int_rate 1.311e-01 4.899e-03 26.765 < 2e-16 ***
## emp_length1 year 2.426e-02 7.108e-02 0.341 0.7328
## emp_length10+ years 1.450e-01 5.763e-02 2.515 0.0119 *
## emp_length2 years -1.198e-01 6.811e-02 -1.759 0.0785 .
## emp_length3 years 5.152e-04 6.844e-02 0.008 0.9940
## emp_length4 years -1.044e-01 7.300e-02 -1.430 0.1527
## emp_length5 years -2.461e-02 7.275e-02 -0.338 0.7351
## emp_length6 years 1.688e-02 8.150e-02 0.207 0.8360
## emp_length7 years 3.427e-02 8.702e-02 0.394 0.6937
## emp_length8 years -5.174e-02 9.867e-02 -0.524 0.6000
## emp_length9 years -1.354e-01 1.078e-01 -1.256 0.2090
## annual_inc -5.623e-06 4.765e-07 -11.799 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22018 on 26179 degrees of freedom
## Residual deviance: 20747 on 26166 degrees of freedom
## AIC: 20775
##
## Number of Fisher Scoring iterations: 5
The model performance is just same as probit and logit because AIC, deviance residual and coefficient values/statistics are almost similar.
Here, the response variable is predicted using the binomial model with cloglog link function as for other models.
## 1 2 3 4 5 6 7
## 0.08187803 0.28535118 0.15695383 0.06900021 0.08479532 0.09625984 0.17753554
## 8 9 10
## 0.41303988 0.17880359 0.20312342
##
## cloglog_model_pred_repayfail 0 1
## 0 9482 1692
## 1 10 8
## [1] 0.152073 is a misclassification rate for validation dataset
From the results, out of 11192 loan customers, the model has predicted 9482 of them repaid the loan while 8 of them the model predicted to have not paid the loan repayment. The other 10 and 1692 are incorrectly classified and the percentage of misclassification error is roughly 15%, which is almost the same as logit and probit model.
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.0035, p-value = 0.376
## alternative hypothesis: greater
Again the null hypothesis in the dispersion test (there is no dispersion in the residuals) is accepted and we can think of this model as an appropriate model.
As mentioned earlier, the greater AUC is, the better the performance of the model will be. This criterion and the GINI value is calculated for this model.
These values can be plotted or tabled or plotted. Note that sensitivity is TPR and specificity is FPR.
| Model | AIC | BIC | log-Likelihood | GINI | AUC | Error_Rate |
|---|---|---|---|---|---|---|
| logit | 20767.27 | 20881.68 | -10369.63 | 0.3581857 | 0.6790929 | 0.1521623 |
| probit | 20756.87 | 20871.28 | -10364.43 | 0.3584123 | 0.6792061 | 0.1522516 |
| cloglog | 20775.03 | 20889.45 | -10373.51 | 0.3579239 | 0.6789620 | 0.1520729 |
In terms of AIC/BIC, it was found that probit_model, has a lowest AIC/BIC values, compared to the other models, although there is no significant difference between the other candidates. As mentioned earlier in section 7.1, since the response variable is a binary variable (0 and 1), it was decided that the logit_model is better than the other two models in nature. However, confusion matrix, ROC curves, GINI value, goodness of fit results and dispersion test results are all considered as well before a model is finally chosen. There is no significant difference between three models in terms of ROC, GINI and prediction error rates. The rate of misclassification is roughly 15% and ROC, which measures the percentage of true positives predicted by the model, is roughly 68% for all three models. In addition, GINI is almost 0.36 across three models. Therefore it was suggested that there is no difference between three models except in terms of their AIC and BIC values, as stated above.
It was finally decided that logit model is the best fit for this dataset. As such, we retained it as final model.
Interpretation of the additional model statistics, with the understanding that the values of the other predictors in the model are held constant are to be made in the subsequent paragraph. Additionally, “Confidence intervals provide additional information as to the certainty of our results of a study, and to the likely effect size of any intervention or risk factor”(Smith,2012; pp. 141 - 142). Note that confidence intervals are based on the log-likelihood function in the logistic models. The width of the confidence interval gives us some idea about how uncertain we are about the credit risk. If the width is big, estimation for population parameter is not as precise while narrower width means otherwise. In this case, confidence intervals for the log odds ratios are exponentiated. That’s why endpoints of the intervals go beyond 1 while for log odds ratios, endpoints do not go beyond 1. The lower and upper bound of the interval is represented by 2.5% and 97.5% quantiles, respectively.
Some example of statistics are:
Customers whose term of loan repayment is 60 months have 32 % greater odds of defaulting the loan than those in the lower employment term division. That increase(32%) in odds of customers defaulting loan per increase in term of repayment by 60 month is between the confidence interval of 1.2114 and 1.4284, which is between 21% to 43%. Because we have a gap of about 22%, this estimation may not be that precise.
Interestingly the customers with just an increase in annual income have no odds of defaulting the loan. It’s confidence interval is \((1.000, 1.000)\). Because of the no width of confidence interval for annual_inc, this estimation is very precise(see Appendix as well).
Customers with interest rate have 15 % greater odds of defaulting the loan. That increase (by 15%) in odds of customers defaulting loan per increase in interest rate is between the confidence interval of 15% to 17%. Because we have a gap by 2%, this estimation may be that precise (See appendix as well).
Customers whose employment length is +10 years have 17 % greater odds of defaulting the loan than those in the lower employment length division. That increase (by 17%) in odds of customers defaulting loan per increase in employment length with +10 years is between the confidence interval of 3% to 32%. Because we have a gap by 29%, this estimation may not be that precise.
Now that the final model is selected, cross-validation is used to assess the model performance on validation dataset. Cross-validation is performed by calculating ROC, GINI and the rate of correct prediction on both training and validation dataset.
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## [1] 0.348455 is a gini for training dataset
## [1] 0.358186 is a gini for validation dataset
##
## confusion_matrix_validation 0 1
## 0 9446 1689
## 1 5 2
## [1] 0.152037 is a misclassification error for validation dataset
##
## confusion_matrix_training 0 1
## 0 22274 3890
## 1 12 4
## [1] 0.149045 is a misclassification error for training dataset
From the results, the AUC and GINI of fitting logit model to the training data are 0.6776 and 0.35511 respectively. Also, the values on the validation set are 0.6745 and 0.349065 respectively. So, the model is neither overfitting nor underfitting on the validation dataset because the performance of the model across two data sets are almost same..
On classification accuracy, we can also see that the number of
correctly classified persons in the training set is \(22,274+4=22,278\). The number of
missclassified persons are \(12+3890 =
3,902\). The classification accuracy is roughly 85% (\(\frac{22,278}{26180}*100\)).
In the case of validation data set, number of correctly classified
persons are \(9446 + 2 = 9,448\)
while missclassified persons are \(1,694\). The classification accuracy is
roughly 85% (\(\frac{9,448}{11142}
*100\)). The true prediction rate between two dataset is 85%.
In conclusion, because the ROC/GINI and rate of true prediction of the model between two dataset, are almost same, it was suggested that the model has learned well enough to generalize the new input. This was because the model did well on both the training and validation data sets in a similar fashion. In doing so it was predicted that classification of faulty loan repayment by the model based on the given variables is accurate, reliable and valid. However, there is one concerning discovery being made and that is AUC is almost 68%, meaning the model’s ability to predict new loan application (whether or not a person default the loan) is only 68%. The other 32% component presents the bank with some risk of defaulting loan therefore it is up to the bank to strategise to mitigate this risk.
In summary, the theoretical formula for the final model can be written as below:
##
## Attaching package: 'equatiomatic'
## The following object is masked from 'package:datasets':
##
## penguins
\[ \begin{aligned} \log\left[ \frac { \widehat{P( \operatorname{repay\_fail} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{repay\_fail} = \operatorname{1} )} } \right] &= -3.32 + 0.27(\operatorname{term}_{\operatorname{60\ months}}) + 0.15(\operatorname{int\_rate}) + 0.02(\operatorname{emp\_length}_{\operatorname{1\ year}})\ + \\ &\quad 0.15(\operatorname{emp\_length}_{\operatorname{10+\ years}}) - 0.14(\operatorname{emp\_length}_{\operatorname{2\ years}}) - 0.01(\operatorname{emp\_length}_{\operatorname{3\ years}}) - 0.12(\operatorname{emp\_length}_{\operatorname{4\ years}})\ - \\ &\quad 0.03(\operatorname{emp\_length}_{\operatorname{5\ years}}) + 0.01(\operatorname{emp\_length}_{\operatorname{6\ years}}) + 0.04(\operatorname{emp\_length}_{\operatorname{7\ years}}) - 0.06(\operatorname{emp\_length}_{\operatorname{8\ years}})\ - \\ &\quad 0.15(\operatorname{emp\_length}_{\operatorname{9\ years}}) + 0(\operatorname{annual\_inc}) \end{aligned} \]
This explanation is a repetition of section 11. Note that every variable is analyzed with the view that others are kept constant
Based on this theoretical equation for the logit model, there is credible evidence (positive coefficients) to suggest that covariates such as int_rate, term with 60 months and emp_length of 1, 6, 7 and 10+ years increase the log odds of people defaulting loan repayment. The possible reason could be caused by the fact that bank customers may have had engaged/gotten:
However, variables such as emp_length of 2, 3, 4, 5, 8 and 9 years decrease the log odds of people defaulting loan repayment. Therefore, emp_length affects loan repayment in both positive and negative ways.
As expected, the predictor variables such as annual_inc does not default of loan payment because its coefficient is zero in the final model. This means with one unit increase, the log odds of annual_inc defaulting loan is zero.The higher the annual income, the less likely is for customer to default the loan.
## AUC_NEW GINI_NEW AUC_OLD GINI_OLD
## Training Dataset 0.6741824 0.3483648 0.5570000 0.1140000
## Validation Dataset 0.6792061 0.3584123 0.5550000 0.1100000
It was found that the new (logit) model’s GINI on training and validation dataset is significantly greater than those of the old model. Therefore, it was concluded that new model is better compared to the old model in terms of performance.
The two questions are:
By comparing the old and the new model in terms of GINI and AUC/ROC from section 11, it was found that the new (logit) model’s:
Therefore, in overall, it was concluded that logit model performed 3 times better than old model by Gini and 1.2 better than old model by AUC. In other words, the prediction by logit model on new data is 1.2 times better than old model.However, there is one concerning discovery being made and that is AUC is almost 68%, meaning the model’s classification accuracy on new loan application (whether or not a person default the loan) is only 68%. The other 32% component presents the bank with some risk of defaulting loan therefore it is up to the bank to strategise to mitigate this risk.
The important variables in the model are term, int_rate, emp_length, annual_inc. To compare the model against the traditional variables, we adopted the evaluation of creditworthiness that follows the 5 Cs of Credit and their variables. These 5 Cs are; * ‘character’ (propensity to repay a debt on time) such as past defaults, credit type, payment terms and FICO score.; * ‘capacity’ (debt repayment ability), namely income, employment history, job stability and Debt-to-Income (DTI) ratio.; * ‘collateral’ (for secured loan assessment) - collateral value; * ‘capital’ (total assets owned by the borrower), for instance, investments, liquid assets, e.g. savings, and finally; * ‘conditions’ (loan transaction specifics), such as principal amount, interest rate, borrower’s purpose of funds, economic and political conditions may be considered (CFI, 2005);
This comparison reveals that variables in the character category were not used while some variables in other 4Cs were incorporated to build the model for predicting credit risk.This could possibly be the reason why classification accuracy of the model is just between 68% to 85% or annual income not having an impact on credit risk.
Based on the theoretical equation(coefficients) and confidence interval of the logit model, there is credible evidence (positive coefficients) to suggest that covariates such as int_rate, term with 60 months and emp_length with the level of 1, 6, 7 and 10+ years increase the log odds of customers defaulting loan repayment. More specifically:
Customers whose term of loan repayment is 60 months have 32 % greater odds of defaulting the loan than those in the lower employment term division. That increase(32%) in odds of customers defaulting loan per increase in term of repayment by 60 month is between the confidence interval of 1.2114 and 1.4284, which is between 21% to 43%. Because we have a gap of about 22%, this estimation may not be that precise.
As expected, the customers with an increase in annual income have no odds of defaulting the loan. It’s confidence interval is \((1.000, 1.000)\). Because there is no width for confidence interval for annual_inc, this estimation is very precise. In this case, with an increase in annual income, the customers are highly likely not to default the loan
Customers with interest rate have 15 % greater odds of defaulting the loan. That increase (by 15%) in odds of customers defaulting loan per increase in interest rate is between the confidence interval of 15% to 17%. Because we have a gap by 2%, this estimation may be that precise.
Customers whose employment length is +10 years have 17 % greater odds of defaulting the loan than those in the lower employmentlength division. That increase (by 17%) in odds of customers defaulting loan per increase in employment length with +10 years is between the confidence interval of 3% to 32%. Because we have a gap by 29%, this estimation may not be that precise.
The possible reason could be caused by the fact that:
However, variables such as emp_length of 2, 3, 4, 5, 8 and 9 years decrease the log odds of people defaulting loan repayment. Therefore, emp_length affects loan repayment in both positive and negative ways.
Therefore the explanation on the assessment of loan application by lenders to their management must be centered around term of loan repayment (60 months),interest rate and the employment length of +10 years because they were the main contributing factor to customers defaulting loan repayment. But main focus of the explanation must be directed to interest rate because as confidence interval suggested, estimation (15% odds) of defaulting loan repayment seem to be very precise(confidence interval width of 2). Also annual income need to be discussed as well because it was found to have had no impact on defaulting loan. Meaning the bank must give loan to customers with very high income because they can’t default loan repayment (see confidence interval of 1000, 1.000 and appendix)
One of the solutions to address this is that the bank need to reconsider its interest rate on principle amount, term of loan repayment and emp_length of +10 years. The other reasons could be that the lenders need to verify customers’ source of income, kind of collateral pledged as security for the loan and the type of account he/she maintains.
The objective of the report was to build a binomial regression model for predicting loan default using available bank data, comprising 36 predictor variables and 1 response variable (repay_fail). It was also intended to identify/evaluate potential risks associated with granting loan to the customers. This is so that the lenders and bank management can take counteractive measures to minimize the impacts of these potential risks with one of them being revenue loss. It was further aimed to justify whether or not there is evidence to recommend those external factors which are less/more serious to defaulting loan repayment.
Based on the model, it was found (due to coefficients being positive) that the interest rate, term of repayment (60 months) and employment length of more than 10 years have the potential to default the loan repayment while annual income was found to have had no impact on defaulting loan repayment, meaning customers with very high annual income tend to repay loan successfully. The developed logit model’s percentage accuracy for prediction (customer defaulting loan or not) is between 76% to 85%, which is much better than old model. This suggests that there is a risk of customers defaulting the loan therefore the lenders and bank management must make an informed decisions around these potential risk factors so as to minimum the risk of revenue loss.
Although the model can generalize new data pretty well as there was no evidence of overfitting/underfitting (see cross-validation in section 10),the 15% misclassification error suggests that the model may not do well to classify customers as being either a potential loan defaulter or no. This limitation could partly be attributed to:
CFI. C. E. (2005). 5 Cs of Credit. Retrieved 2005 - 2021 from https://corporatefinanceinstitute.com/resources/knowledge/credit/5-cs-of-credit/
Hilbe, J. M. (2015). Practical guide to logistic regression. CRC Press LLC.
Leblebici, H., & Salancik, G. R. (1981). Effects of environmental uncertainty on information and decision processes in banks. Administrative Science Quarterly, 578-596.
Smith, C. J. (2012). Interpreting confidence intervals. Phlebology, 27(3), 141–142. https://doi.org/10.1258/phleb.2012.012j02
# Predict "term"
logit_model_pred_training$termP<-predict(logit_model, training_data, type = "response",se=TRUE)
## Warning in logit_model_pred_training$termP <- predict(logit_model,
## training_data, : Coercing LHS to a list
data_termP <- logit_model_pred_training$termP
newdata3 <- cbind( training_data,data_termP)
newdata3$emp_lengthP <-predict(logit_model, training_data, type = "response")
head(newdata3)
# upper limit and lower limits of the predictions using plogis
newdata5 <- within(newdata3,{
PredictedProb <- plogis(fit)
LL <- plogis(fit-(1.96*se.fit))
UL <- plogis(fit+(1.96*se.fit))
})
head(newdata5)
# Predict of repay_fail versus int_rate
A<-ggplot(newdata5,aes(x=int_rate,y=PredictedProb))+
geom_ribbon(aes(ymin=LL,ymax=UL,fill=term),alpha=0.5)+
geom_line(aes(colour=term),size=1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Predict of repay_fail versus annual_inc
B<-ggplot(newdata5,aes(x=annual_inc,y=PredictedProb))+
geom_ribbon(aes(ymin=LL,ymax=UL,fill=term),alpha=0.5)+
geom_line(aes(colour=term),size=1)
ggarrange(A, B,ncol = 2,nrow = 1)
We have a large number of observations, so we plot our data using the
quantile and calculate the mean of the data within the quantile.
The plot of predicted repay_fail versus annual_inc showing that when the
annual income of the customer increases, the prediction from our model
is showing the risk of default decreases linearly, because repay_fail
value is going closer to value of zero. The rate of changes in the value
of the predicted repay_fail is small, that is why in our table of
summary, it appears almost equal to zero. However the change is
significant.
y <- newdata5$PredictedProb
x <- newdata5$annual_inc
g <- cut(x, breaks=quantile(x,seq(0,100,2)/100,na.rm= TRUE))
ym <- tapply(y, g, mean,na.rm= TRUE)
xm <- tapply(x, g, mean,na.rm= TRUE)
datas <- data.frame(xm1=xm, ym1=ym)
plot(xm, ym,xlab = "annual income in $", ylab = "predicted repay_fail",main= "predicted repay_fail - annual-inc relationship")
We have a large number of observations, so we plot our data using the quantile and calculate the mean of the data within the quantile. The plot of predicted repay_fail versus the interest rate is showing that if the interest rate increases the risk of default increases accordingly, because the value of repay_fail goes closer to the value =1 which is the default. The rate of change is small, but significantly.
y <- newdata5$PredictedProb
x <- newdata5$int_rate
g <- cut(x, breaks=quantile(x,seq(0,100,2)/100,na.rm= TRUE))
ym <- tapply(y, g, mean,na.rm= TRUE)
xm <- tapply(x, g, mean,na.rm= TRUE)
plot(xm, ym,xlab = "interest rate %", ylab = "predicted repay_fail",main= "predicted repay_fail - interest rate relationship")
We have a large number of observations, so we plot our data using the quantile and calculate the mean of the data within the quantile. The mean of repay_fail from the bank’s data versus the predicted repay_fail plot is showing a linear relationship. It means our prediction for the default is good. There are slightly differences in their values, it might be caused by our limitations of our model as stated in our conclusion above.
# using mean calculated to simplify the plot
y <- newdata5$repay_fail
x <- newdata5$PredictedProb
g <- cut(x, breaks=quantile(x,seq(0,100,2)/100,na.rm= TRUE))
ym <- tapply(y, g, mean,na.rm= TRUE)
xm <- tapply(x, g, mean,na.rm= TRUE)
plot(xm,ym,xlab = "predicted repay_fail", ylab = "repay_fail", main="repay_fail versus predicted probability")