Bank Loan Data Analysis - Logistic Regression Model

1. Introduction

When a bank lends money to an individual or a business, the bank may transfer the entire loan amount to the borrower, who pays back in installment over time. In some cases, the borrower may not pay back i.e, default on the loan. In this report, this payment indicator is considered and its relationship with other variables is studied. Our goal is to build a statistical model for predicting loan default using information available at the time application by the customer, evaluate the risk of granting them credit facilities and finally minimize the bank’s loss. We are using binomial regression model, within generalized linear modeling (GLM) framework using different link functions in order to predict the risk associated with the response variable ‘repay_fail (0=no default and 1=default)’, using 36 potential predictor variables in the data set.

2. Loading Required Packages

In this section, all the required packages for this report are load into R.

3. Loading the Data into R

The data set is loaded into R and its dimension is printed.

## New names:
## Rows: 38480 Columns: 37
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (14): term, emp_length, home_ownership, verification_status, issue_d, lo... dbl
## (23): ...1, id, member_id, loan_amnt, funded_amnt, funded_amnt_inv, int_...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

## [1] 38480    37

The response variable is repay_fail and there are other 36 predictor/explanatory variables, however, most of them are going to be irrelevant (statistically insignificant) to predict repay_fail. The variables we are going to consider are the ones known at the time when loan applications were made. More specifically, post-loan variables are not going to be considered here. Therefore, in our point of view, variables considered in this report that significantly predict the response variable are: loan_amt, funded_amnt, funded_amnt_inv, term, int_rate, installment, emp_length, home_ownership and annual_inc. Since, loan_amnt and funded_amnt seem to be statistically similar, that is, both have similar effect on the the response variable, one of them can be dropped (funded_amnt) from further investigation in order to avoid collinearity in the model.

4. Data Pre-Processing

In this section, all the data manipulation techniques needed to clean up the data, checking the variables to have proper data types, dealing with missing values in the data, exploring the statistical and the structural characteristics of the data are provided.

4.1 Filtering Irrelevant Variables

The first step in data pre-processing is to clean the dataset so that all the further calculations perform without any problem. For this purpose, simply the relevant columns are selected and all the irrelevant variables excluded from the data.

## [1] 38480     9

Here, we have a dataset containing 38,480 rows (observation) and 9 columns (variables).

4.2 Convert the Class of Variables

Data type is one the most important data structures to check. For example, the class of a categorical variable must be factor, so all the factor variables in the dataset are prepared using as.factor() function. In order to perform a thorough data type investigation, the structure function (str) in R is helpful.

The data structure suggests that there are 3 character variables present and actually they should be of factor type (because they are qualitative) so all the categorical variables are converted to factor object.

4.3 Dealing with Missing Values

## [1] 1

##           emp_length
## repay_fail < 1 year 1 year 10+ years 2 years 3 years 4 years 5 years 6 years
##          0     3883   2762      7103    3707    3361    2837    2694    1826
##          1      682    492      1362     585     578     477     477     318
##           emp_length
## repay_fail 7 years 8 years 9 years  n/a
##          0    1441    1229    1034  774
##          1     261     216     162  219

##           home_ownership
## repay_fail MORTGAGE  NONE OTHER   OWN  RENT
##          0    14692     3    96  2497 15363
##          1     2448     1    29   461  2890

##           term
## repay_fail 36 months 60 months
##          0     25072      7579
##          1      3521      2308

In this data set, emp_length, has a number of missing values represented by “n/a”. So, the cells containing “n/a” are also treated as missing values (NA, as default) and omitted from the dataset. In addition, interest rate greater than 80% are removed because charging higher than or equal to 80% of interest rate is not an option for a bank these days. Finally, the first row has values of zero for some columns therefore this row is removed from the dataset.

Considering the research purposes, the number of missing values and other factors, one can use either a data imputation approach (to replace the missing values with a valid statistic) or just to ignore the rows containing missing values and exclude them. Here, the latter is used to clean the data.

The last three sub-sections in this report, presented an overall data pre-processing to clear the dataset. Now, the final dataset meets all the technical criteria needed to perform a statistical modeling.

5. Data Visualization

Looking at the distribution of the variables is a helpful way to study their behavior and data visualization is the most powerful tool, helping us to visualize all kind of variables and study the shape and behavior of each numerical variable.

Here we can see that some variables contain outliers, as shown by the box plots. For interest rate, although outliers seem to be in the range of 22 and 25, we decided not to remove them. The reason not to eliminate these outliers is that, by common sense, many banks usually charge interest rate at this range and loan amount has a similar situation. Also, no outliers are deleted for dti. For annual income, there seems to be two values which look like outliers. When the the annual column was arranged in ascending order, it was seen the two person’s annual income were 600,000 and 390,000. Although people in this range of income who applied for loan, these two persons applied for loan for as small as 5000 and 15,500 dollars, respectively. Usually, this doesn’t make sense as well, because their annual income is very high. Therefore the annual income of 600,000 and 390,000, are excluded from the dataset.

Now the dataset is updated and can be used in further modeling steps.

5.1 Checking Logistic Regression Assumptions

In this case, all the final explanatory variables are plotted to check linear relationship between each explanatory variable and the response variable. Variables which demonstrate strong linearity are going to be considered. That is, variables with poor linearity are ignored and excluded from fitting model stage. However, plotting a categorical variable such as home_ownership is not a good choice to make, because it doesn’t give logic results considering its qualitative nature but plotting a numerical variables for different levels of a categorical variable makes perfect sense.

From the scatter plots, the variables such as loan_amnt, installment and int_rate show evidences of positive linear relationship while the variable annual_inc show strongly negative linear relationship. Other variable such as funded_amnt_inv seem to have no linear relationships at all. Therefore, it was decided that the variable funded_amnt_inv, can be excluded from the data before modeling, because it is highly correlated with loan_amnt.

5.2 Correlation Analysis

Correlation Analysis is intentionally performed to check whether two or more variables (specially the independent variables) are correlated. If the independent variables are highly correlated, there is a chance of collinearity and one of the should be excluded to remedy the problem. There seem to exist significant correlation between some of the variables, only one of those correlated variables is going to be considered. All the variables with similar correlation coefficient with a specific variable, have the save effect and therefore one of the can be removed from the study.

	repay_fail	loan_amnt	funded_amnt_inv	int_rate	annual_inc	dti
repay_fail	1.00	0.05	0.01	0.20	-0.05	0.04
loan_amnt	0.05	1.00	0.93	0.29	0.41	0.07
funded_amnt_inv	0.01	0.93	1.00	0.28	0.38	0.07
int_rate	0.20	0.29	0.28	1.00	0.08	0.12
annual_inc	-0.05	0.41	0.38	0.08	1.00	-0.12
dti	0.04	0.07	0.07	0.12	-0.12	1.00

It is quite clear from the correlation table that loan_amnt and funded_amnt_inv are highly and positively correlated. Statistically speaking, one of these two variables is redundant because of collinearity they might bring to the model, so one of them should be disregarded. Since, there was no evidence of linear relationship as found in 5.1 above, funded_amnt_inv is the variable to exclude from further modeling. Also it was found that annual_inc is negatively correlated with dti, though the correlation is very weak. Based on these relationships, a number of predictors are selected for model fitting.

6. Data Exploration

The data exploration means extracting all the numerical information that are useful for further modeling. Here, a statistical summary of the data is provided:

Data summary
Name	updated_data1
Number of rows	37372
Number of columns	8
_______________________
Column type frequency:
factor	3
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
term	1	FALSE	2	36 : 27760, 60 : 9612
emp_length	1	FALSE	11	10+: 8423, < 1: 4560, 2 y: 4277, 3 y: 3927
home_ownership	1	FALSE	5	REN: 17845, MOR: 16638, OWN: 2761, OTH: 124

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
repay_fail	1	0.15	0.36	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
loan_amnt	1	11142.57	7391.71	500.00	5400.00	10000.00	15000.00	35000.00	▇▇▃▂▁
int_rate	1	12.19	3.70	5.42	9.63	11.99	14.72	24.11	▆▇▇▂▁
annual_inc	1	67630.96	40374.32	1896.00	41000.00	59706.00	82800.00	375000.00	▇▃▁▁▁
dti	1	13.41	6.71	0.00	8.27	13.51	18.70	29.99	▅▇▇▆▁

7. Training and Validation Sets

The dataset is spitted into the training (70%) and validation (30%) sets.

Since we are interested in prediction and modeling risk factors, variable selection method is employed first to determine the variables which are significantly related to the response. As such, we believe that these variable selection methods has the capability of retaining important and confounding variables, potentially resulting in a slightly richer and more reliable model. The following section concerns choosing the most significant independent variables.

Three different models are considered to fit the data and finally, the performance of all the models are compared using proper criteria and one of the would be the best model.

8. Variable Selection Methods - Forward, Backward and Stepwise.

Stepwise regression models can be used to obtain an optimized model, a model with independent variables which are all significant and the insignificant variables are excluded.

model	full_model	null_model	backward_model	forward_model	stepwise_model
AIC	20770.61	22020.24	20767.27	20767.27	20767.27

## repay_fail ~ term + int_rate + emp_length + annual_inc

After running the variable selection methods, we noted that the AIC for full model (20770.61) was higher than that of forward, backward and both selection methods (20767.271). In addition, all three models selected the same variables with the same performance values. It was therefore anticipated that these selected/significant covariates (term, int_rate, emp_length and annual_inc) must have had an impact on defaulting loan repayment. As such, these variables were used to build the model on training data set. In the following sections, GLM models with different link functions are fitted to the data and the results are analyzed.

8.1 Logit Model

Considering the nature of the response (which is binary), a binomial model with logit link is fitted as the first model.

## 
## Call:
## glm(formula = repay_fail ~ term + int_rate + emp_length + annual_inc, 
##     family = binomial(link = "logit"), data = training_data)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.324e+00  8.810e-02 -37.726  < 2e-16 ***
## term60 months        2.741e-01  4.205e-02   6.520 7.02e-11 ***
## int_rate             1.462e-01  5.490e-03  26.627  < 2e-16 ***
## emp_length1 year     1.919e-02  7.865e-02   0.244   0.8072    
## emp_length10+ years  1.530e-01  6.393e-02   2.393   0.0167 *  
## emp_length2 years   -1.364e-01  7.491e-02  -1.820   0.0687 .  
## emp_length3 years   -5.935e-03  7.566e-02  -0.078   0.9375    
## emp_length4 years   -1.172e-01  8.040e-02  -1.458   0.1447    
## emp_length5 years   -2.928e-02  8.030e-02  -0.365   0.7154    
## emp_length6 years    1.476e-02  9.039e-02   0.163   0.8703    
## emp_length7 years    3.636e-02  9.679e-02   0.376   0.7072    
## emp_length8 years   -6.004e-02  1.085e-01  -0.554   0.5799    
## emp_length9 years   -1.501e-01  1.179e-01  -1.273   0.2032    
## annual_inc          -6.066e-06  5.184e-07 -11.702  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22018  on 26179  degrees of freedom
## Residual deviance: 20739  on 26166  degrees of freedom
## AIC: 20767
## 
## Number of Fisher Scoring iterations: 5

After fitting the model, we test whether the over-dispersion is present or not (over-dispersion is the situation when variance of the response variable in significantly higher than its mean). Over-dispersion is present when the residual deviance is larger than the residual degrees of freedom or under-dispersed if vice-versa. In this model, residual deviance (20739) is less than the degrees of freedom (26166). This result clearly suggests that overdispersion is not present. All predictor variables are significant though some levels of variable, emp_length are not significant.

8.1.1 Model Interpretation

The coefficients matrix in the summary output of the logit model shows average change in log odds of a customer defaulting loan repayment based on different (levels of) independent variables. For instance; one unit decrease in annual_inc is associated with an average decrease of -6.0661741^{-6} in the log odds of a customer defaulting loan repayment. Also, one unit increase in int_rate is associated with an average increase of 0.1461805 in the log odds of a customer defaulting loan repayment.

8.1.2 Make predictions on the model using the validation dataset

##          1          2          3          4          5          6 
## 0.07966139 0.28793523 0.15803798 0.06678672 0.08332566 0.09593395

A number of predictions are calculated from the logit model to be used in the following steps to calculate some accuracy criteria.

8.1.3 Model Diagnostics

Every mathematical model has some model assumptions that should be met before using the model. However, sometimes assumptions could not be met. Therefore, assumptions must be checked by the researcher to consider other tools in case of impossibility of using the model. Residuals analysis, confusion matrix, predictive performance using ROC/GINI, goodness of fit tests and over-dispersion test (Christensen, 2020) are the most common tools to perform model diagnostics to assess the reliability and accuracy of the model.

8.3.1.1 Confusion Matrix

##                           
## logit_model_pred_repayfail    0    1
##                          0 9487 1698
##                          1    5    2

## [1] 0.152162 is a misclassification error on validation dataset

From the confusion matrix, we see that out of 11192 loan customers, the model has predicted 9487 of them repaid the loan, while 2 of them have not paid the loan repayment. The other 5 and 1698 are incorrectly classified and the percentage of misclassification error is roughly 15%. Since the percentage of classification error is low, this model is thought to be realistic, accurate and reliable.

8.3.1.2 Predictive Performance

A ROC curve is plotted with False Positive Rate (FPR) on the x-axis and True Positive Rate (TPR) on the y-axis. It displays the percentage of true positives predicted by the model as the prediction probability cutoff is decreased from 1 to 0. The higher the AUC (Area Under the Curve), the more accurately our model predicts the values for the response variable.

## [1] 0.358186 is a gini for validation dataset

8.3.1.3 Goodness_fit and over-dispersion for logit model

Here, a goodness of fit test and an over-dispersion test performed on the results obtained by the Logit model.

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 1.0026, p-value = 0.408
## alternative hypothesis: greater

The null hypothesis in the dispersion test is that there is equi-dispersion in the model residuals. If the p-value is less than the significance level (0.05 by default), then the null will be rejected and there is over-dispersion (not under-dispersion because in this test, the alternative is set greater and that means one-sided test is performed) is the residuals. Here the p-value is much greater than 0.05 so there is no over-dispersion.

8.2 Probit model

Now lets look at binomial model with probit link to see the behavior of the data using different link functions.

## 
## Call:
## glm(formula = repay_fail ~ term + int_rate + emp_length + annual_inc, 
##     family = binomial(link = "probit"), data = training_data)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.921e+00  4.736e-02 -40.553  < 2e-16 ***
## term60 months        1.556e-01  2.353e-02   6.614 3.75e-11 ***
## int_rate             8.133e-02  2.991e-03  27.194  < 2e-16 ***
## emp_length1 year     6.840e-03  4.315e-02   0.159   0.8740    
## emp_length10+ years  7.703e-02  3.514e-02   2.192   0.0284 *  
## emp_length2 years   -7.874e-02  4.078e-02  -1.931   0.0535 .  
## emp_length3 years   -8.853e-03  4.143e-02  -0.214   0.8308    
## emp_length4 years   -6.631e-02  4.380e-02  -1.514   0.1301    
## emp_length5 years   -1.760e-02  4.386e-02  -0.401   0.6883    
## emp_length6 years    3.596e-03  4.961e-02   0.072   0.9422    
## emp_length7 years    1.641e-02  5.334e-02   0.308   0.7583    
## emp_length8 years   -3.456e-02  5.879e-02  -0.588   0.5566    
## emp_length9 years   -8.062e-02  6.346e-02  -1.270   0.2040    
## annual_inc          -3.134e-06  2.725e-07 -11.501  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22018  on 26179  degrees of freedom
## Residual deviance: 20729  on 26166  degrees of freedom
## AIC: 20757
## 
## Number of Fisher Scoring iterations: 5

The model performance is just same as probit because AIC, deviance residual and coefficient values/statistics are almost similar.

8.2.1 Prediction on the Validation Set

##          1          2          3          4          5          6          7 
## 0.07634873 0.28893915 0.15885497 0.06322380 0.08153112 0.09749245 0.18017971 
##          8          9         10 
## 0.38983898 0.18212017 0.21170907

8.2.2 Model Diagnostics

As mentioned earlier in this report, model diagnostics contains a few methods. Of course there is no necessity to use them all, but we covered most of them to make sure the results are solid.

8.2.2.1 Confusion Matrix

##                            
## probit_model_pred_repayfail    0    1
##                           0 9488 1700
##                           1    4    0

## [1] 0.152252 is a misclassification error for validation dataset

From the confusion matrix resulted above, we see that out of 11192 loan lenders, the model has predicted 9488 of them repaid the loan while none of them failing to repay the loan. The other 4 and 1700 are incorrectly classified by the model and the percentage of misclassification error is roughly 15%, which is roughly same as logit model.

8.2.2.2 Goodness_fit and over-dispersion for probit model

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 1.0023, p-value = 0.428
## alternative hypothesis: greater

Again, the p-value in the dispersion test is far greater than the significance level and we can accept the null hypothesis, which is “the residuals of the probit model are neither over-dispersed nor under-dispersed”. This result suggests that the model in appropriately fitted to the data.

8.2.2.3 Predictive Performance

The AUC and GINI values are calculated from this model to be used in comparison in the following sections.

## [1] 0.358412 is a gini for validation dataset

8.3 Cloglog Model

Now the binomial model with cloglog link function is used to model the data.

## 
## Call:
## glm(formula = repay_fail ~ term + int_rate + emp_length + annual_inc, 
##     family = binomial(link = "cloglog"), data = training_data)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.246e+00  7.993e-02 -40.609  < 2e-16 ***
## term60 months        2.413e-01  3.785e-02   6.376 1.82e-10 ***
## int_rate             1.311e-01  4.899e-03  26.765  < 2e-16 ***
## emp_length1 year     2.426e-02  7.108e-02   0.341   0.7328    
## emp_length10+ years  1.450e-01  5.763e-02   2.515   0.0119 *  
## emp_length2 years   -1.198e-01  6.811e-02  -1.759   0.0785 .  
## emp_length3 years    5.152e-04  6.844e-02   0.008   0.9940    
## emp_length4 years   -1.044e-01  7.300e-02  -1.430   0.1527    
## emp_length5 years   -2.461e-02  7.275e-02  -0.338   0.7351    
## emp_length6 years    1.688e-02  8.150e-02   0.207   0.8360    
## emp_length7 years    3.427e-02  8.702e-02   0.394   0.6937    
## emp_length8 years   -5.174e-02  9.867e-02  -0.524   0.6000    
## emp_length9 years   -1.354e-01  1.078e-01  -1.256   0.2090    
## annual_inc          -5.623e-06  4.765e-07 -11.799  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22018  on 26179  degrees of freedom
## Residual deviance: 20747  on 26166  degrees of freedom
## AIC: 20775
## 
## Number of Fisher Scoring iterations: 5

The model performance is just same as probit and logit because AIC, deviance residual and coefficient values/statistics are almost similar.

8.3.1 Model Prediction

Here, the response variable is predicted using the binomial model with cloglog link function as for other models.

##          1          2          3          4          5          6          7 
## 0.08187803 0.28535118 0.15695383 0.06900021 0.08479532 0.09625984 0.17753554 
##          8          9         10 
## 0.41303988 0.17880359 0.20312342

8.3.2 Model Diagnosis

8.3.2.1 Confusion Matrix

##                             
## cloglog_model_pred_repayfail    0    1
##                            0 9482 1692
##                            1   10    8

## [1] 0.152073 is a misclassification rate for validation dataset

From the results, out of 11192 loan customers, the model has predicted 9482 of them repaid the loan while 8 of them the model predicted to have not paid the loan repayment. The other 10 and 1692 are incorrectly classified and the percentage of misclassification error is roughly 15%, which is almost the same as logit and probit model.

8.3.2.2 Goodness of Fit and over-dispersion test

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 1.0035, p-value = 0.376
## alternative hypothesis: greater

Again the null hypothesis in the dispersion test (there is no dispersion in the residuals) is accepted and we can think of this model as an appropriate model.

8.3.2.3 Predictive Performance

As mentioned earlier, the greater AUC is, the better the performance of the model will be. This criterion and the GINI value is calculated for this model.

9. Compare the models and choosing the best model

These values can be plotted or tabled or plotted. Note that sensitivity is TPR and specificity is FPR.

Model	AIC	BIC	log-Likelihood	GINI	AUC	Error_Rate
logit	20767.27	20881.68	-10369.63	0.3581857	0.6790929	0.1521623
probit	20756.87	20871.28	-10364.43	0.3584123	0.6792061	0.1522516
cloglog	20775.03	20889.45	-10373.51	0.3579239	0.6789620	0.1520729

In terms of AIC/BIC, it was found that probit_model, has a lowest AIC/BIC values, compared to the other models, although there is no significant difference between the other candidates. As mentioned earlier in section 7.1, since the response variable is a binary variable (0 and 1), it was decided that the logit_model is better than the other two models in nature. However, confusion matrix, ROC curves, GINI value, goodness of fit results and dispersion test results are all considered as well before a model is finally chosen. There is no significant difference between three models in terms of ROC, GINI and prediction error rates. The rate of misclassification is roughly 15% and ROC, which measures the percentage of true positives predicted by the model, is roughly 68% for all three models. In addition, GINI is almost 0.36 across three models. Therefore it was suggested that there is no difference between three models except in terms of their AIC and BIC values, as stated above.

It was finally decided that logit model is the best fit for this dataset. As such, we retained it as final model.

10 Final Model with exponeniated confidence interval, Odd ratios(or) and other parameters

Interpretation of the additional model statistics, with the understanding that the values of the other predictors in the model are held constant are to be made in the subsequent paragraph. Additionally, “Confidence intervals provide additional information as to the certainty of our results of a study, and to the likely effect size of any intervention or risk factor”(Smith,2012; pp. 141 - 142). Note that confidence intervals are based on the log-likelihood function in the logistic models. The width of the confidence interval gives us some idea about how uncertain we are about the credit risk. If the width is big, estimation for population parameter is not as precise while narrower width means otherwise. In this case, confidence intervals for the log odds ratios are exponentiated. That’s why endpoints of the intervals go beyond 1 while for log odds ratios, endpoints do not go beyond 1. The lower and upper bound of the interval is represented by 2.5% and 97.5% quantiles, respectively.

Some example of statistics are:

Customers whose term of loan repayment is 60 months have 32 % greater odds of defaulting the loan than those in the lower employment term division. That increase(32%) in odds of customers defaulting loan per increase in term of repayment by 60 month is between the confidence interval of 1.2114 and 1.4284, which is between 21% to 43%. Because we have a gap of about 22%, this estimation may not be that precise.
Interestingly the customers with just an increase in annual income have no odds of defaulting the loan. It’s confidence interval is \((1.000, 1.000)\). Because of the no width of confidence interval for annual_inc, this estimation is very precise(see Appendix as well).
Customers with interest rate have 15 % greater odds of defaulting the loan. That increase (by 15%) in odds of customers defaulting loan per increase in interest rate is between the confidence interval of 15% to 17%. Because we have a gap by 2%, this estimation may be that precise (See appendix as well).
Customers whose employment length is +10 years have 17 % greater odds of defaulting the loan than those in the lower employment length division. That increase (by 17%) in odds of customers defaulting loan per increase in employment length with +10 years is between the confidence interval of 3% to 32%. Because we have a gap by 29%, this estimation may not be that precise.

11 Cross Validation on Final Model

Now that the final model is selected, cross-validation is used to assess the model performance on validation dataset. Cross-validation is performed by calculating ROC, GINI and the rate of correct prediction on both training and validation dataset.

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## [1] 0.348455 is a gini for training dataset

## [1] 0.358186 is a gini for validation dataset

##                            
## confusion_matrix_validation    0    1
##                           0 9446 1689
##                           1    5    2

## [1] 0.152037 is a misclassification error for validation dataset

##                          
## confusion_matrix_training     0     1
##                         0 22274  3890
##                         1    12     4

## [1] 0.149045 is a misclassification error for training dataset

From the results, the AUC and GINI of fitting logit model to the training data are 0.6776 and 0.35511 respectively. Also, the values on the validation set are 0.6745 and 0.349065 respectively. So, the model is neither overfitting nor underfitting on the validation dataset because the performance of the model across two data sets are almost same..

On classification accuracy, we can also see that the number of correctly classified persons in the training set is \(22,274+4=22,278\). The number of missclassified persons are \(12+3890 = 3,902\). The classification accuracy is roughly 85% (\(\frac{22,278}{26180}*100\)).
In the case of validation data set, number of correctly classified persons are \(9446 + 2 = 9,448\) while missclassified persons are \(1,694\). The classification accuracy is roughly 85% (\(\frac{9,448}{11142} *100\)). The true prediction rate between two dataset is 85%.

In conclusion, because the ROC/GINI and rate of true prediction of the model between two dataset, are almost same, it was suggested that the model has learned well enough to generalize the new input. This was because the model did well on both the training and validation data sets in a similar fashion. In doing so it was predicted that classification of faulty loan repayment by the model based on the given variables is accurate, reliable and valid. However, there is one concerning discovery being made and that is AUC is almost 68%, meaning the model’s ability to predict new loan application (whether or not a person default the loan) is only 68%. The other 32% component presents the bank with some risk of defaulting loan therefore it is up to the bank to strategise to mitigate this risk.

12 Extracting theoretical equation of coefficients

In summary, the theoretical formula for the final model can be written as below:

## 
## Attaching package: 'equatiomatic'

## The following object is masked from 'package:datasets':
## 
##     penguins

\[ \begin{aligned} \log\left[ \frac { \widehat{P( \operatorname{repay\_fail} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{repay\_fail} = \operatorname{1} )} } \right] &= -3.32 + 0.27(\operatorname{term}_{\operatorname{60\ months}}) + 0.15(\operatorname{int\_rate}) + 0.02(\operatorname{emp\_length}_{\operatorname{1\ year}})\ + \\ &\quad 0.15(\operatorname{emp\_length}_{\operatorname{10+\ years}}) - 0.14(\operatorname{emp\_length}_{\operatorname{2\ years}}) - 0.01(\operatorname{emp\_length}_{\operatorname{3\ years}}) - 0.12(\operatorname{emp\_length}_{\operatorname{4\ years}})\ - \\ &\quad 0.03(\operatorname{emp\_length}_{\operatorname{5\ years}}) + 0.01(\operatorname{emp\_length}_{\operatorname{6\ years}}) + 0.04(\operatorname{emp\_length}_{\operatorname{7\ years}}) - 0.06(\operatorname{emp\_length}_{\operatorname{8\ years}})\ - \\ &\quad 0.15(\operatorname{emp\_length}_{\operatorname{9\ years}}) + 0(\operatorname{annual\_inc}) \end{aligned} \]

This explanation is a repetition of section 11. Note that every variable is analyzed with the view that others are kept constant

Based on this theoretical equation for the logit model, there is credible evidence (positive coefficients) to suggest that covariates such as int_rate, term with 60 months and emp_length of 1, 6, 7 and 10+ years increase the log odds of people defaulting loan repayment. The possible reason could be caused by the fact that bank customers may have had engaged/gotten:

bank may have imposed high interest rate on customers with no credible jobs;
customers may have been fired/sacked from employment and that loan accumulated;

However, variables such as emp_length of 2, 3, 4, 5, 8 and 9 years decrease the log odds of people defaulting loan repayment. Therefore, emp_length affects loan repayment in both positive and negative ways.

As expected, the predictor variables such as annual_inc does not default of loan payment because its coefficient is zero in the final model. This means with one unit increase, the log odds of annual_inc defaulting loan is zero.The higher the annual income, the less likely is for customer to default the loan.

13 Comparing the Old and the New Model

##                      AUC_NEW  GINI_NEW   AUC_OLD  GINI_OLD
## Training Dataset   0.6741824 0.3483648 0.5570000 0.1140000
## Validation Dataset 0.6792061 0.3584123 0.5550000 0.1100000

It was found that the new (logit) model’s GINI on training and validation dataset is significantly greater than those of the old model. Therefore, it was concluded that new model is better compared to the old model in terms of performance.

14. Answering two main questions.

The two questions are:

How does this model perform compared to the one you used previously? How can it be expected to perform on new loan applications? Note that there are some performance benchmarks (presented as RoC curves) available for your old model, see the project folder on Blackboard.

Solution 1

By comparing the old and the new model in terms of GINI and AUC/ROC from section 11, it was found that the new (logit) model’s:

GINI on training dataset is 3 times better than old training Gini;
GINI on validation dataset is 3 times better than old validation Gini;
ROC/AUC on training dataset is 1.2 times better than old training ROC;
ROC/AUC on validation dataset is 1.2 times better than old validation ROC;

Therefore, in overall, it was concluded that logit model performed 3 times better than old model by Gini and 1.2 better than old model by AUC. In other words, the prediction by logit model on new data is 1.2 times better than old model.However, there is one concerning discovery being made and that is AUC is almost 68%, meaning the model’s classification accuracy on new loan application (whether or not a person default the loan) is only 68%. The other 32% component presents the bank with some risk of defaulting loan therefore it is up to the bank to strategise to mitigate this risk.

What are the important variables in this model and how do they compare to variables that are traditionally important for predicting credit risk in the banking sector? One regulatory requirement for lenders is that they need to clearly explain how a loan application was assessed. To demonstrate to management that this can be achieved, clearly interpret all covariates in your model in terms of their effect on predicting credit risk.

Solution 2

The important variables in the model are term, int_rate, emp_length, annual_inc. To compare the model against the traditional variables, we adopted the evaluation of creditworthiness that follows the 5 Cs of Credit and their variables. These 5 Cs are; * ‘character’ (propensity to repay a debt on time) such as past defaults, credit type, payment terms and FICO score.; * ‘capacity’ (debt repayment ability), namely income, employment history, job stability and Debt-to-Income (DTI) ratio.; * ‘collateral’ (for secured loan assessment) - collateral value; * ‘capital’ (total assets owned by the borrower), for instance, investments, liquid assets, e.g. savings, and finally; * ‘conditions’ (loan transaction specifics), such as principal amount, interest rate, borrower’s purpose of funds, economic and political conditions may be considered (CFI, 2005);

This comparison reveals that variables in the character category were not used while some variables in other 4Cs were incorporated to build the model for predicting credit risk.This could possibly be the reason why classification accuracy of the model is just between 68% to 85% or annual income not having an impact on credit risk.

Based on the theoretical equation(coefficients) and confidence interval of the logit model, there is credible evidence (positive coefficients) to suggest that covariates such as int_rate, term with 60 months and emp_length with the level of 1, 6, 7 and 10+ years increase the log odds of customers defaulting loan repayment. More specifically:

Customers whose term of loan repayment is 60 months have 32 % greater odds of defaulting the loan than those in the lower employment term division. That increase(32%) in odds of customers defaulting loan per increase in term of repayment by 60 month is between the confidence interval of 1.2114 and 1.4284, which is between 21% to 43%. Because we have a gap of about 22%, this estimation may not be that precise.
As expected, the customers with an increase in annual income have no odds of defaulting the loan. It’s confidence interval is \((1.000, 1.000)\). Because there is no width for confidence interval for annual_inc, this estimation is very precise. In this case, with an increase in annual income, the customers are highly likely not to default the loan
Customers with interest rate have 15 % greater odds of defaulting the loan. That increase (by 15%) in odds of customers defaulting loan per increase in interest rate is between the confidence interval of 15% to 17%. Because we have a gap by 2%, this estimation may be that precise.
Customers whose employment length is +10 years have 17 % greater odds of defaulting the loan than those in the lower employmentlength division. That increase (by 17%) in odds of customers defaulting loan per increase in employment length with +10 years is between the confidence interval of 3% to 32%. Because we have a gap by 29%, this estimation may not be that precise.

The possible reason could be caused by the fact that:

bank may have imposed high interest rate on customers with no formal employment;
customers may have been fired/sacked from employment and that loan accumulated;

Therefore the explanation on the assessment of loan application by lenders to their management must be centered around term of loan repayment (60 months),interest rate and the employment length of +10 years because they were the main contributing factor to customers defaulting loan repayment. But main focus of the explanation must be directed to interest rate because as confidence interval suggested, estimation (15% odds) of defaulting loan repayment seem to be very precise(confidence interval width of 2). Also annual income need to be discussed as well because it was found to have had no impact on defaulting loan. Meaning the bank must give loan to customers with very high income because they can’t default loan repayment (see confidence interval of 1000, 1.000 and appendix)

One of the solutions to address this is that the bank need to reconsider its interest rate on principle amount, term of loan repayment and emp_length of +10 years. The other reasons could be that the lenders need to verify customers’ source of income, kind of collateral pledged as security for the loan and the type of account he/she maintains.

15. Conclusion

The objective of the report was to build a binomial regression model for predicting loan default using available bank data, comprising 36 predictor variables and 1 response variable (repay_fail). It was also intended to identify/evaluate potential risks associated with granting loan to the customers. This is so that the lenders and bank management can take counteractive measures to minimize the impacts of these potential risks with one of them being revenue loss. It was further aimed to justify whether or not there is evidence to recommend those external factors which are less/more serious to defaulting loan repayment.

Based on the model, it was found (due to coefficients being positive) that the interest rate, term of repayment (60 months) and employment length of more than 10 years have the potential to default the loan repayment while annual income was found to have had no impact on defaulting loan repayment, meaning customers with very high annual income tend to repay loan successfully. The developed logit model’s percentage accuracy for prediction (customer defaulting loan or not) is between 76% to 85%, which is much better than old model. This suggests that there is a risk of customers defaulting the loan therefore the lenders and bank management must make an informed decisions around these potential risk factors so as to minimum the risk of revenue loss.

Although the model can generalize new data pretty well as there was no evidence of overfitting/underfitting (see cross-validation in section 10),the 15% misclassification error suggests that the model may not do well to classify customers as being either a potential loan defaulter or no. This limitation could partly be attributed to:

Some variables having a big proportional number of unknown, unverified, for example: verification status.
Some variables having many values of “NOT AVAILBLE”.

16. References

CFI. C. E. (2005). 5 Cs of Credit. Retrieved 2005 - 2021 from https://corporatefinanceinstitute.com/resources/knowledge/credit/5-cs-of-credit/

Hilbe, J. M. (2015). Practical guide to logistic regression. CRC Press LLC.

Leblebici, H., & Salancik, G. R. (1981). Effects of environmental uncertainty on information and decision processes in banks. Administrative Science Quarterly, 578-596.

Smith, C. J. (2012). Interpreting confidence intervals. Phlebology, 27(3), 141–142. https://doi.org/10.1258/phleb.2012.012j02

17. Appendix

Verifing ourprediction from our model using plots

# Predict "term" 
logit_model_pred_training$termP<-predict(logit_model, training_data, type = "response",se=TRUE)

## Warning in logit_model_pred_training$termP <- predict(logit_model,
## training_data, : Coercing LHS to a list

data_termP <- logit_model_pred_training$termP
newdata3 <- cbind( training_data,data_termP)

newdata3$emp_lengthP <-predict(logit_model, training_data, type = "response")
head(newdata3)

# upper limit and lower limits of the predictions using plogis
newdata5 <- within(newdata3,{
  PredictedProb <- plogis(fit)
  LL <- plogis(fit-(1.96*se.fit))
  UL <- plogis(fit+(1.96*se.fit))
})
head(newdata5)

 # Predict of repay_fail versus int_rate
    A<-ggplot(newdata5,aes(x=int_rate,y=PredictedProb))+
      geom_ribbon(aes(ymin=LL,ymax=UL,fill=term),alpha=0.5)+
      geom_line(aes(colour=term),size=1)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

 # Predict of repay_fail versus annual_inc
B<-ggplot(newdata5,aes(x=annual_inc,y=PredictedProb))+
      geom_ribbon(aes(ymin=LL,ymax=UL,fill=term),alpha=0.5)+
      geom_line(aes(colour=term),size=1)

ggarrange(A, B,ncol = 2,nrow = 1)

We have a large number of observations, so we plot our data using the quantile and calculate the mean of the data within the quantile.
The plot of predicted repay_fail versus annual_inc showing that when the annual income of the customer increases, the prediction from our model is showing the risk of default decreases linearly, because repay_fail value is going closer to value of zero. The rate of changes in the value of the predicted repay_fail is small, that is why in our table of summary, it appears almost equal to zero. However the change is significant.

y <- newdata5$PredictedProb
x <- newdata5$annual_inc
g <- cut(x, breaks=quantile(x,seq(0,100,2)/100,na.rm= TRUE))
ym <- tapply(y, g, mean,na.rm= TRUE)
xm <- tapply(x, g, mean,na.rm= TRUE)
datas <- data.frame(xm1=xm, ym1=ym)

plot(xm, ym,xlab = "annual income in $", ylab = "predicted repay_fail",main= "predicted repay_fail - annual-inc relationship")

We have a large number of observations, so we plot our data using the quantile and calculate the mean of the data within the quantile. The plot of predicted repay_fail versus the interest rate is showing that if the interest rate increases the risk of default increases accordingly, because the value of repay_fail goes closer to the value =1 which is the default. The rate of change is small, but significantly.

y <- newdata5$PredictedProb
x <- newdata5$int_rate
g <- cut(x, breaks=quantile(x,seq(0,100,2)/100,na.rm= TRUE))
ym <- tapply(y, g, mean,na.rm= TRUE)
xm <- tapply(x, g, mean,na.rm= TRUE)

plot(xm, ym,xlab = "interest rate %", ylab = "predicted repay_fail",main= "predicted repay_fail - interest rate relationship")

We have a large number of observations, so we plot our data using the quantile and calculate the mean of the data within the quantile. The mean of repay_fail from the bank’s data versus the predicted repay_fail plot is showing a linear relationship. It means our prediction for the default is good. There are slightly differences in their values, it might be caused by our limitations of our model as stated in our conclusion above.

# using mean calculated to simplify the plot
y <- newdata5$repay_fail
x <- newdata5$PredictedProb
g <- cut(x, breaks=quantile(x,seq(0,100,2)/100,na.rm= TRUE))
ym <- tapply(y, g, mean,na.rm= TRUE)
xm <- tapply(x, g, mean,na.rm= TRUE)
plot(xm,ym,xlab = "predicted repay_fail", ylab = "repay_fail", main="repay_fail versus predicted probability")