Predicting Loan Defaults with Logistic Regression

Executive Summary

The purpose of this report is to show how computer-based predictive modeling can be used at a bank to predict bad loans, loans that are to default, vital information that can be used to reject the loan at the time of application, and, thereby saving the bank from loses incurred otherwise. Any bank with such a capability to reject potential bad loans and only approve potential good loans (loans that are paid-off in full), can tremendously increase its profit margin.

The method used in this analysis is binary prediction (only having two possible outcomes) using Logistic Regression. A logistic model, aka logit model, will produce probabilities for a certain outcome. In this case, the outcome being a loan to be good. When we consider all loans with a probability greater than 0.5 as good loans and the rest as bad, we get 97.38% of good loans rightly predicted as good and 13.69% of bad loans rightly predicted as bad. Giving a total accuracy of 79.46%. This also means there is a percentage of error, little above 20%. Analyzing the accuracy at different levels of threshold probability, the probability greater than which is a loan good, we perceive some interesting results. At a threshold of .49 and .51 i.e., all loans greater than 0.49 or 0.51 probability marked as good and rest as bad, an overall maximum accuracy of 79.56% is achieved. But, it may be not at the highest prediction accuracy that the bank would make the highest profit. The analysis shows it is at a threshold probability of 0.67 that the profit is maximized. The overall accuracy at this threshold is reduced to 75.72%. Also is a reduced 84.70% of good loans rightly predicted as good and an increased 42.78% of bad loans rightly predicted as bad. The resulting profit is 2.3 times the current profit; considering profit as a simple equation of the sum of subtraction of total amount paid back by the loanee and amount originally loaned. Not to mention, the money saved from rejecting 42.78% of current bad loans can be reinvested in the business, taking the profits further than the above mentioned 2.3 times of the current bottom line.

In this analysis, we are using multiple predictors to build the logit model. But the model itself is not complex and uninterpretable. Rather the relationship between each predictor and the dependent variable is examined at a linear level. There are no interactions between the different predictors incorporated, nor any unreasonable assumptions made. Basically, such a simple model, very much human understandable, is able to produce such remarkable results. This stands out as easy reasoning to utilize this model at the bank.

Yes, there are future improvements that can be made to the modeling exercise. More complex procedures can be adopted. Quality of data collection can be improved. There are multiple steps that can be taken by an organization to create an environment to embrace data and data-driven decision making. The simple model developed in this presentation can more than double the profits. Had the bank rejected all the truly bad loans and only approved the good loans, the profit would have been 7.9 times the current profit. There is clearly a tremendous scope for model improvement. Profits can be tripled, quadrupled.. up to a maximum of eight times. The bank investing in predictive modeling is equivalent to investing in the future. Continuously retraining (re-train not retain!) the model is another requirement to keep up to date with changing trends. Short-term and long-term societal and ethical effects should be studied during the course of reaping fruits from predictive analytics.

Setup

Load packages

library(ggplot2) # data visualization 
library(dplyr)   # data manipulation 
library(tidyr)   # data manipulation 
library(gridExtra) # multiple plots in one graph
library(readr) # to read in file types like csv

# apply general knitr options
knitr::opts_chunk$set(comment=NA, fig.align='center')

Part 2: Introduction

This project is to build a predictive model using Logistic Regression to predict which applicants for a loan are likely to default. A dataset with 30 features (variables) and fifty thousand observations (records) are available. The goal of this project is to build a predictive model based on the available historic data and thereafter be able to predict if a new loan application was to default or not. Not all of the features need to be used for building the model. Also, not all of the observations need to be used. The rationale behind selecting just enough features, removing/imputing missing/bad data etc. will be explained in the coming steps.

The libraries required to run this R markdown are loaded. The dataset used in this research project can be downloaded from here: Dataset: loans50k.csv. And explanation of each field in the dataset from here: Explanation.

Load data

# load the data set
loans = read_csv("loans50k.csv")

Part 3: Preparing and Cleaning the Data

The dataset we loaded has 30 features (columns) and 50,000 observations (records). First of all we need to remove observations with status: Current, In Grace Period and Late. Then add a column to the dataset called response, which will be the response variable with two values Good and Bad. All observations with status ‘Fully Paid’ should be marked Good. And all observations with status ‘Charged Off’ and ‘Default’ should be marked Bad.

loans = loans %>% filter(status == "Fully Paid" | status == "Charged Off" | status == "Default")
loans = loans %>% mutate(response = case_when(status == "Fully Paid" ~ "Good", TRUE ~ "Bad"))
loans$response <- as.factor(loans$response)

Lets convert length of employment to a quantitative field. This may yield better performance while building the model. There are 1823 observations missing a value for length of employment. There is a relation seen between length and home ownership. The majority of the loan applicants with longer length of employment seem to have a mortgage and the ones with shortest length of employment seem to be living on RENT. Ones with in between length of employment seem to OWN a house. This can be seen on the below boxplot (fig 1.). Median length of employment for those with a mortgage is 8. Owning a home is 7. And those on rent is a 4. This relation can be used to impute missing values for length of employment.

loans = loans %>% mutate(length = case_when(length == "< 1 year" ~ 0, length == "1 year" ~ 1, 
                                            length == "2 years" ~ 2, length == "3 years" ~ 3,
                                            length == "4 years" ~ 4,length == "5 years" ~ 5, 
                                            length == "6 years" ~ 6, length == "7 years" ~ 7,
                                            length == "8 years" ~ 8, length == "9 years" ~ 9, 
                                            length == "10+ years" ~ 10, TRUE ~ NA_real_))

Loan amount and monthly payments are highly correlated. Only one of the predictors need to be in the dataset. The linear correlation between amount and payment is shown above in the scatter plot (fig 2.). Field payment can be deleted from the dataset. The predictor employment has 15,288 categorical values. Of which 12,572 have a single observation. 1,238 of the categorical values have two observations (cases). This makes model building with predictor employment impossible. Feature engineering required to reduce the number of categorical values in this predictor is being considered out of scope for this exercise. Hence field employment can be removed from the dataset as well.

loans = loans %>% select(-c(employment,payment))

There are only very few individuals with more than 20 plus accounts open in the last 24 months. Lumping them all to make value 20 (which will now act as 20 and greater) will help model building process. Also there are multiple states with very less loan applications. Here I am combining states with less than 150 applications to one. Now this new category has 648 cases. Similarly grouping all states that have number of applications between 150 and 175 as one to get 308 cases. Also are states with number of applications between 175 and 200 grouped as one to get 518 cases.

loans = loans %>% 
  mutate(accOpen24 = case_when(accOpen24 > 19 ~ 20, TRUE ~ as.numeric(accOpen24)))

other1 = loans %>% group_by(state) %>% summarise(count = n()) %>% 
  filter(count < 150) %>% pull(1)
other2 = loans %>% group_by(state) %>% summarise(count = n()) %>% 
  filter(count >= 150 & count < 175) %>% pull(1)
other3 = loans %>% group_by(state) %>% summarise(count = n()) %>% 
  filter(count >= 175 & count < 200) %>% pull(1)
loans = loans %>% mutate(state = case_when(state %in% other1~"OT",
                                   state %in% other2~"O2",
                                   state %in% other3~"O3",
                                   TRUE~state))

Part 4: Exploring and Transforming Data

Figure 3 above shows five non-categorical variables that can be transformed to again independent variables, but that seems to have be drawn from a more normally distributed population. The transformed predictors will help in model building.

Figure 4 above shows three predictors that are not from a normal distribution, even after log transformation. This doesn’t mean we are removing them as predictors.

Transforming independent variables

Note: log(x+1) transformation is used to eliminate infinite values as observation values.

loans = loans %>% mutate(income = log(income), totalBal = log(totalBal+1), totalRevLim = log(totalRevLim), 
                         avgBal = log(avgBal+1), bcOpen = log(bcOpen+1), pubRec = log(pubRec+1), 
                         inq6mth = log(inq6mth+1), delinq2yr = log(delinq2yr+1))

Remove rest of the missing values

sapply(loans, function(x) sum(is.na(x)))

     loanID      amount        term        rate       grade      length 
          0           0           0           0           0           0 
       home      income    verified      status      reason       state 
          0           0           0           0           0           0 
 debtIncRat   delinq2yr     inq6mth     openAcc      pubRec  revolRatio 
          0           0           0           0           0          15 
   totalAcc   totalPaid    totalBal totalRevLim   accOpen24      avgBal 
          0           0           0           0           0           0 
     bcOpen     bcRatio    totalLim totalRevBal  totalBcLim  totalIlLim 
        360         384           0           0           0           0 
   response 
          0

loans = loans %>% drop_na()

The above table shows the number of missing values for each predictor variable. The number of missing values are too small compared to the sample size. Hence the rows with missing values are removed.

Exploring Categorical variables

Figure 5. Distribution of response.
The figure shows the number of good loans in the sample is more than thrice that of the bad loans. As we are considering the sample to be random. The population of all loans would have a similar mix (i.e three in very four loans are good).

Fig 6: From above interleaved histogram you can see, a high percentage of good loans have 36 months term. Term can be a strong factor to predict good loans.

Fig 7: From above overlaid histogram you can see, with lesser interest rate, chances of loan being good increases. Whereas chances of being a bad loan follows a normal curve. Rate can be a strong factor to predict good and bad loans.

Fig 8: From above interleaved histogram you can see, with a smaller grade (as in B is smaller than C), chances of loan being good increases. Whereas chances of being a bad loan follows more of a normal curve. Grade can be a strong factor to predict good and bad loans.

Fig 9: From above interleaved histogram you can see, those applicants having a home yet to be paid-off tend to pay-fully their loans compared to those living in a rented house.

Fig 10: From above interleaved histogram you can see, number of good loans for applicants whose income is not verified is greater than those whose income is verified. Similarly for bad loans, the number of bad loans for applicants whose income is verified is greater than those whose income is NOT verified. You would have expected this the other way. But data shows another picture.

Part 5: The Logistic Model

Let’s first split the dataset randomly into two parts - training dataset (with 80% of data) and testing dataset (with rest 20%). We can then build a logistic regression model to predict the outcome of a loan as Good or Bad using training dataset. The model’s accuracy can be validated using testing dataset.

smp_size = floor(0.8 * nrow(loans)) ## 80% of the sample size
set.seed(10)
train_ind = sample(seq_len(nrow(loans)), size = smp_size)
train = loans[train_ind, ]
test = loans[-train_ind, ]
train = train %>% select(-c(totalPaid,status, loanID)) ## remove fields not required for model training

Lets build the model and display the coefficient for each predictor. Larger the value of the coefficient, the more linearly related is the predictor with the response variable.

full <- glm(response~.,data=train, family = "binomial") ## building the model using all the predictors in the dataset
full$coefficients

             (Intercept)                   amount            term60 months 
            2.428576e-02            -1.558246e-05            -6.592648e-01 
                    rate                   gradeB                   gradeC 
           -1.310808e+00            -3.731285e-01            -7.596691e-01 
                  gradeD                   gradeE                   gradeF 
           -9.372763e-01            -1.029361e+00            -1.127413e+00 
                  gradeG                   length                  homeOWN 
           -1.173863e+00             5.878831e-03            -6.965745e-02 
                homeRENT                   income  verifiedSource Verified 
           -1.751298e-01             1.060732e-01            -8.948209e-02 
        verifiedVerified        reasoncredit_card reasondebt_consolidation 
           -9.864499e-02            -2.522726e-01            -1.930330e-01 
  reasonhome_improvement              reasonhouse     reasonmajor_purchase 
           -2.394693e-01             5.389878e-02            -1.821646e-01 
           reasonmedical             reasonmoving              reasonother 
           -3.843231e-01            -7.134167e-01            -2.591611e-01 
  reasonrenewable_energy     reasonsmall_business           reasonvacation 
           -1.213306e-01            -7.810495e-01            -1.919047e-01 
           reasonwedding                  stateAR                  stateAZ 
            7.817590e+00            -2.903484e-02             1.386271e-01 
                 stateCA                  stateCO                  stateCT 
            1.846190e-01             7.933552e-01             1.268808e-01 
                 stateFL                  stateGA                  stateIL 
            1.156282e-01             2.082662e-01             1.299632e-01 
                 stateIN                  stateKS                  stateKY 
            8.062240e-02             6.286519e-01             5.718082e-02 
                 stateLA                  stateMA                  stateMD 
            1.195079e-01             1.108594e-01             2.398850e-02 
                 stateMI                  stateMN                  stateMO 
            1.732192e-01             1.701641e-01            -4.145073e-02 
                 stateNC                  stateNJ                  stateNM 
            6.043483e-02            -3.108029e-02            -1.328382e-01 
                 stateNV                  stateNY                  stateO2 
            3.262498e-02            -4.752639e-02             7.364619e-02 
                 stateO3                  stateOH                  stateOK 
            2.791455e-01             1.914548e-01            -1.300323e-01 
                 stateOR                  stateOT                  statePA 
            3.213482e-01             2.393255e-01             1.423103e-01 
                 stateSC                  stateTN                  stateTX 
            4.836058e-01             5.493670e-02             1.509311e-01 
                 stateUT                  stateVA                  stateWA 
            6.517743e-01             2.731042e-01             5.318278e-01 
                 stateWI               debtIncRat                delinq2yr 
            2.239176e-01            -2.618465e-02            -1.841339e-01 
                 inq6mth                  openAcc                   pubRec 
           -1.635515e-01            -1.246498e-02             8.255689e-03 
              revolRatio                 totalAcc                 totalBal 
           -4.252571e-01             1.031602e-02             5.696605e-02 
             totalRevLim                accOpen24                   avgBal 
            1.433371e-01            -8.433119e-02             1.755811e-02 
                  bcOpen                  bcRatio                 totalLim 
            2.494401e-02             2.037354e-03             3.738456e-07 
             totalRevBal               totalBcLim               totalIlLim 
           -2.637391e-06             2.399392e-06             4.002850e-06

Now lets use this model to predict the outcomes (response variable) for the observations in the test dataset. We have the actual response variable, i.e. if the loan is Good or Bad in the test dataset. Using the predict function, we will get the probability of a loan to be Good or Bad. By using a threshold between 0 and 1, we can determine all the loan applications with a predicted probability greater than the threshold as predicted Good and all the loan applications with a predicted probability less than or equal to the threshold as predicted Bad. We can then compare the actual response with the predicted response to calculate the accuracy of the model. Let’s assume the threshold as 0.5.

probs = predict(full,newdata = test,type = "response")

probs_df = data.frame(prob = probs)
probs_df$ID = seq.int(nrow(probs_df))
test$ID = seq.int(nrow(test))
test2 = test %>% select(ID, response, totalPaid, amount)
predictions <- merge(probs_df,test2,by="ID")

threshold = 0.5

predictions = predictions %>% mutate(predicted = case_when((prob > threshold)~'Good',TRUE~'Bad'))
colnames(predictions)[colnames(predictions)=="response"] <- "actual"
predictions = predictions %>% 
  mutate(confusion = case_when((prob >  threshold & actual == 'Good')~'TP',
                               (prob <= threshold & actual == 'Bad')~'TN',
                               (prob >  threshold & actual == 'Bad')~'FP',
                               (prob <= threshold & actual == 'Good')~'FN',TRUE~NA_character_))

conf_matrix = with(predictions,table(actual, predicted))
addmargins(conf_matrix)

      predicted
actual  Bad Good  Sum
  Bad   209 1289 1498
  Good  144 5213 5357
  Sum   353 6502 6855

TN = conf_matrix[1]
FN = conf_matrix[2]
FP = conf_matrix[3]
TP = conf_matrix[4]

accuracy = (TP + TN)/(FP+FN+TN+TP) * 100
sensitivity = TP/(TP + FN)  * 100
specificity = TN/(TN + FP)  * 100

The model accuracy that we obtained is: 79.1 The sensitivity of the model is: 97.31 The specificity of the model is: 13.95

This means that, the model that we built has an overall accuracy of 79.1. This is the percentage of correct predictions this model offers. This model has a high sensitivity. Look at the confusion matrix above to understand sensitivity and specificity. Sensitivity is the total correct predictions of Good loans out of the total Good loans. Here 5213 out of 5357 were predicted correctly. But the specificity for the model is very low. Specificity is the total correct predictions of Bad loans out of the total Bad loans. 209 out of 1498 were predicted correctly. This means a large number of bad loans, 1289, were predicted as good loans. This can mean a large number of potential defaulters could be approved for a loan. This is not good for the loan provider. It can affect the profit of the company.

Part 6: Optimizing the Threshold for Accuracy

Now lets vary the threshold, earlier taken as 0.5, from 0 to 1 and see the effect on overall accuracy and proportions of correctly predicted good and bad loans. We can also find the threshold at which the model accuracy is the highest. For this exercise, let’s increment the threshold by 0.005 from the lowest value of the probability for a loan to be good, to, highest probability for a loan to be good.

probs = predict(full,newdata = test,type = "response")
probs_df = data.frame(prob = probs)
min = min(probs)
max = max(probs)

probs_df$ID = seq.int(nrow(probs_df))
test$ID = seq.int(nrow(test))
test2 = test %>% select(ID, response, totalPaid, amount)
predictions <- merge(probs_df,test2,by="ID")

accur_calc = function(predictions, th){
  predictions = predictions %>% mutate(predicted = case_when((prob > th)~'Good',TRUE~'Bad'))
  colnames(predictions)[colnames(predictions)=="response"] <- "actual"
  predictions = predictions %>% 
    mutate(confusion = case_when((prob >  th & actual == 'Good')~'TP',
                                 (prob <= th & actual == 'Bad')~'TN',
                                 (prob >  th & actual == 'Bad')~'FP',
                                 (prob <= th & actual == 'Good')~'FN',TRUE~NA_character_))
  conf_matrix = with(predictions,table(actual, predicted))
  TN = conf_matrix[1]
  FN = conf_matrix[2]
  FP = conf_matrix[3]
  TP = conf_matrix[4]
  accuracy = (TP + TN)/(FP+FN+TN+TP) * 100
  sensitivity = TP/(TP + FN) * 100
  specificity = TN/(TN + FP) * 100
  
  profit_df = predictions %>% filter(predicted == 'Good') %>% 
    summarise(profit = sum(totalPaid - amount))
  profit = profit_df[1,1]/(1000000)  #in millions
  
  return(c(th, accuracy, sensitivity, specificity, profit))
}

range = seq(min, max, by = .005)
conf_df = data.frame(threshold = double(),accuracy= double(),
                     sensitivity= double(),specificity= double(),profit= double())
jj = 0

for(ii in range){
  jj = jj+1
  vals = accur_calc(predictions, ii)
  conf_df [jj,] = vals
}

Fig 11a. shows the effect of accuracy over threshold. Fig 11a. and 11b. are similar because, this model has a high sensitivity when the accuracy is high and a very low sensitivity when the accuracy is low. It means this model predicts good loans at a high accuracy when the overall accuracy is high, but the trade off being, it predicts bad loans at a very low accuracy when the overall accuracy is high.

The max accuracy and the threshold at which, this model can obtain is in the below table. There can be two thresholds at which overall accuracy is at maximum. But sensitivity and specificity change.

max_acc = conf_df %>% filter(accuracy == max(accuracy)) %>% 
  select(accuracy, threshold, sensitivity, specificity)
max_acc

  accuracy threshold sensitivity specificity
1 79.27061 0.4803775    98.07728    12.01602

Part 7: Optimizing the Threshold for Profit

From the bank’s perspective the most important feature of the model is how it changes the overall profit. Let’s assume profit is calculated as the sum of subtraction of total loan amount paid and initial loan amount (the sum of totalPaid - amount). Let’s again change the threshold from the lowest to highest probability of a loan being good, and study the affect of threshold on profit. We can also find the accuracy at which the profit is maximum.

range = seq(min, max, by = .005)
conf_df = data.frame(threshold = double(),accuracy= double(),
                     sensitivity= double(),specificity= double(),profit= double())
jj = 0

for(ii in range){
  jj = jj+1
  vals = accur_calc(predictions, ii)
  conf_df [jj,] = vals
}

Fig 12a. shows the effect of profit over threshold. Fig 12b. shows the effect of profit over accuracy. You can see from fig 12b. that profit is not maximum when accuracy is at maximum. Profit is maximum when accuracy is slightly lower than max accuracy, which also means at a slightly lower sensitivity than its value at max accuracy and slightly higher specificity than its value at max accuracy.

If the bank chose a higher accuracy model, then the bank would lose money on loans predicted as good, but are actually bad, yielding less profits.

For the best profit threshold, the profit in millions, percentage of overall accuracy, sensitivity and specificity are in the below table:

max_profit = conf_df %>% filter(profit == max(profit)) %>% 
  select(threshold, profit, accuracy, sensitivity, specificity)
max_profit

  threshold   profit accuracy sensitivity specificity
1 0.6503775 3.640309 76.29468    87.25033    37.11615

For the best profit threshold, the profit in millions, overall accuracy, sensitivity and specificity are in the below table:

max_profit = conf_df %>% filter(profit == max(profit)) %>% 
  mutate(accuracy = round(accuracy,2), threshold = round(threshold,4), 
         sensitivity = round(sensitivity,2), specificity = round(specificity,2)) %>% 
  select(profit, accuracy, threshold, sensitivity, specificity)
max_profit

    profit accuracy threshold sensitivity specificity
1 3.640309    76.29    0.6504       87.25       37.12

no_model = test %>% summarise(profit = sum(totalPaid - amount))
curr_profit = no_model$profit / 1000000
increase_pct = round(max_profit$profit / curr_profit * 100,2)

The maximum percentage increase in profit that can be expected by deploying the max profit model is: 270.7 % Note: percentage is calculated by looking at the test dataset, which is a good approximation.

How does this increase in profit compare to the increase in profit from a perfect model that denies all of the truly bad loans?

ultimate_model = test %>% filter(response == 'Good') %>%
  summarise(profit = sum(totalPaid - amount))
ultimate_profit = ultimate_model$profit / 1000000
compare_pct = round(max_profit$profit / ultimate_profit * 100,2)
this_model_times_curr = round(max_profit$profit/curr_profit,1)
ultimate_times_curr = round(ultimate_profit/curr_profit,1)

The profit from the max profit model is 29.26 % of the maximum profit that can be obtained say the bank denied all bad loans, but approved all good loans.

Part 8: Results Summary

The bank should use this predictive model at the threshold that maximizes the profits. The threshold at which the profits are maximized would be 0.6504. The overall accuracy of the model that maximize profit would be 76.29 %. The percentage of good loans that will be correctly predicted would be 87.25. The percentage of bad loans that will be correctly predicted would be 37.12. By using this model, the bank would increase its profit by 270.7%.

This proposed model can more than double the profit (by 2.7 times). Had the bank rejected all the truly bad loans and only approved the good loans, the profit would have been 9.3 times the current profit. This means, the proposed model is only 29 % of the achievable maximum profit - this also means that there is scope for model improvement. The bank can take several steps to further improve the predictive model and make more money. Like:

Improve data collection processes:
- Collect more historical data for model training
- Improve the quality of the data at the source. Like- eliminate missing data, inaccurate data etc..
- Determine and collect other useful predictors like loan applicant’s credit score
Conduct more complex modelling techniques
- Like using ‘regression model selection’ techniques
- Employing dimension reduction techniques
- Trying non-regression methods like decision trees (random forest) etc..