Introduction

The analysis to create a decision making granting loans is one of the most important operations for financial institutions. By taking into account past results of a pertain, a model is trained to accurately predict future outcomes of individual loan approval success rate.

Load data and Libraries

library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggthemes)
library(corrplot)
## corrplot 0.84 loaded
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(shiny)
library(DT)
## 
## Attaching package: 'DT'
## The following objects are masked from 'package:shiny':
## 
##     dataTableOutput, renderDataTable
library(DBI)
## Warning: package 'DBI' was built under R version 3.6.2
library(odbc)
## Warning: package 'odbc' was built under R version 3.6.2
library(rsconnect)
## Warning: package 'rsconnect' was built under R version 3.6.2
## 
## Attaching package: 'rsconnect'
## The following object is masked from 'package:shiny':
## 
##     serverInfo
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(callr)
# Set the blank spaces to NA's
loan = read_csv("loan.csv" , na = "")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   id = col_logical(),
##   member_id = col_logical(),
##   term = col_character(),
##   grade = col_character(),
##   sub_grade = col_character(),
##   emp_title = col_character(),
##   emp_length = col_character(),
##   home_ownership = col_character(),
##   verification_status = col_character(),
##   issue_d = col_character(),
##   loan_status = col_character(),
##   pymnt_plan = col_character(),
##   url = col_logical(),
##   desc = col_logical(),
##   purpose = col_character(),
##   title = col_character(),
##   zip_code = col_character(),
##   addr_state = col_character(),
##   earliest_cr_line = col_character(),
##   initial_list_status = col_character()
##   # ... with 29 more columns
## )
## See spec(...) for full column specifications.
## Warning: 462349 parsing failures.
##   row                       col           expected   actual       file
## 92797 debt_settlement_flag_date 1/0/T/F/TRUE/FALSE Feb-2019 'loan.csv'
## 92797 settlement_status         1/0/T/F/TRUE/FALSE ACTIVE   'loan.csv'
## 92797 settlement_date           1/0/T/F/TRUE/FALSE Feb-2019 'loan.csv'
## 92797 settlement_amount         1/0/T/F/TRUE/FALSE 5443     'loan.csv'
## 92797 settlement_percentage     1/0/T/F/TRUE/FALSE 65       'loan.csv'
## ..... ......................... .................. ........ ..........
## See problems(...) for more details.
rsconnect::setAccountInfo(name='rode', token='FF0050CE796E6D152241FCDBD70150F1', secret='8CfkDnpsOULx5SVvf1OF/n+YvaLOvK7ImOz5/480')

Feature Selection & Engineering

The dataset contains of information of age, annual income, grade of employee, home ownership that affect the probability of default of the borrower. The columns we are going to use are namely:

# Select only the columns mentioned above.
loan = loan %>%
        select(loan_status , loan_amnt , int_rate , grade , emp_length , home_ownership , 
               annual_inc , term)
loan
## # A tibble: 2,260,668 x 8
##    loan_status loan_amnt int_rate grade emp_length home_ownership
##    <chr>           <dbl>    <dbl> <chr> <chr>      <chr>         
##  1 Current          2500     13.6 C     10+ years  RENT          
##  2 Current         30000     18.9 D     10+ years  MORTGAGE      
##  3 Current          5000     18.0 D     6 years    MORTGAGE      
##  4 Current          4000     18.9 D     10+ years  MORTGAGE      
##  5 Current         30000     16.1 C     10+ years  MORTGAGE      
##  6 Current          5550     15.0 C     10+ years  MORTGAGE      
##  7 Current          2000     18.0 D     4 years    RENT          
##  8 Current          6000     13.6 C     10+ years  RENT          
##  9 Current          5000     18.0 D     10+ years  MORTGAGE      
## 10 Current          6000     14.5 C     < 1 year   OWN           
## # ... with 2,260,658 more rows, and 2 more variables: annual_inc <dbl>,
## #   term <chr>

Missing Values:

sapply(loan , function(x) sum(is.na(x)))
##    loan_status      loan_amnt       int_rate          grade     emp_length 
##              0              0              0              0              0 
## home_ownership     annual_inc           term 
##              0              4              0
# Remove the 4 rows with missing annual income, 49 rows where home ownership is 'NONE' or 'ANY' and rows where emp_length is 'n/a'.

loan = loan %>%
        filter(!is.na(annual_inc) , 
               !(home_ownership %in% c('NONE' , 'ANY')) , 
               emp_length != 'n/a')

Exploratory Data Analysis

loan %>%
       count(loan_status) %>%
        ggplot(aes(x = reorder(loan_status , desc(n)) , y = n , fill = n)) + 
        geom_col() + 
        coord_flip() + labs(x = 'Loan Status' , y = 'Count')

We want to convert this variable to binary (1 for default and 0 for non-default) but we have 10 different levels. Loans with status Current, Late payments, In grace period need to be removed. Therefore, we create a new variable called loan_outcome where

loan_outcome -> 1 if loan_status = ‘Charged Off’ or ‘Default’ loan_outcome -> 0 if loan_status = ‘Fully Paid’

loan = loan %>%
        mutate(loan_outcome = ifelse(loan_status %in% c('Charged Off' , 'Default') , 
                                     1, 
                                     ifelse(loan_status == 'Fully Paid' , 0 , 'No info')
                                     ))

barplot(table(loan$loan_outcome) , col = 'lightblue')

We will create a new dataset which contains only rows with 0 or 1 in loan_outcome feature for better modelling.

# Create the new dataset by filtering 0's and 1's in the loan_outcome column and remove loan_status column for the modelling
loan2 = loan %>%
        select(-loan_status) %>%
        filter(loan_outcome %in% c(0 , 1))

Our new dataset contains of 1227885 rows.

Let’s observe how useful these variables would be for credit risk modelling. It is known that the better the grade the lowest the interest rate. We can nicely visualise this with boxplots.

ggplot(loan2 , aes(x = grade , y = int_rate , fill = grade)) + 
        geom_boxplot() + 
        theme_igray() + 
        labs(y = 'Interest Rate' , x = 'Grade')

We assume that grade is a great predictor for the volume of non-performing loans. But how many of them did not performed grouped by grade?

table(loan2$grade , factor(loan2$loan_outcome , c(0 , 1) , c('Fully Paid' , 'Default')))
##    
##     Fully Paid Default
##   A     201708   12424
##   B     311584   46655
##   C     271318   76680
##   D     128042   54771
##   E      53267   33064
##   F      16447   13413
##   G       4285    4227
ggplot(loan2 , aes(x = grade , y = ..count.. , fill = factor(loan_outcome , c(1 , 0) , c('Default' , 'Fully Paid')))) + 
        geom_bar() + 
        theme(legend.title = element_blank())

Now let’s try to find out what impact the annual income of the borrower has on the other variables.

ggplot(loan2[sample(244179 , 10000) , ] , aes(x = annual_inc , y = loan_amnt , color = int_rate)) +
        geom_point(alpha = 0.5 , size = 1.5) + 
        geom_smooth(se = F , color = 'darkred' , method = 'loess') +
        xlim(c(0 , 300000)) + 
        labs(x = 'Annual Income' , y = 'Loan Ammount' , color = 'Interest Rate')
## Warning: Removed 67 rows containing non-finite values (stat_smooth).
## Warning: Removed 67 rows containing missing values (geom_point).

As expected the larger the annual income the larger the demanded ammount by the borrower.

Data modelling

Modelling Process:

Because of the binary response variable we can use logistic regression. Rather than modelling the response Y directly, logistic regression models the probability that Y belongs to a particular category, in our case the probability of a non-performing loan. This probability can be computed by the logistic function,

P = exp(b0 + b1X1 + … + bNXN) / [ 1 + exp(b0 + b1X1 + … + bNXN) ]

where

# Split dataset 
loan2$loan_outcome = as.numeric(loan2$loan_outcome)
idx = sample(dim(loan2)[1] , 0.75*dim(loan2)[1] , replace = F)
trainset = loan2[idx , ]
testset = loan2[-idx , ]

# Fit logistic regression
glm.model = glm(loan_outcome ~ . , trainset , family = binomial(link = 'logit'))
summary(glm.model)
## 
## Call:
## glm(formula = loan_outcome ~ ., family = binomial(link = "logit"), 
##     data = trainset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4530  -0.7125  -0.5319  -0.3228   6.9008  
## 
## Coefficients:
##                       Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)         -3.128e+00  1.999e-02 -156.488  < 2e-16 ***
## loan_amnt            1.095e-05  3.803e-07   28.781  < 2e-16 ***
## int_rate             3.909e-02  1.759e-03   22.229  < 2e-16 ***
## gradeB               6.452e-01  1.364e-02   47.312  < 2e-16 ***
## gradeC               1.061e+00  1.687e-02   62.876  < 2e-16 ***
## gradeD               1.267e+00  2.226e-02   56.912  < 2e-16 ***
## gradeE               1.390e+00  2.809e-02   49.466  < 2e-16 ***
## gradeF               1.435e+00  3.576e-02   40.140  < 2e-16 ***
## gradeG               1.470e+00  4.535e-02   32.419  < 2e-16 ***
## emp_length1 year    -2.555e-03  1.368e-02   -0.187 0.851873    
## emp_length10+ years -6.063e-02  1.044e-02   -5.810 6.27e-09 ***
## emp_length2 years   -5.248e-02  1.271e-02   -4.127 3.67e-05 ***
## emp_length3 years   -2.085e-02  1.308e-02   -1.595 0.110822    
## emp_length4 years   -3.501e-02  1.416e-02   -2.471 0.013456 *  
## emp_length5 years   -3.463e-02  1.401e-02   -2.471 0.013467 *  
## emp_length6 years   -6.319e-02  1.538e-02   -4.109 3.97e-05 ***
## emp_length7 years   -5.558e-02  1.559e-02   -3.565 0.000364 ***
## emp_length8 years   -9.062e-03  1.548e-02   -0.585 0.558276    
## emp_length9 years   -1.821e-02  1.641e-02   -1.110 0.266955    
## home_ownershipOTHER  4.478e-01  2.441e-01    1.834 0.066594 .  
## home_ownershipOWN    1.926e-01  9.497e-03   20.274  < 2e-16 ***
## home_ownershipRENT   3.498e-01  6.086e-03   57.476  < 2e-16 ***
## annual_inc          -2.276e-06  7.227e-08  -31.496  < 2e-16 ***
## term60 months        4.317e-01  6.839e-03   63.131  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 914196  on 920912  degrees of freedom
## Residual deviance: 838717  on 920889  degrees of freedom
## AIC: 838765
## 
## Number of Fisher Scoring iterations: 5

The coefficients of the following features are positive:

  1. Loan Ammount
  2. Interest Rate
  3. Home Ownership - Other
  4. Term
  5. The better the grade the more difficult to default

This means the probability of defaulting on the given credit varies directly with these factors. For example more the given ammount of the loan, more the risk of losing credit.

The coefficients of the following features are negative:

  1. Annual Income
  2. Home Ownership - Own
  3. Home Ownership - Rent
  4. Borrowers with 10+ years of experience are more likely to pay their debt
  5. There is no significant difference in the early years of employment

This means that the probability of defaulting is inversely proportional to the factors mentioned above.

# Prediction on test set
preds = predict(glm.model , testset , type = 'response')

# Density of probabilities
ggplot(data.frame(preds) , aes(preds)) + 
        geom_density(fill = 'lightblue' , alpha = 0.4) +
        labs(x = 'Predicted Probabilities on test set')

But now let’s see how the accuracy, sensitivity and specificity are transformed for given threshold. We can use a threshold of 50% for the posterior probability of default in order to assign an observation to the default class. However, if we are concerned about incorrectly predicting the default status for individuals who default, then we can consider lowering this threshold. So we will consider these three metrics for threshold levels from 1% up to 50%.

k = 0
accuracy = c()
sensitivity = c()
specificity = c()
for(i in seq(from = 0.01 , to = 0.5 , by = 0.01)){
        k = k + 1
        preds_binomial = ifelse(preds > i , 1 , 0)
        confmat = table(testset$loan_outcome , preds_binomial)
        accuracy[k] = sum(diag(confmat)) / sum(confmat)
        sensitivity[k] = confmat[1 , 1] / sum(confmat[ , 1])
        specificity[k] = confmat[2 , 2] / sum(confmat[ , 2])
}

If we plot our results we get this visualization.

threshold = seq(from = 0.01 , to = 0.5 , by = 0.01)

data = data.frame(threshold , accuracy , sensitivity , specificity)
head(data)
##   threshold  accuracy sensitivity specificity
## 1      0.01 0.1946823   0.8958333   0.1945726
## 2      0.02 0.1949103   0.9218750   0.1946070
## 3      0.03 0.1955292   0.9434524   0.1947097
## 4      0.04 0.1995329   0.9641337   0.1954134
## 5      0.05 0.2399307   0.9602776   0.2025856
## 6      0.06 0.2887495   0.9523467   0.2116308
# Gather accuracy , sensitivity and specificity in one column
ggplot(gather(data , key = 'Metric' , value = 'Value' , 2:4) , 
       aes(x = threshold , y = Value , color = Metric)) + 
        geom_line(size = 1.5)

A threshold of 25% - 30% seems ideal cause further increase of the cut off percentage does not have significant impact on the accuracy of the model. The Confusion Matrix for cut off point at 30% will be this,

preds.for.30 = ifelse(preds > 0.3 , 1 , 0)
confusion_matrix_30 = table(Predicted = preds.for.30 , Actual = testset$loan_outcome)
confusion_matrix_30
##          Actual
## Predicted      0      1
##         0 211430  37757
##         1  35818  21967
## [1] "Accuracy : 0.7603"

The ROC (Receiver Operating Characteristics) curve is a popular graphic for simultaneously displaying the two types of errors for all possible thresholds.

library(pROC)
## Warning: package 'pROC' was built under R version 3.6.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Area Under Curve
auc(roc(testset$loan_outcome , preds))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.7018
# Plot ROC curve
plot.roc(testset$loan_outcome , preds , main = "Confidence interval of a threshold" , percent = TRUE , 
         ci = TRUE , of = "thresholds" , thresholds = "best" , print.thres = "best" , col = 'blue')
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Conclusion

A logistic regression model was used to predict the loan status. Different cut off’s were used to decide if the loan should be granted or not. Cut off of 30% gave a good accuracy of 76.03%. The decision to set a cut off is arbitrary and higher levels of threshold increases the risk. The Area Under Curve also gives a measure of accuracy, which came out to be 70.18%.