Introduction
The analysis to create a decision making granting loans is one of the most important operations for financial institutions. By taking into account past results of a pertain, a model is trained to accurately predict future outcomes of individual loan approval success rate.
Load data and Libraries
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggthemes)
library(corrplot)
## corrplot 0.84 loaded
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(shiny)
library(DT)
##
## Attaching package: 'DT'
## The following objects are masked from 'package:shiny':
##
## dataTableOutput, renderDataTable
library(DBI)
## Warning: package 'DBI' was built under R version 3.6.2
library(odbc)
## Warning: package 'odbc' was built under R version 3.6.2
library(rsconnect)
## Warning: package 'rsconnect' was built under R version 3.6.2
##
## Attaching package: 'rsconnect'
## The following object is masked from 'package:shiny':
##
## serverInfo
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(callr)
# Set the blank spaces to NA's
loan = read_csv("loan.csv" , na = "")
## Parsed with column specification:
## cols(
## .default = col_double(),
## id = col_logical(),
## member_id = col_logical(),
## term = col_character(),
## grade = col_character(),
## sub_grade = col_character(),
## emp_title = col_character(),
## emp_length = col_character(),
## home_ownership = col_character(),
## verification_status = col_character(),
## issue_d = col_character(),
## loan_status = col_character(),
## pymnt_plan = col_character(),
## url = col_logical(),
## desc = col_logical(),
## purpose = col_character(),
## title = col_character(),
## zip_code = col_character(),
## addr_state = col_character(),
## earliest_cr_line = col_character(),
## initial_list_status = col_character()
## # ... with 29 more columns
## )
## See spec(...) for full column specifications.
## Warning: 462349 parsing failures.
## row col expected actual file
## 92797 debt_settlement_flag_date 1/0/T/F/TRUE/FALSE Feb-2019 'loan.csv'
## 92797 settlement_status 1/0/T/F/TRUE/FALSE ACTIVE 'loan.csv'
## 92797 settlement_date 1/0/T/F/TRUE/FALSE Feb-2019 'loan.csv'
## 92797 settlement_amount 1/0/T/F/TRUE/FALSE 5443 'loan.csv'
## 92797 settlement_percentage 1/0/T/F/TRUE/FALSE 65 'loan.csv'
## ..... ......................... .................. ........ ..........
## See problems(...) for more details.
rsconnect::setAccountInfo(name='rode', token='FF0050CE796E6D152241FCDBD70150F1', secret='8CfkDnpsOULx5SVvf1OF/n+YvaLOvK7ImOz5/480')
The dataset contains of information of age, annual income, grade of employee, home ownership that affect the probability of default of the borrower. The columns we are going to use are namely:
# Select only the columns mentioned above.
loan = loan %>%
select(loan_status , loan_amnt , int_rate , grade , emp_length , home_ownership ,
annual_inc , term)
loan
## # A tibble: 2,260,668 x 8
## loan_status loan_amnt int_rate grade emp_length home_ownership
## <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Current 2500 13.6 C 10+ years RENT
## 2 Current 30000 18.9 D 10+ years MORTGAGE
## 3 Current 5000 18.0 D 6 years MORTGAGE
## 4 Current 4000 18.9 D 10+ years MORTGAGE
## 5 Current 30000 16.1 C 10+ years MORTGAGE
## 6 Current 5550 15.0 C 10+ years MORTGAGE
## 7 Current 2000 18.0 D 4 years RENT
## 8 Current 6000 13.6 C 10+ years RENT
## 9 Current 5000 18.0 D 10+ years MORTGAGE
## 10 Current 6000 14.5 C < 1 year OWN
## # ... with 2,260,658 more rows, and 2 more variables: annual_inc <dbl>,
## # term <chr>
Missing Values:
sapply(loan , function(x) sum(is.na(x)))
## loan_status loan_amnt int_rate grade emp_length
## 0 0 0 0 0
## home_ownership annual_inc term
## 0 4 0
# Remove the 4 rows with missing annual income, 49 rows where home ownership is 'NONE' or 'ANY' and rows where emp_length is 'n/a'.
loan = loan %>%
filter(!is.na(annual_inc) ,
!(home_ownership %in% c('NONE' , 'ANY')) ,
emp_length != 'n/a')
loan %>%
count(loan_status) %>%
ggplot(aes(x = reorder(loan_status , desc(n)) , y = n , fill = n)) +
geom_col() +
coord_flip() + labs(x = 'Loan Status' , y = 'Count')
We want to convert this variable to binary (1 for default and 0 for non-default) but we have 10 different levels. Loans with status Current, Late payments, In grace period need to be removed. Therefore, we create a new variable called loan_outcome where
loan_outcome -> 1 if loan_status = ‘Charged Off’ or ‘Default’ loan_outcome -> 0 if loan_status = ‘Fully Paid’
loan = loan %>%
mutate(loan_outcome = ifelse(loan_status %in% c('Charged Off' , 'Default') ,
1,
ifelse(loan_status == 'Fully Paid' , 0 , 'No info')
))
barplot(table(loan$loan_outcome) , col = 'lightblue')
We will create a new dataset which contains only rows with 0 or 1 in loan_outcome feature for better modelling.
# Create the new dataset by filtering 0's and 1's in the loan_outcome column and remove loan_status column for the modelling
loan2 = loan %>%
select(-loan_status) %>%
filter(loan_outcome %in% c(0 , 1))
Our new dataset contains of 1227885 rows.
Let’s observe how useful these variables would be for credit risk modelling. It is known that the better the grade the lowest the interest rate. We can nicely visualise this with boxplots.
ggplot(loan2 , aes(x = grade , y = int_rate , fill = grade)) +
geom_boxplot() +
theme_igray() +
labs(y = 'Interest Rate' , x = 'Grade')
We assume that grade is a great predictor for the volume of non-performing loans. But how many of them did not performed grouped by grade?
table(loan2$grade , factor(loan2$loan_outcome , c(0 , 1) , c('Fully Paid' , 'Default')))
##
## Fully Paid Default
## A 201708 12424
## B 311584 46655
## C 271318 76680
## D 128042 54771
## E 53267 33064
## F 16447 13413
## G 4285 4227
ggplot(loan2 , aes(x = grade , y = ..count.. , fill = factor(loan_outcome , c(1 , 0) , c('Default' , 'Fully Paid')))) +
geom_bar() +
theme(legend.title = element_blank())
Now let’s try to find out what impact the annual income of the borrower has on the other variables.
ggplot(loan2[sample(244179 , 10000) , ] , aes(x = annual_inc , y = loan_amnt , color = int_rate)) +
geom_point(alpha = 0.5 , size = 1.5) +
geom_smooth(se = F , color = 'darkred' , method = 'loess') +
xlim(c(0 , 300000)) +
labs(x = 'Annual Income' , y = 'Loan Ammount' , color = 'Interest Rate')
## Warning: Removed 67 rows containing non-finite values (stat_smooth).
## Warning: Removed 67 rows containing missing values (geom_point).
As expected the larger the annual income the larger the demanded ammount by the borrower.
Modelling Process:
Because of the binary response variable we can use logistic regression. Rather than modelling the response Y directly, logistic regression models the probability that Y belongs to a particular category, in our case the probability of a non-performing loan. This probability can be computed by the logistic function,
P = exp(b0 + b1X1 + … + bNXN) / [ 1 + exp(b0 + b1X1 + … + bNXN) ]
where
# Split dataset
loan2$loan_outcome = as.numeric(loan2$loan_outcome)
idx = sample(dim(loan2)[1] , 0.75*dim(loan2)[1] , replace = F)
trainset = loan2[idx , ]
testset = loan2[-idx , ]
# Fit logistic regression
glm.model = glm(loan_outcome ~ . , trainset , family = binomial(link = 'logit'))
summary(glm.model)
##
## Call:
## glm(formula = loan_outcome ~ ., family = binomial(link = "logit"),
## data = trainset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4530 -0.7125 -0.5319 -0.3228 6.9008
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.128e+00 1.999e-02 -156.488 < 2e-16 ***
## loan_amnt 1.095e-05 3.803e-07 28.781 < 2e-16 ***
## int_rate 3.909e-02 1.759e-03 22.229 < 2e-16 ***
## gradeB 6.452e-01 1.364e-02 47.312 < 2e-16 ***
## gradeC 1.061e+00 1.687e-02 62.876 < 2e-16 ***
## gradeD 1.267e+00 2.226e-02 56.912 < 2e-16 ***
## gradeE 1.390e+00 2.809e-02 49.466 < 2e-16 ***
## gradeF 1.435e+00 3.576e-02 40.140 < 2e-16 ***
## gradeG 1.470e+00 4.535e-02 32.419 < 2e-16 ***
## emp_length1 year -2.555e-03 1.368e-02 -0.187 0.851873
## emp_length10+ years -6.063e-02 1.044e-02 -5.810 6.27e-09 ***
## emp_length2 years -5.248e-02 1.271e-02 -4.127 3.67e-05 ***
## emp_length3 years -2.085e-02 1.308e-02 -1.595 0.110822
## emp_length4 years -3.501e-02 1.416e-02 -2.471 0.013456 *
## emp_length5 years -3.463e-02 1.401e-02 -2.471 0.013467 *
## emp_length6 years -6.319e-02 1.538e-02 -4.109 3.97e-05 ***
## emp_length7 years -5.558e-02 1.559e-02 -3.565 0.000364 ***
## emp_length8 years -9.062e-03 1.548e-02 -0.585 0.558276
## emp_length9 years -1.821e-02 1.641e-02 -1.110 0.266955
## home_ownershipOTHER 4.478e-01 2.441e-01 1.834 0.066594 .
## home_ownershipOWN 1.926e-01 9.497e-03 20.274 < 2e-16 ***
## home_ownershipRENT 3.498e-01 6.086e-03 57.476 < 2e-16 ***
## annual_inc -2.276e-06 7.227e-08 -31.496 < 2e-16 ***
## term60 months 4.317e-01 6.839e-03 63.131 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 914196 on 920912 degrees of freedom
## Residual deviance: 838717 on 920889 degrees of freedom
## AIC: 838765
##
## Number of Fisher Scoring iterations: 5
The coefficients of the following features are positive:
This means the probability of defaulting on the given credit varies directly with these factors. For example more the given ammount of the loan, more the risk of losing credit.
The coefficients of the following features are negative:
This means that the probability of defaulting is inversely proportional to the factors mentioned above.
# Prediction on test set
preds = predict(glm.model , testset , type = 'response')
# Density of probabilities
ggplot(data.frame(preds) , aes(preds)) +
geom_density(fill = 'lightblue' , alpha = 0.4) +
labs(x = 'Predicted Probabilities on test set')
But now let’s see how the accuracy, sensitivity and specificity are transformed for given threshold. We can use a threshold of 50% for the posterior probability of default in order to assign an observation to the default class. However, if we are concerned about incorrectly predicting the default status for individuals who default, then we can consider lowering this threshold. So we will consider these three metrics for threshold levels from 1% up to 50%.
k = 0
accuracy = c()
sensitivity = c()
specificity = c()
for(i in seq(from = 0.01 , to = 0.5 , by = 0.01)){
k = k + 1
preds_binomial = ifelse(preds > i , 1 , 0)
confmat = table(testset$loan_outcome , preds_binomial)
accuracy[k] = sum(diag(confmat)) / sum(confmat)
sensitivity[k] = confmat[1 , 1] / sum(confmat[ , 1])
specificity[k] = confmat[2 , 2] / sum(confmat[ , 2])
}
If we plot our results we get this visualization.
threshold = seq(from = 0.01 , to = 0.5 , by = 0.01)
data = data.frame(threshold , accuracy , sensitivity , specificity)
head(data)
## threshold accuracy sensitivity specificity
## 1 0.01 0.1946823 0.8958333 0.1945726
## 2 0.02 0.1949103 0.9218750 0.1946070
## 3 0.03 0.1955292 0.9434524 0.1947097
## 4 0.04 0.1995329 0.9641337 0.1954134
## 5 0.05 0.2399307 0.9602776 0.2025856
## 6 0.06 0.2887495 0.9523467 0.2116308
# Gather accuracy , sensitivity and specificity in one column
ggplot(gather(data , key = 'Metric' , value = 'Value' , 2:4) ,
aes(x = threshold , y = Value , color = Metric)) +
geom_line(size = 1.5)
A threshold of 25% - 30% seems ideal cause further increase of the cut off percentage does not have significant impact on the accuracy of the model. The Confusion Matrix for cut off point at 30% will be this,
preds.for.30 = ifelse(preds > 0.3 , 1 , 0)
confusion_matrix_30 = table(Predicted = preds.for.30 , Actual = testset$loan_outcome)
confusion_matrix_30
## Actual
## Predicted 0 1
## 0 211430 37757
## 1 35818 21967
## [1] "Accuracy : 0.7603"
The ROC (Receiver Operating Characteristics) curve is a popular graphic for simultaneously displaying the two types of errors for all possible thresholds.
library(pROC)
## Warning: package 'pROC' was built under R version 3.6.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Area Under Curve
auc(roc(testset$loan_outcome , preds))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.7018
# Plot ROC curve
plot.roc(testset$loan_outcome , preds , main = "Confidence interval of a threshold" , percent = TRUE ,
ci = TRUE , of = "thresholds" , thresholds = "best" , print.thres = "best" , col = 'blue')
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
A logistic regression model was used to predict the loan status. Different cut off’s were used to decide if the loan should be granted or not. Cut off of 30% gave a good accuracy of 76.03%. The decision to set a cut off is arbitrary and higher levels of threshold increases the risk. The Area Under Curve also gives a measure of accuracy, which came out to be 70.18%.