You should install the following packages before running the code in this work:

install.packages("dplyr")
install.packages("pROC")
install.packages("ggplot2")

Basic model estimate: the case with FICO score only

First, we will load the Assignment_1_data.csv to R. dplyr R library will be used in the process in order to manipulate the data frame.

library(dplyr)
loan_df<-read.csv("Assignment_1_data.csv") #Loading the CSV to R
head(loan_df,5) #Just to see if the data uploaded in the right way
##   id loan_amnt int_rate grade sub_grade emp_length revol_bal revol_util
## 1  1     17625    18.49     D        D2    3 years     12002       88.9
## 2  2      2800     7.62     A        A3    4 years      3897       73.5
## 3  3      5375    15.31     C        C2    3 years      6070       38.4
## 4  4     20000    14.09     B        B5   < 1 year     12174       49.9
## 5  5     10000    12.12     B        B3  10+ years     13547       88.6
##   fico home_ownership annual_inc verification_status loan_status default
## 1  672           RENT      45000            Verified  Fully Paid    Paid
## 2  727           RENT      44500        Not Verified  Fully Paid    Paid
## 3  682            OWN      22880            Verified  Fully Paid    Paid
## 4  702           RENT      95000            Verified  Fully Paid    Paid
## 5  707           RENT      68000        Not Verified  Fully Paid    Paid
  1. We will add another column to the data set with dummy variables , defaulted will be 1 and Paid will be 0.
new_loan_df<-mutate(loan_df,new_default=ifelse(loan_df$default=="Defaulted",1,0)) #creating one more column under the name new_default which assign 0 to paid and 1 to default.
head(new_loan_df,4)
##   id loan_amnt int_rate grade sub_grade emp_length revol_bal revol_util
## 1  1     17625    18.49     D        D2    3 years     12002       88.9
## 2  2      2800     7.62     A        A3    4 years      3897       73.5
## 3  3      5375    15.31     C        C2    3 years      6070       38.4
## 4  4     20000    14.09     B        B5   < 1 year     12174       49.9
##   fico home_ownership annual_inc verification_status loan_status default
## 1  672           RENT      45000            Verified  Fully Paid    Paid
## 2  727           RENT      44500        Not Verified  Fully Paid    Paid
## 3  682            OWN      22880            Verified  Fully Paid    Paid
## 4  702           RENT      95000            Verified  Fully Paid    Paid
##   new_default
## 1           0
## 2           0
## 3           0
## 4           0
  1. Logistic regression model the probability of default given Fico score. Fico score definition: “is best known for producing the most widely used consumer credit scores that financial institutions use in deciding whether to lend money or issue credit” (Investopedia). Fico range between 350 to 850, the coefficient sign will be minus, as Fico score goes up the probability to default goes down. It is important to know that there is no linear relationship between Fico score and the probability to default. Every Fico score need to be calculated based on the model parameters to get the probability to default.
plot(new_loan_df$fico,new_loan_df$new_default,xlab = "Fico score",ylab = "Paid or Default")

log_regression_fico<-glm(new_default~fico,data = new_loan_df,family = binomial)
summary(log_regression_fico)
## 
## Call:
## glm(formula = new_default ~ fico, family = binomial, data = new_loan_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7259  -0.6441  -0.5698  -0.4430   2.6000  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  7.681073   0.755259   10.17   <2e-16 ***
## fico        -0.013414   0.001092  -12.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8869.3  on 9999  degrees of freedom
## Residual deviance: 8694.0  on 9998  degrees of freedom
## AIC: 8698
## 
## Number of Fisher Scoring iterations: 5
coef(log_regression_fico)
## (Intercept)        fico 
##  7.68107344 -0.01341421

3.As you can see the Intercept is 7.681 and the fico coefficient is -0.0134.Fico score has a significant explanatory factor, p value under 0.05. From the first graph you can notice that in general, as fico score goes up more people tend not to pay.

Model evaluation

Using pROC library we will create the ROC curve.

library(pROC)
roc(new_loan_df$new_default,log_regression_fico$fitted.values,plot = TRUE,legacy.axes=TRUE,xlab="False Positive Percentage",ylab="True Positive Percentage",print.auc=TRUE,percent = TRUE) #this line of code is to produce the ROC curve

## 
## Call:
## roc.default(response = new_loan_df$new_default, predictor = log_regression_fico$fitted.values,     percent = TRUE, plot = TRUE, legacy.axes = TRUE, xlab = "False Positive Percentage",     ylab = "True Positive Percentage", print.auc = TRUE)
## 
## Data: log_regression_fico$fitted.values in 8377 controls (new_loan_df$new_default 0) < 1623 cases (new_loan_df$new_default 1).
## Area under the curve: 59.81%

The area under the curve is 0.5981, the area is greater than 0.5.

  1. We will set the 10% probability for default threshold and create confusion matrix
predict_vector<-predict(log_regression_fico,type = "response") 
glm.pred<-rep(0,10000) #create vector with 10000 zero (10000zero is like 10000 No default)
glm.pred[predict_vector>0.1]=1 #taking the glm.predict and for each probability above 10% assign 1 (1 is like Defaulted) 
table(glm.pred,new_loan_df$new_default) #confusion matrix 
##         
## glm.pred    0    1
##        0 1095   89
##        1 7282 1534

We create a confusion matrix: The proportion of mistakenly rejected consumers (predicted as default but actually paid from the ones that paid) is 0.8692849. The proportion of correctly defaulted consumers (predicted as default and indeed default from the people that default) is 0.9451633.

With slightly richer data

The variables that should be transformed to factor in R from the mentioned variables are detailed credit grade and employment verification. But in the data frame R is already consider them to be factor variables. We will only replace employment verification to be two groups - non-verified and everything else (non-verified =1, else=0)

loan_multiple_log<-mutate(new_loan_df,employment_ver_dummy=ifelse(new_loan_df$verification_status=="Not Verified",1,0)) #addind column to the data base of one=not verified, else =0
head(loan_multiple_log) #only to check if it worked
##   id loan_amnt int_rate grade sub_grade emp_length revol_bal revol_util
## 1  1     17625    18.49     D        D2    3 years     12002       88.9
## 2  2      2800     7.62     A        A3    4 years      3897       73.5
## 3  3      5375    15.31     C        C2    3 years      6070       38.4
## 4  4     20000    14.09     B        B5   < 1 year     12174       49.9
## 5  5     10000    12.12     B        B3  10+ years     13547       88.6
## 6  6     20000    18.49     D        D2  10+ years     23178       87.8
##   fico home_ownership annual_inc verification_status loan_status default
## 1  672           RENT      45000            Verified  Fully Paid    Paid
## 2  727           RENT      44500        Not Verified  Fully Paid    Paid
## 3  682            OWN      22880            Verified  Fully Paid    Paid
## 4  702           RENT      95000            Verified  Fully Paid    Paid
## 5  707           RENT      68000        Not Verified  Fully Paid    Paid
## 6  677       MORTGAGE      60000            Verified  Fully Paid    Paid
##   new_default employment_ver_dummy
## 1           0                    0
## 2           0                    1
## 3           0                    0
## 4           0                    0
## 5           0                    1
## 6           0                    0
## 
## Call:
## glm(formula = new_default ~ fico + loan_amnt + int_rate + sub_grade + 
##     revol_bal + employment_ver_dummy, family = binomial, data = loan_multiple_log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1934  -0.6379  -0.5533  -0.4118   2.5769  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)   
## (Intercept)          -7.809e-01  1.287e+00  -0.607  0.54398   
## fico                 -2.248e-03  1.475e-03  -1.524  0.12760   
## loan_amnt            -2.101e-06  4.107e-06  -0.512  0.60893   
## int_rate             -9.319e-02  9.231e-02  -1.010  0.31271   
## sub_gradeA2          -4.988e-02  4.238e-01  -0.118  0.90631   
## sub_gradeA3           7.080e-01  3.802e-01   1.862  0.06260 . 
## sub_gradeA4           8.152e-01  3.746e-01   2.176  0.02956 * 
## sub_gradeA5           9.053e-01  4.273e-01   2.119  0.03413 * 
## sub_gradeB1           1.316e+00  4.872e-01   2.702  0.00689 **
## sub_gradeB2           1.371e+00  5.610e-01   2.444  0.01452 * 
## sub_gradeB3           1.747e+00  6.391e-01   2.733  0.00628 **
## sub_gradeB4           1.788e+00  7.172e-01   2.492  0.01268 * 
## sub_gradeB5           2.144e+00  7.886e-01   2.719  0.00655 **
## sub_gradeC1           2.007e+00  8.219e-01   2.442  0.01461 * 
## sub_gradeC2           2.207e+00  8.970e-01   2.460  0.01388 * 
## sub_gradeC3           2.601e+00  9.451e-01   2.752  0.00593 **
## sub_gradeC4           2.614e+00  9.919e-01   2.635  0.00841 **
## sub_gradeC5           2.601e+00  1.071e+00   2.428  0.01519 * 
## sub_gradeD1           2.798e+00  1.124e+00   2.489  0.01280 * 
## sub_gradeD2           2.891e+00  1.190e+00   2.428  0.01518 * 
## sub_gradeD3           3.064e+00  1.223e+00   2.505  0.01224 * 
## sub_gradeD4           3.263e+00  1.258e+00   2.595  0.00947 **
## sub_gradeD5           3.233e+00  1.330e+00   2.431  0.01505 * 
## sub_gradeE1           3.500e+00  1.400e+00   2.501  0.01239 * 
## sub_gradeE2           2.953e+00  1.443e+00   2.046  0.04075 * 
## sub_gradeE3           2.762e+00  1.510e+00   1.829  0.06734 . 
## sub_gradeE4           2.967e+00  1.586e+00   1.871  0.06136 . 
## sub_gradeE5           3.608e+00  1.643e+00   2.196  0.02811 * 
## sub_gradeF1           3.700e+00  1.686e+00   2.194  0.02824 * 
## sub_gradeF2           4.309e+00  1.747e+00   2.467  0.01363 * 
## sub_gradeF3           3.466e+00  1.791e+00   1.935  0.05293 . 
## sub_gradeF4          -8.940e+00  2.677e+02  -0.033  0.97336   
## sub_gradeF5          -8.887e+00  5.354e+02  -0.017  0.98676   
## sub_gradeG1           4.703e+00  2.337e+00   2.012  0.04417 * 
## sub_gradeG2          -8.985e+00  3.786e+02  -0.024  0.98107   
## sub_gradeG5          -9.042e+00  5.354e+02  -0.017  0.98653   
## revol_bal             1.133e-06  9.845e-07   1.151  0.24980   
## employment_ver_dummy -1.500e-02  5.899e-02  -0.254  0.79932   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8869.3  on 9999  degrees of freedom
## Residual deviance: 8525.4  on 9962  degrees of freedom
## AIC: 8601.4
## 
## Number of Fisher Scoring iterations: 12
##          (Intercept)                 fico            loan_amnt 
##        -7.808947e-01        -2.247923e-03        -2.101344e-06 
##             int_rate          sub_gradeA2          sub_gradeA3 
##        -9.319442e-02        -4.988157e-02         7.079689e-01 
##          sub_gradeA4          sub_gradeA5          sub_gradeB1 
##         8.151579e-01         9.053256e-01         1.316347e+00 
##          sub_gradeB2          sub_gradeB3          sub_gradeB4 
##         1.371108e+00         1.746601e+00         1.787740e+00 
##          sub_gradeB5          sub_gradeC1          sub_gradeC2 
##         2.143986e+00         2.006986e+00         2.206848e+00 
##          sub_gradeC3          sub_gradeC4          sub_gradeC5 
##         2.600532e+00         2.613798e+00         2.601122e+00 
##          sub_gradeD1          sub_gradeD2          sub_gradeD3 
##         2.798020e+00         2.890547e+00         3.064440e+00 
##          sub_gradeD4          sub_gradeD5          sub_gradeE1 
##         3.263211e+00         3.233399e+00         3.500311e+00 
##          sub_gradeE2          sub_gradeE3          sub_gradeE4 
##         2.952968e+00         2.761600e+00         2.966595e+00 
##          sub_gradeE5          sub_gradeF1          sub_gradeF2 
##         3.608298e+00         3.699685e+00         4.308719e+00 
##          sub_gradeF3          sub_gradeF4          sub_gradeF5 
##         3.465906e+00        -8.940223e+00        -8.886938e+00 
##          sub_gradeG1          sub_gradeG2          sub_gradeG5 
##         4.702808e+00        -8.985002e+00        -9.041814e+00 
##            revol_bal employment_ver_dummy 
##         1.133040e-06        -1.499731e-02

The only variable which have significant power to differentiate between Paid and Default is the detailed grade variable and specifically sub_grade C3.

Producing ROC curve to this model:

roc1<-roc(loan_multiple_log$new_default,multiple_log_regression$fitted.values,legacy.axes=TRUE,xlab="False Positive Percentage",ylab="True Positive Percentage",print.auc=TRUE,col="#377eb8",percent = TRUE,plot = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

roc2<-roc(new_loan_df$new_default,log_regression_fico$fitted.values,percent=TRUE,print.auc=TRUE,col="#4daf4a",lwd=4,xlab="False Positive Percentage",ylab="True Positive Percentage",legacy.axes=TRUE,plot = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

plot(roc1,legacy.axes=TRUE,xlab="False Positive Percentage",ylab="True Positive Percentage",col="#377eb8",percent=TRUE)
plot(roc2,add=TRUE)

The area under the curve with this model is greater than the first model. It means that the multiple logistic model regression that taking into account many variables is able to distinguish between Paid and defaulted in 64% chance and the logistic regression model with only Fico can do it with almost 59.8%, which is less.

Other ways to produce the ROC curves

library(ggplot2)
prediction_mul<-predict(multiple_log_regression,type="response")
roc_fico_gg<-roc(loan_multiple_log$new_default,predict_vector)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_mul_gg<-roc(loan_multiple_log$new_default,prediction_mul)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
g_fico<-ggroc(roc_fico_gg, alpha=0.5,colour="red",size=2,legacy.axes = TRUE)
g_mul<-ggroc(roc_mul_gg,alpha=0.5,colour="blue",size=2,legacy.axes=TRUE)
g_all<-ggroc(list(fico_log_regression=roc_fico_gg,mul_log_regression=roc_mul_gg),legacy.axes = TRUE)
g_all+ggtitle("Two model ROC curves - Log regression - fico and multiple variables")+geom_segment(aes(x=0,xend=1,y=0,yend=1),color="grey",linetype="dashed")