Analysis Objective

This analysis is done to demonstrate the use of logistic regression on the lbb_loans.csv dataset, correctly interpret the negative coefficients obtained from logistic regression model, understand which of the variables are more statistically significant as predictors, and demonstrate some strategies to improve the model that has been built.

Libraries and Setup

knitr::opts_chunk$set(cache=TRUE)
options(scipen = 9999)
rm(list=ls())
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units

Import the lbb_loans.csv data

# getwd()
loan.data <- read.csv("lbb_loans.csv")
str(loan.data)
## 'data.frame':    1500 obs. of  16 variables:
##  $ initial_list_status: Factor w/ 2 levels "f","w": 2 2 2 2 2 2 1 2 2 2 ...
##  $ purpose            : Factor w/ 5 levels "credit_card",..: 2 2 3 1 1 2 2 2 2 1 ...
##  $ int_rate           : num  12 12.6 19 14.1 25.8 ...
##  $ installment        : num  399 121 727 291 389 ...
##  $ annual_inc         : num  55000 62000 29000 40000 31500 44000 80000 104000 75000 42000 ...
##  $ dti                : num  16 30.7 10.2 22.5 38.4 ...
##  $ verification_status: Factor w/ 3 levels "Not Verified",..: 2 2 1 2 3 1 1 1 2 3 ...
##  $ grade              : Factor w/ 7 levels "A","B","C","D",..: 2 3 4 3 5 3 5 4 1 5 ...
##  $ revol_bal          : int  754 2194 5279 6600 10489 5217 14036 27942 3885 5768 ...
##  $ inq_last_12m       : int  1 2 5 0 4 3 1 2 0 6 ...
##  $ delinq_2yrs        : int  1 0 3 0 0 1 1 0 0 3 ...
##  $ home_ownership     : Factor w/ 3 levels "MORTGAGE","OWN",..: 3 3 1 3 2 1 3 1 3 1 ...
##  $ not_paid           : int  0 1 1 0 0 1 1 1 0 0 ...
##  $ log_inc            : num  10.9 11 10.3 10.6 10.4 ...
##  $ verified           : int  1 1 0 1 1 0 0 0 1 1 ...
##  $ grdCtoA            : int  1 0 0 0 0 0 0 0 1 0 ...
head(loan.data, 12)
##    initial_list_status            purpose int_rate installment annual_inc
## 1                    w debt_consolidation    11.99      398.52      55000
## 2                    w debt_consolidation    12.62      120.65      62000
## 3                    w   home_improvement    19.03      726.80      29000
## 4                    w        credit_card    14.08      291.38      40000
## 5                    w        credit_card    25.82      389.34      31500
## 6                    w debt_consolidation    16.02      140.67      44000
## 7                    f debt_consolidation    22.91      773.26      80000
## 8                    w debt_consolidation    18.06     1447.31     104000
## 9                    w debt_consolidation     7.35       31.04      75000
## 10                   w        credit_card    25.82      355.03      42000
## 11                   w        credit_card    14.08      218.99      50000
## 12                   w debt_consolidation    13.59      424.74      55000
##      dti verification_status grade revol_bal inq_last_12m delinq_2yrs
## 1  15.99     Source Verified     B       754            1           1
## 2  30.66     Source Verified     C      2194            2           0
## 3  10.22        Not Verified     D      5279            5           3
## 4  22.53     Source Verified     C      6600            0           0
## 5  38.36            Verified     E     10489            4           0
## 6  26.60        Not Verified     C      5217            3           1
## 7  26.43        Not Verified     E     14036            1           1
## 8  26.70        Not Verified     D     27942            2           0
## 9   4.65     Source Verified     A      3885            0           0
## 10 48.60            Verified     E      5768            6           3
## 11 29.07        Not Verified     C     14780            0           0
## 12 20.39        Not Verified     C     11809            0           0
##    home_ownership not_paid  log_inc verified grdCtoA
## 1            RENT        0 10.91509        1       1
## 2            RENT        1 11.03489        1       0
## 3        MORTGAGE        1 10.27505        0       0
## 4            RENT        0 10.59663        1       0
## 5             OWN        0 10.35774        1       0
## 6        MORTGAGE        1 10.69194        0       0
## 7            RENT        1 11.28978        0       0
## 8        MORTGAGE        1 11.55215        0       0
## 9            RENT        0 11.22524        1       1
## 10       MORTGAGE        0 10.64542        1       0
## 11       MORTGAGE        1 10.81978        0       0
## 12           RENT        0 10.91509        0       0
anyNA(loan.data)
## [1] FALSE

Exploratory Data Analysis

Let’s investigate the relationships and discover rough structures of the data. The variable of interest is the not_paid variable, a binary variable that indicate whether a loan is fully paid or not. A loan is considered “not paid” (not paid = 1) when it is Defaulted, Charged Off, or past due date (Grace Period).

names(loan.data)
##  [1] "initial_list_status" "purpose"             "int_rate"           
##  [4] "installment"         "annual_inc"          "dti"                
##  [7] "verification_status" "grade"               "revol_bal"          
## [10] "inq_last_12m"        "delinq_2yrs"         "home_ownership"     
## [13] "not_paid"            "log_inc"             "verified"           
## [16] "grdCtoA"

The explaination about the variables on the dataset are presented below.

Use describe() function from Hmisc package to see the overall data

describe(loan.data)
## loan.data 
## 
##  16  Variables      1500  Observations
## ---------------------------------------------------------------------------
## initial_list_status 
##        n  missing distinct 
##     1500        0        2 
##                       
## Value          f     w
## Frequency    308  1192
## Proportion 0.205 0.795
## ---------------------------------------------------------------------------
## purpose 
##        n  missing distinct 
##     1500        0        5 
##                                                                    
## Value             credit_card debt_consolidation   home_improvement
## Frequency                 314                954                157
## Proportion              0.209              0.636              0.105
##                                                 
## Value          major_purchase     small_business
## Frequency                  49                 26
## Proportion              0.033              0.017
## ---------------------------------------------------------------------------
## int_rate 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1500        0       36    0.998    14.74    6.413     7.21     7.97 
##      .25      .50      .75      .90      .95 
##    10.42    14.08    18.06    23.88    25.84 
## 
## lowest :  5.32  6.08  6.72  7.07  7.21, highest: 30.75 30.79 30.84 30.89 30.94
## ---------------------------------------------------------------------------
## installment 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1500        0      977        1    469.1    328.7    105.4    148.8 
##      .25      .50      .75      .90      .95 
##    248.2    382.3    635.9    903.5   1071.0 
## 
## lowest :   31.04   36.68   38.67   39.24   39.57
## highest: 1415.78 1447.31 1501.96 1502.92 1503.89
## ---------------------------------------------------------------------------
## annual_inc 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1500        0      372        1    81554    53755    25180    32384 
##      .25      .50      .75      .90      .95 
##    48000    67890    95555   140000   180000 
## 
## lowest :    2500    3000   10000   10776   12000
## highest:  494000  600000  750000  950000 1200000
## ---------------------------------------------------------------------------
## dti 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1500        0     1219        1     18.9    12.29    4.096    6.514 
##      .25      .50      .75      .90      .95 
##   11.010   17.160   24.550   31.938   36.522 
## 
## lowest :   0.00   0.14   0.41   0.44   0.45, highest:  94.72 107.60 115.85 163.15 198.56
## ---------------------------------------------------------------------------
## verification_status 
##        n  missing distinct 
##     1500        0        3 
##                                                           
## Value         Not Verified Source Verified        Verified
## Frequency              555             523             422
## Proportion           0.370           0.349           0.281
## ---------------------------------------------------------------------------
## grade 
##        n  missing distinct 
##     1500        0        7 
##                                                     
## Value          A     B     C     D     E     F     G
## Frequency    203   366   451   311   116    38    15
## Proportion 0.135 0.244 0.301 0.207 0.077 0.025 0.010
## ---------------------------------------------------------------------------
## revol_bal 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1500        0     1441        1    15207    15859      924     1900 
##      .25      .50      .75      .90      .95 
##     4994    10139    18125    31401    44265 
## 
## lowest :      0      3      4      9     10, highest: 176233 178902 211088 228292 258897
## ---------------------------------------------------------------------------
## inq_last_12m 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1500        0       19    0.967    2.357    2.522      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      1.0      2.0      3.0      5.1      8.0 
##                                                                       
## Value          0     1     2     3     4     5     6     7     8     9
## Frequency    356   330   274   204   118    68    42    32    31    16
## Proportion 0.237 0.220 0.183 0.136 0.079 0.045 0.028 0.021 0.021 0.011
##                                                                 
## Value         10    11    12    13    14    15    16    17    18
## Frequency     10     5     3     1     2     1     3     2     2
## Proportion 0.007 0.003 0.002 0.001 0.001 0.001 0.002 0.001 0.001
## ---------------------------------------------------------------------------
## delinq_2yrs 
##        n  missing distinct     Info     Mean      Gmd 
##     1500        0        9     0.48   0.3127   0.5376 
##                                                                 
## Value          0     1     2     3     4     5     6     7     8
## Frequency   1204   199    55    26     9     1     2     3     1
## Proportion 0.803 0.133 0.037 0.017 0.006 0.001 0.001 0.002 0.001
## ---------------------------------------------------------------------------
## home_ownership 
##        n  missing distinct 
##     1500        0        3 
##                                      
## Value      MORTGAGE      OWN     RENT
## Frequency       754      192      554
## Proportion    0.503    0.128    0.369
## ---------------------------------------------------------------------------
## not_paid 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     1500        0        2     0.75      751   0.5007   0.5003 
## 
## ---------------------------------------------------------------------------
## log_inc 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1500        0      372        1    11.12   0.6494    10.13    10.39 
##      .25      .50      .75      .90      .95 
##    10.78    11.13    11.47    11.85    12.10 
## 
## lowest :  7.824046  8.006368  9.210340  9.285077  9.392662
## highest: 13.110291 13.304685 13.527828 13.764217 13.997832
## ---------------------------------------------------------------------------
## verified 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     1500        0        2    0.699      945     0.63   0.4665 
## 
## ---------------------------------------------------------------------------
## grdCtoA 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     1500        0        2    0.706      569   0.3793   0.4712 
## 
## ---------------------------------------------------------------------------

Frequency of defaulted loans (not_paid == 0) vs fully-paid loans (not_paid == 1)

table(loan.data$not_paid)
## 
##   0   1 
## 749 751

Summary of Debt-to-Income Ratio

summary(loan.data$dti)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   11.01   17.16   18.90   24.55  198.56

Debt-to-Income ratio dti summary when not_paid is 0 (fully-paid loans)

summary(loan.data[loan.data$not_paid == 0, "dti"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   11.05   16.78   18.41   23.80  198.56

Debt-to-Income ratio dti summary when not_paid is 1 (charged-off, past-due / grace period or defaulted loans)

summary(loan.data[loan.data$not_paid == 1, "dti"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.92   17.52   19.40   25.05  115.85

Summary of Revolving Balance

summary(loan.data$revol_bal)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    4994   10139   15207   18125  258897

We change the revolving balance revol_bal variable to log(revol_bal) as the data has a big variance, like the annual income annual_inc variable.

loan.data$log_revol <- ifelse(
                          loan.data$revol_bal == 0,
                          0,
                          log(loan.data$revol_bal)
)

Connection of Loan Grade with Average Interest Rate

aggregate(loan.data$int_rate ~ loan.data$grade, FUN = mean)
##   loan.data$grade loan.data$int_rate
## 1               A           7.146847
## 2               B          10.484945
## 3               C          14.297494
## 4               D          19.066495
## 5               E          24.698534
## 6               F          29.516316
## 7               G          30.836667

Subsetting Data for Analysis

From the dataset, we know that the highest the grade (G), the higher int_rate (interest rate). The higher interest rate, we can assume that the probability of defaulted loan will be higher.

After checking the grdCtoA column, apparently it is stated that A and B are 1, 0 otherwise. The column is omitted and corrected as gradeAtoE and stating that A, B or C are 0, 1 otherwise.

loan.data$grdCtoA <- NULL
loan.data$gradeDtoG <- as.integer(
                          case_when(
                            loan.data$grade == "A" ~ 0,
                            loan.data$grade == "B" ~ 0,
                            loan.data$grade == "C" ~ 0,
                            loan.data$grade == "D" ~ 1,
                            loan.data$grade == "E" ~ 1,
                            loan.data$grade == "F" ~ 1,
                            loan.data$grade == "G" ~ 1)
                          )

As the “annual_inc”, “revol_bal”, “verification_status”, and “grade” are now presented by “log_inc”, “log_revol”, “verified”, and “gradeDtoG”, the four variables will be omitted from the analysis.

# exclude "annual_inc", "revol_bal", "verification_status", "grade"
loan.data$annual_inc <- loan.data$revol_bal <- loan.data$verification_status <- loan.data$grade <- NULL

We’ll check whether the loan.data now has 13 variables instead of 16 variables.

str(loan.data)
## 'data.frame':    1500 obs. of  13 variables:
##  $ initial_list_status: Factor w/ 2 levels "f","w": 2 2 2 2 2 2 1 2 2 2 ...
##  $ purpose            : Factor w/ 5 levels "credit_card",..: 2 2 3 1 1 2 2 2 2 1 ...
##  $ int_rate           : num  12 12.6 19 14.1 25.8 ...
##  $ installment        : num  399 121 727 291 389 ...
##  $ dti                : num  16 30.7 10.2 22.5 38.4 ...
##  $ inq_last_12m       : int  1 2 5 0 4 3 1 2 0 6 ...
##  $ delinq_2yrs        : int  1 0 3 0 0 1 1 0 0 3 ...
##  $ home_ownership     : Factor w/ 3 levels "MORTGAGE","OWN",..: 3 3 1 3 2 1 3 1 3 1 ...
##  $ not_paid           : int  0 1 1 0 0 1 1 1 0 0 ...
##  $ log_inc            : num  10.9 11 10.3 10.6 10.4 ...
##  $ verified           : int  1 1 0 1 1 0 0 0 1 1 ...
##  $ log_revol          : num  6.63 7.69 8.57 8.79 9.26 ...
##  $ gradeDtoG          : int  0 0 1 0 1 0 1 1 0 1 ...
anyNA(loan.data)
## [1] FALSE
summary(loan.data)
##  initial_list_status               purpose       int_rate    
##  f: 308              credit_card       :314   Min.   : 5.32  
##  w:1192              debt_consolidation:954   1st Qu.:10.42  
##                      home_improvement  :157   Median :14.08  
##                      major_purchase    : 49   Mean   :14.74  
##                      small_business    : 26   3rd Qu.:18.06  
##                                               Max.   :30.94  
##   installment           dti          inq_last_12m     delinq_2yrs    
##  Min.   :  31.04   Min.   :  0.00   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.: 248.15   1st Qu.: 11.01   1st Qu.: 1.000   1st Qu.:0.0000  
##  Median : 382.30   Median : 17.16   Median : 2.000   Median :0.0000  
##  Mean   : 469.15   Mean   : 18.90   Mean   : 2.357   Mean   :0.3127  
##  3rd Qu.: 635.86   3rd Qu.: 24.55   3rd Qu.: 3.000   3rd Qu.:0.0000  
##  Max.   :1503.89   Max.   :198.56   Max.   :18.000   Max.   :8.0000  
##   home_ownership    not_paid         log_inc          verified   
##  MORTGAGE:754    Min.   :0.0000   Min.   : 7.824   Min.   :0.00  
##  OWN     :192    1st Qu.:0.0000   1st Qu.:10.779   1st Qu.:0.00  
##  RENT    :554    Median :1.0000   Median :11.126   Median :1.00  
##                  Mean   :0.5007   Mean   :11.123   Mean   :0.63  
##                  3rd Qu.:1.0000   3rd Qu.:11.467   3rd Qu.:1.00  
##                  Max.   :1.0000   Max.   :13.998   Max.   :1.00  
##    log_revol        gradeDtoG   
##  Min.   : 0.000   Min.   :0.00  
##  1st Qu.: 8.516   1st Qu.:0.00  
##  Median : 9.224   Median :0.00  
##  Mean   : 8.974   Mean   :0.32  
##  3rd Qu.: 9.805   3rd Qu.:1.00  
##  Max.   :12.464   Max.   :1.00

Train and Test Data

In this part, we will split our dataset into train and test sets and build our machine learning model using data only from our train set to obtain an unbiased measurmenet of the model’s accuracy by predicting on test set.

The idea of obtaining an unbiased estimate of our model’s out-of-sample performance is an important one as it is often the case that the in-sample error is optimistic and tuned / adapted in a particular way to minimize the error in the training sample. Therefore - the in-sample error is not a good representation or indication of how our model will perform when it is applied on unseen data.

We’ll split our data into the loan.data.train and loan.data.test set. (85-15)

set.seed(2018)
intrain <- sample(nrow(loan.data), nrow(loan.data)*0.85)
loan.data.train <- loan.data[intrain, ]
loan.data.test <- loan.data[-intrain, ]

Verify to make sure our data is randomly shuffled

prop.table(table(loan.data.train$not_paid))
## 
##         0         1 
## 0.4878431 0.5121569
prop.table(table(loan.data.test$not_paid))
## 
##         0         1 
## 0.5644444 0.4355556

As the data is parted almost at 50-50, we can move on to the next step.

Logistic Regression Model of Credit Risk Analysis

Stepwise Regression

We do stepwise regression where the choice of predictive variables is carried out by an automatic procedure.

glm.all <- glm(not_paid ~., family = "binomial", loan.data.train)
# we only use it as a guidance when building our model
step(glm.all, direction = "both")
## Start:  AIC=1725.61
## not_paid ~ initial_list_status + purpose + int_rate + installment + 
##     dti + inq_last_12m + delinq_2yrs + home_ownership + log_inc + 
##     verified + log_revol + gradeDtoG
## 
##                       Df Deviance    AIC
## - int_rate             1   1691.7 1723.7
## - log_revol            1   1691.7 1723.7
## - dti                  1   1691.7 1723.7
## - initial_list_status  1   1691.9 1723.9
## - verified             1   1692.4 1724.4
## - home_ownership       2   1695.3 1725.3
## <none>                     1691.6 1725.6
## - purpose              4   1699.7 1725.7
## - inq_last_12m         1   1693.9 1725.9
## - gradeDtoG            1   1694.6 1726.6
## - log_inc              1   1696.6 1728.6
## - delinq_2yrs          1   1698.9 1730.9
## - installment          1   1712.4 1744.4
## 
## Step:  AIC=1723.7
## not_paid ~ initial_list_status + purpose + installment + dti + 
##     inq_last_12m + delinq_2yrs + home_ownership + log_inc + verified + 
##     log_revol + gradeDtoG
## 
##                       Df Deviance    AIC
## - log_revol            1   1691.8 1721.8
## - dti                  1   1691.8 1721.8
## - initial_list_status  1   1692.0 1722.0
## - verified             1   1692.6 1722.6
## - home_ownership       2   1695.4 1723.4
## <none>                     1691.7 1723.7
## - purpose              4   1699.8 1723.8
## - inq_last_12m         1   1693.9 1723.9
## + int_rate             1   1691.6 1725.6
## - log_inc              1   1696.8 1726.8
## - delinq_2yrs          1   1699.2 1729.2
## - gradeDtoG            1   1702.5 1732.5
## - installment          1   1713.3 1743.3
## 
## Step:  AIC=1721.81
## not_paid ~ initial_list_status + purpose + installment + dti + 
##     inq_last_12m + delinq_2yrs + home_ownership + log_inc + verified + 
##     gradeDtoG
## 
##                       Df Deviance    AIC
## - dti                  1   1691.9 1719.9
## - initial_list_status  1   1692.2 1720.2
## - verified             1   1692.8 1720.8
## - home_ownership       2   1695.5 1721.5
## <none>                     1691.8 1721.8
## - inq_last_12m         1   1694.0 1722.0
## - purpose              4   1700.5 1722.5
## + log_revol            1   1691.7 1723.7
## + int_rate             1   1691.7 1723.7
## - log_inc              1   1698.2 1726.2
## - delinq_2yrs          1   1699.3 1727.3
## - gradeDtoG            1   1702.6 1730.6
## - installment          1   1713.3 1741.3
## 
## Step:  AIC=1719.9
## not_paid ~ initial_list_status + purpose + installment + inq_last_12m + 
##     delinq_2yrs + home_ownership + log_inc + verified + gradeDtoG
## 
##                       Df Deviance    AIC
## - initial_list_status  1   1692.2 1718.2
## - verified             1   1692.8 1718.8
## - home_ownership       2   1695.5 1719.5
## <none>                     1691.9 1719.9
## - inq_last_12m         1   1694.0 1720.0
## - purpose              4   1700.5 1720.5
## + int_rate             1   1691.8 1721.8
## + dti                  1   1691.8 1721.8
## + log_revol            1   1691.8 1721.8
## - delinq_2yrs          1   1699.3 1725.3
## - log_inc              1   1699.9 1725.9
## - gradeDtoG            1   1703.0 1729.0
## - installment          1   1714.7 1740.7
## 
## Step:  AIC=1718.24
## not_paid ~ purpose + installment + inq_last_12m + delinq_2yrs + 
##     home_ownership + log_inc + verified + gradeDtoG
## 
##                       Df Deviance    AIC
## - verified             1   1693.2 1717.2
## - home_ownership       2   1696.0 1718.0
## - inq_last_12m         1   1694.2 1718.2
## <none>                     1692.2 1718.2
## - purpose              4   1700.7 1718.7
## + initial_list_status  1   1691.9 1719.9
## + int_rate             1   1692.1 1720.1
## + dti                  1   1692.2 1720.2
## + log_revol            1   1692.2 1720.2
## - delinq_2yrs          1   1699.6 1723.6
## - log_inc              1   1700.4 1724.4
## - gradeDtoG            1   1703.4 1727.4
## - installment          1   1715.2 1739.2
## 
## Step:  AIC=1717.17
## not_paid ~ purpose + installment + inq_last_12m + delinq_2yrs + 
##     home_ownership + log_inc + gradeDtoG
## 
##                       Df Deviance    AIC
## - home_ownership       2   1696.9 1716.9
## - inq_last_12m         1   1695.0 1717.0
## <none>                     1693.2 1717.2
## - purpose              4   1701.9 1717.9
## + verified             1   1692.2 1718.2
## + initial_list_status  1   1692.8 1718.8
## + int_rate             1   1693.0 1719.0
## + log_revol            1   1693.1 1719.1
## + dti                  1   1693.1 1719.1
## - delinq_2yrs          1   1700.6 1722.6
## - log_inc              1   1701.7 1723.7
## - gradeDtoG            1   1704.9 1726.9
## - installment          1   1718.8 1740.8
## 
## Step:  AIC=1716.86
## not_paid ~ purpose + installment + inq_last_12m + delinq_2yrs + 
##     log_inc + gradeDtoG
## 
##                       Df Deviance    AIC
## <none>                     1696.9 1716.9
## - inq_last_12m         1   1699.0 1717.0
## + home_ownership       2   1693.2 1717.2
## + verified             1   1696.0 1718.0
## - purpose              4   1706.2 1718.2
## + initial_list_status  1   1696.4 1718.4
## + int_rate             1   1696.6 1718.6
## + log_revol            1   1696.8 1718.8
## + dti                  1   1696.8 1718.8
## - delinq_2yrs          1   1704.1 1722.1
## - log_inc              1   1706.7 1724.7
## - gradeDtoG            1   1709.0 1727.0
## - installment          1   1722.7 1740.7
## 
## Call:  glm(formula = not_paid ~ purpose + installment + inq_last_12m + 
##     delinq_2yrs + log_inc + gradeDtoG, family = "binomial", data = loan.data.train)
## 
## Coefficients:
##               (Intercept)  purposedebt_consolidation  
##                  3.048714                   0.091903  
##   purposehome_improvement      purposemajor_purchase  
##                  0.239558                   0.844807  
##     purposesmall_business                installment  
##                  0.876574                   0.001081  
##              inq_last_12m                delinq_2yrs  
##                 -0.033791                   0.197173  
##                   log_inc                  gradeDtoG  
##                 -0.337676                   0.444379  
## 
## Degrees of Freedom: 1274 Total (i.e. Null);  1265 Residual
## Null Deviance:       1767 
## Residual Deviance: 1697  AIC: 1717

From stepwise regression, we get the optimized regression formula for our model :

formula = not_paid ~ purpose + installment + inq_last_12m + delinq_2yrs + log_inc + gradeDtoG

Let’s try to run this formula to glm() function and we will inspect the p-value and the AIC value of the model.

glm.loan <- glm(formula = not_paid ~ purpose + installment + inq_last_12m + delinq_2yrs + log_inc + gradeDtoG, family = "binomial", data = loan.data.train)

summary(glm.loan)
## 
## Call:
## glm(formula = not_paid ~ purpose + installment + inq_last_12m + 
##     delinq_2yrs + log_inc + gradeDtoG, family = "binomial", data = loan.data.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9241  -1.1167   0.6801   1.1417   1.6730  
## 
## Coefficients:
##                            Estimate Std. Error z value    Pr(>|z|)    
## (Intercept)                3.048714   1.181895   2.580    0.009894 ** 
## purposedebt_consolidation  0.091903   0.145733   0.631    0.528285    
## purposehome_improvement    0.239558   0.223945   1.070    0.284747    
## purposemajor_purchase      0.844807   0.356926   2.367    0.017938 *  
## purposesmall_business      0.876574   0.480180   1.826    0.067924 .  
## installment                0.001081   0.000217   4.983 0.000000628 ***
## inq_last_12m              -0.033791   0.023068  -1.465    0.142972    
## delinq_2yrs                0.197173   0.075917   2.597    0.009399 ** 
## log_inc                   -0.337676   0.108858  -3.102    0.001922 ** 
## gradeDtoG                  0.444379   0.127694   3.480    0.000501 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1766.8  on 1274  degrees of freedom
## Residual deviance: 1696.9  on 1265  degrees of freedom
## AIC: 1716.9
## 
## Number of Fisher Scoring iterations: 4

Improving Our Model

From summary, there are some variables that are not giving high significance to the model, so we will omit them from our model. We will omit purpose and inq_last_12m.

glm.loan1 <- glm(formula = not_paid ~ installment + delinq_2yrs + log_inc + gradeDtoG, family = "binomial", data = loan.data.train)

summary(glm.loan1)
## 
## Call:
## glm(formula = not_paid ~ installment + delinq_2yrs + log_inc + 
##     gradeDtoG, family = "binomial", data = loan.data.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9302  -1.1216   0.7186   1.1442   1.5717  
## 
## Coefficients:
##               Estimate Std. Error z value    Pr(>|z|)    
## (Intercept)  3.1734626  1.1652508   2.723    0.006461 ** 
## installment  0.0011120  0.0002154   5.163 0.000000243 ***
## delinq_2yrs  0.1971310  0.0756659   2.605    0.009180 ** 
## log_inc     -0.3455030  0.1073870  -3.217    0.001294 ** 
## gradeDtoG    0.4241966  0.1252241   3.387    0.000705 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1766.8  on 1274  degrees of freedom
## Residual deviance: 1707.7  on 1270  degrees of freedom
## AIC: 1717.7
## 
## Number of Fisher Scoring iterations: 4

The difference of AIC score between glm.loan and glm.loan1 is only 0.8, so it’s okay to omit purpose and inq_last_12m from our model.

The formula used is : formula = not_paid ~ not_paid ~ installment + delinq_2yrs + log_inc + gradeDtoG and rename it CreditRisk.

CreditRisk <- glm(formula = not_paid ~ installment + delinq_2yrs + log_inc + gradeDtoG, family = "binomial", data = loan.data.train)
CreditRisk$aic
## [1] 1717.693

Negative Coefficient from Logistic Regression

CreditRisk$coefficients
##  (Intercept)  installment  delinq_2yrs      log_inc    gradeDtoG 
##  3.173462603  0.001111953  0.197131024 -0.345503016  0.424196622

As we can see from above, the log_inc have negative coefficients. The coefficients in logistic regression represent the log of odds ratio log(p/1-p) of an event. (p is the probability of an event) As long as the log of our odd ratio is positive, the probability of success will always be more than 50%. When we have negative value for our coefficient, the probability of success is less than 50%.

We refer back to Sigmoid Curve and plot the log_inc coefficient as an example.

plot(function(x)(log(x/(1-x))),
                  xlab = "Probability", 
                  ylab = "Log of Odds", 
                  main = "Sigmoid Curve")
abline(h = CreditRisk$coefficients["log_inc"], col = "red")

As the log_inc has negative coefficient, the higher the income, the probability of the defaulted loan will be lower.

For gradeDtoG variables which has positive coefficient, loan grade A, B, and C (0) will give higher probability of the loan getting paid compared to grade D to G.

Test the Consistancy of Our Model with the Train Data

We want to obtain an unbiased measurement of the model’s accuracy by predicting on our train set before use it on our test set.

loan.data.train$pred.train <- predict(CreditRisk, 
                                loan.data.train, 
                                type = "response")
head(loan.data.train$pred.train)
## [1] 0.5717765 0.5344906 0.4668034 0.5071598 0.6205931 0.4210173

Change the output from pred.train to logical (1 or 0)

We assume if the not_paid value is 0.5 or above, we say it’s a default.

Confusion Table of the Predicted Data with the Train Set of lbb_loan.csv

(conft.train <- table("predicted" = as.numeric(loan.data.train$pred.train >= 0.5),
                      "actual" = loan.data.train$not_paid))
##          actual
## predicted   0   1
##         0 378 279
##         1 244 374

Accuracy, Precision, and Recall from Our Train Set

Accu.Train <- round((conft.train[1]+conft.train[4])/sum(conft.train[1:4]),4)
Prec.Train <- round(conft.train[4]/(conft.train[2]+conft.train[4]), 4)
Reca.Train <- round(conft.train[4]/(conft.train[3]+conft.train[4]), 4)
paste("Accuracy:", Accu.Train*100,"%")
## [1] "Accuracy: 58.98 %"
paste("Precision:", Prec.Train*100,"%")
## [1] "Precision: 60.52 %"
paste("Recall:", Reca.Train*100,"%")
## [1] "Recall: 57.27 %"

More on Improving Our Model

For credit risk analysis, we would rather sacrifice some level of specificity or precision in favor of higher recall (or sensitivity). In simpler words, we want to be more sensitive to “loan defaults”. One thing we can do is to set the treshold to be more sensitive to “positive cases”, by predict a “default” when the probability exceed 0.4 (20% more sensitive than our previous classifier)

Confusion Table of the Predicted Data with the Train Data of lbb_loan.csv

(conft.train2 <- table("predicted" = as.numeric(loan.data.train$pred.train >= 0.4),
                      "actual" = loan.data.train$not_paid))
##          actual
## predicted   0   1
##         0 129  57
##         1 493 596

Accuracy, Precision, and Recall

Accu.Train2 <- round((conft.train2[1]+conft.train2[4])/sum(conft.train2[1:4]),4)
Prec.Train2 <- round(conft.train2[4]/(conft.train2[2]+conft.train2[4]), 4)
Reca.Train2 <- round(conft.train2[4]/(conft.train2[3]+conft.train2[4]), 4)
paste("Accuracy:", Accu.Train2*100,"%")
## [1] "Accuracy: 56.86 %"
paste("Precision:", Prec.Train2*100,"%")
## [1] "Precision: 54.73 %"
paste("Recall:", Reca.Train2*100,"%")
## [1] "Recall: 91.27 %"

From this, we improve our model by increasing our Recall rate to 91.27% from 57.27%, but our Presicion rate goes down from 60.52% to 54.73%. The Accuracy rate goes down a little from 58.98% to 56.86%.

Using Our Model to Test Set

Now we’re going to use our CreditRisk model : formula = not_paid ~ installment + delinq_2yrs + log_inc + gradeDtoG, with the threshold of not_paid more than 0.4 is assumed as a default loan. (see conft.train2)

Use predict() for our test set

loan.data.test$pred.test <- predict(CreditRisk, 
                                loan.data.test, 
                                type = "response")
head(loan.data.test$pred.test)
## [1] 0.5106317 0.6800422 0.4686805 0.4447727 0.4670663 0.3176890

Confusion Table of the Predicted Data with the Test Data of lbb_loan.csv

(conft.test <- table("predicted" = as.numeric(loan.data.test$pred.test >= 0.4),
                      "actual" = loan.data.test$not_paid))
##          actual
## predicted  0  1
##         0 32  7
##         1 95 91

Accuracy, Precision, and Recall

Accu.Test <- round((conft.test[1]+conft.test[4])/sum(conft.test[1:4]),4)
Prec.Test <- round(conft.test[4]/(conft.test[2]+conft.test[4]), 4)
Reca.Test <- round(conft.test[4]/(conft.test[3]+conft.test[4]), 4)

By using Test Set to our improved model, we get:

paste("Accuracy:", Accu.Test*100,"%")
## [1] "Accuracy: 54.67 %"
paste("Precision:", Prec.Test*100,"%")
## [1] "Precision: 48.92 %"
paste("Recall:", Reca.Test*100,"%")
## [1] "Recall: 92.86 %"

The Recall rate for our test set is 92.86%, which is better than the Recall rate for our train set (91.27%). The value is quite consistent with out train set, so we can say that our model is acceptable to be used to predict more unseen (future) data.