Analysis Objective

This analysis is done to demonstrate the use of logistic regression on the lbb_loans.csv dataset, correctly interpret the negative coefficients obtained from logistic regression model, understand which of the variables are more statistically significant as predictors, and demonstrate some strategies to improve the model that has been built.

Libraries and Setup

knitr::opts_chunk$set(cache=TRUE)
options(scipen = 9999)
rm(list=ls())

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

Import the lbb_loans.csv data

# getwd()
loan.data <- read.csv("lbb_loans.csv")
str(loan.data)

## 'data.frame':    1500 obs. of  16 variables:
##  $ initial_list_status: Factor w/ 2 levels "f","w": 2 2 2 2 2 2 1 2 2 2 ...
##  $ purpose            : Factor w/ 5 levels "credit_card",..: 2 2 3 1 1 2 2 2 2 1 ...
##  $ int_rate           : num  12 12.6 19 14.1 25.8 ...
##  $ installment        : num  399 121 727 291 389 ...
##  $ annual_inc         : num  55000 62000 29000 40000 31500 44000 80000 104000 75000 42000 ...
##  $ dti                : num  16 30.7 10.2 22.5 38.4 ...
##  $ verification_status: Factor w/ 3 levels "Not Verified",..: 2 2 1 2 3 1 1 1 2 3 ...
##  $ grade              : Factor w/ 7 levels "A","B","C","D",..: 2 3 4 3 5 3 5 4 1 5 ...
##  $ revol_bal          : int  754 2194 5279 6600 10489 5217 14036 27942 3885 5768 ...
##  $ inq_last_12m       : int  1 2 5 0 4 3 1 2 0 6 ...
##  $ delinq_2yrs        : int  1 0 3 0 0 1 1 0 0 3 ...
##  $ home_ownership     : Factor w/ 3 levels "MORTGAGE","OWN",..: 3 3 1 3 2 1 3 1 3 1 ...
##  $ not_paid           : int  0 1 1 0 0 1 1 1 0 0 ...
##  $ log_inc            : num  10.9 11 10.3 10.6 10.4 ...
##  $ verified           : int  1 1 0 1 1 0 0 0 1 1 ...
##  $ grdCtoA            : int  1 0 0 0 0 0 0 0 1 0 ...

head(loan.data, 12)

##    initial_list_status            purpose int_rate installment annual_inc
## 1                    w debt_consolidation    11.99      398.52      55000
## 2                    w debt_consolidation    12.62      120.65      62000
## 3                    w   home_improvement    19.03      726.80      29000
## 4                    w        credit_card    14.08      291.38      40000
## 5                    w        credit_card    25.82      389.34      31500
## 6                    w debt_consolidation    16.02      140.67      44000
## 7                    f debt_consolidation    22.91      773.26      80000
## 8                    w debt_consolidation    18.06     1447.31     104000
## 9                    w debt_consolidation     7.35       31.04      75000
## 10                   w        credit_card    25.82      355.03      42000
## 11                   w        credit_card    14.08      218.99      50000
## 12                   w debt_consolidation    13.59      424.74      55000
##      dti verification_status grade revol_bal inq_last_12m delinq_2yrs
## 1  15.99     Source Verified     B       754            1           1
## 2  30.66     Source Verified     C      2194            2           0
## 3  10.22        Not Verified     D      5279            5           3
## 4  22.53     Source Verified     C      6600            0           0
## 5  38.36            Verified     E     10489            4           0
## 6  26.60        Not Verified     C      5217            3           1
## 7  26.43        Not Verified     E     14036            1           1
## 8  26.70        Not Verified     D     27942            2           0
## 9   4.65     Source Verified     A      3885            0           0
## 10 48.60            Verified     E      5768            6           3
## 11 29.07        Not Verified     C     14780            0           0
## 12 20.39        Not Verified     C     11809            0           0
##    home_ownership not_paid  log_inc verified grdCtoA
## 1            RENT        0 10.91509        1       1
## 2            RENT        1 11.03489        1       0
## 3        MORTGAGE        1 10.27505        0       0
## 4            RENT        0 10.59663        1       0
## 5             OWN        0 10.35774        1       0
## 6        MORTGAGE        1 10.69194        0       0
## 7            RENT        1 11.28978        0       0
## 8        MORTGAGE        1 11.55215        0       0
## 9            RENT        0 11.22524        1       1
## 10       MORTGAGE        0 10.64542        1       0
## 11       MORTGAGE        1 10.81978        0       0
## 12           RENT        0 10.91509        0       0

anyNA(loan.data)

## [1] FALSE

Exploratory Data Analysis

Let’s investigate the relationships and discover rough structures of the data. The variable of interest is the not_paid variable, a binary variable that indicate whether a loan is fully paid or not. A loan is considered “not paid” (not paid = 1) when it is Defaulted, Charged Off, or past due date (Grace Period).

names(loan.data)

##  [1] "initial_list_status" "purpose"             "int_rate"           
##  [4] "installment"         "annual_inc"          "dti"                
##  [7] "verification_status" "grade"               "revol_bal"          
## [10] "inq_last_12m"        "delinq_2yrs"         "home_ownership"     
## [13] "not_paid"            "log_inc"             "verified"           
## [16] "grdCtoA"

The explaination about the variables on the dataset are presented below.

initial_list_status: Either w (whole) or f (fractional). This variable indicates if the loan was a whole loan or fractional loan. For background: Some institutional investors have a preference to purchase loans in their entirety to obtain legal and accounting treatment specific to their situation - with the added benefit of “instant funding” to borrowers
purpose: Simplified from the original data; One of: credit_card, debt_consolidation, home_improvement, major_purchase and small_business
int_rate: Interest rate in percentages
installment: Monthly payment owed by the borrower
annual_inc: Self-reported annual income provided by the borrower / co-borrowers during application
dti: A ratio of the borrower’s total monthly debt payments on his/her total obligations to the self-reported monthly income (debt to income ratio)
verification_status: is the reported income verified, not verified, or if the income source was verified
grade: software-assigned loan grade
revol_bal: total credit revolving balance (in the case of credit card, it refers to the portion of credit card spending that goes unpaid at the end of a billing cycle)
inq_last_12m: number of credit inquiries in the last 12 months
delinq_2yrs: number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years
home_ownership: one of MORTGAGE, OWN and RENT
not_paid: 1 for charged-off, past-due / grace period or defaulted, 0 for fully-paid loans
log_inc: log of annual_inc
verified: 0 for “Not verified” under verification_status, 1 otherwise
grdDtoG: 0 for a grade of A, B or C, 1 otherwise

Use `describe()` function from `Hmisc` package to see the overall data

describe(loan.data)

## loan.data 
## 
##  16  Variables      1500  Observations
## ---------------------------------------------------------------------------
## initial_list_status 
##        n  missing distinct 
##     1500        0        2 
##                       
## Value          f     w
## Frequency    308  1192
## Proportion 0.205 0.795
## ---------------------------------------------------------------------------
## purpose 
##        n  missing distinct 
##     1500        0        5 
##                                                                    
## Value             credit_card debt_consolidation   home_improvement
## Frequency                 314                954                157
## Proportion              0.209              0.636              0.105
##                                                 
## Value          major_purchase     small_business
## Frequency                  49                 26
## Proportion              0.033              0.017
## ---------------------------------------------------------------------------
## int_rate 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1500        0       36    0.998    14.74    6.413     7.21     7.97 
##      .25      .50      .75      .90      .95 
##    10.42    14.08    18.06    23.88    25.84 
## 
## lowest :  5.32  6.08  6.72  7.07  7.21, highest: 30.75 30.79 30.84 30.89 30.94
## ---------------------------------------------------------------------------
## installment 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1500        0      977        1    469.1    328.7    105.4    148.8 
##      .25      .50      .75      .90      .95 
##    248.2    382.3    635.9    903.5   1071.0 
## 
## lowest :   31.04   36.68   38.67   39.24   39.57
## highest: 1415.78 1447.31 1501.96 1502.92 1503.89
## ---------------------------------------------------------------------------
## annual_inc 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1500        0      372        1    81554    53755    25180    32384 
##      .25      .50      .75      .90      .95 
##    48000    67890    95555   140000   180000 
## 
## lowest :    2500    3000   10000   10776   12000
## highest:  494000  600000  750000  950000 1200000
## ---------------------------------------------------------------------------
## dti 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1500        0     1219        1     18.9    12.29    4.096    6.514 
##      .25      .50      .75      .90      .95 
##   11.010   17.160   24.550   31.938   36.522 
## 
## lowest :   0.00   0.14   0.41   0.44   0.45, highest:  94.72 107.60 115.85 163.15 198.56
## ---------------------------------------------------------------------------
## verification_status 
##        n  missing distinct 
##     1500        0        3 
##                                                           
## Value         Not Verified Source Verified        Verified
## Frequency              555             523             422
## Proportion           0.370           0.349           0.281
## ---------------------------------------------------------------------------
## grade 
##        n  missing distinct 
##     1500        0        7 
##                                                     
## Value          A     B     C     D     E     F     G
## Frequency    203   366   451   311   116    38    15
## Proportion 0.135 0.244 0.301 0.207 0.077 0.025 0.010
## ---------------------------------------------------------------------------
## revol_bal 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1500        0     1441        1    15207    15859      924     1900 
##      .25      .50      .75      .90      .95 
##     4994    10139    18125    31401    44265 
## 
## lowest :      0      3      4      9     10, highest: 176233 178902 211088 228292 258897
## ---------------------------------------------------------------------------
## inq_last_12m 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1500        0       19    0.967    2.357    2.522      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      1.0      2.0      3.0      5.1      8.0 
##                                                                       
## Value          0     1     2     3     4     5     6     7     8     9
## Frequency    356   330   274   204   118    68    42    32    31    16
## Proportion 0.237 0.220 0.183 0.136 0.079 0.045 0.028 0.021 0.021 0.011
##                                                                 
## Value         10    11    12    13    14    15    16    17    18
## Frequency     10     5     3     1     2     1     3     2     2
## Proportion 0.007 0.003 0.002 0.001 0.001 0.001 0.002 0.001 0.001
## ---------------------------------------------------------------------------
## delinq_2yrs 
##        n  missing distinct     Info     Mean      Gmd 
##     1500        0        9     0.48   0.3127   0.5376 
##                                                                 
## Value          0     1     2     3     4     5     6     7     8
## Frequency   1204   199    55    26     9     1     2     3     1
## Proportion 0.803 0.133 0.037 0.017 0.006 0.001 0.001 0.002 0.001
## ---------------------------------------------------------------------------
## home_ownership 
##        n  missing distinct 
##     1500        0        3 
##                                      
## Value      MORTGAGE      OWN     RENT
## Frequency       754      192      554
## Proportion    0.503    0.128    0.369
## ---------------------------------------------------------------------------
## not_paid 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     1500        0        2     0.75      751   0.5007   0.5003 
## 
## ---------------------------------------------------------------------------
## log_inc 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1500        0      372        1    11.12   0.6494    10.13    10.39 
##      .25      .50      .75      .90      .95 
##    10.78    11.13    11.47    11.85    12.10 
## 
## lowest :  7.824046  8.006368  9.210340  9.285077  9.392662
## highest: 13.110291 13.304685 13.527828 13.764217 13.997832
## ---------------------------------------------------------------------------
## verified 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     1500        0        2    0.699      945     0.63   0.4665 
## 
## ---------------------------------------------------------------------------
## grdCtoA 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     1500        0        2    0.706      569   0.3793   0.4712 
## 
## ---------------------------------------------------------------------------

Frequency of defaulted loans (`not_paid` == 0) vs fully-paid loans (`not_paid` == 1)

table(loan.data$not_paid)

## 
##   0   1 
## 749 751

Summary of Debt-to-Income Ratio

summary(loan.data$dti)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   11.01   17.16   18.90   24.55  198.56

Debt-to-Income ratio `dti` summary when `not_paid` is 0 (fully-paid loans)

summary(loan.data[loan.data$not_paid == 0, "dti"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   11.05   16.78   18.41   23.80  198.56

Debt-to-Income ratio `dti` summary when `not_paid` is 1 (charged-off, past-due / grace period or defaulted loans)

summary(loan.data[loan.data$not_paid == 1, "dti"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.92   17.52   19.40   25.05  115.85

Summary of Revolving Balance

summary(loan.data$revol_bal)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    4994   10139   15207   18125  258897

We change the revolving balance revol_bal variable to log(revol_bal) as the data has a big variance, like the annual income annual_inc variable.

loan.data$log_revol <- ifelse(
                          loan.data$revol_bal == 0,
                          0,
                          log(loan.data$revol_bal)
)

Connection of Loan Grade with Average Interest Rate

aggregate(loan.data$int_rate ~ loan.data$grade, FUN = mean)

##   loan.data$grade loan.data$int_rate
## 1               A           7.146847
## 2               B          10.484945
## 3               C          14.297494
## 4               D          19.066495
## 5               E          24.698534
## 6               F          29.516316
## 7               G          30.836667

Subsetting Data for Analysis

From the dataset, we know that the highest the grade (G), the higher int_rate (interest rate). The higher interest rate, we can assume that the probability of defaulted loan will be higher.

After checking the grdCtoA column, apparently it is stated that A and B are 1, 0 otherwise. The column is omitted and corrected as gradeAtoE and stating that A, B or C are 0, 1 otherwise.

loan.data$grdCtoA <- NULL
loan.data$gradeDtoG <- as.integer(
                          case_when(
                            loan.data$grade == "A" ~ 0,
                            loan.data$grade == "B" ~ 0,
                            loan.data$grade == "C" ~ 0,
                            loan.data$grade == "D" ~ 1,
                            loan.data$grade == "E" ~ 1,
                            loan.data$grade == "F" ~ 1,
                            loan.data$grade == "G" ~ 1)
                          )

As the “annual_inc”, “revol_bal”, “verification_status”, and “grade” are now presented by “log_inc”, “log_revol”, “verified”, and “gradeDtoG”, the four variables will be omitted from the analysis.

# exclude "annual_inc", "revol_bal", "verification_status", "grade"
loan.data$annual_inc <- loan.data$revol_bal <- loan.data$verification_status <- loan.data$grade <- NULL

We’ll check whether the loan.data now has 13 variables instead of 16 variables.

str(loan.data)

## 'data.frame':    1500 obs. of  13 variables:
##  $ initial_list_status: Factor w/ 2 levels "f","w": 2 2 2 2 2 2 1 2 2 2 ...
##  $ purpose            : Factor w/ 5 levels "credit_card",..: 2 2 3 1 1 2 2 2 2 1 ...
##  $ int_rate           : num  12 12.6 19 14.1 25.8 ...
##  $ installment        : num  399 121 727 291 389 ...
##  $ dti                : num  16 30.7 10.2 22.5 38.4 ...
##  $ inq_last_12m       : int  1 2 5 0 4 3 1 2 0 6 ...
##  $ delinq_2yrs        : int  1 0 3 0 0 1 1 0 0 3 ...
##  $ home_ownership     : Factor w/ 3 levels "MORTGAGE","OWN",..: 3 3 1 3 2 1 3 1 3 1 ...
##  $ not_paid           : int  0 1 1 0 0 1 1 1 0 0 ...
##  $ log_inc            : num  10.9 11 10.3 10.6 10.4 ...
##  $ verified           : int  1 1 0 1 1 0 0 0 1 1 ...
##  $ log_revol          : num  6.63 7.69 8.57 8.79 9.26 ...
##  $ gradeDtoG          : int  0 0 1 0 1 0 1 1 0 1 ...

anyNA(loan.data)

## [1] FALSE

summary(loan.data)

##  initial_list_status               purpose       int_rate    
##  f: 308              credit_card       :314   Min.   : 5.32  
##  w:1192              debt_consolidation:954   1st Qu.:10.42  
##                      home_improvement  :157   Median :14.08  
##                      major_purchase    : 49   Mean   :14.74  
##                      small_business    : 26   3rd Qu.:18.06  
##                                               Max.   :30.94  
##   installment           dti          inq_last_12m     delinq_2yrs    
##  Min.   :  31.04   Min.   :  0.00   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.: 248.15   1st Qu.: 11.01   1st Qu.: 1.000   1st Qu.:0.0000  
##  Median : 382.30   Median : 17.16   Median : 2.000   Median :0.0000  
##  Mean   : 469.15   Mean   : 18.90   Mean   : 2.357   Mean   :0.3127  
##  3rd Qu.: 635.86   3rd Qu.: 24.55   3rd Qu.: 3.000   3rd Qu.:0.0000  
##  Max.   :1503.89   Max.   :198.56   Max.   :18.000   Max.   :8.0000  
##   home_ownership    not_paid         log_inc          verified   
##  MORTGAGE:754    Min.   :0.0000   Min.   : 7.824   Min.   :0.00  
##  OWN     :192    1st Qu.:0.0000   1st Qu.:10.779   1st Qu.:0.00  
##  RENT    :554    Median :1.0000   Median :11.126   Median :1.00  
##                  Mean   :0.5007   Mean   :11.123   Mean   :0.63  
##                  3rd Qu.:1.0000   3rd Qu.:11.467   3rd Qu.:1.00  
##                  Max.   :1.0000   Max.   :13.998   Max.   :1.00  
##    log_revol        gradeDtoG   
##  Min.   : 0.000   Min.   :0.00  
##  1st Qu.: 8.516   1st Qu.:0.00  
##  Median : 9.224   Median :0.00  
##  Mean   : 8.974   Mean   :0.32  
##  3rd Qu.: 9.805   3rd Qu.:1.00  
##  Max.   :12.464   Max.   :1.00

Train and Test Data

In this part, we will split our dataset into train and test sets and build our machine learning model using data only from our train set to obtain an unbiased measurmenet of the model’s accuracy by predicting on test set.

The idea of obtaining an unbiased estimate of our model’s out-of-sample performance is an important one as it is often the case that the in-sample error is optimistic and tuned / adapted in a particular way to minimize the error in the training sample. Therefore - the in-sample error is not a good representation or indication of how our model will perform when it is applied on unseen data.

We’ll split our data into the loan.data.train and loan.data.test set. (85-15)

set.seed(2018)
intrain <- sample(nrow(loan.data), nrow(loan.data)*0.85)
loan.data.train <- loan.data[intrain, ]
loan.data.test <- loan.data[-intrain, ]

Verify to make sure our data is randomly shuffled

prop.table(table(loan.data.train$not_paid))

## 
##         0         1 
## 0.4878431 0.5121569

prop.table(table(loan.data.test$not_paid))

## 
##         0         1 
## 0.5644444 0.4355556

As the data is parted almost at 50-50, we can move on to the next step.

Logistic Regression Model of Credit Risk Analysis

Stepwise Regression

We do stepwise regression where the choice of predictive variables is carried out by an automatic procedure.

glm.all <- glm(not_paid ~., family = "binomial", loan.data.train)
# we only use it as a guidance when building our model
step(glm.all, direction = "both")

## Start:  AIC=1725.61
## not_paid ~ initial_list_status + purpose + int_rate + installment + 
##     dti + inq_last_12m + delinq_2yrs + home_ownership + log_inc + 
##     verified + log_revol + gradeDtoG
## 
##                       Df Deviance    AIC
## - int_rate             1   1691.7 1723.7
## - log_revol            1   1691.7 1723.7
## - dti                  1   1691.7 1723.7
## - initial_list_status  1   1691.9 1723.9
## - verified             1   1692.4 1724.4
## - home_ownership       2   1695.3 1725.3
## <none>                     1691.6 1725.6
## - purpose              4   1699.7 1725.7
## - inq_last_12m         1   1693.9 1725.9
## - gradeDtoG            1   1694.6 1726.6
## - log_inc              1   1696.6 1728.6
## - delinq_2yrs          1   1698.9 1730.9
## - installment          1   1712.4 1744.4
## 
## Step:  AIC=1723.7
## not_paid ~ initial_list_status + purpose + installment + dti + 
##     inq_last_12m + delinq_2yrs + home_ownership + log_inc + verified + 
##     log_revol + gradeDtoG
## 
##                       Df Deviance    AIC
## - log_revol            1   1691.8 1721.8
## - dti                  1   1691.8 1721.8
## - initial_list_status  1   1692.0 1722.0
## - verified             1   1692.6 1722.6
## - home_ownership       2   1695.4 1723.4
## <none>                     1691.7 1723.7
## - purpose              4   1699.8 1723.8
## - inq_last_12m         1   1693.9 1723.9
## + int_rate             1   1691.6 1725.6
## - log_inc              1   1696.8 1726.8
## - delinq_2yrs          1   1699.2 1729.2
## - gradeDtoG            1   1702.5 1732.5
## - installment          1   1713.3 1743.3
## 
## Step:  AIC=1721.81
## not_paid ~ initial_list_status + purpose + installment + dti + 
##     inq_last_12m + delinq_2yrs + home_ownership + log_inc + verified + 
##     gradeDtoG
## 
##                       Df Deviance    AIC
## - dti                  1   1691.9 1719.9
## - initial_list_status  1   1692.2 1720.2
## - verified             1   1692.8 1720.8
## - home_ownership       2   1695.5 1721.5
## <none>                     1691.8 1721.8
## - inq_last_12m         1   1694.0 1722.0
## - purpose              4   1700.5 1722.5
## + log_revol            1   1691.7 1723.7
## + int_rate             1   1691.7 1723.7
## - log_inc              1   1698.2 1726.2
## - delinq_2yrs          1   1699.3 1727.3
## - gradeDtoG            1   1702.6 1730.6
## - installment          1   1713.3 1741.3
## 
## Step:  AIC=1719.9
## not_paid ~ initial_list_status + purpose + installment + inq_last_12m + 
##     delinq_2yrs + home_ownership + log_inc + verified + gradeDtoG
## 
##                       Df Deviance    AIC
## - initial_list_status  1   1692.2 1718.2
## - verified             1   1692.8 1718.8
## - home_ownership       2   1695.5 1719.5
## <none>                     1691.9 1719.9
## - inq_last_12m         1   1694.0 1720.0
## - purpose              4   1700.5 1720.5
## + int_rate             1   1691.8 1721.8
## + dti                  1   1691.8 1721.8
## + log_revol            1   1691.8 1721.8
## - delinq_2yrs          1   1699.3 1725.3
## - log_inc              1   1699.9 1725.9
## - gradeDtoG            1   1703.0 1729.0
## - installment          1   1714.7 1740.7
## 
## Step:  AIC=1718.24
## not_paid ~ purpose + installment + inq_last_12m + delinq_2yrs + 
##     home_ownership + log_inc + verified + gradeDtoG
## 
##                       Df Deviance    AIC
## - verified             1   1693.2 1717.2
## - home_ownership       2   1696.0 1718.0
## - inq_last_12m         1   1694.2 1718.2
## <none>                     1692.2 1718.2
## - purpose              4   1700.7 1718.7
## + initial_list_status  1   1691.9 1719.9
## + int_rate             1   1692.1 1720.1
## + dti                  1   1692.2 1720.2
## + log_revol            1   1692.2 1720.2
## - delinq_2yrs          1   1699.6 1723.6
## - log_inc              1   1700.4 1724.4
## - gradeDtoG            1   1703.4 1727.4
## - installment          1   1715.2 1739.2
## 
## Step:  AIC=1717.17
## not_paid ~ purpose + installment + inq_last_12m + delinq_2yrs + 
##     home_ownership + log_inc + gradeDtoG
## 
##                       Df Deviance    AIC
## - home_ownership       2   1696.9 1716.9
## - inq_last_12m         1   1695.0 1717.0
## <none>                     1693.2 1717.2
## - purpose              4   1701.9 1717.9
## + verified             1   1692.2 1718.2
## + initial_list_status  1   1692.8 1718.8
## + int_rate             1   1693.0 1719.0
## + log_revol            1   1693.1 1719.1
## + dti                  1   1693.1 1719.1
## - delinq_2yrs          1   1700.6 1722.6
## - log_inc              1   1701.7 1723.7
## - gradeDtoG            1   1704.9 1726.9
## - installment          1   1718.8 1740.8
## 
## Step:  AIC=1716.86
## not_paid ~ purpose + installment + inq_last_12m + delinq_2yrs + 
##     log_inc + gradeDtoG
## 
##                       Df Deviance    AIC
## <none>                     1696.9 1716.9
## - inq_last_12m         1   1699.0 1717.0
## + home_ownership       2   1693.2 1717.2
## + verified             1   1696.0 1718.0
## - purpose              4   1706.2 1718.2
## + initial_list_status  1   1696.4 1718.4
## + int_rate             1   1696.6 1718.6
## + log_revol            1   1696.8 1718.8
## + dti                  1   1696.8 1718.8
## - delinq_2yrs          1   1704.1 1722.1
## - log_inc              1   1706.7 1724.7
## - gradeDtoG            1   1709.0 1727.0
## - installment          1   1722.7 1740.7

## 
## Call:  glm(formula = not_paid ~ purpose + installment + inq_last_12m + 
##     delinq_2yrs + log_inc + gradeDtoG, family = "binomial", data = loan.data.train)
## 
## Coefficients:
##               (Intercept)  purposedebt_consolidation  
##                  3.048714                   0.091903  
##   purposehome_improvement      purposemajor_purchase  
##                  0.239558                   0.844807  
##     purposesmall_business                installment  
##                  0.876574                   0.001081  
##              inq_last_12m                delinq_2yrs  
##                 -0.033791                   0.197173  
##                   log_inc                  gradeDtoG  
##                 -0.337676                   0.444379  
## 
## Degrees of Freedom: 1274 Total (i.e. Null);  1265 Residual
## Null Deviance:       1767 
## Residual Deviance: 1697  AIC: 1717

From stepwise regression, we get the optimized regression formula for our model :

formula = not_paid ~ purpose + installment + inq_last_12m + delinq_2yrs + log_inc + gradeDtoG

Let’s try to run this formula to glm() function and we will inspect the p-value and the AIC value of the model.

glm.loan <- glm(formula = not_paid ~ purpose + installment + inq_last_12m + delinq_2yrs + log_inc + gradeDtoG, family = "binomial", data = loan.data.train)

summary(glm.loan)

## 
## Call:
## glm(formula = not_paid ~ purpose + installment + inq_last_12m + 
##     delinq_2yrs + log_inc + gradeDtoG, family = "binomial", data = loan.data.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9241  -1.1167   0.6801   1.1417   1.6730  
## 
## Coefficients:
##                            Estimate Std. Error z value    Pr(>|z|)    
## (Intercept)                3.048714   1.181895   2.580    0.009894 ** 
## purposedebt_consolidation  0.091903   0.145733   0.631    0.528285    
## purposehome_improvement    0.239558   0.223945   1.070    0.284747    
## purposemajor_purchase      0.844807   0.356926   2.367    0.017938 *  
## purposesmall_business      0.876574   0.480180   1.826    0.067924 .  
## installment                0.001081   0.000217   4.983 0.000000628 ***
## inq_last_12m              -0.033791   0.023068  -1.465    0.142972    
## delinq_2yrs                0.197173   0.075917   2.597    0.009399 ** 
## log_inc                   -0.337676   0.108858  -3.102    0.001922 ** 
## gradeDtoG                  0.444379   0.127694   3.480    0.000501 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1766.8  on 1274  degrees of freedom
## Residual deviance: 1696.9  on 1265  degrees of freedom
## AIC: 1716.9
## 
## Number of Fisher Scoring iterations: 4

Improving Our Model

From summary, there are some variables that are not giving high significance to the model, so we will omit them from our model. We will omit purpose and inq_last_12m.

glm.loan1 <- glm(formula = not_paid ~ installment + delinq_2yrs + log_inc + gradeDtoG, family = "binomial", data = loan.data.train)

summary(glm.loan1)

## 
## Call:
## glm(formula = not_paid ~ installment + delinq_2yrs + log_inc + 
##     gradeDtoG, family = "binomial", data = loan.data.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9302  -1.1216   0.7186   1.1442   1.5717  
## 
## Coefficients:
##               Estimate Std. Error z value    Pr(>|z|)    
## (Intercept)  3.1734626  1.1652508   2.723    0.006461 ** 
## installment  0.0011120  0.0002154   5.163 0.000000243 ***
## delinq_2yrs  0.1971310  0.0756659   2.605    0.009180 ** 
## log_inc     -0.3455030  0.1073870  -3.217    0.001294 ** 
## gradeDtoG    0.4241966  0.1252241   3.387    0.000705 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1766.8  on 1274  degrees of freedom
## Residual deviance: 1707.7  on 1270  degrees of freedom
## AIC: 1717.7
## 
## Number of Fisher Scoring iterations: 4

The difference of AIC score between glm.loan and glm.loan1 is only 0.8, so it’s okay to omit purpose and inq_last_12m from our model.

The formula used is : formula = not_paid ~ not_paid ~ installment + delinq_2yrs + log_inc + gradeDtoG and rename it CreditRisk.

CreditRisk <- glm(formula = not_paid ~ installment + delinq_2yrs + log_inc + gradeDtoG, family = "binomial", data = loan.data.train)

CreditRisk$aic

## [1] 1717.693

Negative Coefficient from Logistic Regression

CreditRisk$coefficients

##  (Intercept)  installment  delinq_2yrs      log_inc    gradeDtoG 
##  3.173462603  0.001111953  0.197131024 -0.345503016  0.424196622

As we can see from above, the log_inc have negative coefficients. The coefficients in logistic regression represent the log of odds ratio log(p/1-p) of an event. (p is the probability of an event) As long as the log of our odd ratio is positive, the probability of success will always be more than 50%. When we have negative value for our coefficient, the probability of success is less than 50%.

We refer back to Sigmoid Curve and plot the log_inc coefficient as an example.

plot(function(x)(log(x/(1-x))),
                  xlab = "Probability", 
                  ylab = "Log of Odds", 
                  main = "Sigmoid Curve")
abline(h = CreditRisk$coefficients["log_inc"], col = "red")

As the log_inc has negative coefficient, the higher the income, the probability of the defaulted loan will be lower.

For gradeDtoG variables which has positive coefficient, loan grade A, B, and C (0) will give higher probability of the loan getting paid compared to grade D to G.

Test the Consistancy of Our Model with the Train Data

We want to obtain an unbiased measurement of the model’s accuracy by predicting on our train set before use it on our test set.

loan.data.train$pred.train <- predict(CreditRisk, 
                                loan.data.train, 
                                type = "response")

head(loan.data.train$pred.train)

## [1] 0.5717765 0.5344906 0.4668034 0.5071598 0.6205931 0.4210173

Change the output from pred.train to logical (1 or 0)

We assume if the not_paid value is 0.5 or above, we say it’s a default.

Confusion Table of the Predicted Data with the Train Set of lbb_loan.csv

(conft.train <- table("predicted" = as.numeric(loan.data.train$pred.train >= 0.5),
                      "actual" = loan.data.train$not_paid))

##          actual
## predicted   0   1
##         0 378 279
##         1 244 374

Accuracy, Precision, and Recall from Our Train Set

Accu.Train <- round((conft.train[1]+conft.train[4])/sum(conft.train[1:4]),4)
Prec.Train <- round(conft.train[4]/(conft.train[2]+conft.train[4]), 4)
Reca.Train <- round(conft.train[4]/(conft.train[3]+conft.train[4]), 4)

paste("Accuracy:", Accu.Train*100,"%")

## [1] "Accuracy: 58.98 %"

paste("Precision:", Prec.Train*100,"%")

## [1] "Precision: 60.52 %"

paste("Recall:", Reca.Train*100,"%")

## [1] "Recall: 57.27 %"

More on Improving Our Model

For credit risk analysis, we would rather sacrifice some level of specificity or precision in favor of higher recall (or sensitivity). In simpler words, we want to be more sensitive to “loan defaults”. One thing we can do is to set the treshold to be more sensitive to “positive cases”, by predict a “default” when the probability exceed 0.4 (20% more sensitive than our previous classifier)

Confusion Table of the Predicted Data with the Train Data of lbb_loan.csv

(conft.train2 <- table("predicted" = as.numeric(loan.data.train$pred.train >= 0.4),
                      "actual" = loan.data.train$not_paid))

##          actual
## predicted   0   1
##         0 129  57
##         1 493 596

Accuracy, Precision, and Recall

Accu.Train2 <- round((conft.train2[1]+conft.train2[4])/sum(conft.train2[1:4]),4)
Prec.Train2 <- round(conft.train2[4]/(conft.train2[2]+conft.train2[4]), 4)
Reca.Train2 <- round(conft.train2[4]/(conft.train2[3]+conft.train2[4]), 4)

paste("Accuracy:", Accu.Train2*100,"%")

## [1] "Accuracy: 56.86 %"

paste("Precision:", Prec.Train2*100,"%")

## [1] "Precision: 54.73 %"

paste("Recall:", Reca.Train2*100,"%")

## [1] "Recall: 91.27 %"

From this, we improve our model by increasing our Recall rate to 91.27% from 57.27%, but our Presicion rate goes down from 60.52% to 54.73%. The Accuracy rate goes down a little from 58.98% to 56.86%.

Using Our Model to Test Set

Now we’re going to use our CreditRisk model : formula = not_paid ~ installment + delinq_2yrs + log_inc + gradeDtoG, with the threshold of not_paid more than 0.4 is assumed as a default loan. (see conft.train2)

Use predict() for our test set

loan.data.test$pred.test <- predict(CreditRisk, 
                                loan.data.test, 
                                type = "response")

head(loan.data.test$pred.test)

## [1] 0.5106317 0.6800422 0.4686805 0.4447727 0.4670663 0.3176890

Confusion Table of the Predicted Data with the Test Data of lbb_loan.csv

(conft.test <- table("predicted" = as.numeric(loan.data.test$pred.test >= 0.4),
                      "actual" = loan.data.test$not_paid))

##          actual
## predicted  0  1
##         0 32  7
##         1 95 91

Accuracy, Precision, and Recall

Accu.Test <- round((conft.test[1]+conft.test[4])/sum(conft.test[1:4]),4)
Prec.Test <- round(conft.test[4]/(conft.test[2]+conft.test[4]), 4)
Reca.Test <- round(conft.test[4]/(conft.test[3]+conft.test[4]), 4)

By using Test Set to our improved model, we get:

paste("Accuracy:", Accu.Test*100,"%")

## [1] "Accuracy: 54.67 %"

paste("Precision:", Prec.Test*100,"%")

## [1] "Precision: 48.92 %"

paste("Recall:", Reca.Test*100,"%")

## [1] "Recall: 92.86 %"

The Recall rate for our test set is 92.86%, which is better than the Recall rate for our train set (91.27%). The value is quite consistent with out train set, so we can say that our model is acceptable to be used to predict more unseen (future) data.

Credit Risk Analysis

Alvernia Eka Poetry

8/22/2018

Analysis Objective

Libraries and Setup

Import the lbb_loans.csv data

Exploratory Data Analysis

Use `describe()` function from `Hmisc` package to see the overall data

Frequency of defaulted loans (`not_paid` == 0) vs fully-paid loans (`not_paid` == 1)

Summary of Debt-to-Income Ratio

Debt-to-Income ratio `dti` summary when `not_paid` is 0 (fully-paid loans)

Debt-to-Income ratio `dti` summary when `not_paid` is 1 (charged-off, past-due / grace period or defaulted loans)

Summary of Revolving Balance

Connection of Loan Grade with Average Interest Rate

Subsetting Data for Analysis

Train and Test Data

Logistic Regression Model of Credit Risk Analysis

Stepwise Regression

Improving Our Model

Negative Coefficient from Logistic Regression

Test the Consistancy of Our Model with the Train Data

Accuracy, Precision, and Recall from Our Train Set

More on Improving Our Model

Using Our Model to Test Set

Credit Risk Analysis

Alvernia Eka Poetry

8/22/2018

Analysis Objective

Libraries and Setup

Import the lbb_loans.csv data

Exploratory Data Analysis

Use describe() function from Hmisc package to see the overall data

Frequency of defaulted loans (not_paid == 0) vs fully-paid loans (not_paid == 1)

Summary of Debt-to-Income Ratio

Debt-to-Income ratio dti summary when not_paid is 0 (fully-paid loans)

Debt-to-Income ratio dti summary when not_paid is 1 (charged-off, past-due / grace period or defaulted loans)

Summary of Revolving Balance

Connection of Loan Grade with Average Interest Rate

Subsetting Data for Analysis

Train and Test Data

Logistic Regression Model of Credit Risk Analysis

Stepwise Regression

Improving Our Model

Negative Coefficient from Logistic Regression

Test the Consistancy of Our Model with the Train Data

Accuracy, Precision, and Recall from Our Train Set

More on Improving Our Model

Using Our Model to Test Set

Use `describe()` function from `Hmisc` package to see the overall data

Frequency of defaulted loans (`not_paid` == 0) vs fully-paid loans (`not_paid` == 1)

Debt-to-Income ratio `dti` summary when `not_paid` is 0 (fully-paid loans)

Debt-to-Income ratio `dti` summary when `not_paid` is 1 (charged-off, past-due / grace period or defaulted loans)