This analysis is done to demonstrate the use of logistic regression on the lbb_loans.csv
dataset, correctly interpret the negative coefficients obtained from logistic regression model, understand which of the variables are more statistically significant as predictors, and demonstrate some strategies to improve the model that has been built.
knitr::opts_chunk$set(cache=TRUE)
options(scipen = 9999)
rm(list=ls())
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
# getwd()
loan.data <- read.csv("lbb_loans.csv")
str(loan.data)
## 'data.frame': 1500 obs. of 16 variables:
## $ initial_list_status: Factor w/ 2 levels "f","w": 2 2 2 2 2 2 1 2 2 2 ...
## $ purpose : Factor w/ 5 levels "credit_card",..: 2 2 3 1 1 2 2 2 2 1 ...
## $ int_rate : num 12 12.6 19 14.1 25.8 ...
## $ installment : num 399 121 727 291 389 ...
## $ annual_inc : num 55000 62000 29000 40000 31500 44000 80000 104000 75000 42000 ...
## $ dti : num 16 30.7 10.2 22.5 38.4 ...
## $ verification_status: Factor w/ 3 levels "Not Verified",..: 2 2 1 2 3 1 1 1 2 3 ...
## $ grade : Factor w/ 7 levels "A","B","C","D",..: 2 3 4 3 5 3 5 4 1 5 ...
## $ revol_bal : int 754 2194 5279 6600 10489 5217 14036 27942 3885 5768 ...
## $ inq_last_12m : int 1 2 5 0 4 3 1 2 0 6 ...
## $ delinq_2yrs : int 1 0 3 0 0 1 1 0 0 3 ...
## $ home_ownership : Factor w/ 3 levels "MORTGAGE","OWN",..: 3 3 1 3 2 1 3 1 3 1 ...
## $ not_paid : int 0 1 1 0 0 1 1 1 0 0 ...
## $ log_inc : num 10.9 11 10.3 10.6 10.4 ...
## $ verified : int 1 1 0 1 1 0 0 0 1 1 ...
## $ grdCtoA : int 1 0 0 0 0 0 0 0 1 0 ...
head(loan.data, 12)
## initial_list_status purpose int_rate installment annual_inc
## 1 w debt_consolidation 11.99 398.52 55000
## 2 w debt_consolidation 12.62 120.65 62000
## 3 w home_improvement 19.03 726.80 29000
## 4 w credit_card 14.08 291.38 40000
## 5 w credit_card 25.82 389.34 31500
## 6 w debt_consolidation 16.02 140.67 44000
## 7 f debt_consolidation 22.91 773.26 80000
## 8 w debt_consolidation 18.06 1447.31 104000
## 9 w debt_consolidation 7.35 31.04 75000
## 10 w credit_card 25.82 355.03 42000
## 11 w credit_card 14.08 218.99 50000
## 12 w debt_consolidation 13.59 424.74 55000
## dti verification_status grade revol_bal inq_last_12m delinq_2yrs
## 1 15.99 Source Verified B 754 1 1
## 2 30.66 Source Verified C 2194 2 0
## 3 10.22 Not Verified D 5279 5 3
## 4 22.53 Source Verified C 6600 0 0
## 5 38.36 Verified E 10489 4 0
## 6 26.60 Not Verified C 5217 3 1
## 7 26.43 Not Verified E 14036 1 1
## 8 26.70 Not Verified D 27942 2 0
## 9 4.65 Source Verified A 3885 0 0
## 10 48.60 Verified E 5768 6 3
## 11 29.07 Not Verified C 14780 0 0
## 12 20.39 Not Verified C 11809 0 0
## home_ownership not_paid log_inc verified grdCtoA
## 1 RENT 0 10.91509 1 1
## 2 RENT 1 11.03489 1 0
## 3 MORTGAGE 1 10.27505 0 0
## 4 RENT 0 10.59663 1 0
## 5 OWN 0 10.35774 1 0
## 6 MORTGAGE 1 10.69194 0 0
## 7 RENT 1 11.28978 0 0
## 8 MORTGAGE 1 11.55215 0 0
## 9 RENT 0 11.22524 1 1
## 10 MORTGAGE 0 10.64542 1 0
## 11 MORTGAGE 1 10.81978 0 0
## 12 RENT 0 10.91509 0 0
anyNA(loan.data)
## [1] FALSE
Let’s investigate the relationships and discover rough structures of the data. The variable of interest is the not_paid
variable, a binary variable that indicate whether a loan is fully paid or not. A loan is considered “not paid” (not paid = 1) when it is Defaulted, Charged Off, or past due date (Grace Period).
names(loan.data)
## [1] "initial_list_status" "purpose" "int_rate"
## [4] "installment" "annual_inc" "dti"
## [7] "verification_status" "grade" "revol_bal"
## [10] "inq_last_12m" "delinq_2yrs" "home_ownership"
## [13] "not_paid" "log_inc" "verified"
## [16] "grdCtoA"
The explaination about the variables on the dataset are presented below.
initial_list_status
: Either w
(whole) or f
(fractional). This variable indicates if the loan was a whole loan or fractional loan. For background: Some institutional investors have a preference to purchase loans in their entirety to obtain legal and accounting treatment specific to their situation - with the added benefit of “instant funding” to borrowerspurpose
: Simplified from the original data; One of: credit_card
, debt_consolidation
, home_improvement
, major_purchase
and small_business
int_rate
: Interest rate in percentagesinstallment
: Monthly payment owed by the borrowerannual_inc
: Self-reported annual income provided by the borrower / co-borrowers during applicationdti
: A ratio of the borrower’s total monthly debt payments on his/her total obligations to the self-reported monthly income (debt to income ratio)verification_status
: is the reported income verified, not verified, or if the income source was verifiedgrade
: software-assigned loan graderevol_bal
: total credit revolving balance (in the case of credit card, it refers to the portion of credit card spending that goes unpaid at the end of a billing cycle)inq_last_12m
: number of credit inquiries in the last 12 monthsdelinq_2yrs
: number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 yearshome_ownership
: one of MORTGAGE
, OWN
and RENT
not_paid
: 1 for charged-off, past-due / grace period or defaulted, 0 for fully-paid loanslog_inc
: log of annual_inc
verified
: 0 for “Not verified” under verification_status
, 1 otherwisegrdDtoG
: 0 for a grade
of A, B or C, 1 otherwisedescribe()
function from Hmisc
package to see the overall datadescribe(loan.data)
## loan.data
##
## 16 Variables 1500 Observations
## ---------------------------------------------------------------------------
## initial_list_status
## n missing distinct
## 1500 0 2
##
## Value f w
## Frequency 308 1192
## Proportion 0.205 0.795
## ---------------------------------------------------------------------------
## purpose
## n missing distinct
## 1500 0 5
##
## Value credit_card debt_consolidation home_improvement
## Frequency 314 954 157
## Proportion 0.209 0.636 0.105
##
## Value major_purchase small_business
## Frequency 49 26
## Proportion 0.033 0.017
## ---------------------------------------------------------------------------
## int_rate
## n missing distinct Info Mean Gmd .05 .10
## 1500 0 36 0.998 14.74 6.413 7.21 7.97
## .25 .50 .75 .90 .95
## 10.42 14.08 18.06 23.88 25.84
##
## lowest : 5.32 6.08 6.72 7.07 7.21, highest: 30.75 30.79 30.84 30.89 30.94
## ---------------------------------------------------------------------------
## installment
## n missing distinct Info Mean Gmd .05 .10
## 1500 0 977 1 469.1 328.7 105.4 148.8
## .25 .50 .75 .90 .95
## 248.2 382.3 635.9 903.5 1071.0
##
## lowest : 31.04 36.68 38.67 39.24 39.57
## highest: 1415.78 1447.31 1501.96 1502.92 1503.89
## ---------------------------------------------------------------------------
## annual_inc
## n missing distinct Info Mean Gmd .05 .10
## 1500 0 372 1 81554 53755 25180 32384
## .25 .50 .75 .90 .95
## 48000 67890 95555 140000 180000
##
## lowest : 2500 3000 10000 10776 12000
## highest: 494000 600000 750000 950000 1200000
## ---------------------------------------------------------------------------
## dti
## n missing distinct Info Mean Gmd .05 .10
## 1500 0 1219 1 18.9 12.29 4.096 6.514
## .25 .50 .75 .90 .95
## 11.010 17.160 24.550 31.938 36.522
##
## lowest : 0.00 0.14 0.41 0.44 0.45, highest: 94.72 107.60 115.85 163.15 198.56
## ---------------------------------------------------------------------------
## verification_status
## n missing distinct
## 1500 0 3
##
## Value Not Verified Source Verified Verified
## Frequency 555 523 422
## Proportion 0.370 0.349 0.281
## ---------------------------------------------------------------------------
## grade
## n missing distinct
## 1500 0 7
##
## Value A B C D E F G
## Frequency 203 366 451 311 116 38 15
## Proportion 0.135 0.244 0.301 0.207 0.077 0.025 0.010
## ---------------------------------------------------------------------------
## revol_bal
## n missing distinct Info Mean Gmd .05 .10
## 1500 0 1441 1 15207 15859 924 1900
## .25 .50 .75 .90 .95
## 4994 10139 18125 31401 44265
##
## lowest : 0 3 4 9 10, highest: 176233 178902 211088 228292 258897
## ---------------------------------------------------------------------------
## inq_last_12m
## n missing distinct Info Mean Gmd .05 .10
## 1500 0 19 0.967 2.357 2.522 0.0 0.0
## .25 .50 .75 .90 .95
## 1.0 2.0 3.0 5.1 8.0
##
## Value 0 1 2 3 4 5 6 7 8 9
## Frequency 356 330 274 204 118 68 42 32 31 16
## Proportion 0.237 0.220 0.183 0.136 0.079 0.045 0.028 0.021 0.021 0.011
##
## Value 10 11 12 13 14 15 16 17 18
## Frequency 10 5 3 1 2 1 3 2 2
## Proportion 0.007 0.003 0.002 0.001 0.001 0.001 0.002 0.001 0.001
## ---------------------------------------------------------------------------
## delinq_2yrs
## n missing distinct Info Mean Gmd
## 1500 0 9 0.48 0.3127 0.5376
##
## Value 0 1 2 3 4 5 6 7 8
## Frequency 1204 199 55 26 9 1 2 3 1
## Proportion 0.803 0.133 0.037 0.017 0.006 0.001 0.001 0.002 0.001
## ---------------------------------------------------------------------------
## home_ownership
## n missing distinct
## 1500 0 3
##
## Value MORTGAGE OWN RENT
## Frequency 754 192 554
## Proportion 0.503 0.128 0.369
## ---------------------------------------------------------------------------
## not_paid
## n missing distinct Info Sum Mean Gmd
## 1500 0 2 0.75 751 0.5007 0.5003
##
## ---------------------------------------------------------------------------
## log_inc
## n missing distinct Info Mean Gmd .05 .10
## 1500 0 372 1 11.12 0.6494 10.13 10.39
## .25 .50 .75 .90 .95
## 10.78 11.13 11.47 11.85 12.10
##
## lowest : 7.824046 8.006368 9.210340 9.285077 9.392662
## highest: 13.110291 13.304685 13.527828 13.764217 13.997832
## ---------------------------------------------------------------------------
## verified
## n missing distinct Info Sum Mean Gmd
## 1500 0 2 0.699 945 0.63 0.4665
##
## ---------------------------------------------------------------------------
## grdCtoA
## n missing distinct Info Sum Mean Gmd
## 1500 0 2 0.706 569 0.3793 0.4712
##
## ---------------------------------------------------------------------------
not_paid
== 0) vs fully-paid loans (not_paid
== 1)table(loan.data$not_paid)
##
## 0 1
## 749 751
summary(loan.data$dti)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 11.01 17.16 18.90 24.55 198.56
dti
summary when not_paid
is 0 (fully-paid loans)summary(loan.data[loan.data$not_paid == 0, "dti"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 11.05 16.78 18.41 23.80 198.56
dti
summary when not_paid
is 1 (charged-off, past-due / grace period or defaulted loans)summary(loan.data[loan.data$not_paid == 1, "dti"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.92 17.52 19.40 25.05 115.85
summary(loan.data$revol_bal)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 4994 10139 15207 18125 258897
We change the revolving balance revol_bal
variable to log(revol_bal
) as the data has a big variance, like the annual income annual_inc
variable.
loan.data$log_revol <- ifelse(
loan.data$revol_bal == 0,
0,
log(loan.data$revol_bal)
)
aggregate(loan.data$int_rate ~ loan.data$grade, FUN = mean)
## loan.data$grade loan.data$int_rate
## 1 A 7.146847
## 2 B 10.484945
## 3 C 14.297494
## 4 D 19.066495
## 5 E 24.698534
## 6 F 29.516316
## 7 G 30.836667
From the dataset, we know that the highest the grade (G), the higher int_rate (interest rate). The higher interest rate, we can assume that the probability of defaulted loan will be higher.
After checking the grdCtoA column, apparently it is stated that A and B are 1, 0 otherwise. The column is omitted and corrected as gradeAtoE
and stating that A, B or C are 0, 1 otherwise.
loan.data$grdCtoA <- NULL
loan.data$gradeDtoG <- as.integer(
case_when(
loan.data$grade == "A" ~ 0,
loan.data$grade == "B" ~ 0,
loan.data$grade == "C" ~ 0,
loan.data$grade == "D" ~ 1,
loan.data$grade == "E" ~ 1,
loan.data$grade == "F" ~ 1,
loan.data$grade == "G" ~ 1)
)
As the “annual_inc”, “revol_bal”, “verification_status”, and “grade” are now presented by “log_inc”, “log_revol”, “verified”, and “gradeDtoG”, the four variables will be omitted from the analysis.
# exclude "annual_inc", "revol_bal", "verification_status", "grade"
loan.data$annual_inc <- loan.data$revol_bal <- loan.data$verification_status <- loan.data$grade <- NULL
We’ll check whether the loan.data now has 13 variables instead of 16 variables.
str(loan.data)
## 'data.frame': 1500 obs. of 13 variables:
## $ initial_list_status: Factor w/ 2 levels "f","w": 2 2 2 2 2 2 1 2 2 2 ...
## $ purpose : Factor w/ 5 levels "credit_card",..: 2 2 3 1 1 2 2 2 2 1 ...
## $ int_rate : num 12 12.6 19 14.1 25.8 ...
## $ installment : num 399 121 727 291 389 ...
## $ dti : num 16 30.7 10.2 22.5 38.4 ...
## $ inq_last_12m : int 1 2 5 0 4 3 1 2 0 6 ...
## $ delinq_2yrs : int 1 0 3 0 0 1 1 0 0 3 ...
## $ home_ownership : Factor w/ 3 levels "MORTGAGE","OWN",..: 3 3 1 3 2 1 3 1 3 1 ...
## $ not_paid : int 0 1 1 0 0 1 1 1 0 0 ...
## $ log_inc : num 10.9 11 10.3 10.6 10.4 ...
## $ verified : int 1 1 0 1 1 0 0 0 1 1 ...
## $ log_revol : num 6.63 7.69 8.57 8.79 9.26 ...
## $ gradeDtoG : int 0 0 1 0 1 0 1 1 0 1 ...
anyNA(loan.data)
## [1] FALSE
summary(loan.data)
## initial_list_status purpose int_rate
## f: 308 credit_card :314 Min. : 5.32
## w:1192 debt_consolidation:954 1st Qu.:10.42
## home_improvement :157 Median :14.08
## major_purchase : 49 Mean :14.74
## small_business : 26 3rd Qu.:18.06
## Max. :30.94
## installment dti inq_last_12m delinq_2yrs
## Min. : 31.04 Min. : 0.00 Min. : 0.000 Min. :0.0000
## 1st Qu.: 248.15 1st Qu.: 11.01 1st Qu.: 1.000 1st Qu.:0.0000
## Median : 382.30 Median : 17.16 Median : 2.000 Median :0.0000
## Mean : 469.15 Mean : 18.90 Mean : 2.357 Mean :0.3127
## 3rd Qu.: 635.86 3rd Qu.: 24.55 3rd Qu.: 3.000 3rd Qu.:0.0000
## Max. :1503.89 Max. :198.56 Max. :18.000 Max. :8.0000
## home_ownership not_paid log_inc verified
## MORTGAGE:754 Min. :0.0000 Min. : 7.824 Min. :0.00
## OWN :192 1st Qu.:0.0000 1st Qu.:10.779 1st Qu.:0.00
## RENT :554 Median :1.0000 Median :11.126 Median :1.00
## Mean :0.5007 Mean :11.123 Mean :0.63
## 3rd Qu.:1.0000 3rd Qu.:11.467 3rd Qu.:1.00
## Max. :1.0000 Max. :13.998 Max. :1.00
## log_revol gradeDtoG
## Min. : 0.000 Min. :0.00
## 1st Qu.: 8.516 1st Qu.:0.00
## Median : 9.224 Median :0.00
## Mean : 8.974 Mean :0.32
## 3rd Qu.: 9.805 3rd Qu.:1.00
## Max. :12.464 Max. :1.00
In this part, we will split our dataset into train and test sets and build our machine learning model using data only from our train set to obtain an unbiased measurmenet of the model’s accuracy by predicting on test set.
The idea of obtaining an unbiased estimate of our model’s out-of-sample performance is an important one as it is often the case that the in-sample error is optimistic and tuned / adapted in a particular way to minimize the error in the training sample. Therefore - the in-sample error is not a good representation or indication of how our model will perform when it is applied on unseen data.
We’ll split our data into the loan.data.train
and loan.data.test
set. (85-15)
set.seed(2018)
intrain <- sample(nrow(loan.data), nrow(loan.data)*0.85)
loan.data.train <- loan.data[intrain, ]
loan.data.test <- loan.data[-intrain, ]
Verify to make sure our data is randomly shuffled
prop.table(table(loan.data.train$not_paid))
##
## 0 1
## 0.4878431 0.5121569
prop.table(table(loan.data.test$not_paid))
##
## 0 1
## 0.5644444 0.4355556
As the data is parted almost at 50-50, we can move on to the next step.
We do stepwise regression where the choice of predictive variables is carried out by an automatic procedure.
glm.all <- glm(not_paid ~., family = "binomial", loan.data.train)
# we only use it as a guidance when building our model
step(glm.all, direction = "both")
## Start: AIC=1725.61
## not_paid ~ initial_list_status + purpose + int_rate + installment +
## dti + inq_last_12m + delinq_2yrs + home_ownership + log_inc +
## verified + log_revol + gradeDtoG
##
## Df Deviance AIC
## - int_rate 1 1691.7 1723.7
## - log_revol 1 1691.7 1723.7
## - dti 1 1691.7 1723.7
## - initial_list_status 1 1691.9 1723.9
## - verified 1 1692.4 1724.4
## - home_ownership 2 1695.3 1725.3
## <none> 1691.6 1725.6
## - purpose 4 1699.7 1725.7
## - inq_last_12m 1 1693.9 1725.9
## - gradeDtoG 1 1694.6 1726.6
## - log_inc 1 1696.6 1728.6
## - delinq_2yrs 1 1698.9 1730.9
## - installment 1 1712.4 1744.4
##
## Step: AIC=1723.7
## not_paid ~ initial_list_status + purpose + installment + dti +
## inq_last_12m + delinq_2yrs + home_ownership + log_inc + verified +
## log_revol + gradeDtoG
##
## Df Deviance AIC
## - log_revol 1 1691.8 1721.8
## - dti 1 1691.8 1721.8
## - initial_list_status 1 1692.0 1722.0
## - verified 1 1692.6 1722.6
## - home_ownership 2 1695.4 1723.4
## <none> 1691.7 1723.7
## - purpose 4 1699.8 1723.8
## - inq_last_12m 1 1693.9 1723.9
## + int_rate 1 1691.6 1725.6
## - log_inc 1 1696.8 1726.8
## - delinq_2yrs 1 1699.2 1729.2
## - gradeDtoG 1 1702.5 1732.5
## - installment 1 1713.3 1743.3
##
## Step: AIC=1721.81
## not_paid ~ initial_list_status + purpose + installment + dti +
## inq_last_12m + delinq_2yrs + home_ownership + log_inc + verified +
## gradeDtoG
##
## Df Deviance AIC
## - dti 1 1691.9 1719.9
## - initial_list_status 1 1692.2 1720.2
## - verified 1 1692.8 1720.8
## - home_ownership 2 1695.5 1721.5
## <none> 1691.8 1721.8
## - inq_last_12m 1 1694.0 1722.0
## - purpose 4 1700.5 1722.5
## + log_revol 1 1691.7 1723.7
## + int_rate 1 1691.7 1723.7
## - log_inc 1 1698.2 1726.2
## - delinq_2yrs 1 1699.3 1727.3
## - gradeDtoG 1 1702.6 1730.6
## - installment 1 1713.3 1741.3
##
## Step: AIC=1719.9
## not_paid ~ initial_list_status + purpose + installment + inq_last_12m +
## delinq_2yrs + home_ownership + log_inc + verified + gradeDtoG
##
## Df Deviance AIC
## - initial_list_status 1 1692.2 1718.2
## - verified 1 1692.8 1718.8
## - home_ownership 2 1695.5 1719.5
## <none> 1691.9 1719.9
## - inq_last_12m 1 1694.0 1720.0
## - purpose 4 1700.5 1720.5
## + int_rate 1 1691.8 1721.8
## + dti 1 1691.8 1721.8
## + log_revol 1 1691.8 1721.8
## - delinq_2yrs 1 1699.3 1725.3
## - log_inc 1 1699.9 1725.9
## - gradeDtoG 1 1703.0 1729.0
## - installment 1 1714.7 1740.7
##
## Step: AIC=1718.24
## not_paid ~ purpose + installment + inq_last_12m + delinq_2yrs +
## home_ownership + log_inc + verified + gradeDtoG
##
## Df Deviance AIC
## - verified 1 1693.2 1717.2
## - home_ownership 2 1696.0 1718.0
## - inq_last_12m 1 1694.2 1718.2
## <none> 1692.2 1718.2
## - purpose 4 1700.7 1718.7
## + initial_list_status 1 1691.9 1719.9
## + int_rate 1 1692.1 1720.1
## + dti 1 1692.2 1720.2
## + log_revol 1 1692.2 1720.2
## - delinq_2yrs 1 1699.6 1723.6
## - log_inc 1 1700.4 1724.4
## - gradeDtoG 1 1703.4 1727.4
## - installment 1 1715.2 1739.2
##
## Step: AIC=1717.17
## not_paid ~ purpose + installment + inq_last_12m + delinq_2yrs +
## home_ownership + log_inc + gradeDtoG
##
## Df Deviance AIC
## - home_ownership 2 1696.9 1716.9
## - inq_last_12m 1 1695.0 1717.0
## <none> 1693.2 1717.2
## - purpose 4 1701.9 1717.9
## + verified 1 1692.2 1718.2
## + initial_list_status 1 1692.8 1718.8
## + int_rate 1 1693.0 1719.0
## + log_revol 1 1693.1 1719.1
## + dti 1 1693.1 1719.1
## - delinq_2yrs 1 1700.6 1722.6
## - log_inc 1 1701.7 1723.7
## - gradeDtoG 1 1704.9 1726.9
## - installment 1 1718.8 1740.8
##
## Step: AIC=1716.86
## not_paid ~ purpose + installment + inq_last_12m + delinq_2yrs +
## log_inc + gradeDtoG
##
## Df Deviance AIC
## <none> 1696.9 1716.9
## - inq_last_12m 1 1699.0 1717.0
## + home_ownership 2 1693.2 1717.2
## + verified 1 1696.0 1718.0
## - purpose 4 1706.2 1718.2
## + initial_list_status 1 1696.4 1718.4
## + int_rate 1 1696.6 1718.6
## + log_revol 1 1696.8 1718.8
## + dti 1 1696.8 1718.8
## - delinq_2yrs 1 1704.1 1722.1
## - log_inc 1 1706.7 1724.7
## - gradeDtoG 1 1709.0 1727.0
## - installment 1 1722.7 1740.7
##
## Call: glm(formula = not_paid ~ purpose + installment + inq_last_12m +
## delinq_2yrs + log_inc + gradeDtoG, family = "binomial", data = loan.data.train)
##
## Coefficients:
## (Intercept) purposedebt_consolidation
## 3.048714 0.091903
## purposehome_improvement purposemajor_purchase
## 0.239558 0.844807
## purposesmall_business installment
## 0.876574 0.001081
## inq_last_12m delinq_2yrs
## -0.033791 0.197173
## log_inc gradeDtoG
## -0.337676 0.444379
##
## Degrees of Freedom: 1274 Total (i.e. Null); 1265 Residual
## Null Deviance: 1767
## Residual Deviance: 1697 AIC: 1717
From stepwise regression, we get the optimized regression formula for our model :
formula = not_paid ~ purpose + installment + inq_last_12m + delinq_2yrs + log_inc + gradeDtoG
Let’s try to run this formula to glm() function and we will inspect the p-value and the AIC value of the model.
glm.loan <- glm(formula = not_paid ~ purpose + installment + inq_last_12m + delinq_2yrs + log_inc + gradeDtoG, family = "binomial", data = loan.data.train)
summary(glm.loan)
##
## Call:
## glm(formula = not_paid ~ purpose + installment + inq_last_12m +
## delinq_2yrs + log_inc + gradeDtoG, family = "binomial", data = loan.data.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9241 -1.1167 0.6801 1.1417 1.6730
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.048714 1.181895 2.580 0.009894 **
## purposedebt_consolidation 0.091903 0.145733 0.631 0.528285
## purposehome_improvement 0.239558 0.223945 1.070 0.284747
## purposemajor_purchase 0.844807 0.356926 2.367 0.017938 *
## purposesmall_business 0.876574 0.480180 1.826 0.067924 .
## installment 0.001081 0.000217 4.983 0.000000628 ***
## inq_last_12m -0.033791 0.023068 -1.465 0.142972
## delinq_2yrs 0.197173 0.075917 2.597 0.009399 **
## log_inc -0.337676 0.108858 -3.102 0.001922 **
## gradeDtoG 0.444379 0.127694 3.480 0.000501 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1766.8 on 1274 degrees of freedom
## Residual deviance: 1696.9 on 1265 degrees of freedom
## AIC: 1716.9
##
## Number of Fisher Scoring iterations: 4
From summary, there are some variables that are not giving high significance to the model, so we will omit them from our model. We will omit purpose
and inq_last_12m
.
glm.loan1 <- glm(formula = not_paid ~ installment + delinq_2yrs + log_inc + gradeDtoG, family = "binomial", data = loan.data.train)
summary(glm.loan1)
##
## Call:
## glm(formula = not_paid ~ installment + delinq_2yrs + log_inc +
## gradeDtoG, family = "binomial", data = loan.data.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9302 -1.1216 0.7186 1.1442 1.5717
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.1734626 1.1652508 2.723 0.006461 **
## installment 0.0011120 0.0002154 5.163 0.000000243 ***
## delinq_2yrs 0.1971310 0.0756659 2.605 0.009180 **
## log_inc -0.3455030 0.1073870 -3.217 0.001294 **
## gradeDtoG 0.4241966 0.1252241 3.387 0.000705 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1766.8 on 1274 degrees of freedom
## Residual deviance: 1707.7 on 1270 degrees of freedom
## AIC: 1717.7
##
## Number of Fisher Scoring iterations: 4
The difference of AIC score between glm.loan and glm.loan1 is only 0.8, so it’s okay to omit purpose
and inq_last_12m
from our model.
The formula used is : formula = not_paid ~ not_paid ~ installment + delinq_2yrs + log_inc + gradeDtoG and rename it CreditRisk.
CreditRisk <- glm(formula = not_paid ~ installment + delinq_2yrs + log_inc + gradeDtoG, family = "binomial", data = loan.data.train)
CreditRisk$aic
## [1] 1717.693
CreditRisk$coefficients
## (Intercept) installment delinq_2yrs log_inc gradeDtoG
## 3.173462603 0.001111953 0.197131024 -0.345503016 0.424196622
As we can see from above, the log_inc
have negative coefficients. The coefficients in logistic regression represent the log of odds ratio log(p/1-p)
of an event. (p
is the probability of an event) As long as the log of our odd ratio is positive, the probability of success will always be more than 50%. When we have negative value for our coefficient, the probability of success is less than 50%.
We refer back to Sigmoid Curve and plot the log_inc
coefficient as an example.
plot(function(x)(log(x/(1-x))),
xlab = "Probability",
ylab = "Log of Odds",
main = "Sigmoid Curve")
abline(h = CreditRisk$coefficients["log_inc"], col = "red")
As the log_inc
has negative coefficient, the higher the income, the probability of the defaulted loan will be lower.
For gradeDtoG
variables which has positive coefficient, loan grade A, B, and C (0) will give higher probability of the loan getting paid compared to grade D to G.
We want to obtain an unbiased measurement of the model’s accuracy by predicting on our train set before use it on our test set.
loan.data.train$pred.train <- predict(CreditRisk,
loan.data.train,
type = "response")
head(loan.data.train$pred.train)
## [1] 0.5717765 0.5344906 0.4668034 0.5071598 0.6205931 0.4210173
Change the output from pred.train to logical (1 or 0)
We assume if the not_paid
value is 0.5 or above, we say it’s a default.
Confusion Table of the Predicted Data with the Train Set of lbb_loan.csv
(conft.train <- table("predicted" = as.numeric(loan.data.train$pred.train >= 0.5),
"actual" = loan.data.train$not_paid))
## actual
## predicted 0 1
## 0 378 279
## 1 244 374
Accu.Train <- round((conft.train[1]+conft.train[4])/sum(conft.train[1:4]),4)
Prec.Train <- round(conft.train[4]/(conft.train[2]+conft.train[4]), 4)
Reca.Train <- round(conft.train[4]/(conft.train[3]+conft.train[4]), 4)
paste("Accuracy:", Accu.Train*100,"%")
## [1] "Accuracy: 58.98 %"
paste("Precision:", Prec.Train*100,"%")
## [1] "Precision: 60.52 %"
paste("Recall:", Reca.Train*100,"%")
## [1] "Recall: 57.27 %"
For credit risk analysis, we would rather sacrifice some level of specificity or precision in favor of higher recall (or sensitivity). In simpler words, we want to be more sensitive to “loan defaults”. One thing we can do is to set the treshold to be more sensitive to “positive cases”, by predict a “default” when the probability exceed 0.4 (20% more sensitive than our previous classifier)
Confusion Table of the Predicted Data with the Train Data of lbb_loan.csv
(conft.train2 <- table("predicted" = as.numeric(loan.data.train$pred.train >= 0.4),
"actual" = loan.data.train$not_paid))
## actual
## predicted 0 1
## 0 129 57
## 1 493 596
Accuracy, Precision, and Recall
Accu.Train2 <- round((conft.train2[1]+conft.train2[4])/sum(conft.train2[1:4]),4)
Prec.Train2 <- round(conft.train2[4]/(conft.train2[2]+conft.train2[4]), 4)
Reca.Train2 <- round(conft.train2[4]/(conft.train2[3]+conft.train2[4]), 4)
paste("Accuracy:", Accu.Train2*100,"%")
## [1] "Accuracy: 56.86 %"
paste("Precision:", Prec.Train2*100,"%")
## [1] "Precision: 54.73 %"
paste("Recall:", Reca.Train2*100,"%")
## [1] "Recall: 91.27 %"
From this, we improve our model by increasing our Recall rate to 91.27% from 57.27%, but our Presicion rate goes down from 60.52% to 54.73%. The Accuracy rate goes down a little from 58.98% to 56.86%.
Now we’re going to use our CreditRisk model : formula = not_paid ~ installment + delinq_2yrs + log_inc + gradeDtoG, with the threshold of not_paid more than 0.4 is assumed as a default loan. (see conft.train2)
Use predict() for our test set
loan.data.test$pred.test <- predict(CreditRisk,
loan.data.test,
type = "response")
head(loan.data.test$pred.test)
## [1] 0.5106317 0.6800422 0.4686805 0.4447727 0.4670663 0.3176890
Confusion Table of the Predicted Data with the Test Data of lbb_loan.csv
(conft.test <- table("predicted" = as.numeric(loan.data.test$pred.test >= 0.4),
"actual" = loan.data.test$not_paid))
## actual
## predicted 0 1
## 0 32 7
## 1 95 91
Accuracy, Precision, and Recall
Accu.Test <- round((conft.test[1]+conft.test[4])/sum(conft.test[1:4]),4)
Prec.Test <- round(conft.test[4]/(conft.test[2]+conft.test[4]), 4)
Reca.Test <- round(conft.test[4]/(conft.test[3]+conft.test[4]), 4)
By using Test Set to our improved model, we get:
paste("Accuracy:", Accu.Test*100,"%")
## [1] "Accuracy: 54.67 %"
paste("Precision:", Prec.Test*100,"%")
## [1] "Precision: 48.92 %"
paste("Recall:", Reca.Test*100,"%")
## [1] "Recall: 92.86 %"
The Recall rate for our test set is 92.86%, which is better than the Recall rate for our train set (91.27%). The value is quite consistent with out train set, so we can say that our model is acceptable to be used to predict more unseen (future) data.