I am going to demonstrate the practical application of logistic regression model to predict if an individual would default their loan or paid the loan. We will use real data from Lending Club and import it to my R studio. The data consists of 9,578 obersevations and total 14 variables
loan <- read.csv("loan_data.csv")
str(loan)
## 'data.frame': 9578 obs. of 14 variables:
## $ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
## $ purpose : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
## $ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
## $ installment : num 829 228 367 162 103 ...
## $ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
## $ dti : num 19.5 14.3 11.6 8.1 15 ...
## $ fico : int 737 707 682 712 667 727 667 722 682 707 ...
## $ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
## $ revol.bal : int 28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
## $ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
## $ inq.last.6mths : int 0 0 1 1 0 0 0 0 1 1 ...
## $ delinq.2yrs : int 0 0 0 0 1 0 0 0 0 0 ...
## $ pub.rec : int 0 0 0 0 0 0 1 0 0 0 ...
## $ not.fully.paid : int 0 0 0 0 0 0 1 1 0 0 ...
Here are the variables definitions: credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”).
int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
installment: The monthly installments ($) owed by the borrower if the loan is funded.
log.annual.inc: The natural log of the self-reported annual income of the borrower.
dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
fico: The FICO credit score of the borrower. days.with.cr.line: The number of days the borrower has had a credit line.
revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).
revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months.
delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).
The dependent variables we use from the data is variabel “not.fully.paid”, expressed as 1 if the corresponding individual default their loan and 0 if they paid the loan. We call this binary classification where the regressand value is either 0 or 1. Here we can predict, given any regressors Xi,…,Xn, the probability of a person will default as \(P(Y=1 | Xi)\) or paid as \(P(Y=0 | Xi)\) Thus we can write this as if the probability of a person default is given by \[ P(Y = 1| Xi) = Pi \] then the probability of them paid is \[P(Y = 0 | Xi) = 1 - Pi \] and Y follows the Bernoulli Distribution.
The logistic regression is based off of logistic distribution function. Suppose a model of \(Pi = β1 + β2Xi\), thus the logistic function is expresses by \[ Pi = \frac{1}{1 + e^-({β_1 + β_2Xi})} \] or we can simply write it \[ Pi = \frac{exp(β_1 + β_2Xi)}{1 + exp(β_1 + β_2Xi)} \] This represents the probability of a person default. Otherwise, we can write the probability of loan being paid (Y = 0) as : \[ 1 - Pi = \frac{1}{1 + exp(β_1 + β_2Xi)} \] Therefore, we can write \[ \frac{Pi}{1 - Pi} = \frac{exp(β_1 + β_2Xi)(exp(β_1 + β_2Xi) + 1)}{exp(β_1 + β_2Xi) + 1} = exp(β_1 + β_2Xi) \] This is exactly the odds ratio in favor of a person default their loan, i.e if Pi = 0.8 means that the odds are 4 to 1 in favor a person will default the loan. Taking the natural log of the equation above, we have \[ Ln(\frac{Pi}{1 - Pi}) = β_1 + β_2Xi \] This is the framework of logistic regression model. The method to estimate the coefficient paramater in logistic regression is using Maximum Likelihood.
Here is the visual representation of default and paid in loan as shown in the graph below. The default rate with respect to paid is approximately 22.5%.
ggplot(loan, aes(x = not.fully.paid)) + geom_histogram(stat = "count", aes(fill = factor(loan$not.fully.paid)))
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Use of `loan$not.fully.paid` is discouraged. Use `not.fully.paid`
## instead.
First, we should prepare the data so the class of the data fits the type of the data values. This can be done by factorizing the variable which values are qualitative, could be binary or a discrete factor like 0, 1, 2, 3, etc, this is commonly referred as dummy variable. For the variable log.annual.income has been initially log-transformed, we should bring it back to its actual values by doing the antilog of log.annual.inc to keep it easy for the coefficient interpration later. Fortunately we don’t have any missing values in our data so we do not have to manipulate the data to fit into the model.
# Convert factorable variable to factors #
loan$credit.policy <- as.factor(loan$credit.policy)
loan$purpose <- as.factor(loan$purpose)
loan$inq.last.6mths <- as.integer(loan$inq.last.6mths)
loan$delinq.2yrs <- as.integer(loan$delinq.2yrs)
loan$pub.rec <- as.integer(loan$pub.rec)
loan$not.fully.paid <- as.factor(loan$not.fully.paid)
loan$log.annual.inc <- exp(loan$log.annual.inc)
# checking if there's any missing values
any(is.na(loan))
## [1] FALSE
The distribution curve for installment, fico, and interest rate shows normal distribution curves. This is important part before building any model to make sure it’s normally distributed, stable mean. In most cases, standardization is needed to stable the mean and variance, so the model can perform better. For the annual income we see it left-skewed but I think it’s okay if we don’t normalize and standardize it just to make things simple and easy to interpret later.
# Normal Distribution curve of log installment #
par(mfrow = c(2, 2))
inst <- loan$installment
h<-hist(inst, breaks=10, col="red", xlab="Log installment",
main="Log installment distribution curve")
xfit<-seq(min(inst),max(inst),length=40)
yfit<-dnorm(xfit,mean=mean(inst),sd=sd(inst))
yfit <- yfit*diff(h$mids[1:2])*length(inst)
lines(xfit, yfit, col="blue", lwd=2)
# Normal Distribution curve of log fico #
fic <- loan$fico
h2 <-hist(fic, breaks=10, col="red", xlab="Log Fico rating",
main="Fico rating distribution curve")
xfit<-seq(min(fic),max(fic),length=40)
yfit<-dnorm(xfit,mean=mean(fic),sd=sd(fic))
yfit <- yfit*diff(h2$mids[1:2])*length(fic)
lines(xfit, yfit, col="blue", lwd=2)
# Normal Distribution curve of log annual income #
inc <- loan$log.annual.inc
h3 <-hist(inc, breaks=10, col="red", xlab="Log annual income",
main="Annual income distribution curve")
xfit<-seq(min(inc),max(inc),length=40)
yfit<-dnorm(xfit,mean=mean(inc),sd=sd(inc))
yfit <- yfit*diff(h3$mids[1:2])*length(inc)
lines(xfit, yfit, col="blue", lwd=2)
# Normal Distribution curve of interest rate #
rate <- loan$int.rate
h4 <- hist(rate, breaks=10, col="red", xlab="Interest rate",
main="Interest rate distribution curve")
xfit<-seq(min(rate),max(rate),length=40)
yfit<-dnorm(xfit,mean=mean(rate),sd=sd(rate))
yfit <- yfit*diff(h4$mids[1:2])*length(rate)
lines(xfit, yfit, col="blue", lwd=2)
Before building the model, it’s mandatory for us to split the whole data into training and testing sets. The training dataset is used to train our model and the testing test is to test how accurate our model make predictions. This is crucial for any machine learning algorithm including logistic regression that we will use. First we feed our model with a subset of our data so that it can learn and produces the coefficients necesssary to predict the regressand then we feed the trained model with our test dataset to see how it works in terms of predicting the outcome of the test dataset that it (the model) has not trained before, so we can see how well it performs. Here I split the data into 80% for the training dataset and 20% for testing dataset.
library(caTools)
sample <- sample.split(loan$not.fully.paid, SplitRatio = 0.8)
train_data <- subset(loan, sample == TRUE)
test_data <- subset(loan, sample == FALSE)
Building logistic regression in R is quite simple, we use glm() function or “Generalized Linear Model” to perform logistic regression, here we put the whole independent variables to the model.
#Logistic regression
log.reg <- glm(not.fully.paid ~ ., data = train_data, family = "binomial")
summary(log.reg)
##
## Call:
## glm(formula = not.fully.paid ~ ., family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6476 -0.6207 -0.4979 -0.3623 2.6978
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.082e+00 1.280e+00 3.972 7.13e-05 ***
## credit.policy1 -3.692e-01 9.363e-02 -3.943 8.04e-05 ***
## purposecredit_card -5.252e-01 1.234e-01 -4.255 2.09e-05 ***
## purposedebt_consolidation -3.004e-01 8.742e-02 -3.436 0.000590 ***
## purposeeducational 1.991e-01 1.694e-01 1.176 0.239720
## purposehome_improvement 1.883e-01 1.408e-01 1.338 0.180950
## purposemajor_purchase -3.750e-01 1.863e-01 -2.013 0.044068 *
## purposesmall_business 6.391e-01 1.292e-01 4.945 7.63e-07 ***
## int.rate 9.415e-01 1.934e+00 0.487 0.626359
## installment 1.053e-03 1.895e-04 5.559 2.71e-08 ***
## log.annual.inc -3.224e-06 8.921e-07 -3.614 0.000302 ***
## dti 3.022e-03 5.190e-03 0.582 0.560370
## fico -9.835e-03 1.610e-03 -6.108 1.01e-09 ***
## days.with.cr.line 3.612e-07 1.480e-05 0.024 0.980530
## revol.bal 2.280e-06 1.048e-06 2.176 0.029564 *
## revol.util 2.366e-03 1.433e-03 1.651 0.098671 .
## inq.last.6mths 7.792e-02 1.574e-02 4.950 7.43e-07 ***
## delinq.2yrs -8.283e-02 6.207e-02 -1.334 0.182062
## pub.rec 2.704e-01 1.053e-01 2.568 0.010234 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6737.8 on 7661 degrees of freedom
## Residual deviance: 6279.8 on 7643 degrees of freedom
## AIC: 6317.8
##
## Number of Fisher Scoring iterations: 5
From the summary of our model log.rev, we can see that, judging by the significance paramater asterisk, there are some insignificant variables. Therefore we can get rid of these insignificant variables and only include the significant variables. We modify our model as
#TUNED GLM
log.reg.rev <- glm(not.fully.paid ~ purpose + installment + log.annual.inc +
fico + revol.bal + inq.last.6mths + pub.rec, data = train_data, family = "binomial")
summary(log.reg.rev)
##
## Call:
## glm(formula = not.fully.paid ~ purpose + installment + log.annual.inc +
## fico + revol.bal + inq.last.6mths + pub.rec, family = "binomial",
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0930 -0.6271 -0.5056 -0.3628 2.7274
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.734e+00 7.062e-01 9.536 < 2e-16 ***
## purposecredit_card -5.016e-01 1.219e-01 -4.116 3.86e-05 ***
## purposedebt_consolidation -2.803e-01 8.575e-02 -3.269 0.001081 **
## purposeeducational 1.929e-01 1.688e-01 1.143 0.253018
## purposehome_improvement 1.783e-01 1.402e-01 1.272 0.203363
## purposemajor_purchase -4.031e-01 1.862e-01 -2.164 0.030455 *
## purposesmall_business 6.416e-01 1.255e-01 5.114 3.16e-07 ***
## installment 1.071e-03 1.720e-04 6.223 4.87e-10 ***
## log.annual.inc -3.591e-06 8.798e-07 -4.082 4.47e-05 ***
## fico -1.231e-02 1.000e-03 -12.311 < 2e-16 ***
## revol.bal 3.855e-06 1.051e-06 3.668 0.000244 ***
## inq.last.6mths 1.110e-01 1.322e-02 8.399 < 2e-16 ***
## pub.rec 2.704e-01 1.042e-01 2.594 0.009494 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6737.8 on 7661 degrees of freedom
## Residual deviance: 6301.2 on 7649 degrees of freedom
## AIC: 6327.2
##
## Number of Fisher Scoring iterations: 5
Interpretation of logistic regression is different from another regression method like OLS regression in which we can directly interpret the effects of the coefficient parameters to the regressand. For example, the coefficient of fico of -0.0123, we interpret that for a unit (100) increase in fico, the log odds in favor of a person default is decrease by 0.0123 units. However, this intepretation is not quite hard to digest. Another way is to take the antilog of the logit. Remember earlier we define \[ \frac{Pi}{1 - Pi} = exp(β_1 + β_2Xi) \] so we know that the odds is \(e^{-0.000003591 * 50000} = 0.8356\) which we can conclude that for people with income of $50,000 per year, on average 17 out of 20 person are likely to default their loan. Here I visualize the simulation of how the odds for default is decreasing as the income increasing. We can clearly see that the odds is decreasing exponentially as the income increasing. It might be better to not just interpret a single Xi but simulate for any values and visualize the effects to the odds.
func3 <- function(x){
exp(-0.000003591 * (x))
}
u <- c(20000, 30000, 40000, 50000, 70000, 100000, 150000, 200000, 250000, 300000, 450000, 500000, 550000, 600000,
800000, 1000000, 1200000, 150000, 200000, 250000, 300000, 350000, 400000, 500000, 600000)
u <- as.data.frame(u)
K <- u %>%
mutate(odds = (func3(u)))
K
## u odds
## 1 20000 0.93069841
## 2 30000 0.89786999
## 3 40000 0.86619952
## 4 50000 0.83564617
## 5 70000 0.77773456
## 6 100000 0.69830452
## 7 150000 0.58353549
## 8 200000 0.48762920
## 9 250000 0.40748547
## 10 300000 0.34051367
## 11 450000 0.19870181
## 12 500000 0.16604441
## 13 550000 0.13875437
## 14 600000 0.11594956
## 15 800000 0.05654039
## 16 1000000 0.02757075
## 17 1200000 0.01344430
## 18 150000 0.58353549
## 19 200000 0.48762920
## 20 250000 0.40748547
## 21 300000 0.34051367
## 22 350000 0.28454895
## 23 400000 0.23778224
## 24 500000 0.16604441
## 25 600000 0.11594956
r <- ggplot(K, aes(x = u, y = odds)) + geom_line()
r
Graph above is the visualization of the effects of increasing the income to the odds of default. It shows that as income increasing, the odds of default is decreasing exponentially. This is an intuitive result as we expect people with higher income will likely to pay their loan, vice versa.
now we can test the model with our test dataset. No, we do not make another model with our test dataset, but instead using out trained model to make predictions off of our test dataset. The syntax to predict is simple in R, using predict() functions in R we can generate predictions with our model.
#MAKING PREDICTIONS
predict_loan <- predict(object = log.reg.rev,
newdata = test_data[-14], type = "response")
head(predict_loan, n = 30)
## 11 15 16 17 18 19 29
## 0.25525891 0.09156507 0.11983393 0.17076668 0.03477595 0.14392812 0.03525650
## 30 31 34 37 39 40 45
## 0.02977701 0.05549851 0.06002100 0.02675648 0.11617745 0.11455656 0.09277177
## 56 59 60 61 74 75 81
## 0.21075875 0.18184797 0.11779812 0.15398311 0.22018983 0.08012999 0.06090537
## 83 85 93 94 109 120 123
## 0.06953879 0.09387963 0.05329096 0.09921972 0.28696441 0.16196419 0.04741813
## 137 142
## 0.07439255 0.17271198
Here we can sample out the predicition result of 30 samples. But we find that the result is not 0 or 1 as we want to use since the dependent variable (not.fully.paid) takes values either 0 or 1. In this case we can simply turn it to 0 or 1 values by assigning any values below 0.5 as 0, and above 0.5 as 1.
binary_predict <- as.factor(ifelse(predict_loan > 0.5, 1, 0))
head(binary_predict, n = 30)
## 11 15 16 17 18 19 29 30 31 34 37 39 40 45 56 59 60 61 74 75
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 81 83 85 93 94 109 120 123 137 142
## 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
The code means that if the predict_loan result is above 0.5, assign it as 1, else (result is below 0.5), assign it as 0. As you can see the fitted values or the predicted values have been transformed to either 0 or 1, representing the 1 as default and 0 as paid.
We have built our model, making predictions, now time to see how accurate our model make predictions off our test dataset. The method to evaluate the accuracy of a model’s predictions is using confusion matrix. With confusion matrix, we can see if our model generate right predictions with respect to the actual value from the test dataset, or otherwise make a false prediction.
###CONFUSION MATRIX
set.seed(12345)
confusionMatrix(data = binary_predict,
reference = test_data$not.fully.paid)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1601 294
## 1 8 13
##
## Accuracy : 0.8424
## 95% CI : (0.8253, 0.8584)
## No Information Rate : 0.8398
## P-Value [Acc > NIR] : 0.3921
##
## Kappa : 0.06
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99503
## Specificity : 0.04235
## Pos Pred Value : 0.84485
## Neg Pred Value : 0.61905
## Prevalence : 0.83977
## Detection Rate : 0.83559
## Detection Prevalence : 0.98904
## Balanced Accuracy : 0.51869
##
## 'Positive' Class : 0
##
The result above tells us that, regarding to the confusion matrix, our model correctly predict off the test dataset that 1601 individuals (true positive) paid their loan while 13 people default (true negative). the overall accuracy is calculated by \(\frac{correct prediction}{total observations}\) or equal to \(\frac{(1601+13)}{1601 + 294 + 8 + 13}\) which results 0.8424. That means that given data points of 1916 observations from our test dataset, our model has correctly predict 1614 outcome. But the confusion matrix suggest that our model has false negative of 8 data, which means that our model predict 8 persons will default but they actually paid the loan, and 294 false positive which our model predict not default but they actually default.
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pred_ROCR <- prediction(predict_loan, test_data$not.fully.paid)
auc_ROCR <- performance(pred_ROCR, measure = 'auc')
plot(performance(pred_ROCR, measure = 'tpr', x.measure = 'fpr'), colorize = TRUE,
print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))
paste('Area under Curve :', signif(auc_ROCR@y.values[[1]]))
## [1] "Area under Curve : 0.64403"