Lab 16: Logistic Regression and ROC curves

Learning Objectives:

estimate a logistic regression
interpret coefficients (how they affect odds and probability)
understand how glm() handles categorical variables
make predictions
understand ROC curves
use ROC curves to compare algorithms

1. Introduction

In this lab we will revisit two data sets we saw previously, and apply logistic regression. Let’s begin with the Lending Club data. We already cleaned up the data somewhat in previous labs so this time we will load in the clean data set. To get to the cleaned up data we eliminated observations with missing value for loan_status, created good loan/bad loan indicator named good, calculated average fico, consolidated small or similar purpose categories, created a new variable income which equals income if either the source of income or income itself is verified, otherwise it equals NA. The cleaning script is here.

library(dplyr)
library(stargazer)
library(caret)
loan <- read.csv("https://www.dropbox.com/s/89g1yyhwpcqwjn9/lending_club_cleaned.csv?raw=1")
#loan <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab9/lending_club_cleaned.csv")
summary(loan)

##    good                     purpose           fico            dti       
##  bad : 6371   debt_consolidation:25253   Min.   :612.0   Min.   : 0.00  
##  good:36164   other             : 5160   1st Qu.:687.0   1st Qu.: 8.20  
##               major_purchase    : 3926   Median :712.0   Median :13.47  
##               home_improvement  : 3625   Mean   :715.1   Mean   :13.37  
##               small_business    : 1992   3rd Qu.:742.0   3rd Qu.:18.68  
##               vacation_wedding  : 1404   Max.   :827.0   Max.   :29.99  
##               (Other)           : 1175                                  
##    loan_amnt         income       
##  Min.   :  500   Min.   :   4800  
##  1st Qu.: 5200   1st Qu.:  44995  
##  Median : 9700   Median :  63000  
##  Mean   :11090   Mean   :  75186  
##  3rd Qu.:15000   3rd Qu.:  90000  
##  Max.   :35000   Max.   :6000000  
##                  NA's   :18758

2. Estimating logit model

We will use function glm() (part of base R) to estimate a logit model. GLM stands for generalized linear model. Its first argument is a formula in the Y ~ X1 + X2 format where Y is the variable we try to predict (the dependent variable) and the X1, X1 etc are the predictor (independent) variables. The second argument specifies the data and in the third argument we specify we want a “binomial” model which tells gml() to fit a logistic function to the data.

logit1 <- glm(good ~ fico, data = loan, family = "binomial")
summary(logit1)

## 
## Call:
## glm(formula = good ~ fico, family = "binomial", data = loan)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5172   0.4078   0.5306   0.6294   0.9622  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.033068   0.296922  -23.69   <2e-16 ***
## fico         0.012358   0.000421   29.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35928  on 42534  degrees of freedom
## Residual deviance: 34985  on 42533  degrees of freedom
## AIC: 34989
## 
## Number of Fisher Scoring iterations: 5

Note that good is a factor variable with two levels (“bad” and “good”). gml() will treat the first level as 0 and the second level as 1. Thus, in our case we are estimating the probability of “good”.

3. Interpreting coefficients in logit regression I: odds ratio

The output includes summary of deviance residuals and AIC which are both measures of fit. It also includes coefficients associated with each independent variable and the intercept. The coefficient on fico is positive and statistically significant. It tells us that a one point increase in fico score raises the log of odds ratio by 0.01. Log of odds is rather hard to interpret, therefore, we often take the exponential of the coefficients.

exp(coef(logit1))

##  (Intercept)         fico 
## 0.0008822209 1.0124345145

The exponential of the slope coefficient tells us the factor by which odds increase if the independent variable increases by one. So, if a fico score increases by 1 point, the odds ratio of loan being good increases by factor of 1.012 which means odds increase 1.2 percent. Here is the math behind this: Logistic model estimates the following equation: \(log(\tfrac{p}{1-p})=\beta_0 + \beta_1 \cdot X\). Taking exponent on both sides we get \(\tfrac{p}{1-p}=e^{\beta_0 + \beta_1 \cdot X}\). We call \(\tfrac{p}{1-p}\) the odds of good loan. Odds of a good loan depend on \(X\) (fico score). When \(X\) increases by one, \(odds(X+1)=e^{\beta_0} \cdot e^{\beta_1 \cdot X} \cdot e^{\beta_1} = odds(X)\cdot e^{\beta_1}\).

IN-CLASS EXERCISE 1: What is the effect of debt to income ratio (dti) on the odds of a loan being good?

4. Interpreting coefficients in logit regression II: probability

While the interpretation in terms of odds is relatively easy, sometimes interpretation in terms of probability is desired. However, calculating how a change in fico score affects the probability of a good loan is a bit complicated. The relationship is non-linear and therefore the effect depends on the value of fico score (and other variables if they are included). Therefore, we need to specify the levels of fico we want to see the effect of. For example, suppose we wanted to know the effect of fico going from 700 to 750 on the probability of loan being good. We could create a test data set with these two values. We will then feed these values to our logit1 model using the function predict(). As option we specify type="response" which tells predict to return probabilities.

test <- data.frame(fico=c(700,750))
test$pred <- predict(logit1,test, type="response")
test

##   fico      pred
## 1  700 0.8344391
## 2  750 0.9033761

The fico score going from 700 to 750 increases the probability of a good loan by seven percentage points.

IN-CLASS EXERCISE 2: What is the effect of fico going from 750-800?

5. Interpreting coefficients in a multiple regression

Let’s add an additional explanatory/independent variable, loan_amnt.

logit2 <- glm(good ~ fico + loan_amnt, data = loan, family = "binomial")
summary(logit2)

## 
## Call:
## glm(formula = good ~ fico + loan_amnt, family = "binomial", data = loan)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6261   0.4011   0.5256   0.6261   0.9423  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.367e+00  3.007e-01  -24.50   <2e-16 ***
## fico         1.319e-02  4.306e-04   30.62   <2e-16 ***
## loan_amnt   -2.229e-05  1.815e-06  -12.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35928  on 42534  degrees of freedom
## Residual deviance: 34838  on 42532  degrees of freedom
## AIC: 34844
## 
## Number of Fisher Scoring iterations: 5

exp(coef(logit2))

##  (Intercept)         fico    loan_amnt 
## 0.0006316636 1.0132732517 0.9999777084

It is important to keep in mind that once you include more than one independent variable, the interpretation of the coefficient on an independent variables is the effect of that independent variable holding all other independent variables constant. For example, holding loan amount constant, the effect of raising fico score by 1 point raises the odds of a good loan by 1.3 percent. This is a bit higher than when we did not control for loan amount, it suggests that fico and loan amount are related. Perhaps people with higher fico scores get approved for larger loans, but larger loans are also more likely to default.

6. Working with categorical/factor variables

Let’s see how gml() handles categorical/factor variables. For example, let’s include purpose into the model.

logit3 <- glm(good ~ fico + loan_amnt + purpose, data = loan, family = "binomial")
summary(logit3)

## 
## Call:
## glm(formula = good ~ fico + loan_amnt + purpose, family = "binomial", 
##     data = loan)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6865   0.3882   0.5136   0.6193   1.2974  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -7.508e+00  3.030e-01 -24.780  < 2e-16 ***
## fico                     1.356e-02  4.361e-04  31.093  < 2e-16 ***
## loan_amnt               -2.311e-05  1.898e-06 -12.177  < 2e-16 ***
## purposeeducational      -5.552e-01  1.233e-01  -4.503 6.69e-06 ***
## purposehome_improvement -1.404e-01  5.245e-02  -2.677 0.007418 ** 
## purposemajor_purchase    5.918e-02  5.651e-02   1.047 0.295023    
## purposemedical          -3.498e-01  1.005e-01  -3.481 0.000499 ***
## purposeother            -3.347e-01  4.276e-02  -7.827 4.99e-15 ***
## purposesmall_business   -9.380e-01  5.472e-02 -17.140  < 2e-16 ***
## purposevacation_wedding  1.160e-01  8.625e-02   1.345 0.178738    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35928  on 42534  degrees of freedom
## Residual deviance: 34502  on 42525  degrees of freedom
## AIC: 34522
## 
## Number of Fisher Scoring iterations: 5

Note that purpose is a factor with eight levels, but glm() is showing us only seven coefficients associated with purpose. Why? Do you recall the term dummy variable trap? To see the effect of \(k\) categories you only need \(k-1\) dummies. The interpretation of the coefficient on each of the included dummies is the effect of that category relative to the ‘left out’ category. In our case the category that is left out is debt_consolidation so the coefficients on eductional, home_improvement etc are relative to debt_consolidation loans. Let’s take an exponent of the coefficients so we can interpret the magnitude of these coefficients.

round(exp(coef(logit3)),3)

##             (Intercept)                    fico               loan_amnt 
##                   0.001                   1.014                   1.000 
##      purposeeducational purposehome_improvement   purposemajor_purchase 
##                   0.574                   0.869                   1.061 
##          purposemedical            purposeother   purposesmall_business 
##                   0.705                   0.716                   0.391 
## purposevacation_wedding 
##                   1.123

The coefficients show that the odds of a loan being good are lower for educational loans relative to debt consolidation loans by a factor of 0.57, or about 43% lower. As another example, let’s take vacation and wedding loans.These have shockingly 12% higher odds of being good than debt consolidation loans.

In many cases we would like to have control over which category is left out of the regression. By default the category that is left out of the regression is the first level of the factor. In this case it was debt_consolidation since it was first in the alphabetical order of the eight factor levels. It is typical to leave out the category with the largest number of observations. The largest category is normally a good reference to which other categories may be compared. (Think of comparing earning of black, Hispanic and Asian workers to white workers.) Here debt consolidation happens to be the biggest category so we did not feel the need to change the order of the factor levels. If we needed to, we could have done it using function factor(factorname, levels =c("","",...)) or if we wanted to automatically order factor levels by the number of cases in each factor level we could have done something like this:

loan <- loan %>% group_by(purpose) %>% mutate(nobs=n()) 
loan$purpose <-  reorder(loan$purpose, -loan$nobs)
levels(loan$purpose)

## [1] "debt_consolidation" "other"              "major_purchase"    
## [4] "home_improvement"   "small_business"     "vacation_wedding"  
## [7] "medical"            "educational"

The levels are no longer in alphabetical order. Instead, they are now ordered from the category with the largest number of observations (debt_consolidation) to the category with the smallest number of observations (medical).

7. Dealing with missing values

Suppose we wanted to include income as an explanatory variable. Recall that income is verified in only about half of the cases and therefore about half of the values for this variable are missing.

logit4 <- glm(good ~ fico + loan_amnt + income + purpose, data = loan, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(logit4)

## 
## Call:
## glm(formula = good ~ fico + loan_amnt + income + purpose, family = "binomial", 
##     data = loan)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3659   0.3840   0.5224   0.6320   1.2589  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -7.482e+00  4.119e-01 -18.165  < 2e-16 ***
## fico                     1.303e-02  5.906e-04  22.061  < 2e-16 ***
## loan_amnt               -3.663e-05  2.580e-06 -14.200  < 2e-16 ***
## income                   7.203e-06  5.426e-07  13.275  < 2e-16 ***
## purposeother            -2.988e-01  6.034e-02  -4.952 7.34e-07 ***
## purposemajor_purchase    1.388e-02  7.689e-02   0.180   0.8568    
## purposehome_improvement -1.077e-01  7.108e-02  -1.515   0.1298    
## purposesmall_business   -9.158e-01  7.016e-02 -13.052  < 2e-16 ***
## purposevacation_wedding  1.114e-01  1.126e-01   0.989   0.3226    
## purposemedical          -2.678e-01  1.426e-01  -1.878   0.0604 .  
## purposeeducational      -5.076e-01  2.309e-01  -2.198   0.0279 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20592  on 23776  degrees of freedom
## Residual deviance: 19677  on 23766  degrees of freedom
##   (18758 observations deleted due to missingness)
## AIC: 19699
## 
## Number of Fisher Scoring iterations: 5

Logit omits observations/rows with missing values for any of the independent variables. Thus, the model above is estimated on only half of the observations in our data set. Perhaps most importantly, this model may not be used for predicting the status of loans that have missing income.

8. Testing logistic model out of sample

Let’s do our normal routine of splitting our data into test and train and see how logistic regression does predicting out of sample. As our model let’s use fico, dti, loan_amnt and purpose as predictors.

set.seed(364)
sample <- sample(nrow(loan),floor(nrow(loan)*0.8))
train <- loan[sample,]
test <- loan[-sample,]

logit4 <- glm(good ~ fico + dti+ loan_amnt + purpose, data = train, family = "binomial")
test$pred <- predict(logit4, test, type="response")

test$good_pred <- ifelse(test$pred > 0.80, "good", "bad")
confusionMatrix(test$good_pred, test$good)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  bad good
##       bad   427 1265
##       good  818 5997
##                                           
##                Accuracy : 0.7551          
##                  95% CI : (0.7459, 0.7643)
##     No Information Rate : 0.8536          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1469          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.34297         
##             Specificity : 0.82581         
##          Pos Pred Value : 0.25236         
##          Neg Pred Value : 0.87997         
##              Prevalence : 0.14635         
##          Detection Rate : 0.05019         
##    Detection Prevalence : 0.19890         
##       Balanced Accuracy : 0.58439         
##                                           
##        'Positive' Class : bad             
##

We have accuracy of 75%. We detect bad loans in 34% of cases, and we label good loans as good in 83% of cases. Kappa is nearly 0.15 which is quite a bit better than when we used Naive Bayes.

9. Receiver Operating Characteristic (ROC) curve

We set the cut off for classifying a loan as good at 0.80 probability. This gave us sensitivity (probability of detecting bad loans among bad loans) of 0.34, and specificity (probability of detecting good loans among good loans) of 0.83. What would happen to sensitivity if we raise the cutoff for classifying a loan as good? What would happen to specificity? What would happen to sensitivity if we classify all loans as bad? What would be our specificity? What if we classify all loans as good?

The tradeoff between sensitivity and specificity is displayed in ROC curve. We will use package pROC to create set of sensitivities and specificities to plot the tradeoff among them.

library(pROC)
roc <- roc(test$good,test$pred) #creates an object with all sorts of diagnostics including sensitivities and specificities

test$sens <- roc$sensitivities[2:8508] #include sensitivities in test data
test$spec <- roc$specificities[2:8508]

ggplot(test, aes(x=spec, y=sens)) + geom_line()

10. Using ROC curves to horse-race algorithms

Let’s estimate a Naive Bayes classifier and compare it to our logistic regression.

library(e1071)
#create the categorical variables that will be used in the Naive Bayes
loan$ficocat <- cut(loan$fico, breaks=c(0,687,742,1000),
                    labels=c("bottom 25%","middle 50%", "top 25%"))
loan$dticat <- cut(loan$dti, breaks=c(0,8.2,18.68,100),
                    labels=c("bottom 25%","middle 50%", "top 25%"))
#put those categorical variables into train and test
train$ficocat <- loan$ficocat[sample]
test$ficocat <- loan$ficocat[-sample]
train$dticat <- loan$dticat[sample]
test$dticat <- loan$dticat[-sample]

NB <- naiveBayes(good ~ ficocat + dticat + purpose,train)
test$predNB <- predict(NB, select(test, ficocat, dticat, purpose), type="raw")[,"good"]
rocNB <- roc(test$good,test$predNB)

test$sensNB <- rocNB$sensitivities[2:8508]
test$specNB <- rocNB$specificities[2:8508]

ggplot(test)  + geom_line(aes(x=spec, y=sens), color="blue") + geom_line(aes(x=specNB,y=sensNB), color="red")

The logistic regression (in blue) is slightly better, having higher specificity for a given sensitivity, and higher sensitivity for given specificity.

IN-CLASS EXERCISE: Eliminate ficocat from the Naive Bayes classifier. Are the ROC curves different?

Exercises

Let’s load the Titanic training data. What are the odds of surviving the shipwreck?
Using the logit model, estimate how much lower are the odds of survival for men relative to women?
Controlling for gender, does age have a statistically significant effect on the odds of survival? If so, what is the magnitude of that effect?
Controlling for gender, does passenger class have a statistically significant effect on the odds of survival? If so, what is the magnitude of that effect?
Controlling for gender, estimate the effect of being in the second class relative to first class, and the effect of being in the third relative to first.
Add fare to the regression you estimated above. Is fare a significant determinant of survival controlling for gender and passenger class? Do you think that if we regressed survival on just gender and fare, fare would be significant? Explain.
As we know from the movie, Jack traveled in the third class and paid 5 pounds (I know that Jack actually won the ticket in poker, but Swen, from whom Jack won the ticket, paid …). Rose traveled in the first class and paid 500 for her ticket (I know that her fiancee, Cal Hockley - Pittsburgh steel tycoon, actually bought the ticket, but …). What is the probability that Jack will survive? What is the probability that Rose will survive?
Create your own logistic model and make predictions for passengers in the Titanic test data set. Keep in mind that you must make predictions for all passengers in the test data (even those with missing values). Use your own probability cut off for predicting survival (0.5 is a natural start). Did you do better with logistic regression than with decision trees? Which algorithm do you like better?