Logistic Regression Exercises

Exploratory Data Analysis

We will be working with the Default dataset to explore how we can predict logistic regression.

# Load the data
library(ISLR)
data(Default)

Let’s start by examining the data:

head(Default)
##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559
summary(Default)
##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554
str(Default)
## 'data.frame':    10000 obs. of  4 variables:
##  $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...
?Default
# Make some plots of the data that you think are helpful in understanding the data (don't worry about running regression lines through them, just plot)
# then I will live code the regression line

# how does income relate to balance?
sp0 = ggplot(data=Default, aes(x=income, y=balance)) + # add color=student
  geom_point() +
  geom_smooth(method = "lm")

sp1 = ggplot(data=Default, aes(x=income, y=balance, color=student)) + 
  geom_point() +
  geom_smooth(method = "lm") + 
  scale_color_manual(values=c("#999999", "#E69F00"))

cor(Default$income, Default$balance)
## [1] -0.1522434
model <- lm(balance ~ income, data=Default)
summary(model)$coefficient
##                  Estimate   Std. Error   t value    Pr(>|t|)
## (Intercept)  1.020449e+03 1.293220e+01  78.90765 0.00000e+00
## income      -5.521812e-03 3.585042e-04 -15.40236 6.38044e-53
# weak negative relationship and correlation
# whether you are a student or not it doesn't strongly affect any relationships between income and balance.

# But whether you're a student or not affects your income (which makes sense). You're not really making money when you're in school!)
sp2 = ggplot(data=Default, aes(x=balance,y=default, color=student)) +
  geom_point() #+ geom_smooth(method="lm")
# Whether you're a student or not does not seem to credit balance

sp3 =ggplot(data=Default, aes(x=income,y=default, color=student)) +
  geom_point() #+ geom_smooth(method="lm")
# Again, we see whether or not you're a student affects your income.

# this tells me nothing:
sp4 = ggplot(data=Default, aes(x=student, y=default)) + 
  geom_point() 
# can have only 4 points (yes, yes) (yes, no) (no, yes) (no, no)

What do your plots tell you? Are there any interesting observations?

I noticed there were 10,000 observations, where each row is a person and 4 columns. Default (categorical response variable) which identifies who failed to make a payment on a credit card debt by the due date, student (categorical predictor variable), and two continuous predictor variables balance (the average balance that the customer has remaining on their credit card after making their monthly payment) and income (income of the customer).

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(sp0, sp1, sp2, sp3, nrow=2) #sp4 tells me nothing
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Fitting a linear model to the Default data

In the remainder of this exercise, we will be primarily interested in the probability that an individual defaults based on a handful of characteristics about them. The variable default will be the response, and the others will be covariates.

Before fitting the model in the next code chunk, consider some ways your model will work well and in some ways it won’t.

data(Default)
# Convert response variable from a factor (e.g., 1/2 to 0/1) to numeric by recoding your data
Default_ <- Default # unique(Default$default) # how to check your levels
Default_$default <- as.numeric(Default_$default)-1
Default$default_yes <- as.numeric(Default$default)-1
# Fit a linear model with default as a response
# and the other variables as predictors

# mod.lm <- lm(default ~ student + balance + income, data=Default_)
mod.lm <- lm(default_yes ~  student + balance + income, data=Default)

# Review the summary of the model
kable(round(summary(mod.lm)$coefficients,2))
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.08 0.01 -9.68 0.00
studentYes -0.01 0.01 -1.82 0.07
balance 0.00 0.00 37.41 0.00
income 0.00 0.00 1.04 0.30
round(summary(mod.lm)$coefficients,2)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    -0.08       0.01   -9.68     0.00
## studentYes     -0.01       0.01   -1.82     0.07
## balance         0.00       0.00   37.41     0.00
## income          0.00       0.00    1.04     0.30
# Plot the results of the model, specifically default vs. balance
ggplot(data=Default_, aes(x=balance,y=default)) +
  geom_point() +
  geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'

How does your model look? Are there some things that you wouldn’t expect?

You’re not going to be predicting below 0. And you can see that for very large and very small values, the liklehood is fairly certain but there is more uncertainty in the middle.

Fitting a logistic (Bernoulli) model to the Default data

To address the inadequacies of the linear model for this dataset, we try using logistic regression. For this we need a new function in \(\mathsf{R}\) called glm. The format is for the most part the same as lm, but we now need to specify a family. Since our data takes on 0s and 1s, we are making the assumption that the noise is Bernoulli, or more precisely

\[ Y_i\mid X=x \sim \text{Bernoulli}(p_i) \] This means we need to specify that the family is binomial. While we won’t get into this too much unless time permits, we can also specify a link, which here we use link=logit. This tells glm to use the logistic cumulative distribution function (CDF) in the model.

Now try running it yourself!

# Fit the logistic model with the same formula as the linear model
binDef <- glm(default ~ student + balance + income,
              family=binomial(link="logit"),
              data=Default)

# Review the summary output
kable(round(summary(binDef)$coefficient,2))
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.87 0.49 -22.08 0.00
studentYes -0.65 0.24 -2.74 0.01
balance 0.01 0.00 24.74 0.00
income 0.00 0.00 0.37 0.71

How would you interpret these coefficients?

Balance and income are close to zero. That means that the logit of p is constant, meaning Y is independent of X.

However the odds of defaulting on credit card debt multiplies by e^b:

exp(-0.65)
## [1] 0.5220458

as you go from a student to a non-student.

The odds of defaulting on credit card debt multiplies by

exp(.01)
## [1] 1.01005

for every 1 unit increase in balance.

confint(binDef, level=0.95)
## Waiting for profiling to be done...
##                     2.5 %        97.5 %
## (Intercept) -1.185902e+01 -9.928174e+00
## studentYes  -1.109018e+00 -1.822147e-01
## balance      5.294898e-03  6.204587e-03
## income      -1.304712e-05  1.912447e-05

Remember you are working on probability scales instead of unit scales.

\(p_i = \exp(\beta_1 x_1 + \beta_2 x_2...)\) / \(1 + exp(\beta_1 x_1 + \beta_2 x_2...)\) where i = observation

\(logit_i = \log(p_i / 1 - pi)\)

# Make predictions and plot the result
predDef <- predict(binDef,
                   type="response") # <-- why do we need this??
predDef2 <- predict(binDef)


max(predDef) # this is the probability
## [1] 0.9776263
max(predDef2) # this is the logit
## [1] 3.777239
logit = log(.977/(1-.977)) # how to calculate the logit of p
p = exp((1))

Default <- Default %>%
  mutate(defaultNum = ifelse(default=="Yes",1,0))
Default <- Default %>%
  mutate(predDef = predDef)

pBinDef <- ggplot(data=Default, aes(x=balance, y=defaultNum)) +
  geom_point() +
  geom_line(data=Default, aes(x=balance, y=predDef, color=Default$student)) +
  xlab("Balance") +
  ylab("Likelihood of Default") +
  labs(title="Probability of Default given Debt Balance") +
  guides(color=guide_legend("Student"))
pBinDef
## Warning: Use of `Default$student` is discouraged. Use `student` instead.

Some discussion questions:

  • Describe the conclusions of your model using the summary output and the plot. Be careful on interpreting the coefficients!
  • Why do we need to specify type=response in the prediction?

The type=“response” option tells R to output probabilities of the form P(Y = 1|X) , as opposed to other information such as the logit.

  • Do you think the model overcomes the difficulties of the lm model? In what ways?

  • Are there any drawbacks to this model? It would be important to report but can try removing income because has no effect on the probability of defaulting.

binDef2 <- glm(default ~ student + balance,
              family=binomial(link="logit"),
              data=Default)

kable(round(summary(binDef2)$coefficient,2))
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.75 0.37 -29.12 0
studentYes -0.71 0.15 -4.85 0
balance 0.01 0.00 24.75 0

End of section 1

Model Fit and Selection

In this section, we are going to fit a larger model that looks at not only the main effects of student, balance, and income on default, but it also looks at interaction effects. Start by fitting the larger model and that incorporates interaction effects (still using family=binomial and link="logit"). Then compute the deviance between the models (remember smaller models have larger deviance!). Also compute the residual degrees of freedom, and conduct a hypothesis test

# deviance is a goodness-of-fit statistic using sum of squares of residuals (and you calculate it by taking the difference of likelihoods between the fitted model and the saturated model.)

# Fit a larger model that includes all 2-way interactions
# In R, you can do this by replacing Z ~ x + y + Z with Z ~ x * y * z
binDefLarge <- glm(default ~ student*balance*income,
              family=binomial(link="logit"),
              data=Default) # this is the saturated model (saturated model is more likely given the sample since it is basically the sample itself)

binDefNull <- glm(default ~ 1,
              family=binomial(link="logit"),
              data=Default)

binDefB <- glm(default ~ balance,
              family=binomial(link="logit"),
              data=Default)

# Compute difference in deviances and degrees of freedom
# Hint: Once you fit binDef/binDefLarge, take a look at the available attributes
# for each model by typing binDef$, and R should autocomplete the available options
# If you're stuck, review the slides
diffDev <- binDef$deviance - binDefLarge$deviance    
diffDF <- binDef$df.residual - binDefLarge$df.residual  

# diffDev <- binDefNull$deviance - binDefB$deviance
# diffDF <- binDefNull$df.residual - binDefB$df.residual

# Difference in degrees of freedom
diffDF
## [1] 4
# Difference in deviances
round(diffDev,3) # small deviance
## [1] 0.586
# Compute p-value
round(1-pchisq(diffDev, diffDF),2) # large p value so smaller model fits better
## [1] 0.96
# saturated model is a lesser fit than the simpler model  
# large test statistics and small pvalues suggest that the the larger model fits better (smaller model fits more poorly than the larger model.)

Diagnostics: Deviance Residuals

In this last section, we are going to look at model diagnostics, which provide a way us a way with which to assess the model. The following plots look at the residuals in the response variable as a function of the covariates.

res1 <- ggplot(data=NULL, aes(x=Default$income, y=residuals(binDef), color=Default$student)) +
  geom_point() +
  xlab("Income") + 
  ylab("Deviance Residuals") +
  labs(title="Deviance Residuals by Income")
res2<- ggplot(data=NULL, aes(x=Default$student, y=residuals(binDef))) +
  geom_boxplot() +
  xlab("Student") + 
  ylab("Deviance Residuals") +
  labs(title="Deviance Residuals by Student Type")
res3 <- ggplot(data=NULL, aes(x=Default$balance, y=residuals(binDef), color=Default$student)) +
  geom_point() +
  xlab("Balance") + 
  ylab("Deviance Residuals") +
  labs(title="Deviance Residuals by Balance")
res1

res2

res3

What do you think these plots suggest about the quality of the model? Do they suggest any major issues with the model?

As balance increases, the deviance residuals get pulled from positive to negative values. For student categories, the deviance residuals are similar. For income,a lot of negative residuals (seems to be driving the overshooting of the model)

Histogram of Deviance Residuals

Plot the histograms of of the residuals.

resHist <- ggplot(data=NULL, aes(x=residuals(binDef))) +
  geom_histogram() +
  xlab("Residuals") +
  ylab("Counts") +
  labs(title="Distribution of Deviance Residuals")
resHist

residual is the actual (observed) value minus the predicted value.

If you have a negative value for a residual it means the actual value was LESS than the predicted value. Our model is overshooting the actual values.

glm(formula = default ~ student + balance + income, family = binomial(link = “logit”), data = Default)

So we might want to consider a simpler model, remove income.

Diagnostics: QQ Plot

Construct a QQ Plot of the residuals.

qqnorm(residuals(binDef))

is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. Normal Q-Q plots that look like this usually mean your sample data are skewed. Which makes sense because we had a lot of No’s (with downward concavity corresponding to negative skewness (long tail to the left) and upward concavity indicating positive skewness.)

Diagnostics: Leverage Points

Run the following code to create a plot of the leverage points.

halfnorm(hatvalues(binDef))

head(hatvalues(binDef))
##            1            2            3            4            5            6 
## 6.375400e-05 6.575620e-05 2.344373e-04 2.842522e-05 7.491062e-05 1.075051e-04

Based on the previous few plots, has your interpretation of the model quality changed at all? If so, how?

By sorting the data you isolate influential points. If an observation has a response value that is very different from the predicted value based on a model, then that observation is called an outlier. On the other hand, if an observation has a particularly unusual combination of predictor values (e.g., one predictor has a very different value for that observation compared with all the other data observations), then that observation is said to have high leverage. Can be both high leverage and influential but also different

High-leverage points are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation

An outlier is a data point whose response y does not follow the general trend of the rest of the data.

A data point has high leverage if it has “extreme” predictor x values. With a single predictor, an extreme x value is simply one that is particularly high or low. With multiple predictors, extreme x values may be particularly high or low for one or more predictors, or may be “unusual” combinations of predictor values (e.g., with two predictors that are positively correlated, an unusual combination of predictor values might be a high value of one predictor paired with a low value of the other predictor).