glmstudent as a predictorbalance and student as predictorslibrary(dplyr) # for manipulating data frames
library(ggplot2) # for data viz
library(here) # for simplifying folder references
library(readr) # reading and writing data files
library(GGally) # extension to ggplot2
library(ISLR) # package from the book Intro to Stat Learning
library(broom) # for cleaning up model results
library(tidyr) # for tidying data
The linear models that we have looked at so far have 2 basic components. Recall the Boston house prices model we worked on last session. The components of the model we built were:
We’re now going to move to a more general type of model, which has three components:
Models with these three components are called generalized linear models (GLMs).1
The link function is necessary because in certain circumstances it doesn’t make sense to directly connect the linear predictor with the Y-variable - we have to first transform it in some way.
Logistic link function
By considering different types of distributions for the response variable, and different link functions, we can study several different GLMs. Two of the most common ones are:
In this session we’ll be covering logistic regression. In R, the mechanics of fitting various GLMs are basically the same (however, interpreting the models and diagnosing problems will require you to understand how the models work in each case).
Logistic regression is widely used in classification problems in healthcare and in some circumstances performs as well as or better than “sexier” models like deep learning.
The default dataset is built into the ISLR package. Much of this example is drawn from here - it’s a good tutorial, and I don’t have too much to add to it, to be honest.
Our goal is to find the probability that a certain credit card holder will default, given their income, balance, student status, etc.
df1.default.dataset <- Default # rename for convenience
# str(df1.default.dataset)
# levels(df1.default.dataset$student) # No is the first level, 1
p1.pairs <- df1.default.dataset %>%
ggpairs(); p1.pairs
# split by default category: Yes vs No
df2.summary <-
df1.default.dataset %>%
group_by(default) %>%
summarize(avg.bal = mean(balance),
avg.inc = mean(income),
# dplyr::n() is used to count rows
num.cases = n(),
num.student = sum(student=="Yes"),
perc.student = round(100*sum(student=="Yes")/n(), 2))
# print result:
df2.summary %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| default | avg.bal | avg.inc | num.cases | num.student | perc.student |
|---|---|---|---|---|---|
| No | 803.9438 | 33566.17 | 9667 | 2817 | 29.14 |
| Yes | 1747.8217 | 32089.15 | 333 | 127 | 38.14 |
# split by Student category: Yes vs No
df1.default.dataset %>%
group_by(student) %>%
summarize(avg.bal = mean(balance),
avg.inc = mean(income),
num = n(),
perc.default = round(100*sum(default=="Yes")/n(), 2)) %>%
# don't use these lines; only for Rmarkdown
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| student | avg.bal | avg.inc | num | perc.default |
|---|---|---|---|---|
| No | 771.7704 | 40011.95 | 7056 | 2.92 |
| Yes | 987.8182 | 17950.23 | 2944 | 4.31 |
There’s a lot of interesting things to see here.
Let’s create some models to examine the probability of default, given the characteristics of each customer.
Let’s start by examining the relationship between default and credit card balance.
# m1.linear.regression <- lm(as.numeric(default) ~ balance,
# data = df1.default.dataset)
# summary(m1.linear.regression)
# plot result of model:
df1.default.dataset %>%
mutate(is.default.numeric = ifelse(default == "Yes",
1,
0)) %>%
ggplot(aes(x = balance,
y = is.default.numeric)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "lm") +
labs(title = "Ordinarly linear regression doesn't really make sense when you have \na binary response",
subtitle = "We want to interpret the regression line as 'Probability of default', but notice that we're seeing negative \nvalues at the far left.")
glmm2.logistic <- glm(default ~ balance,
# specify which specific GLM you want using "family":
family = "binomial",
data = df1.default.dataset)
summary(m2.logistic)
##
## Call:
## glm(formula = default ~ balance, family = "binomial", data = df1.default.dataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2697 -0.1465 -0.0589 -0.0221 3.7589
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
## balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1596.5 on 9998 degrees of freedom
## AIC: 1600.5
##
## Number of Fisher Scoring iterations: 8
Unlike in linear regression, there is no R-squared stat here. Instead, we use residual deviance. Lower deviance is better.
Let’s extract coefficients using broom::tidy. Note that the coefficients are given in terms of log odds, which are hard to interpret, so we convert them to odds factors, which are slightly easier to interpret.
tidy(m2.logistic) %>%
mutate(estimate.odds.factor = exp(estimate)) %>%
# rearrange column order:
select(term,
estimate,
estimate.odds.factor,
everything()) %>%
# don't use these lines; only for Rmarkdown
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| term | estimate | estimate.odds.factor | std.error | statistic | p.value |
|---|---|---|---|---|---|
| (Intercept) | -10.6513306 | 0.0000237 | 0.3611574 | -29.49221 | 0 |
| balance | 0.0054989 | 1.0055141 | 0.0002204 | 24.95309 | 0 |
Every unit increase in balance causes the odds of default to increase by a factor of 0.0054989. If the odds factor is greater than 1, then this variable is associated with an increase in the probability of default.
Let’s create a dataset with three rows, using the 25th, 50th, 75th, 90th and 99th percentile values of balance.
# use quantile( ) to get percentiles
balance.quantiles <- df1.default.dataset %>%
pull(balance) %>%
quantile(probs = c(.25, .50, .75, .90, .99))
df4.fake.data <- data.frame(balance = balance.quantiles)
# view result:
df4.fake.data %>%
# don't use these lines; only for Rmarkdown
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| balance | |
|---|---|
| 25% | 481.7311 |
| 50% | 823.6370 |
| 75% | 1166.3084 |
| 90% | 1471.6253 |
| 99% | 2008.4709 |
Now let’s use predict to see what probability our logistic model predicts for customers at these three percentiles of balance.
# in order to get output in terms of proportions (instead of log odds), use the type = "response" argument
(predict(m2.logistic,
newdata = df4.fake.data,
type = "response")) * 100 # convert to probabilities
## 25% 50% 75% 90% 99%
## 0.03345695 0.21887810 1.42324265 7.18251470 59.70248510
Conclusions:
augment(m2.logistic) %>%
# create an ID column, change fitted value to probability, etc:
mutate(id = row_number(),
default.numeric = ifelse(default == "Yes",
1,
0),
predicted.prob = m2.logistic$family$linkinv(.fitted)) %>%
# ^^ linkinv is used to invert the logistic function to get probabilities instead of log odds
# select columns for plotting:
select(id,
default.numeric,
balance,
predicted.prob) %>%
# reshape the data:
gather(key = "variable",
value = "value",
-c(id, balance)) %>%
# now plot it:
ggplot(aes(x = balance,
y = value,
group = variable,
colour = variable)) +
geom_point(alpha = 0.2)
It’s kind of fun to be able to see the predictions for your actual data points, so that’s the approach I showed above. In practice, you’ll usually just let ggplot plot a smooth curve, like this:
df1.default.dataset %>%
mutate(is.default.numeric = ifelse(default == "Yes",
1,
0)) %>%
ggplot(aes(x = balance,
y = is.default.numeric)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "glm",
method.args = list(family = "binomial"))
student as a predictorFirst we’ll fit a new model. Then we’ll check whether the new model really is better. Finally we’ll plot the new model’s predictions.
# fit logistic model with 2 predictors:
m3.logistic.with.student <-
glm(default ~ balance + student,
data = df1.default.dataset,
family = "binomial")
summary(m3.logistic.with.student)
##
## Call:
## glm(formula = default ~ balance + student, family = "binomial",
## data = df1.default.dataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4578 -0.1422 -0.0559 -0.0203 3.7435
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.075e+01 3.692e-01 -29.116 < 2e-16 ***
## balance 5.738e-03 2.318e-04 24.750 < 2e-16 ***
## studentYes -7.149e-01 1.475e-01 -4.846 1.26e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.7 on 9997 degrees of freedom
## AIC: 1577.7
##
## Number of Fisher Scoring iterations: 8
# examine coefficients:
tidy(m3.logistic.with.student) %>%
mutate(estimate.odds.factor = exp(estimate)) %>%
# rearrange column order:
select(term,
estimate,
estimate.odds.factor,
everything()) %>%
# don't use these lines; only for Rmarkdown
kable() %>%
kable_styling(bootstrap_options = c("striped",
"condensed",
"responsive"),
full_width = FALSE,
position = "left")
| term | estimate | estimate.odds.factor | std.error | statistic | p.value |
|---|---|---|---|---|---|
| (Intercept) | -10.7494959 | 0.0000215 | 0.3691914 | -29.116326 | 0.0e+00 |
| balance | 0.0057381 | 1.0057546 | 0.0002318 | 24.749526 | 0.0e+00 |
| studentYes | -0.7148776 | 0.4892520 | 0.1475190 | -4.846003 | 1.3e-06 |
Huh. The odds factor for student after controlling for balance is less than one: for the same balance, a student is less likely to default than a non-student. This result is the opposite of what we concluded after a naive analysis
Eplanation for this change in conclusion:2
The variables student and balance are correlated. Students tend to hold higher levels of debt, which is in turn associated with higher probability of default. In other words, students are more likely to have large credit card balances, which, as we know from the left-hand panel of the below figure, tend to be associated with high default rates. Thus, even though an individual student with a given credit card balance will tend to have a lower probability of default than a non-student with the same credit card balance, the fact that students on the whole tend to have higher credit card balances means that overall, students tend to default at a higher rate than non-students. This is an important distinction for a credit card company that is trying to determine to whom they should offer credit.
Note that deviance is lower for this model. But is it significantly lower?
anova(m2.logistic,
m3.logistic.with.student,
test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: default ~ balance
## Model 2: default ~ balance + student
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 9998 1596.5
## 2 9997 1571.7 1 24.77 6.459e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes, it is significantly lower.
balance and student as predictorsdf1.default.dataset %>%
mutate(is.default.numeric = ifelse(default == "Yes",
1,
0)) %>%
ggplot(aes(x = balance,
y = is.default.numeric,
group = student,
colour = student)) +
geom_jitter(alpha = 0.1,
height = .1) +
geom_smooth(method = "glm",
method.args = list(family = "binomial"))
Add income to the model as well and check whether the model improves.
Actually, ordinary linear regression is also a specific case of a GLM, where the link function is just the identity function, \(f(x) = x\)↩