Day 4: Generalized linear models

Load the packages for the current session
Extending the linear model using link functions
- Common types of GLMs
Examining the Default dataset
Mini-exercise
Footnotes

Load the packages for the current session

library(dplyr)  # for manipulating data frames 
library(ggplot2)  # for data viz
library(here)  # for simplifying folder references 
library(readr)  # reading and writing data files 
library(GGally)  # extension to ggplot2
library(ISLR)  # package from the book Intro to Stat Learning 
library(broom)  # for cleaning up model results 
library(tidyr)  # for tidying data

Extending the linear model using link functions

The linear models that we have looked at so far have 2 basic components. Recall the Boston house prices model we worked on last session. The components of the model we built were:

Random component: the y-variable
Linear predictor: the x-variables and the coefficients associated with them

We’re now going to move to a more general type of model, which has three components:

Random component
Linear predictor
Link function

Models with these three components are called generalized linear models (GLMs).¹

The link function is necessary because in certain circumstances it doesn’t make sense to directly connect the linear predictor with the Y-variable - we have to first transform it in some way.

Logistic link function

Common types of GLMs

By considering different types of distributions for the response variable, and different link functions, we can study several different GLMs. Two of the most common ones are:

Logistic regression: when the response is a binary (yes/no) variable. The model allows us to understand which predictor variables increase/decrease the probability of “Yes”, and by what percentage. E.g. Will this patient be readmitted?
Poisson regression: when the response is a discrete count variable - i.e. cannot take decimal values. E.g. How many comorbities is this patient likely to have?

In this session we’ll be covering logistic regression. In R, the mechanics of fitting various GLMs are basically the same (however, interpreting the models and diagnosing problems will require you to understand how the models work in each case).

Logistic regression is widely used in classification problems in healthcare and in some circumstances performs as well as or better than “sexier” models like deep learning.

Examining the Default dataset

The default dataset is built into the ISLR package. Much of this example is drawn from here - it’s a good tutorial, and I don’t have too much to add to it, to be honest.

Our goal is to find the probability that a certain credit card holder will default, given their income, balance, student status, etc.

df1.default.dataset <- Default  # rename for convenience 

# str(df1.default.dataset)

# levels(df1.default.dataset$student)  # No is the first level, 1

Exploratory data analysis

p1.pairs <- df1.default.dataset %>% 
    ggpairs(); p1.pairs

# split by default category: Yes vs No
df2.summary <- 
    df1.default.dataset %>% 
    group_by(default) %>% 
    summarize(avg.bal = mean(balance), 
              avg.inc = mean(income), 
              
              # dplyr::n() is used to count rows 
              num.cases = n(), 
              num.student = sum(student=="Yes"), 
              perc.student = round(100*sum(student=="Yes")/n(), 2))  
    

# print result: 
df2.summary %>%     
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

default	avg.bal	avg.inc	num.cases	num.student	perc.student
No	803.9438	33566.17	9667	2817	29.14
Yes	1747.8217	32089.15	333	127	38.14

# split by Student category: Yes vs No
df1.default.dataset %>% 
    group_by(student) %>% 
    summarize(avg.bal = mean(balance), 
              avg.inc = mean(income), 
              num = n(), 
              perc.default = round(100*sum(default=="Yes")/n(), 2)) %>% 
    
    # don't use these lines; only for Rmarkdown
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

student	avg.bal	avg.inc	num	perc.default
No	771.7704	40011.95	7056	2.92
Yes	987.8182	17950.23	2944	4.31

There’s a lot of interesting things to see here.

Defaults are a small proportion of total accounts, for both students and non-students.
Non-students have much higher income than students (no surprise). They also have slightly higher credit balances.
Overall, proportion of students in the sample is 0.2944. Among the defaults, proportion of students is 0.3814. Clearly, this shows that students are riskier than non-students - right?

Let’s create some models to examine the probability of default, given the characteristics of each customer.

What if we just try linear regression?

Let’s start by examining the relationship between default and credit card balance.

# m1.linear.regression <- lm(as.numeric(default) ~ balance, 
#                            data = df1.default.dataset)

# summary(m1.linear.regression)

# plot result of model: 
df1.default.dataset %>% 
    mutate(is.default.numeric = ifelse(default == "Yes", 
                                1, 
                                0)) %>% 
    
    ggplot(aes(x = balance, 
               y = is.default.numeric)) + 
    geom_point(alpha = 0.2) + 
    geom_smooth(method = "lm") + 
    
    labs(title = "Ordinarly linear regression doesn't really make sense when you have \na binary response", 
         subtitle = "We want to interpret the regression line as 'Probability of default', but notice that we're seeing negative \nvalues at the far left.")

Fitting a logistic regression model with `glm`

m2.logistic <- glm(default ~ balance, 
                   # specify which specific GLM you want using "family": 
                   family = "binomial", 
                   data = df1.default.dataset)

summary(m2.logistic)

## 
## Call:
## glm(formula = default ~ balance, family = "binomial", data = df1.default.dataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2697  -0.1465  -0.0589  -0.0221   3.7589  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
## balance      5.499e-03  2.204e-04   24.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1596.5  on 9998  degrees of freedom
## AIC: 1600.5
## 
## Number of Fisher Scoring iterations: 8

Unlike in linear regression, there is no R-squared stat here. Instead, we use residual deviance. Lower deviance is better.

Interpreting coefficients

Let’s extract coefficients using broom::tidy. Note that the coefficients are given in terms of log odds, which are hard to interpret, so we convert them to odds factors, which are slightly easier to interpret.

tidy(m2.logistic) %>% 
    mutate(estimate.odds.factor = exp(estimate)) %>% 
    
    # rearrange column order: 
    select(term, 
           estimate, 
           estimate.odds.factor, 
           everything()) %>% 
    
    
    # don't use these lines; only for Rmarkdown
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

term	estimate	estimate.odds.factor	std.error	statistic	p.value
(Intercept)	-10.6513306	0.0000237	0.3611574	-29.49221	0
balance	0.0054989	1.0055141	0.0002204	24.95309	0

Every unit increase in balance causes the odds of default to increase by a factor of 0.0054989. If the odds factor is greater than 1, then this variable is associated with an increase in the probability of default.

Interpreting the model with fake data

Let’s create a dataset with three rows, using the 25th, 50th, 75th, 90th and 99th percentile values of balance.

# use quantile( ) to get percentiles 
balance.quantiles <- df1.default.dataset %>% 
    pull(balance) %>%
    quantile(probs = c(.25, .50, .75, .90, .99))

df4.fake.data <- data.frame(balance = balance.quantiles)


# view result: 
df4.fake.data %>% 
    
    # don't use these lines; only for Rmarkdown
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

	balance
25%	481.7311
50%	823.6370
75%	1166.3084
90%	1471.6253
99%	2008.4709

Now let’s use predict to see what probability our logistic model predicts for customers at these three percentiles of balance.

# in order to get output in terms of proportions (instead of log odds), use the type = "response" argument 

(predict(m2.logistic, 
        newdata = df4.fake.data, 
        
         
        type = "response")) * 100  # convert to probabilities

##         25%         50%         75%         90%         99% 
##  0.03345695  0.21887810  1.42324265  7.18251470 59.70248510

Conclusions:

Below the 50th percentile of balance, there is a very small probability of default.
The increase in probability from 25th to 75th percentile is very small, but the increase from 90th to 95th is quite large. Therefore, the effect of a single unit increase in balance is not uniform in its effect on probability of default.

Plotting the model

augment(m2.logistic) %>% 
    # create an ID column, change fitted value to probability, etc: 
    mutate(id = row_number(), 
           default.numeric = ifelse(default == "Yes", 
                                    1, 
                                    0), 
           predicted.prob = m2.logistic$family$linkinv(.fitted)) %>% 
    # ^^ linkinv is used to invert the logistic function to get probabilities instead of log odds 
    
    # select columns for plotting: 
    select(id, 
           default.numeric, 
           balance, 
           predicted.prob) %>% 
    
    # reshape the data: 
    gather(key = "variable", 
           value = "value", 
           -c(id, balance)) %>% 
    
    
    # now plot it: 
    ggplot(aes(x = balance, 
               y = value, 
               group = variable, 
               colour = variable)) +
    geom_point(alpha = 0.2)

Alternate way of plotting

It’s kind of fun to be able to see the predictions for your actual data points, so that’s the approach I showed above. In practice, you’ll usually just let ggplot plot a smooth curve, like this:

df1.default.dataset %>% 
    mutate(is.default.numeric = ifelse(default == "Yes", 
                                1, 
                                0)) %>% 
    
    ggplot(aes(x = balance, 
               y = is.default.numeric)) + 
    geom_point(alpha = 0.2) + 
    geom_smooth(method = "glm", 
                method.args = list(family = "binomial"))

Adding `student` as a predictor

First we’ll fit a new model. Then we’ll check whether the new model really is better. Finally we’ll plot the new model’s predictions.

# fit logistic model with 2 predictors: 
m3.logistic.with.student <- 
    glm(default ~ balance + student, 
        data = df1.default.dataset, 
        family = "binomial")

summary(m3.logistic.with.student)

## 
## Call:
## glm(formula = default ~ balance + student, family = "binomial", 
##     data = df1.default.dataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4578  -0.1422  -0.0559  -0.0203   3.7435  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.075e+01  3.692e-01 -29.116  < 2e-16 ***
## balance      5.738e-03  2.318e-04  24.750  < 2e-16 ***
## studentYes  -7.149e-01  1.475e-01  -4.846 1.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.7  on 9997  degrees of freedom
## AIC: 1577.7
## 
## Number of Fisher Scoring iterations: 8

# examine coefficients: 
tidy(m3.logistic.with.student) %>% 
    mutate(estimate.odds.factor = exp(estimate)) %>% 
    
    # rearrange column order: 
    select(term, 
           estimate, 
           estimate.odds.factor, 
           everything()) %>% 
    
    
    # don't use these lines; only for Rmarkdown
    kable() %>% 
    kable_styling(bootstrap_options = c("striped",
                                        "condensed", 
                                        "responsive"), 
                  full_width = FALSE, 
                  position = "left")

term	estimate	estimate.odds.factor	std.error	statistic	p.value
(Intercept)	-10.7494959	0.0000215	0.3691914	-29.116326	0.0e+00
balance	0.0057381	1.0057546	0.0002318	24.749526	0.0e+00
studentYes	-0.7148776	0.4892520	0.1475190	-4.846003	1.3e-06

Huh. The odds factor for student after controlling for balance is less than one: for the same balance, a student is less likely to default than a non-student. This result is the opposite of what we concluded after a naive analysis

Eplanation for this change in conclusion:²

The variables student and balance are correlated. Students tend to hold higher levels of debt, which is in turn associated with higher probability of default. In other words, students are more likely to have large credit card balances, which, as we know from the left-hand panel of the below figure, tend to be associated with high default rates. Thus, even though an individual student with a given credit card balance will tend to have a lower probability of default than a non-student with the same credit card balance, the fact that students on the whole tend to have higher credit card balances means that overall, students tend to default at a higher rate than non-students. This is an important distinction for a credit card company that is trying to determine to whom they should offer credit.

Note that deviance is lower for this model. But is it significantly lower?

anova(m2.logistic, 
      m3.logistic.with.student, 
      test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: default ~ balance
## Model 2: default ~ balance + student
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      9998     1596.5                          
## 2      9997     1571.7  1    24.77 6.459e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes, it is significantly lower.

Plotting the model with `balance` and `student` as predictors

df1.default.dataset %>% 
    mutate(is.default.numeric = ifelse(default == "Yes", 
                                1, 
                                0)) %>% 
    
    ggplot(aes(x = balance, 
               y = is.default.numeric, 
               group = student, 
               colour = student)) + 
    geom_jitter(alpha = 0.1, 
                height = .1) + 
    geom_smooth(method = "glm", 
                method.args = list(family = "binomial"))

Mini-exercise

Add income to the model as well and check whether the model improves.

Footnotes

Actually, ordinary linear regression is also a specific case of a GLM, where the link function is just the identity function, \(f(x) = x\)↩
http://uc-r.github.io/logistic_regression#req ↩

Day 4: Generalized linear models

Nayef Ahmad - VCH Decision Support

September 26, 2018

Load the packages for the current session

Extending the linear model using link functions

Common types of GLMs

Examining the Default dataset

Exploratory data analysis

What if we just try linear regression?

Fitting a logistic regression model with `glm`

Interpreting coefficients

Interpreting the model with fake data

Plotting the model

Alternate way of plotting

Adding `student` as a predictor

Plotting the model with `balance` and `student` as predictors

Mini-exercise

Footnotes

Day 4: Generalized linear models

Nayef Ahmad - VCH Decision Support

September 26, 2018

Load the packages for the current session

Extending the linear model using link functions

Common types of GLMs

Examining the Default dataset

Exploratory data analysis

What if we just try linear regression?

Fitting a logistic regression model with glm

Interpreting coefficients

Interpreting the model with fake data

Plotting the model

Alternate way of plotting

Adding student as a predictor

Plotting the model with balance and student as predictors

Mini-exercise

Footnotes

Fitting a logistic regression model with `glm`

Adding `student` as a predictor

Plotting the model with `balance` and `student` as predictors