Learning Outcomes

  • Understand the difference between a linear and a generalised linear model
  • Be able to run and interpret a binary logistic regression model in \(R\)
  • Transform values between different scales (probabilities, odds, log odds)
  • Make predictions using model coefficients and \(R\) functions such as predict()

You will need to watch the lecture and read Chapter 10 in Andrews (2021) (pages 346 – 363) to fully grasp the concepts.

Recap: Linear regression

  • regression models are models that model the probability distribution of one variable, known as the outcome variable, and varies as a function of other variables, known as predictor variables.
  • the most basic type of regression model is the normal linear model.
  • in normal linear models, we assume that the outcome variable is normally distributed around a mean \(\mu\); the mean varies linearly with changes in a set of predictor variables.

Recap: Linear regression

\[ \begin{aligned} y_i &\sim \mathcal{N}(\mu, \sigma^2) \end{aligned} \]

  • \(y\): observed outcome variable.
  • Each observation \(y_i\) is a sample from such a normal distribution
  • indexed by \(i\) where \(i \in \{1, \dots, N\}\)
  • “\(\sim\)”: “is proportional to” / “is a function of”.
  • \(\mathcal{N}\): normal distributed probability model with mean \(\mu\) and variance \(\sigma^2\) (or standard deviation \(\sigma\)).

Recap: Linear regression

Where \(\mu\) is a linear function of the intercept \(\beta_0\) and its slope \(\beta_1\), relative to the value that the predictor \(x\) takes on.

A simple linear regression – with a single predictor \(x\) – has the form

\[ \mu_i = \beta_0 + \beta_1 \times x_i \]

Recap: Linear regression

For multiple linear regression predictors, with \(K\) predictors, the form is

\[ \mu_i = \beta_0 + \beta_1 \times x_{1i} + \beta_2 \times x_{2i} + \cdots + \beta_K \times x_{Ki} \]

or for short

\[ \mu_i = \beta_0 + \sum_{k=1}^K \beta_k \times x_{ki} \]

Essentially, \(\mu\) is the model’s predicted average of \(y\) for a given (set of) \(x\).

\[ y_i \sim \mathcal{N}(\mu_i, \sigma^2), \quad i \in \{1, \dots, N\} \] \(\in\) means “is an element of”

Recap: Linear regression

  • As useful as this model is, it is clearly limited
  • What (sort of data) is the normal linear model limited to?

Recap: Linear regression

  • Data are continuous (rather than discrete, categorical).
  • Data are assumed to come from a normal distribution.
  • Cannot be applied to categorical outcomes or counts.
  • Generalised linear models allow us to use other distributions with link functions to map different types of outcome variables to a continuous numeric space.
  • For example, binary logistic regression – a specific type of logistic regressions – uses a Binomial / Bernoulli distribution with a logit link to model binary outcomes (yes / no, correct / incorrect).

Binary logistic model

# Show some example data
example_data
            x y
1 -0.56047565 0
2 -0.23017749 0
3  1.55870831 1
4  0.07050839 1
5  0.12928774 1
6  1.71506499 1
7  0.46091621 1
8 -1.26506123 0


Let’s say, we have a set of bivariate data with n binary data \(y\) paired with a continuous variable \(x\):

\[(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\]

where

\[y_i \in \{0, 1\}, \quad \text{for all } i \in n\]

(\(y\) could stand for true / false, correct / incorrect, etc.)

Binary logistic model

Modelling these observed outcomes as samples from a normal distribution is an extreme example of model misspecification. Might look a bit like this:

Bernoulli to the rescue

The normal distribution isn’t appropriate because the data are not continuous.

\[ \begin{aligned} y_i &\sim \mathcal{N}(\mu, \sigma^2) \end{aligned} \] We can use another distribution that can handle binary outcomes.

\[ \begin{aligned} y_i \sim \text{Bernoulli}(\theta) \\ \end{aligned} \]

Bernoulli to the rescue

Binary logistic model: in R

Normal linear model

# lm =  linear model
lm(outcome ~ predictor_1 + predictor_2, data = data)

Binary logistic regression

# glm = generalised linear model
glm(outcome ~ predictor_1 + predictor_2, data = data, family = binomial(link = "logit"))

binomial(link = "logit") models data as coming from a Binomial / Bernoulli distribution where the logit links maps probabilities to a continuous numeric space.

Binary logistic model

Bernoulli distribution has a single parameter \(\theta\) (theta).

\(P(y = 1) = \theta\):

  • \(\theta\) is the probability that the outcome variable takes a value of 1.

\(P(y = 0) = 1 - \theta\):

  • Example: if \(\theta = .7\), what’s \(P(y = 0)\)?

Binary logistic model

\(\theta\) controls the relative heights of each bar

Real data analysis

# load packages
library(tidyverse)

# load smoking data (link can be found on NOW)
smoking_df <- read_csv("https://raw.githubusercontent.com/mark-andrews/ntupsychology-data/main/data-sets/smoking.csv") 

# select variables of interest
smoking_df <- select( smoking_df, cigs, age )

# look at the data
glimpse( smoking_df )
Rows: 807
Columns: 2
$ cigs <dbl> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 10, 20, 30, 0, 0, 20, 0, 30, 0, 20,…
$ age  <dbl> 46, 40, 58, 30, 17, 86, 35, 48, 48, 31, 27, 24, 71, 44, 22, 29, 3…

Real data analysis

Whats the probability of a person smokes? How does this probability vary as function of other factors.

Lots of cigs = 0 (non-smokers); smokers vary in terms of how much they smoke.

cigs is NOT binary (smokers, non-smokers) but a count.

Real data analysis

# Make the cigs a binary variable
smoking_df <- mutate(smoking_df, is_smoker = cigs > 0)
# How many smokers and nonsmokers are in the data?
count( smoking_df, is_smoker )
# A tibble: 2 × 2
  is_smoker     n
  <lgl>     <int>
1 FALSE       497
2 TRUE        310

Real data analysis

# Probability of smokers (same as `mean(smoking_df$is_smoker)`)
310 / ( 310 + 497 )
[1] 0.3841388
# Overall probability of smokers in odds
0.3841388 / ( 1 - 0.3841388 )
[1] 0.6237425
# Overall probability of smokers in log-odds (logits)
log( 0.6237425 )
[1] -0.4720177

Binary logistic regression

lm(outcome ~ predictor, data = data) wouldn’t return an error but it won’t do what you want it to do.

m_0 <- glm(is_smoker ~ 1, data = smoking_df, family = binomial(link = 'logit'))
  • family defines the distribution family.
  • link = 'logit' maps probability to logits (log-odds) to allow for linear modelling.

Intercept is the average log-odds to smoke (in the context of the population that these data come from).

summary( m_0 )$coef # compare Intercept estimate to previous slide!
              Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -0.4720177 0.07237319 -6.521997 6.937742e-11

Binary logistic regression

Research Question: To what extent does age affect the probability of smoking?

m_smoke <- glm(is_smoker ~ age, data = smoking_df, family = binomial(link = 'logit'))
summary( m_smoke )$coef
               Estimate  Std. Error    z value    Pr(>|z|)
(Intercept)  0.02396177 0.190206271  0.1259778 0.899749472
age         -0.01215431 0.004355148 -2.7907922 0.005257921
  • Coefficient table is just like a coefficient table for a linear model.
  • Displays the linear relationship between predictor (age) and the log-odds of being a smoker.
  • What’s the answer to our research question?

Binary logistic regression

summary(m_smoke)$coef
               Estimate  Std. Error    z value    Pr(>|z|)
(Intercept)  0.02396177 0.190206271  0.1259778 0.899749472
age         -0.01215431 0.004355148 -2.7907922 0.005257921
  • Sign of age estimate (-0.01215431; shown in log-odds) is negative; what does this mean?
  • Slope estimate shows the proportional (linear) change in the log-odds (logits) to smoke for each year of age.
  • Std. Error: uncertainty in the estimate.
  • z is a critical value to get the p-value; this is similar to the t-value in linear regressions but we don’t require degrees of freedom.
  • Pr(>|z|): probability of obtaining the same absolute value of z or something larger if age has really no effect on smoking.

Logit-link function

There and back again

Odds tell you how much more likely an event is to happen than not:

\[\text{odds}=\frac{\theta}{1-\theta}\]

Taking the natural log of the odds gives the log-odds (logit):

\[\text{logit}(\theta)=\log\!\left(\text{odds}\right)\]

Exponentiation turns log-odds into odds:

\[\text{odds}=e^{\text{logit(}\theta\text{)}}\] Converting log-odds back to probability:

\[\theta = \frac{1}{1+e^{-\text{logit(}\theta\text{)}}} \]

Function for probability to log-odds

\[ \log\!\left(\frac{\theta}{1-\theta}\right) \]

# probability to logit
logit <- function( x ) log( x / ( 1 - x ) )
logit( .6 ) # prob to logit
[1] 0.4054651
# or R function
qlogis( .6 )
[1] 0.4054651

Implement function for log-odds to probability

\[ \frac{1}{1+e^{-\theta}} \]

# logit to probability
ilogit <- function( x ) 1 / ( 1 + exp( -x ) )
ilogit( 0.4054651 )
[1] 0.6
# or R function
plogis( 0.4054651 )
[1] 0.6

Converting probabilities into log-odds

# Here are some low, medium and high probability values:
theta_low <- 0.01  # low prob of success 
theta_mid <- 0.5   # equal prob of success 
theta_high <- 0.99 # high prob of success 
# Logits of low probability
log( theta_low / ( 1 - theta_low ) )
[1] -4.59512
logit( theta_low ) #  or use plogis()
[1] -4.59512

Work out the logit values for theta_mid and theta_high using the formula and one of the two functions.

Converting between probabilities and log-odds

If the probability of bad weather is 0.75, what are the odds that we’ll have bad weather?

\[ \text{odds} = \frac{\theta}{1 - \theta} \] Use \(R\) as a calculator or solve it by hand!

Converting between probabilities and log-odds

If the probability of bad weather is \(\theta = 0.75\), what are the odds that we’ll have bad weather?

p <- 0.75
p / ( 1 - p ) 
[1] 3

Probability of \(\theta = 0.75\) means that we are three times more likely to have bad weather than good weather.

Converting between probabilities and log-odds

If the odds are 4 (i.e. 4:1) to have bad weather, what is the probability that we will have bad weather?

\[ \theta = \frac{\text{odds}}{1 + \text{odds}} \] Use \(R\) as a calculator!

Converting between probabilities and log-odds

If the odds are 4 (i.e. 4:1) to have bad weather, what is the probability that we will have bad weather?

odds <- 4
odds / ( 1 + odds )
[1] 0.8

Converting between probabilities and log-odds

If there is a probability of \(\theta = 0.9\) that an intervention improves reading of kids, what is its log odds?

Compute the odds:

\[ \text{odds} = \frac{\theta}{1 - \theta} \]

Take the logarithm of the odds to get the log-odds (also called logit):

\[ \text{log-odds} = \log(\text{odds}) \]

Use \(R\) as a calculator!

Converting between probabilities and log-odds

If there is a probability of \(\theta = 0.9\) that an intervention improves reading of kids, what is its log odds?

\[ \text{odds} = \frac{\theta}{1 - \theta} = \frac{0.9}{1 - 0.9} = \frac{0.9}{0.1} = 9 \]

p <- 0.9
p / (1 - p)
[1] 9

Take the logarithm of the odds to get the log-odds (also called logit):

\[ \text{log-odds} = \log(\text{odds}) = \log(9) \approx 2.197225 \]

log(9)
[1] 2.197225

Converting between probabilities and log-odds

The log-odds on the previous slide were 2.1972246. Convert this value into

  1. Odds (the opposite of log(x) is exp(x)).
  2. Probabilities (using plogis(x)).

(replace x correctly)

Converting between probabilities and log-odds

The log-odds on the previous slide were 2.1972246. Convert this value into

Odds (the opposite of log(x) is exp(x)):

exp(2.197225)
[1] 9.000004

Probabilities (using plogis(x)):

plogis(2.197225)
[1] 0.9

Probability and log-odds

  • When probability = 0.5, log-odds are 0 (equal).
  • When probabilities are small, log-odds are negative.
  • When probabilities are large, log-odds are positive.
  • Probabilities are bound between 0 and 1; log-odds aren’t.

Predicting outcomes

What are the log-odds of being a smoker at any given age?

What are the odds of being a smoker at any given age?

What is the probability of being a smoker at any given age?

Binary logistic regression

To answer these questions, we use the linear regression equation.

\[ \log\left(\frac{\theta}{1 - \theta}\right) = \beta_0 + \beta_1 \times x_1 + \beta_2 \times x_2 + \dots + \beta_k \times x_k \]

  • In normal linear regression, the linear function returns an average of the same unit as the outcome.
  • In binary logistic regression, the linear function return the average of the outcome in log-odds.

For example, for the log-odds of being a smoker at 25 years of age, this simplifies to:

\[ \log\left(\frac{\text{is_smoker}}{1 - \text{is_smoker}}\right) = \beta_0 + \beta_1 \times 25 \]

Predicting outcomes: model coefficients

# Extract model coefficients
b <- coef(m_smoke)
b
(Intercept)         age 
 0.02396177 -0.01215431 

Predicting outcomes: log-odds

What are the log-odds of being a smoker at 25 years of age?

b[1] + b[2] * 25 # linear function of age = 25
[1] -0.279896

Indexing the first value [1] for intercept and second value [2] for age slope.

# or without variable indexing
0.02396177 + -0.01215431 * 25
[1] -0.279896

Predicting outcomes: log-odds

What are the odds of being a smoker at 25 years of age?

exp(b[1] + b[2] * 25) 
[1] 0.7558623

Indexing the first value [1] for intercept and second value [2] for age slope.

# or without variable indexing
exp(0.02396177 + -0.01215431 * 25)
[1] 0.7558624

Predicting outcomes: probabilities

But what does log-odds of -0.279 actually mean?

What is the probability of being a smoker at 25 years of age?

The inverse logit function, allows us to convert from log-odds to probability.

\[ \theta = \frac{1}{1 + e^{-\text{logit}(\theta)}} \] We implemented ilogit(x) before, or use the base-R function plogis(x).

Predicting outcomes: probabilities

\[ \log\left(\frac{\text{is_smoker}}{1 - \text{is_smoker}}\right) = \beta_0 + \beta_1 \times 25 \]

which is

logit_smoke_age25 <- b[1] + b[2] * 25 # linear function of age = 25
plogis( logit_smoke_age25 ) # using plogis to convert log odds to prob
[1] 0.4304793
1 / ( 1 + exp( -logit_smoke_age25 ) ) # using the formula we implemted in ilogit(x)
[1] 0.4304793

Predicting outcomes: exercise

What are the log-odds of being a smoker at 50 years of age?

What are the odds of being a smoker at 50 years of age?

What is the probability of being a smoker at 50 years of age?

Predicting outcomes: exercise

What are the log-odds of being a smoker at 50 years of age?

b[1] + b[2] * 50 
[1] -0.5837539

What are the odds of being a smoker at 50 years of age?

exp(b[1] + b[2] * 50)
[1] 0.5578005

What is the probability of being a smoker at 50 years of age?

plogis(b[1] + b[2] * 50)
[1] 0.3580693

Predicting multiple outcomes: manual approach

We can make predictions for larger ranges of predictor values of hypothetical participants (of ages that we did and did not observe!).

# Create vector ages that you're interested in.
age <- seq( from = 15, to = 80, by = 10 )
age
[1] 15 25 35 45 55 65 75
# Use this age vector in the regression formula.
b[1] + b[2] * age # in log-odds
[1] -0.1583529 -0.2798960 -0.4014392 -0.5229823 -0.6445254 -0.7660686 -0.8876117

Try this for ages 20, 25, 30, 35, 40 using seq!

Predicting multiple outcomes: predict

# Step 1. Make a new data frame with range of values of age.
smoking_df2 <- tibble(age = seq(from = 15, to = 80, by = 10))
# Step 2. Use model from earlier to generate predictions.
predict(m_smoke, newdata = smoking_df2) # in log-odds
         1          2          3          4          5          6          7 
-0.1583529 -0.2798960 -0.4014392 -0.5229823 -0.6445254 -0.7660686 -0.8876117 

Predicting multiple outcomes: add_predictions

# Alternative to Step 2 (also log-odds)
library(modelr)
add_predictions(data = smoking_df2, # name of data frame 
                model = m_smoke)    # name of model

Try one of these approaches – predict or add_predictions – for ages 20, 25, 30, 35, 40!

# A tibble: 7 × 2
    age   pred
  <dbl>  <dbl>
1    15 -0.158
2    25 -0.280
3    35 -0.401
4    45 -0.523
5    55 -0.645
6    65 -0.766
7    75 -0.888

Predicting multiple outcomes: probabilities

Specify type = 'response' to get predictions in probability.

# Base R function for vector of predictions
predict(m_smoke, newdata = smoking_df2, type = 'response')

# Alternative using `modelr` for output as data frame
library(modelr)
add_predictions(data = smoking_df2, model = m_smoke, type = 'response')

Predicting multiple outcomes: probabilities

add_predictions(data = smoking_df2, # name of data frame 
                model = m_smoke,    # name of model
                type = 'response')  # return probabilities
# A tibble: 7 × 2
    age  pred
  <dbl> <dbl>
1    15 0.460
2    25 0.430
3    35 0.401
4    45 0.372
5    55 0.344
6    65 0.317
7    75 0.292

Predicting multiple outcomes: probabilities

Try one or both of these approaches – predict or add_predictions – to predict the probability of smoking for ages 20, 25, 30, 35, 40. Replace "..."!

# Step 1. Make a new data frame with range of values of age.
smoking_df2 <- tibble(age = seq(from = ..., to = ..., by = ...))
# Step 2. Use model from earlier to generate predictions.
predict(object = ...,  # name of model
        newdata = ..., #  name of data frame to make predictions for
        type = '...')  # return probabilities
# Alternative to Step 2 
library(modelr)
add_predictions(data = ...,     # name of data frame to use for predictions
                model = ...,    # name of model
                 type = '...')  # return probabilities

References

Andrews, M. (2021). Doing data science in R: An Introduction for Social Scientists. SAGE Publications Ltd.