What have we done; what do we need to do still?

  • model definition ✅
  • relating log-odds to probabilities ✅
  • prediction ✅

What have we done; what do we need to do still?

  • model definition ✅
  • relating log-odds to probabilities ✅
  • prediction ✅
  • meaning of coefficients
  • model comparison

What have we done; what do we need to do still?

This is the format we’ll be applying for the upcoming weeks:

  • model definition
  • interpretation of coefficients
  • generating model predictions
  • comparing (nested) models

Binary logistic regression

When \[y_i \in \{0, 1\}, \quad \text{for all } i \in n.\]

(i.e., each \(y_i\) is a binary variable that can take a value of 0 or 1 for every observation \(i\) in observations \(n\))

We can model binary data by assuming that for all \(i \in 1 \ldots n\) (i.e., for every observation \(i\) of \(n\) samples),

\[ y_i \sim \text{Bernoulli}(\theta_i), \\ \log\left(\frac{\theta_i}{1 - \theta_i}\right) = \beta_0 + \beta_1 \times x_i. \]

\(\dots\) each observed outcome variable value \(y_1, y_2, \ldots, y_n\) is a sample from a Bernoulli distribution with parameter \(\theta_i\), and the log odds of \(\theta\) is a linear function of the predictor variable’s values \(x_1, x_2, \ldots, x_n\).

Implementation in R

# load packages
library(tidyverse)

# load smoking data (link can be found on NOW)
smoking_df <- read_csv("https://raw.githubusercontent.com/mark-andrews/ntupsychology-data/main/data-sets/smoking.csv") 

# select relevant variables
smoking_df <- select(smoking_df, cigs, age, educ)

# Make the cigs a binary variable
smoking_df <- mutate(smoking_df, is_smoker = cigs > 0)

# Show data
smoking_df
# A tibble: 807 × 4
    cigs   age  educ is_smoker
   <dbl> <dbl> <dbl> <lgl>    
 1     0    46  16   FALSE    
 2     0    40  16   FALSE    
 3     3    58  12   TRUE     
 4     0    30  13.5 FALSE    
 5     0    17  10   FALSE    
 6     0    86   6   FALSE    
 7     0    35  12   FALSE    
 8     0    48  15   FALSE    
 9     0    48  12   FALSE    
10     0    31  12   FALSE    
# ℹ 797 more rows

Implementation in R

# Whether or not someone is a smoker modelled as a function of age.
m_age <- glm(is_smoker ~ age, data = smoking_df, family = binomial(link = 'logit'))

# Show model coefficients
summary( m_age )$coef
               Estimate  Std. Error    z value    Pr(>|z|)
(Intercept)  0.02396177 0.190206271  0.1259778 0.899749472
age         -0.01215431 0.004355148 -2.7907922 0.005257921

Converting between probability and log-odds

Probabilities \(>\) odds \(>\) log odds \(>\) odds \(>\) probabilities

# A probability of 
p <- 0.75 

# corresponds to odds of
odds <- p / (1 - p)  
odds 
[1] 3
# which correspond to log odds / logits of
logodds <- log(odds) # the natural logarithm of the odds 
logodds
[1] 1.098612

Converting between probability and log-odds

# which correspond to odds of
odds <- exp(logodds)
# exponentiation of the logit (log odds) to get the odds
odds
[1] 3
# which correspond to a probability
odds / (1 + odds) 
[1] 0.75

Odds are always ratios: An odds of 3 is a ratio which means that 3 chances of something occurring for every 1 chance of it not occurring.

Converting between probability and log-odds

# Use plogis to get the probability of the log odds
plogis(logodds) 
[1] 0.75
# Use qlogis to get the log odds of a probability
qlogis(p)  
[1] 1.098612

Converting between probability and log-odds

  • When probability = 0.5, log-odds are 0 (equal).
  • When probabilities are small, log-odds are negative.
  • When probabilities are large, log-odds are positive.
  • Probabilities are bound between 0 and 1; log-odds aren’t.

Converting between probability and log-odds

# Here are some low, medium and high probability values:
theta_low <- 0.01  # low prob of success 
theta_mid <- 0.5   # equal prob of success 
theta_high <- 0.99 # high prob of success 
# Logits of low probability
log( theta_low / ( 1 - theta_low ) )
[1] -4.59512
plogis( theta_low ) 
[1] 0.5025

Work out the logit values for theta_mid and theta_high using the formula and one of the two functions.

Converting between probability and log-odds

If the probability of bad weather is 0.75, what are the odds that we’ll have bad weather?

\[ \text{odds} = \frac{\theta}{1 - \theta} \] Use \(R\) as a calculator or solve it by hand!

Converting between probability and log-odds

If the probability of bad weather is \(\theta = 0.75\), what are the odds that we’ll have bad weather?

p <- 0.75
p / ( 1 - p ) 
[1] 3

Probability of \(\theta = 0.75\) means that we are three times more likely to have bad weather than good weather.

Converting between probability and log-odds

If the odds are 4 (i.e. 4:1) to have bad weather, what is the probability that we will have bad weather?

\[ \theta = \frac{\text{odds}}{1 + \text{odds}} \] Use \(R\) as a calculator!

Converting between probability and log-odds

If the odds are 4 (i.e. 4:1) to have bad weather, what is the probability that we will have bad weather?

odds <- 4
odds / ( 1 + odds )
[1] 0.8

Converting between probability and log-odds

If there is a probability of \(\theta = 0.9\) that an intervention improves reading of kids, what is its log odds?

Compute the odds:

\[ \text{odds} = \frac{\theta}{1 - \theta} \]

Take the logarithm of the odds to get the log-odds (also called logit):

\[ \text{log-odds} = \log(\text{odds}) \]

Use \(R\) as a calculator!

Converting between probability and log-odds

If there is a probability of \(\theta = 0.9\) that an intervention improves reading of kids, what is its log odds?

\[ \text{odds} = \frac{\theta}{1 - \theta} = \frac{0.9}{1 - 0.9} = \frac{0.9}{0.1} = 9 \]

p <- 0.9
p / (1 - p)
[1] 9

Take the logarithm of the odds to get the log-odds (also called logit):

\[ \text{log-odds} = \log(\text{odds}) = \log(9) \approx 2.197225 \]

log(9)
[1] 2.197225

Converting between probability and log-odds

The log-odds on the previous slide were 2.1972246. Convert this value into

  1. Odds (the opposite of log(x) is exp(x)).
  2. Probabilities (using plogis(x)).

(replace x correctly)

Converting between probability and log-odds

The log-odds on the previous slide were 2.1972246. Convert this value into

Odds (the opposite of log(x) is exp(x)):

exp(2.197225)
[1] 9.000004

Probabilities (using plogis(x)):

plogis(2.197225)
[1] 0.9

Predicting outcomes with add_predictions

# Step 1. Make a new data frame with range of values of age.
smoking_df2 <- tibble(age = seq(from = 15, to = 80, by = 10))
# Alternative to Step 2 (also log-odds)
library(modelr)
add_predictions(data = smoking_df2, # name of data frame 
                model = m_age)      # name of model

Use add_predictions for ages 20, 21, 22, 23, 24, 25, 26, 27, \(\dots\), 75, 76, 77, 78, 79, 80, 81

Predicting outcomes as probabilities

add_predictions(data = smoking_df2, # name of data frame 
                model = m_age,      # name of model
                type = 'response')  # return probabilities
# A tibble: 7 × 2
    age  pred
  <dbl> <dbl>
1    15 0.460
2    25 0.430
3    35 0.401
4    45 0.372
5    55 0.344
6    65 0.317
7    75 0.292

Plotting predicted outcomes

# Generate predictions
d_predict <- add_predictions(data = smoking_df2, 
                             model = m_age, 
                             type = 'response')

# Plot predicted values
library(psyntur)
scatterplot(data = d_predict, x = age, y = pred)

Compare predicts log-odds and probabilities

Muliple binary logistic regression

For ease of illustration consider a binary logistic regression with educ as binary categorical predictor:

# Create a categorical variable for `educ` 
smoking_df <- mutate(smoking_df, educ = case_when(educ > 12 ~ "o12y", TRUE ~ "u12y"))
# Binary logistic regression
m_educ_age <- glm(is_smoker ~ educ + age, data = smoking_df, family = binomial(link = "logit"))
# Inferential statistics for model coefficients
summary(m_educ_age)$coef
               Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -0.24790348 0.20534132 -1.207275 0.2273261975
educu12y     0.56029969 0.15267484  3.669889 0.0002426559
age         -0.01387864 0.00439195 -3.160019 0.0015775895

Interpretation of coefficients

coef(m_educ_age)
(Intercept)    educu12y         age 
-0.24790348  0.56029969 -0.01387864 
  • Log-odds of being a smoker is a linear function of the predictors
  • For any change in the predictors, there is a constant change in the log-odds of being a smoker.
    • the coefficient for educ 0.5602997 suggests that people with less than 12 years of education u12y have, the log-odds of being a smoker increase by 0.5602997.
    • this is the same for 18 to 19 years, 25 to 26 years, etc.
  • What does it mean for something to increase by 0.5602997
  • One way of getting around the lack of intuition around log-odds is to talk about the meaning of the coefficients in terms of odd ratios.

Risk ratios and odds ratios

Difference in terms of probabilities: ratios of probabilities are referred to as risk ratios.

\[ \text{risk ratio} = \frac{p(\text{is smoker if less than 12 years of education})}{p(\text{is smoker if more than 12 years of education)}} \] Create a grid for all combinations of educ and age.

smoking_df3 <- expand_grid(educ = c('o12y', 'u12y'), age = seq(18, 80, 20))

Obtain predictions as probabilities.

pred_df <- add_predictions(data = smoking_df3, model = m_educ_age, type='response')

Convert data frame to wide format (one column per level of educ).

pred_df_wide <- pivot_wider(pred_df, names_from = educ, values_from = pred)

Risk ratios and odds ratios

Calculate the risk ratios

\[ \text{risk ratio} = \frac{p(\text{is smoker if less than 12 years of education})}{p(\text{is smoker if more than 12 years of education)}} \]

Note, these values aren’t constant across age!

mutate(pred_df_wide, prob_ratio = u12y / o12y)
# A tibble: 4 × 4
    age  o12y  u12y prob_ratio
  <dbl> <dbl> <dbl>      <dbl>
1    18 0.378 0.516       1.36
2    38 0.315 0.446       1.42
3    58 0.259 0.379       1.47
4    78 0.209 0.316       1.51

Risk ratios and odds ratios

mutate(pred_df_wide,
       odds_o12y = o12y / (1 - o12y),
       odds_u12y = u12y / (1 - u12y),
       odds_ratio = odds_u12y / odds_o12y)
# A tibble: 4 × 6
    age  o12y  u12y odds_o12y odds_u12y odds_ratio
  <dbl> <dbl> <dbl>     <dbl>     <dbl>      <dbl>
1    18 0.378 0.516     0.608     1.06        1.75
2    38 0.315 0.446     0.461     0.807       1.75
3    58 0.259 0.379     0.349     0.611       1.75
4    78 0.209 0.316     0.264     0.463       1.75

Unlike ratios of probabilities, ratios of odds are independent of age (or any other predictor), where the odds are a ratios of probabilities, i.e. \(\frac{p}{1-p}\). Expressed in maths, odds ratios can be represented as

\[ \text{odds ratio} = \frac{\frac{p}{1 - p}}{\frac{q}{1 - q}} \]

Risk ratios and odds ratios

This odds ratio can be obtained by the exponential of the education coefficient \(e^{\beta_{\text{educ}}}\), where \(e\) is a constant (Euler’s number, i.e. 2.718282) like \(\pi\).

# Extract the model coefficients
betas <- coef(m_educ_age)
betas
(Intercept)    educu12y         age 
-0.24790348  0.56029969 -0.01387864 
# Get odds ratio via Euler’s number
e <- exp(1)
e^betas[2]
educu12y 
1.751197 
# or simply using exponential function
exp(betas[2])
educu12y 
1.751197 

Risk ratios and odds ratios

# or simply using exponential function
exp(betas[2])
educu12y 
1.751197 

The interpretation is that the odds of being a smoker for individuals with less than 12 years of education are 1.7511972 greater than the odds of being a smoker for individuals with more than 12 years of education.

So while betas[2] shows you the coefficient of education in log odds, exp(betas[2]) returns the coefficient in odds.

What are the odds for of a 18 year old to be a smoker compared to a 19 year old, regardless of education?

Risk ratios and odds ratios

How do we get the confidence interval for the same coefficient as odds ratio?

(ci <- confint.default(m_educ_age, parm = "educu12y")) # log odds
             2.5 %    97.5 %
educu12y 0.2610625 0.8595369

Range of values that would not be rejected in a null hypothesis test. These are all plausible values for the parameter of interest (difference of being a smoker for less than vs more than 12 years of education).

exp(ci) # odds ratio
            2.5 %   97.5 %
educu12y 1.298309 2.362067

If odds ratio contains 1, we cannot reject the null hypothesis. Yes 1, not 0! Odds are equal if 1. In contrast, on the log-odds scale the null hypothesis is a zero difference; cause log(1) = 0.

Summarising coefficient interpretation

Changes in the outcome variable can be presented as probabilities, log odds and odds.

  • logodds / logits show constant changes but are unintuitive.
  • probability is intuitive but not constant.
  • odds are constant and reasonably intuitive.

The factor by which odds of being a smoker increases with a unit change in the predictor!

Summarising coefficient interpretation

What do we mean by a factor? We mean it is MULTIPLIED.

  • a factor of 2 means something is twice as likely.
  • a factor of 1 means something is equally likely.
  • a factor of 0.5 means something is half as likely.

Aside, if you get a question asking “what is the factor by which the odds of being a smoker increases for a change in unit of education” I’m really asking you about the odds ratio.

Model fit statistics and model comparisons

  • log-likelihood / deviance
  • log-likelihood ratio tests
    • Normal linear models: F test
    • Binary logistic regression: \(\chi^2\) test (Chi^2)

Model fit statistics and model comparisons

summary(m_educ_age) # see deviances on bottom
Call:
glm(formula = is_smoker ~ educ + age, family = binomial(link = "logit"), 
    data = smoking_df)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.247903   0.205341  -1.207 0.227326    
educu12y     0.560300   0.152675   3.670 0.000243 ***
age         -0.013879   0.004392  -3.160 0.001578 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1075.0  on 806  degrees of freedom
Residual deviance: 1053.3  on 804  degrees of freedom
AIC: 1059.3

Number of Fisher Scoring iterations: 4

Model fit statistics and model comparisons

  • Residual deviance: deviance of the model we’re looking at
  • Null deviance: deviance of a reference model with no predictors
  • A lower deviance indicates a better the model fit
deviance(m_educ_age)
[1] 1053.306

This deviance is based on the log-likelihood

\[ \text{deviance} = -2 \times \text{log likelihood} \]

Log-likelihood

Calculation is based on the observed data and predicted data

# Extract the outcome variable
y <- smoking_df$is_smoker
# Generate predicted values on the prob scale
p_hat <- predict(m_educ_age, type = "response")
# Formula for log-likelihood for binomial models
sum( y * log(p_hat) + (1 - y) * log(1 - p_hat)) # don't worry about this formula
[1] -526.6529

and thus captures, overall, how unsurprised your model is by the data it actually saw (high values mean less surprised by data; i.e. better fit). For short,

logLik(m_educ_age)
'log Lik.' -526.6529 (df=3)

Log-likelihood to deviance

The negative double of the log-likelihood is the deviance, so

-2 * logLik(m_educ_age)
'log Lik.' 1053.306 (df=3)

is the same as

deviance(m_educ_age)
[1] 1053.306

Log-likelihood to deviance

and

deviance(m_educ_age) / -2
[1] -526.6529

is the same as

logLik(m_educ_age)
'log Lik.' -526.6529 (df=3)

The deviance doesn’t tell you more than what you already know from the log-likelihood.

Comparing nested models using \(\chi^2\) test

  • We used \(R^2\) for normal linear models which has an intuitive and absolute interpretation.
  • Log-likelihood and deviance can only be understood in relative terms, relative to other, specifically nested, models.

For example, the two binary logistic models \(\text{is_smoker} \sim \text{Bernoulli}(\theta_i)\) that we have considered so far have the predictor structures

\[ \mathcal{M_1}: \text{logit}(\theta_i) = \beta_0 + \beta_1 \times age_i\\ \] and

\[ \mathcal{M_2}: \text{logit}(\theta_i) = \beta_0 + \beta_1 \times age_i + \beta_2 \times \text{educ}_i. \] Model \(\mathcal{M_1}\) is nested in model \(\mathcal{M_2}\) because \(\mathcal{M_1}\) contains a subset of the predictors of \(\mathcal{M_2}\).

Comparing nested models using \(\chi^2\) test

The deviance of model \(\mathcal{M_2}\) is

deviance(m_educ_age)
[1] 1053.306

What’s the deviance of \(\mathcal{M_1}\)?

Which one of these two models is better at predicting the data?

What does it mean that one model is “better”?

The model is better at identifying a smoker on the basis of their age and education compared to just knowing their age. Education can tell us something that age alone can’t.

Comparing nested models using \(\chi^2\) test

\(\dots\) using the difference in deviance:

deviance(m_educ_age) - deviance(m_age)
[1] -13.762

The deviance of \(\mathcal{M_2}\) is around 13.762 units lower.

We can roughly interpret these differences of deviances

  • if less than 4, inconclusive
  • if between 4 and 10, conclusive
  • if above 10, definitive.

Comparing nested models using \(\chi^2\) test

A meaningful difference of the deviance should be 2 times greater than the difference in the number of parameters in the models.

How many parameters are in \(\mathcal{M_2}\)?

coef(m_educ_age)
(Intercept)    educu12y         age 
-0.24790348  0.56029969 -0.01387864 

It’s 3: in a binary logistic regression the number of coefficients is the number of parameters (in normal linear models there is also the standard deviation). And it’s 2 for \(\mathcal{M_1}\)

coef(m_age)
(Intercept)         age 
 0.02396177 -0.01215431 

So the difference in deviances should be at least 2 (our difference in deviances was 13.762).

Comparing nested models using \(\chi^2\) test

  • A statistical comparison of these deviances test can be implemented using anova.
  • We are using a \(\chi^2\) statistic for binary logistic regression, not an \(F\)-test as for normal linear models.
  • This is because the sampling distribution of the difference in deviances is close to a \(\chi^2\) distribution, not an \(F\)-distribution.
anova(m_age, m_educ_age, test = 'Chisq')

Comparing nested models using \(\chi^2\) test

  • p-value comes from a \(\chi^2\) distribution Df degrees of freedom (difference in number of parameters).
  • The difference in the deviances can be found in the columns Deviance.
anova(m_age, m_educ_age, test = 'Chisq')
Analysis of Deviance Table

Model 1: is_smoker ~ age
Model 2: is_smoker ~ educ + age
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)
1       805     1067.1                      
2       804     1053.3  1   13.762 0.0002075

The p-value is the probability of obtaining a difference in deviances of 13.76 in a \(\chi^2\) distribution with 1 degree of freedom if the model with both educ and age is identical to the model with only age.

diff_deviance <- deviance(m_age) - deviance(m_educ_age)
pchisq(diff_deviance, df = 1, lower.tail = FALSE)
[1] 0.0002074912

Difference in deviances and \(\chi^2\) distribution

So lets say, you have two models \(\mathcal{M_a}\) and \(\mathcal{M_b}\). \(\mathcal{M_a}\) is fully nested in \(\mathcal{M_b}\). \(\mathcal{M_a}\) has 2 parameters, and \(\mathcal{M_b}\) has 5. \(\mathcal{M_a}\) has a deviance of 500, and \(\mathcal{M_b}\) has a deviance of 490. What is the corresponding p-value of their difference?

pchisq(---, df = ---, lower.tail = FALSE)

Exercise

Reload the data to select new variables:

# load smoking data (link can be found on NOW)
smoking_df <- read_csv("https://raw.githubusercontent.com/mark-andrews/ntupsychology-data/main/data-sets/smoking.csv") 

# Make the cigs a binary variable
smoking_df <- mutate(smoking_df, is_smoker = cigs > 0)

# select relevant variables
smoking_df <- select(smoking_df, is_smoker, white, educ, lincome)

We will use white (binary), educ (as continuous variable), and the log of income (lincome).

Implement the following two models with \(\text{is_smoker}_i \sim \text{Bernoulli}(\theta_i)\):

\[ \mathcal{M_1}: \text{logit}(\theta_i) = \beta_0 + \beta_1 \times \log(\text{income})\\ \mathcal{M_2}: \text{logit}(\theta_i) = \beta_0 + \beta_1 \times \log(\text{income}) + \beta_2 \times \text{education} + \beta_3 \times \text{white} \]

and, then, perform a statistical test to evaluate whether the difference in their deviance is statistically meaningful. This test evaluates whether education and being white has an (combined) effect on being a smoker after controlling for income.