What have we done; what do we need to do still?

  • Model definition
  • Relating log-odds to probabilities
  • Predicting modelled outcomes
  • Interpretation of coefficients
  • Model comparison

Binary logistic regression

When \[y_i \in \{0, 1\}, \quad \text{for all } i \in n.\]

(i.e., each observation \(y_i\) in \(n\) takes either the value 0 or 1)

We can model binary data by assuming that for each outcome value \(y_1, y_2, \ldots, y_n\) is a sample from a Bernoulli distribution with parameter \(\theta_i\),

\[ y_i \sim \text{Bernoulli}(\theta_i), \\ \log\left(\frac{\theta_i}{1 - \theta_i}\right) = \beta_0 + \beta_1 \times x_i, \]

and the log odds of \(\theta\) is a linear function of the predictor variable’s values \(x_1, x_2, \ldots, x_n\).

(model definition shows a simple Binary logistic regression)

Implementation in R: prepare data

# load packages
library(tidyverse)

# load smoking data (link can be found on NOW)
smoking_df <- read_csv(
  "https://raw.githubusercontent.com/mark-andrews/ntupsychology-data/main/data-sets/smoking.csv") 

# select relevant variables
smoking_df <- select(smoking_df, cigs, age, educ)

# Make the cigs a binary variable. 
smoking_df <- mutate(smoking_df, is_smoker = cigs > 0)

# Show data
glimpse(smoking_df)
Rows: 807
Columns: 4
$ cigs      <dbl> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 10, 20, 30, 0, 0, 20, 0, 30, 0…
$ age       <dbl> 46, 40, 58, 30, 17, 86, 35, 48, 48, 31, 27, 24, 71, 44, 22, …
$ educ      <dbl> 16.0, 16.0, 12.0, 13.5, 10.0, 6.0, 12.0, 15.0, 12.0, 12.0, 1…
$ is_smoker <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…

Implementation in R: fit binary logistic regression

# Whether or not someone is a smoker modelled as a function of age.
m_age <- glm(is_smoker ~ age, data = smoking_df, family = binomial(link = 'logit'))

# Show model coefficients
summary( m_age )$coef # in logits (log-odds)
               Estimate  Std. Error    z value    Pr(>|z|)
(Intercept)  0.02396177 0.190206271  0.1259778 0.899749472
age         -0.01215431 0.004355148 -2.7907922 0.005257921

Converting between probability and log-odds

Odds are a ratio something happening as opposed to it not happening.

Say \(\theta\) is the probability of an event (e.g. bad weather) happening.

The probability of the same event not happening (e.g. no bad weather) is \(1-\theta\).

Their ratio is \(\frac{\theta}{1-\theta}\). That’s your odds!

So what’s the odds corresponding to a \(\theta\) of .75?

Converting between probability and log-odds

Probabilities \(>\) odds \(>\) log odds \(>\) odds \(>\) probabilities

# A probability of 
p <- 0.75 

# corresponds to odds of
odds <- p / (1 - p)  
odds 
[1] 3

Probability of \(\theta = 0.75\) means that the event in question is three times more likely (e.g. to have bad weather instead of good weather).

# which correspond to log odds / logits of
logodds <- log(odds) # the natural logarithm of the odds 
logodds
[1] 1.098612

Converting between probability and log-odds

# which correspond to odds of
odds <- exp(logodds) # exponentiation of the logit (log odds) to get the odds
odds
[1] 3
# which correspond to a probability of
odds / (1 + odds) 
[1] 0.75

Odds are always ratios:

  • An odds of 3 is a ratio 3-to-1, which means that 3 chances of something occurring for every 1 chance of it not occurring.
  • So the odds of 3 that team A wins against team B, means that it’s 3 times more likely that team A wins.

Converting between probability and log-odds

If there is a probability of \(\theta = 0.9\) that an intervention improves reading of kids, what is this result in log odds?

Compute the odds:

\[ \text{odds} = \frac{\theta}{1 - \theta} \]

Take the logarithm of the odds to get the log-odds (also called logit):

\[ \text{log-odds} = \log(\text{odds}) \]

Use \(R\) as a calculator!

Converting between probability and log-odds

If there is a probability of \(\theta = 0.9\) that an intervention improves reading of kids, what is this result in log odds?

\[ \text{odds} = \frac{\theta}{1 - \theta} = \frac{0.9}{1 - 0.9} = \frac{0.9}{0.1} = 9 \]

p <- 0.9
p / (1 - p)
[1] 9

Take the logarithm of the odds to get the log-odds (also called logit):

\[ \text{log-odds} = \log(\text{odds}) = \log(9) \approx 2.197225 \]

log(9)
[1] 2.197225

Converting between probability and log-odds

The log-odds on the previous slide were 2.1972246. Convert this value into

  1. Odds (the opposite of log(x) is exp(x)).
  2. Probabilities (using plogis(x)).

(replace x correctly)

Converting between probability and log-odds

The log-odds on the previous slide were 2.1972246. Convert this value into

Odds (the opposite of log(x) is exp(x)):

exp(2.197225)
[1] 9.000004

Probabilities (using plogis(x)):

plogis(2.197225)
[1] 0.9

Converting between probability and log-odds

If the odds are 4 (i.e. 4:1) to have bad weather, what is the probability that we will have bad weather?

\[ \theta = \frac{\text{odds}}{1 + \text{odds}} \]

Use \(R\) as a calculator!

Converting between probability and log-odds

If the odds are 4 (i.e. 4:1) to have bad weather, what is the probability that we will have bad weather?

odds <- 4
odds / ( 1 + odds )
[1] 0.8

Converting between probability and log-odds

# Use plogis to get the probability of the log odds
plogis(logodds) 
[1] 0.75
# Use qlogis to get the log odds of a probability
qlogis(p)  
[1] 2.197225

Converting between probability and log-odds

  • When prob = .5, odds = 1, log-odds = 0.
  • When prob < .5, log-odds < 0.
  • When prob > .5, log-odds > 0.
  • Probabilities are bound between 0 and 1; log-odds aren’t.

Predicting outcomes with add_predictions

# Step 1. Make a new data frame with range of values of age.
smoking_df2 <- tibble(age = seq(from = 15, to = 80, by = 10))
# Alternative to Step 2 (also log-odds)
library(modelr)
add_predictions(data = smoking_df2, # name of data frame 
                model = m_age)      # name of model

Use add_predictions for ages 20, 21, 22, 23, 24, 25, 26, 27, \(\dots\), 75, 76, 77, 78, 79, 80, 81

Predicting outcomes as probabilities

add_predictions(data = smoking_df2, # name of data frame 
                model = m_age,      # name of model
                type = 'response')  # return probabilities
# A tibble: 7 × 2
    age  pred
  <dbl> <dbl>
1    15 0.460
2    25 0.430
3    35 0.401
4    45 0.372
5    55 0.344
6    65 0.317
7    75 0.292

Plotting predicted outcomes

# Generate predictions
d_predict <- add_predictions(data = smoking_df2, 
                             model = m_age, 
                             type = 'response')

# Plot predicted values
library(psyntur)
scatterplot(data = d_predict, x = age, y = pred)

Watch out! The relationship between the predictor and the predicted probability isn’t linear, but the relationship between the predictor and the log-odds (logits) is!

Shapes of relationships

Muliple binary logistic regression

For ease of illustration consider a binary logistic regression with educ as binary categorical predictor:

# Create a categorical variable for `educ` 
smoking_df <- mutate(smoking_df, educ = case_when(educ > 12 ~ "o12y", TRUE ~ "u12y"))
# Binary logistic regression
m_educ_age <- glm(is_smoker ~ educ + age, data = smoking_df, family = binomial(link = "logit"))
# Inferential statistics for model coefficients
summary(m_educ_age)$coef
               Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -0.24790348 0.20534132 -1.207275 0.2273261975
educu12y     0.56029969 0.15267484  3.669889 0.0002426559
age         -0.01387864 0.00439195 -3.160019 0.0015775895

Predicting outcomes on different scales

educ has levels o12y (over 12 years of education) and u12y (under 12 years of education) with the following dummy coding:

contrasts(factor(smoking_df$educ))
     u12y
o12y    0
u12y    1
betas <- coef(m_educ_age)
betas
(Intercept)    educu12y         age 
-0.24790348  0.56029969 -0.01387864 

What are the predicted (1) log odds, (2) odds, and (3) probability of being a smoker for a hypothetical person of age 63 with less than 12 years of education?

Interpretation of coefficients

coef(m_educ_age)
(Intercept)    educu12y         age 
-0.24790348  0.56029969 -0.01387864 
  • Log-odds of being a smoker is a linear function of the predictors
  • For any change in the predictors, there is a constant change in the log-odds of being a smoker.
    • the coefficient for educ 0.5602997 suggests that people with less than 12 years of education u12y have, the log-odds of being a smoker increase by 0.5602997.
    • this is the same for 18 to 19 years, 25 to 26 years, etc.
  • What does it mean for something to increase by 0.5602997
  • One way of getting around the lack of intuition around log-odds is to talk about the meaning of the coefficients in terms of odd ratios.

Risk ratios and odds ratios

Difference in terms of probabilities: ratios of probabilities are referred to as risk ratios.

\[ \text{risk ratio} = \frac{p(\text{is smoker if less than 12 years of education})}{p(\text{is smoker if more than 12 years of education)}} \] Create a grid for all combinations of educ and age.

smoking_df3 <- expand_grid(educ = c('o12y', 'u12y'), age = seq(from = 18, to = 80, by = 20))

Obtain predictions as probabilities.

pred_df <- add_predictions(data = smoking_df3, model = m_educ_age, type = 'response')

Convert data frame to wide format (one column per level of educ).

pred_df_wide <- pivot_wider(data = pred_df, names_from = educ, values_from = pred)

Risk ratios and odds ratios

Calculate the risk ratios

\[ \text{risk ratio} = \frac{p(\text{is smoker if less than 12 years of education})}{p(\text{is smoker if more than 12 years of education)}} \]

Note, these values aren’t constant across age!

mutate(pred_df_wide, prob_ratio = u12y / o12y)
# A tibble: 4 × 4
    age  o12y  u12y prob_ratio
  <dbl> <dbl> <dbl>      <dbl>
1    18 0.378 0.516       1.36
2    38 0.315 0.446       1.42
3    58 0.259 0.379       1.47
4    78 0.209 0.316       1.51

Risk ratios and odds ratios

mutate(pred_df_wide,
       odds_o12y = o12y / (1 - o12y),
       odds_u12y = u12y / (1 - u12y),
       odds_ratio = odds_u12y / odds_o12y)
# A tibble: 4 × 6
    age  o12y  u12y odds_o12y odds_u12y odds_ratio
  <dbl> <dbl> <dbl>     <dbl>     <dbl>      <dbl>
1    18 0.378 0.516     0.608     1.06        1.75
2    38 0.315 0.446     0.461     0.807       1.75
3    58 0.259 0.379     0.349     0.611       1.75
4    78 0.209 0.316     0.264     0.463       1.75

Unlike ratios of probabilities, ratios of odds are independent of age (or any other predictor), where the odds are a ratios of probabilities, i.e. \(\frac{p}{1-p}\). Expressed in maths, odds ratios can be represented as

\[ \text{odds ratio} = \frac{\frac{p}{1 - p}}{\frac{q}{1 - q}} \]

Risk ratios and odds ratios

# Extract the model coefficients
betas <- coef(m_educ_age)
betas
(Intercept)    educu12y         age 
-0.24790348  0.56029969 -0.01387864 

The odds ratio can be obtained by the exponential of the education coefficient \(e^{\beta_{\text{educ}}}\), where \(e\) is a constant (Euler’s number, i.e. 2.718282) like \(\pi\).

# Get odds ratio via Euler’s number
e <- exp(1)
e^betas[2]
educu12y 
1.751197 

Risk ratios and odds ratios

# or simply using exponential function
exp(betas[2])
educu12y 
1.751197 

The interpretation is that the odds of being a smoker for individuals with less than 12 years of education are 1.75 greater than the odds of being a smoker for individuals with more than 12 years of education.

So while betas[2] shows you the coefficient of education in log odds, exp(betas[2]) returns the coefficient in odds.

What are the odds for of a 18 year old to be a smoker compared to a 19 year old, regardless of education?

Risk ratios and odds ratios: Confidence interval

(ci <- confint.default(m_educ_age, parm = "educu12y")) # `.default` gives Wald CI rather than profile
             2.5 %    97.5 %
educu12y 0.2610625 0.8595369

Range of values that would not be rejected in a null hypothesis test. These are all plausible values for the parameter of interest (difference of being a smoker for less than vs more than 12 years of education).

exp(ci) # odds ratio
            2.5 %   97.5 %
educu12y 1.298309 2.362067

If odds ratio contains 1, we cannot reject the null hypothesis. Yes 1, not 0! Odds are equal if 1. In contrast, on the log-odds scale the null hypothesis is a zero difference; cause log(1) = 0.

What is the 99% confidence interval for the odds ratio of the age predictor?

Summarising coefficient interpretation

Changes in the outcome variable can be presented as probabilities, log odds and odds.

  • logodds / logits show constant changes but are unintuitive.
  • probability is intuitive but not constant.
  • odds are constant and reasonably intuitive.

The factor by which odds of being a smoker increases with a unit change in the predictor!

Summarising coefficient interpretation

What do we mean by a factor? We mean it is MULTIPLIED.

  • a factor of 2 means something is twice as likely.
  • a factor of 1 means something is equally likely.
  • a factor of 0.5 means something is half as likely.

Aside, if you get a question asking “what is the factor by which the odds of being a smoker increases for a change in unit of education” I’m really asking you about the odds ratio.

Model fit statistics and model comparisons

  • log-likelihood / deviance
  • log-likelihood ratio tests
    • Normal linear models: F test
    • Binary logistic regression: \(\chi^2\) test (Chi^2)

Model fit statistics and model comparisons

summary(m_educ_age) # see deviance on bottom
Call:
glm(formula = is_smoker ~ educ + age, family = binomial(link = "logit"), 
    data = smoking_df)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.247903   0.205341  -1.207 0.227326    
educu12y     0.560300   0.152675   3.670 0.000243 ***
age         -0.013879   0.004392  -3.160 0.001578 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1075.0  on 806  degrees of freedom
Residual deviance: 1053.3  on 804  degrees of freedom
AIC: 1059.3

Number of Fisher Scoring iterations: 4

Model fit statistics and model comparisons

Residual deviance: deviance of the model we’re looking at

deviance(m_educ_age)
[1] 1053.306

Null deviance: deviance of a reference model with no predictors

# Fit a model without predictors
m_null <- glm(is_smoker ~ 1, data = smoking_df, family = binomial(link = "logit"))
# Extract the deviance of the null model
deviance(m_null)
[1] 1075.011

A lower deviance indicates a better the model fit

Log-likelihood

This deviance is based on the log-likelihood which is based on the observed data and predicted data:

# Extract the outcome variable
y <- smoking_df$is_smoker
# Generate predicted values on the prob scale
p_hat <- predict(m_educ_age, type = "response")
# Formula for log-likelihood for binomial models
sum( y * log(p_hat) + (1 - y) * log(1 - p_hat)) # don't worry about this formula
[1] -526.6529

and thus captures, overall, how unsurprised your model is by the data it actually saw (high values mean less surprised by data; i.e. better fit). For short,

logLik(m_educ_age)
'log Lik.' -526.6529 (df=3)

Log-likelihood to deviance

The negative double of the log-likelihood is the deviance, so

\[ \text{deviance} = -2 \times \text{log likelihood} \]

-2 * logLik(m_educ_age)
'log Lik.' 1053.306 (df=3)

is the same as

deviance(m_educ_age)
[1] 1053.306

Log-likelihood to deviance

and

deviance(m_educ_age) / -2
[1] -526.6529

is the same as

logLik(m_educ_age)
'log Lik.' -526.6529 (df=3)

The deviance doesn’t tell you more than what you already know from the log-likelihood.

Comparing nested models using \(\chi^2\) test

  • We used \(R^2\) for normal linear models which has an intuitive and absolute interpretation.
  • Log-likelihood and deviance can only be understood in relative terms, relative to other, specifically nested, models.

For example, the two binary logistic models \(\text{is_smoker} \sim \text{Bernoulli}(\theta_i)\) that we have considered so far have the predictor structures

\[ \mathcal{M_1}: \text{logit}(\theta_i) = \beta_0 + \beta_1 \times age_i\\ \] and

\[ \mathcal{M_2}: \text{logit}(\theta_i) = \beta_0 + \beta_1 \times age_i + \beta_2 \times \text{educ}_i. \] Model \(\mathcal{M_1}\) is nested in model \(\mathcal{M_2}\) because \(\mathcal{M_1}\) contains a subset of the predictors of \(\mathcal{M_2}\).

Comparing nested models using \(\chi^2\) test

The deviance of model \(\mathcal{M_2}\) is

deviance(m_educ_age)
[1] 1053.306

What’s the deviance of \(\mathcal{M_1}\)?

Which one of these two models is better at predicting the data?

What does it mean that one model is “better”?

The model is better at identifying a smoker on the basis of their age and education compared to just knowing their age. Education can tell us something that age alone can’t.

Comparing nested models using \(\chi^2\) test

\(\dots\) using the difference in deviance:

deviance(m_age) - deviance(m_educ_age)
[1] 13.762

The deviance of \(\mathcal{M_2}\) is around 13.76 units lower.

We can roughly interpret these differences of deviances

  • if less than 4, inconclusive
  • if between 4 and 10, conclusive
  • if above 10, definitive.

Comparing nested models using \(\chi^2\) test

A meaningful difference should be 2 times greater than the difference in the number of parameters.

How many parameters are in \(\mathcal{M_2}\)?

coef(m_educ_age)
(Intercept)    educu12y         age 
-0.24790348  0.56029969 -0.01387864 

It’s 3: in a binary logistic regression the number of coefficients is the number of parameters (in normal linear models there is also the standard deviation). And it’s 2 for \(\mathcal{M_1}\):

coef(m_age)
(Intercept)         age 
 0.02396177 -0.01215431 

So the difference in deviance should be at least 2 (our difference in deviance was 13.76).

Comparing nested models using \(\chi^2\) test

  • A statistical comparison of these deviances test can be implemented using anova.
  • We are using a \(\chi^2\) statistic for binary logistic regression, not an \(F\)-test as for normal linear models.
  • This is because the sampling distribution of the difference in deviances is close to a \(\chi^2\) distribution, not an \(F\)-distribution.
anova(m_age, m_educ_age, test = 'Chisq')

Comparing nested models using \(\chi^2\) test

  • p-value comes from a \(\chi^2\) distribution Df degrees of freedom (difference in number of parameters).
  • The difference in the deviance can be found in the column Deviance.
anova(m_age, m_educ_age, test = 'Chisq')
Analysis of Deviance Table

Model 1: is_smoker ~ age
Model 2: is_smoker ~ educ + age
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)
1       805     1067.1                      
2       804     1053.3  1   13.762 0.0002075

The p-value is the probability of obtaining a difference in deviance of 13.76 in a \(\chi^2\) distribution with 1 degree of freedom if the model with both educ and age is identical to the model with only age.

pchisq(deviance(m_age) - deviance(m_educ_age), df = 1, lower.tail = FALSE)
[1] 0.0002074912

Difference in deviances and \(\chi^2\) distribution

Lets say, you have two models \(\mathcal{M_a}\) and \(\mathcal{M_b}\). \(\mathcal{M_a}\) is fully nested in \(\mathcal{M_b}\). \(\mathcal{M_a}\) has 2 parameters, and \(\mathcal{M_b}\) has 5. \(\mathcal{M_a}\) has a deviance of 500, and \(\mathcal{M_b}\) has a deviance of 490. What is the corresponding p-value of their difference?

pchisq(---, df = ---, lower.tail = FALSE)

Exercise

# Reload smoking data (link can be found on NOW)
smoking_df <- read_csv(
  "https://raw.githubusercontent.com/mark-andrews/ntupsychology-data/main/data-sets/smoking.csv") 

# Make the cigs a binary variable
smoking_df <- mutate(smoking_df, is_smoker = cigs > 0)

# Select new set of variables
smoking_df <- select(smoking_df, is_smoker, white, educ, lincome)

We will use white (binary), educ (as continuous variable), and the log of income (lincome).

Implement the following two models with \(\text{is_smoker}_i \sim \text{Bernoulli}(\theta_i)\):

\[ \mathcal{M_1}: \text{logit}(\theta_i) = \beta_0 + \beta_1 \times \log(\text{income})\\ \mathcal{M_2}: \text{logit}(\theta_i) = \beta_0 + \beta_1 \times \log(\text{income}) + \beta_2 \times \text{education} + \beta_3 \times \text{white} \]

Evaluate whether education and being white has a (combined) association with smoking (after controlling for income).