- model definition ✅
- relating log-odds to probabilities ✅
- prediction ✅
This is the format we’ll be applying for the upcoming weeks:
When \[y_i \in \{0, 1\}, \quad \text{for all } i \in n.\]
(i.e., each \(y_i\) is a binary variable that can take a value of 0 or 1 for every observation \(i\) in observations \(n\))
We can model binary data by assuming that for all \(i \in 1 \ldots n\) (i.e., for every observation \(i\) of \(n\) samples),
\[ y_i \sim \text{Bernoulli}(\theta_i), \\ \log\left(\frac{\theta_i}{1 - \theta_i}\right) = \beta_0 + \beta_1 \times x_i. \]
\(\dots\) each observed outcome variable value \(y_1, y_2, \ldots, y_n\) is a sample from a Bernoulli distribution with parameter \(\theta_i\), and the log odds of \(\theta\) is a linear function of the predictor variable’s values \(x_1, x_2, \ldots, x_n\).
# load packages
library(tidyverse)
# load smoking data (link can be found on NOW)
smoking_df <- read_csv("https://raw.githubusercontent.com/mark-andrews/ntupsychology-data/main/data-sets/smoking.csv")
# select relevant variables
smoking_df <- select(smoking_df, cigs, age, educ)
# Make the cigs a binary variable
smoking_df <- mutate(smoking_df, is_smoker = cigs > 0)
# Show data
smoking_df
# A tibble: 807 × 4
cigs age educ is_smoker
<dbl> <dbl> <dbl> <lgl>
1 0 46 16 FALSE
2 0 40 16 FALSE
3 3 58 12 TRUE
4 0 30 13.5 FALSE
5 0 17 10 FALSE
6 0 86 6 FALSE
7 0 35 12 FALSE
8 0 48 15 FALSE
9 0 48 12 FALSE
10 0 31 12 FALSE
# ℹ 797 more rows
# Whether or not someone is a smoker modelled as a function of age. m_age <- glm(is_smoker ~ age, data = smoking_df, family = binomial(link = 'logit')) # Show model coefficients summary( m_age )$coef
Estimate Std. Error z value Pr(>|z|) (Intercept) 0.02396177 0.190206271 0.1259778 0.899749472 age -0.01215431 0.004355148 -2.7907922 0.005257921
Probabilities \(>\) odds \(>\) log odds \(>\) odds \(>\) probabilities
# A probability of p <- 0.75 # corresponds to odds of odds <- p / (1 - p) odds
[1] 3
# which correspond to log odds / logits of logodds <- log(odds) # the natural logarithm of the odds logodds
[1] 1.098612
# which correspond to odds of odds <- exp(logodds) # exponentiation of the logit (log odds) to get the odds odds
[1] 3
# which correspond to a probability odds / (1 + odds)
[1] 0.75
Odds are always ratios: An odds of 3 is a ratio which means that 3 chances of something occurring for every 1 chance of it not occurring.
# Use plogis to get the probability of the log odds plogis(logodds)
[1] 0.75
# Use qlogis to get the log odds of a probability qlogis(p)
[1] 1.098612
# Here are some low, medium and high probability values: theta_low <- 0.01 # low prob of success theta_mid <- 0.5 # equal prob of success theta_high <- 0.99 # high prob of success
# Logits of low probability log( theta_low / ( 1 - theta_low ) )
[1] -4.59512
plogis( theta_low )
[1] 0.5025
Work out the logit values for theta_mid and theta_high using the formula and one of the two functions.
If the probability of bad weather is 0.75, what are the odds that we’ll have bad weather?
\[ \text{odds} = \frac{\theta}{1 - \theta} \] Use \(R\) as a calculator or solve it by hand!
If the probability of bad weather is \(\theta = 0.75\), what are the odds that we’ll have bad weather?
p <- 0.75 p / ( 1 - p )
[1] 3
Probability of \(\theta = 0.75\) means that we are three times more likely to have bad weather than good weather.
If the odds are 4 (i.e. 4:1) to have bad weather, what is the probability that we will have bad weather?
\[ \theta = \frac{\text{odds}}{1 + \text{odds}} \] Use \(R\) as a calculator!
If the odds are 4 (i.e. 4:1) to have bad weather, what is the probability that we will have bad weather?
odds <- 4 odds / ( 1 + odds )
[1] 0.8
If there is a probability of \(\theta = 0.9\) that an intervention improves reading of kids, what is its log odds?
Compute the odds:
\[ \text{odds} = \frac{\theta}{1 - \theta} \]
Take the logarithm of the odds to get the log-odds (also called logit):
\[ \text{log-odds} = \log(\text{odds}) \]
Use \(R\) as a calculator!
If there is a probability of \(\theta = 0.9\) that an intervention improves reading of kids, what is its log odds?
\[ \text{odds} = \frac{\theta}{1 - \theta} = \frac{0.9}{1 - 0.9} = \frac{0.9}{0.1} = 9 \]
p <- 0.9 p / (1 - p)
[1] 9
Take the logarithm of the odds to get the log-odds (also called logit):
\[ \text{log-odds} = \log(\text{odds}) = \log(9) \approx 2.197225 \]
log(9)
[1] 2.197225
The log-odds on the previous slide were 2.1972246. Convert this value into
log(x) is exp(x)).plogis(x)).(replace x correctly)
The log-odds on the previous slide were 2.1972246. Convert this value into
Odds (the opposite of log(x) is exp(x)):
exp(2.197225)
[1] 9.000004
Probabilities (using plogis(x)):
plogis(2.197225)
[1] 0.9
add_predictions# Step 1. Make a new data frame with range of values of age. smoking_df2 <- tibble(age = seq(from = 15, to = 80, by = 10))
# Alternative to Step 2 (also log-odds)
library(modelr)
add_predictions(data = smoking_df2, # name of data frame
model = m_age) # name of model
Use add_predictions for ages 20, 21, 22, 23, 24, 25, 26, 27, \(\dots\), 75, 76, 77, 78, 79, 80, 81
add_predictions(data = smoking_df2, # name of data frame
model = m_age, # name of model
type = 'response') # return probabilities
# A tibble: 7 × 2
age pred
<dbl> <dbl>
1 15 0.460
2 25 0.430
3 35 0.401
4 45 0.372
5 55 0.344
6 65 0.317
7 75 0.292
# Generate predictions
d_predict <- add_predictions(data = smoking_df2,
model = m_age,
type = 'response')
# Plot predicted values
library(psyntur)
scatterplot(data = d_predict, x = age, y = pred)
For ease of illustration consider a binary logistic regression with educ as binary categorical predictor:
# Create a categorical variable for `educ` smoking_df <- mutate(smoking_df, educ = case_when(educ > 12 ~ "o12y", TRUE ~ "u12y"))
# Binary logistic regression m_educ_age <- glm(is_smoker ~ educ + age, data = smoking_df, family = binomial(link = "logit"))
# Inferential statistics for model coefficients summary(m_educ_age)$coef
Estimate Std. Error z value Pr(>|z|) (Intercept) -0.24790348 0.20534132 -1.207275 0.2273261975 educu12y 0.56029969 0.15267484 3.669889 0.0002426559 age -0.01387864 0.00439195 -3.160019 0.0015775895
coef(m_educ_age)
(Intercept) educu12y age -0.24790348 0.56029969 -0.01387864
educ 0.5602997 suggests that people with less than 12 years of education u12y have, the log-odds of being a smoker increase by 0.5602997.Difference in terms of probabilities: ratios of probabilities are referred to as risk ratios.
\[
\text{risk ratio} = \frac{p(\text{is smoker if less than 12 years of education})}{p(\text{is smoker if more than 12 years of education)}}
\] Create a grid for all combinations of educ and age.
smoking_df3 <- expand_grid(educ = c('o12y', 'u12y'), age = seq(18, 80, 20))
Obtain predictions as probabilities.
pred_df <- add_predictions(data = smoking_df3, model = m_educ_age, type='response')
Convert data frame to wide format (one column per level of educ).
pred_df_wide <- pivot_wider(pred_df, names_from = educ, values_from = pred)
Calculate the risk ratios
\[ \text{risk ratio} = \frac{p(\text{is smoker if less than 12 years of education})}{p(\text{is smoker if more than 12 years of education)}} \]
Note, these values aren’t constant across age!
mutate(pred_df_wide, prob_ratio = u12y / o12y)
# A tibble: 4 × 4
age o12y u12y prob_ratio
<dbl> <dbl> <dbl> <dbl>
1 18 0.378 0.516 1.36
2 38 0.315 0.446 1.42
3 58 0.259 0.379 1.47
4 78 0.209 0.316 1.51
mutate(pred_df_wide,
odds_o12y = o12y / (1 - o12y),
odds_u12y = u12y / (1 - u12y),
odds_ratio = odds_u12y / odds_o12y)
# A tibble: 4 × 6
age o12y u12y odds_o12y odds_u12y odds_ratio
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 18 0.378 0.516 0.608 1.06 1.75
2 38 0.315 0.446 0.461 0.807 1.75
3 58 0.259 0.379 0.349 0.611 1.75
4 78 0.209 0.316 0.264 0.463 1.75
Unlike ratios of probabilities, ratios of odds are independent of age (or any other predictor), where the odds are a ratios of probabilities, i.e. \(\frac{p}{1-p}\). Expressed in maths, odds ratios can be represented as
\[ \text{odds ratio} = \frac{\frac{p}{1 - p}}{\frac{q}{1 - q}} \]
This odds ratio can be obtained by the exponential of the education coefficient \(e^{\beta_{\text{educ}}}\), where \(e\) is a constant (Euler’s number, i.e. 2.718282) like \(\pi\).
# Extract the model coefficients betas <- coef(m_educ_age) betas
(Intercept) educu12y age -0.24790348 0.56029969 -0.01387864
# Get odds ratio via Euler’s number e <- exp(1) e^betas[2]
educu12y 1.751197
# or simply using exponential function exp(betas[2])
educu12y 1.751197
# or simply using exponential function exp(betas[2])
educu12y 1.751197
The interpretation is that the odds of being a smoker for individuals with less than 12 years of education are 1.7511972 greater than the odds of being a smoker for individuals with more than 12 years of education.
So while betas[2] shows you the coefficient of education in log odds, exp(betas[2]) returns the coefficient in odds.
What are the odds for of a 18 year old to be a smoker compared to a 19 year old, regardless of education?
How do we get the confidence interval for the same coefficient as odds ratio?
(ci <- confint.default(m_educ_age, parm = "educu12y")) # log odds
2.5 % 97.5 % educu12y 0.2610625 0.8595369
Range of values that would not be rejected in a null hypothesis test. These are all plausible values for the parameter of interest (difference of being a smoker for less than vs more than 12 years of education).
exp(ci) # odds ratio
2.5 % 97.5 % educu12y 1.298309 2.362067
If odds ratio contains 1, we cannot reject the null hypothesis. Yes 1, not 0! Odds are equal if 1. In contrast, on the log-odds scale the null hypothesis is a zero difference; cause log(1) = 0.
Changes in the outcome variable can be presented as probabilities, log odds and odds.
The factor by which odds of being a smoker increases with a unit change in the predictor!
What do we mean by a factor? We mean it is MULTIPLIED.
Aside, if you get a question asking “what is the factor by which the odds of being a smoker increases for a change in unit of education” I’m really asking you about the odds ratio.
summary(m_educ_age) # see deviances on bottom
Call:
glm(formula = is_smoker ~ educ + age, family = binomial(link = "logit"),
data = smoking_df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.247903 0.205341 -1.207 0.227326
educu12y 0.560300 0.152675 3.670 0.000243 ***
age -0.013879 0.004392 -3.160 0.001578 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1075.0 on 806 degrees of freedom
Residual deviance: 1053.3 on 804 degrees of freedom
AIC: 1059.3
Number of Fisher Scoring iterations: 4
deviance(m_educ_age)
[1] 1053.306
This deviance is based on the log-likelihood
\[ \text{deviance} = -2 \times \text{log likelihood} \]
Calculation is based on the observed data and predicted data
# Extract the outcome variable y <- smoking_df$is_smoker # Generate predicted values on the prob scale p_hat <- predict(m_educ_age, type = "response") # Formula for log-likelihood for binomial models sum( y * log(p_hat) + (1 - y) * log(1 - p_hat)) # don't worry about this formula
[1] -526.6529
and thus captures, overall, how unsurprised your model is by the data it actually saw (high values mean less surprised by data; i.e. better fit). For short,
logLik(m_educ_age)
'log Lik.' -526.6529 (df=3)
The negative double of the log-likelihood is the deviance, so
-2 * logLik(m_educ_age)
'log Lik.' 1053.306 (df=3)
is the same as
deviance(m_educ_age)
[1] 1053.306
and
deviance(m_educ_age) / -2
[1] -526.6529
is the same as
logLik(m_educ_age)
'log Lik.' -526.6529 (df=3)
The deviance doesn’t tell you more than what you already know from the log-likelihood.
For example, the two binary logistic models \(\text{is_smoker} \sim \text{Bernoulli}(\theta_i)\) that we have considered so far have the predictor structures
\[ \mathcal{M_1}: \text{logit}(\theta_i) = \beta_0 + \beta_1 \times age_i\\ \] and
\[ \mathcal{M_2}: \text{logit}(\theta_i) = \beta_0 + \beta_1 \times age_i + \beta_2 \times \text{educ}_i. \] Model \(\mathcal{M_1}\) is nested in model \(\mathcal{M_2}\) because \(\mathcal{M_1}\) contains a subset of the predictors of \(\mathcal{M_2}\).
The deviance of model \(\mathcal{M_2}\) is
deviance(m_educ_age)
[1] 1053.306
What’s the deviance of \(\mathcal{M_1}\)?
Which one of these two models is better at predicting the data?
What does it mean that one model is “better”?
The model is better at identifying a smoker on the basis of their age and education compared to just knowing their age. Education can tell us something that age alone can’t.
\(\dots\) using the difference in deviance:
deviance(m_educ_age) - deviance(m_age)
[1] -13.762
The deviance of \(\mathcal{M_2}\) is around 13.762 units lower.
We can roughly interpret these differences of deviances
A meaningful difference of the deviance should be 2 times greater than the difference in the number of parameters in the models.
How many parameters are in \(\mathcal{M_2}\)?
coef(m_educ_age)
(Intercept) educu12y age -0.24790348 0.56029969 -0.01387864
It’s 3: in a binary logistic regression the number of coefficients is the number of parameters (in normal linear models there is also the standard deviation). And it’s 2 for \(\mathcal{M_1}\)
coef(m_age)
(Intercept) age 0.02396177 -0.01215431
So the difference in deviances should be at least 2 (our difference in deviances was 13.762).
anova.anova(m_age, m_educ_age, test = 'Chisq')
Df degrees of freedom (difference in number of parameters).Deviance.anova(m_age, m_educ_age, test = 'Chisq')
Analysis of Deviance Table Model 1: is_smoker ~ age Model 2: is_smoker ~ educ + age Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 805 1067.1 2 804 1053.3 1 13.762 0.0002075
The p-value is the probability of obtaining a difference in deviances of 13.76 in a \(\chi^2\) distribution with 1 degree of freedom if the model with both educ and age is identical to the model with only age.
diff_deviance <- deviance(m_age) - deviance(m_educ_age) pchisq(diff_deviance, df = 1, lower.tail = FALSE)
[1] 0.0002074912
So lets say, you have two models \(\mathcal{M_a}\) and \(\mathcal{M_b}\). \(\mathcal{M_a}\) is fully nested in \(\mathcal{M_b}\). \(\mathcal{M_a}\) has 2 parameters, and \(\mathcal{M_b}\) has 5. \(\mathcal{M_a}\) has a deviance of 500, and \(\mathcal{M_b}\) has a deviance of 490. What is the corresponding p-value of their difference?
pchisq(---, df = ---, lower.tail = FALSE)
Reload the data to select new variables:
# load smoking data (link can be found on NOW)
smoking_df <- read_csv("https://raw.githubusercontent.com/mark-andrews/ntupsychology-data/main/data-sets/smoking.csv")
# Make the cigs a binary variable
smoking_df <- mutate(smoking_df, is_smoker = cigs > 0)
# select relevant variables
smoking_df <- select(smoking_df, is_smoker, white, educ, lincome)
We will use white (binary), educ (as continuous variable), and the log of income (lincome).
Implement the following two models with \(\text{is_smoker}_i \sim \text{Bernoulli}(\theta_i)\):
\[ \mathcal{M_1}: \text{logit}(\theta_i) = \beta_0 + \beta_1 \times \log(\text{income})\\ \mathcal{M_2}: \text{logit}(\theta_i) = \beta_0 + \beta_1 \times \log(\text{income}) + \beta_2 \times \text{education} + \beta_3 \times \text{white} \]
and, then, perform a statistical test to evaluate whether the difference in their deviance is statistically meaningful. This test evaluates whether education and being white has an (combined) effect on being a smoker after controlling for income.