- Model definition
- Relating log-odds to probabilities
- Predicting modelled outcomes
- Interpretation of coefficients
- Model comparison
When \[y_i \in \{0, 1\}, \quad \text{for all } i \in n.\]
(i.e., each observation \(y_i\) in \(n\) takes either the value 0 or 1)
We can model binary data by assuming that for each outcome value \(y_1, y_2, \ldots, y_n\) is a sample from a Bernoulli distribution with parameter \(\theta_i\),
\[ y_i \sim \text{Bernoulli}(\theta_i), \\ \log\left(\frac{\theta_i}{1 - \theta_i}\right) = \beta_0 + \beta_1 \times x_i, \]
and the log odds of \(\theta\) is a linear function of the predictor variable’s values \(x_1, x_2, \ldots, x_n\).
(model definition shows a simple Binary logistic regression)
# load packages library(tidyverse) # load smoking data (link can be found on NOW) smoking_df <- read_csv( "https://raw.githubusercontent.com/mark-andrews/ntupsychology-data/main/data-sets/smoking.csv") # select relevant variables smoking_df <- select(smoking_df, cigs, age, educ) # Make the cigs a binary variable. smoking_df <- mutate(smoking_df, is_smoker = cigs > 0) # Show data glimpse(smoking_df)
Rows: 807 Columns: 4 $ cigs <dbl> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 10, 20, 30, 0, 0, 20, 0, 30, 0… $ age <dbl> 46, 40, 58, 30, 17, 86, 35, 48, 48, 31, 27, 24, 71, 44, 22, … $ educ <dbl> 16.0, 16.0, 12.0, 13.5, 10.0, 6.0, 12.0, 15.0, 12.0, 12.0, 1… $ is_smoker <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
# Whether or not someone is a smoker modelled as a function of age. m_age <- glm(is_smoker ~ age, data = smoking_df, family = binomial(link = 'logit')) # Show model coefficients summary( m_age )$coef # in logits (log-odds)
Estimate Std. Error z value Pr(>|z|) (Intercept) 0.02396177 0.190206271 0.1259778 0.899749472 age -0.01215431 0.004355148 -2.7907922 0.005257921
Odds are a ratio something happening as opposed to it not happening.
Say \(\theta\) is the probability of an event (e.g. bad weather) happening.
The probability of the same event not happening (e.g. no bad weather) is \(1-\theta\).
Their ratio is \(\frac{\theta}{1-\theta}\). That’s your odds!
So what’s the odds corresponding to a \(\theta\) of .75?
Probabilities \(>\) odds \(>\) log odds \(>\) odds \(>\) probabilities
# A probability of p <- 0.75 # corresponds to odds of odds <- p / (1 - p) odds
[1] 3
Probability of \(\theta = 0.75\) means that the event in question is three times more likely (e.g. to have bad weather instead of good weather).
# which correspond to log odds / logits of logodds <- log(odds) # the natural logarithm of the odds logodds
[1] 1.098612
# which correspond to odds of odds <- exp(logodds) # exponentiation of the logit (log odds) to get the odds odds
[1] 3
# which correspond to a probability of odds / (1 + odds)
[1] 0.75
Odds are always ratios:
If there is a probability of \(\theta = 0.9\) that an intervention improves reading of kids, what is this result in log odds?
Compute the odds:
\[ \text{odds} = \frac{\theta}{1 - \theta} \]
Take the logarithm of the odds to get the log-odds (also called logit):
\[ \text{log-odds} = \log(\text{odds}) \]
Use \(R\) as a calculator!
If there is a probability of \(\theta = 0.9\) that an intervention improves reading of kids, what is this result in log odds?
\[ \text{odds} = \frac{\theta}{1 - \theta} = \frac{0.9}{1 - 0.9} = \frac{0.9}{0.1} = 9 \]
p <- 0.9 p / (1 - p)
[1] 9
Take the logarithm of the odds to get the log-odds (also called logit):
\[ \text{log-odds} = \log(\text{odds}) = \log(9) \approx 2.197225 \]
log(9)
[1] 2.197225
The log-odds on the previous slide were 2.1972246. Convert this value into
log(x) is exp(x)).plogis(x)).(replace x correctly)
The log-odds on the previous slide were 2.1972246. Convert this value into
Odds (the opposite of log(x) is exp(x)):
exp(2.197225)
[1] 9.000004
Probabilities (using plogis(x)):
plogis(2.197225)
[1] 0.9
If the odds are 4 (i.e. 4:1) to have bad weather, what is the probability that we will have bad weather?
\[ \theta = \frac{\text{odds}}{1 + \text{odds}} \]
Use \(R\) as a calculator!
If the odds are 4 (i.e. 4:1) to have bad weather, what is the probability that we will have bad weather?
odds <- 4 odds / ( 1 + odds )
[1] 0.8
# Use plogis to get the probability of the log odds plogis(logodds)
[1] 0.75
# Use qlogis to get the log odds of a probability qlogis(p)
[1] 2.197225
add_predictions# Step 1. Make a new data frame with range of values of age. smoking_df2 <- tibble(age = seq(from = 15, to = 80, by = 10))
# Alternative to Step 2 (also log-odds)
library(modelr)
add_predictions(data = smoking_df2, # name of data frame
model = m_age) # name of model
Use add_predictions for ages 20, 21, 22, 23, 24, 25, 26, 27, \(\dots\), 75, 76, 77, 78, 79, 80, 81
add_predictions(data = smoking_df2, # name of data frame
model = m_age, # name of model
type = 'response') # return probabilities
# A tibble: 7 × 2
age pred
<dbl> <dbl>
1 15 0.460
2 25 0.430
3 35 0.401
4 45 0.372
5 55 0.344
6 65 0.317
7 75 0.292
# Generate predictions
d_predict <- add_predictions(data = smoking_df2,
model = m_age,
type = 'response')
# Plot predicted values
library(psyntur)
scatterplot(data = d_predict, x = age, y = pred)
Watch out! The relationship between the predictor and the predicted probability isn’t linear, but the relationship between the predictor and the log-odds (logits) is!
For ease of illustration consider a binary logistic regression with educ as binary categorical predictor:
# Create a categorical variable for `educ` smoking_df <- mutate(smoking_df, educ = case_when(educ > 12 ~ "o12y", TRUE ~ "u12y"))
# Binary logistic regression m_educ_age <- glm(is_smoker ~ educ + age, data = smoking_df, family = binomial(link = "logit"))
# Inferential statistics for model coefficients summary(m_educ_age)$coef
Estimate Std. Error z value Pr(>|z|) (Intercept) -0.24790348 0.20534132 -1.207275 0.2273261975 educu12y 0.56029969 0.15267484 3.669889 0.0002426559 age -0.01387864 0.00439195 -3.160019 0.0015775895
educ has levels o12y (over 12 years of education) and u12y (under 12 years of education) with the following dummy coding:
contrasts(factor(smoking_df$educ))
u12y o12y 0 u12y 1
betas <- coef(m_educ_age) betas
(Intercept) educu12y age -0.24790348 0.56029969 -0.01387864
What are the predicted (1) log odds, (2) odds, and (3) probability of being a smoker for a hypothetical person of age 63 with less than 12 years of education?
coef(m_educ_age)
(Intercept) educu12y age -0.24790348 0.56029969 -0.01387864
educ 0.5602997 suggests that people with less than 12 years of education u12y have, the log-odds of being a smoker increase by 0.5602997.Difference in terms of probabilities: ratios of probabilities are referred to as risk ratios.
\[
\text{risk ratio} = \frac{p(\text{is smoker if less than 12 years of education})}{p(\text{is smoker if more than 12 years of education)}}
\] Create a grid for all combinations of educ and age.
smoking_df3 <- expand_grid(educ = c('o12y', 'u12y'), age = seq(from = 18, to = 80, by = 20))
Obtain predictions as probabilities.
pred_df <- add_predictions(data = smoking_df3, model = m_educ_age, type = 'response')
Convert data frame to wide format (one column per level of educ).
pred_df_wide <- pivot_wider(data = pred_df, names_from = educ, values_from = pred)
Calculate the risk ratios
\[ \text{risk ratio} = \frac{p(\text{is smoker if less than 12 years of education})}{p(\text{is smoker if more than 12 years of education)}} \]
Note, these values aren’t constant across age!
mutate(pred_df_wide, prob_ratio = u12y / o12y)
# A tibble: 4 × 4
age o12y u12y prob_ratio
<dbl> <dbl> <dbl> <dbl>
1 18 0.378 0.516 1.36
2 38 0.315 0.446 1.42
3 58 0.259 0.379 1.47
4 78 0.209 0.316 1.51
mutate(pred_df_wide,
odds_o12y = o12y / (1 - o12y),
odds_u12y = u12y / (1 - u12y),
odds_ratio = odds_u12y / odds_o12y)
# A tibble: 4 × 6
age o12y u12y odds_o12y odds_u12y odds_ratio
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 18 0.378 0.516 0.608 1.06 1.75
2 38 0.315 0.446 0.461 0.807 1.75
3 58 0.259 0.379 0.349 0.611 1.75
4 78 0.209 0.316 0.264 0.463 1.75
Unlike ratios of probabilities, ratios of odds are independent of age (or any other predictor), where the odds are a ratios of probabilities, i.e. \(\frac{p}{1-p}\). Expressed in maths, odds ratios can be represented as
\[ \text{odds ratio} = \frac{\frac{p}{1 - p}}{\frac{q}{1 - q}} \]
# Extract the model coefficients betas <- coef(m_educ_age) betas
(Intercept) educu12y age -0.24790348 0.56029969 -0.01387864
The odds ratio can be obtained by the exponential of the education coefficient \(e^{\beta_{\text{educ}}}\), where \(e\) is a constant (Euler’s number, i.e. 2.718282) like \(\pi\).
# Get odds ratio via Euler’s number e <- exp(1) e^betas[2]
educu12y 1.751197
# or simply using exponential function exp(betas[2])
educu12y 1.751197
The interpretation is that the odds of being a smoker for individuals with less than 12 years of education are 1.75 greater than the odds of being a smoker for individuals with more than 12 years of education.
So while betas[2] shows you the coefficient of education in log odds, exp(betas[2]) returns the coefficient in odds.
What are the odds for of a 18 year old to be a smoker compared to a 19 year old, regardless of education?
(ci <- confint.default(m_educ_age, parm = "educu12y")) # `.default` gives Wald CI rather than profile
2.5 % 97.5 % educu12y 0.2610625 0.8595369
Range of values that would not be rejected in a null hypothesis test. These are all plausible values for the parameter of interest (difference of being a smoker for less than vs more than 12 years of education).
exp(ci) # odds ratio
2.5 % 97.5 % educu12y 1.298309 2.362067
If odds ratio contains 1, we cannot reject the null hypothesis. Yes 1, not 0! Odds are equal if 1. In contrast, on the log-odds scale the null hypothesis is a zero difference; cause log(1) = 0.
What is the 99% confidence interval for the odds ratio of the age predictor?
Changes in the outcome variable can be presented as probabilities, log odds and odds.
The factor by which odds of being a smoker increases with a unit change in the predictor!
What do we mean by a factor? We mean it is MULTIPLIED.
Aside, if you get a question asking “what is the factor by which the odds of being a smoker increases for a change in unit of education” I’m really asking you about the odds ratio.
summary(m_educ_age) # see deviance on bottom
Call:
glm(formula = is_smoker ~ educ + age, family = binomial(link = "logit"),
data = smoking_df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.247903 0.205341 -1.207 0.227326
educu12y 0.560300 0.152675 3.670 0.000243 ***
age -0.013879 0.004392 -3.160 0.001578 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1075.0 on 806 degrees of freedom
Residual deviance: 1053.3 on 804 degrees of freedom
AIC: 1059.3
Number of Fisher Scoring iterations: 4
Residual deviance: deviance of the model we’re looking at
deviance(m_educ_age)
[1] 1053.306
Null deviance: deviance of a reference model with no predictors
# Fit a model without predictors m_null <- glm(is_smoker ~ 1, data = smoking_df, family = binomial(link = "logit")) # Extract the deviance of the null model deviance(m_null)
[1] 1075.011
A lower deviance indicates a better the model fit
This deviance is based on the log-likelihood which is based on the observed data and predicted data:
# Extract the outcome variable y <- smoking_df$is_smoker # Generate predicted values on the prob scale p_hat <- predict(m_educ_age, type = "response") # Formula for log-likelihood for binomial models sum( y * log(p_hat) + (1 - y) * log(1 - p_hat)) # don't worry about this formula
[1] -526.6529
and thus captures, overall, how unsurprised your model is by the data it actually saw (high values mean less surprised by data; i.e. better fit). For short,
logLik(m_educ_age)
'log Lik.' -526.6529 (df=3)
The negative double of the log-likelihood is the deviance, so
\[ \text{deviance} = -2 \times \text{log likelihood} \]
-2 * logLik(m_educ_age)
'log Lik.' 1053.306 (df=3)
is the same as
deviance(m_educ_age)
[1] 1053.306
and
deviance(m_educ_age) / -2
[1] -526.6529
is the same as
logLik(m_educ_age)
'log Lik.' -526.6529 (df=3)
The deviance doesn’t tell you more than what you already know from the log-likelihood.
For example, the two binary logistic models \(\text{is_smoker} \sim \text{Bernoulli}(\theta_i)\) that we have considered so far have the predictor structures
\[ \mathcal{M_1}: \text{logit}(\theta_i) = \beta_0 + \beta_1 \times age_i\\ \] and
\[ \mathcal{M_2}: \text{logit}(\theta_i) = \beta_0 + \beta_1 \times age_i + \beta_2 \times \text{educ}_i. \] Model \(\mathcal{M_1}\) is nested in model \(\mathcal{M_2}\) because \(\mathcal{M_1}\) contains a subset of the predictors of \(\mathcal{M_2}\).
The deviance of model \(\mathcal{M_2}\) is
deviance(m_educ_age)
[1] 1053.306
What’s the deviance of \(\mathcal{M_1}\)?
Which one of these two models is better at predicting the data?
What does it mean that one model is “better”?
The model is better at identifying a smoker on the basis of their age and education compared to just knowing their age. Education can tell us something that age alone can’t.
\(\dots\) using the difference in deviance:
deviance(m_age) - deviance(m_educ_age)
[1] 13.762
The deviance of \(\mathcal{M_2}\) is around 13.76 units lower.
We can roughly interpret these differences of deviances
A meaningful difference should be 2 times greater than the difference in the number of parameters.
How many parameters are in \(\mathcal{M_2}\)?
coef(m_educ_age)
(Intercept) educu12y age -0.24790348 0.56029969 -0.01387864
It’s 3: in a binary logistic regression the number of coefficients is the number of parameters (in normal linear models there is also the standard deviation). And it’s 2 for \(\mathcal{M_1}\):
coef(m_age)
(Intercept) age 0.02396177 -0.01215431
So the difference in deviance should be at least 2 (our difference in deviance was 13.76).
anova.anova(m_age, m_educ_age, test = 'Chisq')
Df degrees of freedom (difference in number of parameters).Deviance.anova(m_age, m_educ_age, test = 'Chisq')
Analysis of Deviance Table Model 1: is_smoker ~ age Model 2: is_smoker ~ educ + age Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 805 1067.1 2 804 1053.3 1 13.762 0.0002075
The p-value is the probability of obtaining a difference in deviance of 13.76 in a \(\chi^2\) distribution with 1 degree of freedom if the model with both educ and age is identical to the model with only age.
pchisq(deviance(m_age) - deviance(m_educ_age), df = 1, lower.tail = FALSE)
[1] 0.0002074912
Lets say, you have two models \(\mathcal{M_a}\) and \(\mathcal{M_b}\). \(\mathcal{M_a}\) is fully nested in \(\mathcal{M_b}\). \(\mathcal{M_a}\) has 2 parameters, and \(\mathcal{M_b}\) has 5. \(\mathcal{M_a}\) has a deviance of 500, and \(\mathcal{M_b}\) has a deviance of 490. What is the corresponding p-value of their difference?
pchisq(---, df = ---, lower.tail = FALSE)
# Reload smoking data (link can be found on NOW) smoking_df <- read_csv( "https://raw.githubusercontent.com/mark-andrews/ntupsychology-data/main/data-sets/smoking.csv") # Make the cigs a binary variable smoking_df <- mutate(smoking_df, is_smoker = cigs > 0) # Select new set of variables smoking_df <- select(smoking_df, is_smoker, white, educ, lincome)
We will use white (binary), educ (as continuous variable), and the log of income (lincome).
Implement the following two models with \(\text{is_smoker}_i \sim \text{Bernoulli}(\theta_i)\):
\[ \mathcal{M_1}: \text{logit}(\theta_i) = \beta_0 + \beta_1 \times \log(\text{income})\\ \mathcal{M_2}: \text{logit}(\theta_i) = \beta_0 + \beta_1 \times \log(\text{income}) + \beta_2 \times \text{education} + \beta_3 \times \text{white} \]
Evaluate whether education and being white has a (combined) association with smoking (after controlling for income).