Binary logistic regression handles two-category outcomes. But many epidemiologic outcomes have more than two categories:
When the response variable has \(K > 2\) categories, we use polytomous logistic regression (also called multinomial logistic regression). When the categories have a natural order, we can take advantage of that ordering with ordinal logistic regression.
Today’s roadmap (Kleinbaum Ch. 23): 1. Polytomous (multinomial) logistic regression for nominal outcomes 2. Ordinal logistic regression for ordered outcomes 3. The proportional odds assumption 4. Choosing between models
Suppose general health has 3 categories: Excellent, Good, Poor. A naive approach would be to dichotomize and run a binary logistic regression repeatedly:
Problems with this approach:
A polytomous model fits all categories simultaneously in a single likelihood, sharing information across comparisons.
library(tidyverse)
library(haven)
library(janitor)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(nnet) # multinom() for polytomous regression
library(MASS) # polr() for ordinal regression
library(brant) # Brant test for proportional odds
library(ggeffects)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))The BRFSS variable genhlth asks: “Would you say that
in general your health is…”
1 = Excellent, 2 = Very Good, 3 = Good, 4 = Fair, 5 = Poor
We collapse this into a 3-level ordinal outcome to keep models tractable for class:
brfss_poly <- brfss_full |>
mutate(
# 3-level ordinal outcome
genhlth_3 = case_when(
genhlth %in% c(1, 2) ~ "Excellent/VG",
genhlth == 3 ~ "Good",
genhlth %in% c(4, 5) ~ "Fair/Poor",
TRUE ~ NA_character_
),
# Ordered factor for ordinal regression
genhlth_ord = factor(genhlth_3,
levels = c("Excellent/VG", "Good", "Fair/Poor"),
ordered = TRUE),
# Unordered factor for polytomous regression
genhlth_nom = factor(genhlth_3,
levels = c("Excellent/VG", "Good", "Fair/Poor")),
# Predictors
age = age80,
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
exercise = factor(case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
TRUE ~ NA_character_
), levels = c("No", "Yes")),
income_cat = case_when(
income2 %in% 1:8 ~ as.numeric(income2),
TRUE ~ NA_real_
),
smoker = factor(case_when(
smokday2 %in% c(1, 2) ~ "Current",
smokday2 == 3 ~ "Former/Never",
TRUE ~ NA_character_
), levels = c("Former/Never", "Current"))
) |>
filter(
!is.na(genhlth_ord), !is.na(age), age >= 18, !is.na(sex),
!is.na(bmi), !is.na(exercise), !is.na(income_cat), !is.na(smoker)
)
set.seed(1220)
brfss_poly <- brfss_poly |>
dplyr::select(genhlth_ord, genhlth_nom, age, sex, bmi,
exercise, income_cat, smoker) |>
slice_sample(n = 5000)
saveRDS(brfss_poly,
"/Users/samriddhi/Downloads/brfss_polytomous_2020.rds")
brfss_poly |>
count(genhlth_ord) |>
mutate(pct = round(100 * n / sum(n), 1)) |>
kable(col.names = c("General Health", "N", "%"),
caption = "Outcome Distribution") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| General Health | N | % |
|---|---|---|
| Excellent/VG | 2380 | 47.6 |
| Good | 1624 | 32.5 |
| Fair/Poor | 996 | 19.9 |
Since genhlth_ord and genhlth_nom contain the same three levels (just coded as ordered vs. unordered factors), the distributions will be identical — but showing both makes the distinction between the two factor types explicit for the lecture, which is useful context for introducing polytomous vs. ordinal regression.
For an outcome with \(K\) categories, we choose one as the reference (say, \(Y = 1\)) and fit \(K - 1\) logit equations:
\[ \log\left[\frac{\Pr(Y = k)}{\Pr(Y = 1)}\right] = \alpha_k + \beta_{k1} X_1 + \cdots + \beta_{kp} X_p, \quad k = 2, \ldots, K \]
Each comparison gets its own intercept and its own slopes. With 3 outcome categories and 5 predictors, that’s \(2 \times (1 + 5) = 12\) parameters.
The probabilities are recovered as:
\[ \Pr(Y = k \mid X) = \frac{\exp(\alpha_k + X^T \beta_k)}{1 + \sum_{j=2}^{K} \exp(\alpha_j + X^T \beta_j)} \]
By construction, the probabilities sum to 1.
nnet::multinom()mod_multi <- multinom(
genhlth_nom ~ age + sex + bmi + exercise + income_cat + smoker,
data = brfss_poly,
trace = FALSE
)
summary(mod_multi)## Call:
## multinom(formula = genhlth_nom ~ age + sex + bmi + exercise +
## income_cat + smoker, data = brfss_poly, trace = FALSE)
##
## Coefficients:
## (Intercept) age sexFemale bmi exerciseYes income_cat
## Good -1.317306 0.01497067 -0.1417121 0.05087197 -0.500636 -0.1693992
## Fair/Poor -1.283880 0.02579060 -0.1257209 0.06741312 -1.338926 -0.3932631
## smokerCurrent
## Good 0.4117899
## Fair/Poor 0.3436415
##
## Std. Errors:
## (Intercept) age sexFemale bmi exerciseYes income_cat
## Good 0.2689762 0.002189706 0.06782701 0.005768280 0.08178912 0.01748213
## Fair/Poor 0.3296881 0.002916356 0.08583048 0.006778269 0.09155273 0.02090747
## smokerCurrent
## Good 0.07723392
## Fair/Poor 0.09720041
##
## Residual Deviance: 9287.557
## AIC: 9315.557
The output gives two sets of coefficients, one for each non-reference category:
Each \(\beta\) is interpreted as a log relative risk ratio (RRR). Exponentiating gives the relative risk ratio (sometimes called a relative odds ratio) for being in category \(k\) vs. the reference, per one-unit change in \(X\).
tidy(mod_multi, conf.int = TRUE, exponentiate = TRUE) |>
dplyr::select(y.level, term, estimate, conf.low, conf.high, p.value) |>
mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3)),
p.value = format.pval(p.value, digits = 3)) |>
kable(col.names = c("Outcome", "Predictor", "RRR", "Lower", "Upper", "p"),
caption = "Multinomial Logistic Regression: RRRs vs. Excellent/VG") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Outcome | Predictor | RRR | Lower | Upper | p |
|---|---|---|---|---|---|
| Good | (Intercept) | 0.268 | 0.158 | 0.454 | 9.71e-07 |
| Good | age | 1.015 | 1.011 | 1.019 | 8.10e-12 |
| Good | sexFemale | 0.868 | 0.760 | 0.991 | 0.036679 |
| Good | bmi | 1.052 | 1.040 | 1.064 | < 2e-16 |
| Good | exerciseYes | 0.606 | 0.516 | 0.712 | 9.30e-10 |
| Good | income_cat | 0.844 | 0.816 | 0.874 | < 2e-16 |
| Good | smokerCurrent | 1.510 | 1.297 | 1.756 | 9.73e-08 |
| Fair/Poor | (Intercept) | 0.277 | 0.145 | 0.529 | 9.85e-05 |
| Fair/Poor | age | 1.026 | 1.020 | 1.032 | < 2e-16 |
| Fair/Poor | sexFemale | 0.882 | 0.745 | 1.043 | 0.142987 |
| Fair/Poor | bmi | 1.070 | 1.056 | 1.084 | < 2e-16 |
| Fair/Poor | exerciseYes | 0.262 | 0.219 | 0.314 | < 2e-16 |
| Fair/Poor | income_cat | 0.675 | 0.648 | 0.703 | < 2e-16 |
| Fair/Poor | smokerCurrent | 1.410 | 1.165 | 1.706 | 0.000407 |
Example interpretation. Holding all else constant, current smokers have a relative risk ratio of 1.41 (95% CI: 1.17–1.71) for reporting “Fair/Poor” health vs. “Excellent/VG” compared to former/never smokers. In other words, the relative risk of being in the “Fair/Poor” category (rather than “Excellent/VG”) is about 41% higher among current smokers.
ggpredict(mod_multi, terms = "income_cat [1:8]") |>
plot() +
labs(title = "Predicted Probability of Each Health Category by Income",
x = "Income Category (1 = lowest, 8 = highest)",
y = "Predicted Probability") +
theme_minimal()Interpretation. Holding all other predictors at their means, the predicted probability of reporting “Fair/Poor” health drops sharply across income levels — from 63% at the lowest income category to 17% at the highest. The predicted probability of “Excellent/VG” health shows the opposite pattern, rising from 11% to 49% across the same range. The probability of “Good” health remains comparatively stable (roughly 26%–35%), reflecting that income primarily shifts respondents between the two extremes rather than into or out of the “Good” category.
When categories are ordered, the polytomous model wastes information. It treats “Good” as just as different from “Excellent” as it is from “Fair/Poor”. An ordinal model exploits the ordering and produces fewer parameters and easier interpretation.
The most common ordinal model is the proportional odds model, also called the cumulative logit model. Rather than modeling the probability of each category separately (as in the multinomial model), it models the probability of being at or below a given category. This respects the ordering of the outcome — “Excellent/VG” < “Good” < “Fair/Poor” in terms of health decline — and yields a single, consistent estimate of each predictor’s effect across the entire outcome scale.
The tradeoff for this parsimony is an additional assumption: the proportional odds assumption, which requires that the effect of each predictor is the same regardless of which cut-point is being modeled. This should be checked after fitting.
Define cumulative probabilities:
\[ \gamma_k = \Pr(Y \leq k \mid X) \]
The model is:
\[ \log\left[\frac{\Pr(Y \leq k)}{\Pr(Y > k)}\right] = \alpha_k - \beta_1 X_1 - \cdots - \beta_p X_p \]
Key features:
With \(K = 3\) and 5 predictors, that’s \(2 + 5 = 7\) parameters (vs. 12 for multinomial).
In plain terms: imagine income predicts health. The proportional odds assumption says that the boost in odds of being in “Excellent/VG or better” (vs. “Good or worse”) from moving up one income level is the same as the boost in odds of being in “Excellent/VG or Good” (vs. “Fair/Poor”). The effect of income doesn’t change depending on which boundary you’re crossing — it’s a single, consistent shift up or down the health scale.
MASS::polr()mod_ord <- polr(
genhlth_ord ~ age + sex + bmi + exercise + income_cat + smoker,
data = brfss_poly,
Hess = TRUE
)
summary(mod_ord)## Call:
## polr(formula = genhlth_ord ~ age + sex + bmi + exercise + income_cat +
## smoker, data = brfss_poly, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## age 0.01797 0.001867 9.627
## sexFemale -0.12885 0.056835 -2.267
## bmi 0.04973 0.004534 10.967
## exerciseYes -0.92167 0.064426 -14.306
## income_cat -0.26967 0.014113 -19.107
## smokerCurrent 0.29313 0.064304 4.558
##
## Intercepts:
## Value Std. Error t value
## Excellent/VG|Good 0.0985 0.2200 0.4478
## Good|Fair/Poor 1.8728 0.2212 8.4650
##
## Residual Deviance: 9318.274
## AIC: 9334.274
tidy(mod_ord, conf.int = TRUE, exponentiate = TRUE) |>
filter(coef.type == "coefficient") |>
dplyr::select(term, estimate, conf.low, conf.high) |>
mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3))) |>
kable(col.names = c("Predictor", "OR", "Lower", "Upper"),
caption = "Ordinal Logistic Regression: Cumulative ORs") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Predictor | OR | Lower | Upper |
|---|---|---|---|
| age | 1.018 | 1.014 | 1.022 |
| sexFemale | 0.879 | 0.786 | 0.983 |
| bmi | 1.051 | 1.042 | 1.060 |
| exerciseYes | 0.398 | 0.351 | 0.451 |
| income_cat | 0.764 | 0.743 | 0.785 |
| smokerCurrent | 1.341 | 1.182 | 1.521 |
Interpretation. An OR of \(e^{\beta}\) for a predictor means: a one-unit increase in \(X\) multiplies the odds of being in a worse health category (i.e., \(Y > k\) vs. \(Y \leq k\)) by \(e^{\beta}\), and this is true at every cut-point. Here, \(\beta\) is the estimated coefficient from the model (the log-odds), and \(e^{\beta}\) is simply that coefficient exponentiated — converting it from the log-odds scale to an odds ratio that is easier to interpret.
For example, if smoking has OR = 1.6, then current smokers have 1.6 times the odds of reporting Good or worse health (vs. Excellent/VG), AND 1.6 times the odds of reporting Fair/Poor (vs. Excellent/VG or Good).
The proportional odds assumption is strong. It states that the effect of each predictor on the log-odds of being at or below any cut-point is constant across all cut-points — in other words, one \(\beta\) summarizes the predictor’s effect across the entire outcome scale. If this holds, the ordinal model is both valid and parsimonious. If it fails, that single \(\beta\) is a misleading average of effects that actually differ across thresholds, and a more flexible model is needed.
There are two complementary ways to evaluate this assumption: a visual check and a formal test.
The idea is to fit separate binary logistic regressions at each cumulative cut-point and compare the coefficients. If the proportional odds assumption holds, the log-odds coefficients should be roughly the same at both cut-points — the two sets of points should overlap or sit close together. Large separation between cut-points for a given predictor is a warning sign.
cut1 <- glm(I(as.numeric(genhlth_ord) <= 1) ~ age + sex + bmi + exercise + income_cat + smoker,
data = brfss_poly, family = binomial)
cut2 <- glm(I(as.numeric(genhlth_ord) <= 2) ~ age + sex + bmi + exercise + income_cat + smoker,
data = brfss_poly, family = binomial)
bind_rows(
broom::tidy(cut1, conf.int = TRUE) |> mutate(cutpoint = "\u2264 Excellent/VG"),
broom::tidy(cut2, conf.int = TRUE) |> mutate(cutpoint = "\u2264 Good")
) |>
filter(term != "(Intercept)") |>
ggplot(aes(x = estimate, y = term, color = cutpoint, shape = cutpoint)) +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
geom_pointrange(aes(xmin = conf.low, xmax = conf.high),
position = position_dodge(width = 0.5)) +
labs(
title = "Visual Check: Proportional Odds Assumption",
subtitle = "Coefficients should overlap across cut-points if assumption holds",
x = "Log-Odds Coefficient",
y = "Predictor",
color = "Cut-point",
shape = "Cut-point"
) +
theme_minimal()Predictors whose point estimates are far apart between cut-points (e.g.,
smokerCurrent,sexFemale) are candidates for violating the assumption — confirm with the Brant test below.
## --------------------------------------------
## Test for X2 df probability
## --------------------------------------------
## Omnibus 35.87 6 0
## age 0.06 1 0.81
## sexFemale 0.89 1 0.34
## bmi 7.1 1 0.01
## exerciseYes 10.58 1 0
## income_cat 10.54 1 0
## smokerCurrent 7.76 1 0.01
## --------------------------------------------
##
## H0: Parallel Regression Assumption holds
as.data.frame(brant_out) |>
tibble::rownames_to_column("Predictor") |>
mutate(
X2 = round(X2, 2),
probability = ifelse(probability < 0.001, "< 0.001", sprintf("%.3f", probability)),
Fails = ifelse(Predictor == "Omnibus",
ifelse(as.numeric(gsub("< ", "", probability)) < 0.05 | probability == "< 0.001", "Yes (overall)", "No (overall)"),
ifelse(probability == "< 0.001" | as.numeric(gsub("< ", "", probability)) < 0.05, "⚠ Yes", ""))
) |>
kable(
col.names = c("Predictor", "X²", "df", "p-value", "Assumption Violated?"),
caption = "Brant Test for Proportional Odds Assumption",
align = c("l", "r", "r", "r", "c")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
row_spec(0, bold = TRUE) |>
row_spec(1, bold = TRUE, background = "#f0f0f0")| Predictor | X² | df | p-value | Assumption Violated? |
|---|---|---|---|---|
| Omnibus | 35.87 | 6 | < 0.001 | Yes (overall) |
| age | 0.06 | 1 | 0.814 | |
| sexFemale | 0.89 | 1 | 0.345 | |
| bmi | 7.10 | 1 | 0.008 | ⚠ Yes |
| exerciseYes | 10.58 | 1 | 0.001 | ⚠ Yes |
| income_cat | 10.54 | 1 | 0.001 | ⚠ Yes |
| smokerCurrent | 7.76 | 1 | 0.005 | ⚠ Yes |
tibble(
Feature = c("Outcome type", "Number of equations", "Parameters (3 cats, 5 preds)",
"Key assumption", "Best when"),
`Multinomial (multinom)` = c("Nominal or ordinal",
"K - 1 = 2",
"12",
"None beyond independence",
"Categories unordered or PO fails"),
`Ordinal (polr)` = c("Ordinal only",
"1 (with K - 1 cut-points)",
"7",
"Proportional odds",
"Categories ordered AND PO holds")
) |>
kable(caption = "Multinomial vs. Ordinal Logistic Regression") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Feature | Multinomial (multinom) | Ordinal (polr) |
|---|---|---|
| Outcome type | Nominal or ordinal | Ordinal only |
| Number of equations | K - 1 = 2 | 1 (with K - 1 cut-points) |
| Parameters (3 cats, 5 preds) | 12 | 7 |
| Key assumption | None beyond independence | Proportional odds |
| Best when | Categories unordered or PO fails | Categories ordered AND PO holds |
Both models are fit by maximum likelihood. We can compare their fits with AIC (lower = better):
tibble(
Model = c("Multinomial", "Ordinal (proportional odds)"),
AIC = c(AIC(mod_multi), AIC(mod_ord)),
df = c(mod_multi$edf, length(coef(mod_ord)) + length(mod_ord$zeta))
) |>
mutate(AIC = round(AIC, 1)) |>
kable(caption = "Model Comparison") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Model | AIC | df |
|---|---|---|
| Multinomial | 9315.6 | 14 |
| Ordinal (proportional odds) | 9334.3 | 8 |
Tradeoff: The ordinal model is more parsimonious. If proportional odds holds, it has lower AIC and tighter confidence intervals. If PO fails, the multinomial model is more honest.
| Concept | Key Point |
|---|---|
| Polytomous outcome | Use nnet::multinom() for \(K
> 2\) categories |
| Coefficients | \(K - 1\) sets, each interpreted as log relative risk ratios vs. reference |
| Ordinal outcome | Use MASS::polr() to exploit ordering |
| Cumulative logit | Single set of \(\beta\)s describes all cut-points |
| Proportional odds | Strong assumption; check with brant() |
| Model choice | Ordinal if ordered AND PO holds; otherwise multinomial |
EPI 553 — Polytomous and Ordinal Regression Lab Due: End of class, April 16, 2026
Complete the four tasks below using
brfss_polytomous_2020.rds. Submit a knitted HTML file via
Brightspace.
| Variable | Description | Type |
|---|---|---|
genhlth_ord |
General health (Excellent/VG < Good < Fair/Poor) | Ordered factor (outcome) |
genhlth_nom |
General health, unordered | Factor (outcome) |
age |
Age in years | Numeric |
sex |
Male / Female | Factor |
bmi |
Body mass index | Numeric |
exercise |
Exercised in past 30 days (No/Yes) | Factor |
income_cat |
Household income (1-8) | Numeric |
smoker |
Former/Never vs. Current | Factor |
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(nnet)
library(MASS)
library(brant)
library(ggeffects)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))
brfss_poly <- readRDS("brfss_polytomous_2020.rds")1a. (5 pts) Create a frequency table of
genhlth_ord showing N and percentage for each category.
brfss_poly |>
count(genhlth_ord) |>
mutate(pct = round(100 * n / sum(n), 1)) |>
kable(col.names = c("General Health", "N", "%"),
caption = "Outcome Distribution") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| General Health | N | % |
|---|---|---|
| Excellent/VG | 2380 | 47.6 |
| Good | 1624 | 32.5 |
| Fair/Poor | 996 | 19.9 |
1b. (5 pts) Create a stacked bar chart showing the
distribution of general health by smoker status.
library(ggplot2)
brfss_poly |>
ggplot(aes(x = smoker, fill = genhlth_ord)) +
geom_bar(position = "stack") +
labs(
title = "Distribution of General Health by Smoking Status",
x = "Smoking Status",
y = "Count",
fill = "General Health"
) +
theme_minimal()1c. (5 pts) Use tbl_summary() to create
a descriptive table stratified by genhlth_ord with at least
4 predictors.
brfss_poly |>
dplyr::select(genhlth_ord, age, sex, bmi, exercise, income_cat, smoker) |>
tbl_summary(
by = genhlth_ord,
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
label = list(
age ~ "Age (years)",
sex ~ "Sex",
bmi ~ "BMI (kg/m²)",
exercise ~ "Exercised past 30 days",
income_cat ~ "Income category (1–8)",
smoker ~ "Smoking status"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_overall() |>
add_p(test = list(
all_continuous() ~ "aov",
all_categorical() ~ "chisq.test"
)) |>
bold_labels() |>
modify_caption("**Table 2. Participant Characteristics by General Health Category (BRFSS 2020, n = 5,000)**") |>
modify_header(label ~ "**Characteristic**")| Characteristic | Overall N = 5,0001 |
Excellent/VG N = 2,3801 |
Good N = 1,6241 |
Fair/Poor N = 9961 |
p-value2 |
|---|---|---|---|---|---|
| Age (years) | 57.0 (16.2) | 54.7 (16.5) | 57.9 (16.3) | 60.8 (14.6) | <0.001 |
| Sex | 0.005 | ||||
| Male | 2,717 (54%) | 1,315 (55%) | 906 (56%) | 496 (50%) | |
| Female | 2,283 (46%) | 1,065 (45%) | 718 (44%) | 500 (50%) | |
| BMI (kg/m²) | 28.5 (6.3) | 27.4 (5.2) | 29.1 (6.4) | 30.0 (8.0) | <0.001 |
| Exercised past 30 days | 3,630 (73%) | 1,999 (84%) | 1,161 (71%) | 470 (47%) | <0.001 |
| Income category (1–8) | <0.001 | ||||
| 1 | 225 (4.5%) | 39 (1.6%) | 72 (4.4%) | 114 (11%) | |
| 2 | 271 (5.4%) | 69 (2.9%) | 92 (5.7%) | 110 (11%) | |
| 3 | 403 (8.1%) | 124 (5.2%) | 128 (7.9%) | 151 (15%) | |
| 4 | 512 (10%) | 174 (7.3%) | 188 (12%) | 150 (15%) | |
| 5 | 519 (10%) | 198 (8.3%) | 197 (12%) | 124 (12%) | |
| 6 | 714 (14%) | 354 (15%) | 249 (15%) | 111 (11%) | |
| 7 | 844 (17%) | 443 (19%) | 289 (18%) | 112 (11%) | |
| 8 | 1,512 (30%) | 979 (41%) | 409 (25%) | 124 (12%) | |
| Smoking status | <0.001 | ||||
| Former/Never | 3,304 (66%) | 1,692 (71%) | 1,015 (63%) | 597 (60%) | |
| Current | 1,696 (34%) | 688 (29%) | 609 (38%) | 399 (40%) | |
| 1 Mean (SD); n (%) | |||||
| 2 One-way analysis of means; Pearson’s Chi-squared test | |||||
Intepretation Across all variables, the table demonstrates a consistent and meaningful pattern: individuals with better self-rated health tend to be younger, have lower BMI, be more physically active, have higher income, and are less likely to smoke. In contrast, poorer health is concentrated among individuals with older age, higher BMI, lower socioeconomic status, reduced physical activity, and higher smoking prevalence.
These findings collectively reinforce a strong social and behavioral gradient in general health, highlighting smoking, physical inactivity, and socioeconomic disadvantage as key correlates of poorer self-rated health.
2a. (5 pts) Fit a multinomial logistic regression
model predicting genhlth_nom from at least 4 predictors
using multinom().
mod_multi <- multinom(
genhlth_nom ~ age + sex + bmi + exercise + income_cat + smoker,
data = brfss_poly,
trace = FALSE
)
summary(mod_multi)## Call:
## multinom(formula = genhlth_nom ~ age + sex + bmi + exercise +
## income_cat + smoker, data = brfss_poly, trace = FALSE)
##
## Coefficients:
## (Intercept) age sexFemale bmi exerciseYes income_cat
## Good -1.317306 0.01497067 -0.1417121 0.05087197 -0.500636 -0.1693992
## Fair/Poor -1.283880 0.02579060 -0.1257209 0.06741312 -1.338926 -0.3932631
## smokerCurrent
## Good 0.4117899
## Fair/Poor 0.3436415
##
## Std. Errors:
## (Intercept) age sexFemale bmi exerciseYes income_cat
## Good 0.2689762 0.002189706 0.06782701 0.005768280 0.08178912 0.01748213
## Fair/Poor 0.3296881 0.002916356 0.08583048 0.006778269 0.09155273 0.02090747
## smokerCurrent
## Good 0.07723392
## Fair/Poor 0.09720041
##
## Residual Deviance: 9287.557
## AIC: 9315.557
2b. (10 pts) Report the relative risk ratios (exponentiated coefficients) with 95% CIs in a clean table.
tidy(mod_multi, conf.int = TRUE, exponentiate = TRUE) |>
dplyr::select(y.level, term, estimate, conf.low, conf.high, p.value) |>
mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3)),
p.value = format.pval(p.value, digits = 3)) |>
kable(col.names = c("Outcome", "Predictor", "RRR", "Lower", "Upper", "p"),
caption = "Multinomial Logistic Regression: RRRs vs. Excellent/VG") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Outcome | Predictor | RRR | Lower | Upper | p |
|---|---|---|---|---|---|
| Good | (Intercept) | 0.268 | 0.158 | 0.454 | 9.71e-07 |
| Good | age | 1.015 | 1.011 | 1.019 | 8.10e-12 |
| Good | sexFemale | 0.868 | 0.760 | 0.991 | 0.036679 |
| Good | bmi | 1.052 | 1.040 | 1.064 | < 2e-16 |
| Good | exerciseYes | 0.606 | 0.516 | 0.712 | 9.30e-10 |
| Good | income_cat | 0.844 | 0.816 | 0.874 | < 2e-16 |
| Good | smokerCurrent | 1.510 | 1.297 | 1.756 | 9.73e-08 |
| Fair/Poor | (Intercept) | 0.277 | 0.145 | 0.529 | 9.85e-05 |
| Fair/Poor | age | 1.026 | 1.020 | 1.032 | < 2e-16 |
| Fair/Poor | sexFemale | 0.882 | 0.745 | 1.043 | 0.142987 |
| Fair/Poor | bmi | 1.070 | 1.056 | 1.084 | < 2e-16 |
| Fair/Poor | exerciseYes | 0.262 | 0.219 | 0.314 | < 2e-16 |
| Fair/Poor | income_cat | 0.675 | 0.648 | 0.703 | < 2e-16 |
| Fair/Poor | smokerCurrent | 1.410 | 1.165 | 1.706 | 0.000407 |
2c. (5 pts) Interpret the RRR for one predictor in the “Fair/Poor vs. Excellent/VG” comparison. Intepretation The relative risk ratio (RRR) for current smokers in the Fair/Poor health category compared to Excellent/Very Good health is:RRR = 1.41 (95% CI: 1.17–1.71, p < 0.001)
This means that, holding age, sex, BMI, physical activity, and income constant, individuals who are current smokers have 41% higher relative risk of reporting Fair/Poor general health (vs. Excellent/Very Good health) compared to those who are former/never smokers.Because the confidence interval does not include 1 and the p-value is highly significant, this association is statistically robust.
3a. (5 pts) Fit an ordinal logistic regression model
with the same predictors using polr().
mod_ord <- polr(
genhlth_ord ~ age + sex + bmi + exercise + income_cat + smoker,
data = brfss_poly,
Hess = TRUE
)
summary(mod_ord)## Call:
## polr(formula = genhlth_ord ~ age + sex + bmi + exercise + income_cat +
## smoker, data = brfss_poly, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## age 0.01797 0.001867 9.627
## sexFemale -0.12885 0.056835 -2.267
## bmi 0.04973 0.004534 10.967
## exerciseYes -0.92167 0.064426 -14.306
## income_cat -0.26967 0.014113 -19.107
## smokerCurrent 0.29313 0.064304 4.558
##
## Intercepts:
## Value Std. Error t value
## Excellent/VG|Good 0.0985 0.2200 0.4478
## Good|Fair/Poor 1.8728 0.2212 8.4650
##
## Residual Deviance: 9318.274
## AIC: 9334.274
3b. (5 pts) Report the cumulative ORs with 95% CIs.
ctable <- coef(summary(mod_ord))
# Keep only predictor rows (exclude intercepts)
ctable_pred <- ctable[rownames(ctable) %in% names(mod_ord$coefficients), ]
beta <- mod_ord$coefficients
se <- ctable_pred[, "Std. Error"]
OR <- exp(beta)
lower_CI <- exp(beta - 1.96 * se)
upper_CI <- exp(beta + 1.96 * se)
pvals <- 2 * (1 - pnorm(abs(ctable_pred[, "t value"])))
results_table <- data.frame(
Predictor = names(beta),
OR = round(OR, 3),
Lower_95CI = round(lower_CI, 3),
Upper_95CI = round(upper_CI, 3),
p_value = round(pvals, 4)
)
results_table## Predictor OR Lower_95CI Upper_95CI p_value
## age age 1.018 1.014 1.022 0.0000
## sexFemale sexFemale 0.879 0.786 0.983 0.0234
## bmi bmi 1.051 1.042 1.060 0.0000
## exerciseYes exerciseYes 0.398 0.351 0.451 0.0000
## income_cat income_cat 0.764 0.743 0.785 0.0000
## smokerCurrent smokerCurrent 1.341 1.182 1.521 0.0000
3c. (5 pts) Interpret one OR in plain language, making sure to mention the “at every cut-point” property. Holding age, sex, BMI, physical activity, and income constant, current smokers have 1.34 times higher cumulative odds of reporting worse general health compared to former/never smokers. Current smoking is associated with worse self-rated health. Smokers are consistently more likely to fall into poorer health categories rather than better ones, and this relationship holds across all levels of the health outcome.
3d. (10 pts) Use ggpredict() to plot
predicted probabilities of each health category across a continuous
predictor of your choice.
ggpredict(mod_ord, terms = "bmi [15:55 by=0.5]") |>
plot() +
labs(title = "Ordinal Model: Predicted Probability of Each Health Category",
x = "Income Category", y = "Predicted Probability") +
theme_minimal()Interpretation As BMI increases from healthy to obese ranges, the predicted probability of reporting Excellent/VG health declines steadily. In parallel, the predicted probability of Fair/Poor health rises from roughly 18% at BMI = 20 to approximately 32% at BMI = 40. The probability of reporting Good health remains relatively stable across the BMI range (approximately 26–30%), consistent with the pattern seen in the multinomial model: BMI primarily shifts respondents between the extreme health categories rather than into or out of the middle “Good” category. These trends are consistent with the clinical literature linking obesity to chronic conditions such as diabetes, hypertension, and musculoskeletal disease, all of which negatively affect self-rated health. ## Task 4: Check Assumptions and Compare (15 points)
4a. (5 pts) Run the Brant test for proportional odds. Does the assumption hold?
## --------------------------------------------
## Test for X2 df probability
## --------------------------------------------
## Omnibus 35.87 6 0
## age 0.06 1 0.81
## sexFemale 0.89 1 0.34
## bmi 7.1 1 0.01
## exerciseYes 10.58 1 0
## income_cat 10.54 1 0
## smokerCurrent 7.76 1 0.01
## --------------------------------------------
##
## H0: Parallel Regression Assumption holds
Interpretation The proportional odds assumption was violated (Brant test p < 0.001), particularly for BMI, physical activity, income, and smoking, suggesting that the effects of these predictors vary across thresholds of general health.
4b. (5 pts) Compare the AIC of the multinomial and ordinal models. Which fits better?
tibble(
Model = c("Multinomial", "Ordinal (proportional odds)"),
AIC = c(AIC(mod_multi), AIC(mod_ord)),
df = c(mod_multi$edf, length(coef(mod_ord)) + length(mod_ord$zeta))
) |>
mutate(AIC = round(AIC, 1)) |>
kable(caption = "Model Comparison") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Model | AIC | df |
|---|---|---|
| Multinomial | 9315.6 | 14 |
| Ordinal (proportional odds) | 9334.3 | 8 |
Interpretation Multinomial model AIC = 9315.6 (df = 14) Ordinal (proportional odds) model AIC = 9334.3 (df = 8).The multinomial logistic regression model fits better, because it has a lower AIC. The multinomial model provides a better statistical fit to the data than the ordinal proportional odds model. This is consistent with earlier results showing violation of the proportional odds assumption, suggesting that the relationship between predictors and general health is not constant across outcome thresholds.
4c. (5 pts) Based on your results, which model would you recommend reporting? Justify in 2-3 sentences. I would recommend reporting the multinomial logistic regression model because it has a substantially lower AIC (9315.6 vs 9334.3), indicating better model fit. This is also supported by the violation of the proportional odds assumption in the ordinal model, suggesting that key predictors do not have constant effects across outcome thresholds. Although the ordinal model is more parsimonious and easier to interpret, the multinomial model more accurately reflects the underlying data structure.
Completion credit (25 points): Awarded for a complete, good-faith attempt. Total: 75 + 25 = 100 points.
End of Lab Activity