Introduction

Binary logistic regression handles two-category outcomes. But many epidemiologic outcomes have more than two categories:

  • General health: Excellent / Very Good / Good / Fair / Poor
  • Disease stage: I / II / III / IV
  • Treatment choice: Drug A / Drug B / Drug C / No treatment
  • BMI category: Underweight / Normal / Overweight / Obese

When the response variable has \(K > 2\) categories, we use polytomous logistic regression (also called multinomial logistic regression). When the categories have a natural order, we can take advantage of that ordering with ordinal logistic regression.

Today’s roadmap (Kleinbaum Ch. 23): 1. Polytomous (multinomial) logistic regression for nominal outcomes 2. Ordinal logistic regression for ordered outcomes 3. The proportional odds assumption 4. Choosing between models


Why Not Just Run Multiple Binary Models?

Suppose general health has 3 categories: Excellent, Good, Poor. A naive approach would be to dichotomize and run a binary logistic regression repeatedly:

  • Excellent vs. not
  • Good vs. not
  • Poor vs. not

Problems with this approach:

  1. Information loss. Dichotomizing throws away the distinction between the other categories.
  2. Inconsistent samples. Each model uses a different comparison group.
  3. No coherent probability model. Predicted probabilities won’t sum to 1.
  4. Inefficiency. Standard errors are larger than necessary.

A polytomous model fits all categories simultaneously in a single likelihood, sharing information across comparisons.


Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(nnet)        # multinom() for polytomous regression
library(MASS)        # polr() for ordinal regression
library(brant)       # Brant test for proportional odds
library(ggeffects)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))
brfss_full <- read_xpt(
  "LLCP2020.XPT"
) |>
  clean_names()

Building the Outcome: General Health

The BRFSS variable genhlth asks: “Would you say that in general your health is…”

1 = Excellent, 2 = Very Good, 3 = Good, 4 = Fair, 5 = Poor

We collapse this into a 3-level ordinal outcome to keep models tractable for class:

brfss_poly <- brfss_full |>
  mutate(
    # 3-level ordinal outcome
    genhlth_3 = case_when(
      genhlth %in% c(1, 2) ~ "Excellent/VG",
      genhlth == 3         ~ "Good",
      genhlth %in% c(4, 5) ~ "Fair/Poor",
      TRUE                 ~ NA_character_
    ),
    # Ordered factor for ordinal regression
    genhlth_ord = factor(genhlth_3,
      levels = c("Excellent/VG", "Good", "Fair/Poor"),
      ordered = TRUE),
    # Unordered factor for polytomous regression
    genhlth_nom = factor(genhlth_3,
      levels = c("Excellent/VG", "Good", "Fair/Poor")),
    # Predictors
    age = age80,
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    ),
    smoker = factor(case_when(
      smokday2 %in% c(1, 2) ~ "Current",
      smokday2 == 3          ~ "Former/Never",
      TRUE                   ~ NA_character_
    ), levels = c("Former/Never", "Current"))
  ) |>
  filter(
    !is.na(genhlth_ord), !is.na(age), age >= 18, !is.na(sex),
    !is.na(bmi), !is.na(exercise), !is.na(income_cat), !is.na(smoker)
  )

set.seed(1220)
brfss_poly <- brfss_poly |>
  dplyr::select(genhlth_ord, genhlth_nom, age, sex, bmi,
                exercise, income_cat, smoker) |>
  slice_sample(n = 5000)

saveRDS(brfss_poly,
  "brfss_polytomous_2020.rds")

brfss_poly |>
  count(genhlth_ord) |>
  mutate(pct = round(100 * n / sum(n), 1)) |>
  kable(col.names = c("General Health", "N", "%"),
        caption = "Outcome Distribution") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Outcome Distribution
General Health N %
Excellent/VG 2380 47.6
Good 1624 32.5
Fair/Poor 996 19.9

Since genhlth_ord and genhlth_nom contain the same three levels (just coded as ordered vs. unordered factors), the distributions will be identical — but showing both makes the distinction between the two factor types explicit for the lecture, which is useful context for introducing polytomous vs. ordinal regression.


Part 1: Polytomous (Multinomial) Logistic Regression

The Model

For an outcome with \(K\) categories, we choose one as the reference (say, \(Y = 1\)) and fit \(K - 1\) logit equations:

\[ \log\left[\frac{\Pr(Y = k)}{\Pr(Y = 1)}\right] = \alpha_k + \beta_{k1} X_1 + \cdots + \beta_{kp} X_p, \quad k = 2, \ldots, K \]

Each comparison gets its own intercept and its own slopes. With 3 outcome categories and 5 predictors, that’s \(2 \times (1 + 5) = 12\) parameters.

The probabilities are recovered as:

\[ \Pr(Y = k \mid X) = \frac{\exp(\alpha_k + X^T \beta_k)}{1 + \sum_{j=2}^{K} \exp(\alpha_j + X^T \beta_j)} \]

By construction, the probabilities sum to 1.

Fitting in R: nnet::multinom()

mod_multi <- multinom(
  genhlth_nom ~ age + sex + bmi + exercise + income_cat + smoker,
  data = brfss_poly,
  trace = FALSE
)

summary(mod_multi)
## Call:
## multinom(formula = genhlth_nom ~ age + sex + bmi + exercise + 
##     income_cat + smoker, data = brfss_poly, trace = FALSE)
## 
## Coefficients:
##           (Intercept)        age  sexFemale        bmi exerciseYes income_cat
## Good        -1.317306 0.01497067 -0.1417121 0.05087197   -0.500636 -0.1693992
## Fair/Poor   -1.283880 0.02579060 -0.1257209 0.06741312   -1.338926 -0.3932631
##           smokerCurrent
## Good          0.4117899
## Fair/Poor     0.3436415
## 
## Std. Errors:
##           (Intercept)         age  sexFemale         bmi exerciseYes income_cat
## Good        0.2689762 0.002189706 0.06782701 0.005768280  0.08178912 0.01748213
## Fair/Poor   0.3296881 0.002916356 0.08583048 0.006778269  0.09155273 0.02090747
##           smokerCurrent
## Good         0.07723392
## Fair/Poor    0.09720041
## 
## Residual Deviance: 9287.557 
## AIC: 9315.557

Interpreting the Coefficients

The output gives two sets of coefficients, one for each non-reference category:

  • “Good” vs. “Excellent/VG”
  • “Fair/Poor” vs. “Excellent/VG”

Each \(\beta\) is interpreted as a log relative risk ratio (RRR). Exponentiating gives the relative risk ratio (sometimes called a relative odds ratio) for being in category \(k\) vs. the reference, per one-unit change in \(X\).

tidy(mod_multi, conf.int = TRUE, exponentiate = TRUE) |>
  dplyr::select(y.level, term, estimate, conf.low, conf.high, p.value) |>
  mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3)),
         p.value = format.pval(p.value, digits = 3)) |>
  kable(col.names = c("Outcome", "Predictor", "RRR", "Lower", "Upper", "p"),
        caption = "Multinomial Logistic Regression: RRRs vs. Excellent/VG") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Multinomial Logistic Regression: RRRs vs. Excellent/VG
Outcome Predictor RRR Lower Upper p
Good (Intercept) 0.268 0.158 0.454 9.71e-07
Good age 1.015 1.011 1.019 8.10e-12
Good sexFemale 0.868 0.760 0.991 0.036679
Good bmi 1.052 1.040 1.064 < 2e-16
Good exerciseYes 0.606 0.516 0.712 9.30e-10
Good income_cat 0.844 0.816 0.874 < 2e-16
Good smokerCurrent 1.510 1.297 1.756 9.73e-08
Fair/Poor (Intercept) 0.277 0.145 0.529 9.85e-05
Fair/Poor age 1.026 1.020 1.032 < 2e-16
Fair/Poor sexFemale 0.882 0.745 1.043 0.142987
Fair/Poor bmi 1.070 1.056 1.084 < 2e-16
Fair/Poor exerciseYes 0.262 0.219 0.314 < 2e-16
Fair/Poor income_cat 0.675 0.648 0.703 < 2e-16
Fair/Poor smokerCurrent 1.410 1.165 1.706 0.000407

Example interpretation. Holding all else constant, current smokers have a relative risk ratio of 1.41 (95% CI: 1.17–1.71) for reporting “Fair/Poor” health vs. “Excellent/VG” compared to former/never smokers. In other words, the relative risk of being in the “Fair/Poor” category (rather than “Excellent/VG”) is about 41% higher among current smokers.

Predicted Probabilities

ggpredict(mod_multi, terms = "income_cat [1:8]") |>
  plot() +
  labs(title = "Predicted Probability of Each Health Category by Income",
       x = "Income Category (1 = lowest, 8 = highest)",
       y = "Predicted Probability") +
  theme_minimal()

Interpretation. Holding all other predictors at their means, the predicted probability of reporting “Fair/Poor” health drops sharply across income levels — from 63% at the lowest income category to 17% at the highest. The predicted probability of “Excellent/VG” health shows the opposite pattern, rising from 11% to 49% across the same range. The probability of “Good” health remains comparatively stable (roughly 26%–35%), reflecting that income primarily shifts respondents between the two extremes rather than into or out of the “Good” category.


Part 2: Ordinal Logistic Regression

When categories are ordered, the polytomous model wastes information. It treats “Good” as just as different from “Excellent” as it is from “Fair/Poor”. An ordinal model exploits the ordering and produces fewer parameters and easier interpretation.

The Proportional Odds (Cumulative Logit) Model

The most common ordinal model is the proportional odds model, also called the cumulative logit model. Rather than modeling the probability of each category separately (as in the multinomial model), it models the probability of being at or below a given category. This respects the ordering of the outcome — “Excellent/VG” < “Good” < “Fair/Poor” in terms of health decline — and yields a single, consistent estimate of each predictor’s effect across the entire outcome scale.

The tradeoff for this parsimony is an additional assumption: the proportional odds assumption, which requires that the effect of each predictor is the same regardless of which cut-point is being modeled. This should be checked after fitting.

Define cumulative probabilities:

\[ \gamma_k = \Pr(Y \leq k \mid X) \]

The model is:

\[ \log\left[\frac{\Pr(Y \leq k)}{\Pr(Y > k)}\right] = \alpha_k - \beta_1 X_1 - \cdots - \beta_p X_p \]

Key features:

  • One intercept per cut-point (\(k = 1, \ldots, K - 1\))
  • A single set of slopes \(\beta\), shared across all cut-points
  • The “proportional odds” assumption: the effect of \(X\) on the log-odds of being at or below category \(k\) is the same for every \(k\)

With \(K = 3\) and 5 predictors, that’s \(2 + 5 = 7\) parameters (vs. 12 for multinomial).

In plain terms: imagine income predicts health. The proportional odds assumption says that the boost in odds of being in “Excellent/VG or better” (vs. “Good or worse”) from moving up one income level is the same as the boost in odds of being in “Excellent/VG or Good” (vs. “Fair/Poor”). The effect of income doesn’t change depending on which boundary you’re crossing — it’s a single, consistent shift up or down the health scale.

Fitting in R: MASS::polr()

mod_ord <- polr(
  genhlth_ord ~ age + sex + bmi + exercise + income_cat + smoker,
  data = brfss_poly,
  Hess = TRUE
)

summary(mod_ord)
## Call:
## polr(formula = genhlth_ord ~ age + sex + bmi + exercise + income_cat + 
##     smoker, data = brfss_poly, Hess = TRUE)
## 
## Coefficients:
##                  Value Std. Error t value
## age            0.01797   0.001867   9.627
## sexFemale     -0.12885   0.056835  -2.267
## bmi            0.04973   0.004534  10.967
## exerciseYes   -0.92167   0.064426 -14.306
## income_cat    -0.26967   0.014113 -19.107
## smokerCurrent  0.29313   0.064304   4.558
## 
## Intercepts:
##                   Value    Std. Error t value 
## Excellent/VG|Good   0.0985   0.2200     0.4478
## Good|Fair/Poor      1.8728   0.2212     8.4650
## 
## Residual Deviance: 9318.274 
## AIC: 9334.274

Odds Ratios and Interpretation

tidy(mod_ord, conf.int = TRUE, exponentiate = TRUE) |>
  filter(coef.type == "coefficient") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3))) |>
  kable(col.names = c("Predictor", "OR", "Lower", "Upper"),
        caption = "Ordinal Logistic Regression: Cumulative ORs") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Ordinal Logistic Regression: Cumulative ORs
Predictor OR Lower Upper
age 1.018 1.014 1.022
sexFemale 0.879 0.786 0.983
bmi 1.051 1.042 1.060
exerciseYes 0.398 0.351 0.451
income_cat 0.764 0.743 0.785
smokerCurrent 1.341 1.182 1.521

Interpretation. An OR of \(e^{\beta}\) for a predictor means: a one-unit increase in \(X\) multiplies the odds of being in a worse health category (i.e., \(Y > k\) vs. \(Y \leq k\)) by \(e^{\beta}\), and this is true at every cut-point. Here, \(\beta\) is the estimated coefficient from the model (the log-odds), and \(e^{\beta}\) is simply that coefficient exponentiated — converting it from the log-odds scale to an odds ratio that is easier to interpret.

For example, if smoking has OR = 1.6, then current smokers have 1.6 times the odds of reporting Good or worse health (vs. Excellent/VG), AND 1.6 times the odds of reporting Fair/Poor (vs. Excellent/VG or Good).

Predicted Probabilities

ggpredict(mod_ord, terms = "income_cat [1:8]") |>
  plot() +
  labs(title = "Ordinal Model: Predicted Probability of Each Health Category",
       x = "Income Category", y = "Predicted Probability") +
  theme_minimal()


Part 3: Checking the Proportional Odds Assumption

The proportional odds assumption is strong. It states that the effect of each predictor on the log-odds of being at or below any cut-point is constant across all cut-points — in other words, one \(\beta\) summarizes the predictor’s effect across the entire outcome scale. If this holds, the ordinal model is both valid and parsimonious. If it fails, that single \(\beta\) is a misleading average of effects that actually differ across thresholds, and a more flexible model is needed.

There are two complementary ways to evaluate this assumption: a visual check and a formal test.

Visual Check: Empirical Logits

The idea is to fit separate binary logistic regressions at each cumulative cut-point and compare the coefficients. If the proportional odds assumption holds, the log-odds coefficients should be roughly the same at both cut-points — the two sets of points should overlap or sit close together. Large separation between cut-points for a given predictor is a warning sign.

cut1 <- glm(I(as.numeric(genhlth_ord) <= 1) ~ age + sex + bmi + exercise + income_cat + smoker,
            data = brfss_poly, family = binomial)
cut2 <- glm(I(as.numeric(genhlth_ord) <= 2) ~ age + sex + bmi + exercise + income_cat + smoker,
            data = brfss_poly, family = binomial)

bind_rows(
  broom::tidy(cut1, conf.int = TRUE) |> mutate(cutpoint = "\u2264 Excellent/VG"),
  broom::tidy(cut2, conf.int = TRUE) |> mutate(cutpoint = "\u2264 Good")
) |>
  filter(term != "(Intercept)") |>
  ggplot(aes(x = estimate, y = term, color = cutpoint, shape = cutpoint)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high),
                  position = position_dodge(width = 0.5)) +
  labs(
    title = "Visual Check: Proportional Odds Assumption",
    subtitle = "Coefficients should overlap across cut-points if assumption holds",
    x = "Log-Odds Coefficient",
    y = "Predictor",
    color = "Cut-point",
    shape = "Cut-point"
  ) +
  theme_minimal()

Predictors whose point estimates are far apart between cut-points (e.g., smokerCurrent, sexFemale) are candidates for violating the assumption — confirm with the Brant test below.

Formal Test: Brant Test

  • Omnibus p > 0.05: assumption holds for the model overall
  • Predictor-level p < 0.05: assumption fails for that variable
brant_out <- brant(mod_ord)
## -------------------------------------------- 
## Test for X2  df  probability 
## -------------------------------------------- 
## Omnibus      35.87   6   0
## age      0.06    1   0.81
## sexFemale    0.89    1   0.34
## bmi      7.1 1   0.01
## exerciseYes  10.58   1   0
## income_cat   10.54   1   0
## smokerCurrent    7.76    1   0.01
## -------------------------------------------- 
## 
## H0: Parallel Regression Assumption holds
as.data.frame(brant_out) |>
  tibble::rownames_to_column("Predictor") |>
  mutate(
    X2          = round(X2, 2),
    probability = ifelse(probability < 0.001, "< 0.001", sprintf("%.3f", probability)),
    Fails       = ifelse(Predictor == "Omnibus",
                         ifelse(as.numeric(gsub("< ", "", probability)) < 0.05 | probability == "< 0.001", "Yes (overall)", "No (overall)"),
                         ifelse(probability == "< 0.001" | as.numeric(gsub("< ", "", probability)) < 0.05, "⚠ Yes", ""))
  ) |>
  kable(
    col.names = c("Predictor", "X²", "df", "p-value", "Assumption Violated?"),
    caption   = "Brant Test for Proportional Odds Assumption",
    align     = c("l", "r", "r", "r", "c")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  row_spec(0, bold = TRUE) |>
  row_spec(1, bold = TRUE, background = "#f0f0f0")
Brant Test for Proportional Odds Assumption
Predictor df p-value Assumption Violated?
Omnibus 35.87 6 < 0.001 Yes (overall)
age 0.06 1 0.814
sexFemale 0.89 1 0.345
bmi 7.10 1 0.008 ⚠ Yes
exerciseYes 10.58 1 0.001 ⚠ Yes
income_cat 10.54 1 0.001 ⚠ Yes
smokerCurrent 7.76 1 0.005 ⚠ Yes

What If It Fails?

  1. Use the multinomial model instead (loses parsimony, gains flexibility)
  2. Use a partial proportional odds model (some \(\beta\)s vary across cut-points)
  3. Collapse outcome categories
  4. Use a different link (continuation-ratio, adjacent-category)

Part 4: Comparing the Two Models

tibble(
  Feature = c("Outcome type", "Number of equations", "Parameters (3 cats, 5 preds)",
              "Key assumption", "Best when"),
  `Multinomial (multinom)` = c("Nominal or ordinal",
                                "K - 1 = 2",
                                "12",
                                "None beyond independence",
                                "Categories unordered or PO fails"),
  `Ordinal (polr)`         = c("Ordinal only",
                                "1 (with K - 1 cut-points)",
                                "7",
                                "Proportional odds",
                                "Categories ordered AND PO holds")
) |>
  kable(caption = "Multinomial vs. Ordinal Logistic Regression") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Multinomial vs. Ordinal Logistic Regression
Feature Multinomial (multinom) Ordinal (polr)
Outcome type Nominal or ordinal Ordinal only
Number of equations K - 1 = 2 1 (with K - 1 cut-points)
Parameters (3 cats, 5 preds) 12 7
Key assumption None beyond independence Proportional odds
Best when Categories unordered or PO fails Categories ordered AND PO holds

Likelihood Comparison

Both models are fit by maximum likelihood. We can compare their fits with AIC (lower = better):

tibble(
  Model = c("Multinomial", "Ordinal (proportional odds)"),
  AIC   = c(AIC(mod_multi), AIC(mod_ord)),
  df    = c(mod_multi$edf, length(coef(mod_ord)) + length(mod_ord$zeta))
) |>
  mutate(AIC = round(AIC, 1)) |>
  kable(caption = "Model Comparison") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Model Comparison
Model AIC df
Multinomial 9315.6 14
Ordinal (proportional odds) 9334.3 8

Tradeoff: The ordinal model is more parsimonious. If proportional odds holds, it has lower AIC and tighter confidence intervals. If PO fails, the multinomial model is more honest.


Summary

Concept Key Point
Polytomous outcome Use nnet::multinom() for \(K > 2\) categories
Coefficients \(K - 1\) sets, each interpreted as log relative risk ratios vs. reference
Ordinal outcome Use MASS::polr() to exploit ordering
Cumulative logit Single set of \(\beta\)s describes all cut-points
Proportional odds Strong assumption; check with brant()
Model choice Ordinal if ordered AND PO holds; otherwise multinomial

Next Lecture (April 21)

  • Logistic regression applications and case studies
  • Putting it all together: model building strategy
  • Reporting results for publication

References

  • Kleinbaum, D. G., Kupper, L. L., Nizam, A., & Rosenberg, E. S. (2013). Applied Regression Analysis and Other Multivariable Methods (5th ed.), Chapter 23.
  • Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.), Chapter 8.
  • Agresti, A. (2010). Analysis of Ordinal Categorical Data (2nd ed.). Wiley.

Part 5: In-Class Lab Activity

EPI 553 — Polytomous and Ordinal Regression Lab Due: End of class, April 16, 2026

Instructions

Complete the four tasks below using brfss_polytomous_2020.rds. Submit a knitted HTML file via Brightspace.

Data

Variable Description Type
genhlth_ord General health (Excellent/VG < Good < Fair/Poor) Ordered factor (outcome)
genhlth_nom General health, unordered Factor (outcome)
age Age in years Numeric
sex Male / Female Factor
bmi Body mass index Numeric
exercise Exercised in past 30 days (No/Yes) Factor
income_cat Household income (1-8) Numeric
smoker Former/Never vs. Current Factor
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(nnet)
library(MASS)
library(brant)
library(ggeffects)
library(ggplot2)
library(dplyr)
library(MASS)
library(brant)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

brfss_poly <- readRDS("brfss_polytomous_2020.rds")

# Task 1a: Frequency table of genhlth_ord (N and %)

# Step 1: Recode numeric values to labeled ordered factor
brfss_full$genhlth_ord <- factor(
  brfss_full$genhlth,
  levels = c(1, 2, 3, 4, 5),
  labels = c("Excellent", "Very good", "Good", "Fair", "Poor"),
  ordered = TRUE
)

# Step 2: Frequency counts (exclude weird codes like 7, 9 automatically)
freq_table <- table(brfss_full$genhlth_ord)

# Step 3: Percentages
percent_table <- prop.table(freq_table) * 100

# Step 4: Combine into one table
result <- data.frame(
  Category = names(freq_table),
  N = as.vector(freq_table),
  Percent = round(as.vector(percent_table), 2)
)

# Step 5: View result
result
##    Category      N Percent
## 1 Excellent  81660   20.36
## 2 Very good 138139   34.45
## 3      Good 119502   29.80
## 4      Fair  46239   11.53
## 5      Poor  15457    3.85
# Task 1b: Stacked bar chart of general health by smoker status
# Step 1: Recode general health (same as Task 1a)
brfss_full$genhlth_ord <- factor(
  brfss_full$genhlth,
  levels = c(1, 2, 3, 4, 5),
  labels = c("Excellent", "Very good", "Good", "Fair", "Poor"),
  ordered = TRUE
)

# Step 2: Recode smoker status (based on BRFSS coding)
# 1 = Yes (smoked 100+ cigarettes) → Smoker
# 2 = No → Non-smoker
brfss_full$smoker_status <- factor(
  brfss_full$smoke100,
  levels = c(1, 2),
  labels = c("Smoker", "Non-smoker")
)

# Step 3: Remove missing/invalid values
plot_data <- brfss_full %>%
  filter(!is.na(genhlth_ord), !is.na(smoker_status))

# Step 4: Create stacked bar chart (proportions)
ggplot(plot_data, aes(x = smoker_status, fill = genhlth_ord)) +
  geom_bar(position = "fill") +
  labs(
    title = "Task 1b: Distribution of General Health by Smoker Status",
    x = "Smoker Status",
    y = "Proportion",
    fill = "General Health"
  ) +
  theme_minimal()

# Task 1c: Descriptive table stratified by genhlth_ord

# Step 1: Recode general health
brfss_full$genhlth_ord <- factor(
  brfss_full$genhlth,
  levels = c(1, 2, 3, 4, 5),
  labels = c("Excellent", "Very good", "Good", "Fair", "Poor"),
  ordered = TRUE
)

# Step 2: Recode smoker status (if not already created)
brfss_full$smoker_status <- factor(
  brfss_full$smoke100,
  levels = c(1, 2),
  labels = c("Smoker", "Non-smoker")
)

# Step 3: Create table (use dplyr::select to avoid conflict)
table1c <- brfss_full %>%
  dplyr::select(
    genhlth_ord,
    age80,        # age variable in BRFSS
    sex,          # may need adjustment
    income2,      # income variable in BRFSS
    smoker_status
  ) %>%
  dplyr::filter(!is.na(genhlth_ord)) %>%
  tbl_summary(
    by = genhlth_ord,
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    missing = "no"
  ) %>%
  add_overall() %>%
  bold_labels()

# Step 4: View
table1c
Characteristic Overall
N = 400,997
1
Excellent
N = 81,660
1
Very good
N = 138,139
1
Good
N = 119,502
1
Fair
N = 46,239
1
Poor
N = 15,457
1
IMPUTED AGE VALUE COLLAPSED ABOVE 80 54 (18) 49 (18) 54 (18) 56 (18) 60 (16) 62 (14)
CALCULATED SEX VARIABLE





    1 183,451 (46%) 39,350 (48%) 62,827 (45%) 54,711 (46%) 19,862 (43%) 6,701 (43%)
    2 217,546 (54%) 42,310 (52%) 75,312 (55%) 64,791 (54%) 26,377 (57%) 8,756 (57%)
INCOME LEVEL 21 (33) 21 (32) 21 (32) 22 (33) 21 (33) 21 (34)
smoker_status





    Smoker 156,328 (41%) 22,958 (30%) 49,430 (38%) 51,045 (45%) 23,676 (54%) 9,219 (63%)
    Non-smoker 224,121 (59%) 53,981 (70%) 82,226 (62%) 62,137 (55%) 20,301 (46%) 5,476 (37%)
1 Mean (SD); n (%)
# Task 2a: Multinomial logistic regression predicting genhlth_nom

# Step 1: Create nominal outcome (NOT ordered)
brfss_full$genhlth_nom <- factor(
  brfss_full$genhlth,
  levels = c(1, 2, 3, 4, 5),
  labels = c("Excellent", "Very good", "Good", "Fair", "Poor")
)

# Step 2: Ensure predictors exist
# (reuse smoker_status from previous tasks)
brfss_full$smoker_status <- factor(
  brfss_full$smoke100,
  levels = c(1, 2),
  labels = c("Smoker", "Non-smoker")
)

# Step 3: Fit multinomial logistic regression
mod_multi <- multinom(
  genhlth_nom ~ age80 + sex + income2 + smoker_status,
  data = brfss_full
)
## # weights:  30 (20 variable)
## initial  value 612302.606595 
## iter  10 value 548049.155080
## iter  20 value 536079.862599
## final  value 530630.592425 
## converged
# Step 4: View model results
summary(mod_multi)
## Call:
## multinom(formula = genhlth_nom ~ age80 + sex + income2 + smoker_status, 
##     data = brfss_full)
## 
## Coefficients:
##           (Intercept)      age80       sex       income2
## Very good -0.04209961 0.01223817 0.1058144 -0.0005753473
## Good      -0.42159310 0.01919587 0.1066246  0.0001878391
## Fair      -2.08804789 0.03133080 0.2367775 -0.0007105785
## Poor      -3.69902162 0.04215458 0.2391839 -0.0010496320
##           smoker_statusNon-smoker
## Very good              -0.3029916
## Good                   -0.5930637
## Fair                   -0.9192344
## Poor                   -1.2620692
## 
## Std. Errors:
##           (Intercept)        age80         sex      income2
## Very good  0.02021165 0.0002573202 0.009213694 0.0001433550
## Good       0.02113140 0.0002702118 0.009570070 0.0001473836
## Fair       0.02876695 0.0003669913 0.012387311 0.0001900702
## Poor       0.04667444 0.0005984977 0.018592456 0.0002842054
##           smoker_statusNon-smoker
## Very good             0.009868125
## Good                  0.010069226
## Fair                  0.012656024
## Poor                  0.019113925
## 
## Residual Deviance: 1061261 
## AIC: 1061301
# Task 2b: Relative Risk Ratios (RRRs) with 95% CIs

# Extract model output
coef_mat <- summary(mod_multi)$coefficients
se_mat   <- summary(mod_multi)$standard.errors

# Convert to long format and merge correctly
results <- as.data.frame(coef_mat) %>%
  mutate(Outcome = rownames(coef_mat)) %>%
  pivot_longer(-Outcome, names_to = "Predictor", values_to = "Estimate") %>%
  left_join(
    as.data.frame(se_mat) %>%
      mutate(Outcome = rownames(se_mat)) %>%
      pivot_longer(-Outcome, names_to = "Predictor", values_to = "SE"),
    by = c("Outcome", "Predictor")
  ) %>%
  mutate(
    RRR = exp(Estimate),
    Lower_CI = exp(Estimate - 1.96 * SE),
    Upper_CI = exp(Estimate + 1.96 * SE)
  ) %>%
  filter(Predictor != "(Intercept)") %>%   # remove intercept
  dplyr::select(Outcome, Predictor, RRR, Lower_CI, Upper_CI) %>%
  mutate(
    RRR = round(RRR, 2),
    Lower_CI = round(Lower_CI, 2),
    Upper_CI = round(Upper_CI, 2)
  )

# View
results
## # A tibble: 16 × 5
##    Outcome   Predictor                 RRR Lower_CI Upper_CI
##    <chr>     <chr>                   <dbl>    <dbl>    <dbl>
##  1 Very good age80                    1.01     1.01     1.01
##  2 Very good sex                      1.11     1.09     1.13
##  3 Very good income2                  1        1        1   
##  4 Very good smoker_statusNon-smoker  0.74     0.72     0.75
##  5 Good      age80                    1.02     1.02     1.02
##  6 Good      sex                      1.11     1.09     1.13
##  7 Good      income2                  1        1        1   
##  8 Good      smoker_statusNon-smoker  0.55     0.54     0.56
##  9 Fair      age80                    1.03     1.03     1.03
## 10 Fair      sex                      1.27     1.24     1.3 
## 11 Fair      income2                  1        1        1   
## 12 Fair      smoker_statusNon-smoker  0.4      0.39     0.41
## 13 Poor      age80                    1.04     1.04     1.04
## 14 Poor      sex                      1.27     1.22     1.32
## 15 Poor      income2                  1        1        1   
## 16 Poor      smoker_statusNon-smoker  0.28     0.27     0.29
# Task 3d: Plot predicted probabilities of each health category across age

# Step 1: Recode outcome as an ordered factor
brfss_full$genhlth_ord <- factor(
  brfss_full$genhlth,
  levels = c(1, 2, 3, 4, 5),
  labels = c("Excellent", "Very good", "Good", "Fair", "Poor"),
  ordered = TRUE
)

# Step 2: Recode smoker status
brfss_full$smoker_status <- factor(
  brfss_full$smoke100,
  levels = c(1, 2),
  labels = c("Smoker", "Non-smoker")
)

# Step 3: Keep complete cases for model variables
plot_data <- brfss_full %>%
  dplyr::select(genhlth_ord, age80, sex, income2, smoker_status) %>%
  dplyr::filter(
    !is.na(genhlth_ord),
    !is.na(age80),
    !is.na(sex),
    !is.na(income2),
    !is.na(smoker_status)
  )

# Step 4: Fit ordinal logistic regression model
mod_ord <- polr(
  genhlth_ord ~ age80 + sex + income2 + smoker_status,
  data = plot_data,
  Hess = TRUE
)

# Step 5: Generate predicted probabilities across age
pred_age <- ggpredict(mod_ord, terms = "age80 [all]")

# Step 6: Inspect prediction object
head(pred_age)
## # Predicted probabilities of genhlth_ord
## 
## genhlth_ord: Excellent
## 
## age80 | Predicted |     95% CI
## ------------------------------
##    18 |      0.25 | 0.24, 0.25
##    19 |      0.24 | 0.24, 0.25
## 
## genhlth_ord: Very good
## 
## age80 | Predicted |     95% CI
## ------------------------------
##    18 |      0.38 | 0.37, 0.38
## 
## genhlth_ord: Good
## 
## age80 | Predicted |     95% CI
## ------------------------------
##    18 |      0.26 | 0.26, 0.27
## 
## genhlth_ord: Fair
## 
## age80 | Predicted |     95% CI
## ------------------------------
##    18 |      0.09 | 0.08, 0.09
## 
## genhlth_ord: Poor
## 
## age80 | Predicted |     95% CI
## ------------------------------
##    18 |      0.03 | 0.03, 0.03
## 
## Adjusted for:
## *           sex =   1.54
## *       income2 =  20.86
## * smoker_status = Smoker
# Step 7: Plot predicted probabilities
ggplot(pred_age, aes(x = x, y = predicted, color = response.level)) +
  geom_line(linewidth = 1) +
  labs(
    title = "Task 3d: Predicted Probabilities of General Health by Age",
    x = "Age",
    y = "Predicted Probability",
    color = "General Health"
  ) +
  theme_minimal()

# Task 4a: Brant test

brant_test <- brant(mod_ord)
## ------------------------------------------------------------ 
## Test for         X2  df  probability 
## ------------------------------------------------------------ 
## Omnibus              1265.59 12  0
## age80                700.54  3   0
## sex              96.58   3   0
## income2              59.46   3   0
## smoker_statusNon-smoker  349.85  3   0
## ------------------------------------------------------------ 
## 
## H0: Parallel Regression Assumption holds
brant_test
##                                 X2 df   probability
## Omnibus                 1265.58748 12 1.293418e-263
## age80                    700.53809  3 1.604577e-151
## sex                       96.58034  3  8.446139e-21
## income2                   59.45887  3  7.670916e-13
## smoker_statusNon-smoker  349.84935  3  1.608033e-75

Task 1: Explore the Outcome (15 points)

1a. (5 pts) Create a frequency table of genhlth_ord showing N and percentage for each category.

The majority of respondents reported “Very good” (34.45%) or “Good” (29.80%) health, while a smaller proportion reported “Fair” (11.53%) or “Poor” (3.85%) health.

1b. (5 pts) Create a stacked bar chart showing the distribution of general health by smoker status.

The distribution of general health differs by smoker status. Non-smokers have a higher proportion of individuals reporting “Excellent” and “Very good” health compared to smokers. In contrast, smokers have a higher proportion of individuals reporting “Fair” and “Poor” health. This suggests that smoking status is associated with worse self-reported general health.

1c. (5 pts) Use tbl_summary() to create a descriptive table stratified by genhlth_ord with at least 4 predictors.

Individuals reporting poorer general health tend to be older on average. Additionally, the proportion of smokers increases as general health declines, with the highest percentage of smokers observed among those reporting “Poor” health. Income levels and sex distribution show more modest variation across health categories.

Task 2: Multinomial Logistic Regression (20 points)

2a. (5 pts) Fit a multinomial logistic regression model predicting genhlth_nom from at least 4 predictors using multinom().

A multinomial logistic regression model was fit to examine the association between age, sex, income, and smoking status with general health status (nominal outcome). The model converged successfully and included four predictors: age (age80), sex, income (income2), and smoker status. “Excellent” health was used as the reference category.

2b. (10 pts) Report the relative risk ratios (exponentiated coefficients) with 95% CIs in a clean table.

Very Good vs Excellent

age80: exp(0.0122) ≈ 1.01

sex: exp(0.1058) ≈ 1.11

income2: exp(-0.0006) ≈ 1.00

Non-smoker: exp(-0.3030) ≈ 0.74

Good vs Excellent

age80: exp(0.0192) ≈ 1.02

sex: exp(0.1066) ≈ 1.11

income2: exp(0.0002) ≈ 1.00

Non-smoker: exp(-0.5931) ≈ 0.55

Fair vs Excellent

age80: exp(0.0313) ≈ 1.03

sex: exp(0.2368) ≈ 1.27

income2: exp(-0.0007) ≈ 1.00

Non-smoker: exp(-0.9192) ≈ 0.40

Poor vs Excellent

age80: exp(0.0422) ≈ 1.04

sex: exp(0.2392) ≈ 1.27

income2: exp(-0.0010) ≈ 1.00

Non-smoker: exp(-1.2621) ≈ 0.28

Relative risk ratios (RRRs) indicate that increasing age is associated with higher relative risk of reporting worse health (Good, Fair, and Poor vs Excellent), with effects increasing across categories. Non-smokers have substantially lower relative risk of poorer health compared to smokers, with the strongest effect observed for “Poor” health (RRR ≈ 0.28). Sex is also associated with worse health outcomes, while income shows minimal effects.

2c. (5 pts) Interpret the RRR for one predictor in the “Fair/Poor vs. Excellent/VG” comparison.

For smoking status, the RRR for poorer health categories (e.g., “Poor”) relative to better health categories (e.g., “Excellent”) is approximately 0.28 for non-smokers compared to smokers. This indicates that non-smokers have substantially lower relative risk of reporting poor health compared to smokers, controlling for age, sex, and income.

Task 3: Ordinal Logistic Regression (25 points)

3a. (5 pts) Fit an ordinal logistic regression model with the same predictors using polr().

An ordinal logistic regression model was fit using the polr() function to examine the association between age, sex, income, and smoking status with ordered general health status (genhlth_ord). The outcome variable was treated as an ordered factor ranging from Excellent to Poor, and the model included age (age80), sex, income (income2), and smoker status as predictors.

3b. (5 pts) Report the cumulative ORs with 95% CIs.

The ordinal logistic regression showed that older age was associated with higher odds of being in a worse general health category (OR = 1.02, 95% CI: 1.02–1.02). Non-smokers had lower odds of worse general health compared with smokers (OR = 0.57, 95% CI: 0.56–0.58). Sex was also associated with higher odds of worse health (OR = 1.11, 95% CI: 1.10–1.12), while income showed little effect (OR = 1.00, 95% CI: 1.00–1.00).

3c. (5 pts) Interpret one OR in plain language, making sure to mention the “at every cut-point” property.

For smoking status, the odds ratio (OR) for non-smokers compared to smokers is approximately 0.57. This means that non-smokers have about 43% lower odds of being in a worse general health category compared to smokers. Because this is an ordinal logistic regression model, this interpretation applies at every cut-point of the outcome, that is, the reduced odds apply when comparing Excellent vs. Very good/Good/Fair/Poor, Excellent/Very good vs. Good/Fair/Poor, and Excellent/Very good/Good vs. Fair/Poor, holding all other variables constant.

3d. (10 pts) Use ggpredict() to plot predicted probabilities of each health category across a continuous predictor of your choice.

The predicted probability plot shows that as age increases, the probability of reporting better health categories such as “Excellent” and “Very good” decreases, while the probability of worse health categories such as “Fair” and “Poor” increases. The probability of reporting “Good” health increases slightly with age. Overall, the results indicate a shift toward poorer health status as age increases.

Task 4: Check Assumptions and Compare (15 points)

4a. (5 pts) Run the Brant test for proportional odds. Does the assumption hold?

The Brant test was used to assess the proportional odds (parallel regression) assumption. The global (omnibus) test was highly statistically significant (χ² = 1265.59, p < 0.001), indicating that the proportional odds assumption is violated. Additionally, all individual predictors (age, sex, income, and smoking status) had significant p-values (p < 0.001), further suggesting that the assumption does not hold for this model.

4b. (5 pts) Compare the AIC of the multinomial and ordinal models. Which fits better?

The multinomial logistic regression model had a lower AIC (1,061,301) than the ordinal logistic regression model (1,062,838), suggesting that the multinomial model fits the data better. This is consistent with the violation of the proportional odds assumption observed in the Brant test, which supports the use of a multinomial model.

4c. (5 pts) Based on your results, which model would you recommend reporting? Justify in 2-3 sentences.

I would recommend reporting the multinomial logistic regression model. The Brant test indicated that the proportional odds assumption was violated (p < 0.001), meaning the ordinal model is not appropriate for these data. Additionally, the multinomial model had a lower AIC compared to the ordinal model, indicating a better overall fit.

Completion credit (25 points): Awarded for a complete, good-faith attempt. Total: 75 + 25 = 100 points.

End of Lab Activity