Part 5: In-Class Lab Activity

EPI 553 — Polytomous and Ordinal Regression Lab Due: End of class, April 16, 2026

Instructions

Complete the four tasks below using brfss_polytomous_2020.rds. Submit a knitted HTML file via Brightspace.

Data

Variable Description Type
genhlth_ord General health (Excellent/VG < Good < Fair/Poor) Ordered factor (outcome)
genhlth_nom General health, unordered Factor (outcome)
age Age in years Numeric
sex Male / Female Factor
bmi Body mass index Numeric
exercise Exercised in past 30 days (No/Yes) Factor
income_cat Household income (1-8) Numeric
smoker Former/Never vs. Current Factor
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(nnet)
library(MASS)
library(brant)
library(ggeffects)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

brfss_poly <- readRDS("~/desktop/EPI553/data/brfss_polytomous_2020.rds")

Task 1: Explore the Outcome (15 points)

1a. (5 pts) Create a frequency table of genhlth_ord showing N and percentage for each category.

brfss_poly |>
  count(genhlth_ord) |>
  mutate(pct = round(100 * n / sum(n), 1)) |>
  kable(col.names = c("General Health", "N", "%"),
        caption = "Outcome Distribution") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Outcome Distribution
General Health N %
Excellent/VG 2380 47.6
Good 1624 32.5
Fair/Poor 996 19.9

1b. (5 pts) Create a stacked bar chart showing the distribution of general health by smoker status.

brfss_poly |>
ggplot(aes(x= smoker, fill=genhlth_ord)) + 
  geom_bar(position= "fill") +
  labs(
    x= "Smoker Status", 
    y= "Count",
    fill= "General Health",
    title= "Distribution of General Health by Smoker Status"
  ) +
  scale_y_continuous(labels= scales::percent) +
  scale_fill_manual(values= c(
    "Excellent/VG"= "Darkgreen",
    "Good"= "Orange",
    "Fair/Poor"= "Blue"
  ))

1c. (5 pts) Use tbl_summary() to create a descriptive table stratified by genhlth_ord with at least 4 predictors.

brfss_poly |>
  tbl_summary(
    by = genhlth_ord,
    include = c(age, sex, bmi, exercise, income_cat, smoker),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    label = list(
      age           ~ "Age (years)",
      sex           ~ "Sex",
      bmi           ~ "BMI",
      exercise      ~ "Exercise in past 30 days",
      income_cat    ~ "Income category (1-8)",
      smoker        ~ "Smoking status"
    )
  ) |>
  add_overall() |>
  add_p() |>
  bold_labels()
Characteristic Overall
N = 5,000
1
Excellent/VG
N = 2,380
1
Good
N = 1,624
1
Fair/Poor
N = 996
1
p-value2
Age (years) 57 (16) 55 (16) 58 (16) 61 (15) <0.001
Sex



0.005
    Male 2,717 (54%) 1,315 (55%) 906 (56%) 496 (50%)
    Female 2,283 (46%) 1,065 (45%) 718 (44%) 500 (50%)
BMI 28 (6) 27 (5) 29 (6) 30 (8) <0.001
Exercise in past 30 days 3,630 (73%) 1,999 (84%) 1,161 (71%) 470 (47%) <0.001
Income category (1-8)



<0.001
    1 225 (4.5%) 39 (1.6%) 72 (4.4%) 114 (11%)
    2 271 (5.4%) 69 (2.9%) 92 (5.7%) 110 (11%)
    3 403 (8.1%) 124 (5.2%) 128 (7.9%) 151 (15%)
    4 512 (10%) 174 (7.3%) 188 (12%) 150 (15%)
    5 519 (10%) 198 (8.3%) 197 (12%) 124 (12%)
    6 714 (14%) 354 (15%) 249 (15%) 111 (11%)
    7 844 (17%) 443 (19%) 289 (18%) 112 (11%)
    8 1,512 (30%) 979 (41%) 409 (25%) 124 (12%)
Smoking status



<0.001
    Former/Never 3,304 (66%) 1,692 (71%) 1,015 (63%) 597 (60%)
    Current 1,696 (34%) 688 (29%) 609 (38%) 399 (40%)
1 Mean (SD); n (%)
2 Kruskal-Wallis rank sum test; Pearson’s Chi-squared test

Task 2: Multinomial Logistic Regression (20 points)

2a. (5 pts) Fit a multinomial logistic regression model predicting genhlth_nom from at least 4 predictors using multinom().

mod_multi <- multinom(
  genhlth_nom ~ age + sex + bmi + exercise + income_cat + smoker,
  data = brfss_poly,
  trace = FALSE
)

summary(mod_multi)
## Call:
## multinom(formula = genhlth_nom ~ age + sex + bmi + exercise + 
##     income_cat + smoker, data = brfss_poly, trace = FALSE)
## 
## Coefficients:
##           (Intercept)        age  sexFemale        bmi exerciseYes income_cat
## Good        -1.317306 0.01497067 -0.1417121 0.05087197   -0.500636 -0.1693992
## Fair/Poor   -1.283880 0.02579060 -0.1257209 0.06741312   -1.338926 -0.3932631
##           smokerCurrent
## Good          0.4117899
## Fair/Poor     0.3436415
## 
## Std. Errors:
##           (Intercept)         age  sexFemale         bmi exerciseYes income_cat
## Good        0.2689762 0.002189706 0.06782701 0.005768280  0.08178912 0.01748213
## Fair/Poor   0.3296881 0.002916356 0.08583048 0.006778269  0.09155273 0.02090747
##           smokerCurrent
## Good         0.07723392
## Fair/Poor    0.09720041
## 
## Residual Deviance: 9287.557 
## AIC: 9315.557

2b. (10 pts) Report the relative risk ratios (exponentiated coefficients) with 95% CIs in a clean table.

tidy(mod_multi, conf.int = TRUE, exponentiate = TRUE) |>
  dplyr::select(y.level, term, estimate, conf.low, conf.high, p.value) |>
  mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3)),
         p.value = format.pval(p.value, digits = 3)) |>
  kable(col.names = c("Outcome", "Predictor", "RRR", "Lower", "Upper", "p"),
        caption = "Multinomial Logistic Regression: RRRs vs. Excellent/VG") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Multinomial Logistic Regression: RRRs vs. Excellent/VG
Outcome Predictor RRR Lower Upper p
Good (Intercept) 0.268 0.158 0.454 9.71e-07
Good age 1.015 1.011 1.019 8.10e-12
Good sexFemale 0.868 0.760 0.991 0.036679
Good bmi 1.052 1.040 1.064 < 2e-16
Good exerciseYes 0.606 0.516 0.712 9.30e-10
Good income_cat 0.844 0.816 0.874 < 2e-16
Good smokerCurrent 1.510 1.297 1.756 9.73e-08
Fair/Poor (Intercept) 0.277 0.145 0.529 9.85e-05
Fair/Poor age 1.026 1.020 1.032 < 2e-16
Fair/Poor sexFemale 0.882 0.745 1.043 0.142987
Fair/Poor bmi 1.070 1.056 1.084 < 2e-16
Fair/Poor exerciseYes 0.262 0.219 0.314 < 2e-16
Fair/Poor income_cat 0.675 0.648 0.703 < 2e-16
Fair/Poor smokerCurrent 1.410 1.165 1.706 0.000407

2c. (5 pts) Interpret the RRR for one predictor in the “Fair/Poor vs. Excellent/VG” comparison.

Holding all else constant, those who have exercised in the past 30 days have a relative risk ratio of 0.262 (95% CI: 0.219-0.314) for reporting “Fair/Poor” health vs. “Excellent/VG” compared to those who did not exercise in the past 30 days. In other words, the relative risk of being in the “Fair/Poor” category rather than the “Excellent/VG” is about 74% lower among those who exercise.

Task 3: Ordinal Logistic Regression (25 points)

3a. (5 pts) Fit an ordinal logistic regression model with the same predictors using polr().

mod_ord <- polr(
  genhlth_ord ~ age + sex + bmi + exercise + income_cat + smoker,
  data = brfss_poly,
  Hess = TRUE
)

summary(mod_ord)
## Call:
## polr(formula = genhlth_ord ~ age + sex + bmi + exercise + income_cat + 
##     smoker, data = brfss_poly, Hess = TRUE)
## 
## Coefficients:
##                  Value Std. Error t value
## age            0.01797   0.001867   9.627
## sexFemale     -0.12885   0.056835  -2.267
## bmi            0.04973   0.004534  10.967
## exerciseYes   -0.92167   0.064426 -14.306
## income_cat    -0.26967   0.014113 -19.107
## smokerCurrent  0.29313   0.064304   4.558
## 
## Intercepts:
##                   Value    Std. Error t value 
## Excellent/VG|Good   0.0985   0.2200     0.4478
## Good|Fair/Poor      1.8728   0.2212     8.4650
## 
## Residual Deviance: 9318.274 
## AIC: 9334.274

3b. (5 pts) Report the cumulative ORs with 95% CIs.

tidy(mod_ord, conf.int = TRUE, exponentiate = TRUE) |>
  filter(coef.type == "coefficient") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3))) |>
  kable(col.names = c("Predictor", "OR", "Lower", "Upper"),
        caption = "Ordinal Logistic Regression: Cumulative ORs") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Ordinal Logistic Regression: Cumulative ORs
Predictor OR Lower Upper
age 1.018 1.014 1.022
sexFemale 0.879 0.786 0.983
bmi 1.051 1.042 1.060
exerciseYes 0.398 0.351 0.451
income_cat 0.764 0.743 0.785
smokerCurrent 1.341 1.182 1.521

3c. (5 pts) Interpret one OR in plain language, making sure to mention the “at every cut-point” property.

The OR for smoking is 1.341. Therefore, current smokers have 1.3 times the odds of reporting worse general health compared to non-smokers. In other words, smokers have 1.3 times the odds of reporting good or worse health (vs excellent/VG) and 1.3 times the odds of reporting fair/poor health (vs excellent/VG or good). This applies at every cut point of the outcome, meaning that a one unit increase in smoking status multiples the odds of being in a worse health category by 1.3.

3d. (10 pts) Use ggpredict() to plot predicted probabilities of each health category across a continuous predictor of your choice.

ggpredict(mod_ord, terms = "age") |>
  plot() +
  labs(title = "Ordinal Model: Predicted Probability of Each Health Category",
       x = "Age", y = "Predicted Probability") +
  theme_minimal()

Task 4: Check Assumptions and Compare (15 points)

4a. (5 pts) Run the Brant test for proportional odds. Does the assumption hold?

brant_out <- brant(mod_ord)
## -------------------------------------------- 
## Test for X2  df  probability 
## -------------------------------------------- 
## Omnibus      35.87   6   0
## age      0.06    1   0.81
## sexFemale    0.89    1   0.34
## bmi      7.1 1   0.01
## exerciseYes  10.58   1   0
## income_cat   10.54   1   0
## smokerCurrent    7.76    1   0.01
## -------------------------------------------- 
## 
## H0: Parallel Regression Assumption holds
as.data.frame(brant_out) |>
  tibble::rownames_to_column("Predictor") |>
  mutate(
    X2          = round(X2, 2),
    probability = ifelse(probability < 0.001, "< 0.001", sprintf("%.3f", probability)),
    Fails       = ifelse(Predictor == "Omnibus",
                         ifelse(as.numeric(gsub("< ", "", probability)) < 0.05 | probability == "< 0.001", "Yes (overall)", "No (overall)"),
                         ifelse(probability == "< 0.001" | as.numeric(gsub("< ", "", probability)) < 0.05, "⚠ Yes", ""))
  ) |>
  kable(
    col.names = c("Predictor", "X²", "df", "p-value", "Assumption Violated?"),
    caption   = "Brant Test for Proportional Odds Assumption",
    align     = c("l", "r", "r", "r", "c")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  row_spec(0, bold = TRUE) |>
  row_spec(1, bold = TRUE, background = "#f0f0f0")
Brant Test for Proportional Odds Assumption
Predictor df p-value Assumption Violated?
Omnibus 35.87 6 < 0.001 Yes (overall)
age 0.06 1 0.814
sexFemale 0.89 1 0.345
bmi 7.10 1 0.008 ⚠ Yes
exerciseYes 10.58 1 0.001 ⚠ Yes
income_cat 10.54 1 0.001 ⚠ Yes
smokerCurrent 7.76 1 0.005 ⚠ Yes

The omnibus p-value is 0 so the assumption hold for the model overall. For the predictor-level p-values, the assumption holds for age and sex variables because their p-values are greater than 0.05 (0.814 and 0.345 respectively). The assumptions do not hold for bmi (p-value= 0.008), exerciseYes (p-value= 0.001), income_cat (p-value= 0.001), and smokerCurrent (p-value= 0.005) because the p-values are less than 0.05.

4b. (5 pts) Compare the AIC of the multinomial and ordinal models. Which fits better?

tibble(
  Model = c("Multinomial", "Ordinal (proportional odds)"),
  AIC   = c(AIC(mod_multi), AIC(mod_ord)),
  df    = c(mod_multi$edf, length(coef(mod_ord)) + length(mod_ord$zeta))
) |>
  mutate(AIC = round(AIC, 1)) |>
  kable(caption = "Model Comparison") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Model Comparison
Model AIC df
Multinomial 9315.6 14
Ordinal (proportional odds) 9334.3 8

Multinomial= 9315.6 Ordinal= 9334.3 The multinomial model fits best because it has a lower AIC.

4c. (5 pts) Based on your results, which model would you recommend reporting? Justify in 2-3 sentences.

Based on my results, the multinomial logisitic regression model (AIC= 9315.6) fits the data better than the ordinal logistic regression model (AIC= 9334.3), since the lower AIC indicates a better model fit. Also, several variables failed the Brant test. Therefore I would recommend reporting the multinomial model.

Completion credit (25 points): Awarded for a complete, good-faith attempt. Total: 75 + 25 = 100 points.

End of Lab Activity