Part 5: In-Class Lab Activity

EPI 553 — Polytomous and Ordinal Regression Lab Due: End of class, April 16, 2026

Instructions

Complete the four tasks below using brfss_polytomous_2020.rds. Submit a knitted HTML file via Brightspace.

Data

Variable Description Type
genhlth_ord General health (Excellent/VG < Good < Fair/Poor) Ordered factor (outcome)
genhlth_nom General health, unordered Factor (outcome)
age Age in years Numeric
sex Male / Female Factor
bmi Body mass index Numeric
exercise Exercised in past 30 days (No/Yes) Factor
income_cat Household income (1-8) Numeric
smoker Former/Never vs. Current Factor
library(tidyverse)
library(haven)
library(janitor)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(nnet)        # multinom() for polytomous regression
library(MASS)        # polr() for ordinal regression
library(brant)       # Brant test for proportional odds
library(ggeffects)
library(gt) 

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

brfss_poly <- readRDS("E:/College WORK MS in EPI/SEM 2 MS EPI/Epi553 STATS 2/brfss_polytomous_2020.rds")

# Quick peek at the data structure
head(brfss_poly)
## # A tibble: 6 × 8
##   genhlth_ord  genhlth_nom    age sex      bmi exercise income_cat smoker      
##   <ord>        <fct>        <dbl> <fct>  <dbl> <fct>         <dbl> <fct>       
## 1 Excellent/VG Excellent/VG    53 Male    34.7 Yes               6 Current     
## 2 Fair/Poor    Fair/Poor       58 Male    27.4 No                3 Former/Never
## 3 Good         Good            70 Female  45.7 No                4 Former/Never
## 4 Good         Good            67 Male    36.2 Yes               5 Former/Never
## 5 Excellent/VG Excellent/VG    50 Male    23.7 No                6 Former/Never
## 6 Good         Good            26 Male    27.8 Yes               8 Current
str(brfss_poly)
## tibble [5,000 × 8] (S3: tbl_df/tbl/data.frame)
##  $ genhlth_ord: Ord.factor w/ 3 levels "Excellent/VG"<..: 1 3 2 2 1 2 2 3 1 1 ...
##  $ genhlth_nom: Factor w/ 3 levels "Excellent/VG",..: 1 3 2 2 1 2 2 3 1 1 ...
##  $ age        : num [1:5000] 53 58 70 67 50 26 34 63 73 36 ...
##   ..- attr(*, "label")= chr "IMPUTED AGE VALUE COLLAPSED ABOVE 80"
##  $ sex        : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 2 1 1 1 ...
##  $ bmi        : num [1:5000] 34.7 27.4 45.7 36.2 23.7 ...
##  $ exercise   : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 2 2 2 1 2 ...
##  $ income_cat : num [1:5000] 6 3 4 5 6 8 4 3 6 8 ...
##  $ smoker     : Factor w/ 2 levels "Former/Never",..: 2 1 1 1 1 2 2 2 1 1 ...

Task 1: Explore the Outcome (15 points)

1a. (5 pts) Create a frequency table of genhlth_ord showing N and percentage for each category.

brfss_poly %>%
  dplyr::select(genhlth_ord) %>%
  tbl_summary(
    label = list(genhlth_ord = "General Health"),
    type = list(genhlth_ord = "categorical")
  ) %>%
  bold_labels() %>% 
  modify_caption("**Table 1. Frequency Distribution of General Health Status**") %>% 
  as_gt()
Table 1. Frequency Distribution of General Health Status
Characteristic N = 5,0001
General Health
    Excellent/VG 2,380 (48%)
    Good 1,624 (32%)
    Fair/Poor 996 (20%)
1 n (%)

1b. (5 pts) Create a stacked bar chart showing the distribution of general health by smoker status.

# Create data for plotting
plot_data <- brfss_poly %>%
  count(smoker, genhlth_ord) %>%
  group_by(smoker) %>%
  mutate(
    percentage = round(100 * n / sum(n), 1),
    label = paste0(n, "\n(", percentage, "%)")
  ) |>
  ungroup()
 
# Create the stacked bar chart
plot_1b <- ggplot(plot_data, 
                   aes(x = smoker, y = n, fill = genhlth_ord)) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = label), 
            position = position_stack(vjust = 0.5),
            color = "white", 
            fontface = "bold",
            size = 3.5) +
  labs(
    title = "Task 1b: Distribution of General Health by Smoking Status",
    subtitle = "Stacked bar chart showing counts and percentages",
    x = "Smoking Status",
    y = "Count",
    fill = "General Health"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 12, face = "bold"),
    plot.subtitle = element_text(size = 10),
    legend.position = "right",
    axis.text = element_text(size = 10)
  )
 
print(plot_1b)

1c. (5 pts) Use tbl_summary() to create a descriptive table stratified by genhlth_ord with at least 4 predictors.

# Stratified by the outcome (genhlth_ord)

table_1c <- brfss_poly  %>%
  dplyr::select(
    genhlth_ord, age, sex, bmi, exercise, income_cat,
    smoker
  )  %>%
  
  tbl_summary(
    # Stratify by health outcome
    by = genhlth_ord,
    label = list(
      age = "Age (years)",
      sex = "Sex",
      bmi = "Body Mass Index",
      exercise = "Exercise (past 30 days)",
      income_cat = "Household Income Category (1-8)",
      smoker = "Smoking Status"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    type = list(
      age ~ "continuous",
      bmi ~ "continuous",
      income_cat ~ "continuous",
      sex ~ "categorical",
      exercise ~ "categorical",
      smoker ~ "categorical"
    ),
    missing = "no"  
  )  %>%
  
    # Add p-values comparing across health categories
  add_p()  %>%
  # Add overall column
  add_overall()  %>%
  # Format for display
  bold_labels() %>%
modify_caption("**Table 1c. Descriptive table by General Health Status**")

# Display the table
table_1c
Table 1c. Descriptive table by General Health Status
Characteristic Overall
N = 5,000
1
Excellent/VG
N = 2,380
1
Good
N = 1,624
1
Fair/Poor
N = 996
1
p-value2
Age (years) 57 (16) 55 (16) 58 (16) 61 (15) <0.001
Sex



0.005
    Male 2,717 (54%) 1,315 (55%) 906 (56%) 496 (50%)
    Female 2,283 (46%) 1,065 (45%) 718 (44%) 500 (50%)
Body Mass Index 28 (6) 27 (5) 29 (6) 30 (8) <0.001
Exercise (past 30 days)



<0.001
    No 1,370 (27%) 381 (16%) 463 (29%) 526 (53%)
    Yes 3,630 (73%) 1,999 (84%) 1,161 (71%) 470 (47%)
Household Income Category (1-8) 5.78 (2.14) 6.43 (1.84) 5.64 (2.09) 4.47 (2.23) <0.001
Smoking Status



<0.001
    Former/Never 3,304 (66%) 1,692 (71%) 1,015 (63%) 597 (60%)
    Current 1,696 (34%) 688 (29%) 609 (38%) 399 (40%)
1 Mean (SD); n (%)
2 Kruskal-Wallis rank sum test; Pearson’s Chi-squared test

Task 2: Multinomial Logistic Regression (20 points)

2a. (5 pts) Fit a multinomial logistic regression model predicting genhlth_nom from at least 4 predictors using multinom().

Fitting in R: nnet::multinom()

mod_multi <- multinom(
  genhlth_nom ~ age + sex + bmi + exercise + income_cat + smoker,
  data = brfss_poly,
  trace = FALSE
)

summary(mod_multi)
## Call:
## multinom(formula = genhlth_nom ~ age + sex + bmi + exercise + 
##     income_cat + smoker, data = brfss_poly, trace = FALSE)
## 
## Coefficients:
##           (Intercept)        age  sexFemale        bmi exerciseYes income_cat
## Good        -1.317306 0.01497067 -0.1417121 0.05087197   -0.500636 -0.1693992
## Fair/Poor   -1.283880 0.02579060 -0.1257209 0.06741312   -1.338926 -0.3932631
##           smokerCurrent
## Good          0.4117899
## Fair/Poor     0.3436415
## 
## Std. Errors:
##           (Intercept)         age  sexFemale         bmi exerciseYes income_cat
## Good        0.2689762 0.002189706 0.06782701 0.005768280  0.08178912 0.01748213
## Fair/Poor   0.3296881 0.002916356 0.08583048 0.006778269  0.09155273 0.02090747
##           smokerCurrent
## Good         0.07723392
## Fair/Poor    0.09720041
## 
## Residual Deviance: 9287.557 
## AIC: 9315.557

Interpretation: Age, BMI, and current smoking consistently increase the likelihood of reporting worse health, while exercise and higher income strongly protect against poorer self‑rated health. The effect of smoking is meaningful in both comparisons: current smokers are 51% more likely to report Good and 41% more likely to report Fair/Poor health rather than Excellent/VG. Sex differences are small and statistically significant only for the Good category.

2b. (10 pts) Report the relative risk ratios (exponentiated coefficients) with 95% CIs in a clean table.

tidy(mod_multi, conf.int = TRUE, exponentiate = TRUE) |>
  dplyr::select(y.level, term, estimate, conf.low, conf.high, p.value) |>
  mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3)),
         p.value = format.pval(p.value, digits = 3)) |>
  kable(col.names = c("Outcome", "Predictor", "RRR", "Lower", "Upper", "p"),
        caption = "Multinomial Logistic Regression: RRRs vs. Excellent/VG") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Multinomial Logistic Regression: RRRs vs. Excellent/VG
Outcome Predictor RRR Lower Upper p
Good (Intercept) 0.268 0.158 0.454 9.71e-07
Good age 1.015 1.011 1.019 8.10e-12
Good sexFemale 0.868 0.760 0.991 0.036679
Good bmi 1.052 1.040 1.064 < 2e-16
Good exerciseYes 0.606 0.516 0.712 9.30e-10
Good income_cat 0.844 0.816 0.874 < 2e-16
Good smokerCurrent 1.510 1.297 1.756 9.73e-08
Fair/Poor (Intercept) 0.277 0.145 0.529 9.85e-05
Fair/Poor age 1.026 1.020 1.032 < 2e-16
Fair/Poor sexFemale 0.882 0.745 1.043 0.142987
Fair/Poor bmi 1.070 1.056 1.084 < 2e-16
Fair/Poor exerciseYes 0.262 0.219 0.314 < 2e-16
Fair/Poor income_cat 0.675 0.648 0.703 < 2e-16
Fair/Poor smokerCurrent 1.410 1.165 1.706 0.000407

2c. (5 pts) Interpret the RRR for one predictor in the “Fair/Poor vs. Excellent/VG” comparison.

Age (RRR = 1.026, 95% CI: 1.020–1.032) Older adults are more likely to report Fair/Poor rather than Excellent/VG health.Each additional year increases the relative risk of Fair/Poor health by about 2.6%, holding all else constant.

Task 3: Ordinal Logistic Regression (25 points)

3a. (5 pts) Fit an ordinal logistic regression model with the same predictors using polr().

# Using genhlth_ord (ORDERED factor) as outcome

mod_ord <- polr(
  genhlth_ord ~ age + sex + bmi + exercise + income_cat + smoker,
  data = brfss_poly,
  Hess = TRUE  
)
# Look at the summary
summary(mod_ord)
## Call:
## polr(formula = genhlth_ord ~ age + sex + bmi + exercise + income_cat + 
##     smoker, data = brfss_poly, Hess = TRUE)
## 
## Coefficients:
##                  Value Std. Error t value
## age            0.01797   0.001867   9.627
## sexFemale     -0.12885   0.056835  -2.267
## bmi            0.04973   0.004534  10.967
## exerciseYes   -0.92167   0.064426 -14.306
## income_cat    -0.26967   0.014113 -19.107
## smokerCurrent  0.29313   0.064304   4.558
## 
## Intercepts:
##                   Value    Std. Error t value 
## Excellent/VG|Good   0.0985   0.2200     0.4478
## Good|Fair/Poor      1.8728   0.2212     8.4650
## 
## Residual Deviance: 9318.274 
## AIC: 9334.274

3b. (5 pts) Report the cumulative ORs with 95% CIs.

tidy(mod_ord, conf.int = TRUE, exponentiate = TRUE) |>
  filter(coef.type == "coefficient") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3))) |>
  kable(col.names = c("Predictor", "OR", "Lower", "Upper"),
        caption = "Ordinal Logistic Regression: Cumulative ORs") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Ordinal Logistic Regression: Cumulative ORs
Predictor OR Lower Upper
age 1.018 1.014 1.022
sexFemale 0.879 0.786 0.983
bmi 1.051 1.042 1.060
exerciseYes 0.398 0.351 0.451
income_cat 0.764 0.743 0.785
smokerCurrent 1.341 1.182 1.521

3c. (5 pts) Interpret one OR in plain language, making sure to mention the “at every cut-point” property.

Example: BMI (OR = 1.051)
A one‑unit increase in BMI multiplies the odds of being in a worse health category by 1.051. For every additional BMI point, people have about 5% higher odds of reporting Good or worse (vs. Excellent/VG), and 5% higher odds of reporting Fair/Poor (vs. Excellent/VG or Good), the same increase in odds applies at every cut‑point of the health scale.

3d. (10 pts) Use ggpredict() to plot predicted probabilities of each health category across a continuous predictor of your choice.

Predicted Probabilities

ggpredict(mod_ord, terms = "bmi") |>
  plot() +
  labs(title = "Ordinal Model: Predicted Probability of Each Health Category",
       x = "BMI Category", y = "Predicted Probability") +
  theme_minimal()

Task 4: Check Assumptions and Compare (15 points)

4a. (5 pts) Run the Brant test for proportional odds. Does the assumption hold?

Formal Test: Brant Test

  • Omnibus p > 0.05: assumption holds for the model overall
  • Predictor-level p < 0.05: assumption fails for that variable
brant_out <- brant(mod_ord)
## -------------------------------------------- 
## Test for X2  df  probability 
## -------------------------------------------- 
## Omnibus      35.87   6   0
## age      0.06    1   0.81
## sexFemale    0.89    1   0.34
## bmi      7.1 1   0.01
## exerciseYes  10.58   1   0
## income_cat   10.54   1   0
## smokerCurrent    7.76    1   0.01
## -------------------------------------------- 
## 
## H0: Parallel Regression Assumption holds
as.data.frame(brant_out) |>
  tibble::rownames_to_column("Predictor") |>
  mutate(
    X2          = round(X2, 2),
    probability = ifelse(probability < 0.001, "< 0.001", sprintf("%.3f", probability)),
    Fails       = ifelse(Predictor == "Omnibus",
                         ifelse(as.numeric(gsub("< ", "", probability)) < 0.05 | probability == "< 0.001", "Yes (overall)", "No (overall)"),
                         ifelse(probability == "< 0.001" | as.numeric(gsub("< ", "", probability)) < 0.05, "⚠ Yes", ""))
  ) |>
  kable(
    col.names = c("Predictor", "X²", "df", "p-value", "Assumption Violated?"),
    caption   = "Brant Test for Proportional Odds Assumption",
    align     = c("l", "r", "r", "r", "c")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  row_spec(0, bold = TRUE) |>
  row_spec(1, bold = TRUE, background = "#f0f0f0")
Brant Test for Proportional Odds Assumption
Predictor df p-value Assumption Violated?
Omnibus 35.87 6 < 0.001 Yes (overall)
age 0.06 1 0.814
sexFemale 0.89 1 0.345
bmi 7.10 1 0.008 ⚠ Yes
exerciseYes 10.58 1 0.001 ⚠ Yes
income_cat 10.54 1 0.001 ⚠ Yes
smokerCurrent 7.76 1 0.005 ⚠ Yes

Interpretation: Brant test indicates that the proportional‑odds assumption is violated for several key predictors (BMI, exercise, income, smoking). This means that the ordinal logistic regression model is not appropriate for those predictors if we want a single OR that applies “at every cut‑point.”

4b. (5 pts) Compare the AIC of the multinomial and ordinal models. Which fits better?

tibble(
  Feature = c("Outcome type", "Number of equations", "Parameters (3 cats, 5 preds)",
              "Key assumption", "Best when"),
  `Multinomial (multinom)` = c("Nominal or ordinal",
                                "K - 1 = 2",
                                "12",
                                "None beyond independence",
                                "Categories unordered or PO fails"),
  `Ordinal (polr)`         = c("Ordinal only",
                                "1 (with K - 1 cut-points)",
                                "7",
                                "Proportional odds",
                                "Categories ordered AND PO holds")
) |>
  kable(caption = "Multinomial vs. Ordinal Logistic Regression") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Multinomial vs. Ordinal Logistic Regression
Feature Multinomial (multinom) Ordinal (polr)
Outcome type Nominal or ordinal Ordinal only
Number of equations K - 1 = 2 1 (with K - 1 cut-points)
Parameters (3 cats, 5 preds) 12 7
Key assumption None beyond independence Proportional odds
Best when Categories unordered or PO fails Categories ordered AND PO holds
tibble(
  Model = c("Multinomial", "Ordinal (proportional odds)"),
  AIC   = c(AIC(mod_multi), AIC(mod_ord)),
  df    = c(mod_multi$edf, length(coef(mod_ord)) + length(mod_ord$zeta))
) |>
  mutate(AIC = round(AIC, 1)) |>
  kable(caption = "Model Comparison") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Model Comparison
Model AIC df
Multinomial 9315.6 14
Ordinal (proportional odds) 9334.3 8

Interpretation: The multinomial model has the lower AIC (9315.6 vs. 9334.3), meaning it provides the better statistical fit to the data. Lower AIC indicates a better balance of fit and complexity.

4c. (5 pts) Based on your results, which model would you recommend reporting? Justify in 2-3 sentences.

Although the ordinal model is more parsimonious, the Brant test shows that the proportional‑odds assumption is clearly violated for several predictors. Because this assumption does not hold, the ordinal model is not appropriate, even though it uses fewer parameters. Given both the AIC results and the failed PO assumption, the multinomial logistic regression is the model to report.

Completion credit (25 points): Awarded for a complete, good-faith attempt. Total: 75 + 25 = 100 points.

End of Lab Activity