Part 5: In-Class Lab Activity

EPI 553 — Polytomous and Ordinal Regression Lab Due: End of class, April 16, 2026

Instructions

Complete the four tasks below using brfss_polytomous_2020.rds. Submit a knitted HTML file via Brightspace.

Data

Variable	Description	Type
`genhlth_ord`	General health (Excellent/VG < Good < Fair/Poor)	Ordered factor (outcome)
`genhlth_nom`	General health, unordered	Factor (outcome)
`age`	Age in years	Numeric
`sex`	Male / Female	Factor
`bmi`	Body mass index	Numeric
`exercise`	Exercised in past 30 days (No/Yes)	Factor
`income_cat`	Household income (1-8)	Numeric
`smoker`	Former/Never vs. Current	Factor

library(tidyverse)
library(haven)
library(janitor)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(nnet)        # multinom() for polytomous regression
library(MASS)        # polr() for ordinal regression
library(brant)       # Brant test for proportional odds
library(ggeffects)
library(gt) 

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

brfss_poly <- readRDS("E:/College WORK MS in EPI/SEM 2 MS EPI/Epi553 STATS 2/brfss_polytomous_2020.rds")

# Quick peek at the data structure
head(brfss_poly)

## # A tibble: 6 × 8
##   genhlth_ord  genhlth_nom    age sex      bmi exercise income_cat smoker      
##   <ord>        <fct>        <dbl> <fct>  <dbl> <fct>         <dbl> <fct>       
## 1 Excellent/VG Excellent/VG    53 Male    34.7 Yes               6 Current     
## 2 Fair/Poor    Fair/Poor       58 Male    27.4 No                3 Former/Never
## 3 Good         Good            70 Female  45.7 No                4 Former/Never
## 4 Good         Good            67 Male    36.2 Yes               5 Former/Never
## 5 Excellent/VG Excellent/VG    50 Male    23.7 No                6 Former/Never
## 6 Good         Good            26 Male    27.8 Yes               8 Current

str(brfss_poly)

## tibble [5,000 × 8] (S3: tbl_df/tbl/data.frame)
##  $ genhlth_ord: Ord.factor w/ 3 levels "Excellent/VG"<..: 1 3 2 2 1 2 2 3 1 1 ...
##  $ genhlth_nom: Factor w/ 3 levels "Excellent/VG",..: 1 3 2 2 1 2 2 3 1 1 ...
##  $ age        : num [1:5000] 53 58 70 67 50 26 34 63 73 36 ...
##   ..- attr(*, "label")= chr "IMPUTED AGE VALUE COLLAPSED ABOVE 80"
##  $ sex        : Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 2 1 1 1 ...
##  $ bmi        : num [1:5000] 34.7 27.4 45.7 36.2 23.7 ...
##  $ exercise   : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 2 2 2 1 2 ...
##  $ income_cat : num [1:5000] 6 3 4 5 6 8 4 3 6 8 ...
##  $ smoker     : Factor w/ 2 levels "Former/Never",..: 2 1 1 1 1 2 2 2 1 1 ...

Task 1: Explore the Outcome (15 points)

1a. (5 pts) Create a frequency table of genhlth_ord showing N and percentage for each category.

brfss_poly %>%
  dplyr::select(genhlth_ord) %>%
  tbl_summary(
    label = list(genhlth_ord = "General Health"),
    type = list(genhlth_ord = "categorical")
  ) %>%
  bold_labels() %>% 
  modify_caption("**Table 1. Frequency Distribution of General Health Status**") %>% 
  as_gt()

**Table 1. Frequency Distribution of General Health Status**
Characteristic	N = 5,000¹
General Health
Excellent/VG	2,380 (48%)
Good	1,624 (32%)
Fair/Poor	996 (20%)
¹ n (%)

1b. (5 pts) Create a stacked bar chart showing the distribution of general health by smoker status.

# Create data for plotting
plot_data <- brfss_poly %>%
  count(smoker, genhlth_ord) %>%
  group_by(smoker) %>%
  mutate(
    percentage = round(100 * n / sum(n), 1),
    label = paste0(n, "\n(", percentage, "%)")
  ) |>
  ungroup()
 
# Create the stacked bar chart
plot_1b <- ggplot(plot_data, 
                   aes(x = smoker, y = n, fill = genhlth_ord)) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = label), 
            position = position_stack(vjust = 0.5),
            color = "white", 
            fontface = "bold",
            size = 3.5) +
  labs(
    title = "Task 1b: Distribution of General Health by Smoking Status",
    subtitle = "Stacked bar chart showing counts and percentages",
    x = "Smoking Status",
    y = "Count",
    fill = "General Health"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 12, face = "bold"),
    plot.subtitle = element_text(size = 10),
    legend.position = "right",
    axis.text = element_text(size = 10)
  )
 
print(plot_1b)

1c. (5 pts) Use tbl_summary() to create a descriptive table stratified by genhlth_ord with at least 4 predictors.

# Stratified by the outcome (genhlth_ord)

table_1c <- brfss_poly  %>%
  dplyr::select(
    genhlth_ord, age, sex, bmi, exercise, income_cat,
    smoker
  )  %>%
  
  tbl_summary(
    # Stratify by health outcome
    by = genhlth_ord,
    label = list(
      age = "Age (years)",
      sex = "Sex",
      bmi = "Body Mass Index",
      exercise = "Exercise (past 30 days)",
      income_cat = "Household Income Category (1-8)",
      smoker = "Smoking Status"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    type = list(
      age ~ "continuous",
      bmi ~ "continuous",
      income_cat ~ "continuous",
      sex ~ "categorical",
      exercise ~ "categorical",
      smoker ~ "categorical"
    ),
    missing = "no"  
  )  %>%
  
    # Add p-values comparing across health categories
  add_p()  %>%
  # Add overall column
  add_overall()  %>%
  # Format for display
  bold_labels() %>%
modify_caption("**Table 1c. Descriptive table by General Health Status**")

# Display the table
table_1c

**Table 1c. Descriptive table by General Health Status**
Characteristic	Overall N = 5,000¹	Excellent/VG N = 2,380¹	Good N = 1,624¹	Fair/Poor N = 996¹	p-value²
Age (years)	57 (16)	55 (16)	58 (16)	61 (15)	<0.001
Sex					0.005
Male	2,717 (54%)	1,315 (55%)	906 (56%)	496 (50%)
Female	2,283 (46%)	1,065 (45%)	718 (44%)	500 (50%)
Body Mass Index	28 (6)	27 (5)	29 (6)	30 (8)	<0.001
Exercise (past 30 days)					<0.001
No	1,370 (27%)	381 (16%)	463 (29%)	526 (53%)
Yes	3,630 (73%)	1,999 (84%)	1,161 (71%)	470 (47%)
Household Income Category (1-8)	5.78 (2.14)	6.43 (1.84)	5.64 (2.09)	4.47 (2.23)	<0.001
Smoking Status					<0.001
Former/Never	3,304 (66%)	1,692 (71%)	1,015 (63%)	597 (60%)
Current	1,696 (34%)	688 (29%)	609 (38%)	399 (40%)
¹ Mean (SD); n (%)
² Kruskal-Wallis rank sum test; Pearson’s Chi-squared test

Task 2: Multinomial Logistic Regression (20 points)

2a. (5 pts) Fit a multinomial logistic regression model predicting genhlth_nom from at least 4 predictors using multinom().

Fitting in R: `nnet::multinom()`

mod_multi <- multinom(
  genhlth_nom ~ age + sex + bmi + exercise + income_cat + smoker,
  data = brfss_poly,
  trace = FALSE
)

summary(mod_multi)

## Call:
## multinom(formula = genhlth_nom ~ age + sex + bmi + exercise + 
##     income_cat + smoker, data = brfss_poly, trace = FALSE)
## 
## Coefficients:
##           (Intercept)        age  sexFemale        bmi exerciseYes income_cat
## Good        -1.317306 0.01497067 -0.1417121 0.05087197   -0.500636 -0.1693992
## Fair/Poor   -1.283880 0.02579060 -0.1257209 0.06741312   -1.338926 -0.3932631
##           smokerCurrent
## Good          0.4117899
## Fair/Poor     0.3436415
## 
## Std. Errors:
##           (Intercept)         age  sexFemale         bmi exerciseYes income_cat
## Good        0.2689762 0.002189706 0.06782701 0.005768280  0.08178912 0.01748213
## Fair/Poor   0.3296881 0.002916356 0.08583048 0.006778269  0.09155273 0.02090747
##           smokerCurrent
## Good         0.07723392
## Fair/Poor    0.09720041
## 
## Residual Deviance: 9287.557 
## AIC: 9315.557

Interpretation: Age, BMI, and current smoking consistently increase the likelihood of reporting worse health, while exercise and higher income strongly protect against poorer self‑rated health. The effect of smoking is meaningful in both comparisons: current smokers are 51% more likely to report Good and 41% more likely to report Fair/Poor health rather than Excellent/VG. Sex differences are small and statistically significant only for the Good category.

2b. (10 pts) Report the relative risk ratios (exponentiated coefficients) with 95% CIs in a clean table.

tidy(mod_multi, conf.int = TRUE, exponentiate = TRUE) |>
  dplyr::select(y.level, term, estimate, conf.low, conf.high, p.value) |>
  mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3)),
         p.value = format.pval(p.value, digits = 3)) |>
  kable(col.names = c("Outcome", "Predictor", "RRR", "Lower", "Upper", "p"),
        caption = "Multinomial Logistic Regression: RRRs vs. Excellent/VG") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Multinomial Logistic Regression: RRRs vs. Excellent/VG
Outcome	Predictor	RRR	Lower	Upper	p
Good	(Intercept)	0.268	0.158	0.454	9.71e-07
Good	age	1.015	1.011	1.019	8.10e-12
Good	sexFemale	0.868	0.760	0.991	0.036679
Good	bmi	1.052	1.040	1.064	< 2e-16
Good	exerciseYes	0.606	0.516	0.712	9.30e-10
Good	income_cat	0.844	0.816	0.874	< 2e-16
Good	smokerCurrent	1.510	1.297	1.756	9.73e-08
Fair/Poor	(Intercept)	0.277	0.145	0.529	9.85e-05
Fair/Poor	age	1.026	1.020	1.032	< 2e-16
Fair/Poor	sexFemale	0.882	0.745	1.043	0.142987
Fair/Poor	bmi	1.070	1.056	1.084	< 2e-16
Fair/Poor	exerciseYes	0.262	0.219	0.314	< 2e-16
Fair/Poor	income_cat	0.675	0.648	0.703	< 2e-16
Fair/Poor	smokerCurrent	1.410	1.165	1.706	0.000407

2c. (5 pts) Interpret the RRR for one predictor in the “Fair/Poor vs. Excellent/VG” comparison.

Age (RRR = 1.026, 95% CI: 1.020–1.032) Older adults are more likely to report Fair/Poor rather than Excellent/VG health.Each additional year increases the relative risk of Fair/Poor health by about 2.6%, holding all else constant.

Task 3: Ordinal Logistic Regression (25 points)

3a. (5 pts) Fit an ordinal logistic regression model with the same predictors using polr().

# Using genhlth_ord (ORDERED factor) as outcome

mod_ord <- polr(
  genhlth_ord ~ age + sex + bmi + exercise + income_cat + smoker,
  data = brfss_poly,
  Hess = TRUE  
)
# Look at the summary
summary(mod_ord)

## Call:
## polr(formula = genhlth_ord ~ age + sex + bmi + exercise + income_cat + 
##     smoker, data = brfss_poly, Hess = TRUE)
## 
## Coefficients:
##                  Value Std. Error t value
## age            0.01797   0.001867   9.627
## sexFemale     -0.12885   0.056835  -2.267
## bmi            0.04973   0.004534  10.967
## exerciseYes   -0.92167   0.064426 -14.306
## income_cat    -0.26967   0.014113 -19.107
## smokerCurrent  0.29313   0.064304   4.558
## 
## Intercepts:
##                   Value    Std. Error t value 
## Excellent/VG|Good   0.0985   0.2200     0.4478
## Good|Fair/Poor      1.8728   0.2212     8.4650
## 
## Residual Deviance: 9318.274 
## AIC: 9334.274

3b. (5 pts) Report the cumulative ORs with 95% CIs.

tidy(mod_ord, conf.int = TRUE, exponentiate = TRUE) |>
  filter(coef.type == "coefficient") |>
  dplyr::select(term, estimate, conf.low, conf.high) |>
  mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3))) |>
  kable(col.names = c("Predictor", "OR", "Lower", "Upper"),
        caption = "Ordinal Logistic Regression: Cumulative ORs") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Ordinal Logistic Regression: Cumulative ORs
Predictor	OR	Lower	Upper
age	1.018	1.014	1.022
sexFemale	0.879	0.786	0.983
bmi	1.051	1.042	1.060
exerciseYes	0.398	0.351	0.451
income_cat	0.764	0.743	0.785
smokerCurrent	1.341	1.182	1.521

3c. (5 pts) Interpret one OR in plain language, making sure to mention the “at every cut-point” property.

Example: BMI (OR = 1.051)
A one‑unit increase in BMI multiplies the odds of being in a worse health category by 1.051. For every additional BMI point, people have about 5% higher odds of reporting Good or worse (vs. Excellent/VG), and 5% higher odds of reporting Fair/Poor (vs. Excellent/VG or Good), the same increase in odds applies at every cut‑point of the health scale.

3d. (10 pts) Use ggpredict() to plot predicted probabilities of each health category across a continuous predictor of your choice.

Predicted Probabilities

ggpredict(mod_ord, terms = "bmi") |>
  plot() +
  labs(title = "Ordinal Model: Predicted Probability of Each Health Category",
       x = "BMI Category", y = "Predicted Probability") +
  theme_minimal()

Task 4: Check Assumptions and Compare (15 points)

4a. (5 pts) Run the Brant test for proportional odds. Does the assumption hold?

Formal Test: Brant Test

Omnibus p > 0.05: assumption holds for the model overall
Predictor-level p < 0.05: assumption fails for that variable

brant_out <- brant(mod_ord)

## -------------------------------------------- 
## Test for X2  df  probability 
## -------------------------------------------- 
## Omnibus      35.87   6   0
## age      0.06    1   0.81
## sexFemale    0.89    1   0.34
## bmi      7.1 1   0.01
## exerciseYes  10.58   1   0
## income_cat   10.54   1   0
## smokerCurrent    7.76    1   0.01
## -------------------------------------------- 
## 
## H0: Parallel Regression Assumption holds

as.data.frame(brant_out) |>
  tibble::rownames_to_column("Predictor") |>
  mutate(
    X2          = round(X2, 2),
    probability = ifelse(probability < 0.001, "< 0.001", sprintf("%.3f", probability)),
    Fails       = ifelse(Predictor == "Omnibus",
                         ifelse(as.numeric(gsub("< ", "", probability)) < 0.05 | probability == "< 0.001", "Yes (overall)", "No (overall)"),
                         ifelse(probability == "< 0.001" | as.numeric(gsub("< ", "", probability)) < 0.05, "⚠ Yes", ""))
  ) |>
  kable(
    col.names = c("Predictor", "X²", "df", "p-value", "Assumption Violated?"),
    caption   = "Brant Test for Proportional Odds Assumption",
    align     = c("l", "r", "r", "r", "c")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  row_spec(0, bold = TRUE) |>
  row_spec(1, bold = TRUE, background = "#f0f0f0")

Brant Test for Proportional Odds Assumption
Predictor	X²	df	p-value	Assumption Violated?
Omnibus	35.87	6	< 0.001	Yes (overall)
age	0.06	1	0.814
sexFemale	0.89	1	0.345
bmi	7.10	1	0.008	⚠ Yes
exerciseYes	10.58	1	0.001	⚠ Yes
income_cat	10.54	1	0.001	⚠ Yes
smokerCurrent	7.76	1	0.005	⚠ Yes

Interpretation: Brant test indicates that the proportional‑odds assumption is violated for several key predictors (BMI, exercise, income, smoking). This means that the ordinal logistic regression model is not appropriate for those predictors if we want a single OR that applies “at every cut‑point.”

4b. (5 pts) Compare the AIC of the multinomial and ordinal models. Which fits better?

tibble(
  Feature = c("Outcome type", "Number of equations", "Parameters (3 cats, 5 preds)",
              "Key assumption", "Best when"),
  `Multinomial (multinom)` = c("Nominal or ordinal",
                                "K - 1 = 2",
                                "12",
                                "None beyond independence",
                                "Categories unordered or PO fails"),
  `Ordinal (polr)`         = c("Ordinal only",
                                "1 (with K - 1 cut-points)",
                                "7",
                                "Proportional odds",
                                "Categories ordered AND PO holds")
) |>
  kable(caption = "Multinomial vs. Ordinal Logistic Regression") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Multinomial vs. Ordinal Logistic Regression
Feature	Multinomial (multinom)	Ordinal (polr)
Outcome type	Nominal or ordinal	Ordinal only
Number of equations	K - 1 = 2	1 (with K - 1 cut-points)
Parameters (3 cats, 5 preds)	12	7
Key assumption	None beyond independence	Proportional odds
Best when	Categories unordered or PO fails	Categories ordered AND PO holds

tibble(
  Model = c("Multinomial", "Ordinal (proportional odds)"),
  AIC   = c(AIC(mod_multi), AIC(mod_ord)),
  df    = c(mod_multi$edf, length(coef(mod_ord)) + length(mod_ord$zeta))
) |>
  mutate(AIC = round(AIC, 1)) |>
  kable(caption = "Model Comparison") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Model Comparison
Model	AIC	df
Multinomial	9315.6	14
Ordinal (proportional odds)	9334.3	8

Interpretation: The multinomial model has the lower AIC (9315.6 vs. 9334.3), meaning it provides the better statistical fit to the data. Lower AIC indicates a better balance of fit and complexity.

4c. (5 pts) Based on your results, which model would you recommend reporting? Justify in 2-3 sentences.

Although the ordinal model is more parsimonious, the Brant test shows that the proportional‑odds assumption is clearly violated for several predictors. Because this assumption does not hold, the ordinal model is not appropriate, even though it uses fewer parameters. Given both the AIC results and the failed PO assumption, the multinomial logistic regression is the model to report.

Completion credit (25 points): Awarded for a complete, good-faith attempt. Total: 75 + 25 = 100 points.

End of Lab Activity

Polytomous and Ordinal Logistic Regression

EPI 553 — Principles of Statistical Inference II

Suruchi Shahi

April 20, 2026

Part 5: In-Class Lab Activity

Instructions

Data

Task 1: Explore the Outcome (15 points)

Task 2: Multinomial Logistic Regression (20 points)

Fitting in R: `nnet::multinom()`

Task 3: Ordinal Logistic Regression (25 points)

Predicted Probabilities

Task 4: Check Assumptions and Compare (15 points)

Formal Test: Brant Test

Polytomous and Ordinal Logistic Regression

EPI 553 — Principles of Statistical Inference II

Suruchi Shahi

April 20, 2026

Part 5: In-Class Lab Activity

Instructions

Data

Task 1: Explore the Outcome (15 points)

Task 2: Multinomial Logistic Regression (20 points)

Fitting in R: nnet::multinom()

Task 3: Ordinal Logistic Regression (25 points)

Predicted Probabilities

Task 4: Check Assumptions and Compare (15 points)

Formal Test: Brant Test

Fitting in R: `nnet::multinom()`