Binary logistic regression handles two-category outcomes. But many epidemiologic outcomes have more than two categories:
When the response variable has \(K > 2\) categories, we use polytomous logistic regression (also called multinomial logistic regression). When the categories have a natural order, we can take advantage of that ordering with ordinal logistic regression.
Today’s roadmap (Kleinbaum Ch. 23): 1. Polytomous (multinomial) logistic regression for nominal outcomes 2. Ordinal logistic regression for ordered outcomes 3. The proportional odds assumption 4. Choosing between models
Suppose general health has 3 categories: Excellent, Good, Poor. A naive approach would be to dichotomize and run a binary logistic regression repeatedly:
Problems with this approach:
A polytomous model fits all categories simultaneously in a single likelihood, sharing information across comparisons.
library(tidyverse)
library(haven)
library(janitor)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(nnet) # multinom() for polytomous regression
library(MASS) # polr() for ordinal regression
library(brant) # Brant test for proportional odds
library(ggeffects)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))brfss_full <- read_xpt(
"/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/data/brfss_2020"
) |>
clean_names()The BRFSS variable genhlth asks: “Would you say that
in general your health is…”
1 = Excellent, 2 = Very Good, 3 = Good, 4 = Fair, 5 = Poor
We collapse this into a 3-level ordinal outcome to keep models tractable for class:
brfss_poly <- brfss_full |>
mutate(
# 3-level ordinal outcome
genhlth_3 = case_when(
genhlth %in% c(1, 2) ~ "Excellent/VG",
genhlth == 3 ~ "Good",
genhlth %in% c(4, 5) ~ "Fair/Poor",
TRUE ~ NA_character_
),
# Ordered factor for ordinal regression
genhlth_ord = factor(genhlth_3,
levels = c("Excellent/VG", "Good", "Fair/Poor"),
ordered = TRUE),
# Unordered factor for polytomous regression
genhlth_nom = factor(genhlth_3,
levels = c("Excellent/VG", "Good", "Fair/Poor")),
# Predictors
age = age80,
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
exercise = factor(case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
TRUE ~ NA_character_
), levels = c("No", "Yes")),
income_cat = case_when(
income2 %in% 1:8 ~ as.numeric(income2),
TRUE ~ NA_real_
),
smoker = factor(case_when(
smokday2 %in% c(1, 2) ~ "Current",
smokday2 == 3 ~ "Former/Never",
TRUE ~ NA_character_
), levels = c("Former/Never", "Current"))
) |>
filter(
!is.na(genhlth_ord), !is.na(age), age >= 18, !is.na(sex),
!is.na(bmi), !is.na(exercise), !is.na(income_cat), !is.na(smoker)
)
set.seed(1220)
brfss_poly <- brfss_poly |>
dplyr::select(genhlth_ord, genhlth_nom, age, sex, bmi,
exercise, income_cat, smoker) |>
slice_sample(n = 5000)
saveRDS(brfss_poly,
"/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/data/brfss_polytomous_2020.rds")
brfss_poly |>
count(genhlth_ord) |>
mutate(pct = round(100 * n / sum(n), 1)) |>
kable(col.names = c("General Health", "N", "%"),
caption = "Outcome Distribution") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| General Health | N | % |
|---|---|---|
| Excellent/VG | 2380 | 47.6 |
| Good | 1624 | 32.5 |
| Fair/Poor | 996 | 19.9 |
For an outcome with \(K\) categories, we choose one as the reference (say, \(Y = 1\)) and fit \(K - 1\) logit equations:
\[ \log\left[\frac{\Pr(Y = k)}{\Pr(Y = 1)}\right] = \alpha_k + \beta_{k1} X_1 + \cdots + \beta_{kp} X_p, \quad k = 2, \ldots, K \]
Each comparison gets its own intercept and its own slopes. With 3 outcome categories and 5 predictors, that’s \(2 \times (1 + 5) = 12\) parameters.
The probabilities are recovered as:
\[ \Pr(Y = k \mid X) = \frac{\exp(\alpha_k + X^T \beta_k)}{1 + \sum_{j=2}^{K} \exp(\alpha_j + X^T \beta_j)} \]
By construction, the probabilities sum to 1.
nnet::multinom()mod_multi <- multinom(
genhlth_nom ~ age + sex + bmi + exercise + income_cat + smoker,
data = brfss_poly,
trace = FALSE
)
summary(mod_multi)## Call:
## multinom(formula = genhlth_nom ~ age + sex + bmi + exercise +
## income_cat + smoker, data = brfss_poly, trace = FALSE)
##
## Coefficients:
## (Intercept) age sexFemale bmi exerciseYes income_cat
## Good -1.317306 0.01497067 -0.1417121 0.05087197 -0.500636 -0.1693992
## Fair/Poor -1.283880 0.02579060 -0.1257209 0.06741312 -1.338926 -0.3932631
## smokerCurrent
## Good 0.4117899
## Fair/Poor 0.3436415
##
## Std. Errors:
## (Intercept) age sexFemale bmi exerciseYes income_cat
## Good 0.2689762 0.002189706 0.06782701 0.005768280 0.08178912 0.01748213
## Fair/Poor 0.3296881 0.002916356 0.08583048 0.006778269 0.09155273 0.02090747
## smokerCurrent
## Good 0.07723392
## Fair/Poor 0.09720041
##
## Residual Deviance: 9287.557
## AIC: 9315.557
The output gives two sets of coefficients, one for each non-reference category:
Each \(\beta\) is interpreted as a log relative risk ratio (RRR). Exponentiating gives the relative risk ratio (sometimes called a relative odds ratio) for being in category \(k\) vs. the reference, per one-unit change in \(X\).
tidy(mod_multi, conf.int = TRUE, exponentiate = TRUE) |>
dplyr::select(y.level, term, estimate, conf.low, conf.high, p.value) |>
mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3)),
p.value = format.pval(p.value, digits = 3)) |>
kable(col.names = c("Outcome", "Predictor", "RRR", "Lower", "Upper", "p"),
caption = "Multinomial Logistic Regression: RRRs vs. Excellent/VG") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Outcome | Predictor | RRR | Lower | Upper | p |
|---|---|---|---|---|---|
| Good | (Intercept) | 0.268 | 0.158 | 0.454 | 9.71e-07 |
| Good | age | 1.015 | 1.011 | 1.019 | 8.10e-12 |
| Good | sexFemale | 0.868 | 0.760 | 0.991 | 0.036679 |
| Good | bmi | 1.052 | 1.040 | 1.064 | < 2e-16 |
| Good | exerciseYes | 0.606 | 0.516 | 0.712 | 9.30e-10 |
| Good | income_cat | 0.844 | 0.816 | 0.874 | < 2e-16 |
| Good | smokerCurrent | 1.510 | 1.297 | 1.756 | 9.73e-08 |
| Fair/Poor | (Intercept) | 0.277 | 0.145 | 0.529 | 9.85e-05 |
| Fair/Poor | age | 1.026 | 1.020 | 1.032 | < 2e-16 |
| Fair/Poor | sexFemale | 0.882 | 0.745 | 1.043 | 0.142987 |
| Fair/Poor | bmi | 1.070 | 1.056 | 1.084 | < 2e-16 |
| Fair/Poor | exerciseYes | 0.262 | 0.219 | 0.314 | < 2e-16 |
| Fair/Poor | income_cat | 0.675 | 0.648 | 0.703 | < 2e-16 |
| Fair/Poor | smokerCurrent | 1.410 | 1.165 | 1.706 | 0.000407 |
Example interpretation. Holding all else constant, the relative risk of reporting “Fair/Poor” health (vs. “Excellent/VG”) for current smokers is approximately X times that of former/never smokers.
When categories are ordered, the polytomous model wastes information. It treats “Good” as just as different from “Excellent” as it is from “Fair/Poor”. An ordinal model exploits the ordering and produces fewer parameters and easier interpretation.
The most common ordinal model is the proportional odds model. Define cumulative probabilities:
\[ \gamma_k = \Pr(Y \leq k \mid X) \]
The model is:
\[ \log\left[\frac{\Pr(Y \leq k)}{\Pr(Y > k)}\right] = \alpha_k - \beta_1 X_1 - \cdots - \beta_p X_p \]
Key features:
With \(K = 3\) and 5 predictors, that’s \(2 + 5 = 7\) parameters (vs. 12 for multinomial).
MASS::polr()mod_ord <- polr(
genhlth_ord ~ age + sex + bmi + exercise + income_cat + smoker,
data = brfss_poly,
Hess = TRUE
)
summary(mod_ord)## Call:
## polr(formula = genhlth_ord ~ age + sex + bmi + exercise + income_cat +
## smoker, data = brfss_poly, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## age 0.01797 0.001867 9.627
## sexFemale -0.12885 0.056835 -2.267
## bmi 0.04973 0.004534 10.967
## exerciseYes -0.92167 0.064426 -14.306
## income_cat -0.26967 0.014113 -19.107
## smokerCurrent 0.29313 0.064304 4.558
##
## Intercepts:
## Value Std. Error t value
## Excellent/VG|Good 0.0985 0.2200 0.4478
## Good|Fair/Poor 1.8728 0.2212 8.4650
##
## Residual Deviance: 9318.274
## AIC: 9334.274
tidy(mod_ord, conf.int = TRUE, exponentiate = TRUE) |>
filter(coef.type == "coefficient") |>
dplyr::select(term, estimate, conf.low, conf.high) |>
mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3))) |>
kable(col.names = c("Predictor", "OR", "Lower", "Upper"),
caption = "Ordinal Logistic Regression: Cumulative ORs") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Predictor | OR | Lower | Upper |
|---|---|---|---|
| age | 1.018 | 1.014 | 1.022 |
| sexFemale | 0.879 | 0.786 | 0.983 |
| bmi | 1.051 | 1.042 | 1.060 |
| exerciseYes | 0.398 | 0.351 | 0.451 |
| income_cat | 0.764 | 0.743 | 0.785 |
| smokerCurrent | 1.341 | 1.182 | 1.521 |
Interpretation. An OR of \(e^{\beta}\) for a predictor means: a one-unit increase in \(X\) multiplies the odds of being in a worse health category (i.e., \(Y > k\) vs. \(Y \leq k\)) by \(e^{\beta}\), and this is true at every cut-point.
For example, if smoking has OR = 1.6, then current smokers have 1.6 times the odds of reporting Good or worse health (vs. Excellent/VG), AND 1.6 times the odds of reporting Fair/Poor (vs. Excellent/VG or Good).
The proportional odds assumption is strong. If it fails, the single \(\beta\) per predictor is misleading.
For each predictor, plot the cumulative log-odds at each cut-point. Roughly parallel lines support proportional odds.
## --------------------------------------------
## Test for X2 df probability
## --------------------------------------------
## Omnibus 35.87 6 0
## age 0.06 1 0.81
## sexFemale 0.89 1 0.34
## bmi 7.1 1 0.01
## exerciseYes 10.58 1 0
## income_cat 10.54 1 0
## smokerCurrent 7.76 1 0.01
## --------------------------------------------
##
## H0: Parallel Regression Assumption holds
tibble(
Feature = c("Outcome type", "Number of equations", "Parameters (3 cats, 5 preds)",
"Key assumption", "Best when"),
`Multinomial (multinom)` = c("Nominal or ordinal",
"K - 1 = 2",
"12",
"None beyond independence",
"Categories unordered or PO fails"),
`Ordinal (polr)` = c("Ordinal only",
"1 (with K - 1 cut-points)",
"7",
"Proportional odds",
"Categories ordered AND PO holds")
) |>
kable(caption = "Multinomial vs. Ordinal Logistic Regression") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Feature | Multinomial (multinom) | Ordinal (polr) |
|---|---|---|
| Outcome type | Nominal or ordinal | Ordinal only |
| Number of equations | K - 1 = 2 | 1 (with K - 1 cut-points) |
| Parameters (3 cats, 5 preds) | 12 | 7 |
| Key assumption | None beyond independence | Proportional odds |
| Best when | Categories unordered or PO fails | Categories ordered AND PO holds |
Both models are fit by maximum likelihood. We can compare their fits with AIC (lower = better):
tibble(
Model = c("Multinomial", "Ordinal (proportional odds)"),
AIC = c(AIC(mod_multi), AIC(mod_ord)),
df = c(mod_multi$edf, length(coef(mod_ord)) + length(mod_ord$zeta))
) |>
mutate(AIC = round(AIC, 1)) |>
kable(caption = "Model Comparison") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Model | AIC | df |
|---|---|---|
| Multinomial | 9315.6 | 14 |
| Ordinal (proportional odds) | 9334.3 | 8 |
Tradeoff: The ordinal model is more parsimonious. If proportional odds holds, it has lower AIC and tighter confidence intervals. If PO fails, the multinomial model is more honest.
| Concept | Key Point |
|---|---|
| Polytomous outcome | Use nnet::multinom() for \(K
> 2\) categories |
| Coefficients | \(K - 1\) sets, each interpreted as log relative risk ratios vs. reference |
| Ordinal outcome | Use MASS::polr() to exploit ordering |
| Cumulative logit | Single set of \(\beta\)s describes all cut-points |
| Proportional odds | Strong assumption; check with brant() |
| Model choice | Ordinal if ordered AND PO holds; otherwise multinomial |
EPI 553 — Polytomous and Ordinal Regression Lab Due: End of class, April 16, 2026
Complete the four tasks below using
brfss_polytomous_2020.rds. Submit a knitted HTML file via
Brightspace.
| Variable | Description | Type |
|---|---|---|
genhlth_ord |
General health (Excellent/VG < Good < Fair/Poor) | Ordered factor (outcome) |
genhlth_nom |
General health, unordered | Factor (outcome) |
age |
Age in years | Numeric |
sex |
Male / Female | Factor |
bmi |
Body mass index | Numeric |
exercise |
Exercised in past 30 days (No/Yes) | Factor |
income_cat |
Household income (1-8) | Numeric |
smoker |
Former/Never vs. Current | Factor |
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(nnet)
library(MASS)
library(brant)
library(ggeffects)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))
brfss_poly <- readRDS("brfss_polytomous_2020.rds")1a. (5 pts) Create a frequency table of
genhlth_ord showing N and percentage for each category.
# Table of General Health
brfss_poly |>
dplyr::select(genhlth_ord) |>
tbl_summary(
label = list(genhlth_ord ~ "General Health Status"),
statistic = list(all_categorical() ~ "{n} ({p}%)"),
missing = "no"
) |>
add_n() |>
bold_labels() |>
modify_caption("**Table 1. General Health Status Distribution and Percentages**")| Characteristic | N | N = 5,0001 |
|---|---|---|
| General Health Status | 5,000 | |
| Excellent/VG | 2,380 (48%) | |
| Good | 1,624 (32%) | |
| Fair/Poor | 996 (20%) | |
| 1 n (%) | ||
1b. (5 pts) Create a stacked bar chart showing the
distribution of general health by smoker status.
# Create split bar plot for gen health by smoker
ggplot(brfss_poly, aes(x = smoker, fill = genhlth_ord)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
labs(
x = "Smoker Status",
y = "Percent",
fill = "General Health Status",
title = "Figure 1. Stacked Bar Chart for Distribution of General Health by Smoker Status"
) +
theme_minimal()1c. (5 pts) Use tbl_summary() to create
a descriptive table stratified by genhlth_ord with at least
4 predictors.
# Create Unweighted Table 1 Descriptive Statistics
brfss_poly |>
dplyr::select(genhlth_ord, age, sex, bmi, exercise, income_cat, smoker) |>
tbl_summary(
by = genhlth_ord, # Stratify Table by gen health
type = list(exercise ~ "categorical"),
label = list(
age ~ "Age Category",
sex ~ "Sex",
bmi ~ "Body Mass Index (kg/m2)",
exercise ~ "Exercise Status",
income_cat ~ "Income Category",
smoker ~ "Smoking Status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
modify_caption("**Table 2. Descriptive Statistics — BRFSS 2020 Analytic Sample (n = 5,000)**") | Characteristic | N | Excellent/VG N = 2,3801 |
Good N = 1,6241 |
Fair/Poor N = 9961 |
|---|---|---|---|---|
| Age Category | 5,000 | 54.7 (16.5) | 57.9 (16.3) | 60.8 (14.6) |
| Sex | 5,000 | |||
| Male | 1,315 (55%) | 906 (56%) | 496 (50%) | |
| Female | 1,065 (45%) | 718 (44%) | 500 (50%) | |
| Body Mass Index (kg/m2) | 5,000 | 27.4 (5.2) | 29.1 (6.4) | 30.0 (8.0) |
| Exercise Status | 5,000 | |||
| No | 381 (16%) | 463 (29%) | 526 (53%) | |
| Yes | 1,999 (84%) | 1,161 (71%) | 470 (47%) | |
| Income Category | 5,000 | |||
| 1 | 39 (1.6%) | 72 (4.4%) | 114 (11%) | |
| 2 | 69 (2.9%) | 92 (5.7%) | 110 (11%) | |
| 3 | 124 (5.2%) | 128 (7.9%) | 151 (15%) | |
| 4 | 174 (7.3%) | 188 (12%) | 150 (15%) | |
| 5 | 198 (8.3%) | 197 (12%) | 124 (12%) | |
| 6 | 354 (15%) | 249 (15%) | 111 (11%) | |
| 7 | 443 (19%) | 289 (18%) | 112 (11%) | |
| 8 | 979 (41%) | 409 (25%) | 124 (12%) | |
| Smoking Status | 5,000 | |||
| Former/Never | 1,692 (71%) | 1,015 (63%) | 597 (60%) | |
| Current | 688 (29%) | 609 (38%) | 399 (40%) | |
| 1 Mean (SD); n (%) | ||||
2a. (5 pts) Fit a multinomial logistic regression
model predicting genhlth_nom from at least 4 predictors
using multinom().
# FIt a multinomial logistic regression using four predictors
mod_multi <- multinom(genhlth_nom ~ age + sex + bmi + exercise, data = brfss_poly, trace = FALSE)
summary(mod_multi)## Call:
## multinom(formula = genhlth_nom ~ age + sex + bmi + exercise,
## data = brfss_poly, trace = FALSE)
##
## Coefficients:
## (Intercept) age sexFemale bmi exerciseYes
## Good -1.712197 0.01120923 -0.06112428 0.04413147 -0.6638872
## Fair/Poor -2.729987 0.02124402 0.07853406 0.05915033 -1.6278050
##
## Std. Errors:
## (Intercept) age sexFemale bmi exerciseYes
## Good 0.2210452 0.002040544 0.06620135 0.005529328 0.07982258
## Fair/Poor 0.2700071 0.002631480 0.08099654 0.006368339 0.08666907
##
## Residual Deviance: 9764.619
## AIC: 9784.619
2b. (10 pts) Report the relative risk ratios (exponentiated coefficients) with 95% CIs in a clean table.
tidy(mod_multi, conf.int = TRUE, exponentiate = TRUE) |>
dplyr::select(y.level, term, estimate, conf.low, conf.high, p.value) |>
mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3)),
p.value = format.pval(p.value, digits = 3)) |>
kable(col.names = c("Outcome", "Predictor", "RRR", "Lower", "Upper", "p"),
caption = "Multinomial Logistic Regression: RRRs vs. Excellent/VG") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Outcome | Predictor | RRR | Lower | Upper | p |
|---|---|---|---|---|---|
| Good | (Intercept) | 0.180 | 0.117 | 0.278 | 9.49e-15 |
| Good | age | 1.011 | 1.007 | 1.015 | 3.95e-08 |
| Good | sexFemale | 0.941 | 0.826 | 1.071 | 0.356 |
| Good | bmi | 1.045 | 1.034 | 1.057 | 1.45e-15 |
| Good | exerciseYes | 0.515 | 0.440 | 0.602 | < 2e-16 |
| Fair/Poor | (Intercept) | 0.065 | 0.038 | 0.111 | < 2e-16 |
| Fair/Poor | age | 1.021 | 1.016 | 1.027 | 6.86e-16 |
| Fair/Poor | sexFemale | 1.082 | 0.923 | 1.268 | 0.332 |
| Fair/Poor | bmi | 1.061 | 1.048 | 1.074 | < 2e-16 |
| Fair/Poor | exerciseYes | 0.196 | 0.166 | 0.233 | < 2e-16 |
2c. (5 pts) Interpret the RRR for one predictor in the “Fair/Poor vs. Excellent/VG” comparison.
Holding all else constant, females have a relative risk ratio of 1.082 (95% CI: 0.923, 1.268) for reporting “Fair/Poor” health vs. “excellent/VG” compared to males. In other words, the relative risk of being in the “Fair/Poor” category (rather than “Excellent/VG”) is about 8% higher among females.
3a. (5 pts) Fit an ordinal logistic regression model
with the same predictors using polr().
# Fit ordinal logisitc model with same predictors
mod_ord <- polr(genhlth_nom ~ age + sex + bmi + exercise, data = brfss_poly, Hess = TRUE)3b. (5 pts) Report the cumulative ORs with 95% CIs.
# Create a pretty table to display
tidy(mod_ord, conf.int = TRUE, exponentiate = TRUE) |>
filter(coef.type == "coefficient") |>
dplyr::select(term, estimate, conf.low, conf.high) |>
mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3))) |>
kable(col.names = c("Predictor", "OR", "Lower", "Upper"),
caption = "Ordinal Logistic Regression: Cumulative ORs") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Predictor | OR | Lower | Upper |
|---|---|---|---|
| age | 1.015 | 1.012 | 1.018 |
| sexFemale | 1.015 | 0.911 | 1.131 |
| bmi | 1.046 | 1.037 | 1.055 |
| exerciseYes | 0.312 | 0.276 | 0.353 |
3c. (5 pts) Interpret one OR in plain language, making sure to mention the “at every cut-point” property.
BMI produces an OR of 1.046. This suggests that for every one-unit increase in BMI, the odds of being in a worse health category increases by 4.6% and this is true at every cut-point.
3d. (10 pts) Use ggpredict() to plot
predicted probabilities of each health category across a continuous
predictor of your choice.
# Create figure for health categories across body mass index
ggpredict(mod_ord, terms = "bmi[all]") |>
plot() +
labs(title = "Ordinal Model: Predicted Probability of Each Health Category",
x = "Body Mass Index (kg/m2)", y = "Predicted Probability") +
theme_minimal()It appears that in the excellent/very good category, the predicted probability of having a lower body mass index is higher compared to the fair/poor group where having a higher BMI has a higher predicted probability.
4a. (5 pts) Run the Brant test for proportional odds. Does the assumption hold?
## --------------------------------------------
## Test for X2 df probability
## --------------------------------------------
## Omnibus 20.67 4 0
## age 0.6 1 0.44
## sexFemale 2.5 1 0.11
## bmi 4.9 1 0.03
## exerciseYes 11.33 1 0
## --------------------------------------------
##
## H0: Parallel Regression Assumption holds
# Display in a pretty table
as.data.frame(brant_out) |>
tibble::rownames_to_column("Predictor") |>
mutate(
X2 = round(X2, 2),
probability = ifelse(probability < 0.001, "< 0.001", sprintf("%.3f", probability)),
Fails = ifelse(Predictor == "Omnibus",
ifelse(as.numeric(gsub("< ", "", probability)) < 0.05 | probability == "< 0.001", "Yes (overall)", "No (overall)"),
ifelse(probability == "< 0.001" | as.numeric(gsub("< ", "", probability)) < 0.05, "⚠ Yes", ""))
) |>
kable(
col.names = c("Predictor", "X²", "df", "p-value", "Assumption Violated?"),
caption = "Brant Test for Proportional Odds Assumption",
align = c("l", "r", "r", "r", "c")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
row_spec(0, bold = TRUE) |>
row_spec(1, bold = TRUE, background = "#f0f0f0")| Predictor | X² | df | p-value | Assumption Violated? |
|---|---|---|---|---|
| Omnibus | 20.67 | 4 | < 0.001 | Yes (overall) |
| age | 0.60 | 1 | 0.438 | |
| sexFemale | 2.50 | 1 | 0.114 | |
| bmi | 4.90 | 1 | 0.027 | ⚠ Yes |
| exerciseYes | 11.33 | 1 | < 0.001 | ⚠ Yes |
No the assumption does not hold because the Brant Test for Proportional Odds Assumption produces a p-value lower than 0.001 in the omnibus category suggesting that the overall assumption is violated. Since this failed, it is suggested to consider using the multinomial model instead which does lose parsimony but it gains flexibility.
4b. (5 pts) Compare the AIC of the multinomial and ordinal models. Which fits better?
# Compare AIC of multinomial and ordinal models
tibble(
Model = c("Multinomial", "Ordinal (proportional odds)"),
AIC = c(AIC(mod_multi), AIC(mod_ord)),
df = c(mod_multi$edf, length(coef(mod_ord)) + length(mod_ord$zeta))
) |>
mutate(AIC = round(AIC, 1)) |>
kable(caption = "Model Comparison") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Model | AIC | df |
|---|---|---|
| Multinomial | 9784.6 | 10 |
| Ordinal (proportional odds) | 9796.4 | 6 |
4c. (5 pts) Based on your results, which model would you recommend reporting? Justify in 2-3 sentences.
I would recommend using the multinomial regression instead of the ordinal(proportional odds). I would choose this because when comparing the two models, the multinomial model has a lower AIC than the ordinal model. In addition, after running the Brant test for proportional odds assumption, we discover that the assumptions are violated suggesting that the multinomial model is more honest.
Completion credit (25 points): Awarded for a complete, good-faith attempt. Total: 75 + 25 = 100 points.
End of Lab Activity