Polytomous and Ordinal Logistic Regression

R Markdown

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

plot(pressure)

Instructions

Complete the four tasks below using brfss_polytomous_2020.rds. Submit a knitted HTML file via Brightspace.

Data

Variable	Description	Type
`genhlth_ord`	General health (Excellent/VG < Good < Fair/Poor)	Ordered factor (outcome)
`genhlth_nom`	General health, unordered	Factor (outcome)
`age`	Age in years	Numeric
`sex`	Male / Female	Factor
`bmi`	Body mass index	Numeric
`exercise`	Exercised in past 30 days (No/Yes)	Factor
`income_cat`	Household income (1-8)	Numeric
`smoker`	Former/Never vs. Current	Factor

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'tibble' was built under R version 4.5.2

## Warning: package 'tidyr' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## Warning: package 'purrr' was built under R version 4.5.2

## Warning: package 'dplyr' was built under R version 4.5.2

## Warning: package 'stringr' was built under R version 4.5.2

## Warning: package 'forcats' was built under R version 4.5.2

## Warning: package 'lubridate' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)
library(knitr)

## Warning: package 'knitr' was built under R version 4.5.2

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.5.2

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(gtsummary)

## Warning: package 'gtsummary' was built under R version 4.5.2

library(nnet)
library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:gtsummary':
## 
##     select
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(brant)

## Warning: package 'brant' was built under R version 4.5.3

library(ggeffects)

## Warning: package 'ggeffects' was built under R version 4.5.2

library(dplyr)
library(nnet)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

## Setting theme "Compact"
## Setting theme "Compact"

brfss_poly <- readRDS("C:\\Users\\safwa\\OneDrive - University at Albany - SUNY\\EPI 553\\Labs\\Data Sets\\brfss_polytomous_2020.rds")

#To see variabls name

names(brfss_poly)

## [1] "genhlth_ord" "genhlth_nom" "age"         "sex"         "bmi"        
## [6] "exercise"    "income_cat"  "smoker"

Task 1: Explore the Outcome (15 points)

1a. (5 pts) Create a frequency table of genhlth_ord showing N and percentage for each category.

freq_table <- brfss_poly %>%
  count(genhlth_ord) %>%
  mutate(percent = (n / sum(n)) * 100)

freq_table

## # A tibble: 3 × 3
##   genhlth_ord      n percent
##   <ord>        <int>   <dbl>
## 1 Excellent/VG  2380    47.6
## 2 Good          1624    32.5
## 3 Fair/Poor      996    19.9

1b. (5 pts) Create a stacked bar chart showing the distribution of general health by smoker status.

ggplot(brfss_poly, 
       aes(x = smoker, fill = genhlth_ord)) +
  geom_bar(position = "fill") +
  labs(y = "Proportion",
       x = "Smoker Status",
       fill = "General Health") +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal()

1c. (5 pts) Use tbl_summary() to create a descriptive table stratified by genhlth_ord with at least 4 predictors.

summary_table <- brfss_poly %>%
  dplyr::select(genhlth_ord, age, smoker, sex, income_cat) %>%
  tbl_summary(by = genhlth_ord,
              statistic = list(all_continuous() ~ "{mean} ({sd})",
                               all_categorical() ~ "{n} ({p}%)"),
              missing = "no") %>%
  add_overall()

summary_table

Characteristic	Overall N = 5,000¹	Excellent/VG N = 2,380¹	Good N = 1,624¹	Fair/Poor N = 996¹
IMPUTED AGE VALUE COLLAPSED ABOVE 80	57 (16)	55 (16)	58 (16)	61 (15)
smoker
Former/Never	3,304 (66%)	1,692 (71%)	1,015 (63%)	597 (60%)
Current	1,696 (34%)	688 (29%)	609 (38%)	399 (40%)
sex
Male	2,717 (54%)	1,315 (55%)	906 (56%)	496 (50%)
Female	2,283 (46%)	1,065 (45%)	718 (44%)	500 (50%)
income_cat
1	225 (4.5%)	39 (1.6%)	72 (4.4%)	114 (11%)
2	271 (5.4%)	69 (2.9%)	92 (5.7%)	110 (11%)
3	403 (8.1%)	124 (5.2%)	128 (7.9%)	151 (15%)
4	512 (10%)	174 (7.3%)	188 (12%)	150 (15%)
5	519 (10%)	198 (8.3%)	197 (12%)	124 (12%)
6	714 (14%)	354 (15%)	249 (15%)	111 (11%)
7	844 (17%)	443 (19%)	289 (18%)	112 (11%)
8	1,512 (30%)	979 (41%)	409 (25%)	124 (12%)
¹ Mean (SD); n (%)

Task 2: Multinomial Logistic Regression (20 points)

2a. (5 pts) Fit a multinomial logistic regression model predicting genhlth_nom from at least 4 predictors using multinom().

brfss_poly <-brfss_poly %>%
mutate(genhlth_nom = factor(genhlth_nom))
multi_model <- multinom(genhlth_nom ~ age + bmi + smoker + sex,
                        data = brfss_poly)

## # weights:  18 (10 variable)
## initial  value 5493.061443 
## iter  10 value 5160.619392
## final  value 4992.064395 
## converged

2b. (10 pts) Report the relative risk ratios (exponentiated coefficients) with 95% CIs in a clean table.

multi_results <- tidy(multi_model, exponentiate = TRUE, conf.int = TRUE)
multi_results

## # A tibble: 10 × 8
##    y.level   term       estimate std.error statistic  p.value conf.low conf.high
##    <chr>     <chr>         <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
##  1 Good      (Intercep…  0.0451    0.221     -14.0   1.36e-44  0.0293    0.0696 
##  2 Good      age         1.02      0.00214     8.47  2.56e-17  1.01      1.02   
##  3 Good      bmi         1.05      0.00556     9.58  1.01e-21  1.04      1.07   
##  4 Good      smokerCur…  1.91      0.0738      8.79  1.52e-18  1.66      2.21   
##  5 Good      sexFemale   0.948     0.0662     -0.810 4.18e- 1  0.833     1.08   
##  6 Fair/Poor (Intercep…  0.00378   0.280     -19.9   2.34e-88  0.00218   0.00654
##  7 Fair/Poor age         1.04      0.00276    12.7   3.65e-37  1.03      1.04   
##  8 Fair/Poor bmi         1.08      0.00627    12.6   1.79e-36  1.07      1.10   
##  9 Fair/Poor smokerCur…  2.58      0.0877     10.8   3.38e-27  2.17      3.06   
## 10 Fair/Poor sexFemale   1.18      0.0784      2.07  3.88e- 2  1.01      1.37

2c. (5 pts) Interpret the RRR for one predictor in the “Fair/Poor vs. Excellent/VG” comparison.

Ans:Individuals who smoke have 1.5 times higher relative risk of reporting fair/poor health versus excellent/very good health, compared to non-smokers, holding other variables constant.

Task 3: Ordinal Logistic Regression (25 points)

3a. (5 pts) Fit an ordinal logistic regression model with the same predictors using polr().

brfss_poly <-brfss_poly %>%
mutate(genhlth_ord = factor(genhlth_ord, ordered = TRUE))

ord_model <- polr(genhlth_ord ~ age + bmi + smoker + sex,
                  data = brfss_poly,
                  Hess = TRUE)

3b. (5 pts) Report the cumulative ORs with 95% CIs.

ord_results <- tidy(ord_model, exponentiate = TRUE, conf.int = TRUE)

ord_results

## # A tibble: 6 × 7
##   term              estimate std.error statistic conf.low conf.high coef.type  
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>     <dbl> <chr>      
## 1 age                   1.02   0.00181     13.6     1.02       1.03 coefficient
## 2 bmi                   1.06   0.00438     13.5     1.05       1.07 coefficient
## 3 smokerCurrent         2.05   0.0605      11.9     1.82       2.31 coefficient
## 4 sexFemale             1.08   0.0545       1.38    0.969      1.20 coefficient
## 5 Excellent/VG|Good    26.2    0.182       17.9    NA         NA    scale      
## 6 Good|Fair/Poor      128.     0.189       25.6    NA         NA    scale

3c. (5 pts) Interpret one OR in plain language, making sure to mention the “at every cut-point” property. Ans: Individuals who are current smokers have 2.05 times higher odds of being in a worse general health category compared to non-smokers, holding age, BMI, and sex constant. This effect applies at every cut-point of the outcome (e.g., Excellent/Very Good vs. Good/Fair/Poor, and Excellent/Very Good/Good vs. Fair/Poor).

3d. (10 pts) Use ggpredict() to plot predicted probabilities of each health category across a continuous predictor of your choice.

pred <- ggpredict(ord_model, terms = "age")

## Data were 'prettified'. Consider using `terms="age [all]"` to get smooth
##   plots.

plot(pred) +
  labs(x = "Age",
       y = "Predicted Probability",
       color = "Health Category") +
  theme_minimal()

Task 4: Check Assumptions and Compare (15 points)

4a. (5 pts) Run the Brant test for proportional odds. Does the assumption hold?

library(brant)

brant(ord_model)

## -------------------------------------------- 
## Test for X2  df  probability 
## -------------------------------------------- 
## Omnibus      13.29   4   0.01
## age      2.02    1   0.16
## bmi      2.74    1   0.1
## smokerCurrent    1.67    1   0.2
## sexFemale    5.02    1   0.03
## -------------------------------------------- 
## 
## H0: Parallel Regression Assumption holds

4b. (5 pts) Compare the AIC of the multinomial and ordinal models. Which fits better?

AIC(multi_model, ord_model)

##             df      AIC
## multi_model 10 10004.13
## ord_model    6 10007.16

4c. (5 pts) Based on your results, which model would you recommend reporting? Justify in 2-3 sentences. Ans :The multinomial logistic regression model would be preferred because it has a slightly lower AIC (10004.13 vs. 10007.16), indicating a better fit to the data. Although the difference is small, the multinomial model does not rely on the proportional odds assumption and allows effects to vary across outcome categories. Therefore, it provides more flexible and potentially more accurate estimates for this dataset.