summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
plot(pressure)
Complete the four tasks below using
brfss_polytomous_2020.rds. Submit a knitted HTML file via
Brightspace.
| Variable | Description | Type |
|---|---|---|
genhlth_ord |
General health (Excellent/VG < Good < Fair/Poor) | Ordered factor (outcome) |
genhlth_nom |
General health, unordered | Factor (outcome) |
age |
Age in years | Numeric |
sex |
Male / Female | Factor |
bmi |
Body mass index | Numeric |
exercise |
Exercised in past 30 days (No/Yes) | Factor |
income_cat |
Household income (1-8) | Numeric |
smoker |
Former/Never vs. Current | Factor |
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(knitr)
## Warning: package 'knitr' was built under R version 4.5.2
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.2
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.5.2
library(nnet)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:gtsummary':
##
## select
##
## The following object is masked from 'package:dplyr':
##
## select
library(brant)
## Warning: package 'brant' was built under R version 4.5.3
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.5.2
library(dplyr)
library(nnet)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))
## Setting theme "Compact"
## Setting theme "Compact"
brfss_poly <- readRDS("C:\\Users\\safwa\\OneDrive - University at Albany - SUNY\\EPI 553\\Labs\\Data Sets\\brfss_polytomous_2020.rds")
#To see variabls name
names(brfss_poly)
## [1] "genhlth_ord" "genhlth_nom" "age" "sex" "bmi"
## [6] "exercise" "income_cat" "smoker"
1a. (5 pts) Create a frequency table of
genhlth_ord showing N and percentage for each category.
freq_table <- brfss_poly %>%
count(genhlth_ord) %>%
mutate(percent = (n / sum(n)) * 100)
freq_table
## # A tibble: 3 × 3
## genhlth_ord n percent
## <ord> <int> <dbl>
## 1 Excellent/VG 2380 47.6
## 2 Good 1624 32.5
## 3 Fair/Poor 996 19.9
1b. (5 pts) Create a stacked bar chart showing the
distribution of general health by smoker status.
ggplot(brfss_poly,
aes(x = smoker, fill = genhlth_ord)) +
geom_bar(position = "fill") +
labs(y = "Proportion",
x = "Smoker Status",
fill = "General Health") +
scale_y_continuous(labels = scales::percent) +
theme_minimal()
1c. (5 pts) Use tbl_summary() to create
a descriptive table stratified by genhlth_ord with at least
4 predictors.
summary_table <- brfss_poly %>%
dplyr::select(genhlth_ord, age, smoker, sex, income_cat) %>%
tbl_summary(by = genhlth_ord,
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"),
missing = "no") %>%
add_overall()
summary_table
| Characteristic | Overall N = 5,0001 |
Excellent/VG N = 2,3801 |
Good N = 1,6241 |
Fair/Poor N = 9961 |
|---|---|---|---|---|
| IMPUTED AGE VALUE COLLAPSED ABOVE 80 | 57 (16) | 55 (16) | 58 (16) | 61 (15) |
| smoker | ||||
| Former/Never | 3,304 (66%) | 1,692 (71%) | 1,015 (63%) | 597 (60%) |
| Current | 1,696 (34%) | 688 (29%) | 609 (38%) | 399 (40%) |
| sex | ||||
| Male | 2,717 (54%) | 1,315 (55%) | 906 (56%) | 496 (50%) |
| Female | 2,283 (46%) | 1,065 (45%) | 718 (44%) | 500 (50%) |
| income_cat | ||||
| 1 | 225 (4.5%) | 39 (1.6%) | 72 (4.4%) | 114 (11%) |
| 2 | 271 (5.4%) | 69 (2.9%) | 92 (5.7%) | 110 (11%) |
| 3 | 403 (8.1%) | 124 (5.2%) | 128 (7.9%) | 151 (15%) |
| 4 | 512 (10%) | 174 (7.3%) | 188 (12%) | 150 (15%) |
| 5 | 519 (10%) | 198 (8.3%) | 197 (12%) | 124 (12%) |
| 6 | 714 (14%) | 354 (15%) | 249 (15%) | 111 (11%) |
| 7 | 844 (17%) | 443 (19%) | 289 (18%) | 112 (11%) |
| 8 | 1,512 (30%) | 979 (41%) | 409 (25%) | 124 (12%) |
| 1 Mean (SD); n (%) | ||||
2a. (5 pts) Fit a multinomial logistic regression
model predicting genhlth_nom from at least 4 predictors
using multinom().
brfss_poly <-brfss_poly %>%
mutate(genhlth_nom = factor(genhlth_nom))
multi_model <- multinom(genhlth_nom ~ age + bmi + smoker + sex,
data = brfss_poly)
## # weights: 18 (10 variable)
## initial value 5493.061443
## iter 10 value 5160.619392
## final value 4992.064395
## converged
2b. (10 pts) Report the relative risk ratios (exponentiated coefficients) with 95% CIs in a clean table.
multi_results <- tidy(multi_model, exponentiate = TRUE, conf.int = TRUE)
multi_results
## # A tibble: 10 × 8
## y.level term estimate std.error statistic p.value conf.low conf.high
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Good (Intercep… 0.0451 0.221 -14.0 1.36e-44 0.0293 0.0696
## 2 Good age 1.02 0.00214 8.47 2.56e-17 1.01 1.02
## 3 Good bmi 1.05 0.00556 9.58 1.01e-21 1.04 1.07
## 4 Good smokerCur… 1.91 0.0738 8.79 1.52e-18 1.66 2.21
## 5 Good sexFemale 0.948 0.0662 -0.810 4.18e- 1 0.833 1.08
## 6 Fair/Poor (Intercep… 0.00378 0.280 -19.9 2.34e-88 0.00218 0.00654
## 7 Fair/Poor age 1.04 0.00276 12.7 3.65e-37 1.03 1.04
## 8 Fair/Poor bmi 1.08 0.00627 12.6 1.79e-36 1.07 1.10
## 9 Fair/Poor smokerCur… 2.58 0.0877 10.8 3.38e-27 2.17 3.06
## 10 Fair/Poor sexFemale 1.18 0.0784 2.07 3.88e- 2 1.01 1.37
2c. (5 pts) Interpret the RRR for one predictor in the “Fair/Poor vs. Excellent/VG” comparison.
Ans:Individuals who smoke have 1.5 times higher relative risk of reporting fair/poor health versus excellent/very good health, compared to non-smokers, holding other variables constant.
3a. (5 pts) Fit an ordinal logistic regression model
with the same predictors using polr().
brfss_poly <-brfss_poly %>%
mutate(genhlth_ord = factor(genhlth_ord, ordered = TRUE))
ord_model <- polr(genhlth_ord ~ age + bmi + smoker + sex,
data = brfss_poly,
Hess = TRUE)
3b. (5 pts) Report the cumulative ORs with 95% CIs.
ord_results <- tidy(ord_model, exponentiate = TRUE, conf.int = TRUE)
ord_results
## # A tibble: 6 × 7
## term estimate std.error statistic conf.low conf.high coef.type
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 age 1.02 0.00181 13.6 1.02 1.03 coefficient
## 2 bmi 1.06 0.00438 13.5 1.05 1.07 coefficient
## 3 smokerCurrent 2.05 0.0605 11.9 1.82 2.31 coefficient
## 4 sexFemale 1.08 0.0545 1.38 0.969 1.20 coefficient
## 5 Excellent/VG|Good 26.2 0.182 17.9 NA NA scale
## 6 Good|Fair/Poor 128. 0.189 25.6 NA NA scale
3c. (5 pts) Interpret one OR in plain language, making sure to mention the “at every cut-point” property. Ans: Individuals who are current smokers have 2.05 times higher odds of being in a worse general health category compared to non-smokers, holding age, BMI, and sex constant. This effect applies at every cut-point of the outcome (e.g., Excellent/Very Good vs. Good/Fair/Poor, and Excellent/Very Good/Good vs. Fair/Poor).
3d. (10 pts) Use ggpredict() to plot
predicted probabilities of each health category across a continuous
predictor of your choice.
pred <- ggpredict(ord_model, terms = "age")
## Data were 'prettified'. Consider using `terms="age [all]"` to get smooth
## plots.
plot(pred) +
labs(x = "Age",
y = "Predicted Probability",
color = "Health Category") +
theme_minimal()
4a. (5 pts) Run the Brant test for proportional odds. Does the assumption hold?
library(brant)
brant(ord_model)
## --------------------------------------------
## Test for X2 df probability
## --------------------------------------------
## Omnibus 13.29 4 0.01
## age 2.02 1 0.16
## bmi 2.74 1 0.1
## smokerCurrent 1.67 1 0.2
## sexFemale 5.02 1 0.03
## --------------------------------------------
##
## H0: Parallel Regression Assumption holds
4b. (5 pts) Compare the AIC of the multinomial and ordinal models. Which fits better?
AIC(multi_model, ord_model)
## df AIC
## multi_model 10 10004.13
## ord_model 6 10007.16
4c. (5 pts) Based on your results, which model would you recommend reporting? Justify in 2-3 sentences. Ans :The multinomial logistic regression model would be preferred because it has a slightly lower AIC (10004.13 vs. 10007.16), indicating a better fit to the data. Although the difference is small, the multinomial model does not rely on the proportional odds assumption and allows effects to vary across outcome categories. Therefore, it provides more flexible and potentially more accurate estimates for this dataset.