This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
# Load required packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(car)
library(ggeffects)
library(gtsummary)
library(ggstats)
In this lab, you will:
# Load the full BRFSS 2023 dataset
brfss_clean<- read_rds("C:\\Users\\safwa\\OneDrive - University at Albany - SUNY\\EPI 553\\Labs\\brfss_subset_2023.rds")
# Display dataset dimensions
names(brfss_clean)
## [1] "diabetes" "age_group" "age_cont" "sex"
## [5] "race" "education" "income" "bmi_cat"
## [9] "phys_active" "current_smoker" "gen_health" "hypertension"
## [13] "high_chol"
### Task 1: Explore the Outcome Variable
# YOUR CODE HERE: Create a frequency table of hypertension status
freq_table <- table(brfss_clean$hypertension)
prev_ht <- mean(brfss_clean$hypertension == 1, na.rm = TRUE) * 100
prev_ht
## [1] 52.69321
# YOUR CODE HERE: Calculate the prevalence of hypertension by age group
prev_by_age <- brfss_clean %>%
group_by(age_group) %>%
summarise(
n = n(),
ht_cases = sum(hypertension == 1, na.rm = TRUE),
prevalence = mean(hypertension == 1, na.rm = TRUE) * 100,
.groups = "drop"
)
prev_by_age
## # A tibble: 6 × 4
## age_group n ht_cases prevalence
## <fct> <int> <int> <dbl>
## 1 18-24 12 1 8.33
## 2 25-34 77 15 19.5
## 3 35-44 138 42 30.4
## 4 45-54 161 61 37.9
## 5 55-64 266 137 51.5
## 6 65+ 627 419 66.8
Questions:
What is the overall prevalence of hypertension in the data set? Ans: The overall prevalence of hypertension in the data set is 52.7%
How does hypertension prevalence vary by age group? Ans : The data shows a strong positive correlation between age and hypertension. As the age increase the prevalence of hypertension rises significantly and consistently. ***
# YOUR CODE HERE: Fit a simple logistic regression model
# Outcome: hypertension
# Predictor: age_cont
model_logistic_simple <- glm(hypertension ~ age_cont,
data = brfss_clean,
family = binomial(link = "logit"))
summary(model_logistic_simple)
##
## Call:
## glm(formula = hypertension ~ age_cont, family = binomial(link = "logit"),
## data = brfss_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.042577 0.295584 -10.29 <2e-16 ***
## age_cont 0.053119 0.004831 11.00 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1772.1 on 1280 degrees of freedom
## Residual deviance: 1632.6 on 1279 degrees of freedom
## AIC: 1636.6
##
## Number of Fisher Scoring iterations: 4
# YOUR CODE HERE: Display the results with odds ratios
# Display results with odds ratios
tidy(model_logistic_simple, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Simple Logistic Regression: Diabetes ~ Age (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)
| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.048 | 0.296 | -10.293 | 0 | 0.026 | 0.084 |
| age_cont | 1.055 | 0.005 | 10.996 | 0 | 1.045 | 1.065 |
Questions:
What is the odds ratio for age? Interpret this value. Ans : the odds ratio for age is 1.055. For each one-year increase in age, the odds of having hypertension increase by 5.5%, holding everything else constant.
Is the association statistically significant? Ans: Since p value is 0 which is less than 0.05 and CI does not cross 1 , Age is significantly associated with hypertension.
What is the 95% confidence interval for the odds ratio? Ans: The 95% confidence interval for the odds ratio falls between 4.5% and 6.5%.
# YOUR CODE HERE: Fit a multiple logistic regression model
# Outcome: hypertension
# Predictors: age_cont, sex, bmi_cat, phys_active, current_smoker
model_logistic_multiple <- glm(hypertension ~ age_cont + sex + bmi_cat +
phys_active + current_smoker,
data = brfss_clean,
family = binomial(link = "logit"))
# YOUR CODE HERE:# Display results
tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Multiple Logistic Regression: hypertension ~ Age + Covariates (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
scroll_box(height = "400px")
| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.008 | 0.653 | -7.355 | 0.000 | 0.002 | 0.028 |
| age_cont | 1.061 | 0.005 | 11.234 | 0.000 | 1.050 | 1.073 |
| sexMale | 1.270 | 0.123 | 1.950 | 0.051 | 0.999 | 1.616 |
| bmi_catNormal | 2.097 | 0.546 | 1.356 | 0.175 | 0.759 | 6.756 |
| bmi_catOverweight | 3.241 | 0.543 | 2.166 | 0.030 | 1.183 | 10.385 |
| bmi_catObese | 6.585 | 0.545 | 3.459 | 0.001 | 2.394 | 21.176 |
| phys_active | 0.900 | 0.130 | -0.808 | 0.419 | 0.697 | 1.162 |
| current_smoker | 1.071 | 0.139 | 0.495 | 0.621 | 0.817 | 1.407 |
Questions:
How did the odds ratio for age change after adjusting for other variables? Ans: After adjusting for sex, BMI category, physical activity, and smoking status, the odds ratio for age is 1.061.This means that for each one-year increase in age, the odds of hypertension increase by approximately 6.1%, holding all other variables constant.Compared to the crude (unadjusted) model (which was likely very similar), the OR appears to have changed only slightly, suggesting the association between age and hypertension remains strong even after adjustment.
What does this suggest about confounding? Ans:Because the adjusted odds ratio for age remains strong and statistically significant (p < 0.001), and does not appear to change substantially from the crude estimate, this suggests:There is little evidence of strong confounding by sex, BMI, physical activity, or smoking.Age appears to be an independent predictor of hypertension.
Which variables are the strongest predictors of hypertension? Ans: Obesity, Overweight and Age are the strongest predictors of hypertension.
# YOUR CODE HERE: Create a table showing the dummy variable coding for bmi_cat
# Extract dummy variable coding
dummy_table <- data.frame(
bmi_cat= c("bmi_catNormal (Reference)", "bmi_catOverweight ", "bmi_catObese"),
`Dummy 1 (bmi_catOverweight )` = c(0, 1, 0),
`Dummy 2 (bmi_catObese)` = c(0, 0, 1),
check.names = FALSE
)
dummy_table %>%
kable(caption = "Dummy Variable Coding for BMI",
align = "lccc") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(1, bold = TRUE, background = "#ffe6e6") # Highlight reference category
| bmi_cat | Dummy 1 (bmi_catOverweight ) |
Dummy 2 (bmi_catObese)
|
|---|---|---|
| bmi_catNormal (Reference) | 0 | 0 |
| bmi_catOverweight | 1 | 0 |
| bmi_catObese | 0 | 1 |
# YOUR CODE HERE: Extract and display the odds ratios for BMI categories
bmi_coefs <- tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
filter(str_detect(term, "^bmi_cat")) %>%
mutate(
bmi_cat = str_remove(term, "^bmi_cat_?"),
bmi_cat = factor(bmi_cat, levels = c("bmi_catOverweight ", "bmi_catObese"))
)
bmi_or <- tidy(model_logistic_multiple,
exponentiate = TRUE,
conf.int = TRUE)
bmi_or <- bmi_or[bmi_or$term %in% c("bmi_catOverweight",
"bmi_catObese"), ]
bmi_or
## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 bmi_catOverweight 3.24 0.543 2.17 0.0303 1.18 10.4
## 2 bmi_catObese 6.59 0.545 3.46 0.000542 2.39 21.2
# Add reference category
ref_row1 <- data.frame(
term = "bmi_catNormal",
estimate = 1.0,
std.error = NA,
statistic = NA,
p.value = NA,
conf.low = 1.0,
conf.high = 1.0,
bmi_cat = factor("bmi_catNormal (Ref)", levels = c("bmi_catNormal (Ref)", "bmi_catOverweight", "bmi_catObese"))
)
bmi_coefs_full <- bind_rows(ref_row1, bmi_coefs) %>%
mutate(bmi_cat = factor(bmi_cat, levels = c("bmi_catNormal(Ref)", "bmi_catOverweightt", "bmi_catObese")))
# Plot
p3 <- ggplot(bmi_coefs_full, aes(x = bmi_cat, y = estimate)) +
geom_hline(yintercept = 1, linetype = "dashed") +
geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +
coord_flip() +
labs(
title = "Association Between BMI and Hypertension",
subtitle = "Adjusted Odds Ratios (reference: bmi_catNormal )",
x = "BMI",
y = "Odds Ratio (95% CI)"
) +
theme_minimal(base_size = 12)
ggplotly(p3)
Questions: a) What is the reference category for BMI? Ans: The reference category for BMI is bmi_catNormal.
Interpret the odds ratio for “Obese” compared to the reference category. Ans:Individuals classified as obese have 6.59 times the odds of hypertension compared to individuals with normal BMI (the reference category), adjusting for age, sex, physical activity, and smoking status.
How would you explain this to a non-statistician? Ans: People who are obese are much more likely to have high blood pressure than people with a normal weight. In this study, obese individuals were about 6 times more likely to have hypertension compared to those with normal BMI, even after accounting for age, sex, exercise, and smoking.This means obesity is strongly linked to high blood pressure in this population. ***
# YOUR CODE HERE: Fit a model with Age × BMI interaction
# Test if the effect of age on hypertension differs by BMI category
# Model with interaction term
model_interaction <- glm(hypertension ~ age_cont *bmi_cat + sex + phys_active + current_smoker,
data = brfss_clean,
family = binomial(link = "logit"))
# Display interaction results
tidy(model_interaction, exponentiate = TRUE, conf.int = TRUE) %>%
filter(str_detect(term, "age_cont")) %>%
kable(caption = "Age × BMI Interaction Model (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)
| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| age_cont | 1.005 | 0.042 | 0.117 | 0.907 | 0.930 | 1.110 |
| age_cont:bmi_catNormal | 1.058 | 0.043 | 1.287 | 0.198 | 0.956 | 1.147 |
| age_cont:bmi_catOverweight | 1.064 | 0.043 | 1.431 | 0.152 | 0.962 | 1.152 |
| age_cont:bmi_catObese | 1.052 | 0.043 | 1.186 | 0.236 | 0.952 | 1.139 |
# Test if the effect of age on hypertension differs by BMI category
pred_interact <- ggpredict(model_interaction, terms = c("age_cont [18:80]", "bmi_cat"))
p4 <- ggplot(pred_interact, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
labs(
title = "Predicted Probability of Hypertension by Age and BMI Category",
subtitle = "Testing for Age × BMI Interaction",
x = "Age (years)",
y = "Predicted Probability of Hypertension",
color = "BMI Category",
fill = "BMI Category"
) +
scale_y_continuous(labels = scales::percent_format()) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")
ggplotly(p4)
# YOUR CODE HERE: Perform a likelihood ratio test comparing models with and without interaction
model_main <- glm(hypertension ~ age_cont + bmi_cat + sex + phys_active + current_smoker,
data = brfss_clean,
family = binomial(link = "logit"))
lrt_result <- anova(model_main, model_interaction, test = "LRT")
# Display clean table
broom::tidy(lrt_result) %>%
knitr::kable(
caption = "Likelihood Ratio Test Comparing Models With and Without Age × BMI Interaction",
digits = 4
) %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover"),
full_width = FALSE
)
| term | df.residual | residual.deviance | df | deviance | p.value |
|---|---|---|---|---|---|
| hypertension ~ age_cont + bmi_cat + sex + phys_active + current_smoker | 1273 | 1563.496 | NA | NA | NA |
| hypertension ~ age_cont * bmi_cat + sex + phys_active + current_smoker | 1270 | 1561.260 | 3 | 2.2363 | 0.5248 |
Questions:
# YOUR CODE HERE: Calculate VIF for your multiple regression model
vif_values <- car::vif(model_logistic_multiple)
# Handle categorical variables (GVIF adjustment)
if (is.matrix(vif_values)) {
vif_df <- data.frame(
Variable = rownames(vif_values),
GVIF = vif_values[, "GVIF"],
Df = vif_values[, "Df"],
Adjusted_VIF = vif_values[, "GVIF^(1/(2*Df))"]
)
} else {
vif_df <- data.frame(
Variable = names(vif_values),
Adjusted_VIF = as.numeric(vif_values)
)
}
# Display table
knitr::kable(
vif_df,
caption = "Variance Inflation Factors (VIF) for Hypertension Logistic Model",
digits = 3
) %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover"),
full_width = FALSE
)
| Variable | GVIF | Df | Adjusted_VIF | |
|---|---|---|---|---|
| age_cont | age_cont | 1.127 | 1 | 1.061 |
| sex | sex | 1.017 | 1 | 1.008 |
| bmi_cat | bmi_cat | 1.103 | 3 | 1.016 |
| phys_active | phys_active | 1.025 | 1 | 1.012 |
| current_smoker | current_smoker | 1.074 | 1 | 1.036 |
# YOUR CODE HERE: Create a Cook's distance plot to identify influential observations
# Calculate Cook's distance
cooks_d <- cooks.distance(model_logistic_multiple)
# Create data frame
influence_df <- data.frame(
observation = 1:length(cooks_d),
cooks_d = cooks_d
)
# Common threshold
threshold <- 4 / nrow(brfss_clean)
influence_df <- influence_df %>%
dplyr::mutate(
influential = ifelse(cooks_d > threshold, "Yes", "No")
)
# Plot
p_cook <- ggplot(influence_df, aes(x = observation, y = cooks_d)) +
geom_point(aes(color = influential), alpha = 0.6) +
geom_hline(yintercept = threshold, linetype = "dashed", color = "red") +
labs(
title = "Cook's Distance Plot",
subtitle = paste("Threshold =", round(threshold, 4)),
x = "Observation Number",
y = "Cook's Distance",
color = "Influential?"
) +
scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
theme_minimal(base_size = 12)
p_cook
# Count influential observations
sum(influence_df$influential == "Yes")
## [1] 22
Questions:
Are there any concerns about multicollinearity? Ans: Since the model standard errors are small and stable (as seen earlier in your interaction output), there is no indication of problematic multicollinearity.
Are there any influential observations that might affect your results? Ans: Yes, there are some potentially influential observations.From the Cook’s Distance plot: The threshold is approximately 0.0031 Several observations exceed this threshold (marked in red) A few points are substantially above the cutoff
What would you do if you found serious violations? Ans: I would investigate influential observations,run sensitivity analyses
# YOUR CODE HERE: Compare three models using AIC and BIC
# Model A: Age only
model_A <- glm(
hypertension ~ age_cont,
data = brfss_clean,
family = binomial(link = "logit")
)
# Model B: Age + sex + bmi_cat
model_B <- glm(
hypertension ~ age_cont + sex + bmi_cat,
data = brfss_clean,
family = binomial(link = "logit")
)
# Model C: Age + sex + bmi_cat + phys_active + current_smoker
model_C <- glm(
hypertension ~ age_cont + sex + bmi_cat + phys_active + current_smoker,
data = brfss_clean,
family = binomial(link = "logit")
)
# YOUR CODE HERE:# Create comparison table
model_comp <- tibble::tibble(
Model = c("Model A: Age only",
"Model B: Age + sex + BMI",
"Model C: Age + sex + BMI + activity + smoking"),
AIC = c(AIC(model_A), AIC(model_B), AIC(model_C)),
BIC = c(BIC(model_A), BIC(model_B), BIC(model_C)),
Deviance = c(deviance(model_A), deviance(model_B), deviance(model_C))
) %>%
dplyr::arrange(AIC)
# Display table
model_comp %>%
knitr::kable(
caption = "Model Comparison for Hypertension (Lower AIC/BIC = Better Fit)",
digits = 2
) %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover"),
full_width = FALSE
)
| Model | AIC | BIC | Deviance |
|---|---|---|---|
| Model B: Age + sex + BMI | 1576.49 | 1607.42 | 1564.49 |
| Model C: Age + sex + BMI + activity + smoking | 1579.50 | 1620.74 | 1563.50 |
| Model A: Age only | 1636.61 | 1646.92 | 1632.61 |
Questions:
Write a brief report (1-2 pages) summarizing your findings:
Submission: Submit your completed R Markdown file and knitted HTML report.
Logistic Regression:
\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]
Odds Ratio:
\[\text{OR} = e^{\beta_i}\]
Predicted Probability:
\[p = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}\]
Session Info
sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggstats_0.12.0 gtsummary_2.5.0 ggeffects_2.3.2 car_3.1-5
## [5] carData_3.0-6 broom_1.0.12 plotly_4.12.0 kableExtra_1.4.0
## [9] knitr_1.51 haven_2.5.5 lubridate_1.9.4 forcats_1.0.1
## [13] stringr_1.6.0 dplyr_1.1.4 purrr_1.2.1 readr_2.1.6
## [17] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.56 bslib_0.10.0 htmlwidgets_1.6.4
## [5] insight_1.4.6 tzdb_0.5.0 crosstalk_1.2.2 vctrs_0.7.1
## [9] tools_4.5.1 generics_0.1.4 datawizard_1.3.0 pkgconfig_2.0.3
## [13] data.table_1.18.0 RColorBrewer_1.1-3 S7_0.2.1 lifecycle_1.0.5
## [17] compiler_4.5.1 farver_2.1.2 textshaping_1.0.4 codetools_0.2-20
## [21] htmltools_0.5.9 sass_0.4.10 yaml_2.3.12 lazyeval_0.2.2
## [25] Formula_1.2-5 pillar_1.11.1 jquerylib_0.1.4 cachem_1.1.0
## [29] abind_1.4-8 tidyselect_1.2.1 digest_0.6.39 stringi_1.8.7
## [33] labeling_0.4.3 fastmap_1.2.0 grid_4.5.1 cli_3.6.5
## [37] magrittr_2.0.4 utf8_1.2.6 withr_3.0.2 scales_1.4.0
## [41] backports_1.5.0 timechange_0.3.0 rmarkdown_2.30 httr_1.4.7
## [45] otel_0.2.0 hms_1.1.4 evaluate_1.0.5 viridisLite_0.4.2
## [49] rlang_1.1.7 glue_1.8.0 xml2_1.5.2 svglite_2.2.2
## [53] rstudioapi_0.18.0 jsonlite_2.0.0 R6_2.6.1 systemfonts_1.3.1