Statistical modeling is a fundamental tool in epidemiology that allows us to:
This lecture introduces key concepts in regression modeling using real-world data from the Behavioral Risk Factor Surveillance System (BRFSS) 2023.
In this lab, you will:
# YOUR CODE HERE: Create a frequency table of hypertension status
desc_table <- brfss_clean %>%
group_by(hypertension) %>%
summarise(
N = n(),
) %>%
mutate(hypertension = ifelse(hypertension == 1, "Hypertension", "No Hypertension"))
desc_table %>%
kable(caption = "Frequency Table of Hypertension",
align = "lrrrrrrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| hypertension | N |
|---|---|
| No Hypertension | 606 |
| Hypertension | 675 |
# YOUR CODE HERE: Calculate the prevalence of hypertension by age group
desc_table <- brfss_clean %>%
group_by(age_group) %>%
summarise(
N = n(),
`% Hypertension` = round(100 * mean(hypertension), 1),
)
desc_table %>%
kable(caption = "Hypertension Status by Age Group",
align = "lrrrrrrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| age_group | N | % Hypertension |
|---|---|---|
| 18-24 | 12 | 8.3 |
| 25-34 | 77 | 19.5 |
| 35-44 | 138 | 30.4 |
| 45-54 | 161 | 37.9 |
| 55-64 | 266 | 51.5 |
| 65+ | 627 | 66.8 |
Questions:
The overall prevalence of hypertension is 675 / 1281 = 52.7%
Hypertension prevalence tends to increase with increasing age.
# YOUR CODE HERE: Fit a simple logistic regression model
# Outcome: hypertension
# Predictor: age_cont
model_logistic_simple <- glm(hypertension ~ age_cont,
data = brfss_clean,
family = binomial(link = "logit"))
# Display results
tidy(model_logistic_simple, conf.int = TRUE) %>%
kable(caption = "Simple Logistic Regression: Hypertension ~ Age",
digits = 4,
col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | -3.0426 | 0.2956 | -10.2934 | 0 | -3.6328 | -2.4732 |
| age_cont | 0.0531 | 0.0048 | 10.9955 | 0 | 0.0438 | 0.0627 |
# YOUR CODE HERE: Display the results with odds ratios
tidy(model_logistic_simple, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Simple Logistic Regression: Hypertension ~ Age (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.048 | 0.296 | -10.293 | 0 | 0.026 | 0.084 |
| age_cont | 1.055 | 0.005 | 10.996 | 0 | 1.045 | 1.065 |
Questions:
The odds ratio for age is 1.055. This means that for every 1 year increase in age, there is a 5.5% increase in hypertension risk.
Yes p < 0.05.
95% CI = [1.045, 1.065]
# YOUR CODE HERE: Fit a multiple logistic regression model
# Outcome: hypertension
# Predictors: age_cont, sex, bmi_cat, phys_active, current_smoker
model_logistic_multiple <- glm(hypertension ~ age_cont + sex + bmi_cat +
phys_active + current_smoker,
data = brfss_clean,
family = binomial(link = "logit"))
# Display results
tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Multiple Logistic Regression: Hypertension ~ Age + Covariates (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
scroll_box(height = "400px")| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.008 | 0.653 | -7.355 | 0.000 | 0.002 | 0.028 |
| age_cont | 1.061 | 0.005 | 11.234 | 0.000 | 1.050 | 1.073 |
| sexMale | 1.270 | 0.123 | 1.950 | 0.051 | 0.999 | 1.616 |
| bmi_catNormal | 2.097 | 0.546 | 1.356 | 0.175 | 0.759 | 6.756 |
| bmi_catOverweight | 3.241 | 0.543 | 2.166 | 0.030 | 1.183 | 10.385 |
| bmi_catObese | 6.585 | 0.545 | 3.459 | 0.001 | 2.394 | 21.176 |
| phys_active | 0.900 | 0.130 | -0.808 | 0.419 | 0.697 | 1.162 |
| current_smoker | 1.071 | 0.139 | 0.495 | 0.621 | 0.817 | 1.407 |
Questions:
After controlling for other variables, the odds ratio for age increased to 1.061.
This suggests that other variables distort the true relationship between age and hypertension and bring the estimate closer to the null.
The BMI variables are the strongest predictors of hypertension as they have the largest odds ratios.
# YOUR CODE HERE: Create a table showing the dummy variable coding for bmi_cat
dummy_table <- data.frame(
bmi_cat = c("Underweight", "Normal", "Overweight", "Obese"),
`Dummy 1 (Normal)` = c(0, 1, 0, 0),
`Dummy 2 (Overweight)` = c(0, 0, 1, 0),
`Dummy 3 (Obese)` = c(0, 0, 0, 1),
check.names = FALSE
)
dummy_table %>%
kable(caption = "Dummy Variable Coding for BMI_cat",
align = "lccc") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(1, bold = TRUE, background = "#ffe6e6") # Highlight reference category| bmi_cat | Dummy 1 (Normal) | Dummy 2 (Overweight) | Dummy 3 (Obese) |
|---|---|---|---|
| Underweight | 0 | 0 | 0 |
| Normal | 1 | 0 | 0 |
| Overweight | 0 | 1 | 0 |
| Obese | 0 | 0 | 1 |
# YOUR CODE HERE: Extract and display the odds ratios for BMI categories
# Extract coefficients
bmi_coefs <- tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
filter(str_detect(term, "bmi_cat")) %>%
mutate(
bmi = str_remove(term, "bmi_cat"),
bmi = factor(bmi, levels = c("Underweight", "Normal", "Overweight","Obese"))
)
# Add reference category
ref_row <- data.frame(
term = "bmi_catUnderweight",
estimate = 1.0,
std.error = 0,
statistic = NA,
p.value = NA,
conf.low = 1.0,
conf.high = 1.0,
bmi = factor("Underweight (Ref)",
levels = c("Underweight (Ref)",
"Normal",
"Overweight",
"Obese"
))
)
bmi_coefs_full <- bind_rows(ref_row, bmi_coefs) %>%
mutate(bmi = factor(bmi,
levels = c("Underweight (Ref)",
"Normal",
"Overweight",
"Obese"
)))
# Plot
p3 <- ggplot(bmi_coefs_full, aes(x = bmi, y = estimate)) +
geom_hline(yintercept = 1, linetype = "dashed", color = "gray50") +
geom_pointrange(aes(ymin = conf.low, ymax = conf.high),
size = 0.8, color = "darkblue") +
coord_flip() +
labs(
title = "Association Between BMI and Hypertension",
subtitle = "Adjusted Odds Ratios (reference: Normal)",
x = "BMI",
y = "Odds Ratio (95% CI)"
) +
theme_minimal(base_size = 12)
ggplotly(p3)Questions:
Underweight
The odds ratio for Obese is 6.58 times that of the reference category.
Obese BMI is associated with 6.58 times the risk of hypertension compared with Underweight BMI.
# YOUR CODE HERE: Fit a model with Age × BMI interaction
# Test if the effect of age on hypertension differs by BMI category
# Model with interaction term
model_interaction <- glm(hypertension ~ age_cont * bmi_cat,
data = brfss_clean,
family = binomial(link = "logit"))
# Display interaction results
tidy(model_interaction, exponentiate = TRUE, conf.int = TRUE) %>%
filter(str_detect(term, "age_cont")) %>%
kable(caption = "Age × BMI Interaction Model (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| age_cont | 1.004 | 0.042 | 0.102 | 0.918 | 0.929 | 1.108 |
| age_cont:bmi_catNormal | 1.058 | 0.043 | 1.306 | 0.192 | 0.957 | 1.147 |
| age_cont:bmi_catOverweight | 1.063 | 0.043 | 1.423 | 0.155 | 0.962 | 1.151 |
| age_cont:bmi_catObese | 1.054 | 0.042 | 1.232 | 0.218 | 0.954 | 1.140 |
# YOUR CODE HERE: Perform a likelihood ratio test comparing models with and without interaction
model1 <- glm(hypertension ~ age_cont,
data = brfss_clean,
family = binomial)
# Model 2: Age + BMI
model2 <- glm(hypertension ~ age_cont + bmi_cat,
data = brfss_clean,
family = binomial)
# Model 3: Full model
model3 <- model_logistic_multiple
# Likelihood ratio test
lrt_1_2 <- anova(model1, model2, test = "LRT")
lrt_2_3 <- anova(model2, model3, test = "LRT")
# Create comparison table
model_comp <- data.frame(
Model = c("Model 1: Age only",
"Model 2: Age + BMI",
"Model 3: Full model"),
AIC = c(AIC(model1), AIC(model2), AIC(model3)),
BIC = c(BIC(model1), BIC(model2), BIC(model3)),
`Deviance` = c(deviance(model1), deviance(model2), deviance(model3)),
check.names = FALSE
)
model_comp %>%
kable(caption = "Model Comparison: AIC, BIC, and Deviance",
digits = 2,
align = "lrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which.min(model_comp$AIC), bold = TRUE, background = "#d4edda")| Model | AIC | BIC | Deviance |
|---|---|---|---|
| Model 1: Age only | 1636.61 | 1646.92 | 1632.61 |
| Model 2: Age + BMI | 1578.06 | 1603.84 | 1568.06 |
| Model 3: Full model | 1579.50 | 1620.74 | 1563.50 |
#visualization
pred_interact <- ggpredict(model_interaction, terms = c("age_cont", "bmi_cat"))
# Plot
p4 <- ggplot(pred_interact, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
labs(
title = "Predicted Probability of Hypertension by Age and BMI",
subtitle = "Testing for Age × BMI Interaction",
x = "Age (years)",
y = "Predicted Probability of Hypertension",
color = "bmi_cat",
fill = "bmi_cat"
) +
scale_y_continuous(labels = scales::percent_format()) +
scale_color_manual(values = c("Underweight" = "#E64B35", "Normal" = "#4DBBD5", "Overweight" = "#228B22", "Obese" = "#FFC000")) +
scale_fill_manual(values = c("Underweight" = "#E64B35", "Normal" = "#4DBBD5", "Overweight" = "#228B22", "Obese" = "#FFC000")) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")
ggplotly(p4)Questions:
No, p>0.05
The relationship between age and hypertension is not affected by BMI. That is, BMI is not an effect modifier of the Age-Hypertension relationship.
Done.
# YOUR CODE HERE: Calculate VIF for your multiple regression model
vif_values <- vif(model_logistic_multiple)
# Create VIF table
# For models with categorical variables, vif() returns GVIF (Generalized VIF)
if (is.matrix(vif_values)) {
# If matrix (categorical variables present), extract GVIF^(1/(2*Df))
vif_df <- data.frame(
Variable = rownames(vif_values),
VIF = vif_values[, "GVIF^(1/(2*Df))"]
)
} else {
# If vector (only continuous variables)
vif_df <- data.frame(
Variable = names(vif_values),
VIF = as.numeric(vif_values)
)
}
# Add interpretation
vif_df <- vif_df %>%
arrange(desc(VIF)) %>%
mutate(
Interpretation = case_when(
VIF < 5 ~ "Low (No concern)",
VIF >= 5 & VIF < 10 ~ "Moderate (Monitor)",
VIF >= 10 ~ "High (Problem)"
)
)
vif_df %>%
kable(caption = "Variance Inflation Factors (VIF) for Multiple Regression Model",
digits = 2,
align = "lrc") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which(vif_df$VIF >= 10), bold = TRUE, color = "white", background = "#DC143C") %>%
row_spec(which(vif_df$VIF >= 5 & vif_df$VIF < 10), background = "#FFA500") %>%
row_spec(which(vif_df$VIF < 5), background = "#90EE90")| Variable | VIF | Interpretation | |
|---|---|---|---|
| age_cont | age_cont | 1.06 | Low (No concern) |
| current_smoker | current_smoker | 1.04 | Low (No concern) |
| bmi_cat | bmi_cat | 1.02 | Low (No concern) |
| phys_active | phys_active | 1.01 | Low (No concern) |
| sex | sex | 1.01 | Low (No concern) |
# YOUR CODE HERE: Create a Cook's distance plot to identify influential observations
# Calculate Cook's distance
cooks_d <- cooks.distance(model_logistic_multiple)
# Create data frame
influence_df <- data.frame(
observation = 1:length(cooks_d),
cooks_d = cooks_d
) %>%
mutate(influential = ifelse(cooks_d > 1, "Yes", "No"))
# Plot
p5 <- ggplot(influence_df, aes(x = observation, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
labs(
title = "Cook's Distance: Identifying Influential Observations",
subtitle = "Values > 1 indicate potentially influential observations",
x = "Observation Number",
y = "Cook's Distance",
color = "Influential?"
) +
scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
theme_minimal(base_size = 12)
ggplotly(p5)n_influential <- sum(influence_df$influential == "Yes")
cat("Number of potentially influential observations:", n_influential, "\n")## Number of potentially influential observations: 0
Questions:
No.
No.
If I found serious violations in the VIF test I would try to deal with multicollinearity by removing one of the highly correlated predictors. If there were serious violations in Cook’s test I would consider removing influential outliers.
# YOUR CODE HERE: Compare three models using AIC and BIC
# Model A: Age only
# Model B: Age + sex + bmi_cat
# Model C: Age + sex + bmi_cat + phys_active + current_smoker
model1 <- glm(hypertension ~ age_cont,
data = brfss_clean,
family = binomial)
# Model 2: Age + BMI
model2 <- glm(hypertension ~ age_cont + sex + bmi_cat,
data = brfss_clean,
family = binomial)
# Model 3: Full model
model3 <- glm(hypertension ~ age_cont + sex + bmi_cat + phys_active + current_smoker,
data = brfss_clean,
family = binomial)
# Likelihood ratio test
lrt_1_2 <- anova(model1, model2, test = "LRT")
lrt_2_3 <- anova(model2, model3, test = "LRT")
# Create comparison table
model_comp <- data.frame(
Model = c("Model 1: Age only",
"Model 2: Age + sex + bmi_cat",
"Model 3: Age + sex + bmi_cat + phys_active + current_smoker"),
AIC = c(AIC(model1), AIC(model2), AIC(model3)),
BIC = c(BIC(model1), BIC(model2), BIC(model3)),
`Deviance` = c(deviance(model1), deviance(model2), deviance(model3)),
check.names = FALSE
)
model_comp %>%
kable(caption = "Model Comparison: AIC, BIC, and Deviance",
digits = 2,
align = "lrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which.min(model_comp$AIC), bold = TRUE, background = "#d4edda")| Model | AIC | BIC | Deviance |
|---|---|---|---|
| Model 1: Age only | 1636.61 | 1646.92 | 1632.61 |
| Model 2: Age + sex + bmi_cat | 1576.49 | 1607.42 | 1564.49 |
| Model 3: Age + sex + bmi_cat + phys_active + current_smoker | 1579.50 | 1620.74 | 1563.50 |
Questions:
Age + sex + bmi_cat is the best fit as it has the lowest AIC.
No because Model 2 has the lowest AIC, suggesting it has the best fit to the data, so adding the other predictors is unnecessary.
I would use Model 2 because it has the lowest AIC and so provides the best fit for the data.
Write a brief report (1-2 pages) summarizing your findings:
In this assignment, I sought to understand the factors that are associated with hypertension in the 2023 BRFSS data. In this analysis, hypertension was treated as a binary variable, that is, an individual either did or did not have hypertension.
First, I created a frequency table of hypertension status, and presented the age-specific prevalence of hypertension in the sample. For age-specific prevalence, age was categorized into groups 18-24, 25-34, 35-44, 45-54, 55-64, and 65+.
Then, I ran a simple logistic regression model to quantify the relationship between age and hypertension, with continuous age as the predictor and hypertension status as the outcome. I found that OR=1.055, p<0.001, 95% CI [1.045,1.065], indicating that every 1 year age increase resulted in a 5.5% increased risk of hypertension.
I also ran a multiple logistic regression model with hypertension status as the outcome and age, sex, BMI, physical activity, and smoking status as the predictors. For this analysis, sex was a binary variable (M/F), BMI was categorized into four groups (Underweight, Normal, Overweight, Underweight), Physical Activity was a binary variable (No Physical Activity / Physical Activity), and Smoking Status was a binary variable (Current Smoker / Not a current smoker). After adjusting for these other variables, I found that the odds ratio for age increased slightly, to OR=1.061, p<0.001, 95% CI[1.050, 1.073]. This indicates that the aforementioned variables were confounders that distorted the true relationship between age and hypertension, bringing the estimate closer to the null value.
I also tested for a potential interaction effect between age and BMI. No statistically significant interaction term was found, that is, the effect of age on hypertension was not affected by BMI. However, a likelihood ratio test indicated that the Age+BMI model was a better fit for the data than the age only model, as it had a lower AIC and BIC (AIC: 1578.06, BIC: 1603.84, Deviance: 1568.06)
Model diagnostics for multicollinearity (VIF) and influential outliers (Cook’s Distance) did not indicate any serious concerns with the data. For VIF, none of the variables in the multiple regression were significantly correlated with one another. For the Cook’s Distance, there were no influential outliers present.
A final model comparison indicated that the model with Age, Sex, and BMI was the best fit for the data based on AIC and BIC (AIC: 1576.49, BIC: 1607.42, Deviance: 1564.49). This model was the better fit compared to the Age Only model and a model with Age, sex, BMI, physical activity, and smoking status.
There are some potential limitations with this analysis. It is not possible to determine causation with this analysis. In this analysis, I used data from a single point in time, and determined associations between variables. Thus, it is not possible to make inferences about the direction of any particular association. Also, linear regression assumes a linear relationship between variables, but this may not always be the case. Thus, when variables display a non-linear relationship, the results of linear regression may not paint an accurate picture of the true relationship in the population.
Submission: Submit your completed R Markdown file and knitted HTML report.
Logistic Regression:
\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]
Odds Ratio:
\[\text{OR} = e^{\beta_i}\]
Predicted Probability:
\[p = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}\]
Session Info
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggstats_0.12.0 gtsummary_2.5.0 ggeffects_2.3.2 car_3.1-3
## [5] carData_3.0-5 broom_1.0.11 plotly_4.12.0 kableExtra_1.4.0
## [9] knitr_1.51 haven_2.5.5 lubridate_1.9.4 forcats_1.0.1
## [13] stringr_1.6.0 dplyr_1.1.4 purrr_1.2.1 readr_2.1.6
## [17] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.56 bslib_0.9.0 htmlwidgets_1.6.4
## [5] insight_1.4.4 tzdb_0.5.0 vctrs_0.7.0 tools_4.5.1
## [9] crosstalk_1.2.2 generics_0.1.4 datawizard_1.3.0 pkgconfig_2.0.3
## [13] data.table_1.18.0 RColorBrewer_1.1-3 S7_0.2.1 lifecycle_1.0.5
## [17] compiler_4.5.1 farver_2.1.2 textshaping_1.0.4 htmltools_0.5.9
## [21] sass_0.4.10 yaml_2.3.12 lazyeval_0.2.2 Formula_1.2-5
## [25] pillar_1.11.1 jquerylib_0.1.4 cachem_1.1.0 abind_1.4-8
## [29] tidyselect_1.2.1 digest_0.6.39 stringi_1.8.7 labeling_0.4.3
## [33] fastmap_1.2.0 grid_4.5.1 cli_3.6.5 magrittr_2.0.4
## [37] withr_3.0.2 scales_1.4.0 backports_1.5.0 timechange_0.3.0
## [41] rmarkdown_2.30 httr_1.4.7 otel_0.2.0 hms_1.1.4
## [45] evaluate_1.0.5 viridisLite_0.4.2 rlang_1.1.7 glue_1.8.0
## [49] xml2_1.5.2 svglite_2.2.2 rstudioapi_0.18.0 jsonlite_2.0.0
## [53] R6_2.6.1 systemfonts_1.3.1