In this lab, you will:
This lecture introduces key concepts in regression modeling using real-world data from the Behavioral Risk Factor Surveillance System (BRFSS) 2023.
The BRFSS is a large-scale telephone survey that collects data on health-related risk behaviors, chronic health conditions, and use of preventive services from U.S. residents.
# Load the full BRFSS 2023 dataset
brfss_subset <- read_rds("/Users/nataliasmall/Downloads/EPI 553/brfss_subset_2023.rds") %>%
janitor::clean_names()
# Display dataset dimensions
names(brfss_subset)## [1] "diabetes" "age_group" "age_cont" "sex"
## [5] "race" "education" "income" "bmi_cat"
## [9] "phys_active" "current_smoker" "gen_health" "hypertension"
## [13] "high_chol"
# YOUR CODE HERE: Create a frequency table of hypertension status
# Disclaimer: From 552
Frequency_hypstat = cbind(freq = table(brfss_subset$hypertension), percent = prop.table(table(brfss_subset$hypertension)))
# Print out the table
Frequency_hypstat## freq percent
## 0 606 0.4730679
## 1 675 0.5269321
# YOUR CODE HERE: Calculate the prevalence of hypertension by age group
hyp_by_age <- brfss_subset %>%
group_by(age_group) %>%
summarise(
N = n(),
Mean_hyp = round(mean(hypertension, na.rm = TRUE), 2),
Pct_Hypertension = round(
sum(hypertension == 1, na.rm = TRUE) / sum(!is.na(hypertension)) * 100, 2)
)
print(hyp_by_age)## # A tibble: 6 × 4
## age_group N Mean_hyp Pct_Hypertension
## <fct> <int> <dbl> <dbl>
## 1 18-24 12 0.08 8.33
## 2 25-34 77 0.19 19.5
## 3 35-44 138 0.3 30.4
## 4 45-54 161 0.38 37.9
## 5 55-64 266 0.52 51.5
## 6 65+ 627 0.67 66.8
Questions:
# YOUR CODE HERE: Fit a simple logistic regression model
# Outcome: hypertension
# Predictor: age_cont
# YOUR CODE HERE: Display the results with odds ratios
model_logistic_simple <- glm(hypertension ~ age_cont,
data = brfss_subset,
family = binomial(link = "logit"))
# Display results with odds ratios
tidy(model_logistic_simple, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Simple Logistic Regression: Hypertension ~ Age (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.048 | 0.296 | -10.293 | 0 | 0.026 | 0.084 |
| age_cont | 1.055 | 0.005 | 10.996 | 0 | 1.045 | 1.065 |
Questions:
# YOUR CODE HERE: Fit a multiple logistic regression model
# Outcome: hypertension
# Predictors: age_cont, sex, bmi_cat, phys_active, current_smoker
# YOUR CODE HERE: Display the results
model_logistic_multiple <- glm(hypertension ~ age_cont + sex + bmi_cat +
phys_active + current_smoker,
data = brfss_subset,
family = binomial(link = "logit"))
# Display results
tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Multiple Logistic Regression: Hypertension ~ Age + Covariates (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
scroll_box(height = "400px")| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.008 | 0.653 | -7.355 | 0.000 | 0.002 | 0.028 |
| age_cont | 1.061 | 0.005 | 11.234 | 0.000 | 1.050 | 1.073 |
| sexMale | 1.270 | 0.123 | 1.950 | 0.051 | 0.999 | 1.616 |
| bmi_catNormal | 2.097 | 0.546 | 1.356 | 0.175 | 0.759 | 6.756 |
| bmi_catOverweight | 3.241 | 0.543 | 2.166 | 0.030 | 1.183 | 10.385 |
| bmi_catObese | 6.585 | 0.545 | 3.459 | 0.001 | 2.394 | 21.176 |
| phys_active | 0.900 | 0.130 | -0.808 | 0.419 | 0.697 | 1.162 |
| current_smoker | 1.071 | 0.139 | 0.495 | 0.621 | 0.817 | 1.407 |
Questions:
# YOUR CODE HERE: Create a table showing the dummy variable coding for bmi_cat
dummy_table <- data.frame(
bmi_cat = c("Underweight", "Normal", "Overweight", "Obese"),
`Dummy 1 (Normal)` = c(0, 1, 0, 0),
`Dummy 2 (Overweight)` = c(0, 0, 1, 0),
`Dummy 3 (Obese)` = c(0, 0, 0, 1),
check.names = FALSE
)
dummy_table %>%
kable(caption = "Dummy Variable Coding for bmi_cat",
align = "lccc") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(1, bold = TRUE, background = "#ffe6e6") # Highlight reference category| bmi_cat | Dummy 1 (Normal) | Dummy 2 (Overweight) | Dummy 3 (Obese) |
|---|---|---|---|
| Underweight | 0 | 0 | 0 |
| Normal | 1 | 0 | 0 |
| Overweight | 0 | 1 | 0 |
| Obese | 0 | 0 | 1 |
# YOUR CODE HERE: Extract and display the odds ratios for BMI categories
tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
filter(str_detect(term, "bmi_cat"))## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 bmi_catNormal 2.10 0.546 1.36 0.175 0.759 6.76
## 2 bmi_catOverweight 3.24 0.543 2.17 0.0303 1.18 10.4
## 3 bmi_catObese 6.59 0.545 3.46 0.000542 2.39 21.2
Questions:
# YOUR CODE HERE: Fit a model with Age × BMI interaction
# Test if the effect of age on hypertension differs by BMI category
model_interaction <- glm(hypertension ~ age_cont * bmi_cat + sex + phys_active + current_smoker,
data = brfss_subset,
family = binomial(link = "logit"))
# Display interaction results
tidy(model_interaction, exponentiate = TRUE, conf.int = TRUE) %>%
filter(str_detect(term, "age_cont")) %>%
kable(caption = "Age × BMI Interaction Model (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| age_cont | 1.005 | 0.042 | 0.117 | 0.907 | 0.930 | 1.110 |
| age_cont:bmi_catNormal | 1.058 | 0.043 | 1.287 | 0.198 | 0.956 | 1.147 |
| age_cont:bmi_catOverweight | 1.064 | 0.043 | 1.431 | 0.152 | 0.962 | 1.152 |
| age_cont:bmi_catObese | 1.052 | 0.043 | 1.186 | 0.236 | 0.952 | 1.139 |
# YOUR CODE HERE: Perform a likelihood ratio test comparing models with and without interaction
lrt <- anova(model_logistic_multiple, model_interaction, test = "LRT")
lrt## Analysis of Deviance Table
##
## Model 1: hypertension ~ age_cont + sex + bmi_cat + phys_active + current_smoker
## Model 2: hypertension ~ age_cont * bmi_cat + sex + phys_active + current_smoker
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 1273 1563.5
## 2 1270 1561.3 3 2.2363 0.5248
#YOUR CODE HERE: Visualization showing predicted probabilities by age and BMI category
pred_interact <- ggpredict(model_interaction, terms = c("age_cont [18:80]", "bmi_cat"))
# Plot
p4 <- ggplot(pred_interact, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
labs(
title = "Predicted Probability of Hypertension by Age and BMI",
subtitle = "Testing for Age × BMI Interaction",
x = "Age (years)",
y = "Predicted Probability of Hypertension",
color = "BMI",
fill = "BMI"
) +
scale_y_continuous(labels = scales::percent_format()) +
scale_color_manual(values = c("Underweight" = "red", "Normal" = "blue","Overweight" = "yellow", "Obese" = "#4DBBD5")) +
scale_fill_manual(values = c("Underweight" = "red", "Normal" = "blue", "Overweight" = "yellow", "Obese" = "#4DBBD5")) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")
ggplotly(p4)Questions:
# YOUR CODE HERE: Calculate VIF for your multiple regression model
vif_values <- vif(model_logistic_multiple)
print (vif_values)## GVIF Df GVIF^(1/(2*Df))
## age_cont 1.126628 1 1.061428
## sex 1.016509 1 1.008221
## bmi_cat 1.103045 3 1.016480
## phys_active 1.024820 1 1.012334
## current_smoker 1.073574 1 1.036134
# YOUR CODE HERE: Create a Cook's distance plot to identify influential observations
cooks_d <- cooks.distance(model_logistic_multiple)
# Create data frame
influence_df <- data.frame(
observation = 1:length(cooks_d),
cooks_d = cooks_d
) %>%
mutate(influential = ifelse(cooks_d > 1, "Yes", "No"))
# Plot
p5 <- ggplot(influence_df, aes(x = observation, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
labs(
title = "Cook's Distance: Identifying Influential Observations",
subtitle = "Values > 1 indicate potentially influential observations",
x = "Observation Number",
y = "Cook's Distance",
color = "Influential?"
) +
scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
theme_minimal(base_size = 12)
ggplotly(p5)Questions:
# YOUR CODE HERE: Compare three models using AIC and BIC
# Model A: Age only
model1 <- glm(hypertension ~ age_cont,
data = brfss_subset,
family = binomial)
# Model B: Age + sex + bmi_cat
model2 <- glm(hypertension ~ age_cont + sex + bmi_cat,
data = brfss_subset,
family = binomial)
# Model C: Age + sex + bmi_cat + phys_active + current_smoker
model3 <- glm(hypertension ~ age_cont + sex + bmi_cat + phys_active + current_smoker,
data = brfss_subset,
family = binomial)
# YOUR CODE HERE: Create a comparison table
model_comp <- data.frame(
Model = c("Model A: Age only",
"Model B: Age + sex + bmi_cat",
"Model 3: Full model"),
AIC = c(AIC(model1), AIC(model2), AIC(model3)),
BIC = c(BIC(model1), BIC(model2), BIC(model3)),
`Deviance` = c(deviance(model1), deviance(model2), deviance(model3)),
check.names = FALSE
)
model_comp %>%
kable(caption = "Model Comparison: AIC, BIC, and Deviance",
digits = 2,
align = "lrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which.min(model_comp$AIC), bold = TRUE, background = "#d4edda")| Model | AIC | BIC | Deviance |
|---|---|---|---|
| Model A: Age only | 1636.61 | 1646.92 | 1632.61 |
| Model B: Age + sex + bmi_cat | 1576.49 | 1607.42 | 1564.49 |
| Model 3: Full model | 1579.50 | 1620.74 | 1563.50 |
Questions:
Write a brief report (1-2 pages) summarizing your findings: 1. Introduction: State your research question 2. Methods: Describe your analytic approach 3. Results: Present key findings with tables and figures 4. Interpretation: Explain what your results mean 5. Limitations: Discuss potential issues with your analysis
Submission: Submit your completed R Markdown file and knitted HTML report.
The research questions covered in this lab were (1) What is the overall prevalence of hypertension in the data set? and (2) How does hypertension prevalence vary by age group?
In this analysis, we used data from the Behavioral Risk Factor Surveillance System (BRFSS) 2023. Particularly, a subset of the data containing demographic, health behavior, and health status data. At the beginning of the analysis, we explore the binary outcome variable, hypertension. Creating a frequency table of hypertension status to dictate the overall prevalence of hypertension in the dataset, and then further calculating the prevalence of hypertension by age group to determine how hypertension prevalence varies by age group. First, we fit a simple logistic regression model, trying to determine whether there is a statistically significant association between hypertension ~ age_cont. Then, we fit a multiple logistic regression model, controlling for confounding factors, particularly adjusting for sex, BMI, physical activity, and smoking. From the multiple logistic regression model, we would have determined the strongest predictors of hypertension, which prompts us to test for interaction, regarding effect modification, and then perform a likelihood ratio test to further compare models with and without interaction. After this, we worked on model diagnostics, calculating VIF for the multiple regression model and creating a Cook’s distance plot to identify influential observations. Finally, we decided on a definitive model for the final analysis after comparing three models using AIC and BIC, seeking the best fit.
In the dataset, about 53% of individuals have hypertension, indicating an overall strong association between age group and percentage with hypertension in the dataset, with higher percentages among higher age groups. After fitting the simple logistic regression model, it was determined that there is a highly statistically significant association (p < 0.001) between hypertension and age, with an Odds Ratio (OR) of 1.055 and a 95% Confidence Interval (CI) of [1.045, 1.065]. From the multiple regression model, after adjusting for sex, BMI, physical activity, and smoking, determined that each 1-year increase in age is associated with a 6.1% increase in the odds of hypertension. The slight increase in odds ratio from 1.055 to 1.061 suggests that confounding could have been present. But, with only a 0.57% change, this could indicate that the other variables did not significantly confound relationship between age and hypertension. The BMI category was determined to be the strongest predictor, having the largest odds ratio, 6.585. For BMI, compared to the reference category “underweight”, obese individuals have a 6.6 times higher odds of developing hypertension after adjusting for age, BMI, sex, physical activity, and smoking. Age_cont:bmi_cat interaction terms did not prove to be statistically significant. All of the p-values for the age_cont:bmi_cat terms in the table are > 0.05. When calculating VIF, all values were < 5, affirming no concerns about multicollinearity. All in all, after comparing three models using AIC and BIC, we choose Model B: Age + sex + bmi_cat, because it was best fit to the data, based on AIC.
As stated before, as age groups increase, mean hypertension increases. The BMI categories were the strongest predictors of hypertension, with the obese BMI category being the strongest predictor. Compared to the reference category “underweight”, obese individuals have a 6.6 times higher odds of developing hypertension after adjusting for age, BMI, sex, physical activity, and smoking. People who are obese are more likely to develop hypertension, high blood pressure, compared to underweight people. It can be said these individuals have six times the chance of getting hypertension. This is taking into consideration social and health factors like age, BMI, smoking etc. Interaction terms were not statistically significant. All of the p-values for the age_cont:bmi_cat terms in the table are > 0.05, meaning that the age-hypertension relationship does not differ by BMI, and there is no evidence of effect modification. No there are no concerns about multicollinearity, all VIF values are < 5 which is generally acceptable.
Potential issues with your analysis is that causal relationships cannot be determined. Correlation doesn’t equal causality. There are also variations in frequency between variables with multiple categories such as BMI, which could act as a catalyst for potentially serious violations.