Statistical modeling is a fundamental tool in epidemiology that allows us to:
This lecture introduces key concepts in regression modeling using real-world data from the Behavioral Risk Factor Surveillance System (BRFSS) 2023.
# Load required packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(car)
library(ggeffects)
library(gtsummary)
library(ggstats)The BRFSS is a large-scale telephone survey that collects data on health-related risk behaviors, chronic health conditions, and use of preventive services from U.S. residents.
# Load the full BRFSS 2023 dataset
brfss_clean <- read_rds("/Users/vikya/OneDrive - University at Albany - SUNY/Documents/553 Principles of Statistical Inference II/data/Statistical Modeling in Epidemiology Lab/brfss_subset_2023.rds")
# Display dataset dimensions
names(brfss_clean)## [1] "diabetes" "age_group" "age_cont" "sex"
## [5] "race" "education" "income" "bmi_cat"
## [9] "phys_active" "current_smoker" "gen_health" "hypertension"
## [13] "high_chol"
# Summary table by diabetes status
desc_table <- brfss_clean %>%
group_by(diabetes) %>%
summarise(
N = n(),
`Mean Age` = round(mean(age_cont), 1),
`% Male` = round(100 * mean(sex == "Male"), 1),
`% Obese` = round(100 * mean(bmi_cat == "Obese", na.rm = TRUE), 1),
`% Physically Active` = round(100 * mean(phys_active), 1),
`% Current Smoker` = round(100 * mean(current_smoker), 1),
`% Hypertension` = round(100 * mean(hypertension), 1),
`% High Cholesterol` = round(100 * mean(high_chol), 1)
) %>%
mutate(diabetes = ifelse(diabetes == 1, "Diabetes", "No Diabetes"))
desc_table %>%
kable(caption = "Descriptive Statistics by Diabetes Status",
align = "lrrrrrrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| diabetes | N | Mean Age | % Male | % Obese | % Physically Active | % Current Smoker | % Hypertension | % High Cholesterol |
|---|---|---|---|---|---|---|---|---|
| No Diabetes | 1053 | 58.2 | 49.0 | 34.8 | 69.4 | 29.3 | 47.5 | 42.5 |
| Diabetes | 228 | 63.1 | 53.9 | 56.1 | 53.5 | 27.6 | 76.8 | 67.1 |
A statistical model is a mathematical representation of the relationship between:
\[f(Y) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon\]
Where:
The choice of regression model depends on the type of outcome variable:
| Outcome Type | Regression Type | Link Function | Example |
|---|---|---|---|
| Continuous | Linear | Identity: Y | Blood pressure, BMI |
| Binary | Logistic | Logit: log(p/(1-p)) | Disease status, mortality |
| Count | Poisson/Negative Binomial | Log: log(Y) | Number of infections |
| Time-to-event | Cox Proportional Hazards | Log: log(h(t)) | Survival time |
Let’s model the relationship between age and diabetes prevalence.
# Simple linear regression: diabetes ~ age
model_linear_simple <- lm(diabetes ~ age_cont, data = brfss_clean)
# Display results
tidy(model_linear_simple, conf.int = TRUE) %>%
kable(caption = "Simple Linear Regression: Diabetes ~ Age",
digits = 4,
col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | -0.0632 | 0.0481 | -1.3125 | 0.1896 | -0.1576 | 0.0312 |
| age_cont | 0.0041 | 0.0008 | 5.1368 | 0.0000 | 0.0025 | 0.0056 |
Interpretation:
# Create scatter plot with regression line
p1 <- ggplot(brfss_clean, aes(x = age_cont, y = diabetes)) +
geom_jitter(alpha = 0.2, width = 0.5, height = 0.02, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red", linewidth = 1.2) +
labs(
title = "Relationship Between Age and Diabetes",
subtitle = "Simple Linear Regression",
x = "Age (years)",
y = "Probability of Diabetes"
) +
theme_minimal(base_size = 12)
ggplotly(p1) %>%
layout(hovermode = "closest")Diabetes Prevalence by Age
Problem with linear regression for binary outcomes:
Solution: Logistic Regression
Uses the logit link function to ensure predicted probabilities stay between 0 and 1:
\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]
# Simple logistic regression: diabetes ~ age
model_logistic_simple <- glm(diabetes ~ age_cont,
data = brfss_clean,
family = binomial(link = "logit"))
# Display results with odds ratios
tidy(model_logistic_simple, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Simple Logistic Regression: Diabetes ~ Age (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.029 | 0.423 | -8.390 | 0 | 0.012 | 0.064 |
| age_cont | 1.034 | 0.007 | 4.978 | 0 | 1.021 | 1.048 |
Interpretation:
Predicted Diabetes Probability by Age
# Generate predicted probabilities
pred_data <- data.frame(age_cont = seq(18, 80, by = 1))
pred_data$predicted_prob <- predict(model_logistic_simple,
newdata = pred_data,
type = "response")
# Plot
p2 <- ggplot(pred_data, aes(x = age_cont, y = predicted_prob)) +
geom_line(color = "darkred", linewidth = 1.5) +
geom_ribbon(aes(ymin = predicted_prob - 0.02,
ymax = predicted_prob + 0.02),
alpha = 0.2, fill = "darkred") +
labs(
title = "Predicted Probability of Diabetes by Age",
subtitle = "Simple Logistic Regression",
x = "Age (years)",
y = "Predicted Probability of Diabetes"
) +
scale_y_continuous(labels = scales::percent_format(), limits = c(0, 0.6)) +
theme_minimal(base_size = 12)
ggplotly(p2)Predicted Diabetes Probability by Age
A confounder is a variable that:
Example: The relationship between age and diabetes may be confounded by BMI, physical activity, and other factors.
# Multiple logistic regression with potential confounders
model_logistic_multiple <- glm(diabetes ~ age_cont + sex + bmi_cat +
phys_active + current_smoker + education,
data = brfss_clean,
family = binomial(link = "logit"))
# Display results
tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Multiple Logistic Regression: Diabetes ~ Age + Covariates (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
scroll_box(height = "400px")| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.009 | 1.177 | -4.001 | 0.000 | 0.000 | 0.065 |
| age_cont | 1.041 | 0.007 | 5.515 | 0.000 | 1.027 | 1.057 |
| sexMale | 1.191 | 0.154 | 1.133 | 0.257 | 0.880 | 1.613 |
| bmi_catNormal | 1.971 | 1.052 | 0.645 | 0.519 | 0.378 | 36.309 |
| bmi_catOverweight | 3.155 | 1.044 | 1.101 | 0.271 | 0.621 | 57.679 |
| bmi_catObese | 6.834 | 1.041 | 1.845 | 0.065 | 1.354 | 124.675 |
| phys_active | 0.589 | 0.157 | -3.373 | 0.001 | 0.433 | 0.802 |
| current_smoker | 1.213 | 0.178 | 1.085 | 0.278 | 0.852 | 1.716 |
| educationHigh school graduate | 0.634 | 0.288 | -1.579 | 0.114 | 0.364 | 1.131 |
| educationSome college | 0.542 | 0.294 | -2.081 | 0.037 | 0.307 | 0.977 |
| educationCollege graduate | 0.584 | 0.305 | -1.763 | 0.078 | 0.324 | 1.074 |
Interpretation:
Categorical variables with \(k\) levels are represented using \(k-1\) dummy variables (indicator variables).
Education has 4 levels: 1. < High school (reference category) 2. High school graduate 3. Some college 4. College graduate
R automatically creates 3 dummy variables:
# Extract dummy variable coding
dummy_table <- data.frame(
Education = c("< High school", "High school graduate", "Some college", "College graduate"),
`Dummy 1 (HS grad)` = c(0, 1, 0, 0),
`Dummy 2 (Some college)` = c(0, 0, 1, 0),
`Dummy 3 (College grad)` = c(0, 0, 0, 1),
check.names = FALSE
)
dummy_table %>%
kable(caption = "Dummy Variable Coding for Education",
align = "lccc") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(1, bold = TRUE, background = "#ffe6e6") # Highlight reference category| Education | Dummy 1 (HS grad) | Dummy 2 (Some college) | Dummy 3 (College grad) |
|---|---|---|---|
| < High school | 0 | 0 | 0 |
| High school graduate | 1 | 0 | 0 |
| Some college | 0 | 1 | 0 |
| College graduate | 0 | 0 | 1 |
Reference Category: The category with all zeros (< High school) is the reference group. All other categories are compared to this reference.
# Extract education coefficients
educ_coefs <- tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
filter(str_detect(term, "education")) %>%
mutate(
education_level = str_remove(term, "education"),
education_level = factor(education_level,
levels = c("High school graduate",
"Some college",
"College graduate"))
)
# Add reference category
ref_row <- data.frame(
term = "education< High school",
estimate = 1.0,
std.error = 0,
statistic = NA,
p.value = NA,
conf.low = 1.0,
conf.high = 1.0,
education_level = factor("< High school (Ref)",
levels = c("< High school (Ref)",
"High school graduate",
"Some college",
"College graduate"))
)
educ_coefs_full <- bind_rows(ref_row, educ_coefs) %>%
mutate(education_level = factor(education_level,
levels = c("< High school (Ref)",
"High school graduate",
"Some college",
"College graduate")))
# Plot
p3 <- ggplot(educ_coefs_full, aes(x = education_level, y = estimate)) +
geom_hline(yintercept = 1, linetype = "dashed", color = "gray50") +
geom_pointrange(aes(ymin = conf.low, ymax = conf.high),
size = 0.8, color = "darkblue") +
coord_flip() +
labs(
title = "Association Between Education and Diabetes",
subtitle = "Adjusted Odds Ratios (reference: < High school)",
x = "Education Level",
y = "Odds Ratio (95% CI)"
) +
theme_minimal(base_size = 12)
ggplotly(p3)Odds Ratios for Education Levels
# Plot model coefficients with `ggcoef_model()`
ggcoef_model(model_logistic_multiple, exponentiate = TRUE,
include = c("education"),
variable_labels = c(
education = "Education"),
facet_labeller = ggplot2::label_wrap_gen(10)
)An interaction exists when the effect of one variable on the outcome differs across levels of another variable.
Epidemiologic term: Effect modification
Does the effect of age on diabetes differ between males and females?
# Model with interaction term
model_interaction <- glm(diabetes ~ age_cont * sex + bmi_cat + phys_active,
data = brfss_clean,
family = binomial(link = "logit"))
# Display interaction results
tidy(model_interaction, exponentiate = TRUE, conf.int = TRUE) %>%
filter(str_detect(term, "age_cont")) %>%
kable(caption = "Age × Sex Interaction Model (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| age_cont | 1.031 | 0.009 | 3.178 | 0.001 | 1.012 | 1.051 |
| age_cont:sexMale | 1.015 | 0.014 | 1.084 | 0.278 | 0.988 | 1.044 |
Interpretation:
# Generate predicted probabilities by sex
pred_interact <- ggpredict(model_interaction, terms = c("age_cont [18:80]", "sex"))
# Plot
p4 <- ggplot(pred_interact, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
labs(
title = "Predicted Probability of Diabetes by Age and Sex",
subtitle = "Testing for Age × Sex Interaction",
x = "Age (years)",
y = "Predicted Probability of Diabetes",
color = "Sex",
fill = "Sex"
) +
scale_y_continuous(labels = scales::percent_format()) +
scale_color_manual(values = c("Female" = "#E64B35", "Male" = "#4DBBD5")) +
scale_fill_manual(values = c("Female" = "#E64B35", "Male" = "#4DBBD5")) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")
ggplotly(p4)Age-Diabetes Relationship by Sex
Every regression model makes assumptions about the data. If assumptions are violated, results may be invalid.
Variance Inflation Factor (VIF): Measures how much the variance of a coefficient is inflated due to correlation with other predictors.
# Calculate VIF
vif_values <- vif(model_logistic_multiple)
# Create VIF table
# For models with categorical variables, vif() returns GVIF (Generalized VIF)
if (is.matrix(vif_values)) {
# If matrix (categorical variables present), extract GVIF^(1/(2*Df))
vif_df <- data.frame(
Variable = rownames(vif_values),
VIF = vif_values[, "GVIF^(1/(2*Df))"]
)
} else {
# If vector (only continuous variables)
vif_df <- data.frame(
Variable = names(vif_values),
VIF = as.numeric(vif_values)
)
}
# Add interpretation
vif_df <- vif_df %>%
arrange(desc(VIF)) %>%
mutate(
Interpretation = case_when(
VIF < 5 ~ "Low (No concern)",
VIF >= 5 & VIF < 10 ~ "Moderate (Monitor)",
VIF >= 10 ~ "High (Problem)"
)
)
vif_df %>%
kable(caption = "Variance Inflation Factors (VIF) for Multiple Regression Model",
digits = 2,
align = "lrc") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which(vif_df$VIF >= 10), bold = TRUE, color = "white", background = "#DC143C") %>%
row_spec(which(vif_df$VIF >= 5 & vif_df$VIF < 10), background = "#FFA500") %>%
row_spec(which(vif_df$VIF < 5), background = "#90EE90")| Variable | VIF | Interpretation | |
|---|---|---|---|
| age_cont | age_cont | 1.05 | Low (No concern) |
| current_smoker | current_smoker | 1.05 | Low (No concern) |
| phys_active | phys_active | 1.02 | Low (No concern) |
| sex | sex | 1.01 | Low (No concern) |
| education | education | 1.01 | Low (No concern) |
| bmi_cat | bmi_cat | 1.01 | Low (No concern) |
Cook’s Distance: Measures how much the model would change if an observation were removed.
# Calculate Cook's distance
cooks_d <- cooks.distance(model_logistic_multiple)
# Create data frame
influence_df <- data.frame(
observation = 1:length(cooks_d),
cooks_d = cooks_d
) %>%
mutate(influential = ifelse(cooks_d > 1, "Yes", "No"))
# Plot
p5 <- ggplot(influence_df, aes(x = observation, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
labs(
title = "Cook's Distance: Identifying Influential Observations",
subtitle = "Values > 1 indicate potentially influential observations",
x = "Observation Number",
y = "Cook's Distance",
color = "Influential?"
) +
scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
theme_minimal(base_size = 12)
ggplotly(p5)Cook’s Distance for Influential Observations
# Count influential observations
n_influential <- sum(influence_df$influential == "Yes")
cat("Number of potentially influential observations:", n_influential, "\n")## Number of potentially influential observations: 0
Use Likelihood Ratio Test to compare nested models:
# Model 1: Age only
model1 <- glm(diabetes ~ age_cont,
data = brfss_clean,
family = binomial)
# Model 2: Age + Sex
model2 <- glm(diabetes ~ age_cont + sex,
data = brfss_clean,
family = binomial)
# Model 3: Full model
model3 <- model_logistic_multiple
# Likelihood ratio test
lrt_1_2 <- anova(model1, model2, test = "LRT")
lrt_2_3 <- anova(model2, model3, test = "LRT")
# Create comparison table
model_comp <- data.frame(
Model = c("Model 1: Age only",
"Model 2: Age + Sex",
"Model 3: Full model"),
AIC = c(AIC(model1), AIC(model2), AIC(model3)),
BIC = c(BIC(model1), BIC(model2), BIC(model3)),
`Deviance` = c(deviance(model1), deviance(model2), deviance(model3)),
check.names = FALSE
)
model_comp %>%
kable(caption = "Model Comparison: AIC, BIC, and Deviance",
digits = 2,
align = "lrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which.min(model_comp$AIC), bold = TRUE, background = "#d4edda")| Model | AIC | BIC | Deviance |
|---|---|---|---|
| Model 1: Age only | 1175.08 | 1185.39 | 1171.08 |
| Model 2: Age + Sex | 1175.85 | 1191.32 | 1169.85 |
| Model 3: Full model | 1122.65 | 1179.36 | 1100.65 |
Interpretation:
All statistical models include an error term (\(\epsilon\)) to account for:
\[Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon\]
Key points:
In this lab, you will:
hyp_freq <- brfss_clean %>%
count(hypertension) %>%
mutate(
Status = ifelse(hypertension == 1, "Hypertension", "No Hypertension"),
Percent = round(100 * n / sum(n), 1)
) %>%
select(Status, n, Percent)
hyp_freq %>%
kable(caption = "Frequency Table: Hypertension Status",
col.names = c("Status", "N", "%"),
align = "lrr") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Status | N | % |
|---|---|---|
| No Hypertension | 606 | 47.3 |
| Hypertension | 675 | 52.7 |
# YOUR CODE HERE: Calculate the prevalence of hypertension by age group
hyp_by_age <- brfss_clean %>%
group_by(age_group) %>%
summarise(
N = n(),
Cases = sum(hypertension),
Prevalence = round(100 * mean(hypertension), 1)
)
hyp_by_age %>%
kable(caption = "Hypertension Prevalence by Age Group",
col.names = c("Age Group", "N", "Cases", "Prevalence (%)"),
align = "lrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Age Group | N | Cases | Prevalence (%) |
|---|---|---|---|
| 18-24 | 12 | 1 | 8.3 |
| 25-34 | 77 | 15 | 19.5 |
| 35-44 | 138 | 42 | 30.4 |
| 45-54 | 161 | 61 | 37.9 |
| 55-64 | 266 | 137 | 51.5 |
| 65+ | 627 | 419 | 66.8 |
Questions:
# Outcome: hypertension
# Predictor: age_cont
model_hyp_simple <- glm(hypertension ~ age_cont,
data = brfss_clean,
family = binomial(link = "logit"))
# YOUR CODE HERE: Display the results with odds ratios
tidy(model_hyp_simple, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Simple Logistic Regression: Hypertension ~ Age (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic",
"p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.048 | 0.296 | -10.293 | 0 | 0.026 | 0.084 |
| age_cont | 1.055 | 0.005 | 10.996 | 0 | 1.045 | 1.065 |
Questions:
What is the odds ratio for age? Interpret this value. The OR for age is 1.055. Each additional year of age is associated with a 5.5% increase in the odds of hypertension.
Is the association statistically significant? Yes, the association is highly statistically significant (z = 10.996, p < 0.001).
What is the 95% confidence interval for the odds ratio? The 95% CI is (1.045, 1.065) narrow and above 1.0. This indicates a positive estimate. —
# Multiple logistic regression with potential confounders
# Outcome: hypertension
# Predictors: age_cont, sex, bmi_cat, phys_active, current_smoker
model_hyp_multiple <- glm(hypertension ~ age_cont + sex + bmi_cat +
phys_active + current_smoker,
data = brfss_clean,
family = binomial(link = "logit"))
# Display the results
tidy(model_hyp_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Multiple Logistic Regression: Hypertension ~ Age + Covariates (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic",
"p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
scroll_box(height = "400px")| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.008 | 0.653 | -7.355 | 0.000 | 0.002 | 0.028 |
| age_cont | 1.061 | 0.005 | 11.234 | 0.000 | 1.050 | 1.073 |
| sexMale | 1.270 | 0.123 | 1.950 | 0.051 | 0.999 | 1.616 |
| bmi_catNormal | 2.097 | 0.546 | 1.356 | 0.175 | 0.759 | 6.756 |
| bmi_catOverweight | 3.241 | 0.543 | 2.166 | 0.030 | 1.183 | 10.385 |
| bmi_catObese | 6.585 | 0.545 | 3.459 | 0.001 | 2.394 | 21.176 |
| phys_active | 0.900 | 0.130 | -0.808 | 0.419 | 0.697 | 1.162 |
| current_smoker | 1.071 | 0.139 | 0.495 | 0.621 | 0.817 | 1.407 |
Questions:
# Create a table showing the dummy variable coding for bmi_cat
dummy_bmi <- data.frame(
`BMI Category` = c("Underweight/Normal", "Overweight", "Obese"),
`Dummy 1 (Overweight)` = c(0, 1, 0),
`Dummy 2 (Obese)` = c(0, 0, 1),
check.names = FALSE
)
dummy_bmi %>%
kable(caption = "Dummy Variable Coding for BMI Category",
align = "lcc") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(1, bold = TRUE, background = "#ffe6e6") # Highlight reference category| BMI Category | Dummy 1 (Overweight) | Dummy 2 (Obese) |
|---|---|---|
| Underweight/Normal | 0 | 0 |
| Overweight | 1 | 0 |
| Obese | 0 | 1 |
# Extract and display the odds ratios for BMI categories
tidy(model_hyp_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
filter(str_detect(term, "bmi_cat")) %>%
mutate(term = str_remove(term, "bmi_cat")) %>%
kable(caption = "Odds Ratios for BMI Categories (vs. Underweight/Normal)",
digits = 3,
col.names = c("BMI Category", "Odds Ratio", "Std. Error", "z-statistic",
"p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| BMI Category | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| Normal | 2.097 | 0.546 | 1.356 | 0.175 | 0.759 | 6.756 |
| Overweight | 3.241 | 0.543 | 2.166 | 0.030 | 1.183 | 10.385 |
| Obese | 6.585 | 0.545 | 3.459 | 0.001 | 2.394 | 21.176 |
Questions:
What is the reference category for BMI? The reference category is Underweight/Normal BMI
Interpret the odds ratio for “Obese” compared to the reference category.The OR for Obese is 6.585 (95% CI: 2.394–21.176, p = 0.001).This means, after adjusting for age, sex, physical activity, and smoking, obese individuals have 6.6 times the odds of hypertension compared to normal/underweight individuals.
How would you explain this to a non-statistician? People who are obese are about 6 to 7 times more likely to have high blood pressure than people at a healthy weight —
# YOUR CODE HERE: Fit a model with Age × BMI interaction
# Test if the effect of age on hypertension differs by BMI category
model_hyp_noint <- glm(hypertension ~ age_cont + bmi_cat + sex + phys_active,
data = brfss_clean,
family = binomial(link = "logit"))
model_hyp_int <- glm(hypertension ~ age_cont * bmi_cat + sex + phys_active,
data = brfss_clean,
family = binomial(link = "logit"))
tidy(model_hyp_int, exponentiate = TRUE, conf.int = TRUE) %>%
filter(str_detect(term, "age_cont")) %>%
kable(caption = "Age x BMI Interaction Model (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic",
"p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| age_cont | 1.005 | 0.042 | 0.129 | 0.897 | 0.930 | 1.110 |
| age_cont:bmi_catNormal | 1.056 | 0.043 | 1.261 | 0.207 | 0.955 | 1.145 |
| age_cont:bmi_catOverweight | 1.062 | 0.043 | 1.407 | 0.159 | 0.961 | 1.151 |
| age_cont:bmi_catObese | 1.051 | 0.043 | 1.162 | 0.245 | 0.951 | 1.137 |
# YOUR CODE HERE: Perform a likelihood ratio test comparing models with and without interaction
anova(model_hyp_noint, model_hyp_int, test = "LRT")## Analysis of Deviance Table
##
## Model 1: hypertension ~ age_cont + bmi_cat + sex + phys_active
## Model 2: hypertension ~ age_cont * bmi_cat + sex + phys_active
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 1274 1563.7
## 2 1271 1561.6 3 2.1826 0.5354
pred_interact_hyp <- ggpredict(model_hyp_int, terms = c("age_cont [18:80]", "bmi_cat"))
p4 <- ggplot(pred_interact_hyp, aes(x = x, y = predicted,
color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
labs(
title = "Predicted Probability of Hypertension by Age and BMI Category",
subtitle = "Testing for Age × BMI Interaction",
x = "Age (years)",
y = "Predicted Probability of Hypertension",
color = "BMI Category",
fill = "BMI Category"
) +
scale_y_continuous(labels = scales::percent_format()) +
scale_color_manual(values = c("Underweight/Normal" = "#4DBBD5",
"Overweight" = "#E64B35",
"Obese" = "#00A087")) +
scale_fill_manual(values = c("Underweight/Normal" = "#4DBBD5",
"Overweight" = "#E64B35",
"Obese" = "#00A087")) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")
ggplotly(p4)Questions:
# Calculate VIF for your multiple regression model
vif_values <- vif(model_hyp_multiple)
if (is.matrix(vif_values)) {
vif_df <- data.frame(
Variable = rownames(vif_values),
VIF = vif_values[, "GVIF^(1/(2*Df))"]
)
} else {
vif_df <- data.frame(
Variable = names(vif_values),
VIF = as.numeric(vif_values)
)
}
vif_df <- vif_df %>%
arrange(desc(VIF)) %>%
mutate(
Interpretation = case_when(
VIF < 5 ~ "Low (No concern)",
VIF >= 5 & VIF < 10 ~ "Moderate (Monitor)",
VIF >= 10 ~ "High (Problem)"
)
)
vif_df %>%
kable(caption = "Variance Inflation Factors (VIF) for Hypertension Multiple Regression Model",
digits = 2,
align = "lrc") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which(vif_df$VIF >= 10), bold = TRUE, color = "white", background = "#DC143C") %>%
row_spec(which(vif_df$VIF >= 5 & vif_df$VIF < 10), background = "#FFA500") %>%
row_spec(which(vif_df$VIF < 5), background = "#90EE90")| Variable | VIF | Interpretation | |
|---|---|---|---|
| age_cont | age_cont | 1.06 | Low (No concern) |
| current_smoker | current_smoker | 1.04 | Low (No concern) |
| bmi_cat | bmi_cat | 1.02 | Low (No concern) |
| phys_active | phys_active | 1.01 | Low (No concern) |
| sex | sex | 1.01 | Low (No concern) |
cooks_d <- cooks.distance(model_hyp_multiple)
influence_df <- data.frame(
observation = 1:length(cooks_d),
cooks_d = cooks_d
) %>%
mutate(influential = ifelse(cooks_d > 1, "Yes", "No"))
p5 <- ggplot(influence_df, aes(x = observation, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
labs(
title = "Cook's Distance: Identifying Influential Observations",
subtitle = "Values > 1 indicate potentially influential observations",
x = "Observation Number",
y = "Cook's Distance",
color = "Influential?"
) +
scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
theme_minimal(base_size = 12)
ggplotly(p5)n_influential <- sum(influence_df$influential == "Yes")
cat("Number of potentially influential observations:", n_influential, "\n")## Number of potentially influential observations: 0
# Create a Cook's distance plot to identify influential observations
model_A <- glm(hypertension ~ age_cont,
data = brfss_clean,
family = binomial)Questions:
# Compare three models using AIC and BIC
# Model A: Age only
model_A <- glm(hypertension ~ age_cont,
data = brfss_clean,
family = binomial)
# Model B: Age + sex + bmi_cat
model_B <- glm(hypertension ~ age_cont + sex + bmi_cat,
data = brfss_clean,
family = binomial)
# Model C: Age + sex + bmi_cat + phys_active + current_smoker
model_C <- glm(hypertension ~ age_cont + sex + bmi_cat +
phys_active + current_smoker,
data = brfss_clean,
family = binomial)
# Create a comparison table
model_comp <- data.frame(
Model = c("Model A: Age only",
"Model B: Age + Sex + BMI",
"Model C: Full model"),
AIC = c(AIC(model_A), AIC(model_B), AIC(model_C)),
BIC = c(BIC(model_A), BIC(model_B), BIC(model_C)),
Deviance = c(deviance(model_A), deviance(model_B), deviance(model_C)),
check.names = FALSE
)
model_comp %>%
kable(caption = "Model Comparison: AIC, BIC, and Deviance",
digits = 2,
align = "lrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which.min(model_comp$AIC), bold = TRUE, background = "#d4edda")| Model | AIC | BIC | Deviance |
|---|---|---|---|
| Model A: Age only | 1636.61 | 1646.92 | 1632.61 |
| Model B: Age + Sex + BMI | 1576.49 | 1607.42 | 1564.49 |
| Model C: Full model | 1579.50 | 1620.74 | 1563.50 |
Questions:
Write a brief report (1-2 pages) summarizing your findings:
Risk Factors for Hypertension Among U.S. Adults: A Logistic Regression Analysis of BRFSS 2023
Introduction
This analysis investigates the association between age and hypertension among U.S. adults, adjusted for sex, BMI category, physical activity, and smoking status. Hypertension is one of the most prevalent and modifiable risk factors for cardiovascular disease and stroke in the United States. This makes identification of its strongest predictors essential for targeting prevention and surveillance resources.
Methods
Hypertension was indicated as 1 = yes and 0 = no. I started by reporting the overall prevalence of hypertension and the prevalence by age group. I estimated three nested logistic regression models. The first included age as the only predictor (Model A). The second was a slightly more complex model that added sex and BMI category to the previous model (Model B). The third was a full model that added physical activity and smoking status to the previous model (Model C). The BMI was treated as a categorical variable. The “underweight/normal” category was the reference group. There were two dummy variables used to indicate “overweight” and “obese” status. I tested for an age vs. BMI interaction using a likelihood ratio test (LRT) comparing models with and without interaction terms. Model diagnostics included the Generalized Variance Inflation Factor (GVIF) for multicollinearity and Cook’s distance for influential observations. Model fit was compared across all three models using AIC and BIC. Loower values indicated better fit. All analyses were conducted in R with statistical significance set at α = 0.05.
Results
More than half of the overall sample (n = 675 out of 1,281) was hypertensive (prevalence = 52.7%). Prevalence increased with age, from 8.3% of adults aged 18-24 to 66.8% of those 65+. In the simple logistic regression model, the odds of hypertension were 5.5% higher for each year of age (OR = 1.055, 95% CI: 1.045-1.065, p < 0.001). The age OR changed by little after accounting for sex, BMI, PA, and smoking (OR = 1.061, 95% CI: 1.050-1.073); i.e., there was essentially no confounding of age by these variables. In the adjusted model, the odds of hypertension were over 6 times higher for obese individuals in comparison to those with a normal weight (OR = 6.585, 95% CI: 2.394–21.176, p = 0.001) and over 3 times higher for overweight individuals (OR = 3.241, 95% CI: 1.183–10.385, p = 0.030). In the case of sex, physical activity, and smoking, no statistically significant differences were found in the adjusted model. The age x BMI interaction was insignificantly different (LRT: χ²(3) = 2.18, p = 0.535), which means that the impact of age on hypertension is not significantly different within individual BMI levels. All GVIF values fell below 1.1 and no influential observations were identified (Cook’s D > 1: n = 0). Model comparison showed that Model B (age + sex + BMI) achieved the lowest AIC (1576.49) and BIC (1607.42), outperforming both the age-only model and the full model.All Generalized Variance Inflation Factor (GVIF) values were below 1.1, and no influential observations (Cook’s D > 1: n=0) were detected. Model comparison revealed that model B (age+sex+BMI) had the lowest AIC (1576.49) and BIC (1607.42) and was better compared to the age-only model and the full model.
Interpretation
Age and obesity stand out as the most important, clinically recognizable hypertension risk factors. Without the direct and mediated effects of obesity and the direct effect of age, one could potentially avert 50% of new cases of hypertension. This is a major public health challenge since many populations worldwide are experiencing both population aging and increasing prevalence of obesity. The non-significant interaction also confirms that the increase in the probability of hypertension with a one-year increase in age is similar by BMI group. However, evidence suggests that physical activity, smoking, and other factors might have different effects on hypertension by BMI,25 such that these factors could be effect modifiers of the age–hypertension relationship.
Limitations
This analysis is subject to several important limitations. The cross-sectional design of BRFSS prevents causal inference, therefore we cannot determine whether age or obesity causes hypertension from these data. Hypertension is self-reported, and misclassification bias may be introduced if participants are unaware of their diagnosis. BRFSS uses telephone-based random-digit dialing sampling, which may underrepresent low-income or non-English-speaking households. This potentially limits the generalizability of study findings. Also, the residual confounding from unmeasured variables, including dietary sodium intake, family history of hypertension, medication use, and psychological stress, cannot be excluded. Finally, small sample sizes in younger age groups (n = 12 for ages 18–24) limit the precision of prevalence estimates.
Submission: Submit your completed R Markdown file and knitted HTML report.
Logistic Regression:
\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]
Odds Ratio:
\[\text{OR} = e^{\beta_i}\]
Predicted Probability:
\[p = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}\]
Session Info
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggstats_0.12.0 gtsummary_2.5.0 ggeffects_2.3.2 car_3.1-3
## [5] carData_3.0-5 broom_1.0.11 plotly_4.12.0 kableExtra_1.4.0
## [9] knitr_1.51 haven_2.5.5 lubridate_1.9.4 forcats_1.0.1
## [13] stringr_1.6.0 dplyr_1.2.0 purrr_1.2.1 readr_2.1.6
## [17] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.56 bslib_0.10.0
## [4] htmlwidgets_1.6.4 insight_1.4.6 lattice_0.22-7
## [7] tzdb_0.5.0 crosstalk_1.2.2 vctrs_0.7.1
## [10] tools_4.5.1 generics_0.1.4 datawizard_1.3.0
## [13] pkgconfig_2.0.3 Matrix_1.7-3 data.table_1.18.0
## [16] RColorBrewer_1.1-3 S7_0.2.1 lifecycle_1.0.5
## [19] compiler_4.5.1 farver_2.1.2 textshaping_1.0.4
## [22] htmltools_0.5.9 sass_0.4.10 yaml_2.3.12
## [25] lazyeval_0.2.2 Formula_1.2-5 pillar_1.11.1
## [28] jquerylib_0.1.4 broom.helpers_1.22.0 cachem_1.1.0
## [31] abind_1.4-8 nlme_3.1-168 tidyselect_1.2.1
## [34] digest_0.6.39 stringi_1.8.7 labeling_0.4.3
## [37] splines_4.5.1 labelled_2.16.0 fastmap_1.2.0
## [40] grid_4.5.1 cli_3.6.5 magrittr_2.0.4
## [43] cards_0.7.1 withr_3.0.2 scales_1.4.0
## [46] backports_1.5.0 timechange_0.3.0 rmarkdown_2.30
## [49] httr_1.4.7 otel_0.2.0 hms_1.1.4
## [52] evaluate_1.0.5 viridisLite_0.4.2 mgcv_1.9-3
## [55] rlang_1.1.7 glue_1.8.0 xml2_1.5.2
## [58] svglite_2.2.2 rstudioapi_0.18.0 jsonlite_2.0.0
## [61] R6_2.6.1 systemfonts_1.3.1