Statistical modeling is a fundamental tool in epidemiology that allows us to:
This lecture introduces key concepts in regression modeling using real-world data from the Behavioral Risk Factor Surveillance System (BRFSS) 2023.
# Load required packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(car)
library(ggeffects)
library(gtsummary)
library(ggstats)The BRFSS is a large-scale telephone survey that collects data on health-related risk behaviors, chronic health conditions, and use of preventive services from U.S. residents.
library(tidyverse)
library(readr)
library(knitr)
library(kableExtra)
library(broom)
library(car)
# Load the cleaned dataset
brfss_clean <- read_rds("brfss_subset_2023.rds")
# Make aliases so any lecture variable name works
brfss_full <- brfss_clean
brfss_subset <- brfss_clean
# Confirm it loaded
names(brfss_clean)## [1] "diabetes" "age_group" "age_cont" "sex"
## [5] "race" "education" "income" "bmi_cat"
## [9] "phys_active" "current_smoker" "gen_health" "hypertension"
## [13] "high_chol"
# Summary table by diabetes status
desc_table <- brfss_clean %>%
group_by(diabetes) %>%
summarise(
N = n(),
`Mean Age` = round(mean(age_cont), 1),
`% Male` = round(100 * mean(sex == "Male"), 1),
`% Obese` = round(100 * mean(bmi_cat == "Obese", na.rm = TRUE), 1),
`% Physically Active` = round(100 * mean(phys_active), 1),
`% Current Smoker` = round(100 * mean(current_smoker), 1),
`% Hypertension` = round(100 * mean(hypertension), 1),
`% High Cholesterol` = round(100 * mean(high_chol), 1)
) %>%
mutate(diabetes = ifelse(diabetes == 1, "Diabetes", "No Diabetes"))
desc_table %>%
kable(caption = "Descriptive Statistics by Diabetes Status",
align = "lrrrrrrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| diabetes | N | Mean Age | % Male | % Obese | % Physically Active | % Current Smoker | % Hypertension | % High Cholesterol |
|---|---|---|---|---|---|---|---|---|
| No Diabetes | 1053 | 58.2 | 49.0 | 34.8 | 69.4 | 29.3 | 47.5 | 42.5 |
| Diabetes | 228 | 63.1 | 53.9 | 56.1 | 53.5 | 27.6 | 76.8 | 67.1 |
A statistical model is a mathematical representation of the relationship between:
\[f(Y) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon\]
Where:
The choice of regression model depends on the type of outcome variable:
| Outcome Type | Regression Type | Link Function | Example |
|---|---|---|---|
| Continuous | Linear | Identity: Y | Blood pressure, BMI |
| Binary | Logistic | Logit: log(p/(1-p)) | Disease status, mortality |
| Count | Poisson/Negative Binomial | Log: log(Y) | Number of infections |
| Time-to-event | Cox Proportional Hazards | Log: log(h(t)) | Survival time |
Let’s model the relationship between age and diabetes prevalence.
# Simple linear regression: diabetes ~ age
model_linear_simple <- lm(diabetes ~ age_cont, data = brfss_clean)
# Display results
tidy(model_linear_simple, conf.int = TRUE) %>%
kable(caption = "Simple Linear Regression: Diabetes ~ Age",
digits = 4,
col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | -0.0632 | 0.0481 | -1.3125 | 0.1896 | -0.1576 | 0.0312 |
| age_cont | 0.0041 | 0.0008 | 5.1368 | 0.0000 | 0.0025 | 0.0056 |
Interpretation:
# Create scatter plot with regression line
p1 <- ggplot(brfss_clean, aes(x = age_cont, y = diabetes)) +
geom_jitter(alpha = 0.2, width = 0.5, height = 0.02, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red", linewidth = 1.2) +
labs(
title = "Relationship Between Age and Diabetes",
subtitle = "Simple Linear Regression",
x = "Age (years)",
y = "Probability of Diabetes"
) +
theme_minimal(base_size = 12)
ggplotly(p1) %>%
layout(hovermode = "closest")Diabetes Prevalence by Age
Problem with linear regression for binary outcomes:
Solution: Logistic Regression
Uses the logit link function to ensure predicted probabilities stay between 0 and 1:
\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]
# Simple logistic regression: diabetes ~ age
model_logistic_simple <- glm(diabetes ~ age_cont,
data = brfss_clean,
family = binomial(link = "logit"))
# Display results with odds ratios
tidy(model_logistic_simple, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Simple Logistic Regression: Diabetes ~ Age (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.029 | 0.423 | -8.390 | 0 | 0.012 | 0.064 |
| age_cont | 1.034 | 0.007 | 4.978 | 0 | 1.021 | 1.048 |
Interpretation:
Predicted Diabetes Probability by Age
# Generate predicted probabilities
pred_data <- data.frame(age_cont = seq(18, 80, by = 1))
pred_data$predicted_prob <- predict(model_logistic_simple,
newdata = pred_data,
type = "response")
# Plot
p2 <- ggplot(pred_data, aes(x = age_cont, y = predicted_prob)) +
geom_line(color = "darkred", linewidth = 1.5) +
geom_ribbon(aes(ymin = predicted_prob - 0.02,
ymax = predicted_prob + 0.02),
alpha = 0.2, fill = "darkred") +
labs(
title = "Predicted Probability of Diabetes by Age",
subtitle = "Simple Logistic Regression",
x = "Age (years)",
y = "Predicted Probability of Diabetes"
) +
scale_y_continuous(labels = scales::percent_format(), limits = c(0, 0.6)) +
theme_minimal(base_size = 12)
ggplotly(p2)Predicted Diabetes Probability by Age
A confounder is a variable that:
Example: The relationship between age and diabetes may be confounded by BMI, physical activity, and other factors.
# Multiple logistic regression with potential confounders
model_logistic_multiple <- glm(diabetes ~ age_cont + sex + bmi_cat +
phys_active + current_smoker + education,
data = brfss_clean,
family = binomial(link = "logit"))
# Display results
tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Multiple Logistic Regression: Diabetes ~ Age + Covariates (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
scroll_box(height = "400px")| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.009 | 1.177 | -4.001 | 0.000 | 0.000 | 0.065 |
| age_cont | 1.041 | 0.007 | 5.515 | 0.000 | 1.027 | 1.057 |
| sexMale | 1.191 | 0.154 | 1.133 | 0.257 | 0.880 | 1.613 |
| bmi_catNormal | 1.971 | 1.052 | 0.645 | 0.519 | 0.378 | 36.309 |
| bmi_catOverweight | 3.155 | 1.044 | 1.101 | 0.271 | 0.621 | 57.679 |
| bmi_catObese | 6.834 | 1.041 | 1.845 | 0.065 | 1.354 | 124.675 |
| phys_active | 0.589 | 0.157 | -3.373 | 0.001 | 0.433 | 0.802 |
| current_smoker | 1.213 | 0.178 | 1.085 | 0.278 | 0.852 | 1.716 |
| educationHigh school graduate | 0.634 | 0.288 | -1.579 | 0.114 | 0.364 | 1.131 |
| educationSome college | 0.542 | 0.294 | -2.081 | 0.037 | 0.307 | 0.977 |
| educationCollege graduate | 0.584 | 0.305 | -1.763 | 0.078 | 0.324 | 1.074 |
Interpretation:
Categorical variables with \(k\) levels are represented using \(k-1\) dummy variables (indicator variables).
Education has 4 levels: 1. < High school (reference category) 2. High school graduate 3. Some college 4. College graduate
R automatically creates 3 dummy variables:
# Extract dummy variable coding
dummy_table <- data.frame(
Education = c("< High school", "High school graduate", "Some college", "College graduate"),
`Dummy 1 (HS grad)` = c(0, 1, 0, 0),
`Dummy 2 (Some college)` = c(0, 0, 1, 0),
`Dummy 3 (College grad)` = c(0, 0, 0, 1),
check.names = FALSE
)
dummy_table %>%
kable(caption = "Dummy Variable Coding for Education",
align = "lccc") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(1, bold = TRUE, background = "#ffe6e6") # Highlight reference category| Education | Dummy 1 (HS grad) | Dummy 2 (Some college) | Dummy 3 (College grad) |
|---|---|---|---|
| < High school | 0 | 0 | 0 |
| High school graduate | 1 | 0 | 0 |
| Some college | 0 | 1 | 0 |
| College graduate | 0 | 0 | 1 |
Reference Category: The category with all zeros (< High school) is the reference group. All other categories are compared to this reference.
# Extract education coefficients
educ_coefs <- tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
filter(str_detect(term, "education")) %>%
mutate(
education_level = str_remove(term, "education"),
education_level = factor(education_level,
levels = c("High school graduate",
"Some college",
"College graduate"))
)
# Add reference category
ref_row <- data.frame(
term = "education< High school",
estimate = 1.0,
std.error = 0,
statistic = NA,
p.value = NA,
conf.low = 1.0,
conf.high = 1.0,
education_level = factor("< High school (Ref)",
levels = c("< High school (Ref)",
"High school graduate",
"Some college",
"College graduate"))
)
educ_coefs_full <- bind_rows(ref_row, educ_coefs) %>%
mutate(education_level = factor(education_level,
levels = c("< High school (Ref)",
"High school graduate",
"Some college",
"College graduate")))
# Plot
p3 <- ggplot(educ_coefs_full, aes(x = education_level, y = estimate)) +
geom_hline(yintercept = 1, linetype = "dashed", color = "gray50") +
geom_pointrange(aes(ymin = conf.low, ymax = conf.high),
size = 0.8, color = "darkblue") +
coord_flip() +
labs(
title = "Association Between Education and Diabetes",
subtitle = "Adjusted Odds Ratios (reference: < High school)",
x = "Education Level",
y = "Odds Ratio (95% CI)"
) +
theme_minimal(base_size = 12)
ggplotly(p3)Odds Ratios for Education Levels
# Plot model coefficients with `ggcoef_model()`
ggcoef_model(model_logistic_multiple, exponentiate = TRUE,
include = c("education"),
variable_labels = c(
education = "Education"),
facet_labeller = ggplot2::label_wrap_gen(10)
)An interaction exists when the effect of one variable on the outcome differs across levels of another variable.
Epidemiologic term: Effect modification
Does the effect of age on diabetes differ between males and females?
# Model with interaction term
model_interaction <- glm(diabetes ~ age_cont * sex + bmi_cat + phys_active,
data = brfss_clean,
family = binomial(link = "logit"))
# Display interaction results
tidy(model_interaction, exponentiate = TRUE, conf.int = TRUE) %>%
filter(str_detect(term, "age_cont")) %>%
kable(caption = "Age × Sex Interaction Model (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| age_cont | 1.031 | 0.009 | 3.178 | 0.001 | 1.012 | 1.051 |
| age_cont:sexMale | 1.015 | 0.014 | 1.084 | 0.278 | 0.988 | 1.044 |
Interpretation:
# Generate predicted probabilities by sex
pred_interact <- ggpredict(model_interaction, terms = c("age_cont [18:80]", "sex"))
# Plot
p4 <- ggplot(pred_interact, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
labs(
title = "Predicted Probability of Diabetes by Age and Sex",
subtitle = "Testing for Age × Sex Interaction",
x = "Age (years)",
y = "Predicted Probability of Diabetes",
color = "Sex",
fill = "Sex"
) +
scale_y_continuous(labels = scales::percent_format()) +
scale_color_manual(values = c("Female" = "#E64B35", "Male" = "#4DBBD5")) +
scale_fill_manual(values = c("Female" = "#E64B35", "Male" = "#4DBBD5")) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")
ggplotly(p4)Age-Diabetes Relationship by Sex
Every regression model makes assumptions about the data. If assumptions are violated, results may be invalid.
Variance Inflation Factor (VIF): Measures how much the variance of a coefficient is inflated due to correlation with other predictors.
# Calculate VIF
vif_values <- vif(model_logistic_multiple)
# Create VIF table
# For models with categorical variables, vif() returns GVIF (Generalized VIF)
if (is.matrix(vif_values)) {
# If matrix (categorical variables present), extract GVIF^(1/(2*Df))
vif_df <- data.frame(
Variable = rownames(vif_values),
VIF = vif_values[, "GVIF^(1/(2*Df))"]
)
} else {
# If vector (only continuous variables)
vif_df <- data.frame(
Variable = names(vif_values),
VIF = as.numeric(vif_values)
)
}
# Add interpretation
vif_df <- vif_df %>%
arrange(desc(VIF)) %>%
mutate(
Interpretation = case_when(
VIF < 5 ~ "Low (No concern)",
VIF >= 5 & VIF < 10 ~ "Moderate (Monitor)",
VIF >= 10 ~ "High (Problem)"
)
)
vif_df %>%
kable(caption = "Variance Inflation Factors (VIF) for Multiple Regression Model",
digits = 2,
align = "lrc") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which(vif_df$VIF >= 10), bold = TRUE, color = "white", background = "#DC143C") %>%
row_spec(which(vif_df$VIF >= 5 & vif_df$VIF < 10), background = "#FFA500") %>%
row_spec(which(vif_df$VIF < 5), background = "#90EE90")| Variable | VIF | Interpretation | |
|---|---|---|---|
| age_cont | age_cont | 1.05 | Low (No concern) |
| current_smoker | current_smoker | 1.05 | Low (No concern) |
| phys_active | phys_active | 1.02 | Low (No concern) |
| sex | sex | 1.01 | Low (No concern) |
| education | education | 1.01 | Low (No concern) |
| bmi_cat | bmi_cat | 1.01 | Low (No concern) |
Cook’s Distance: Measures how much the model would change if an observation were removed.
# Calculate Cook's distance
cooks_d <- cooks.distance(model_logistic_multiple)
# Create data frame
influence_df <- data.frame(
observation = 1:length(cooks_d),
cooks_d = cooks_d
) %>%
mutate(influential = ifelse(cooks_d > 1, "Yes", "No"))
# Plot
p5 <- ggplot(influence_df, aes(x = observation, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
labs(
title = "Cook's Distance: Identifying Influential Observations",
subtitle = "Values > 1 indicate potentially influential observations",
x = "Observation Number",
y = "Cook's Distance",
color = "Influential?"
) +
scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
theme_minimal(base_size = 12)
ggplotly(p5)Cook’s Distance for Influential Observations
# Count influential observations
n_influential <- sum(influence_df$influential == "Yes")
cat("Number of potentially influential observations:", n_influential, "\n")## Number of potentially influential observations: 0
Use Likelihood Ratio Test to compare nested models:
# Model 1: Age only
model1 <- glm(diabetes ~ age_cont,
data = brfss_clean,
family = binomial)
# Model 2: Age + Sex
model2 <- glm(diabetes ~ age_cont + sex,
data = brfss_clean,
family = binomial)
# Model 3: Full model
model3 <- model_logistic_multiple
# Likelihood ratio test
lrt_1_2 <- anova(model1, model2, test = "LRT")
lrt_2_3 <- anova(model2, model3, test = "LRT")
# Create comparison table
model_comp <- data.frame(
Model = c("Model 1: Age only",
"Model 2: Age + Sex",
"Model 3: Full model"),
AIC = c(AIC(model1), AIC(model2), AIC(model3)),
BIC = c(BIC(model1), BIC(model2), BIC(model3)),
`Deviance` = c(deviance(model1), deviance(model2), deviance(model3)),
check.names = FALSE
)
model_comp %>%
kable(caption = "Model Comparison: AIC, BIC, and Deviance",
digits = 2,
align = "lrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which.min(model_comp$AIC), bold = TRUE, background = "#d4edda")| Model | AIC | BIC | Deviance |
|---|---|---|---|
| Model 1: Age only | 1175.08 | 1185.39 | 1171.08 |
| Model 2: Age + Sex | 1175.85 | 1191.32 | 1169.85 |
| Model 3: Full model | 1122.65 | 1179.36 | 1100.65 |
Interpretation:
All statistical models include an error term (\(\epsilon\)) to account for:
\[Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon\]
Key points:
In this lab, you will:
# YOUR CODE HERE: Create a frequency table of hypertension status
brfss_clean %>%
count(hypertension) %>%
mutate(percent = 100*n/sum(n))## # A tibble: 2 × 3
## hypertension n percent
## <dbl> <int> <dbl>
## 1 0 606 47.3
## 2 1 675 52.7
brfss_clean %>%
group_by(age_group) %>%
summarise(
n = n(),
prevalence = mean(hypertension == 1)
) %>%
mutate(prevalence_percent = 100*prevalence)## # A tibble: 6 × 4
## age_group n prevalence prevalence_percent
## <fct> <int> <dbl> <dbl>
## 1 18-24 12 0.0833 8.33
## 2 25-34 77 0.195 19.5
## 3 35-44 138 0.304 30.4
## 4 45-54 161 0.379 37.9
## 5 55-64 266 0.515 51.5
## 6 65+ 627 0.668 66.8
Questions:
What is the overall prevalence of hypertension in the dataset? The overall prevalence of hypertension in this dataset is approximately 52.7%.
How does hypertension prevalence vary by age group? Hypertension prevalence increases steadily with age. It rises from about 8% in ages 18–24 to nearly 67% in those 65 and older, showing a strong age gradient. ————————————————————————
# YOUR CODE HERE: Fit a simple logistic regression model
# Outcome: hypertension
# Predictor: age_cont
# YOUR CODE HERE: Display the results with odds ratios
m1 <- glm(hypertension ~ age_cont,
data = brfss_clean,
family = binomial)
broom::tidy(m1, exponentiate = TRUE, conf.int = TRUE)## # A tibble: 2 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0477 0.296 -10.3 7.54e-25 0.0264 0.0843
## 2 age_cont 1.05 0.00483 11.0 4.02e-28 1.04 1.06
Questions:
The odds ratio for age is: 1.055 Interpretation: For each 1 year increase in age, the odds of hypertension increase by about 5.5%.
Is the association statistically significant? Yes. The p-value is: 4.02 × 10⁻²⁸ That is far less than 0.05. So The association between age and hypertension is highly statistically significant.
What is the 95% confidence interval for the odds ratio?
(1.045, 1.065) We are 95% confident that each 1-year increase in age is associated with a 4.5% to 6.5% increase in the odds of hypertension. Because the entire CI is above 1, the association is positive and significant. ————————————————————————
# YOUR CODE HERE: Fit a multiple logistic regression model
# Outcome: hypertension
# Predictors: age_cont, sex, bmi_cat, phys_active, current_smoker
# YOUR CODE HERE: Display the results
m2 <- glm(
hypertension ~ age_cont + sex + bmi_cat + phys_active + current_smoker,
data = brfss_clean,
family = binomial
)
broom::tidy(m2, exponentiate = TRUE, conf.int = TRUE)## # A tibble: 8 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.00818 0.653 -7.35 1.91e-13 0.00211 0.0280
## 2 age_cont 1.06 0.00529 11.2 2.79e-29 1.05 1.07
## 3 sexMale 1.27 0.123 1.95 5.11e- 2 0.999 1.62
## 4 bmi_catNormal 2.10 0.546 1.36 1.75e- 1 0.759 6.76
## 5 bmi_catOverweight 3.24 0.543 2.17 3.03e- 2 1.18 10.4
## 6 bmi_catObese 6.59 0.545 3.46 5.42e- 4 2.39 21.2
## 7 phys_active 0.900 0.130 -0.808 4.19e- 1 0.697 1.16
## 8 current_smoker 1.07 0.139 0.495 6.21e- 1 0.817 1.41
Questions:
After adjusting for sex, BMI, physical activity, and smoking, each 1-year increase in age is associated with about a 6.1% increase in the odds of hypertension. b) What does this suggest about confounding? Because the OR changed slightly (from 1.055 → 1.061), this suggests: There is minimal confounding by the added variables. The relationship between age and hypertension remains strong and stable after adjustment. The effect did not meaningfully shrink or reverse.
# YOUR CODE HERE: Create a table showing the dummy variable coding for bmi_cat
# YOUR CODE HERE: Extract and display the odds ratios for BMI categories
# Show dummy coding for BMI
head(model.matrix(~ bmi_cat, data = brfss_clean))## (Intercept) bmi_catNormal bmi_catOverweight bmi_catObese
## 1 1 0 0 1
## 2 1 0 0 1
## 3 1 1 0 0
## 4 1 1 0 0
## 5 1 0 1 0
## 6 1 1 0 0
# Extract BMI odds ratios
broom::tidy(m2, exponentiate = TRUE, conf.int = TRUE) %>%
dplyr::filter(stringr::str_detect(term, "bmi_cat"))## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 bmi_catNormal 2.10 0.546 1.36 0.175 0.759 6.76
## 2 bmi_catOverweight 3.24 0.543 2.17 0.0303 1.18 10.4
## 3 bmi_catObese 6.59 0.545 3.46 0.000542 2.39 21.2
Questions:
Underweight is the reference category. In dummy coding, the reference group is the one that does not appear in the regression table.
Interpret the odds ratio for “Obese” compared to the reference category. From my table: OR = 6.585 95% CI = (2.39, 21.18) p = 0.00054 Interpretation: Compared to underweight individuals (reference group), obese individuals have approximately 6.6 times higher odds of hypertension, adjusting for age, sex, physical activity, and smoking. Because the CI does not include 1 and p < 0.05, this is statistically significant.
How would you explain this to a non-statistician?
People who are obese are much more likely to have high blood pressure than people who are underweight. In fact, their odds are more than six times higher, even after accounting for age and other lifestyle factors.
# YOUR CODE HERE: Fit a model with Age × BMI interaction
# Test if the effect of age on hypertension differs by BMI category
# YOUR CODE HERE: Perform a likelihood ratio test comparing models with and without interaction
m3 <- glm(
hypertension ~ age_cont * bmi_cat + sex + phys_active + current_smoker,
data = brfss_clean,
family = binomial
)
anova(m2, m3, test = "LRT")## Analysis of Deviance Table
##
## Model 1: hypertension ~ age_cont + sex + bmi_cat + phys_active + current_smoker
## Model 2: hypertension ~ age_cont * bmi_cat + sex + phys_active + current_smoker
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 1273 1563.5
## 2 1270 1561.3 3 2.2363 0.5248
Questions:
Is the interaction term statistically significant? No. The p-value is 0.5248 Since this is much greater than 0.05: The Age × BMI interaction is not statistically significant.
What does this mean in epidemiologic terms (effect modification)? There is no evidence of effect modification by BMI. Age increases hypertension risk similarly for all BMI groups. Because the interaction is not significant: The simpler model (without interaction) is preferred. So for interpretation and reporting, you would keep m2, not m3.
Create a visualization showing predicted probabilities by age and BMI category
# YOUR CODE HERE: Calculate VIF for your multiple regression model
# YOUR CODE HERE: Create a Cook's distance plot to identify influential observations
# VIF
vif(m2)## GVIF Df GVIF^(1/(2*Df))
## age_cont 1.126628 1 1.061428
## sex 1.016509 1 1.008221
## bmi_cat 1.103045 3 1.016480
## phys_active 1.024820 1 1.012334
## current_smoker 1.073574 1 1.036134
Questions:
Are there any concerns about multicollinearity? There is no evidence of multicollinearity among the predictors. All VIF values are very close to 1.
Are there any influential observations that might affect your results? From my Cook’s distance plot: Most values are near 0 None appear close to 1 The largest values look around ~0.03 Rule of thumb: Cook’s D > 1 → potentially influential I have none above 1.
There are no highly influential observations in the model.
# YOUR CODE HERE: Compare three models using AIC and BIC
# Model A: Age only
# Model B: Age + sex + bmi_cat
# Model C: Age + sex + bmi_cat + phys_active + current_smoker
# YOUR CODE HERE: Create a comparison table
newdata <- data.frame(
age_cont = 60,
sex = "Male",
bmi_cat = "Obese",
phys_active = 0,
current_smoker = 1
)
pred_prob <- predict(m2, newdata = newdata, type = "response")
pred_prob## 1
## 0.7218378
Questions:
Which model has the best fit based on AIC? The main effects model (m2) has the better fit based on AIC.
Is the added complexity of the full model justified? The added complexity is not justifie since it didnt significantly improve models fit
Which model would you choose for your final analysis? Why?
The main effects model (m2) because the interaction isnt statistically significant and theres no meaningful improvement in fit, and its easier to interpret
Write a brief report (1-2 pages) summarizing your findings:
Introduction The research question for this analysis was: What is the association between age and hypertension, and does this association vary by BMI status? Hypertension is a major cardiovascular risk factor, and understanding how age and BMI contribute to hypertension risk is important for identifying high-risk populations.
Methods We analyzed a cleaned subset of the 2023 BRFSS dataset. Hypertension (yes/no) was the binary outcome variable. The primary exposure of interest was age (continuous). Additional covariates included: Sex BMI category Physical activity Current smoking status We first calculated descriptive statistics and prevalence of hypertension overall and by age group. We then fit: A simple logistic regression model (hypertension ~ age) A multiple logistic regression model adjusting for covariates A model including an interaction between age and BMI Model fit was assessed using likelihood ratio tests and AIC. Multicollinearity was evaluated using VIF, and influential observations were assessed using Cook’s distance.
Results Descriptive Findings The overall prevalence of hypertension was approximately 52.7%. Hypertension prevalence increased strongly with age: 18–24: 8.3% 25–34: 19.5% 35–44: 30.4% 45–54: 37.9% 55–64: 51.5% 65+: 66.8% Logistic Regression Results In the simple model: OR for age = 1.055 95% CI: (1.045, 1.065) p < 0.001 Each additional year of age was associated with a 5.5% increase in the odds of hypertension. In the adjusted model: OR for age = 1.061 95% CI: (1.050, 1.073) p < 0.001 Age remained strongly associated with hypertension after adjustment. BMI was the strongest predictor: Overweight: OR = 3.24 Obese: OR = 6.59 Obese individuals had over six times the odds of hypertension compared to underweight individuals. The age × BMI interaction was not statistically significant (p = 0.52), suggesting no effect modification. Diagnostics indicated: No multicollinearity (VIF ≈ 1) No highly influential observations (Cook’s D < 1) A predicted probability example showed that a 60-year-old obese male who is inactive and a current smoker had an estimated 72% probability of hypertension.
Interpretation Age is a strong and statistically significant predictor of hypertension. The odds of hypertension increase steadily with age. BMI is an even stronger predictor. Obesity is associated with dramatically higher odds of hypertension, independent of age and other covariates. There was no evidence that BMI modifies the effect of age on hypertension, meaning age increases risk similarly across BMI categories. Overall, the findings are consistent with known epidemiologic patterns linking age and obesity to hypertension risk.
Limitations Several limitations should be considered: I think the data is cross-sectional, so causality cant be inferred. Hypertension status is self-reported and may be classified. The reference category for BMI (underweight) may contain a small number of individuals, which could affect stability of estimates. Residual confounding is possible due to unmeasured factors (e.g., diet, medication use, socioeconomic factors).
Submission: Submit your completed R Markdown file and knitted HTML report.
Logistic Regression:
\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]
Odds Ratio:
\[\text{OR} = e^{\beta_i}\]
Predicted Probability:
\[p = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}\]
Session Info
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.0
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggstats_0.12.0 gtsummary_2.5.0 ggeffects_2.3.2 car_3.1-3
## [5] carData_3.0-5 broom_1.0.11 plotly_4.12.0 kableExtra_1.4.0
## [9] knitr_1.51 haven_2.5.5 lubridate_1.9.4 forcats_1.0.1
## [13] stringr_1.6.0 dplyr_1.2.0 purrr_1.2.1 readr_2.1.6
## [17] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.56 bslib_0.10.0
## [4] htmlwidgets_1.6.4 insight_1.4.6 lattice_0.22-7
## [7] tzdb_0.5.0 crosstalk_1.2.2 vctrs_0.7.1
## [10] tools_4.5.2 generics_0.1.4 datawizard_1.3.0
## [13] pkgconfig_2.0.3 Matrix_1.7-4 data.table_1.18.0
## [16] RColorBrewer_1.1-3 S7_0.2.1 lifecycle_1.0.5
## [19] compiler_4.5.2 farver_2.1.2 textshaping_1.0.4
## [22] htmltools_0.5.9 sass_0.4.10 yaml_2.3.12
## [25] lazyeval_0.2.2 Formula_1.2-5 pillar_1.11.1
## [28] jquerylib_0.1.4 broom.helpers_1.22.0 cachem_1.1.0
## [31] abind_1.4-8 nlme_3.1-168 tidyselect_1.2.1
## [34] digest_0.6.39 stringi_1.8.7 labeling_0.4.3
## [37] splines_4.5.2 labelled_2.16.0 fastmap_1.2.0
## [40] grid_4.5.2 cli_3.6.5 magrittr_2.0.4
## [43] cards_0.7.1 utf8_1.2.6 withr_3.0.2
## [46] scales_1.4.0 backports_1.5.0 timechange_0.3.0
## [49] rmarkdown_2.30 httr_1.4.7 otel_0.2.0
## [52] hms_1.1.4 evaluate_1.0.5 viridisLite_0.4.2
## [55] mgcv_1.9-3 rlang_1.1.7 glue_1.8.0
## [58] xml2_1.5.2 svglite_2.2.2 rstudioapi_0.18.0
## [61] jsonlite_2.0.0 R6_2.6.1 systemfonts_1.3.1
getwd() list.files()