Introduction

Statistical modeling is a fundamental tool in epidemiology that allows us to:

  • Describe relationships between variables
  • Predict outcomes based on risk factors
  • Estimate associations while controlling for confounding

This lecture introduces key concepts in regression modeling using real-world data from the Behavioral Risk Factor Surveillance System (BRFSS) 2023.


Setup and Data Preparation

# Load required packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(car)
library(ggeffects)
library(gtsummary)
library(ggstats)

Loading BRFSS 2023 Data

The BRFSS is a large-scale telephone survey that collects data on health-related risk behaviors, chronic health conditions, and use of preventive services from U.S. residents.

# Load the full BRFSS 2023 dataset
brfss_clean <- read_rds("/Users/vikya/OneDrive - University at Albany - SUNY/Documents/553 Principles of Statistical Inference II/data/Statistical Modeling in Epidemiology Lab/brfss_subset_2023.rds")

# Display dataset dimensions
names(brfss_clean)
##  [1] "diabetes"       "age_group"      "age_cont"       "sex"           
##  [5] "race"           "education"      "income"         "bmi_cat"       
##  [9] "phys_active"    "current_smoker" "gen_health"     "hypertension"  
## [13] "high_chol"

Descriptive Statistics

# Summary table by diabetes status
desc_table <- brfss_clean %>%
  group_by(diabetes) %>%
  summarise(
    N = n(),
    `Mean Age` = round(mean(age_cont), 1),
    `% Male` = round(100 * mean(sex == "Male"), 1),
    `% Obese` = round(100 * mean(bmi_cat == "Obese", na.rm = TRUE), 1),
    `% Physically Active` = round(100 * mean(phys_active), 1),
    `% Current Smoker` = round(100 * mean(current_smoker), 1),
    `% Hypertension` = round(100 * mean(hypertension), 1),
    `% High Cholesterol` = round(100 * mean(high_chol), 1)
  ) %>%
  mutate(diabetes = ifelse(diabetes == 1, "Diabetes", "No Diabetes"))

desc_table %>%
  kable(caption = "Descriptive Statistics by Diabetes Status",
        align = "lrrrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Descriptive Statistics by Diabetes Status
diabetes N Mean Age % Male % Obese % Physically Active % Current Smoker % Hypertension % High Cholesterol
No Diabetes 1053 58.2 49.0 34.8 69.4 29.3 47.5 42.5
Diabetes 228 63.1 53.9 56.1 53.5 27.6 76.8 67.1

Part 1: Statistical Modeling Concepts

1. What is Statistical Modeling?

A statistical model is a mathematical representation of the relationship between:

  • An outcome variable (dependent variable, response)
  • One or more predictor variables (independent variables, exposures, covariates)

General Form of a Statistical Model

\[f(Y) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon\]

Where:

  • \(f(Y)\) is a function of the outcome (identity, log, logit, etc.)
  • \(\beta_0\) is the intercept (baseline value)
  • \(\beta_1, \beta_2, \ldots, \beta_p\) are coefficients (effect sizes)
  • \(X_1, X_2, \ldots, X_p\) are predictor variables
  • \(\epsilon\) is the error term (random variation)

2. Types of Regression Models

The choice of regression model depends on the type of outcome variable:

Common Regression Models in Epidemiology
Outcome Type Regression Type Link Function Example
Continuous Linear Identity: Y Blood pressure, BMI
Binary Logistic Logit: log(p/(1-p)) Disease status, mortality
Count Poisson/Negative Binomial Log: log(Y) Number of infections
Time-to-event Cox Proportional Hazards Log: log(h(t)) Survival time

Simple vs. Multiple Regression

  • Simple regression: One predictor variable
  • Multiple regression: Two or more predictor variables (controls for confounding)

3. Linear Regression Example

Let’s model the relationship between age and diabetes prevalence.

Simple Linear Regression

# Simple linear regression: diabetes ~ age
model_linear_simple <- lm(diabetes ~ age_cont, data = brfss_clean)

# Display results
tidy(model_linear_simple, conf.int = TRUE) %>%
  kable(caption = "Simple Linear Regression: Diabetes ~ Age",
        digits = 4,
        col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Simple Linear Regression: Diabetes ~ Age
Term Estimate Std. Error t-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) -0.0632 0.0481 -1.3125 0.1896 -0.1576 0.0312
age_cont 0.0041 0.0008 5.1368 0.0000 0.0025 0.0056

Interpretation:

  • Intercept (\(\beta_0\)): -0.0632 - Expected probability of diabetes at age 0 (not meaningful in this context)
  • Slope (\(\beta_1\)): 0.0041 - For each 1-year increase in age, the probability of diabetes increases by 0.41%

Visualization

With continuous age

# Create scatter plot with regression line
p1 <- ggplot(brfss_clean, aes(x = age_cont, y = diabetes)) +
  geom_jitter(alpha = 0.2, width = 0.5, height = 0.02, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red", linewidth = 1.2) +
  labs(
    title = "Relationship Between Age and Diabetes",
    subtitle = "Simple Linear Regression",
    x = "Age (years)",
    y = "Probability of Diabetes"
  ) +
  theme_minimal(base_size = 12)

ggplotly(p1) %>%
  layout(hovermode = "closest")

Diabetes Prevalence by Age


4. Logistic Regression: The Preferred Model for Binary Outcomes

Problem with linear regression for binary outcomes:

  • Predicted probabilities can fall outside [0, 1]
  • Assumes constant variance (violated for binary data)

Solution: Logistic Regression

Uses the logit link function to ensure predicted probabilities stay between 0 and 1:

\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]

Simple Logistic Regression

# Simple logistic regression: diabetes ~ age
model_logistic_simple <- glm(diabetes ~ age_cont,
                              data = brfss_clean,
                              family = binomial(link = "logit"))

# Display results with odds ratios
tidy(model_logistic_simple, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Simple Logistic Regression: Diabetes ~ Age (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Simple Logistic Regression: Diabetes ~ Age (Odds Ratios)
Term Odds Ratio Std. Error z-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 0.029 0.423 -8.390 0 0.012 0.064
age_cont 1.034 0.007 4.978 0 1.021 1.048

Interpretation:

  • Odds Ratio (OR): 1.034
  • For each 1-year increase in age, the odds of diabetes increase by 3.4%
  • The relationship is highly statistically significant (p < 0.001)

Predicted Probabilities

# From ggeffects package
pp <- predict_response(model_logistic_simple, terms = "age_cont")
plot(pp)
Predicted Diabetes Probability by Age

Predicted Diabetes Probability by Age

# Generate predicted probabilities
pred_data <- data.frame(age_cont = seq(18, 80, by = 1))
pred_data$predicted_prob <- predict(model_logistic_simple,
                                    newdata = pred_data,
                                    type = "response")

# Plot
p2 <- ggplot(pred_data, aes(x = age_cont, y = predicted_prob)) +
  geom_line(color = "darkred", linewidth = 1.5) +
  geom_ribbon(aes(ymin = predicted_prob - 0.02,
                  ymax = predicted_prob + 0.02),
              alpha = 0.2, fill = "darkred") +
  labs(
    title = "Predicted Probability of Diabetes by Age",
    subtitle = "Simple Logistic Regression",
    x = "Age (years)",
    y = "Predicted Probability of Diabetes"
  ) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 0.6)) +
  theme_minimal(base_size = 12)

ggplotly(p2)

Predicted Diabetes Probability by Age


5. Multiple Regression: Controlling for Confounding

What is Confounding?

A confounder is a variable that:

  1. Is associated with both the exposure and the outcome
  2. Is not on the causal pathway between exposure and outcome
  3. Distorts the true relationship between exposure and outcome

Example: The relationship between age and diabetes may be confounded by BMI, physical activity, and other factors.

Multiple Logistic Regression

# Multiple logistic regression with potential confounders
model_logistic_multiple <- glm(diabetes ~ age_cont + sex + bmi_cat +
                                phys_active + current_smoker + education,
                               data = brfss_clean,
                               family = binomial(link = "logit"))

# Display results
tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Multiple Logistic Regression: Diabetes ~ Age + Covariates (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  scroll_box(height = "400px")
Multiple Logistic Regression: Diabetes ~ Age + Covariates (Odds Ratios)
Term Odds Ratio Std. Error z-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 0.009 1.177 -4.001 0.000 0.000 0.065
age_cont 1.041 0.007 5.515 0.000 1.027 1.057
sexMale 1.191 0.154 1.133 0.257 0.880 1.613
bmi_catNormal 1.971 1.052 0.645 0.519 0.378 36.309
bmi_catOverweight 3.155 1.044 1.101 0.271 0.621 57.679
bmi_catObese 6.834 1.041 1.845 0.065 1.354 124.675
phys_active 0.589 0.157 -3.373 0.001 0.433 0.802
current_smoker 1.213 0.178 1.085 0.278 0.852 1.716
educationHigh school graduate 0.634 0.288 -1.579 0.114 0.364 1.131
educationSome college 0.542 0.294 -2.081 0.037 0.307 0.977
educationCollege graduate 0.584 0.305 -1.763 0.078 0.324 1.074

Interpretation:

  • Age (adjusted OR): 1.041
    • After adjusting for sex, BMI, physical activity, smoking, and education, each 1-year increase in age is associated with a 4.1% increase in the odds of diabetes
  • Sex (Male vs Female): OR = 1.191
    • Males have 19.1% higher odds of diabetes compared to females, adjusting for other variables
  • BMI (Obese vs Normal): OR = 6.834
    • Obese individuals had 6.83 times higher odds of diabetes compared to normal-weight individuals.

6. Dummy Variables: Coding Categorical Predictors

Categorical variables with \(k\) levels are represented using \(k-1\) dummy variables (indicator variables).

Example: Education Level

Education has 4 levels: 1. < High school (reference category) 2. High school graduate 3. Some college 4. College graduate

R automatically creates 3 dummy variables:

# Extract dummy variable coding
dummy_table <- data.frame(
  Education = c("< High school", "High school graduate", "Some college", "College graduate"),
  `Dummy 1 (HS grad)` = c(0, 1, 0, 0),
  `Dummy 2 (Some college)` = c(0, 0, 1, 0),
  `Dummy 3 (College grad)` = c(0, 0, 0, 1),
  check.names = FALSE
)

dummy_table %>%
  kable(caption = "Dummy Variable Coding for Education",
        align = "lccc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(1, bold = TRUE, background = "#ffe6e6")  # Highlight reference category
Dummy Variable Coding for Education
Education Dummy 1 (HS grad) Dummy 2 (Some college) Dummy 3 (College grad)
< High school 0 0 0
High school graduate 1 0 0
Some college 0 1 0
College graduate 0 0 1

Reference Category: The category with all zeros (< High school) is the reference group. All other categories are compared to this reference.

Visualizing Education Effects

# Extract education coefficients
educ_coefs <- tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, "education")) %>%
  mutate(
    education_level = str_remove(term, "education"),
    education_level = factor(education_level,
                             levels = c("High school graduate",
                                       "Some college",
                                       "College graduate"))
  )

# Add reference category
ref_row <- data.frame(
  term = "education< High school",
  estimate = 1.0,
  std.error = 0,
  statistic = NA,
  p.value = NA,
  conf.low = 1.0,
  conf.high = 1.0,
  education_level = factor("< High school (Ref)",
                          levels = c("< High school (Ref)",
                                    "High school graduate",
                                    "Some college",
                                    "College graduate"))
)

educ_coefs_full <- bind_rows(ref_row, educ_coefs) %>%
  mutate(education_level = factor(education_level,
                                 levels = c("< High school (Ref)",
                                           "High school graduate",
                                           "Some college",
                                           "College graduate")))

# Plot
p3 <- ggplot(educ_coefs_full, aes(x = education_level, y = estimate)) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "gray50") +
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high),
                  size = 0.8, color = "darkblue") +
  coord_flip() +
  labs(
    title = "Association Between Education and Diabetes",
    subtitle = "Adjusted Odds Ratios (reference: < High school)",
    x = "Education Level",
    y = "Odds Ratio (95% CI)"
  ) +
  theme_minimal(base_size = 12)

ggplotly(p3)

Odds Ratios for Education Levels

# Plot model coefficients with `ggcoef_model()`
ggcoef_model(model_logistic_multiple, exponentiate = TRUE,
  include = c("education"),
  variable_labels = c(
    education = "Education"),
  facet_labeller = ggplot2::label_wrap_gen(10)
)


7. Interactions (Effect Modification)

An interaction exists when the effect of one variable on the outcome differs across levels of another variable.

Epidemiologic term: Effect modification

Example: Age × Sex Interaction

Does the effect of age on diabetes differ between males and females?

# Model with interaction term
model_interaction <- glm(diabetes ~ age_cont * sex + bmi_cat + phys_active,
                         data = brfss_clean,
                         family = binomial(link = "logit"))

# Display interaction results
tidy(model_interaction, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, "age_cont")) %>%
  kable(caption = "Age × Sex Interaction Model (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Age × Sex Interaction Model (Odds Ratios)
Term Odds Ratio Std. Error z-statistic p-value 95% CI Lower 95% CI Upper
age_cont 1.031 0.009 3.178 0.001 1.012 1.051
age_cont:sexMale 1.015 0.014 1.084 0.278 0.988 1.044

Interpretation:

  • Main effect of age: OR among females (reference)
  • Interaction term (age:sexMale): Additional effect of age among males
  • If the interaction term is significant, the age-diabetes relationship differs by sex

Visualizing Interaction

# Generate predicted probabilities by sex
pred_interact <- ggpredict(model_interaction, terms = c("age_cont [18:80]", "sex"))

# Plot
p4 <- ggplot(pred_interact, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
  labs(
    title = "Predicted Probability of Diabetes by Age and Sex",
    subtitle = "Testing for Age × Sex Interaction",
    x = "Age (years)",
    y = "Predicted Probability of Diabetes",
    color = "Sex",
    fill = "Sex"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_color_manual(values = c("Female" = "#E64B35", "Male" = "#4DBBD5")) +
  scale_fill_manual(values = c("Female" = "#E64B35", "Male" = "#4DBBD5")) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

ggplotly(p4)

Age-Diabetes Relationship by Sex


8. Model Diagnostics

Every regression model makes assumptions about the data. If assumptions are violated, results may be invalid.

Key Assumptions for Logistic Regression

  1. Linearity of log odds: Continuous predictors have a linear relationship with the log odds of the outcome
  2. Independence of observations: Each observation is independent
  3. No perfect multicollinearity: Predictors are not perfectly correlated
  4. No influential outliers: Individual observations don’t overly influence the model

Checking for Multicollinearity

Variance Inflation Factor (VIF): Measures how much the variance of a coefficient is inflated due to correlation with other predictors.

  • VIF < 5: Generally acceptable
  • VIF > 10: Serious multicollinearity problem
# Calculate VIF
vif_values <- vif(model_logistic_multiple)

# Create VIF table
# For models with categorical variables, vif() returns GVIF (Generalized VIF)
if (is.matrix(vif_values)) {
  # If matrix (categorical variables present), extract GVIF^(1/(2*Df))
  vif_df <- data.frame(
    Variable = rownames(vif_values),
    VIF = vif_values[, "GVIF^(1/(2*Df))"]
  )
} else {
  # If vector (only continuous variables)
  vif_df <- data.frame(
    Variable = names(vif_values),
    VIF = as.numeric(vif_values)
  )
}

# Add interpretation
vif_df <- vif_df %>%
  arrange(desc(VIF)) %>%
  mutate(
    Interpretation = case_when(
      VIF < 5 ~ "Low (No concern)",
      VIF >= 5 & VIF < 10 ~ "Moderate (Monitor)",
      VIF >= 10 ~ "High (Problem)"
    )
  )

vif_df %>%
  kable(caption = "Variance Inflation Factors (VIF) for Multiple Regression Model",
        digits = 2,
        align = "lrc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which(vif_df$VIF >= 10), bold = TRUE, color = "white", background = "#DC143C") %>%
  row_spec(which(vif_df$VIF >= 5 & vif_df$VIF < 10), background = "#FFA500") %>%
  row_spec(which(vif_df$VIF < 5), background = "#90EE90")
Variance Inflation Factors (VIF) for Multiple Regression Model
Variable VIF Interpretation
age_cont age_cont 1.05 Low (No concern)
current_smoker current_smoker 1.05 Low (No concern)
phys_active phys_active 1.02 Low (No concern)
sex sex 1.01 Low (No concern)
education education 1.01 Low (No concern)
bmi_cat bmi_cat 1.01 Low (No concern)

Influential Observations

Cook’s Distance: Measures how much the model would change if an observation were removed.

  • Cook’s D > 1: Potentially influential observation
# Calculate Cook's distance
cooks_d <- cooks.distance(model_logistic_multiple)

# Create data frame
influence_df <- data.frame(
  observation = 1:length(cooks_d),
  cooks_d = cooks_d
) %>%
  mutate(influential = ifelse(cooks_d > 1, "Yes", "No"))

# Plot
p5 <- ggplot(influence_df, aes(x = observation, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  labs(
    title = "Cook's Distance: Identifying Influential Observations",
    subtitle = "Values > 1 indicate potentially influential observations",
    x = "Observation Number",
    y = "Cook's Distance",
    color = "Influential?"
  ) +
  scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
  theme_minimal(base_size = 12)

ggplotly(p5)

Cook’s Distance for Influential Observations

# Count influential observations
n_influential <- sum(influence_df$influential == "Yes")
cat("Number of potentially influential observations:", n_influential, "\n")
## Number of potentially influential observations: 0

9. Model Comparison and Selection

Comparing Nested Models

Use Likelihood Ratio Test to compare nested models:

# Model 1: Age only
model1 <- glm(diabetes ~ age_cont,
              data = brfss_clean,
              family = binomial)

# Model 2: Age + Sex
model2 <- glm(diabetes ~ age_cont + sex,
              data = brfss_clean,
              family = binomial)

# Model 3: Full model
model3 <- model_logistic_multiple

# Likelihood ratio test
lrt_1_2 <- anova(model1, model2, test = "LRT")
lrt_2_3 <- anova(model2, model3, test = "LRT")

# Create comparison table
model_comp <- data.frame(
  Model = c("Model 1: Age only",
            "Model 2: Age + Sex",
            "Model 3: Full model"),
  AIC = c(AIC(model1), AIC(model2), AIC(model3)),
  BIC = c(BIC(model1), BIC(model2), BIC(model3)),
  `Deviance` = c(deviance(model1), deviance(model2), deviance(model3)),
  check.names = FALSE
)

model_comp %>%
  kable(caption = "Model Comparison: AIC, BIC, and Deviance",
        digits = 2,
        align = "lrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which.min(model_comp$AIC), bold = TRUE, background = "#d4edda")
Model Comparison: AIC, BIC, and Deviance
Model AIC BIC Deviance
Model 1: Age only 1175.08 1185.39 1171.08
Model 2: Age + Sex 1175.85 1191.32 1169.85
Model 3: Full model 1122.65 1179.36 1100.65

Interpretation:

  • Lower AIC/BIC indicates better model fit
  • Model 3 (full model) has the lowest AIC, suggesting it provides the best fit to the data

10. Error Term in Statistical Models

All statistical models include an error term (\(\epsilon\)) to account for:

  • Random variation in the outcome
  • Unmeasured variables not included in the model
  • Measurement error in variables

\[Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon\]

Key points:

  • The model cannot perfectly predict every outcome
  • The difference between observed and predicted values is the error (residual)
  • We assume errors are normally distributed with mean 0 (for linear regression)

Part 2: Student Lab Activity

Lab Overview

In this lab, you will:

  1. Build your own logistic regression model predicting hypertension (high blood pressure)
  2. Create dummy variables for categorical predictors
  3. Interpret regression coefficients
  4. Test for confounding and interaction
  5. Perform model diagnostics

Lab Instructions

Task 1: Explore the Outcome Variable

hyp_freq <- brfss_clean %>%
  count(hypertension) %>%
  mutate(
    Status = ifelse(hypertension == 1, "Hypertension", "No Hypertension"),
    Percent = round(100 * n / sum(n), 1)
  ) %>%
  select(Status, n, Percent)

hyp_freq %>%
  kable(caption = "Frequency Table: Hypertension Status",
        col.names = c("Status", "N", "%"),
        align = "lrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Frequency Table: Hypertension Status
Status N %
No Hypertension 606 47.3
Hypertension 675 52.7
# YOUR CODE HERE: Calculate the prevalence of hypertension by age group
hyp_by_age <- brfss_clean %>%
  group_by(age_group) %>%
  summarise(
    N = n(),
    Cases = sum(hypertension),
    Prevalence = round(100 * mean(hypertension), 1)
  )

hyp_by_age %>%
  kable(caption = "Hypertension Prevalence by Age Group",
        col.names = c("Age Group", "N", "Cases", "Prevalence (%)"),
        align = "lrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Hypertension Prevalence by Age Group
Age Group N Cases Prevalence (%)
18-24 12 1 8.3
25-34 77 15 19.5
35-44 138 42 30.4
45-54 161 61 37.9
55-64 266 137 51.5
65+ 627 419 66.8

Questions:

  1. What is the overall prevalence of hypertension in the dataset? The overall prevalence of hypertension is 52.7%
  2. How does hypertension prevalence vary by age group? Prevalence rises sharply and consistently with age. From just 8.3% in the 18–24 group up to 66.8% among those 65+.

Task 2: Build a Simple Logistic Regression Model

# Outcome: hypertension
# Predictor: age_cont
model_hyp_simple <- glm(hypertension ~ age_cont,
                        data = brfss_clean,
                        family = binomial(link = "logit"))


# YOUR CODE HERE: Display the results with odds ratios
tidy(model_hyp_simple, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Simple Logistic Regression: Hypertension ~ Age (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic",
                      "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Simple Logistic Regression: Hypertension ~ Age (Odds Ratios)
Term Odds Ratio Std. Error z-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 0.048 0.296 -10.293 0 0.026 0.084
age_cont 1.055 0.005 10.996 0 1.045 1.065
pp_hyp <- predict_response(model_hyp_simple, terms = "age_cont")
plot(pp_hyp)

Questions:

  1. What is the odds ratio for age? Interpret this value. The OR for age is 1.055. Each additional year of age is associated with a 5.5% increase in the odds of hypertension.

  2. Is the association statistically significant? Yes, the association is highly statistically significant (z = 10.996, p < 0.001).

  3. What is the 95% confidence interval for the odds ratio? The 95% CI is (1.045, 1.065) narrow and above 1.0. This indicates a positive estimate. —

Task 3: Create a Multiple Regression Model

# Multiple logistic regression with potential confounders
# Outcome: hypertension
# Predictors: age_cont, sex, bmi_cat, phys_active, current_smoker
model_hyp_multiple <- glm(hypertension ~ age_cont + sex + bmi_cat +
                            phys_active + current_smoker,
                          data = brfss_clean,
                          family = binomial(link = "logit"))


# Display the results
tidy(model_hyp_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Multiple Logistic Regression: Hypertension ~ Age + Covariates (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic",
                      "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  scroll_box(height = "400px")
Multiple Logistic Regression: Hypertension ~ Age + Covariates (Odds Ratios)
Term Odds Ratio Std. Error z-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 0.008 0.653 -7.355 0.000 0.002 0.028
age_cont 1.061 0.005 11.234 0.000 1.050 1.073
sexMale 1.270 0.123 1.950 0.051 0.999 1.616
bmi_catNormal 2.097 0.546 1.356 0.175 0.759 6.756
bmi_catOverweight 3.241 0.543 2.166 0.030 1.183 10.385
bmi_catObese 6.585 0.545 3.459 0.001 2.394 21.176
phys_active 0.900 0.130 -0.808 0.419 0.697 1.162
current_smoker 1.071 0.139 0.495 0.621 0.817 1.407

Questions:

  1. How did the odds ratio for age change after adjusting for other variables? The age OR increased slightly from 1.055 (crude) to 1.061 (adjusted).
  2. What does this suggest about confounding? This minimal change suggests no confounding by sex, BMI, physical activity, or smoking. Age has a strong and independent direct effect on hypertension
  3. Which variables are the strongest predictors of hypertension? BMI is the strongest predictor. The Obese OR is 6.585 (p = 0.001) and Overweight OR is 3.241 (p = 0.030).Age is also highly significant (p < 0.001). Physical activity, smoking, and sex are not statistically significant in the adjusted version —

Task 4: Interpret Dummy Variables

# Create a table showing the dummy variable coding for bmi_cat
dummy_bmi <- data.frame(
  `BMI Category`         = c("Underweight/Normal", "Overweight", "Obese"),
  `Dummy 1 (Overweight)` = c(0, 1, 0),
  `Dummy 2 (Obese)`      = c(0, 0, 1),
  check.names = FALSE
)

dummy_bmi %>%
  kable(caption = "Dummy Variable Coding for BMI Category",
        align = "lcc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(1, bold = TRUE, background = "#ffe6e6")  # Highlight reference category
Dummy Variable Coding for BMI Category
BMI Category Dummy 1 (Overweight) Dummy 2 (Obese)
Underweight/Normal 0 0
Overweight 1 0
Obese 0 1
# Extract and display the odds ratios for BMI categories
tidy(model_hyp_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, "bmi_cat")) %>%
  mutate(term = str_remove(term, "bmi_cat")) %>%
  kable(caption = "Odds Ratios for BMI Categories (vs. Underweight/Normal)",
        digits = 3,
        col.names = c("BMI Category", "Odds Ratio", "Std. Error", "z-statistic",
                      "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Odds Ratios for BMI Categories (vs. Underweight/Normal)
BMI Category Odds Ratio Std. Error z-statistic p-value 95% CI Lower 95% CI Upper
Normal 2.097 0.546 1.356 0.175 0.759 6.756
Overweight 3.241 0.543 2.166 0.030 1.183 10.385
Obese 6.585 0.545 3.459 0.001 2.394 21.176

Questions:

  1. What is the reference category for BMI? The reference category is Underweight/Normal BMI

  2. Interpret the odds ratio for “Obese” compared to the reference category.The OR for Obese is 6.585 (95% CI: 2.394–21.176, p = 0.001).This means, after adjusting for age, sex, physical activity, and smoking, obese individuals have 6.6 times the odds of hypertension compared to normal/underweight individuals.

  3. How would you explain this to a non-statistician? People who are obese are about 6 to 7 times more likely to have high blood pressure than people at a healthy weight —

Task 5: Test for Interaction

# YOUR CODE HERE: Fit a model with Age × BMI interaction
# Test if the effect of age on hypertension differs by BMI category

model_hyp_noint <- glm(hypertension ~ age_cont + bmi_cat + sex + phys_active,
                       data = brfss_clean,
                       family = binomial(link = "logit"))
model_hyp_int <- glm(hypertension ~ age_cont * bmi_cat + sex + phys_active,
                     data = brfss_clean,
                     family = binomial(link = "logit"))
tidy(model_hyp_int, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, "age_cont")) %>%
  kable(caption = "Age x BMI Interaction Model (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic",
                      "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Age x BMI Interaction Model (Odds Ratios)
Term Odds Ratio Std. Error z-statistic p-value 95% CI Lower 95% CI Upper
age_cont 1.005 0.042 0.129 0.897 0.930 1.110
age_cont:bmi_catNormal 1.056 0.043 1.261 0.207 0.955 1.145
age_cont:bmi_catOverweight 1.062 0.043 1.407 0.159 0.961 1.151
age_cont:bmi_catObese 1.051 0.043 1.162 0.245 0.951 1.137
# YOUR CODE HERE: Perform a likelihood ratio test comparing models with and without interaction
anova(model_hyp_noint, model_hyp_int, test = "LRT")
## Analysis of Deviance Table
## 
## Model 1: hypertension ~ age_cont + bmi_cat + sex + phys_active
## Model 2: hypertension ~ age_cont * bmi_cat + sex + phys_active
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      1274     1563.7                     
## 2      1271     1561.6  3   2.1826   0.5354
pred_interact_hyp <- ggpredict(model_hyp_int, terms = c("age_cont [18:80]", "bmi_cat"))

p4 <- ggplot(pred_interact_hyp, aes(x = x, y = predicted,
                                    color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
  labs(
    title = "Predicted Probability of Hypertension by Age and BMI Category",
    subtitle = "Testing for Age × BMI Interaction",
    x = "Age (years)",
    y = "Predicted Probability of Hypertension",
    color = "BMI Category",
    fill = "BMI Category"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_color_manual(values = c("Underweight/Normal" = "#4DBBD5",
                                "Overweight"         = "#E64B35",
                                "Obese"              = "#00A087")) +
  scale_fill_manual(values  = c("Underweight/Normal" = "#4DBBD5",
                                "Overweight"         = "#E64B35",
                                "Obese"              = "#00A087")) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

ggplotly(p4)

Questions:

  1. Is the interaction term statistically significant? No, none of the interaction terms are statistically significant (all p > 0.05: Normal p = 0.207, Overweight p = 0.159, Obese p = 0.245); The LRT= χ²(3) = 2.18, p = 0.535.
  2. What does this mean in epidemiologic terms (effect modification)? This means BMI is not an effect modifier of the age. The effect of age on hypertension risk is approximately the same regardless of someone’s weight.
  3. Create a visualization showing predicted probabilities by age and BMI category

Task 6: Model Diagnostics

# Calculate VIF for your multiple regression model
vif_values <- vif(model_hyp_multiple)

if (is.matrix(vif_values)) {
  vif_df <- data.frame(
    Variable = rownames(vif_values),
    VIF = vif_values[, "GVIF^(1/(2*Df))"]
  )
} else {
  vif_df <- data.frame(
    Variable = names(vif_values),
    VIF = as.numeric(vif_values)
  )
}

vif_df <- vif_df %>%
  arrange(desc(VIF)) %>%
  mutate(
    Interpretation = case_when(
      VIF < 5              ~ "Low (No concern)",
      VIF >= 5 & VIF < 10 ~ "Moderate (Monitor)",
      VIF >= 10            ~ "High (Problem)"
    )
  )

vif_df %>%
  kable(caption = "Variance Inflation Factors (VIF) for Hypertension Multiple Regression Model",
        digits = 2,
        align = "lrc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which(vif_df$VIF >= 10), bold = TRUE, color = "white", background = "#DC143C") %>%
  row_spec(which(vif_df$VIF >= 5 & vif_df$VIF < 10), background = "#FFA500") %>%
  row_spec(which(vif_df$VIF < 5), background = "#90EE90")
Variance Inflation Factors (VIF) for Hypertension Multiple Regression Model
Variable VIF Interpretation
age_cont age_cont 1.06 Low (No concern)
current_smoker current_smoker 1.04 Low (No concern)
bmi_cat bmi_cat 1.02 Low (No concern)
phys_active phys_active 1.01 Low (No concern)
sex sex 1.01 Low (No concern)
cooks_d <- cooks.distance(model_hyp_multiple)

influence_df <- data.frame(
  observation = 1:length(cooks_d),
  cooks_d = cooks_d
) %>%
  mutate(influential = ifelse(cooks_d > 1, "Yes", "No"))

p5 <- ggplot(influence_df, aes(x = observation, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  labs(
    title = "Cook's Distance: Identifying Influential Observations",
    subtitle = "Values > 1 indicate potentially influential observations",
    x = "Observation Number",
    y = "Cook's Distance",
    color = "Influential?"
  ) +
  scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
  theme_minimal(base_size = 12)

ggplotly(p5)
n_influential <- sum(influence_df$influential == "Yes")
cat("Number of potentially influential observations:", n_influential, "\n")
## Number of potentially influential observations: 0
# Create a Cook's distance plot to identify influential observations
model_A <- glm(hypertension ~ age_cont,
               data = brfss_clean,
               family = binomial)

Questions:

  1. Are there any concerns about multicollinearity? No concerns. All VIF values are below 1.1, under the threshold of 5.
  2. Are there any influential observations that might affect your results? No influential observations
  3. What would you do if you found serious violations? If serious violations were found: - remove or combine correlated predictors: remove or combine correlated predictors - for influential observations: run sensitivity analyses with and without them and report both sets of results.

Task 7: Model Comparison

# Compare three models using AIC and BIC
# Model A: Age only
model_A <- glm(hypertension ~ age_cont,
               data = brfss_clean,
               family = binomial)
# Model B: Age + sex + bmi_cat
model_B <- glm(hypertension ~ age_cont + sex + bmi_cat,
               data = brfss_clean,
               family = binomial)
# Model C: Age + sex + bmi_cat + phys_active + current_smoker

model_C <- glm(hypertension ~ age_cont + sex + bmi_cat +
                 phys_active + current_smoker,
               data = brfss_clean,
               family = binomial)

# Create a comparison table
model_comp <- data.frame(
  Model = c("Model A: Age only",
            "Model B: Age + Sex + BMI",
            "Model C: Full model"),
  AIC      = c(AIC(model_A), AIC(model_B), AIC(model_C)),
  BIC      = c(BIC(model_A), BIC(model_B), BIC(model_C)),
  Deviance = c(deviance(model_A), deviance(model_B), deviance(model_C)),
  check.names = FALSE
)

model_comp %>%
  kable(caption = "Model Comparison: AIC, BIC, and Deviance",
        digits = 2,
        align = "lrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which.min(model_comp$AIC), bold = TRUE, background = "#d4edda")
Model Comparison: AIC, BIC, and Deviance
Model AIC BIC Deviance
Model A: Age only 1636.61 1646.92 1632.61
Model B: Age + Sex + BMI 1576.49 1607.42 1564.49
Model C: Full model 1579.50 1620.74 1563.50

Questions:

  1. Which model has the best fit based on AIC? Model B has the best fit based on AIC (AIC = 1576.49), which is lower than both Model A (1636.61) and Model C (1579.50).
  2. Is the added complexity of the full model justified? Not justified. Model C has a higher AIC (1579.50 vs 1576.49) and higher BIC (1620.74 vs 1607.42) despite two extra predictors
  3. Which model would you choose for your final analysis? Why? Model B is the best choice. Because it achieves the lowest AIC and BIC, includes the two strongest predictors and avoids over fitting by excluding variables that do not significantly improve fit. —

Lab Report Guidelines

Write a brief report (1-2 pages) summarizing your findings:

  1. Introduction: State your research question
  2. Methods: Describe your analytic approach
  3. Results: Present key findings with tables and figures
  4. Interpretation: Explain what your results mean
  5. Limitations: Discuss potential issues with your analysis

Risk Factors for Hypertension Among U.S. Adults: A Logistic Regression Analysis of BRFSS 2023

Introduction

This analysis investigates the association between age and hypertension among U.S. adults, adjusted for sex, BMI category, physical activity, and smoking status. Hypertension is one of the most prevalent and modifiable risk factors for cardiovascular disease and stroke in the United States. This makes identification of its strongest predictors essential for targeting prevention and surveillance resources.

Methods

Hypertension was indicated as 1 = yes and 0 = no. I started by reporting the overall prevalence of hypertension and the prevalence by age group. I estimated three nested logistic regression models. The first included age as the only predictor (Model A). The second was a slightly more complex model that added sex and BMI category to the previous model (Model B). The third was a full model that added physical activity and smoking status to the previous model (Model C). The BMI was treated as a categorical variable. The “underweight/normal” category was the reference group. There were two dummy variables used to indicate “overweight” and “obese” status. I tested for an age vs. BMI interaction using a likelihood ratio test (LRT) comparing models with and without interaction terms. Model diagnostics included the Generalized Variance Inflation Factor (GVIF) for multicollinearity and Cook’s distance for influential observations. Model fit was compared across all three models using AIC and BIC. Loower values indicated better fit. All analyses were conducted in R with statistical significance set at α = 0.05.

Results

More than half of the overall sample (n = 675 out of 1,281) was hypertensive (prevalence = 52.7%). Prevalence increased with age, from 8.3% of adults aged 18-24 to 66.8% of those 65+. In the simple logistic regression model, the odds of hypertension were 5.5% higher for each year of age (OR = 1.055, 95% CI: 1.045-1.065, p < 0.001). The age OR changed by little after accounting for sex, BMI, PA, and smoking (OR = 1.061, 95% CI: 1.050-1.073); i.e., there was essentially no confounding of age by these variables. In the adjusted model, the odds of hypertension were over 6 times higher for obese individuals in comparison to those with a normal weight (OR = 6.585, 95% CI: 2.394–21.176, p = 0.001) and over 3 times higher for overweight individuals (OR = 3.241, 95% CI: 1.183–10.385, p = 0.030). In the case of sex, physical activity, and smoking, no statistically significant differences were found in the adjusted model. The age x BMI interaction was insignificantly different (LRT: χ²(3) = 2.18, p = 0.535), which means that the impact of age on hypertension is not significantly different within individual BMI levels. All GVIF values fell below 1.1 and no influential observations were identified (Cook’s D > 1: n = 0). Model comparison showed that Model B (age + sex + BMI) achieved the lowest AIC (1576.49) and BIC (1607.42), outperforming both the age-only model and the full model.All Generalized Variance Inflation Factor (GVIF) values were below 1.1, and no influential observations (Cook’s D > 1: n=0) were detected. Model comparison revealed that model B (age+sex+BMI) had the lowest AIC (1576.49) and BIC (1607.42) and was better compared to the age-only model and the full model.

Interpretation

Age and obesity stand out as the most important, clinically recognizable hypertension risk factors. Without the direct and mediated effects of obesity and the direct effect of age, one could potentially avert 50% of new cases of hypertension. This is a major public health challenge since many populations worldwide are experiencing both population aging and increasing prevalence of obesity. The non-significant interaction also confirms that the increase in the probability of hypertension with a one-year increase in age is similar by BMI group. However, evidence suggests that physical activity, smoking, and other factors might have different effects on hypertension by BMI,25 such that these factors could be effect modifiers of the age–hypertension relationship.

Limitations

This analysis is subject to several important limitations. The cross-sectional design of BRFSS prevents causal inference, therefore we cannot determine whether age or obesity causes hypertension from these data. Hypertension is self-reported, and misclassification bias may be introduced if participants are unaware of their diagnosis. BRFSS uses telephone-based random-digit dialing sampling, which may underrepresent low-income or non-English-speaking households. This potentially limits the generalizability of study findings. Also, the residual confounding from unmeasured variables, including dietary sodium intake, family history of hypertension, medication use, and psychological stress, cannot be excluded. Finally, small sample sizes in younger age groups (n = 12 for ages 18–24) limit the precision of prevalence estimates.

Submission: Submit your completed R Markdown file and knitted HTML report.


Summary

Key Concepts Covered

  1. Statistical modeling describes relationships between variables
  2. Regression types depend on the outcome variable type
  3. Logistic regression is appropriate for binary outcomes
  4. Multiple regression controls for confounding
  5. Dummy variables represent categorical predictors
  6. Interactions test for effect modification
  7. Model diagnostics check assumptions and identify problems
  8. Model comparison helps select the best model

Important Formulas

Logistic Regression:

\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]

Odds Ratio:

\[\text{OR} = e^{\beta_i}\]

Predicted Probability:

\[p = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}\]


References

  • Agresti, A. (2018). An Introduction to Categorical Data Analysis (3rd ed.). Wiley.
  • Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
  • Vittinghoff, E., Glidden, D. V., Shiboski, S. C., & McCulloch, C. E. (2012). Regression Methods in Biostatistics (2nd ed.). Springer.
  • Centers for Disease Control and Prevention. (2023). Behavioral Risk Factor Surveillance System.

Session Info

sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggstats_0.12.0   gtsummary_2.5.0  ggeffects_2.3.2  car_3.1-3       
##  [5] carData_3.0-5    broom_1.0.11     plotly_4.12.0    kableExtra_1.4.0
##  [9] knitr_1.51       haven_2.5.5      lubridate_1.9.4  forcats_1.0.1   
## [13] stringr_1.6.0    dplyr_1.2.0      purrr_1.2.1      readr_2.1.6     
## [17] tidyr_1.3.2      tibble_3.3.1     ggplot2_4.0.1    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6         xfun_0.56            bslib_0.10.0        
##  [4] htmlwidgets_1.6.4    insight_1.4.6        lattice_0.22-7      
##  [7] tzdb_0.5.0           crosstalk_1.2.2      vctrs_0.7.1         
## [10] tools_4.5.1          generics_0.1.4       datawizard_1.3.0    
## [13] pkgconfig_2.0.3      Matrix_1.7-3         data.table_1.18.0   
## [16] RColorBrewer_1.1-3   S7_0.2.1             lifecycle_1.0.5     
## [19] compiler_4.5.1       farver_2.1.2         textshaping_1.0.4   
## [22] htmltools_0.5.9      sass_0.4.10          yaml_2.3.12         
## [25] lazyeval_0.2.2       Formula_1.2-5        pillar_1.11.1       
## [28] jquerylib_0.1.4      broom.helpers_1.22.0 cachem_1.1.0        
## [31] abind_1.4-8          nlme_3.1-168         tidyselect_1.2.1    
## [34] digest_0.6.39        stringi_1.8.7        labeling_0.4.3      
## [37] splines_4.5.1        labelled_2.16.0      fastmap_1.2.0       
## [40] grid_4.5.1           cli_3.6.5            magrittr_2.0.4      
## [43] cards_0.7.1          withr_3.0.2          scales_1.4.0        
## [46] backports_1.5.0      timechange_0.3.0     rmarkdown_2.30      
## [49] httr_1.4.7           otel_0.2.0           hms_1.1.4           
## [52] evaluate_1.0.5       viridisLite_0.4.2    mgcv_1.9-3          
## [55] rlang_1.1.7          glue_1.8.0           xml2_1.5.2          
## [58] svglite_2.2.2        rstudioapi_0.18.0    jsonlite_2.0.0      
## [61] R6_2.6.1             systemfonts_1.3.1