Introduction

Statistical modeling is a fundamental tool in epidemiology that allows us to:

Describe relationships between variables
Predict outcomes based on risk factors
Estimate associations while controlling for confounding

This lecture introduces key concepts in regression modeling using real-world data from the Behavioral Risk Factor Surveillance System (BRFSS) 2023.

Setup and Data Preparation

# Load required packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(car)
library(ggeffects)
library(gtsummary)
library(ggstats)

Descriptive Statistics

brfss_clean <- read_rds("C:/Users/joshm/Documents/UAlbany/Spring 2026/EPI 553/Labs/brfss_subset_2023.rds")

Part 2: Student Lab Activity

Lab Overview

In this lab, you will:

Build your own logistic regression model predicting hypertension (high blood pressure)
Create dummy variables for categorical predictors
Interpret regression coefficients
Test for confounding and interaction
Perform model diagnostics

Lab Instructions

Task 1: Explore the Outcome Variable

# YOUR CODE HERE: Create a frequency table of hypertension status
desc_table <- brfss_clean %>%
  group_by(hypertension) %>%
  summarise(
    N = n(),
    
  ) %>%
  mutate(hypertension = ifelse(hypertension == 1, "Hypertension", "No Hypertension"))

desc_table %>%
  kable(caption = "Frequency Table of Hypertension",
        align = "lrrrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Frequency Table of Hypertension
hypertension	N
No Hypertension	606
Hypertension	675

# YOUR CODE HERE: Calculate the prevalence of hypertension by age group
desc_table <- brfss_clean %>%
  group_by(age_group) %>%
  summarise(
    N = n(),

    `% Hypertension` = round(100 * mean(hypertension), 1),
)

desc_table %>%
  kable(caption = "Hypertension Status by Age Group",
        align = "lrrrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Hypertension Status by Age Group
age_group	N	% Hypertension
18-24	12	8.3
25-34	77	19.5
35-44	138	30.4
45-54	161	37.9
55-64	266	51.5
65+	627	66.8

Questions:

What is the overall prevalence of hypertension in the dataset?

The overall prevalence of hypertension is 675 / 1281 = 52.7%

How does hypertension prevalence vary by age group?

Hypertension prevalence tends to increase with increasing age.

Task 2: Build a Simple Logistic Regression Model

# YOUR CODE HERE: Fit a simple logistic regression model
# Outcome: hypertension
# Predictor: age_cont

model_logistic_simple <- glm(hypertension ~ age_cont,
                              data = brfss_clean,
                              family = binomial(link = "logit"))

# Display results
tidy(model_logistic_simple, conf.int = TRUE) %>%
  kable(caption = "Simple Logistic Regression: Hypertension ~ Age",
        digits = 4,
        col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Simple Logistic Regression: Hypertension ~ Age
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	-3.0426	0.2956	-10.2934	0	-3.6328	-2.4732
age_cont	0.0531	0.0048	10.9955	0	0.0438	0.0627

# YOUR CODE HERE: Display the results with odds ratios
tidy(model_logistic_simple, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Simple Logistic Regression: Hypertension ~ Age (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Simple Logistic Regression: Hypertension ~ Age (Odds Ratios)
Term	Odds Ratio	Std. Error	z-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	0.048	0.296	-10.293	0	0.026	0.084
age_cont	1.055	0.005	10.996	0	1.045	1.065

Questions:

What is the odds ratio for age? Interpret this value.

The odds ratio for age is 1.055. This means that for every 1 year increase in age, there is a 5.5% increase in hypertension risk.

Is the association statistically significant?

Yes p < 0.05.

What is the 95% confidence interval for the odds ratio?

95% CI = [1.045, 1.065]

Task 3: Create a Multiple Regression Model

# YOUR CODE HERE: Fit a multiple logistic regression model
# Outcome: hypertension
# Predictors: age_cont, sex, bmi_cat, phys_active, current_smoker

model_logistic_multiple <- glm(hypertension ~ age_cont + sex + bmi_cat +
                                phys_active + current_smoker,
                               data = brfss_clean,
                               family = binomial(link = "logit"))

# Display results
tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Multiple Logistic Regression: Hypertension ~ Age + Covariates (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  scroll_box(height = "400px")

Multiple Logistic Regression: Hypertension ~ Age + Covariates (Odds Ratios)
Term	Odds Ratio	Std. Error	z-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	0.008	0.653	-7.355	0.000	0.002	0.028
age_cont	1.061	0.005	11.234	0.000	1.050	1.073
sexMale	1.270	0.123	1.950	0.051	0.999	1.616
bmi_catNormal	2.097	0.546	1.356	0.175	0.759	6.756
bmi_catOverweight	3.241	0.543	2.166	0.030	1.183	10.385
bmi_catObese	6.585	0.545	3.459	0.001	2.394	21.176
phys_active	0.900	0.130	-0.808	0.419	0.697	1.162
current_smoker	1.071	0.139	0.495	0.621	0.817	1.407

Questions:

How did the odds ratio for age change after adjusting for other variables?

After controlling for other variables, the odds ratio for age increased to 1.061.

What does this suggest about confounding?

This suggests that other variables distort the true relationship between age and hypertension and bring the estimate closer to the null.

Which variables are the strongest predictors of hypertension?

The BMI variables are the strongest predictors of hypertension as they have the largest odds ratios.

Task 4: Interpret Dummy Variables

# YOUR CODE HERE: Create a table showing the dummy variable coding for bmi_cat



dummy_table <- data.frame(
  bmi_cat = c("Underweight", "Normal", "Overweight", "Obese"),
  `Dummy 1 (Normal)` = c(0, 1, 0, 0),
  `Dummy 2 (Overweight)` = c(0, 0, 1, 0),
  `Dummy 3 (Obese)` = c(0, 0, 0, 1),
  check.names = FALSE
)

dummy_table %>%
  kable(caption = "Dummy Variable Coding for BMI_cat",
        align = "lccc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(1, bold = TRUE, background = "#ffe6e6")  # Highlight reference category

Dummy Variable Coding for BMI_cat
bmi_cat	Dummy 1 (Normal)	Dummy 2 (Overweight)	Dummy 3 (Obese)
Underweight	0	0	0
Normal	1	0	0
Overweight	0	1	0
Obese	0	0	1

# YOUR CODE HERE: Extract and display the odds ratios for BMI categories

# Extract coefficients
bmi_coefs <- tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, "bmi_cat")) %>%
  mutate(
    bmi = str_remove(term, "bmi_cat"),
    bmi = factor(bmi, levels = c("Underweight", "Normal", "Overweight","Obese"))
  )

# Add reference category
ref_row <- data.frame(
  term = "bmi_catUnderweight",
  estimate = 1.0,
  std.error = 0,
  statistic = NA,
  p.value = NA,
  conf.low = 1.0,
  conf.high = 1.0,
  bmi = factor("Underweight (Ref)",
                          levels = c("Underweight (Ref)",
                                      "Normal",
                                    "Overweight",
                                    "Obese"
                                 
                                    ))
)

bmi_coefs_full <- bind_rows(ref_row, bmi_coefs) %>%
  mutate(bmi = factor(bmi,
                                 levels = c("Underweight (Ref)",
                                            "Normal",
                                           "Overweight",
                                           "Obese"
                                          
                                       )))

# Plot
p3 <- ggplot(bmi_coefs_full, aes(x = bmi, y = estimate)) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "gray50") +
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high),
                  size = 0.8, color = "darkblue") +
  coord_flip() +
  labs(
    title = "Association Between BMI and Hypertension",
    subtitle = "Adjusted Odds Ratios (reference: Normal)",
    x = "BMI",
    y = "Odds Ratio (95% CI)"
  ) +
  theme_minimal(base_size = 12)

ggplotly(p3)

Questions:

What is the reference category for BMI?

Underweight

Interpret the odds ratio for “Obese” compared to the reference category.

The odds ratio for Obese is 6.58 times that of the reference category.

How would you explain this to a non-statistician?

Obese BMI is associated with 6.58 times the risk of hypertension compared with Underweight BMI.

Task 5: Test for Interaction

# YOUR CODE HERE: Fit a model with Age × BMI interaction
# Test if the effect of age on hypertension differs by BMI category

# Model with interaction term
model_interaction <- glm(hypertension ~ age_cont * bmi_cat,
                         data = brfss_clean,
                         family = binomial(link = "logit"))

# Display interaction results
tidy(model_interaction, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, "age_cont")) %>%
  kable(caption = "Age × BMI Interaction Model (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Age × BMI Interaction Model (Odds Ratios)
Term	Odds Ratio	Std. Error	z-statistic	p-value	95% CI Lower	95% CI Upper
age_cont	1.004	0.042	0.102	0.918	0.929	1.108
age_cont:bmi_catNormal	1.058	0.043	1.306	0.192	0.957	1.147
age_cont:bmi_catOverweight	1.063	0.043	1.423	0.155	0.962	1.151
age_cont:bmi_catObese	1.054	0.042	1.232	0.218	0.954	1.140

# YOUR CODE HERE: Perform a likelihood ratio test comparing models with and without interaction

model1 <- glm(hypertension ~ age_cont,
              data = brfss_clean,
              family = binomial)

# Model 2: Age + BMI
model2 <- glm(hypertension ~ age_cont + bmi_cat,
              data = brfss_clean,
              family = binomial)

# Model 3: Full model
model3 <- model_logistic_multiple

# Likelihood ratio test
lrt_1_2 <- anova(model1, model2, test = "LRT")
lrt_2_3 <- anova(model2, model3, test = "LRT")

# Create comparison table
model_comp <- data.frame(
  Model = c("Model 1: Age only",
            "Model 2: Age + BMI",
            "Model 3: Full model"),
  AIC = c(AIC(model1), AIC(model2), AIC(model3)),
  BIC = c(BIC(model1), BIC(model2), BIC(model3)),
  `Deviance` = c(deviance(model1), deviance(model2), deviance(model3)),
  check.names = FALSE
)

model_comp %>%
  kable(caption = "Model Comparison: AIC, BIC, and Deviance",
        digits = 2,
        align = "lrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which.min(model_comp$AIC), bold = TRUE, background = "#d4edda")

Model Comparison: AIC, BIC, and Deviance
Model	AIC	BIC	Deviance
Model 1: Age only	1636.61	1646.92	1632.61
Model 2: Age + BMI	1578.06	1603.84	1568.06
Model 3: Full model	1579.50	1620.74	1563.50

#visualization
pred_interact <- ggpredict(model_interaction, terms = c("age_cont", "bmi_cat"))

# Plot
p4 <- ggplot(pred_interact, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
  labs(
    title = "Predicted Probability of Hypertension by Age and BMI",
    subtitle = "Testing for Age × BMI Interaction",
    x = "Age (years)",
    y = "Predicted Probability of Hypertension",
    color = "bmi_cat",
    fill = "bmi_cat"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_color_manual(values = c("Underweight" = "#E64B35", "Normal" = "#4DBBD5", "Overweight" = "#228B22", "Obese" = "#FFC000")) +
  scale_fill_manual(values = c("Underweight" = "#E64B35", "Normal" = "#4DBBD5", "Overweight" = "#228B22", "Obese" = "#FFC000")) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

ggplotly(p4)

Questions:

Is the interaction term statistically significant?

No, p>0.05

What does this mean in epidemiologic terms (effect modification)?

The relationship between age and hypertension is not affected by BMI. That is, BMI is not an effect modifier of the Age-Hypertension relationship.

Create a visualization showing predicted probabilities by age and BMI category

Done.

Task 6: Model Diagnostics

# YOUR CODE HERE: Calculate VIF for your multiple regression model
vif_values <- vif(model_logistic_multiple)

# Create VIF table
# For models with categorical variables, vif() returns GVIF (Generalized VIF)
if (is.matrix(vif_values)) {
  # If matrix (categorical variables present), extract GVIF^(1/(2*Df))
  vif_df <- data.frame(
    Variable = rownames(vif_values),
    VIF = vif_values[, "GVIF^(1/(2*Df))"]
  )
} else {
  # If vector (only continuous variables)
  vif_df <- data.frame(
    Variable = names(vif_values),
    VIF = as.numeric(vif_values)
  )
}

# Add interpretation
vif_df <- vif_df %>%
  arrange(desc(VIF)) %>%
  mutate(
    Interpretation = case_when(
      VIF < 5 ~ "Low (No concern)",
      VIF >= 5 & VIF < 10 ~ "Moderate (Monitor)",
      VIF >= 10 ~ "High (Problem)"
    )
  )

vif_df %>%
  kable(caption = "Variance Inflation Factors (VIF) for Multiple Regression Model",
        digits = 2,
        align = "lrc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which(vif_df$VIF >= 10), bold = TRUE, color = "white", background = "#DC143C") %>%
  row_spec(which(vif_df$VIF >= 5 & vif_df$VIF < 10), background = "#FFA500") %>%
  row_spec(which(vif_df$VIF < 5), background = "#90EE90")

Variance Inflation Factors (VIF) for Multiple Regression Model
	Variable	VIF	Interpretation
age_cont	age_cont	1.06	Low (No concern)
current_smoker	current_smoker	1.04	Low (No concern)
bmi_cat	bmi_cat	1.02	Low (No concern)
phys_active	phys_active	1.01	Low (No concern)
sex	sex	1.01	Low (No concern)

# YOUR CODE HERE: Create a Cook's distance plot to identify influential observations
# Calculate Cook's distance
cooks_d <- cooks.distance(model_logistic_multiple)

# Create data frame
influence_df <- data.frame(
  observation = 1:length(cooks_d),
  cooks_d = cooks_d
) %>%
  mutate(influential = ifelse(cooks_d > 1, "Yes", "No"))

# Plot
p5 <- ggplot(influence_df, aes(x = observation, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  labs(
    title = "Cook's Distance: Identifying Influential Observations",
    subtitle = "Values > 1 indicate potentially influential observations",
    x = "Observation Number",
    y = "Cook's Distance",
    color = "Influential?"
  ) +
  scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
  theme_minimal(base_size = 12)

ggplotly(p5)

n_influential <- sum(influence_df$influential == "Yes")
cat("Number of potentially influential observations:", n_influential, "\n")

## Number of potentially influential observations: 0

Questions:

Are there any concerns about multicollinearity?

No.

Are there any influential observations that might affect your results?

No.

What would you do if you found serious violations?

If I found serious violations in the VIF test I would try to deal with multicollinearity by removing one of the highly correlated predictors. If there were serious violations in Cook’s test I would consider removing influential outliers.

Task 7: Model Comparison

# YOUR CODE HERE: Compare three models using AIC and BIC
# Model A: Age only
# Model B: Age + sex + bmi_cat
# Model C: Age + sex + bmi_cat + phys_active + current_smoker

model1 <- glm(hypertension ~ age_cont,
              data = brfss_clean,
              family = binomial)

# Model 2: Age + BMI
model2 <- glm(hypertension ~ age_cont + sex + bmi_cat,
              data = brfss_clean,
              family = binomial)

# Model 3: Full model
model3 <- glm(hypertension ~ age_cont + sex + bmi_cat + phys_active + current_smoker,
              data = brfss_clean,
              family = binomial)

# Likelihood ratio test
lrt_1_2 <- anova(model1, model2, test = "LRT")
lrt_2_3 <- anova(model2, model3, test = "LRT")

# Create comparison table
model_comp <- data.frame(
  Model = c("Model 1: Age only",
            "Model 2: Age + sex + bmi_cat",
            "Model 3: Age + sex + bmi_cat + phys_active + current_smoker"),
  AIC = c(AIC(model1), AIC(model2), AIC(model3)),
  BIC = c(BIC(model1), BIC(model2), BIC(model3)),
  `Deviance` = c(deviance(model1), deviance(model2), deviance(model3)),
  check.names = FALSE
)

model_comp %>%
  kable(caption = "Model Comparison: AIC, BIC, and Deviance",
        digits = 2,
        align = "lrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which.min(model_comp$AIC), bold = TRUE, background = "#d4edda")

Model Comparison: AIC, BIC, and Deviance
Model	AIC	BIC	Deviance
Model 1: Age only	1636.61	1646.92	1632.61
Model 2: Age + sex + bmi_cat	1576.49	1607.42	1564.49
Model 3: Age + sex + bmi_cat + phys_active + current_smoker	1579.50	1620.74	1563.50

Questions:

Which model has the best fit based on AIC?

Age + sex + bmi_cat is the best fit as it has the lowest AIC.

Is the added complexity of the full model justified?

No because Model 2 has the lowest AIC, suggesting it has the best fit to the data, so adding the other predictors is unnecessary.

Which model would you choose for your final analysis? Why?

I would use Model 2 because it has the lowest AIC and so provides the best fit for the data.

Lab Report Guidelines

Write a brief report (1-2 pages) summarizing your findings:

In this assignment, I sought to understand the factors that are associated with hypertension in the 2023 BRFSS data. In this analysis, hypertension was treated as a binary variable, that is, an individual either did or did not have hypertension.

First, I created a frequency table of hypertension status, and presented the age-specific prevalence of hypertension in the sample. For age-specific prevalence, age was categorized into groups 18-24, 25-34, 35-44, 45-54, 55-64, and 65+.

Then, I ran a simple logistic regression model to quantify the relationship between age and hypertension, with continuous age as the predictor and hypertension status as the outcome. I found that OR=1.055, p<0.001, 95% CI [1.045,1.065], indicating that every 1 year age increase resulted in a 5.5% increased risk of hypertension.

I also ran a multiple logistic regression model with hypertension status as the outcome and age, sex, BMI, physical activity, and smoking status as the predictors. For this analysis, sex was a binary variable (M/F), BMI was categorized into four groups (Underweight, Normal, Overweight, Underweight), Physical Activity was a binary variable (No Physical Activity / Physical Activity), and Smoking Status was a binary variable (Current Smoker / Not a current smoker). After adjusting for these other variables, I found that the odds ratio for age increased slightly, to OR=1.061, p<0.001, 95% CI[1.050, 1.073]. This indicates that the aforementioned variables were confounders that distorted the true relationship between age and hypertension, bringing the estimate closer to the null value.

I also tested for a potential interaction effect between age and BMI. No statistically significant interaction term was found, that is, the effect of age on hypertension was not affected by BMI. However, a likelihood ratio test indicated that the Age+BMI model was a better fit for the data than the age only model, as it had a lower AIC and BIC (AIC: 1578.06, BIC: 1603.84, Deviance: 1568.06)

Model diagnostics for multicollinearity (VIF) and influential outliers (Cook’s Distance) did not indicate any serious concerns with the data. For VIF, none of the variables in the multiple regression were significantly correlated with one another. For the Cook’s Distance, there were no influential outliers present.

A final model comparison indicated that the model with Age, Sex, and BMI was the best fit for the data based on AIC and BIC (AIC: 1576.49, BIC: 1607.42, Deviance: 1564.49). This model was the better fit compared to the Age Only model and a model with Age, sex, BMI, physical activity, and smoking status.

There are some potential limitations with this analysis. It is not possible to determine causation with this analysis. In this analysis, I used data from a single point in time, and determined associations between variables. Thus, it is not possible to make inferences about the direction of any particular association. Also, linear regression assumes a linear relationship between variables, but this may not always be the case. Thus, when variables display a non-linear relationship, the results of linear regression may not paint an accurate picture of the true relationship in the population.

Introduction: State your research question
Methods: Describe your analytic approach
Results: Present key findings with tables and figures
Interpretation: Explain what your results mean
Limitations: Discuss potential issues with your analysis

Submission: Submit your completed R Markdown file and knitted HTML report.

Summary

Key Concepts Covered

Statistical modeling describes relationships between variables
Regression types depend on the outcome variable type
Logistic regression is appropriate for binary outcomes
Multiple regression controls for confounding
Dummy variables represent categorical predictors
Interactions test for effect modification
Model diagnostics check assumptions and identify problems
Model comparison helps select the best model

Important Formulas

Logistic Regression:

\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]

Odds Ratio:

\[\text{OR} = e^{\beta_i}\]

Predicted Probability:

\[p = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}\]

References

Agresti, A. (2018). An Introduction to Categorical Data Analysis (3rd ed.). Wiley.
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
Vittinghoff, E., Glidden, D. V., Shiboski, S. C., & McCulloch, C. E. (2012). Regression Methods in Biostatistics (2nd ed.). Springer.
Centers for Disease Control and Prevention. (2023). Behavioral Risk Factor Surveillance System.

Session Info

sessionInfo()

## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggstats_0.12.0   gtsummary_2.5.0  ggeffects_2.3.2  car_3.1-3       
##  [5] carData_3.0-5    broom_1.0.11     plotly_4.12.0    kableExtra_1.4.0
##  [9] knitr_1.51       haven_2.5.5      lubridate_1.9.4  forcats_1.0.1   
## [13] stringr_1.6.0    dplyr_1.1.4      purrr_1.2.1      readr_2.1.6     
## [17] tidyr_1.3.2      tibble_3.3.1     ggplot2_4.0.1    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.56          bslib_0.9.0        htmlwidgets_1.6.4 
##  [5] insight_1.4.4      tzdb_0.5.0         vctrs_0.7.0        tools_4.5.1       
##  [9] crosstalk_1.2.2    generics_0.1.4     datawizard_1.3.0   pkgconfig_2.0.3   
## [13] data.table_1.18.0  RColorBrewer_1.1-3 S7_0.2.1           lifecycle_1.0.5   
## [17] compiler_4.5.1     farver_2.1.2       textshaping_1.0.4  htmltools_0.5.9   
## [21] sass_0.4.10        yaml_2.3.12        lazyeval_0.2.2     Formula_1.2-5     
## [25] pillar_1.11.1      jquerylib_0.1.4    cachem_1.1.0       abind_1.4-8       
## [29] tidyselect_1.2.1   digest_0.6.39      stringi_1.8.7      labeling_0.4.3    
## [33] fastmap_1.2.0      grid_4.5.1         cli_3.6.5          magrittr_2.0.4    
## [37] withr_3.0.2        scales_1.4.0       backports_1.5.0    timechange_0.3.0  
## [41] rmarkdown_2.30     httr_1.4.7         otel_0.2.0         hms_1.1.4         
## [45] evaluate_1.0.5     viridisLite_0.4.2  rlang_1.1.7        glue_1.8.0        
## [49] xml2_1.5.2         svglite_2.2.2      rstudioapi_0.18.0  jsonlite_2.0.0    
## [53] R6_2.6.1           systemfonts_1.3.1

Statistical Modeling in Epidemiology

EPI 553 - Advanced Epidemiologic Methods

Josh Macera

2026-02-24

Introduction

Setup and Data Preparation

Descriptive Statistics

Part 2: Student Lab Activity

Lab Overview

Lab Instructions

Task 1: Explore the Outcome Variable

Task 2: Build a Simple Logistic Regression Model

Task 3: Create a Multiple Regression Model

Task 4: Interpret Dummy Variables

Task 5: Test for Interaction

Task 6: Model Diagnostics

Task 7: Model Comparison

Lab Report Guidelines

Summary

Key Concepts Covered

Important Formulas

References