Introduction

Statistical modeling is a fundamental tool in epidemiology that allows us to:

Describe relationships between variables
Predict outcomes based on risk factors
Estimate associations while controlling for confounding

This lecture introduces key concepts in regression modeling using real-world data from the Behavioral Risk Factor Surveillance System (BRFSS) 2023.

Setup and Data Preparation

# Load required packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(car)
library(ggeffects)
library(gtsummary)
library(ggstats)
library(janitor)
library(broom.helpers)

Loading BRFSS 2023 Data

The BRFSS is a large-scale telephone survey that collects data on health-related risk behaviors, chronic health conditions, and use of preventive services from U.S. residents.

# Load the full BRFSS 2023 dataset
brfss_full <- read_xpt("/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/BRFSS_2013") %>%
  janitor::clean_names()

# Display dataset dimensions
names(brfss_full)

##   [1] "state"    "fmonth"   "idate"    "imonth"   "iday"     "iyear"   
##   [7] "dispcode" "seqno"    "psu"      "ctelenm1" "pvtresd1" "colghous"
##  [13] "statere1" "celphon1" "ladult1"  "numadult" "respslc1" "landsex2"
##  [19] "lndsxbrt" "safetime" "ctelnum1" "cellfon5" "cadult1"  "cellsex2"
##  [25] "celsxbrt" "pvtresd3" "cclghous" "cstate1"  "landline" "hhadult" 
##  [31] "sexvar"   "genhlth"  "physhlth" "menthlth" "poorhlth" "primins1"
##  [37] "persdoc3" "medcost1" "checkup1" "exerany2" "exract12" "exeroft1"
##  [43] "exerhmm1" "exract22" "exeroft2" "exerhmm2" "strength" "bphigh6" 
##  [49] "bpmeds1"  "cholchk3" "toldhi3"  "cholmed3" "cvdinfr4" "cvdcrhd4"
##  [55] "cvdstrk3" "asthma3"  "asthnow"  "chcscnc1" "chcocnc1" "chccopd3"
##  [61] "addepev3" "chckdny2" "havarth4" "diabete4" "diabage4" "marital" 
##  [67] "educa"    "renthom1" "numhhol4" "numphon4" "cpdemo1c" "veteran3"
##  [73] "employ1"  "children" "income3"  "pregnant" "weight2"  "height3" 
##  [79] "deaf"     "blind"    "decide"   "diffwalk" "diffdres" "diffalon"
##  [85] "fall12mn" "fallinj5" "smoke100" "smokday2" "usenow3"  "ecignow2"
##  [91] "alcday4"  "avedrnk3" "drnk3ge5" "maxdrnks" "flushot7" "flshtmy3"
##  [97] "pneuvac4" "shingle2" "hivtst7"  "hivtstd3" "seatbelt" "drnkdri2"
## [103] "covidpo1" "covidsm1" "covidact" "pdiabts1" "prediab2" "diabtype"
## [109] "insulin1" "chkhemo3" "eyeexam1" "diabeye1" "diabedu1" "feetsore"
## [115] "arthexer" "arthedu"  "lmtjoin3" "arthdis2" "joinpai2" "lcsfirst"
## [121] "lcslast"  "lcsnumcg" "lcsctsc1" "lcsscncr" "lcsctwhn" "hadmam"  
## [127] "howlong"  "cervscrn" "crvclcnc" "crvclpap" "crvclhpv" "hadhyst2"
## [133] "psatest1" "psatime1" "pcpsars2" "psasugs1" "pcstalk2" "hadsigm4"
## [139] "colnsigm" "colntes1" "sigmtes1" "lastsig4" "colncncr" "vircolo1"
## [145] "vclntes2" "smalstol" "stoltest" "stooldn2" "bldstfit" "sdnates1"
## [151] "cncrdiff" "cncrage"  "cncrtyp2" "csrvtrt3" "csrvdoc1" "csrvsum" 
## [157] "csrvrtrn" "csrvinst" "csrvinsr" "csrvdein" "csrvclin" "csrvpain"
## [163] "csrvctl2" "indortan" "numburn3" "sunprtct" "wkdayout" "wkendout"
## [169] "cimemlo1" "cdworry"  "cddiscu1" "cdhous1"  "cdsocia1" "caregiv1"
## [175] "crgvrel4" "crgvlng1" "crgvhrs1" "crgvprb3" "crgvalzd" "crgvper1"
## [181] "crgvhou1" "crgvexpt" "lastsmk2" "stopsmk2" "mentcigs" "mentecig"
## [187] "heattbco" "firearm5" "gunload"  "loadulk2" "hasymp1"  "hasymp2" 
## [193] "hasymp3"  "hasymp4"  "hasymp5"  "hasymp6"  "strsymp1" "strsymp2"
## [199] "strsymp3" "strsymp4" "strsymp5" "strsymp6" "firstaid" "aspirin" 
## [205] "birthsex" "somale"   "sofemale" "trnsgndr" "marijan1" "marjsmok"
## [211] "marjeat"  "marjvape" "marjdab"  "marjothr" "usemrjn4" "acedeprs"
## [217] "acedrink" "acedrugs" "aceprisn" "acedivrc" "acepunch" "acehurt1"
## [223] "aceswear" "acetouch" "acetthem" "acehvsex" "aceadsaf" "aceadned"
## [229] "imfvpla4" "hpvadvc4" "hpvadsht" "tetanus1" "covidva1" "covacge1"
## [235] "covidnu2" "lsatisfy" "emtsuprt" "sdlonely" "sdhemply" "foodstmp"
## [241] "sdhfood1" "sdhbills" "sdhutils" "sdhtrnsp" "sdhstre1" "rrclass3"
## [247] "rrcognt2" "rrtreat"  "rratwrk2" "rrhcare4" "rrphysm2" "rcsgend1"
## [253] "rcsxbrth" "rcsrltn2" "casthdx2" "casthno2" "qstver"   "qstlang" 
## [259] "metstat"  "urbstat"  "mscode"   "ststr"    "strwt"    "rawrake" 
## [265] "wt2rake"  "imprace"  "chispnc"  "crace1"   "cageg"    "cllcpwt" 
## [271] "dualuse"  "dualcor"  "llcpwt2"  "llcpwt"   "rfhlth"   "phys14d" 
## [277] "ment14d"  "hlthpl1"  "hcvu653"  "totinda"  "metvl12"  "metvl22" 
## [283] "maxvo21"  "fc601"    "actin13"  "actin23"  "padur1"   "padur2"  
## [289] "pafreq1"  "pafreq2"  "minac12"  "minac22"  "strfreq"  "pamiss3" 
## [295] "pamin13"  "pamin23"  "pa3min"   "pavig13"  "pavig23"  "pa3vigm" 
## [301] "pacat3"   "paindx3"  "pa150r4"  "pa300r4"  "pa30023"  "pastrng" 
## [307] "parec3"   "pastae3"  "rfhype6"  "cholch3"  "rfchol3"  "michd"   
## [313] "ltasth1"  "casthm1"  "asthms1"  "drdxar2"  "mrace1"   "hispanc" 
## [319] "race"     "raceg21"  "racegr3"  "raceprv"  "sex"      "ageg5yr" 
## [325] "age65yr"  "age80"    "age_g"    "htin4"    "htm4"     "wtkg3"   
## [331] "bmi5"     "bmi5cat"  "rfbmi5"   "chldcnt"  "educag"   "incomg1" 
## [337] "smoker3"  "rfsmok3"  "cureci2"  "drnkany6" "drocdy4"  "rfbing6" 
## [343] "drnkwk2"  "rfdrhv8"  "flshot7"  "pneumo3"  "aidtst4"  "rfseat2" 
## [349] "rfseat3"  "drnkdrv"

Creating a Working Subset

For computational efficiency and teaching purposes, we’ll create a subset with relevant variables and complete cases.

# Select variables of interest and create analytic dataset
set.seed(553)  # For reproducibility

brfss_subset <- brfss_full %>%
  select(
    # Outcome: Diabetes status
    diabete4,
    # Demographics
    age_g,      # Age category
    sex,         # Sex
    race,       # Race/ethnicity
    educag,     # Education level
    incomg1,    # Income category
    # Health behaviors
    bmi5cat,    # BMI category
    exerany2,     # Physical activity
    smokday2,     # Smoking frequency
    # Health status
    genhlth,      # General health
    rfhype6,    # High blood pressure
    rfchol3     # High cholesterol
  ) %>%
  # Filter to complete cases only
  drop_na() %>%
  # Sample 2000 observations for manageable analysis
  slice_sample(n = 2000)

# Display subset dimensions
cat("Working subset dimensions:",
    nrow(brfss_subset), "observations,",
    ncol(brfss_subset), "variables\n")

## Working subset dimensions: 2000 observations, 12 variables

Data Recoding and Cleaning

# Create clean dataset with recoded variables
brfss_clean <- brfss_subset %>%
  mutate(
    # Outcome: Diabetes (binary)
    diabetes = case_when(
      diabete4 == 1 ~ 1,  # Yes
      diabete4 %in% c(2, 3, 4) ~ 0,  # No, pre-diabetes, or gestational only
      TRUE ~ NA_real_
    ),

    # Age groups
    age_group = factor(case_when(
      age_g == 1 ~ "18-24",
      age_g == 2 ~ "25-34",
      age_g == 3 ~ "35-44",
      age_g == 4 ~ "45-54",
      age_g == 5 ~ "55-64",
      age_g == 6 ~ "65+"
    ), levels = c("18-24", "25-34", "35-44", "45-54", "55-64", "65+")),

    # Age continuous (midpoint of category)
    age_cont = case_when(
      age_g == 1 ~ 21,
      age_g == 2 ~ 29.5,
      age_g == 3 ~ 39.5,
      age_g == 4 ~ 49.5,
      age_g == 5 ~ 59.5,
      age_g == 6 ~ 70
    ),

    # Sex
    sex = factor(ifelse(sex == 1, "Male", "Female")),

    # Race/ethnicity
    race = factor(case_when(
      race == 1 ~ "White",
      race == 2 ~ "Black",
      race == 3 ~ "Native American",
      race == 4 ~ "Asian",
      race == 5 ~ "Native Hawaiian/PI",
      race == 6 ~ "Other",
      race == 7 ~ "Multiracial",
      race == 8 ~ "Hispanic"
    )),

    # Education (simplified)
    education = factor(case_when(
      educag == 1 ~ "< High school",
      educag == 2 ~ "High school graduate",
      educag == 3 ~ "Some college",
      educag == 4 ~ "College graduate"
    ), levels = c("< High school", "High school graduate", "Some college", "College graduate")),

    # Income (simplified)
    income = factor(case_when(
      incomg1 == 1 ~ "< $25,000",
      incomg1 == 2 ~ "$25,000-$49,999",
      incomg1 == 3 ~ "$50,000-$74,999",
      incomg1 == 4 ~ "$75,000+",
      incomg1 == 5 ~ "Unknown"
    ), levels = c("< $25,000", "$25,000-$49,999", "$50,000-$74,999", "$75,000+", "Unknown")),

    # BMI category
    bmi_cat = factor(case_when(
      bmi5cat == 1 ~ "Underweight",
      bmi5cat == 2 ~ "Normal",
      bmi5cat == 3 ~ "Overweight",
      bmi5cat == 4 ~ "Obese"
    ), levels = c("Underweight", "Normal", "Overweight", "Obese")),

    # Physical activity (binary)
    phys_active = ifelse(exerany2 == 1, 1, 0),

    # Current smoking
    current_smoker = case_when(
      smokday2 == 1 ~ 1,  # Every day
      smokday2 == 2 ~ 1,  # Some days
      smokday2 == 3 ~ 0,  # Not at all
      TRUE ~ 0
    ),

    # General health (simplified)
    gen_health = factor(case_when(
      genhlth %in% c(1, 2) ~ "Excellent/Very good",
      genhlth == 3 ~ "Good",
      genhlth %in% c(4, 5) ~ "Fair/Poor"
    ), levels = c("Excellent/Very good", "Good", "Fair/Poor")),

    # Hypertension
    hypertension = ifelse(rfhype6 == 2, 1, 0),

    # High cholesterol
    high_chol = ifelse(rfchol3 == 2, 1, 0)
  ) %>%
  # Select only the clean variables
  select(diabetes, age_group, age_cont, sex, race, education, income,
         bmi_cat, phys_active, current_smoker, gen_health,
         hypertension, high_chol) %>%
  # Remove any remaining missing values
  drop_na()

# Save the cleaned subset for future use
write_rds(brfss_clean,
          "/Users/zoya_hayes/Desktop/EPI553/EPI553_Coding/data/brfss_subset_2023.rds")

cat("Clean dataset saved with", nrow(brfss_clean), "complete observations\n")

## Clean dataset saved with 1281 complete observations

Descriptive Statistics

# Summary table by diabetes status
desc_table <- brfss_clean %>%
  group_by(diabetes) %>%
  summarise(
    N = n(),
    `Mean Age` = round(mean(age_cont), 1),
    `% Male` = round(100 * mean(sex == "Male"), 1),
    `% Obese` = round(100 * mean(bmi_cat == "Obese", na.rm = TRUE), 1),
    `% Physically Active` = round(100 * mean(phys_active), 1),
    `% Current Smoker` = round(100 * mean(current_smoker), 1),
    `% Hypertension` = round(100 * mean(hypertension), 1),
    `% High Cholesterol` = round(100 * mean(high_chol), 1)
  ) %>%
  mutate(diabetes = ifelse(diabetes == 1, "Diabetes", "No Diabetes"))

desc_table %>%
  kable(caption = "Descriptive Statistics by Diabetes Status",
        align = "lrrrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Descriptive Statistics by Diabetes Status
diabetes	N	Mean Age	% Male	% Obese	% Physically Active	% Current Smoker	% Hypertension	% High Cholesterol
No Diabetes	1053	58.2	49.0	34.8	69.4	29.3	47.5	42.5
Diabetes	228	63.1	53.9	56.1	53.5	27.6	76.8	67.1

Part 1: Statistical Modeling Concepts

1. What is Statistical Modeling?

A statistical model is a mathematical representation of the relationship between:

An outcome variable (dependent variable, response)
One or more predictor variables (independent variables, exposures, covariates)

General Form of a Statistical Model

\[f(Y) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon\]

Where:

\(f(Y)\) is a function of the outcome (identity, log, logit, etc.)
\(\beta_0\) is the intercept (baseline value)
\(\beta_1, \beta_2, \ldots, \beta_p\) are coefficients (effect sizes)
\(X_1, X_2, \ldots, X_p\) are predictor variables
\(\epsilon\) is the error term (random variation)

2. Types of Regression Models

The choice of regression model depends on the type of outcome variable:

Common Regression Models in Epidemiology
Outcome Type	Regression Type	Link Function	Example
Continuous	Linear	Identity: Y	Blood pressure, BMI
Binary	Logistic	Logit: log(p/(1-p))	Disease status, mortality
Count	Poisson/Negative Binomial	Log: log(Y)	Number of infections
Time-to-event	Cox Proportional Hazards	Log: log(h(t))	Survival time

Simple vs. Multiple Regression

Simple regression: One predictor variable
Multiple regression: Two or more predictor variables (controls for confounding)

3. Linear Regression Example

Let’s model the relationship between age and diabetes prevalence.

Simple Linear Regression

# Simple linear regression: diabetes ~ age
model_linear_simple <- lm(diabetes ~ age_cont, data = brfss_clean)

# Display results
tidy(model_linear_simple, conf.int = TRUE) %>%
  kable(caption = "Simple Linear Regression: Diabetes ~ Age",
        digits = 4,
        col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Simple Linear Regression: Diabetes ~ Age
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	-0.0632	0.0481	-1.3125	0.1896	-0.1576	0.0312
age_cont	0.0041	0.0008	5.1368	0.0000	0.0025	0.0056

Interpretation:

Intercept (\(\beta_0\)): -0.0632 - Expected probability of diabetes at age 0 (not meaningful in this context)
Slope (\(\beta_1\)): 0.0041 - For each 1-year increase in age, the probability of diabetes increases by 0.41%

Visualization

With continuous age

# Create scatter plot with regression line
p1 <- ggplot(brfss_clean, aes(x = age_cont, y = diabetes)) +
  geom_jitter(alpha = 0.2, width = 0.5, height = 0.02, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red", linewidth = 1.2) +
  labs(
    title = "Relationship Between Age and Diabetes",
    subtitle = "Simple Linear Regression",
    x = "Age (years)",
    y = "Probability of Diabetes"
  ) +
  theme_minimal(base_size = 12)

ggplotly(p1) %>%
  layout(hovermode = "closest")

Diabetes Prevalence by Age

4. Logistic Regression: The Preferred Model for Binary Outcomes

Problem with linear regression for binary outcomes:

Predicted probabilities can fall outside [0, 1]
Assumes constant variance (violated for binary data)

Solution: Logistic Regression

Uses the logit link function to ensure predicted probabilities stay between 0 and 1:

\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]

Simple Logistic Regression

# Simple logistic regression: diabetes ~ age
model_logistic_simple <- glm(diabetes ~ age_cont,
                              data = brfss_clean,
                              family = binomial(link = "logit"))

# Display results with odds ratios
tidy(model_logistic_simple, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Simple Logistic Regression: Diabetes ~ Age (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Simple Logistic Regression: Diabetes ~ Age (Odds Ratios)
Term	Odds Ratio	Std. Error	z-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	0.029	0.423	-8.390	0	0.012	0.064
age_cont	1.034	0.007	4.978	0	1.021	1.048

Interpretation:

Odds Ratio (OR): 1.034
For each 1-year increase in age, the odds of diabetes increase by 3.4%
The relationship is highly statistically significant (p < 0.001)

Predicted Probabilities

# From ggeffects package
pp <- predict_response(model_logistic_simple, terms = "age_cont")
plot(pp)

Predicted Diabetes Probability by Age

# Generate predicted probabilities
pred_data <- data.frame(age_cont = seq(18, 80, by = 1))
pred_data$predicted_prob <- predict(model_logistic_simple,
                                    newdata = pred_data,
                                    type = "response")

# Plot
p2 <- ggplot(pred_data, aes(x = age_cont, y = predicted_prob)) +
  geom_line(color = "darkred", linewidth = 1.5) +
  geom_ribbon(aes(ymin = predicted_prob - 0.02,
                  ymax = predicted_prob + 0.02),
              alpha = 0.2, fill = "darkred") +
  labs(
    title = "Predicted Probability of Diabetes by Age",
    subtitle = "Simple Logistic Regression",
    x = "Age (years)",
    y = "Predicted Probability of Diabetes"
  ) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 0.6)) +
  theme_minimal(base_size = 12)

ggplotly(p2)

Predicted Diabetes Probability by Age

5. Multiple Regression: Controlling for Confounding

What is Confounding?

A confounder is a variable that:

Is associated with both the exposure and the outcome
Is not on the causal pathway between exposure and outcome
Distorts the true relationship between exposure and outcome

Example: The relationship between age and diabetes may be confounded by BMI, physical activity, and other factors.

Multiple Logistic Regression

# Multiple logistic regression with potential confounders
model_logistic_multiple <- glm(diabetes ~ age_cont + sex + bmi_cat +
                                phys_active + current_smoker + education,
                               data = brfss_clean,
                               family = binomial(link = "logit"))

# Display results
tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Multiple Logistic Regression: Diabetes ~ Age + Covariates (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  scroll_box(height = "400px")

Multiple Logistic Regression: Diabetes ~ Age + Covariates (Odds Ratios)
Term	Odds Ratio	Std. Error	z-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	0.009	1.177	-4.001	0.000	0.000	0.065
age_cont	1.041	0.007	5.515	0.000	1.027	1.057
sexMale	1.191	0.154	1.133	0.257	0.880	1.613
bmi_catNormal	1.971	1.052	0.645	0.519	0.378	36.309
bmi_catOverweight	3.155	1.044	1.101	0.271	0.621	57.679
bmi_catObese	6.834	1.041	1.845	0.065	1.354	124.675
phys_active	0.589	0.157	-3.373	0.001	0.433	0.802
current_smoker	1.213	0.178	1.085	0.278	0.852	1.716
educationHigh school graduate	0.634	0.288	-1.579	0.114	0.364	1.131
educationSome college	0.542	0.294	-2.081	0.037	0.307	0.977
educationCollege graduate	0.584	0.305	-1.763	0.078	0.324	1.074

Interpretation:

Age (adjusted OR): 1.041
- After adjusting for sex, BMI, physical activity, smoking, and education, each 1-year increase in age is associated with a 4.1% increase in the odds of diabetes
Sex (Male vs Female): OR = 1.191
- Males have 19.1% higher odds of diabetes compared to females, adjusting for other variables

BMI (Obese vs Normal): OR = 6.834
- Obese individuals had 6.83 times higher odds of diabetes compared to normal-weight individuals.

6. Dummy Variables: Coding Categorical Predictors

Categorical variables with \(k\) levels are represented using \(k-1\) dummy variables (indicator variables).

Example: Education Level

Education has 4 levels: 1. < High school (reference category) 2. High school graduate 3. Some college 4. College graduate

R automatically creates 3 dummy variables:

# Extract dummy variable coding
dummy_table <- data.frame(
  Education = c("< High school", "High school graduate", "Some college", "College graduate"),
  `Dummy 1 (HS grad)` = c(0, 1, 0, 0),
  `Dummy 2 (Some college)` = c(0, 0, 1, 0),
  `Dummy 3 (College grad)` = c(0, 0, 0, 1),
  check.names = FALSE
)

dummy_table %>%
  kable(caption = "Dummy Variable Coding for Education",
        align = "lccc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(1, bold = TRUE, background = "#ffe6e6")  # Highlight reference category

Dummy Variable Coding for Education
Education	Dummy 1 (HS grad)	Dummy 2 (Some college)	Dummy 3 (College grad)
< High school	0	0	0
High school graduate	1	0	0
Some college	0	1	0
College graduate	0	0	1

Reference Category: The category with all zeros (< High school) is the reference group. All other categories are compared to this reference.

Visualizing Education Effects

# Extract education coefficients
educ_coefs <- tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, "education")) %>%
  mutate(
    education_level = str_remove(term, "education"),
    education_level = factor(education_level,
                             levels = c("High school graduate",
                                       "Some college",
                                       "College graduate"))
  )

# Add reference category
ref_row <- data.frame(
  term = "education< High school",
  estimate = 1.0,
  std.error = 0,
  statistic = NA,
  p.value = NA,
  conf.low = 1.0,
  conf.high = 1.0,
  education_level = factor("< High school (Ref)",
                          levels = c("< High school (Ref)",
                                    "High school graduate",
                                    "Some college",
                                    "College graduate"))
)

educ_coefs_full <- bind_rows(ref_row, educ_coefs) %>%
  mutate(education_level = factor(education_level,
                                 levels = c("< High school (Ref)",
                                           "High school graduate",
                                           "Some college",
                                           "College graduate")))

# Plot
p3 <- ggplot(educ_coefs_full, aes(x = education_level, y = estimate)) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "gray50") +
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high),
                  size = 0.8, color = "darkblue") +
  coord_flip() +
  labs(
    title = "Association Between Education and Diabetes",
    subtitle = "Adjusted Odds Ratios (reference: < High school)",
    x = "Education Level",
    y = "Odds Ratio (95% CI)"
  ) +
  theme_minimal(base_size = 12)

ggplotly(p3)

Odds Ratios for Education Levels

# Plot model coefficients with `ggcoef_model()`
ggcoef_model(model_logistic_multiple, exponentiate = TRUE,
  include = c("education"),
  variable_labels = c(
    education = "Education"),
  facet_labeller = ggplot2::label_wrap_gen(10)
)

7. Interactions (Effect Modification)

An interaction exists when the effect of one variable on the outcome differs across levels of another variable.

Epidemiologic term: Effect modification

Example: Age × Sex Interaction

Does the effect of age on diabetes differ between males and females?

# Model with interaction term
model_interaction <- glm(diabetes ~ age_cont * sex + bmi_cat + phys_active,
                         data = brfss_clean,
                         family = binomial(link = "logit"))

# Display interaction results
tidy(model_interaction, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, "age_cont")) %>%
  kable(caption = "Age × Sex Interaction Model (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Age × Sex Interaction Model (Odds Ratios)
Term	Odds Ratio	Std. Error	z-statistic	p-value	95% CI Lower	95% CI Upper
age_cont	1.031	0.009	3.178	0.001	1.012	1.051
age_cont:sexMale	1.015	0.014	1.084	0.278	0.988	1.044

Interpretation:

Main effect of age: OR among females (reference)
Interaction term (age:sexMale): Additional effect of age among males
If the interaction term is significant, the age-diabetes relationship differs by sex

Visualizing Interaction

# Generate predicted probabilities by sex
pred_interact <- ggpredict(model_interaction, terms = c("age_cont [18:80]", "sex"))

# Plot
p4 <- ggplot(pred_interact, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
  labs(
    title = "Predicted Probability of Diabetes by Age and Sex",
    subtitle = "Testing for Age × Sex Interaction",
    x = "Age (years)",
    y = "Predicted Probability of Diabetes",
    color = "Sex",
    fill = "Sex"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_color_manual(values = c("Female" = "#E64B35", "Male" = "#4DBBD5")) +
  scale_fill_manual(values = c("Female" = "#E64B35", "Male" = "#4DBBD5")) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

ggplotly(p4)

Age-Diabetes Relationship by Sex

8. Model Diagnostics

Every regression model makes assumptions about the data. If assumptions are violated, results may be invalid.

Key Assumptions for Logistic Regression

Linearity of log odds: Continuous predictors have a linear relationship with the log odds of the outcome
Independence of observations: Each observation is independent
No perfect multicollinearity: Predictors are not perfectly correlated
No influential outliers: Individual observations don’t overly influence the model

Checking for Multicollinearity

Variance Inflation Factor (VIF): Measures how much the variance of a coefficient is inflated due to correlation with other predictors.

VIF < 5: Generally acceptable
VIF > 10: Serious multicollinearity problem

# Calculate VIF
vif_values <- vif(model_logistic_multiple)

# Create VIF table
# For models with categorical variables, vif() returns GVIF (Generalized VIF)
if (is.matrix(vif_values)) {
  # If matrix (categorical variables present), extract GVIF^(1/(2*Df))
  vif_df <- data.frame(
    Variable = rownames(vif_values),
    VIF = vif_values[, "GVIF^(1/(2*Df))"]
  )
} else {
  # If vector (only continuous variables)
  vif_df <- data.frame(
    Variable = names(vif_values),
    VIF = as.numeric(vif_values)
  )
}

# Add interpretation
vif_df <- vif_df %>%
  arrange(desc(VIF)) %>%
  mutate(
    Interpretation = case_when(
      VIF < 5 ~ "Low (No concern)",
      VIF >= 5 & VIF < 10 ~ "Moderate (Monitor)",
      VIF >= 10 ~ "High (Problem)"
    )
  )

vif_df %>%
  kable(caption = "Variance Inflation Factors (VIF) for Multiple Regression Model",
        digits = 2,
        align = "lrc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which(vif_df$VIF >= 10), bold = TRUE, color = "white", background = "#DC143C") %>%
  row_spec(which(vif_df$VIF >= 5 & vif_df$VIF < 10), background = "#FFA500") %>%
  row_spec(which(vif_df$VIF < 5), background = "#90EE90")

Variance Inflation Factors (VIF) for Multiple Regression Model
	Variable	VIF	Interpretation
age_cont	age_cont	1.05	Low (No concern)
current_smoker	current_smoker	1.05	Low (No concern)
phys_active	phys_active	1.02	Low (No concern)
sex	sex	1.01	Low (No concern)
education	education	1.01	Low (No concern)
bmi_cat	bmi_cat	1.01	Low (No concern)

Influential Observations

Cook’s Distance: Measures how much the model would change if an observation were removed.

Cook’s D > 1: Potentially influential observation

# Calculate Cook's distance
cooks_d <- cooks.distance(model_logistic_multiple)

# Create data frame
influence_df <- data.frame(
  observation = 1:length(cooks_d),
  cooks_d = cooks_d
) %>%
  mutate(influential = ifelse(cooks_d > 1, "Yes", "No"))

# Plot
p5 <- ggplot(influence_df, aes(x = observation, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  labs(
    title = "Cook's Distance: Identifying Influential Observations",
    subtitle = "Values > 1 indicate potentially influential observations",
    x = "Observation Number",
    y = "Cook's Distance",
    color = "Influential?"
  ) +
  scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
  theme_minimal(base_size = 12)

ggplotly(p5)

Cook’s Distance for Influential Observations

# Count influential observations
n_influential <- sum(influence_df$influential == "Yes")
cat("Number of potentially influential observations:", n_influential, "\n")

## Number of potentially influential observations: 0

9. Model Comparison and Selection

Comparing Nested Models

Use Likelihood Ratio Test to compare nested models:

# Model 1: Age only
model1 <- glm(diabetes ~ age_cont,
              data = brfss_clean,
              family = binomial)

# Model 2: Age + Sex
model2 <- glm(diabetes ~ age_cont + sex,
              data = brfss_clean,
              family = binomial)

# Model 3: Full model
model3 <- model_logistic_multiple

# Likelihood ratio test
lrt_1_2 <- anova(model1, model2, test = "LRT")
lrt_2_3 <- anova(model2, model3, test = "LRT")

# Create comparison table
model_comp <- data.frame(
  Model = c("Model 1: Age only",
            "Model 2: Age + Sex",
            "Model 3: Full model"),
  AIC = c(AIC(model1), AIC(model2), AIC(model3)),
  BIC = c(BIC(model1), BIC(model2), BIC(model3)),
  `Deviance` = c(deviance(model1), deviance(model2), deviance(model3)),
  check.names = FALSE
)

model_comp %>%
  kable(caption = "Model Comparison: AIC, BIC, and Deviance",
        digits = 2,
        align = "lrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which.min(model_comp$AIC), bold = TRUE, background = "#d4edda")

Model Comparison: AIC, BIC, and Deviance
Model	AIC	BIC	Deviance
Model 1: Age only	1175.08	1185.39	1171.08
Model 2: Age + Sex	1175.85	1191.32	1169.85
Model 3: Full model	1122.65	1179.36	1100.65

Interpretation:

Lower AIC/BIC indicates better model fit
Model 3 (full model) has the lowest AIC, suggesting it provides the best fit to the data

10. Error Term in Statistical Models

All statistical models include an error term (\(\epsilon\)) to account for:

Random variation in the outcome
Unmeasured variables not included in the model
Measurement error in variables

\[Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon\]

Key points:

The model cannot perfectly predict every outcome
The difference between observed and predicted values is the error (residual)
We assume errors are normally distributed with mean 0 (for linear regression)

Part 2: Student Lab Activity

Lab Overview

In this lab, you will:

Build your own logistic regression model predicting hypertension (high blood pressure)
Create dummy variables for categorical predictors
Interpret regression coefficients
Test for confounding and interaction
Perform model diagnostics

Lab Instructions

Task 1: Explore the Outcome Variable

# Summary table by hypertension status
hyper_table <- brfss_clean %>%
  group_by(hypertension) %>%
  summarise(
    N = n(),
    `Mean Age` = round(mean(age_cont), 1),
    `% Male` = round(100 * mean(sex == "Male"), 1),
    `% Obese` = round(100 * mean(bmi_cat == "Obese", na.rm = TRUE), 1),
    `% Physically Active` = round(100 * mean(phys_active), 1),
    `% Current Smoker` = round(100 * mean(current_smoker), 1),
    `% Diabetes` = round(100 * mean(diabetes), 1),
    `% High Cholesterol` = round(100 * mean(high_chol), 1)
  ) %>%
  mutate(hypertension = ifelse(hypertension == 1, "Hypertension", "No Hypertension"))

hyper_table %>%
  kable(caption = "Descriptive Statistics by Hypertension Status",
        align = "lrrrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Descriptive Statistics by Hypertension Status
hypertension	N	Mean Age	% Male	% Obese	% Physically Active	% Current Smoker	% Diabetes	% High Cholesterol
No Hypertension	606	54.5	46.2	30.7	69.3	32.8	8.7	30.9
Hypertension	675	63.1	53.2	45.6	64.1	25.6	25.9	61.3

# Calculate overall prevalence 
hyp_prevalence <- sum(brfss_clean$hypertension) / nrow(brfss_clean) * 100

hyp_prevalence %>% 
kable(caption = "Total Prevalence (Percent)",
      align = "l")

Total Prevalence (Percent)
x
52.69321

# Calculate hypertension by age group
hyp_age_prevalence <- brfss_clean %>% 
  group_by(age_group) %>% 
  summarise(Prevalence = sum(hypertension) / n() * 100)

hyp_age_prevalence %>% 
kable(caption = "Hypertension Prevalence by Age Group (Percent)")

Hypertension Prevalence by Age Group (Percent)
age_group	Prevalence
18-24	8.333333
25-34	19.480519
35-44	30.434783
45-54	37.888199
55-64	51.503759
65+	66.826156

Questions:

What is the overall prevalence of hypertension in the dataset? The overall prevalence of hypertension in the dataset is appromiately 52.69%.
How does hypertension prevalence vary by age group? It appears that as age group increases, prevalence of hypertension increases as well. For instance, the 18-24 year old group has an 8.3% prevalence of hypertension, while the 35-44 year old group has 30.4% prevalence of hypertension, and 65+ year old group has the highest prevalence of hypertension with 66.8%. —

Task 2: Build a Simple Logistic Regression Model

# Simple logistic regression: hypertension ~ age
model_logistic_hyp <- glm(hypertension ~ age_cont,
                              data = brfss_clean,
                              family = binomial(link = "logit"))

# Display results with odds ratios
tidy(model_logistic_hyp, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Simple Logistic Regression: Hypertension ~ Age (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Simple Logistic Regression: Hypertension ~ Age (Odds Ratios)
Term	Odds Ratio	Std. Error	z-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	0.048	0.296	-10.293	0	0.026	0.084
age_cont	1.055	0.005	10.996	0	1.045	1.065

Questions:

What is the odds ratio for age? Interpret this value. The odds ratio for age is 1.055. For each 1 year increase in age, the odds of hypertension increase by 5.5%.
Is the association statistically significant? This association is highly statistically significant (p < 0.001) and the confidence interval does not contain 1 [1.045,1.065].
What is the 95% confidence interval for the odds ratio? The 95% confidence interval for the odds ratio is [1.045, 1.065]. —

Task 3: Create a Multiple Regression Model

# Multiple logistic regression with potential confounders
mult_model_logistic <- glm(hypertension ~ age_cont + sex + bmi_cat +
                                phys_active + current_smoker,
                               data = brfss_clean,
                               family = binomial(link = "logit"))

# Display results
tidy(mult_model_logistic, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Multiple Logistic Regression: Hypertension ~ Age + Covariates (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  scroll_box(height = "400px")

Multiple Logistic Regression: Hypertension ~ Age + Covariates (Odds Ratios)
Term	Odds Ratio	Std. Error	z-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	0.008	0.653	-7.355	0.000	0.002	0.028
age_cont	1.061	0.005	11.234	0.000	1.050	1.073
sexMale	1.270	0.123	1.950	0.051	0.999	1.616
bmi_catNormal	2.097	0.546	1.356	0.175	0.759	6.756
bmi_catOverweight	3.241	0.543	2.166	0.030	1.183	10.385
bmi_catObese	6.585	0.545	3.459	0.001	2.394	21.176
phys_active	0.900	0.130	-0.808	0.419	0.697	1.162
current_smoker	1.071	0.139	0.495	0.621	0.817	1.407

Questions:

How did the odds ratio for age change after adjusting for other variables? After adjusting for sex, BMI, physical activity, and smoking, each 1-year increase in age is associated with a 6.1% increase in the odds of hypertension.
What does this suggest about confounding? This suggests that sex, BMI, physical activity, and smoking may be slight confounders because the odds ratio increased after adjusting and these variables possibly masked the true association.
Which variables are the strongest predictors of hypertension? Obese is a very strong predictor of hypertension because obese individuals had 6.59 times higher odds of hypertension compared to underweight individuals. Overweight is also a strong predictor of hypertension because they had 3.24 times higher odds of diabetes compared to underweight individuals. Males have 12.7% higher odds of hypertension compared to females, adjusting for other variables. —

Task 4: Interpret Dummy Variables

# Extract dummy variable coding
dummy_table2 <- data.frame(
  bmi_cat = c("Underweight", "Normal", "Overweight", "Obese"),
  `Dummy 1 (Normal)` = c(0, 1, 0, 0),
  `Dummy 2 (Overweight)` = c(0, 0, 1, 0),
  `Dummy 3 (Obese)` = c(0, 0, 0, 1),
  check.names = FALSE
)

dummy_table2 %>%
  kable(caption = "Dummy Variable Coding for bmi_cat",
        align = "lccc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(1, bold = TRUE, background = "#ffe6e6")  # Highlight reference category

Dummy Variable Coding for bmi_cat
bmi_cat	Dummy 1 (Normal)	Dummy 2 (Overweight)	Dummy 3 (Obese)
Underweight	0	0	0
Normal	1	0	0
Overweight	0	1	0
Obese	0	0	1

# Display results
tidy_plus_plus(mult_model_logistic, exponentiate = TRUE) %>%
  filter(str_detect(variable, "bmi_cat")) %>% 
  select(var_label, label, OR = estimate, conf.low, conf.high, p.value) %>% 
  kable(caption = "Odds Ratios for Hypertension by BMI Category",
        digits = 3,
        col.names = c("Variable", "Category", "Odds Ratio", "95% CI Lower", "95% CI Upper", "p-value")
        ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Odds Ratios for Hypertension by BMI Category
Variable	Category	Odds Ratio	95% CI Lower	95% CI Upper	p-value
bmi_cat	Underweight	1.000	NA	NA	NA
bmi_cat	Normal	2.097	0.759	6.756	0.175
bmi_cat	Overweight	3.241	1.183	10.385	0.030
bmi_cat	Obese	6.585	2.394	21.176	0.001

# Extract education coefficients
bmi_coefs <- tidy(mult_model_logistic, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, "bmi_cat")) %>%
  mutate(
    bmi_level = str_remove(term, "bmi_cat"),
    bmi_level = factor(bmi_level,
                             levels = c("Normal",
                                       "Overweight",
                                       "Obese"))
  )

# Add reference category
ref_row2 <- data.frame(
  term = "bmi_cat Underweight",
  estimate = 1.0,
  std.error = 0,
  statistic = NA,
  p.value = NA,
  conf.low = 1.0,
  conf.high = 1.0,
  bmi_level = factor("Underweight (Ref)",
                          levels = c("Underweight (Ref)",
                                    "Normal",
                                    "Overweight",
                                    "Obese"))
)

bmi_coefs_full <- bind_rows(ref_row2, bmi_coefs) %>%
  mutate(bmi_level = factor(bmi_level,
                                 levels = c("Underweight (Ref)",
                                           "Normal",
                                           "Overweight",
                                           "Obese")))

# Plot
plot3 <- ggplot(bmi_coefs_full, aes(x = bmi_level, y = estimate)) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "gray50") +
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high),
                  size = 0.8, color = "darkblue") +
  coord_flip() +
  labs(
    title = "Association Between BMI and Hypertension",
    subtitle = "Adjusted Odds Ratios (reference: Underweight)",
    x = "BMI",
    y = "Odds Ratio (95% CI)"
  ) +
  theme_minimal(base_size = 12)

ggplotly(plot3)

Questions:

What is the reference category for BMI? The reference category for BMI is underweight.
Interpret the odds ratio for “Obese” compared to the reference category. Obese individuals had 6.59 times higher odds of hypertension compared to underweight individuals.
How would you explain this to a non-statistician? I would explain this to a non-statistician by explaining that an increased odds ratio means that one group has higher odds of an outcome occurring to them when compared to another group. For our current example, we would say that a group of individuals who are considered obese have six and a half times the odds of demonstrating hypertension than underweight individuals. —

Task 5: Test for Interaction

# Model with interaction term
model_interaction2 <- glm(hypertension ~ age_cont * bmi_cat,
                         data = brfss_clean,
                         family = binomial(link = "logit"))

# Display interaction results
tidy(model_interaction2, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, "bmi_cat")) %>%
  kable(caption = "Age × BMI Interaction Model (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Age × BMI Interaction Model (Odds Ratios)
Term	Odds Ratio	Std. Error	z-statistic	p-value	95% CI Lower	95% CI Upper
bmi_catNormal	0.065	2.632	-1.039	0.299	0.001	38.322
bmi_catOverweight	0.078	2.606	-0.978	0.328	0.001	44.673
bmi_catObese	0.263	2.573	-0.519	0.604	0.003	144.533
age_cont:bmi_catNormal	1.058	0.043	1.306	0.192	0.957	1.147
age_cont:bmi_catOverweight	1.063	0.043	1.423	0.155	0.962	1.151
age_cont:bmi_catObese	1.054	0.042	1.232	0.218	0.954	1.140

# Generate predicted probabilities by BMI
pred_interact2 <- ggpredict(model_interaction2, terms = c("age_cont [18:80]", "bmi_cat"))

# Plot
plot4 <- ggplot(pred_interact2, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
  labs(
    title = "Predicted Probability of Hypertension by Age and BMI",
    subtitle = "Testing for Age × BMI Interaction",
    x = "Age (years)",
    y = "Predicted Probability of Hypertension",
    color = "BMI",
    fill = "BMI"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_color_manual(values = c("Normal" = "#E64B35", "Overweight" = "#4DBBD5", "Obese" = "#FFA500")) +
  scale_fill_manual(values = c("Normal" = "#E64B35", "Overweight" = "#4DBBD5", "Obese" = "#FFA500")) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

ggplotly(plot4)

Questions:

Is the interaction term statistically significant? No the interaction term is not statistically significant. None of the p-values are below the alpha value of 0.05 and all of the confidence intervals contain 1.
What does this mean in epidemiologic terms (effect modification)? This means that the age-hypertension relationship does not appear to differ by BMI, and BMI has failed to provide enough relationship to find statistical evidence of effect modification.
Create a visualization showing predicted probabilities by age and BMI category

Task 6: Model Diagnostics

# Calculate VIF
vif_values2<- vif(model_logistic_multiple)

# Create VIF table
# For models with categorical variables, vif() returns GVIF (Generalized VIF)
if (is.matrix(vif_values2)) {
  # If matrix (categorical variables present), extract GVIF^(1/(2*Df))
  vif_df2 <- data.frame(
    Variable = rownames(vif_values2),
    VIF = vif_values2[, "GVIF^(1/(2*Df))"]
  )
} else {
  # If vector (only continuous variables)
  vif_df2 <- data.frame(
    Variable = names(vif_values2),
    VIF = as.numeric(vif_values2)
  )
}

# Add interpretation
vif_df2 <- vif_df2 %>%
  arrange(desc(VIF)) %>%
  mutate(
    Interpretation = case_when(
      VIF < 5 ~ "Low (No concern)",
      VIF >= 5 & VIF < 10 ~ "Moderate (Monitor)",
      VIF >= 10 ~ "High (Problem)"
    )
  )

vif_df2 %>%
  kable(caption = "Variance Inflation Factors (VIF) for Multiple Regression Model",
        digits = 2,
        align = "lrc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which(vif_df2$VIF >= 10), bold = TRUE, color = "white", background = "#DC143C") %>%
  row_spec(which(vif_df2$VIF >= 5 & vif_df2$VIF < 10), background = "#FFA500") %>%
  row_spec(which(vif_df2$VIF < 5), background = "#90EE90")

Variance Inflation Factors (VIF) for Multiple Regression Model
	Variable	VIF	Interpretation
age_cont	age_cont	1.05	Low (No concern)
current_smoker	current_smoker	1.05	Low (No concern)
phys_active	phys_active	1.02	Low (No concern)
sex	sex	1.01	Low (No concern)
education	education	1.01	Low (No concern)
bmi_cat	bmi_cat	1.01	Low (No concern)

# Calculate Cook's distance
cooks_d2 <- cooks.distance(mult_model_logistic)

# Create data frame
influence_df2 <- data.frame(
  observation = 1:length(cooks_d2),
  cooks_d2 = cooks_d2
) %>%
  mutate(influential = ifelse(cooks_d2 > 1, "Yes", "No"))

# Plot
plot5 <- ggplot(influence_df2, aes(x = observation, y = cooks_d2, color = influential)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  labs(
    title = "Cook's Distance: Identifying Influential Observations",
    subtitle = "Values > 1 indicate potentially influential observations",
    x = "Observation Number",
    y = "Cook's Distance",
    color = "Influential?"
  ) +
  scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
  theme_minimal(base_size = 12)

ggplotly(plot5)

# Count influential observations
n_influential2 <- sum(influence_df2$influential == "Yes")
cat("Number of potentially influential observations:", n_influential2, "\n")

## Number of potentially influential observations: 0

Questions:

Are there any concerns about multicollinearity? There are no concerns regarding multicollinearity, each variance inflation factor suggests low or no concern.
Are there any influential observations that might affect your results? According to Cook’s Distance, there are not any influential observation that might affect our results.
What would you do if you found serious violations? I would go back through the data and rerun it to ensure that I did not perform any aspects incorrectly. To solve issues related to multicollinearity, we could center the variables, remove variables that are highly correlated, or add linear variables (Frost, 2023). For Cook’s, we can either replace the outliers with missing values, utilize imputation, or or cap the outliers at the 1.5 IQR limits (Prabhakaran, 2016).

References

Frost, J. (2023, January 29). Multicollinearity in Regression Analysis: Problems, detection, and solutions. Statistics by Jim. https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/#google_vignette

Prabhakaran, S. (2016, December 9). Outlier detection and treatment with R | DataScience+. https://datascienceplus.com/outlier-detection-and-treatment-with-r/ —

Task 7: Model Comparison

# Model 1: Age only
m1 <- glm(hypertension ~ age_cont,
              data = brfss_clean,
              family = binomial)

# Model 2: Age + Sex + BMI
m2 <- glm(hypertension ~ age_cont + sex + bmi_cat,
              data = brfss_clean,
              family = binomial)

# Model 3: Full model
m3 <- mult_model_logistic

# Likelihood ratio test
lrt_m1_m2 <- anova(m1, m2, test = "LRT")
lrt_m2_m3 <- anova(m2, m3, test = "LRT")

# Create comparison table
model_comp_table <- data.frame(
  Model = c("Model 1: Age only",
            "Model 2: Age + Sex + BMI",
            "Model 3: Full model"),
  AIC = c(AIC(m1), AIC(m2), AIC(m3)),
  BIC = c(BIC(m1), BIC(m2), BIC(m3)),
  `Deviance` = c(deviance(model1), deviance(m2), deviance(m3)),
  check.names = FALSE
)

model_comp_table %>%
  kable(caption = "Model Comparison: AIC, BIC, and Deviance",
        digits = 2,
        align = "lrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which.min(model_comp_table$AIC), bold = TRUE, background = "#d4edda")

Model Comparison: AIC, BIC, and Deviance
Model	AIC	BIC	Deviance
Model 1: Age only	1636.61	1646.92	1171.08
Model 2: Age + Sex + BMI	1576.49	1607.42	1564.49
Model 3: Full model	1579.50	1620.74	1563.50

Questions:

Which model has the best fit based on AIC? Model 2: Age + Sex + BMI has the best fit based on AIC because it has the lowest AIC (1576.49).
Is the added complexity of the full model justified? The added complexity of the full model could be justified because it only has an AIC that is 3 points higher than Model 2 which is considered the gray area.
Which model would you choose for your final analysis? Why? I would personally choose Model 2 because it demonstrates the complexities of the data while also maintaining the fit of the data. —

Lab Report Guidelines

Write a brief report (1-2 pages) summarizing your findings:

Introduction: State your research question
Methods: Describe your analytic approach
Results: Present key findings with tables and figures
Interpretation: Explain what your results mean
Limitations: Discuss potential issues with your analysis

Countless research has already established that age and hypertension share a relationship. The present research was conducted to discover if there is an association between hypertension and age within the 2023 cycle of BRFSS data. BRFSS, otherwise known as, the Behavioral Risk Factor Surveillance System, is a nationally conducted survey by the Centers for Disease Control and Prevention and local state health departments. This survey employs random-digit dialing in order to collect state-specific public health information including health-risk behaviors, chronic conditions, and preventative service that can aid in influencing future recommendations for the population. Every year, the BRFSS collects data from over 400,000 individuals in the United States.

For the current report, the sample was originally 433,323 observations with 350 variables. To focus the data on the research question, a working subset was created that focused on demographics, health behaviors, and health status, including diabetes status, age category, sex, race, education level, BMI category, physical activity, smoking frequency, general health, high blood pressure, and high cholesterol. Then the data was filters to have complete cases only, and a sample 2000 observations were randomly selected for a manageable analysis. The subset with 2000 observations was then cleaned to include recoded variables such as a binary diabetes outcome and adding meaningful titles to age groups, sex, race, education, income, BMI, and general health. This step also created hypertension and high cholesterol variables. The data was cleaned one more time to include only these cleaned variables and remove any remaining missing values fro a final analytical dataset of 1281 complete observations.

Descriptive statistics were computed for hypertension status. 606 individuals did not have hypertension and 675 did. Of the 675 who demonstrated, their mean age was 63.1, 53.2% were male, 45.6% were obese, 64.1% were physically active, 25.6% were current smokers, 25.9% were diabetic, and 61.3% had high cholesterol. Total prevalence of hypertension was 52.69%. Hypertension prevalence was calculated for each age group. For the 18-24 age group, prevalence was 8.33%, for the 25-34 age group, prevalence was 19.48%, for the 35-44 age group, prevalence was 30.43%, for the 45-54 age group , prevalence was 37.89%, for the 55-64 age group, prevalence was 51.50%, and for the 65+ age group, prevalence was 66.83%. This demonstrates a trend that as age increases, hypertension prevalence also increases.

First a simple logistic regression model was created. It computed an odds ratio that suggests for every 1 year increase in age, the odds of hypertension increase by 5.5%. This association is highly significant because the p-value < 0.001. This odds ratio provides a 95% confidence interval of 1.045 to 1.065. Then a multiple regression model with potential confounders of sex, BMI, physically active, and smoking was created. Some key findings were that Males have 12.7% higher odds of hypertension compared to females after adjusting for other variables. Overweight individuals had a 3.24 times higher odds of hypertension compared to underweight individuals and obese individuals had a 6.59 times higher odds of hypertension when compared to underweight individuals.

A test for interaction was conducted to see if the effect of age on hypertension differs by BMI category. BMI was discovered to not be statistically significant because none of the p-values were below the alpha level of 0.05 and all of the suggested confidence intervals contained 1. This means that the age-hypertension relationship does not appear to differ by BMI because it has failed to provide enough evidence to suggest effect modification.

For the multiple regression model, variance inflation factors were conducted to assess for multicollinearity; however, each factor suggested low or no concern. Furthermore, a plot to demonstrate Cook’s distance was assembled, and there were not any influential observations detected that may affect the results. Finally, a model comparison was created. Model 1 was age only, Model 2 was age, sex, and BMI, and Model 3 was a full model. Model 1 demonstrated the highest AIC (1636.61) while Model 3 showed a lowered AIC (1579.50). Model 2 suggested the lowest AIC (1576.49) and is the choice for the final analysis.

The results demonstrate that age and BMI influence the outcome of hypertension as well as its effects. None of the variables, age, smoker, physical activity, sex, education, or BMI were considered variance inflation factors suggesting no or low multicollinearity. It appears that as the category for BMI increases, as does the odds for hypertension. Furthermore, as age increases, the prevalence of hypertension increases. These findings suggest that a relationship between age, BMI, and hypertension do exist; however, further investigation is required to understand confounders and effect modification to a fuller extent and to establish temporality.

Potential issues with this analysis come from the collection of data. Since BRFSS is a survey, there is no explicit method to determine temporality; for this reason, it can not be said that age causes hypertension. A relationship can be inferred, but it will not be causal. Furthermore, these interviews are conducted and the data is used as a secondary source meaning the researchers have no influence over the data collection process suggesting potential bias and classification errors.

Submission: Submit your completed R Markdown file and knitted HTML report.

Summary

Key Concepts Covered

Statistical modeling describes relationships between variables
Regression types depend on the outcome variable type
Logistic regression is appropriate for binary outcomes
Multiple regression controls for confounding
Dummy variables represent categorical predictors
Interactions test for effect modification
Model diagnostics check assumptions and identify problems
Model comparison helps select the best model

Important Formulas

Logistic Regression:

\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]

Odds Ratio:

\[\text{OR} = e^{\beta_i}\]

Predicted Probability:

\[p = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}\]

References

Agresti, A. (2018). An Introduction to Categorical Data Analysis (3rd ed.). Wiley.
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
Vittinghoff, E., Glidden, D. V., Shiboski, S. C., & McCulloch, C. E. (2012). Regression Methods in Biostatistics (2nd ed.). Springer.
Centers for Disease Control and Prevention. (2023). Behavioral Risk Factor Surveillance System.

Session Info

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.7.3
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] broom.helpers_1.22.0 janitor_2.2.1        ggstats_0.12.0      
##  [4] gtsummary_2.5.0      ggeffects_2.3.2      car_3.1-3           
##  [7] carData_3.0-5        broom_1.0.11         plotly_4.12.0       
## [10] kableExtra_1.4.0     knitr_1.51           haven_2.5.5         
## [13] lubridate_1.9.4      forcats_1.0.1        stringr_1.6.0       
## [16] dplyr_1.2.0          purrr_1.2.1          readr_2.1.6         
## [19] tidyr_1.3.2          tibble_3.3.1         ggplot2_4.0.1       
## [22] tidyverse_2.0.0     
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.56          bslib_0.9.0        htmlwidgets_1.6.4 
##  [5] insight_1.4.6      lattice_0.22-7     tzdb_0.5.0         crosstalk_1.2.2   
##  [9] vctrs_0.7.1        tools_4.5.2        generics_0.1.4     datawizard_1.3.0  
## [13] pkgconfig_2.0.3    Matrix_1.7-4       data.table_1.18.0  RColorBrewer_1.1-3
## [17] S7_0.2.1           lifecycle_1.0.5    compiler_4.5.2     farver_2.1.2      
## [21] textshaping_1.0.4  codetools_0.2-20   snakecase_0.11.1   htmltools_0.5.9   
## [25] sass_0.4.10        yaml_2.3.12        lazyeval_0.2.2     Formula_1.2-5     
## [29] pillar_1.11.1      jquerylib_0.1.4    cachem_1.1.0       abind_1.4-8       
## [33] nlme_3.1-168       tidyselect_1.2.1   digest_0.6.39      stringi_1.8.7     
## [37] labeling_0.4.3     splines_4.5.2      labelled_2.16.0    fastmap_1.2.0     
## [41] grid_4.5.2         cli_3.6.5          magrittr_2.0.4     cards_0.7.1       
## [45] withr_3.0.2        scales_1.4.0       backports_1.5.0    timechange_0.3.0  
## [49] rmarkdown_2.30     httr_1.4.7         otel_0.2.0         hms_1.1.4         
## [53] evaluate_1.0.5     viridisLite_0.4.2  mgcv_1.9-3         rlang_1.1.7       
## [57] glue_1.8.0         xml2_1.5.2         svglite_2.2.2      rstudioapi_0.18.0 
## [61] jsonlite_2.0.0     R6_2.6.1           systemfonts_1.3.1

Statistical Modeling in Epidemiology

EPI 553 - Advanced Epidemiologic Methods

Zoya Hayes

2026-02-24

Introduction

Setup and Data Preparation

Loading BRFSS 2023 Data

Creating a Working Subset

Data Recoding and Cleaning

Descriptive Statistics

Part 1: Statistical Modeling Concepts

1. What is Statistical Modeling?

General Form of a Statistical Model

2. Types of Regression Models

Simple vs. Multiple Regression

3. Linear Regression Example

Simple Linear Regression

Visualization

With continuous age

4. Logistic Regression: The Preferred Model for Binary Outcomes

Simple Logistic Regression

Predicted Probabilities

5. Multiple Regression: Controlling for Confounding

What is Confounding?

Multiple Logistic Regression

6. Dummy Variables: Coding Categorical Predictors

Example: Education Level

Visualizing Education Effects

7. Interactions (Effect Modification)

Example: Age × Sex Interaction

Visualizing Interaction

8. Model Diagnostics

Key Assumptions for Logistic Regression

Checking for Multicollinearity

Influential Observations

9. Model Comparison and Selection

Comparing Nested Models

10. Error Term in Statistical Models

Part 2: Student Lab Activity

Lab Overview

Lab Instructions

Task 1: Explore the Outcome Variable

Task 2: Build a Simple Logistic Regression Model

Task 3: Create a Multiple Regression Model

Task 4: Interpret Dummy Variables

Task 5: Test for Interaction

Task 6: Model Diagnostics

Task 7: Model Comparison

Lab Report Guidelines

Summary

Key Concepts Covered

Important Formulas

References