Lab Overview

In this lab, you will:

Build your own logistic regression model predicting hypertension (high blood pressure)
Create dummy variables for categorical predictors
Interpret regression coefficients
Test for confounding and interaction
Perform model diagnostics

Setup and Data Preparation

# Load required packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(car)
library(ggeffects)
library(gtsummary)
library(ggstats)

Lab Instructions

Task 1: Explore the Outcome Variable

###load brfss_subset_2023.rds

brfss_data <- read_rds("C:/Users/suruc/OneDrive/Desktop/R/EPI553_Rclass/data/brfss_subset_2023.rds")

```{r- dataview} # Inspect the data glimpse(brfss_data) # Examine the first few rows head(brfss_data, n = 10)

View data structure

str(brfss_data)

Dimensions: rows (observations) and columns (variables)

dim(brfss_data)

What variables do we have?

names(brfss_data)


## Descriptive Statistics


``` r
# YOUR CODE HERE: Create a frequency table of hypertension status
# Create a frequency table of hypertension status
hyprtnsn_freq <- brfss_data %>%
  group_by(hypertension) %>%
  summarise(
   Count = n(),
    Percentage = round(100 * n() / nrow(brfss_data), 1)
  ) %>%
  mutate(
    Status = case_when(
      hypertension == 1 ~ "Has Hypertension",
      hypertension == 0 ~ "No Hypertension"
    )
  ) %>%
  select(Status, Count, Percentage)

hyprtnsn_freq %>%
  kable(caption = "Overall Prevalence of Hypertension in Study Sample",
        align = "lrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Overall Prevalence of Hypertension in Study Sample
Status	Count	Percentage
No Hypertension	606	47.3
Has Hypertension	675	52.7

# Calculate the prevalence of hypertension by age group

hyprtnsn_by_age <- brfss_data %>%
  group_by(age_group) %>%
  summarise(
    Total = n(),
    Cases = sum(hypertension),
    Prevalence_Percent = round(100 * mean(hypertension), 1),
    .groups = 'drop'
  )

hyprtnsn_by_age %>%
  kable(caption = "Hypertension Prevalence by Age Group",
        align = "lccc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Hypertension Prevalence by Age Group
age_group	Total	Cases	Prevalence_Percent
18-24	12	1	8.3
25-34	77	15	19.5
35-44	138	42	30.4
45-54	161	61	37.9
55-64	266	137	51.5
65+	627	419	66.8

Questions:

What is the overall prevalence of hypertension in the dataset? The overall prevalence of hypertension in the dataset is 675, which is 52.7%.
How does hypertension prevalence vary by age group? Hypertension increases steadily and sharply with age with highest prevalence among the 65+ age group i.e. 419 cases out of 627 in the age group (66.8%). Only 8.3% of the age 18–24 have hypertension and it nearly doubles by early adulthood with prevalence jumping to 19.5% in the 25–34 group. By ages 35–54, about one‑third to nearly 40% are hypertensive. Major increase is seen after age 55 with over 50%.

Task 2: Build a Simple Logistic Regression Model

# Outcome: hypertension
# Predictor: age_cont (continous age in years)

# Simple logistic regression: diabetes ~ age
model_hyprtnsn_simple <- glm(hypertension ~ age_cont,
                             data = brfss_data,                
                             family = binomial(link = "logit"))

# Display results with odds ratios
tidy(model_hyprtnsn_simple, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Simple Logistic Regression: Diabetes ~ Age (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Simple Logistic Regression: Diabetes ~ Age (Odds Ratios)
Term	Odds Ratio	Std. Error	z-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	0.048	0.296	-10.293	0	0.026	0.084
age_cont	1.055	0.005	10.996	0	1.045	1.065

Interpretation:

YOUR CODE HERE: Predicted Probabilities

# From ggeffects package
pp <- predict_response(model_hyprtnsn_simple, terms = "age_cont")
plot(pp)

# Generate predicted probabilities
pred_data <- data.frame(age_cont = seq(18, 80, by = 1))
pred_data$predicted_prob <- predict(model_hyprtnsn_simple,
                                    newdata = pred_data,
                                    type = "response")

# Plot
p2 <- ggplot(pred_data, aes(x = age_cont, y = predicted_prob)) +
  geom_line(color = "darkred", linewidth = 1.5) +
  geom_ribbon(aes(ymin = predicted_prob - 0.03,
                  ymax = predicted_prob + 0.03),
              alpha = 0.2, fill = "darkred") +
  labs(
    title = "Predicted Probability of Hypertension by Age",
    subtitle = "Simple Logistic Regression Model",
    x = "Age (years)",
    y = "Predicted Probability of hypertension"
  ) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 0.8)) +
  theme_minimal(base_size = 12)

ggplotly(p2)

Questions:

What is the odds ratio for age? Interpret this value.

The odds ratio (OR) for age_cont is 1.055. For each additional year of age, the odds of having diabetes increase by 5.5%. Because an OR of 1.055 means the odds are multiplied by 1.055 for every one‑year increase. This shows a positive association: older individuals have higher odds of diabetes.

Is the association statistically significant? Yes. This association is statistically significant.The p‑value for age is < 0.001, which is far below the α = 0.05 threshold. The 95% CI (1.045 to 1.065) does not include 1. Both factor indicate a statistically significant association between age and diabetes.
What is the 95% confidence interval for the odds ratio? The 95% CI for the odds ratio is 1.045 to 1.065.We are 95% confident that each additional year of age increases the odds of diabetes by 4.5% to 6.5%.

Task 3: Create a Multiple Regression Model

# YOUR CODE HERE: Fit a multiple logistic regression model
# Outcome: hypertension
# Predictors: age_cont, sex, bmi_cat, phys_active, current_smoker

model_hyprtnsn_multiple <- glm(hypertension ~ age_cont + sex + bmi_cat +
                                phys_active + current_smoker,
                               data = brfss_data,
                               family = binomial(link = "logit"))

YOUR CODE HERE: Display the results

# Display results

tidy(model_hyprtnsn_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Multiple Logistic Regression: Hypertension ~ Age + Covariates (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  scroll_box(height = "400px")

Multiple Logistic Regression: Hypertension ~ Age + Covariates (Odds Ratios)
Term	Odds Ratio	Std. Error	z-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	0.008	0.653	-7.355	0.000	0.002	0.028
age_cont	1.061	0.005	11.234	0.000	1.050	1.073
sexMale	1.270	0.123	1.950	0.051	0.999	1.616
bmi_catNormal	2.097	0.546	1.356	0.175	0.759	6.756
bmi_catOverweight	3.241	0.543	2.166	0.030	1.183	10.385
bmi_catObese	6.585	0.545	3.459	0.001	2.394	21.176
phys_active	0.900	0.130	-0.808	0.419	0.697	1.162
current_smoker	1.071	0.139	0.495	0.621	0.817	1.407

## BMI Category Odds Ratios (vs. Underweight):

## Normal weight: OR = 2.1

## Overweight: OR = 3.24

## Obese: OR = 6.59

Questions:

How did the odds ratio for age change after adjusting for other variables?

Crude model (age only): OR = 1.055
Adjusted model: OR = 1.061 The odds ratio increased slightly—from a 5.5% increase in odds of hypertension per year of age to a 6.1% increase per year after adjustment. This is a small upward shift, but the direction matters.

What does this suggest about confounding?

The OR for age increased after adjusting for sex, BMI category, physical activity, and smoking.Some of the covariates (especially BMI or sex) were masking part of the true association between age and hypertension in the crude model.After controlling for them, the age–hypertension association became slightly stronger. When older adults also tend to have characteristics that partially overlap with hypertension risk (e.g., higher BMI), but those characteristics don’t fully explain the age effect.

Which variables are the strongest predictors of hypertension?

Based on magnitude of OR and statistical significance: Strongest predictors are

BMI category (especially Obese):

Obese: OR = 6.585, p = 0.001
Overweight: OR = 3.241, p = 0.030 These are very large increases in odds and statistically significant.

Age:

OR = 1.061, p < 0.001 Small per‑year effect, but highly significant and cumulative over decades.

Task 4: Interpret Dummy Variables

# YOUR CODE HERE: Create a table showing the dummy variable coding for bmi_cat

# Extract dummy variable coding
bmi_dummy_table <- data.frame(
  bmi_cat = c("Normal", "Overweight", "Obese"),
  `Dummy 1 (Normal)` = c(1, 0, 0),
  `Dummy 2 (Overweight)` = c(0, 1, 0),
  `Dummy 3 (Obese)` = c(0, 0, 1),
  check.names = FALSE
)

bmi_dummy_table %>%
  kable(caption = "Dummy Variable Coding for BMI Category",
        align = "lccc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(1, bold = TRUE, background = "#ffe6e6")  # Highlight reference category

Dummy Variable Coding for BMI Category
bmi_cat	Dummy 1 (Normal)	Dummy 2 (Overweight)	Dummy 3 (Obese)
Normal	1	0	0
Overweight	0	1	0
Obese	0	0	1

YOUR CODE HERE: Extract and display the odds ratios for BMI categories

Visualizing BMI Category Effects

# Extract BMI coefficients
bmi_coefs <- tidy(model_hyprtnsn_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, "bmi_cat")) %>%
  mutate(
    bmi_cat = str_remove(term, "bmi_cat"),
    bmi_cat = factor(bmi_cat,
        levels = c("Underweight", "Normal", "Overweight", "Obese"))
  )

# Add reference category
ref_row <- data.frame(
  term = "BMI_catUnderweight",
  estimate = 1.0,
  std.error = 0,
  statistic = NA,
  p.value = NA,
  conf.low = 1.0,
  conf.high = 1.0,
  
  bmi_cat = factor("Underweight (Ref)",
                          levels = c("Underweight (Ref)",
                                    "Normal",
                                    "Overweight",
                                    "Obese"))
)

##combining and reordering

bmi_coefs_full <- bind_rows(ref_row, bmi_coefs) %>%
  mutate(
    bmi_cat = factor(bmi_cat,
                      levels = c("Underweight (Ref)",
                                    "Normal",
                                    "Overweight",
                                    "Obese")))

#Visualize the new underweight reference category

bmi_coefs_full %>%
  select(bmi_cat, estimate, conf.low, conf.high, p.value) %>%
  kable(caption = "Odds Ratios for BMI Categories (Reference: UnderWeight)",
        digits = 3,
        col.names = c("BMI Category", "Odds Ratio", "95% CI Lower", "95% CI Upper", "p-value"),
        align = "lrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Odds Ratios for BMI Categories (Reference: UnderWeight)
BMI Category	Odds Ratio	95% CI Lower	95% CI Upper	p-value
Underweight (Ref)	1.000	1.000	1.000	NA
Normal	2.097	0.759	6.756	0.175
Overweight	3.241	1.183	10.385	0.030
Obese	6.585	2.394	21.176	0.001

# Plot the odds ratios
p_bmi <- ggplot(bmi_coefs_full, aes(x = bmi_cat, y = estimate)) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "gray50") +
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high),
                  size = 0.8, color = "darkblue") +
  coord_flip() +
  labs(
    title = "Odds Ratios for BMI Categories and Hypertension",
    subtitle = "Reference: Underweight",
    x = "BMI Category",
    y = "Odds Ratio (95% CI)"
  ) +
  theme_minimal(base_size = 12)

ggplotly(p_bmi)

Odds Ratios for BMI Category Levels

# Interpretation table
bmi_interpret <- data.frame(
  `BMI Category` = c( "Underweight", "Normal", "Overweight", "Obese"),
  `Odds Ratio` = c(1.0, round(filter(bmi_coefs, term == "bmi_catNormal")$estimate, 3), 
                   round(filter(bmi_coefs, term == "bmi_catOverweight")$estimate, 3),
                   round(filter(bmi_coefs, term == "bmi_catObese")$estimate, 3)),
  `Interpretation` = c(
    "Reference group",
    paste0("About ", round(filter(bmi_coefs, term == "bmi_catNormal")$estimate - 1, 1)*100, "% higher/lower odds than underweight"),
    paste0("About ", round((filter(bmi_coefs, term == "bmi_catOverweight")$estimate - 1)*100, 0), "% higher odds than underweight"),
    paste0("About ", round((filter(bmi_coefs, term == "bmi_catObese")$estimate - 1)*100, 0), "% higher odds than underweight")
  ),
  check.names = FALSE
)

cat("\n=BMI INTERPRETATION =\n")

## 
## =BMI INTERPRETATION =

print(bmi_interpret)

##   BMI Category Odds Ratio                                Interpretation
## 1  Underweight      1.000                               Reference group
## 2       Normal      2.097 About 110% higher/lower odds than underweight
## 3   Overweight      3.241       About 224% higher odds than underweight
## 4        Obese      6.585       About 559% higher odds than underweight

# Plot model coefficients with `ggcoef_model()`
ggcoef_model(model_hyprtnsn_multiple, exponentiate = TRUE,
  include = c("bmi_cat"),
  variable_labels = c(
    BMI = "bmi_cat"),
  facet_labeller = ggplot2::label_wrap_gen(10)
)

Questions:

What is the reference category for BMI?

The reference category is Underweight (Odds Ratio = 1.00).

Interpret the odds ratio for “Obese” compared to the reference category.

The odds ratio for the Obese group is 6.585, meaning their odds are 6.6 times higher than the underweight group.

How would you explain this to a non-statistician? People in the obese category are much more likely to have the hypertension than people who are underweight. Simply, obesity is linked to a higher risk, and the difference is large enough that it’s unlikely to be due to random chance.

Task 5: Test for Interaction

# YOUR CODE HERE: Fit a model with Age × BMI interaction
# Test if the effect of age on hypertension differs by BMI category

# Model with interaction term
model_interaction <- glm(hypertension ~ age_cont* bmi_cat,data = brfss_data,
        family = binomial(link = "logit"))

# Model WITHOUT interaction (additive model)
model_no_interaction <- glm(hypertension ~ age_cont + bmi_cat,
                            data = brfss_data,
                            family = binomial(link = "logit"))

# Display interaction results

tidy(model_interaction, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, ":")) %>%
  kable(caption = "Age × BMI Interaction Model (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Age × BMI Interaction Model (Odds Ratios)
Term	Odds Ratio	Std. Error	z-statistic	p-value	95% CI Lower	95% CI Upper
age_cont:bmi_catNormal	1.058	0.043	1.306	0.192	0.957	1.147
age_cont:bmi_catOverweight	1.063	0.043	1.423	0.155	0.962	1.151
age_cont:bmi_catObese	1.054	0.042	1.232	0.218	0.954	1.140

#  show full model

tidy(model_interaction, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Full Age × BMI Interaction Model",
        digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Full Age × BMI Interaction Model
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.259	2.536	-0.533	0.594	0.000	25.001
age_cont	1.004	0.042	0.102	0.918	0.929	1.108
bmi_catNormal	0.065	2.632	-1.039	0.299	0.001	38.322
bmi_catOverweight	0.078	2.606	-0.978	0.328	0.001	44.673
bmi_catObese	0.263	2.573	-0.519	0.604	0.003	144.533
age_cont:bmi_catNormal	1.058	0.043	1.306	0.192	0.957	1.147
age_cont:bmi_catOverweight	1.063	0.043	1.423	0.155	0.962	1.151
age_cont:bmi_catObese	1.054	0.042	1.232	0.218	0.954	1.140

Interpretation:

Main effect of age: The effect of age does NOT differ meaningfully across BMI categories. Age does not modify the effect of BMI, and BMI does not modify the effect of age.
Interaction term:Adding the interaction, the relation is not statistically signifcant with any of the BMI groups.

YOUR CODE HERE: Perform a likelihood ratio test comparing models with and without interaction

# Model 1: Age only (model_no_interaction)
model1 <- model_no_interaction

#Model 2: Age + BMI (Model_interaction)
model2 <- model_no_interaction
# Model 3: Full model
model3 <- model_hyprtnsn_multiple

# Likelihood ratio test
lrt_1_2 <- anova(model1, model2, test = "LRT")
lrt_2_3 <- anova(model2, model3, test = "LRT")

# Create comparison table
model_comp <- data.frame(
  Model = c("Model 1: Age only",
            "Model 2: Age + BMI",
            "Model 3: Full model"),
  AIC = c(AIC(model1), AIC(model2), AIC(model3)),
  BIC = c(BIC(model1), BIC(model2), BIC(model3)),
  `Deviance` = c(deviance(model1), deviance(model2), deviance(model3)),
  check.names = FALSE
)

model_comp %>%
  kable(caption = "Model Comparison: AIC, BIC, and Deviance",
        digits = 2,
        align = "lrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which.min(model_comp$AIC), bold = TRUE, background = "#d4edda")

Model Comparison: AIC, BIC, and Deviance
Model	AIC	BIC	Deviance
Model 1: Age only	1578.06	1603.84	1568.06
Model 2: Age + BMI	1578.06	1603.84	1568.06
Model 3: Full model	1579.50	1620.74	1563.50

Questions:

Is the interaction term statistically significant? The interaction model does not improve model fit as we can see:

AIC: Model 3 = 1579.50 (higher = not good)
BIC: Model 3 = 1620.74 (higher = not good)
Deviance: Model 3 = 1563.50 Since AIC and BIC both increase, the interaction terms do not improve the model. Combined with your earlier p‑values (all > 0.05), the interaction terms are not statistically significant.

What does this mean in epidemiologic terms (effect modification)?

The effect of age does NOT differ across BMI categories.
BMI does not change the relationship between age and the outcome.
Age does not change the relationship between BMI and the outcome.

There is no evidence of effect modification by BMI on the age–hypertension relationship. The association between age and the hypertension appears consistent across BMI groups. Age increases risk at about the same rate whether someone is normal weight, overweight, or obese.

Create a visualization showing predicted probabilities by age and BMI category

Visualizing Interaction

# Generate predicted probabilities by BMI

pred_interact <- ggpredict(model_interaction, terms = c("age_cont", "bmi_cat"))

# Plot
p4 <- ggplot(pred_interact, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
  labs(
    title = "Predicted Probability of Hypertension by Age and BMI",
    subtitle = "Testing for Age × BMI Interaction",
    x = "Age (years)",
    y = "Predicted Probability of Hypertension",
    color = "BMI",
    fill = "BMI"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_color_manual(values = c("Normal" = "#E64B35", "Overweight" = "#4DBBD5", "Obese" = "pink")) +
  scale_fill_manual(values = c("Normal" = "#E64B35", "Overweight" = "#4DBBD5", "Obese" = "pink")) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

ggplotly(p4)

Age-hypertension Relationship by BMI

Task 6: Model Diagnostics

# YOUR CODE HERE: Calculate VIF for your multiple regression model
# Calculate VIF
vif_values <- vif(model_hyprtnsn_multiple)

# Create VIF table
# For models with categorical variables, vif() returns GVIF (Generalized VIF)
if (is.matrix(vif_values)) {
  # If matrix (categorical variables present), extract GVIF^(1/(2*Df))
  vif_df <- data.frame(
    Variable = rownames(vif_values),
    VIF = vif_values[, "GVIF^(1/(2*Df))"]
  )
} else {
  # If vector (only continuous variables)
  vif_df <- data.frame(
    Variable = names(vif_values),
    VIF = as.numeric(vif_values)
  )
}

# Add interpretation
vif_df <- vif_df %>%
  arrange(desc(VIF)) %>%
  mutate(
    Interpretation = case_when(
      VIF < 5 ~ "Low (No concern)",
      VIF >= 5 & VIF < 10 ~ "Moderate (Monitor)",
      VIF >= 10 ~ "High (Problem)"
    )
  )

vif_df %>%
  kable(caption = "Variance Inflation Factors (VIF) for Multiple Regression Model",
        digits = 2,
        align = "lrc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which(vif_df$VIF >= 10), bold = TRUE, color = "white", background = "#DC143C") %>%
  row_spec(which(vif_df$VIF >= 5 & vif_df$VIF < 10), background = "#FFA500") %>%
  row_spec(which(vif_df$VIF < 5), background = "#90EE90")

Variance Inflation Factors (VIF) for Multiple Regression Model
	Variable	VIF	Interpretation
age_cont	age_cont	1.06	Low (No concern)
current_smoker	current_smoker	1.04	Low (No concern)
bmi_cat	bmi_cat	1.02	Low (No concern)
phys_active	phys_active	1.01	Low (No concern)
sex	sex	1.01	Low (No concern)

CHECK 2 Influential Observations

Cook’s Distance: Measures how much the model would change if an observation were removed.

Cook’s D > 1: Potentially influential observation

CODE HERE: Create a Cook’s distance plot to identify influential observations

# Calculate Cook's distance
cooks_d <- cooks.distance(model_hyprtnsn_multiple)

# Create data frame
influence_df <- data.frame(
  observation = 1:length(cooks_d),
  cooks_d = cooks_d
) %>%
  mutate(influential = ifelse(cooks_d > 1, "Yes", "No"))

# Plot
p5 <- ggplot(influence_df, aes(x = observation, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  labs(
    title = "Cook's Distance: Identifying Influential Observations",
    subtitle = "Values > 1 indicate potentially influential observations",
    x = "Observation Number",
    y = "Cook's Distance",
    color = "Influential?"
  ) +
  scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
  theme_minimal(base_size = 12)

ggplotly(p5)

Cook’s Distance for Influential Observations

# Count influential observations
n_influential <- sum(influence_df$influential == "Yes")
cat("Number of potentially influential observations:", n_influential, "\n")

## Number of potentially influential observations: 0

Questions:

Are there any concerns about multicollinearity?

There is no major multicollinearity concern for the main-effects model.

Are there any influential observations that might affect your results?

Cook’s Distance plot shows: - Number of potentially influential observations: 0 - All observation points fall below the common threshold of 1.0

There are no influential observations that would meaningfully distort or bias your regression results.

What would you do if you found serious violations? If serious violations were detected (e.g., high multicollinearity, influential outliers, poor model fit), I would:

Investigate multicollinearity (Examine VIF values, Remove or combine correlated predictors, Avoid unnecessary interaction terms)
Address influential observations (Inspect those cases individually, Check for data entry errors, Consider robust regression or sensitivity analyses)
Refit a simpler, more stable model (Prefer models with lower AIC/BIC, Remove non‑significant interaction terms, Ensure interpretability and epidemiologic plausibility)

Task 7: Model Comparison

CODE HERE: Create a comparison table

# YOUR CODE HERE: Compare three models using AIC and BIC
# Model A: Age only
# Model B: Age + sex + bmi_cat
# Model C: Age + sex + bmi_cat + phys_active + current_smoker

# Model A: Age only
modela <- glm(hypertension ~ age_cont,
              data = brfss_data,
              family = binomial)

# Model B: Model with age + demographics + BMI
modelb <- glm(hypertension ~ age_cont + sex + bmi_cat,
              data = brfss_data,
              family = binomial)

# Model C: Full model with all predictors
modelc <- glm(hypertension ~ age_cont + sex + bmi_cat + phys_active + current_smoker,
              data = brfss_data,
              family = binomial)

# Create comparison table
model_comparison <- data.frame(
  Model = c("Model A: Age only",
    "Model B: Age + Sex + BMI",
    "Model C: Full model\n(+ Activity + Smoking)"),
  AIC = c(AIC(modela), AIC(modelb), AIC(modelc)),
  BIC = c(BIC(modela), BIC(modelb), BIC(modelc)),
  `Deviance` = c(deviance(modela), deviance(modelb), deviance(modelc)),
  check.names = FALSE
)

model_comparison %>%
  kable(caption = "Model Comparison: AIC, BIC, and Deviance",
        digits = 2,
        align = "lrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which.min(model_comparison$AIC), bold = TRUE, background = "#d4edda")

Model Comparison: AIC, BIC, and Deviance
Model	AIC	BIC	Deviance
Model A: Age only	1636.61	1646.92	1632.61
Model B: Age + Sex + BMI	1576.49	1607.42	1564.49
Model C: Full model (+ Activity + Smoking)	1579.5

# Likelihood ratio test
lrt_a_b <- anova(modela, modelb, test = "LRT")

cat("Test 1: Model A vs Model B\n")

## Test 1: Model A vs Model B

print(lrt_a_b)

## Analysis of Deviance Table
## 
## Model 1: hypertension ~ age_cont
## Model 2: hypertension ~ age_cont + sex + bmi_cat
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      1279     1632.6                          
## 2      1275     1564.5  4   68.126 5.643e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cat("Interpretation: Adding sex and BMI  to age significantly improves fit:", lrt_a_b$`Pr(>Chi)`[2] < 0.05, "\n\n")

## Interpretation: Adding sex and BMI  to age significantly improves fit: TRUE

lrt_b_c <- anova(modelb, modelc, test = "LRT")
cat("\nTest 2: Model B vs Model C\n")

## 
## Test 2: Model B vs Model C

print(lrt_b_c)

## Analysis of Deviance Table
## 
## Model 1: hypertension ~ age_cont + sex + bmi_cat
## Model 2: hypertension ~ age_cont + sex + bmi_cat + phys_active + current_smoker
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      1275     1564.5                     
## 2      1273     1563.5  2  0.99112   0.6092

cat("Interpretation: Adding physical activity and smoking significantly improves fit:", lrt_b_c$`Pr(>Chi)`[2] < 0.05, "\n\n")

## Interpretation: Adding physical activity and smoking significantly improves fit: FALSE

Questions:

Which model has the best fit based on AIC?

According to the AIC values: - Model A (Age only): 1636.61 - Model B (Age + Sex + BMI): 1576.49 - Model C (Full model): 1579.50

Model B has the lowest AIC (1576.49), meaning it provides the best overall fit while balancing model complexity and goodness of fit.

Is the added complexity of the full model justified?

Looking at the AIC/BIC values, Model C has higher AIC and higher BIC than Model B. This means the extra variables (activity + smoking) do not improve the model enough to justify their inclusion.When we look at P-value from deviance table for Model B vs Model C, LRT p-value = 0.6092 > 0.05. Adding physical activity and smoking does not significantly improve model fit. The added complexity is not justified.

Which model would you choose for your final analysis? Why?

I would choose Model B (Age + Sex + BMI) because it has the lowest AIC, indicating the best balance of fit and parsimony.The LRT shows that adding activity and smoking does not improve the model. Model B includes important epidemiologic predictors (age, sex, BMI).

Lab Report Guidelines

Write a brief report (1-2 pages) summarizing your findings:

Introduction: State your research question
Methods: Describe your analytic approach
Results: Present key findings with tables and figures
Interpretation: Explain what your results mean
Limitations: Discuss potential issues with your analysis

Logistic Regression Analysis of Hypertension

Introduction The purpose of this analysis was to identify predictors of hypertension in an adult population using BRFSS subset data and evaluate whether demographic or behavioral factors modify the relationship between age and hypertension.

Specifically, the research question was:

“Which factors are associated with hypertension, and is the effect of age modified by BMI or other covariates?”

Given that the overall prevalence of hypertension in the dataset was 52.7% (675 cases), understanding these predictors is important for identifying high‑risk groups and informing prevention strategies.

Methods A series of logistic regression models were constructed to examine associations between hypertension (binary outcome) and potential predictors. Age was treated as a continuous variable. Sex, BMI category, physical activity, and smoking status were included as categorical predictors using dummy variables. The analytic steps included:

Model building:
Model A: Age only
Model B: Age + Sex + BMI
Model C: Full model (Age + Sex + BMI + Activity + Smoking)

Methods A series of logistic regression models were constructed to examine associations between hypertension (binary outcome) and potential predictors. Age was treated as a continuous variable. Sex, BMI category, physical activity, and smoking status were included as categorical predictors using dummy variables.

The analytic steps included: - Model building: Model A: Age only Model B: Age + Sex + BMI Model C: Full model (Age + Sex + BMI + Activity + Smoking)

All analyses were conducted using logistic regression in R.

Results 3.1 Descriptive Findings

Hypertension prevalence: 52.7%
Prevalence increased sharply with age, from 8.3% (18–24) to 66.8% (65+).

3.2 Main Effects Model (Model B) Key adjusted odds ratios: For age: 1.061 AoR (Each additional year increases odds by ~6%) For BMI, obese group, AoR 6.59 (obese people have 6.6 times higher odds of hypertension compared to underweight reference group.)

3.3 Interaction Model Age × BMI interaction terms were not statistically significant (all p > 0.05). This indicates no evidence of effect modification: the effect of age on hypertension does not differ across BMI categories.

3.4 Model Comparison

Model B had the lowest AIC. Adding sex and BMI significantly improved fit and adding activity and smoking did not improve fit.

3.5 Diagnostics

Influential observations: Cook’s Distance plot showed 0 influential observations. Multicollinearity: No major concerns in the main-effects model.

Interpretation

Age and BMI are strong, independent predictors of hypertension. The odds of hypertension increase steadily with age, and individuals who are overweight or obese have higher odds compared to those who are underweight.

There is no evidence of effect modification between age and BMI, showing the increase in hypertension risk with age is consistent across BMI groups. Model comparison strongly supports Model B (Age + Sex + BMI) as the most appropriate model. Adding physical activity and smoking does not improve model fit and introduces complexity.

Overall, it was found that hypertension risk increases with age and higher BMI, and these effects appear additive rather than interactive.

Limitations Since this is a Cross-sectional data, causality cannot be inferred. The observations are also Self-reported behaviors so physical activity and smoking may be misclassified data.More confounding factors like Diet, medication use, and socioeconomic factors were not included.

Submission: Submit your completed R Markdown file and knitted HTML report.

Summary

Key Concepts Covered

Statistical modeling describes relationships between variables
Regression types depend on the outcome variable type
Logistic regression is appropriate for binary outcomes
Multiple regression controls for confounding
Dummy variables represent categorical predictors
Interactions test for effect modification
Model diagnostics check assumptions and identify problems
Model comparison helps select the best model

Important Formulas

Logistic Regression:

\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]

Odds Ratio:

\[\text{OR} = e^{\beta_i}\]

Predicted Probability:

\[p = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}\]

Session Info

sessionInfo()

## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggstats_0.12.0   gtsummary_2.5.0  ggeffects_2.3.2  car_3.1-3       
##  [5] carData_3.0-5    broom_1.0.11     plotly_4.12.0    kableExtra_1.4.0
##  [9] knitr_1.51       haven_2.5.5      lubridate_1.9.4  forcats_1.0.1   
## [13] stringr_1.6.0    dplyr_1.2.0      purrr_1.2.1      readr_2.1.6     
## [17] tidyr_1.3.2      tibble_3.3.1     ggplot2_4.0.1    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6         xfun_0.56            bslib_0.10.0        
##  [4] htmlwidgets_1.6.4    insight_1.4.5        tzdb_0.5.0          
##  [7] vctrs_0.7.1          tools_4.5.2          crosstalk_1.2.2     
## [10] generics_0.1.4       datawizard_1.3.0     pkgconfig_2.0.3     
## [13] data.table_1.18.0    RColorBrewer_1.1-3   S7_0.2.1            
## [16] lifecycle_1.0.5      compiler_4.5.2       farver_2.1.2        
## [19] textshaping_1.0.4    htmltools_0.5.9      sass_0.4.10         
## [22] yaml_2.3.12          lazyeval_0.2.2       Formula_1.2-5       
## [25] pillar_1.11.1        jquerylib_0.1.4      broom.helpers_1.22.0
## [28] cachem_1.1.0         abind_1.4-8          tidyselect_1.2.1    
## [31] digest_0.6.39        stringi_1.8.7        labeling_0.4.3      
## [34] labelled_2.16.0      fastmap_1.2.0        grid_4.5.2          
## [37] cli_3.6.5            magrittr_2.0.4       cards_0.7.1         
## [40] withr_3.0.2          scales_1.4.0         backports_1.5.0     
## [43] timechange_0.3.0     rmarkdown_2.30       httr_1.4.7          
## [46] otel_0.2.0           hms_1.1.4            evaluate_1.0.5      
## [49] viridisLite_0.4.2    rlang_1.1.7          glue_1.8.0          
## [52] xml2_1.5.2           svglite_2.2.2        rstudioapi_0.18.0   
## [55] jsonlite_2.0.0       R6_2.6.1             systemfonts_1.3.1

Statistical Modeling in Epidemiology

EPI 553 - Advanced Epidemiologic Methods- Lab Activity 5 Solutions

Suruchi Shahi

2026-02-24

Lab Overview

Setup and Data Preparation

Lab Instructions

Task 1: Explore the Outcome Variable

View data structure

Dimensions: rows (observations) and columns (variables)

What variables do we have?

Task 2: Build a Simple Logistic Regression Model

YOUR CODE HERE: Predicted Probabilities

Task 3: Create a Multiple Regression Model

YOUR CODE HERE: Display the results

Task 4: Interpret Dummy Variables

YOUR CODE HERE: Extract and display the odds ratios for BMI categories

Visualizing BMI Category Effects

Task 5: Test for Interaction

YOUR CODE HERE: Perform a likelihood ratio test comparing models with and without interaction

Visualizing Interaction

Task 6: Model Diagnostics

CHECK 2 Influential Observations

CODE HERE: Create a Cook’s distance plot to identify influential observations

Task 7: Model Comparison

CODE HERE: Create a comparison table

Lab Report Guidelines

Summary

Key Concepts Covered

Important Formulas