Lab Overview

In this lab, you will:

  1. Build your own logistic regression model predicting hypertension (high blood pressure)
  2. Create dummy variables for categorical predictors
  3. Interpret regression coefficients
  4. Test for confounding and interaction
  5. Perform model diagnostics

Setup and Data Preparation

# Load required packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(car)
library(ggeffects)
library(gtsummary)
library(ggstats)

Lab Instructions

Task 1: Explore the Outcome Variable

###load brfss_subset_2023.rds

brfss_data <- read_rds("C:/Users/suruc/OneDrive/Desktop/R/EPI553_Rclass/data/brfss_subset_2023.rds")

```{r- dataview} # Inspect the data glimpse(brfss_data) # Examine the first few rows head(brfss_data, n = 10)

View data structure

str(brfss_data)

Dimensions: rows (observations) and columns (variables)

dim(brfss_data)

What variables do we have?

names(brfss_data)


## Descriptive Statistics


``` r
# YOUR CODE HERE: Create a frequency table of hypertension status
# Create a frequency table of hypertension status
hyprtnsn_freq <- brfss_data %>%
  group_by(hypertension) %>%
  summarise(
   Count = n(),
    Percentage = round(100 * n() / nrow(brfss_data), 1)
  ) %>%
  mutate(
    Status = case_when(
      hypertension == 1 ~ "Has Hypertension",
      hypertension == 0 ~ "No Hypertension"
    )
  ) %>%
  select(Status, Count, Percentage)

hyprtnsn_freq %>%
  kable(caption = "Overall Prevalence of Hypertension in Study Sample",
        align = "lrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Overall Prevalence of Hypertension in Study Sample
Status Count Percentage
No Hypertension 606 47.3
Has Hypertension 675 52.7
# Calculate the prevalence of hypertension by age group

hyprtnsn_by_age <- brfss_data %>%
  group_by(age_group) %>%
  summarise(
    Total = n(),
    Cases = sum(hypertension),
    Prevalence_Percent = round(100 * mean(hypertension), 1),
    .groups = 'drop'
  )

hyprtnsn_by_age %>%
  kable(caption = "Hypertension Prevalence by Age Group",
        align = "lccc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Hypertension Prevalence by Age Group
age_group Total Cases Prevalence_Percent
18-24 12 1 8.3
25-34 77 15 19.5
35-44 138 42 30.4
45-54 161 61 37.9
55-64 266 137 51.5
65+ 627 419 66.8

Questions:

  1. What is the overall prevalence of hypertension in the dataset? The overall prevalence of hypertension in the dataset is 675, which is 52.7%.

  2. How does hypertension prevalence vary by age group? Hypertension increases steadily and sharply with age with highest prevalence among the 65+ age group i.e. 419 cases out of 627 in the age group (66.8%). Only 8.3% of the age 18–24 have hypertension and it nearly doubles by early adulthood with prevalence jumping to 19.5% in the 25–34 group. By ages 35–54, about one‑third to nearly 40% are hypertensive. Major increase is seen after age 55 with over 50%.


Task 2: Build a Simple Logistic Regression Model

# Outcome: hypertension
# Predictor: age_cont (continous age in years)

# Simple logistic regression: diabetes ~ age
model_hyprtnsn_simple <- glm(hypertension ~ age_cont,
                             data = brfss_data,                
                             family = binomial(link = "logit"))

# Display results with odds ratios
tidy(model_hyprtnsn_simple, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Simple Logistic Regression: Diabetes ~ Age (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Simple Logistic Regression: Diabetes ~ Age (Odds Ratios)
Term Odds Ratio Std. Error z-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 0.048 0.296 -10.293 0 0.026 0.084
age_cont 1.055 0.005 10.996 0 1.045 1.065

Interpretation:

YOUR CODE HERE: Predicted Probabilities

# From ggeffects package
pp <- predict_response(model_hyprtnsn_simple, terms = "age_cont")
plot(pp)

# Generate predicted probabilities
pred_data <- data.frame(age_cont = seq(18, 80, by = 1))
pred_data$predicted_prob <- predict(model_hyprtnsn_simple,
                                    newdata = pred_data,
                                    type = "response")

# Plot
p2 <- ggplot(pred_data, aes(x = age_cont, y = predicted_prob)) +
  geom_line(color = "darkred", linewidth = 1.5) +
  geom_ribbon(aes(ymin = predicted_prob - 0.03,
                  ymax = predicted_prob + 0.03),
              alpha = 0.2, fill = "darkred") +
  labs(
    title = "Predicted Probability of Hypertension by Age",
    subtitle = "Simple Logistic Regression Model",
    x = "Age (years)",
    y = "Predicted Probability of hypertension"
  ) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 0.8)) +
  theme_minimal(base_size = 12)

ggplotly(p2)

Questions:

  1. What is the odds ratio for age? Interpret this value.

The odds ratio (OR) for age_cont is 1.055. For each additional year of age, the odds of having diabetes increase by 5.5%. Because an OR of 1.055 means the odds are multiplied by 1.055 for every one‑year increase. This shows a positive association: older individuals have higher odds of diabetes.

  1. Is the association statistically significant? Yes. This association is statistically significant.The p‑value for age is < 0.001, which is far below the α = 0.05 threshold. The 95% CI (1.045 to 1.065) does not include 1. Both factor indicate a statistically significant association between age and diabetes.

  2. What is the 95% confidence interval for the odds ratio? The 95% CI for the odds ratio is 1.045 to 1.065.We are 95% confident that each additional year of age increases the odds of diabetes by 4.5% to 6.5%.


Task 3: Create a Multiple Regression Model

# YOUR CODE HERE: Fit a multiple logistic regression model
# Outcome: hypertension
# Predictors: age_cont, sex, bmi_cat, phys_active, current_smoker

model_hyprtnsn_multiple <- glm(hypertension ~ age_cont + sex + bmi_cat +
                                phys_active + current_smoker,
                               data = brfss_data,
                               family = binomial(link = "logit"))

YOUR CODE HERE: Display the results

# Display results

tidy(model_hyprtnsn_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Multiple Logistic Regression: Hypertension ~ Age + Covariates (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  scroll_box(height = "400px")
Multiple Logistic Regression: Hypertension ~ Age + Covariates (Odds Ratios)
Term Odds Ratio Std. Error z-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 0.008 0.653 -7.355 0.000 0.002 0.028
age_cont 1.061 0.005 11.234 0.000 1.050 1.073
sexMale 1.270 0.123 1.950 0.051 0.999 1.616
bmi_catNormal 2.097 0.546 1.356 0.175 0.759 6.756
bmi_catOverweight 3.241 0.543 2.166 0.030 1.183 10.385
bmi_catObese 6.585 0.545 3.459 0.001 2.394 21.176
phys_active 0.900 0.130 -0.808 0.419 0.697 1.162
current_smoker 1.071 0.139 0.495 0.621 0.817 1.407
## BMI Category Odds Ratios (vs. Underweight):
## Normal weight: OR = 2.1
## Overweight: OR = 3.24
## Obese: OR = 6.59

Questions:

  1. How did the odds ratio for age change after adjusting for other variables?
  • Crude model (age only): OR = 1.055
  • Adjusted model: OR = 1.061 The odds ratio increased slightly—from a 5.5% increase in odds of hypertension per year of age to a 6.1% increase per year after adjustment. This is a small upward shift, but the direction matters.
  1. What does this suggest about confounding?

The OR for age increased after adjusting for sex, BMI category, physical activity, and smoking.Some of the covariates (especially BMI or sex) were masking part of the true association between age and hypertension in the crude model.After controlling for them, the age–hypertension association became slightly stronger. When older adults also tend to have characteristics that partially overlap with hypertension risk (e.g., higher BMI), but those characteristics don’t fully explain the age effect.

  1. Which variables are the strongest predictors of hypertension?

Based on magnitude of OR and statistical significance: Strongest predictors are

  1. BMI category (especially Obese):
  • Obese: OR = 6.585, p = 0.001
  • Overweight: OR = 3.241, p = 0.030 These are very large increases in odds and statistically significant.
  1. Age:
  • OR = 1.061, p < 0.001 Small per‑year effect, but highly significant and cumulative over decades.

Task 4: Interpret Dummy Variables

# YOUR CODE HERE: Create a table showing the dummy variable coding for bmi_cat

# Extract dummy variable coding
bmi_dummy_table <- data.frame(
  bmi_cat = c("Normal", "Overweight", "Obese"),
  `Dummy 1 (Normal)` = c(1, 0, 0),
  `Dummy 2 (Overweight)` = c(0, 1, 0),
  `Dummy 3 (Obese)` = c(0, 0, 1),
  check.names = FALSE
)

bmi_dummy_table %>%
  kable(caption = "Dummy Variable Coding for BMI Category",
        align = "lccc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(1, bold = TRUE, background = "#ffe6e6")  # Highlight reference category
Dummy Variable Coding for BMI Category
bmi_cat Dummy 1 (Normal) Dummy 2 (Overweight) Dummy 3 (Obese)
Normal 1 0 0
Overweight 0 1 0
Obese 0 0 1

YOUR CODE HERE: Extract and display the odds ratios for BMI categories

Visualizing BMI Category Effects

# Extract BMI coefficients
bmi_coefs <- tidy(model_hyprtnsn_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, "bmi_cat")) %>%
  mutate(
    bmi_cat = str_remove(term, "bmi_cat"),
    bmi_cat = factor(bmi_cat,
        levels = c("Underweight", "Normal", "Overweight", "Obese"))
  )

# Add reference category
ref_row <- data.frame(
  term = "BMI_catUnderweight",
  estimate = 1.0,
  std.error = 0,
  statistic = NA,
  p.value = NA,
  conf.low = 1.0,
  conf.high = 1.0,
  
  bmi_cat = factor("Underweight (Ref)",
                          levels = c("Underweight (Ref)",
                                    "Normal",
                                    "Overweight",
                                    "Obese"))
)

##combining and reordering

bmi_coefs_full <- bind_rows(ref_row, bmi_coefs) %>%
  mutate(
    bmi_cat = factor(bmi_cat,
                      levels = c("Underweight (Ref)",
                                    "Normal",
                                    "Overweight",
                                    "Obese")))

#Visualize the new underweight reference category

bmi_coefs_full %>%
  select(bmi_cat, estimate, conf.low, conf.high, p.value) %>%
  kable(caption = "Odds Ratios for BMI Categories (Reference: UnderWeight)",
        digits = 3,
        col.names = c("BMI Category", "Odds Ratio", "95% CI Lower", "95% CI Upper", "p-value"),
        align = "lrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Odds Ratios for BMI Categories (Reference: UnderWeight)
BMI Category Odds Ratio 95% CI Lower 95% CI Upper p-value
Underweight (Ref) 1.000 1.000 1.000 NA
Normal 2.097 0.759 6.756 0.175
Overweight 3.241 1.183 10.385 0.030
Obese 6.585 2.394 21.176 0.001
# Plot the odds ratios
p_bmi <- ggplot(bmi_coefs_full, aes(x = bmi_cat, y = estimate)) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "gray50") +
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high),
                  size = 0.8, color = "darkblue") +
  coord_flip() +
  labs(
    title = "Odds Ratios for BMI Categories and Hypertension",
    subtitle = "Reference: Underweight",
    x = "BMI Category",
    y = "Odds Ratio (95% CI)"
  ) +
  theme_minimal(base_size = 12)

ggplotly(p_bmi)

Odds Ratios for BMI Category Levels

# Interpretation table
bmi_interpret <- data.frame(
  `BMI Category` = c( "Underweight", "Normal", "Overweight", "Obese"),
  `Odds Ratio` = c(1.0, round(filter(bmi_coefs, term == "bmi_catNormal")$estimate, 3), 
                   round(filter(bmi_coefs, term == "bmi_catOverweight")$estimate, 3),
                   round(filter(bmi_coefs, term == "bmi_catObese")$estimate, 3)),
  `Interpretation` = c(
    "Reference group",
    paste0("About ", round(filter(bmi_coefs, term == "bmi_catNormal")$estimate - 1, 1)*100, "% higher/lower odds than underweight"),
    paste0("About ", round((filter(bmi_coefs, term == "bmi_catOverweight")$estimate - 1)*100, 0), "% higher odds than underweight"),
    paste0("About ", round((filter(bmi_coefs, term == "bmi_catObese")$estimate - 1)*100, 0), "% higher odds than underweight")
  ),
  check.names = FALSE
)

cat("\n=BMI INTERPRETATION =\n")
## 
## =BMI INTERPRETATION =
print(bmi_interpret)
##   BMI Category Odds Ratio                                Interpretation
## 1  Underweight      1.000                               Reference group
## 2       Normal      2.097 About 110% higher/lower odds than underweight
## 3   Overweight      3.241       About 224% higher odds than underweight
## 4        Obese      6.585       About 559% higher odds than underweight
# Plot model coefficients with `ggcoef_model()`
ggcoef_model(model_hyprtnsn_multiple, exponentiate = TRUE,
  include = c("bmi_cat"),
  variable_labels = c(
    BMI = "bmi_cat"),
  facet_labeller = ggplot2::label_wrap_gen(10)
)

Questions:

  1. What is the reference category for BMI?

The reference category is Underweight (Odds Ratio = 1.00).

  1. Interpret the odds ratio for “Obese” compared to the reference category.

The odds ratio for the Obese group is 6.585, meaning their odds are 6.6 times higher than the underweight group.

  1. How would you explain this to a non-statistician? People in the obese category are much more likely to have the hypertension than people who are underweight. Simply, obesity is linked to a higher risk, and the difference is large enough that it’s unlikely to be due to random chance.

Task 5: Test for Interaction

# YOUR CODE HERE: Fit a model with Age × BMI interaction
# Test if the effect of age on hypertension differs by BMI category

# Model with interaction term
model_interaction <- glm(hypertension ~ age_cont* bmi_cat,data = brfss_data,
        family = binomial(link = "logit"))

# Model WITHOUT interaction (additive model)
model_no_interaction <- glm(hypertension ~ age_cont + bmi_cat,
                            data = brfss_data,
                            family = binomial(link = "logit"))

# Display interaction results

tidy(model_interaction, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, ":")) %>%
  kable(caption = "Age × BMI Interaction Model (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Age × BMI Interaction Model (Odds Ratios)
Term Odds Ratio Std. Error z-statistic p-value 95% CI Lower 95% CI Upper
age_cont:bmi_catNormal 1.058 0.043 1.306 0.192 0.957 1.147
age_cont:bmi_catOverweight 1.063 0.043 1.423 0.155 0.962 1.151
age_cont:bmi_catObese 1.054 0.042 1.232 0.218 0.954 1.140
#  show full model

tidy(model_interaction, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Full Age × BMI Interaction Model",
        digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Full Age × BMI Interaction Model
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.259 2.536 -0.533 0.594 0.000 25.001
age_cont 1.004 0.042 0.102 0.918 0.929 1.108
bmi_catNormal 0.065 2.632 -1.039 0.299 0.001 38.322
bmi_catOverweight 0.078 2.606 -0.978 0.328 0.001 44.673
bmi_catObese 0.263 2.573 -0.519 0.604 0.003 144.533
age_cont:bmi_catNormal 1.058 0.043 1.306 0.192 0.957 1.147
age_cont:bmi_catOverweight 1.063 0.043 1.423 0.155 0.962 1.151
age_cont:bmi_catObese 1.054 0.042 1.232 0.218 0.954 1.140

Interpretation:

  • Main effect of age: The effect of age does NOT differ meaningfully across BMI categories. Age does not modify the effect of BMI, and BMI does not modify the effect of age.
  • Interaction term:Adding the interaction, the relation is not statistically signifcant with any of the BMI groups.

YOUR CODE HERE: Perform a likelihood ratio test comparing models with and without interaction

# Model 1: Age only (model_no_interaction)
model1 <- model_no_interaction

#Model 2: Age + BMI (Model_interaction)
model2 <- model_no_interaction
# Model 3: Full model
model3 <- model_hyprtnsn_multiple

# Likelihood ratio test
lrt_1_2 <- anova(model1, model2, test = "LRT")
lrt_2_3 <- anova(model2, model3, test = "LRT")

# Create comparison table
model_comp <- data.frame(
  Model = c("Model 1: Age only",
            "Model 2: Age + BMI",
            "Model 3: Full model"),
  AIC = c(AIC(model1), AIC(model2), AIC(model3)),
  BIC = c(BIC(model1), BIC(model2), BIC(model3)),
  `Deviance` = c(deviance(model1), deviance(model2), deviance(model3)),
  check.names = FALSE
)

model_comp %>%
  kable(caption = "Model Comparison: AIC, BIC, and Deviance",
        digits = 2,
        align = "lrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which.min(model_comp$AIC), bold = TRUE, background = "#d4edda")
Model Comparison: AIC, BIC, and Deviance
Model AIC BIC Deviance
Model 1: Age only 1578.06 1603.84 1568.06
Model 2: Age + BMI 1578.06 1603.84 1568.06
Model 3: Full model 1579.50 1620.74 1563.50

Questions:

  1. Is the interaction term statistically significant? The interaction model does not improve model fit as we can see:
  • AIC: Model 3 = 1579.50 (higher = not good)
  • BIC: Model 3 = 1620.74 (higher = not good)
  • Deviance: Model 3 = 1563.50 Since AIC and BIC both increase, the interaction terms do not improve the model. Combined with your earlier p‑values (all > 0.05), the interaction terms are not statistically significant.
  1. What does this mean in epidemiologic terms (effect modification)?
  • The effect of age does NOT differ across BMI categories.
  • BMI does not change the relationship between age and the outcome.
  • Age does not change the relationship between BMI and the outcome.

There is no evidence of effect modification by BMI on the age–hypertension relationship. The association between age and the hypertension appears consistent across BMI groups. Age increases risk at about the same rate whether someone is normal weight, overweight, or obese.

  1. Create a visualization showing predicted probabilities by age and BMI category

Visualizing Interaction

# Generate predicted probabilities by BMI

pred_interact <- ggpredict(model_interaction, terms = c("age_cont", "bmi_cat"))

# Plot
p4 <- ggplot(pred_interact, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
  labs(
    title = "Predicted Probability of Hypertension by Age and BMI",
    subtitle = "Testing for Age × BMI Interaction",
    x = "Age (years)",
    y = "Predicted Probability of Hypertension",
    color = "BMI",
    fill = "BMI"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_color_manual(values = c("Normal" = "#E64B35", "Overweight" = "#4DBBD5", "Obese" = "pink")) +
  scale_fill_manual(values = c("Normal" = "#E64B35", "Overweight" = "#4DBBD5", "Obese" = "pink")) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

ggplotly(p4)

Age-hypertension Relationship by BMI


Task 6: Model Diagnostics

# YOUR CODE HERE: Calculate VIF for your multiple regression model
# Calculate VIF
vif_values <- vif(model_hyprtnsn_multiple)

# Create VIF table
# For models with categorical variables, vif() returns GVIF (Generalized VIF)
if (is.matrix(vif_values)) {
  # If matrix (categorical variables present), extract GVIF^(1/(2*Df))
  vif_df <- data.frame(
    Variable = rownames(vif_values),
    VIF = vif_values[, "GVIF^(1/(2*Df))"]
  )
} else {
  # If vector (only continuous variables)
  vif_df <- data.frame(
    Variable = names(vif_values),
    VIF = as.numeric(vif_values)
  )
}

# Add interpretation
vif_df <- vif_df %>%
  arrange(desc(VIF)) %>%
  mutate(
    Interpretation = case_when(
      VIF < 5 ~ "Low (No concern)",
      VIF >= 5 & VIF < 10 ~ "Moderate (Monitor)",
      VIF >= 10 ~ "High (Problem)"
    )
  )

vif_df %>%
  kable(caption = "Variance Inflation Factors (VIF) for Multiple Regression Model",
        digits = 2,
        align = "lrc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which(vif_df$VIF >= 10), bold = TRUE, color = "white", background = "#DC143C") %>%
  row_spec(which(vif_df$VIF >= 5 & vif_df$VIF < 10), background = "#FFA500") %>%
  row_spec(which(vif_df$VIF < 5), background = "#90EE90")
Variance Inflation Factors (VIF) for Multiple Regression Model
Variable VIF Interpretation
age_cont age_cont 1.06 Low (No concern)
current_smoker current_smoker 1.04 Low (No concern)
bmi_cat bmi_cat 1.02 Low (No concern)
phys_active phys_active 1.01 Low (No concern)
sex sex 1.01 Low (No concern)

CHECK 2 Influential Observations

Cook’s Distance: Measures how much the model would change if an observation were removed.

  • Cook’s D > 1: Potentially influential observation

CODE HERE: Create a Cook’s distance plot to identify influential observations

# Calculate Cook's distance
cooks_d <- cooks.distance(model_hyprtnsn_multiple)

# Create data frame
influence_df <- data.frame(
  observation = 1:length(cooks_d),
  cooks_d = cooks_d
) %>%
  mutate(influential = ifelse(cooks_d > 1, "Yes", "No"))

# Plot
p5 <- ggplot(influence_df, aes(x = observation, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  labs(
    title = "Cook's Distance: Identifying Influential Observations",
    subtitle = "Values > 1 indicate potentially influential observations",
    x = "Observation Number",
    y = "Cook's Distance",
    color = "Influential?"
  ) +
  scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
  theme_minimal(base_size = 12)

ggplotly(p5)

Cook’s Distance for Influential Observations

# Count influential observations
n_influential <- sum(influence_df$influential == "Yes")
cat("Number of potentially influential observations:", n_influential, "\n")
## Number of potentially influential observations: 0

Questions:

  1. Are there any concerns about multicollinearity?

There is no major multicollinearity concern for the main-effects model.

  1. Are there any influential observations that might affect your results?

Cook’s Distance plot shows: - Number of potentially influential observations: 0 - All observation points fall below the common threshold of 1.0

There are no influential observations that would meaningfully distort or bias your regression results.

  1. What would you do if you found serious violations? If serious violations were detected (e.g., high multicollinearity, influential outliers, poor model fit), I would:
  • Investigate multicollinearity (Examine VIF values, Remove or combine correlated predictors, Avoid unnecessary interaction terms)

  • Address influential observations (Inspect those cases individually, Check for data entry errors, Consider robust regression or sensitivity analyses)

  • Refit a simpler, more stable model (Prefer models with lower AIC/BIC, Remove non‑significant interaction terms, Ensure interpretability and epidemiologic plausibility)


Task 7: Model Comparison

CODE HERE: Create a comparison table

# YOUR CODE HERE: Compare three models using AIC and BIC
# Model A: Age only
# Model B: Age + sex + bmi_cat
# Model C: Age + sex + bmi_cat + phys_active + current_smoker

# Model A: Age only
modela <- glm(hypertension ~ age_cont,
              data = brfss_data,
              family = binomial)

# Model B: Model with age + demographics + BMI
modelb <- glm(hypertension ~ age_cont + sex + bmi_cat,
              data = brfss_data,
              family = binomial)

# Model C: Full model with all predictors
modelc <- glm(hypertension ~ age_cont + sex + bmi_cat + phys_active + current_smoker,
              data = brfss_data,
              family = binomial)

# Create comparison table
model_comparison <- data.frame(
  Model = c("Model A: Age only",
    "Model B: Age + Sex + BMI",
    "Model C: Full model\n(+ Activity + Smoking)"),
  AIC = c(AIC(modela), AIC(modelb), AIC(modelc)),
  BIC = c(BIC(modela), BIC(modelb), BIC(modelc)),
  `Deviance` = c(deviance(modela), deviance(modelb), deviance(modelc)),
  check.names = FALSE
)

model_comparison %>%
  kable(caption = "Model Comparison: AIC, BIC, and Deviance",
        digits = 2,
        align = "lrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which.min(model_comparison$AIC), bold = TRUE, background = "#d4edda")
Model Comparison: AIC, BIC, and Deviance
Model AIC BIC Deviance
Model A: Age only 1636.61 1646.92 1632.61
Model B: Age + Sex + BMI 1576.49 1607.42 1564.49
Model C: Full model (+ Activity + Smoking)
1579.5
# Likelihood ratio test
lrt_a_b <- anova(modela, modelb, test = "LRT")

cat("Test 1: Model A vs Model B\n")
## Test 1: Model A vs Model B
print(lrt_a_b)
## Analysis of Deviance Table
## 
## Model 1: hypertension ~ age_cont
## Model 2: hypertension ~ age_cont + sex + bmi_cat
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      1279     1632.6                          
## 2      1275     1564.5  4   68.126 5.643e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("Interpretation: Adding sex and BMI  to age significantly improves fit:", lrt_a_b$`Pr(>Chi)`[2] < 0.05, "\n\n")
## Interpretation: Adding sex and BMI  to age significantly improves fit: TRUE
lrt_b_c <- anova(modelb, modelc, test = "LRT")
cat("\nTest 2: Model B vs Model C\n")
## 
## Test 2: Model B vs Model C
print(lrt_b_c)
## Analysis of Deviance Table
## 
## Model 1: hypertension ~ age_cont + sex + bmi_cat
## Model 2: hypertension ~ age_cont + sex + bmi_cat + phys_active + current_smoker
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      1275     1564.5                     
## 2      1273     1563.5  2  0.99112   0.6092
cat("Interpretation: Adding physical activity and smoking significantly improves fit:", lrt_b_c$`Pr(>Chi)`[2] < 0.05, "\n\n")
## Interpretation: Adding physical activity and smoking significantly improves fit: FALSE

Questions:

  1. Which model has the best fit based on AIC?

According to the AIC values: - Model A (Age only): 1636.61 - Model B (Age + Sex + BMI): 1576.49 - Model C (Full model): 1579.50

Model B has the lowest AIC (1576.49), meaning it provides the best overall fit while balancing model complexity and goodness of fit.

  1. Is the added complexity of the full model justified?

Looking at the AIC/BIC values, Model C has higher AIC and higher BIC than Model B. This means the extra variables (activity + smoking) do not improve the model enough to justify their inclusion.When we look at P-value from deviance table for Model B vs Model C, LRT p-value = 0.6092 > 0.05. Adding physical activity and smoking does not significantly improve model fit. The added complexity is not justified.

  1. Which model would you choose for your final analysis? Why?

I would choose Model B (Age + Sex + BMI) because it has the lowest AIC, indicating the best balance of fit and parsimony.The LRT shows that adding activity and smoking does not improve the model. Model B includes important epidemiologic predictors (age, sex, BMI).


Lab Report Guidelines

Write a brief report (1-2 pages) summarizing your findings:

  1. Introduction: State your research question
  2. Methods: Describe your analytic approach
  3. Results: Present key findings with tables and figures
  4. Interpretation: Explain what your results mean
  5. Limitations: Discuss potential issues with your analysis

Logistic Regression Analysis of Hypertension

  1. Introduction The purpose of this analysis was to identify predictors of hypertension in an adult population using BRFSS subset data and evaluate whether demographic or behavioral factors modify the relationship between age and hypertension.

Specifically, the research question was:

“Which factors are associated with hypertension, and is the effect of age modified by BMI or other covariates?”

Given that the overall prevalence of hypertension in the dataset was 52.7% (675 cases), understanding these predictors is important for identifying high‑risk groups and informing prevention strategies.

  1. Methods A series of logistic regression models were constructed to examine associations between hypertension (binary outcome) and potential predictors. Age was treated as a continuous variable. Sex, BMI category, physical activity, and smoking status were included as categorical predictors using dummy variables. The analytic steps included:
  • Model building:
  • Model A: Age only
  • Model B: Age + Sex + BMI
  • Model C: Full model (Age + Sex + BMI + Activity + Smoking)
  1. Methods A series of logistic regression models were constructed to examine associations between hypertension (binary outcome) and potential predictors. Age was treated as a continuous variable. Sex, BMI category, physical activity, and smoking status were included as categorical predictors using dummy variables.

The analytic steps included: - Model building: Model A: Age only Model B: Age + Sex + BMI Model C: Full model (Age + Sex + BMI + Activity + Smoking)

All analyses were conducted using logistic regression in R.

  1. Results 3.1 Descriptive Findings
  • Hypertension prevalence: 52.7%
  • Prevalence increased sharply with age, from 8.3% (18–24) to 66.8% (65+).

3.2 Main Effects Model (Model B) Key adjusted odds ratios: For age: 1.061 AoR (Each additional year increases odds by ~6%) For BMI, obese group, AoR 6.59 (obese people have 6.6 times higher odds of hypertension compared to underweight reference group.)

3.3 Interaction Model Age × BMI interaction terms were not statistically significant (all p > 0.05). This indicates no evidence of effect modification: the effect of age on hypertension does not differ across BMI categories.

3.4 Model Comparison

Model B had the lowest AIC. Adding sex and BMI significantly improved fit and adding activity and smoking did not improve fit.

3.5 Diagnostics

Influential observations: Cook’s Distance plot showed 0 influential observations. Multicollinearity: No major concerns in the main-effects model.

  1. Interpretation

Age and BMI are strong, independent predictors of hypertension. The odds of hypertension increase steadily with age, and individuals who are overweight or obese have higher odds compared to those who are underweight.

There is no evidence of effect modification between age and BMI, showing the increase in hypertension risk with age is consistent across BMI groups. Model comparison strongly supports Model B (Age + Sex + BMI) as the most appropriate model. Adding physical activity and smoking does not improve model fit and introduces complexity.

Overall, it was found that hypertension risk increases with age and higher BMI, and these effects appear additive rather than interactive.

  1. Limitations Since this is a Cross-sectional data, causality cannot be inferred. The observations are also Self-reported behaviors so physical activity and smoking may be misclassified data.More confounding factors like Diet, medication use, and socioeconomic factors were not included.

Submission: Submit your completed R Markdown file and knitted HTML report.


Summary

Key Concepts Covered

  1. Statistical modeling describes relationships between variables
  2. Regression types depend on the outcome variable type
  3. Logistic regression is appropriate for binary outcomes
  4. Multiple regression controls for confounding
  5. Dummy variables represent categorical predictors
  6. Interactions test for effect modification
  7. Model diagnostics check assumptions and identify problems
  8. Model comparison helps select the best model

Important Formulas

Logistic Regression:

\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]

Odds Ratio:

\[\text{OR} = e^{\beta_i}\]

Predicted Probability:

\[p = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}\]


Session Info

sessionInfo()
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggstats_0.12.0   gtsummary_2.5.0  ggeffects_2.3.2  car_3.1-3       
##  [5] carData_3.0-5    broom_1.0.11     plotly_4.12.0    kableExtra_1.4.0
##  [9] knitr_1.51       haven_2.5.5      lubridate_1.9.4  forcats_1.0.1   
## [13] stringr_1.6.0    dplyr_1.2.0      purrr_1.2.1      readr_2.1.6     
## [17] tidyr_1.3.2      tibble_3.3.1     ggplot2_4.0.1    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6         xfun_0.56            bslib_0.10.0        
##  [4] htmlwidgets_1.6.4    insight_1.4.5        tzdb_0.5.0          
##  [7] vctrs_0.7.1          tools_4.5.2          crosstalk_1.2.2     
## [10] generics_0.1.4       datawizard_1.3.0     pkgconfig_2.0.3     
## [13] data.table_1.18.0    RColorBrewer_1.1-3   S7_0.2.1            
## [16] lifecycle_1.0.5      compiler_4.5.2       farver_2.1.2        
## [19] textshaping_1.0.4    htmltools_0.5.9      sass_0.4.10         
## [22] yaml_2.3.12          lazyeval_0.2.2       Formula_1.2-5       
## [25] pillar_1.11.1        jquerylib_0.1.4      broom.helpers_1.22.0
## [28] cachem_1.1.0         abind_1.4-8          tidyselect_1.2.1    
## [31] digest_0.6.39        stringi_1.8.7        labeling_0.4.3      
## [34] labelled_2.16.0      fastmap_1.2.0        grid_4.5.2          
## [37] cli_3.6.5            magrittr_2.0.4       cards_0.7.1         
## [40] withr_3.0.2          scales_1.4.0         backports_1.5.0     
## [43] timechange_0.3.0     rmarkdown_2.30       httr_1.4.7          
## [46] otel_0.2.0           hms_1.1.4            evaluate_1.0.5      
## [49] viridisLite_0.4.2    rlang_1.1.7          glue_1.8.0          
## [52] xml2_1.5.2           svglite_2.2.2        rstudioapi_0.18.0   
## [55] jsonlite_2.0.0       R6_2.6.1             systemfonts_1.3.1