Student Lab Activity

Lab Overview

In this lab, you will:

  1. Build your own logistic regression model predicting hypertension (high blood pressure)
  2. Create dummy variables for categorical predictors
  3. Interpret regression coefficients
  4. Test for confounding and interaction
  5. Perform model diagnostics

This lecture introduces key concepts in regression modeling using real-world data from the Behavioral Risk Factor Surveillance System (BRFSS) 2023.


Setup and Data Preparation

# Load required packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(car)
library(ggeffects)
library(gtsummary)
library(ggstats)

Loading BRFSS 2023 Data

The BRFSS is a large-scale telephone survey that collects data on health-related risk behaviors, chronic health conditions, and use of preventive services from U.S. residents.

# Load the full BRFSS 2023 dataset
brfss_subset <- read_rds("/Users/nataliasmall/Downloads/EPI 553/brfss_subset_2023.rds") %>%
  janitor::clean_names()

# Display dataset dimensions
names(brfss_subset)
##  [1] "diabetes"       "age_group"      "age_cont"       "sex"           
##  [5] "race"           "education"      "income"         "bmi_cat"       
##  [9] "phys_active"    "current_smoker" "gen_health"     "hypertension"  
## [13] "high_chol"

Lab Instructions

Task 1: Explore the Outcome Variable

# YOUR CODE HERE: Create a frequency table of hypertension status
# Disclaimer: From 552
Frequency_hypstat = cbind(freq = table(brfss_subset$hypertension), percent = prop.table(table(brfss_subset$hypertension)))
# Print out the table
Frequency_hypstat
##   freq   percent
## 0  606 0.4730679
## 1  675 0.5269321
# YOUR CODE HERE: Calculate the prevalence of hypertension by age group
hyp_by_age <- brfss_subset %>%
  group_by(age_group) %>%
  summarise(
    N = n(),
    Mean_hyp = round(mean(hypertension, na.rm = TRUE), 2),
    Pct_Hypertension = round(
      sum(hypertension == 1, na.rm = TRUE) / sum(!is.na(hypertension)) * 100, 2)
  )

print(hyp_by_age)
## # A tibble: 6 × 4
##   age_group     N Mean_hyp Pct_Hypertension
##   <fct>     <int>    <dbl>            <dbl>
## 1 18-24        12     0.08             8.33
## 2 25-34        77     0.19            19.5 
## 3 35-44       138     0.3             30.4 
## 4 45-54       161     0.38            37.9 
## 5 55-64       266     0.52            51.5 
## 6 65+         627     0.67            66.8

Questions:

  1. What is the overall prevalence of hypertension in the dataset?
  • About 53% of individuals in the dataset have hypertension
  • There is an overall strong association between age group and percentage with hypertension in the dataset, with higher percentages among higher age groups
  1. How does hypertension prevalence vary by age group?
  • As age groups increase, mean hypertension increases

Task 2: Build a Simple Logistic Regression Model

# YOUR CODE HERE: Fit a simple logistic regression model
# Outcome: hypertension
# Predictor: age_cont

# YOUR CODE HERE: Display the results with odds ratios
model_logistic_simple <- glm(hypertension ~ age_cont,
                              data = brfss_subset,
                              family = binomial(link = "logit"))

# Display results with odds ratios
tidy(model_logistic_simple, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Simple Logistic Regression: Hypertension ~ Age (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Simple Logistic Regression: Hypertension ~ Age (Odds Ratios)
Term Odds Ratio Std. Error z-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 0.048 0.296 -10.293 0 0.026 0.084
age_cont 1.055 0.005 10.996 0 1.045 1.065

Questions:

  1. What is the odds ratio for age? Interpret this value.
  • Odds Ratio (OR): 1.055
  1. Is the association statistically significant?
  • The relationship is highly statistically significant (p < 0.001)
  1. What is the 95% confidence interval for the odds ratio?
  • 95% Confidence Interval (CI): [1.045, 1.065]

Task 3: Create a Multiple Regression Model

# YOUR CODE HERE: Fit a multiple logistic regression model
# Outcome: hypertension
# Predictors: age_cont, sex, bmi_cat, phys_active, current_smoker

# YOUR CODE HERE: Display the results
model_logistic_multiple <- glm(hypertension ~ age_cont + sex + bmi_cat +
                                phys_active + current_smoker,
                               data = brfss_subset,
                               family = binomial(link = "logit"))

# Display results
tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Multiple Logistic Regression: Hypertension ~ Age + Covariates (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  scroll_box(height = "400px")
Multiple Logistic Regression: Hypertension ~ Age + Covariates (Odds Ratios)
Term Odds Ratio Std. Error z-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 0.008 0.653 -7.355 0.000 0.002 0.028
age_cont 1.061 0.005 11.234 0.000 1.050 1.073
sexMale 1.270 0.123 1.950 0.051 0.999 1.616
bmi_catNormal 2.097 0.546 1.356 0.175 0.759 6.756
bmi_catOverweight 3.241 0.543 2.166 0.030 1.183 10.385
bmi_catObese 6.585 0.545 3.459 0.001 2.394 21.176
phys_active 0.900 0.130 -0.808 0.419 0.697 1.162
current_smoker 1.071 0.139 0.495 0.621 0.817 1.407

Questions:

  1. How did the odds ratio for age change after adjusting for other variables?
  • After adjusting for sex, BMI, physical activity, and smoking, each 1-year increase in age is associated with a 6.1% increase in the odds of hypertension
  1. What does this suggest about confounding?
  • The slight increase in odds ratio from 1.055 to 1.061 suggests that confounding could be present. But, it should be noted that there was only around a 0.57% change, which is low, indcating that the other variables did not signifiantly conofund relationship between age and hypertension.
  1. Which variables are the strongest predictors of hypertension?
  • The BMI categories are the strongest predictors of hypertension. The obese BMI category is the strongest predictor, having the largest odds ratio, 6.585, and six times the odds of hypertension

Task 4: Interpret Dummy Variables

# YOUR CODE HERE: Create a table showing the dummy variable coding for bmi_cat
dummy_table <- data.frame(
  bmi_cat = c("Underweight", "Normal", "Overweight", "Obese"),
  `Dummy 1 (Normal)` = c(0, 1, 0, 0),
  `Dummy 2 (Overweight)` = c(0, 0, 1, 0),
  `Dummy 3 (Obese)` = c(0, 0, 0, 1),
  check.names = FALSE
)

dummy_table %>%
  kable(caption = "Dummy Variable Coding for bmi_cat",
        align = "lccc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(1, bold = TRUE, background = "#ffe6e6")  # Highlight reference category
Dummy Variable Coding for bmi_cat
bmi_cat Dummy 1 (Normal) Dummy 2 (Overweight) Dummy 3 (Obese)
Underweight 0 0 0
Normal 1 0 0
Overweight 0 1 0
Obese 0 0 1
# YOUR CODE HERE: Extract and display the odds ratios for BMI categories
tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, "bmi_cat"))
## # A tibble: 3 × 7
##   term              estimate std.error statistic  p.value conf.low conf.high
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 bmi_catNormal         2.10     0.546      1.36 0.175       0.759      6.76
## 2 bmi_catOverweight     3.24     0.543      2.17 0.0303      1.18      10.4 
## 3 bmi_catObese          6.59     0.545      3.46 0.000542    2.39      21.2

Questions:

  1. What is the reference category for BMI?
  • The category with all zeros (underweight) is the reference group. All other categories are compared to this reference
  1. Interpret the odds ratio for “Obese” compared to the reference category.
  • Compared to the reference category “underweight”, obese individuals have a 6.6 times higher odds of developing hypertension after adjusting for age, BMI, sex, physical activity, and smoking.
  1. How would you explain this to a non-statistician?
  • People who are obese are more likely to develop hypertension, high blood pressure, compared to underweight people. It can be said these individuals have six times the chance of getting hypertension. This is taking into consideration social and health factors like age, BMI, smoking etc.

Task 5: Test for Interaction

# YOUR CODE HERE: Fit a model with Age × BMI interaction
# Test if the effect of age on hypertension differs by BMI category
model_interaction <- glm(hypertension ~ age_cont * bmi_cat + sex + phys_active + current_smoker,
                         data = brfss_subset,
                         family = binomial(link = "logit"))

# Display interaction results
tidy(model_interaction, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, "age_cont")) %>%
  kable(caption = "Age × BMI Interaction Model (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Age × BMI Interaction Model (Odds Ratios)
Term Odds Ratio Std. Error z-statistic p-value 95% CI Lower 95% CI Upper
age_cont 1.005 0.042 0.117 0.907 0.930 1.110
age_cont:bmi_catNormal 1.058 0.043 1.287 0.198 0.956 1.147
age_cont:bmi_catOverweight 1.064 0.043 1.431 0.152 0.962 1.152
age_cont:bmi_catObese 1.052 0.043 1.186 0.236 0.952 1.139
# YOUR CODE HERE: Perform a likelihood ratio test comparing models with and without interaction
lrt <- anova(model_logistic_multiple, model_interaction, test = "LRT")
lrt
## Analysis of Deviance Table
## 
## Model 1: hypertension ~ age_cont + sex + bmi_cat + phys_active + current_smoker
## Model 2: hypertension ~ age_cont * bmi_cat + sex + phys_active + current_smoker
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      1273     1563.5                     
## 2      1270     1561.3  3   2.2363   0.5248
#YOUR CODE HERE: Visualization showing predicted probabilities by age and BMI category
pred_interact <- ggpredict(model_interaction, terms = c("age_cont [18:80]", "bmi_cat"))

# Plot
p4 <- ggplot(pred_interact, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
  labs(
    title = "Predicted Probability of Hypertension by Age and BMI",
    subtitle = "Testing for Age × BMI Interaction",
    x = "Age (years)",
    y = "Predicted Probability of Hypertension",
    color = "BMI",
    fill = "BMI"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_color_manual(values = c("Underweight" = "red", "Normal" = "blue","Overweight" = "yellow", "Obese" = "#4DBBD5")) +
  scale_fill_manual(values = c("Underweight" = "red", "Normal" = "blue", "Overweight" = "yellow", "Obese" = "#4DBBD5")) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

ggplotly(p4)

Questions:

  1. Is the interaction term statistically significant?
  • No, the interaction term is not statistically significant. All of the p-values for the age_cont:bmi_cat terms in the table are > 0.05.
  1. What does this mean in epidemiologic terms (effect modification)?
  • This means that the age-hypertension relationship does not differ by BMI, and there is no evidence of effect modification.
  1. Create a visualization showing predicted probabilities by age and BMI category

Task 6: Model Diagnostics

# YOUR CODE HERE: Calculate VIF for your multiple regression model
vif_values <- vif(model_logistic_multiple)
print (vif_values)
##                    GVIF Df GVIF^(1/(2*Df))
## age_cont       1.126628  1        1.061428
## sex            1.016509  1        1.008221
## bmi_cat        1.103045  3        1.016480
## phys_active    1.024820  1        1.012334
## current_smoker 1.073574  1        1.036134
# YOUR CODE HERE: Create a Cook's distance plot to identify influential observations
cooks_d <- cooks.distance(model_logistic_multiple)

# Create data frame
influence_df <- data.frame(
  observation = 1:length(cooks_d),
  cooks_d = cooks_d
) %>%
  mutate(influential = ifelse(cooks_d > 1, "Yes", "No"))

# Plot
p5 <- ggplot(influence_df, aes(x = observation, y = cooks_d, color = influential)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  labs(
    title = "Cook's Distance: Identifying Influential Observations",
    subtitle = "Values > 1 indicate potentially influential observations",
    x = "Observation Number",
    y = "Cook's Distance",
    color = "Influential?"
  ) +
  scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
  theme_minimal(base_size = 12)

ggplotly(p5)

Questions:

  1. Are there any concerns about multicollinearity?
  • No there are no concerns about multicollinearity, all VIF values are < 5 which is generally acceptable.
  1. Are there any influential observations that might affect your results?
  • No, there are no influential observations that might affect results.
  • Cook’s D < 1: No potentially influential observations
  1. What would you do if you found serious violations?
  • If there were serious multicollinearity violations, I would try removing predictors that are highly correlated. I would also check for any influential outliers, and if there are you could transform with log.

Task 7: Model Comparison

# YOUR CODE HERE: Compare three models using AIC and BIC
# Model A: Age only
model1 <- glm(hypertension ~ age_cont,
              data = brfss_subset,
              family = binomial)
# Model B: Age + sex + bmi_cat
model2 <- glm(hypertension ~ age_cont + sex + bmi_cat,
              data = brfss_subset,
              family = binomial)
# Model C: Age + sex + bmi_cat + phys_active + current_smoker
model3 <- glm(hypertension ~ age_cont + sex + bmi_cat + phys_active + current_smoker,
              data = brfss_subset,
              family = binomial)

# YOUR CODE HERE: Create a comparison table
model_comp <- data.frame(
  Model = c("Model A: Age only",
            "Model B: Age + sex + bmi_cat",
            "Model 3: Full model"),
  AIC = c(AIC(model1), AIC(model2), AIC(model3)),
  BIC = c(BIC(model1), BIC(model2), BIC(model3)),
  `Deviance` = c(deviance(model1), deviance(model2), deviance(model3)),
  check.names = FALSE
)

model_comp %>%
  kable(caption = "Model Comparison: AIC, BIC, and Deviance",
        digits = 2,
        align = "lrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(which.min(model_comp$AIC), bold = TRUE, background = "#d4edda")
Model Comparison: AIC, BIC, and Deviance
Model AIC BIC Deviance
Model A: Age only 1636.61 1646.92 1632.61
Model B: Age + sex + bmi_cat 1576.49 1607.42 1564.49
Model 3: Full model 1579.50 1620.74 1563.50

Questions:

  1. Which model has the best fit based on AIC?
  • Model B: Age + sex + bmi_cat, has the lowest AIC, suggesting it provides the best fit to the data
  1. Is the added complexity of the full model justified?
  • No, the added complexity of the full model was not justified because its AIC was not lower than Model B’s. It did not do anything to improve best fit.
  1. Which model would you choose for your final analysis? Why?
  • I would choose Model B: Age + sex + bmi_cat, because it is best fit to the data, based on AIC

Lab Report Guidelines

Write a brief report (1-2 pages) summarizing your findings: 1. Introduction: State your research question 2. Methods: Describe your analytic approach 3. Results: Present key findings with tables and figures 4. Interpretation: Explain what your results mean 5. Limitations: Discuss potential issues with your analysis

Submission: Submit your completed R Markdown file and knitted HTML report.


Lab Report

The research questions covered in this lab were (1) What is the overall prevalence of hypertension in the data set? and (2) How does hypertension prevalence vary by age group?

In this analysis, we used data from the Behavioral Risk Factor Surveillance System (BRFSS) 2023. Particularly, a subset of the data containing demographic, health behavior, and health status data. At the beginning of the analysis, we explore the binary outcome variable, hypertension. Creating a frequency table of hypertension status to dictate the overall prevalence of hypertension in the dataset, and then further calculating the prevalence of hypertension by age group to determine how hypertension prevalence varies by age group. First, we fit a simple logistic regression model, trying to determine whether there is a statistically significant association between hypertension ~ age_cont. Then, we fit a multiple logistic regression model, controlling for confounding factors, particularly adjusting for sex, BMI, physical activity, and smoking. From the multiple logistic regression model, we would have determined the strongest predictors of hypertension, which prompts us to test for interaction, regarding effect modification, and then perform a likelihood ratio test to further compare models with and without interaction. After this, we worked on model diagnostics, calculating VIF for the multiple regression model and creating a Cook’s distance plot to identify influential observations. Finally, we decided on a definitive model for the final analysis after comparing three models using AIC and BIC, seeking the best fit.

In the dataset, about 53% of individuals have hypertension, indicating an overall strong association between age group and percentage with hypertension in the dataset, with higher percentages among higher age groups. After fitting the simple logistic regression model, it was determined that there is a highly statistically significant association (p < 0.001) between hypertension and age, with an Odds Ratio (OR) of 1.055 and a 95% Confidence Interval (CI) of [1.045, 1.065]. From the multiple regression model, after adjusting for sex, BMI, physical activity, and smoking, determined that each 1-year increase in age is associated with a 6.1% increase in the odds of hypertension. The slight increase in odds ratio from 1.055 to 1.061 suggests that confounding could have been present. But, with only a 0.57% change, this could indicate that the other variables did not significantly confound relationship between age and hypertension. The BMI category was determined to be the strongest predictor, having the largest odds ratio, 6.585. For BMI, compared to the reference category “underweight”, obese individuals have a 6.6 times higher odds of developing hypertension after adjusting for age, BMI, sex, physical activity, and smoking. Age_cont:bmi_cat interaction terms did not prove to be statistically significant. All of the p-values for the age_cont:bmi_cat terms in the table are > 0.05. When calculating VIF, all values were < 5, affirming no concerns about multicollinearity. All in all, after comparing three models using AIC and BIC, we choose Model B: Age + sex + bmi_cat, because it was best fit to the data, based on AIC.

As stated before, as age groups increase, mean hypertension increases. The BMI categories were the strongest predictors of hypertension, with the obese BMI category being the strongest predictor. Compared to the reference category “underweight”, obese individuals have a 6.6 times higher odds of developing hypertension after adjusting for age, BMI, sex, physical activity, and smoking. People who are obese are more likely to develop hypertension, high blood pressure, compared to underweight people. It can be said these individuals have six times the chance of getting hypertension. This is taking into consideration social and health factors like age, BMI, smoking etc. Interaction terms were not statistically significant. All of the p-values for the age_cont:bmi_cat terms in the table are > 0.05, meaning that the age-hypertension relationship does not differ by BMI, and there is no evidence of effect modification. No there are no concerns about multicollinearity, all VIF values are < 5 which is generally acceptable.

Potential issues with your analysis is that causal relationships cannot be determined. Correlation doesn’t equal causality. There are also variations in frequency between variables with multiple categories such as BMI, which could act as a catalyst for potentially serious violations.

Key Concepts Covered

  1. Statistical modeling describes relationships between variables
  2. Regression types depend on the outcome variable type
  3. Logistic regression is appropriate for binary outcomes
  4. Multiple regression controls for confounding
  5. Dummy variables represent categorical predictors
  6. Interactions test for effect modification
  7. Model diagnostics check assumptions and identify problems
  8. Model comparison helps select the best model