R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Setup and Data Preparation

# Load required packages
library(tidyverse)
library(haven)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(car)
library(ggeffects)
library(gtsummary)
library(ggstats)

Lab Overview

In this lab, you will:

  1. Build your own logistic regression model predicting hypertension (high blood pressure)
  2. Create dummy variables for categorical predictors
  3. Interpret regression coefficients
  4. Test for confounding and interaction
  5. Perform model diagnostics

Lab Instructions

# Load the full BRFSS 2023 dataset
brfss_clean<- read_rds("C:\\Users\\safwa\\OneDrive - University at Albany - SUNY\\EPI 553\\Labs\\brfss_subset_2023.rds")
# Display dataset dimensions
names(brfss_clean)
##  [1] "diabetes"       "age_group"      "age_cont"       "sex"           
##  [5] "race"           "education"      "income"         "bmi_cat"       
##  [9] "phys_active"    "current_smoker" "gen_health"     "hypertension"  
## [13] "high_chol"
### Task 1: Explore the Outcome Variable

# YOUR CODE HERE: Create a frequency table of hypertension status
freq_table <- table(brfss_clean$hypertension)
prev_ht <- mean(brfss_clean$hypertension == 1, na.rm = TRUE) * 100
prev_ht
## [1] 52.69321
# YOUR CODE HERE: Calculate the prevalence of hypertension by age group
prev_by_age <- brfss_clean %>%
  group_by(age_group) %>%
  summarise(
    n = n(),
    ht_cases = sum(hypertension == 1, na.rm = TRUE),
    prevalence = mean(hypertension == 1, na.rm = TRUE) * 100,
    .groups = "drop"
  )

prev_by_age
## # A tibble: 6 × 4
##   age_group     n ht_cases prevalence
##   <fct>     <int>    <int>      <dbl>
## 1 18-24        12        1       8.33
## 2 25-34        77       15      19.5 
## 3 35-44       138       42      30.4 
## 4 45-54       161       61      37.9 
## 5 55-64       266      137      51.5 
## 6 65+         627      419      66.8

Questions:

  1. What is the overall prevalence of hypertension in the data set? Ans: The overall prevalence of hypertension in the data set is 52.7%

  2. How does hypertension prevalence vary by age group? Ans : The data shows a strong positive correlation between age and hypertension. As the age increase the prevalence of hypertension rises significantly and consistently. ***

Task 2: Build a Simple Logistic Regression Model

# YOUR CODE HERE: Fit a simple logistic regression model
# Outcome: hypertension
# Predictor: age_cont
model_logistic_simple <- glm(hypertension ~ age_cont, 
                             data = brfss_clean, 
                             family = binomial(link = "logit"))
summary(model_logistic_simple)
## 
## Call:
## glm(formula = hypertension ~ age_cont, family = binomial(link = "logit"), 
##     data = brfss_clean)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.042577   0.295584  -10.29   <2e-16 ***
## age_cont     0.053119   0.004831   11.00   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1772.1  on 1280  degrees of freedom
## Residual deviance: 1632.6  on 1279  degrees of freedom
## AIC: 1636.6
## 
## Number of Fisher Scoring iterations: 4
# YOUR CODE HERE: Display the results with odds ratios
# Display results with odds ratios

tidy(model_logistic_simple, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Simple Logistic Regression: Diabetes ~ Age (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Simple Logistic Regression: Diabetes ~ Age (Odds Ratios)
Term Odds Ratio Std. Error z-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 0.048 0.296 -10.293 0 0.026 0.084
age_cont 1.055 0.005 10.996 0 1.045 1.065

Questions:

  1. What is the odds ratio for age? Interpret this value. Ans : the odds ratio for age is 1.055. For each one-year increase in age, the odds of having hypertension increase by 5.5%, holding everything else constant.

  2. Is the association statistically significant? Ans: Since p value is 0 which is less than 0.05 and CI does not cross 1 , Age is significantly associated with hypertension.

  3. What is the 95% confidence interval for the odds ratio? Ans: The 95% confidence interval for the odds ratio falls between 4.5% and 6.5%.


Task 3: Create a Multiple Regression Model

# YOUR CODE HERE: Fit a multiple logistic regression model
# Outcome: hypertension
# Predictors: age_cont, sex, bmi_cat, phys_active, current_smoker
model_logistic_multiple <- glm(hypertension ~ age_cont + sex + bmi_cat +
                                phys_active + current_smoker,
                               data = brfss_clean,
                               family = binomial(link = "logit"))

# YOUR CODE HERE:# Display results
tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
  kable(caption = "Multiple Logistic Regression: hypertension ~ Age + Covariates (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  scroll_box(height = "400px")
Multiple Logistic Regression: hypertension ~ Age + Covariates (Odds Ratios)
Term Odds Ratio Std. Error z-statistic p-value 95% CI Lower 95% CI Upper
(Intercept) 0.008 0.653 -7.355 0.000 0.002 0.028
age_cont 1.061 0.005 11.234 0.000 1.050 1.073
sexMale 1.270 0.123 1.950 0.051 0.999 1.616
bmi_catNormal 2.097 0.546 1.356 0.175 0.759 6.756
bmi_catOverweight 3.241 0.543 2.166 0.030 1.183 10.385
bmi_catObese 6.585 0.545 3.459 0.001 2.394 21.176
phys_active 0.900 0.130 -0.808 0.419 0.697 1.162
current_smoker 1.071 0.139 0.495 0.621 0.817 1.407

Questions:

  1. How did the odds ratio for age change after adjusting for other variables? Ans: After adjusting for sex, BMI category, physical activity, and smoking status, the odds ratio for age is 1.061.This means that for each one-year increase in age, the odds of hypertension increase by approximately 6.1%, holding all other variables constant.Compared to the crude (unadjusted) model (which was likely very similar), the OR appears to have changed only slightly, suggesting the association between age and hypertension remains strong even after adjustment.

  2. What does this suggest about confounding? Ans:Because the adjusted odds ratio for age remains strong and statistically significant (p < 0.001), and does not appear to change substantially from the crude estimate, this suggests:There is little evidence of strong confounding by sex, BMI, physical activity, or smoking.Age appears to be an independent predictor of hypertension.

  3. Which variables are the strongest predictors of hypertension? Ans: Obesity, Overweight and Age are the strongest predictors of hypertension.

Task 4: Interpret Dummy Variables

# YOUR CODE HERE: Create a table showing the dummy variable coding for bmi_cat
# Extract dummy variable coding
dummy_table <- data.frame(
  bmi_cat= c("bmi_catNormal (Reference)", "bmi_catOverweight    ", "bmi_catObese"),
  `Dummy 1 (bmi_catOverweight   )` = c(0, 1, 0),
  `Dummy 2 (bmi_catObese)` = c(0, 0, 1),
  check.names = FALSE
)
dummy_table %>%
  kable(caption = "Dummy Variable Coding for BMI",
        align = "lccc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) %>%
  row_spec(1, bold = TRUE, background = "#ffe6e6")  # Highlight reference category
Dummy Variable Coding for BMI
bmi_cat Dummy 1 (bmi_catOverweight )
Dummy 2 (bmi_catObese)
bmi_catNormal (Reference) 0 0
bmi_catOverweight 1 0
bmi_catObese 0 1
# YOUR CODE HERE: Extract and display the odds ratios for BMI categories
bmi_coefs <- tidy(model_logistic_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, "^bmi_cat")) %>%
  mutate(
    bmi_cat = str_remove(term, "^bmi_cat_?"),
    bmi_cat = factor(bmi_cat, levels = c("bmi_catOverweight ", "bmi_catObese"))
  )
bmi_or <- tidy(model_logistic_multiple,
               exponentiate = TRUE,
               conf.int = TRUE)

bmi_or <- bmi_or[bmi_or$term %in% c("bmi_catOverweight",
                                    "bmi_catObese"), ]

bmi_or
## # A tibble: 2 × 7
##   term              estimate std.error statistic  p.value conf.low conf.high
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 bmi_catOverweight     3.24     0.543      2.17 0.0303       1.18      10.4
## 2 bmi_catObese          6.59     0.545      3.46 0.000542     2.39      21.2
# Add reference category
ref_row1 <- data.frame(
  term = "bmi_catNormal",
  estimate = 1.0,
  std.error = NA,
  statistic = NA,
  p.value = NA,
  conf.low = 1.0,
  conf.high = 1.0,
  bmi_cat = factor("bmi_catNormal (Ref)", levels = c("bmi_catNormal (Ref)", "bmi_catOverweight", "bmi_catObese"))
)

bmi_coefs_full <- bind_rows(ref_row1, bmi_coefs) %>%
  mutate(bmi_cat = factor(bmi_cat, levels = c("bmi_catNormal(Ref)", "bmi_catOverweightt", "bmi_catObese")))

# Plot
p3 <- ggplot(bmi_coefs_full, aes(x = bmi_cat, y = estimate)) +
  geom_hline(yintercept = 1, linetype = "dashed") +
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +
  coord_flip() +
  labs(
    title = "Association Between BMI and Hypertension",
    subtitle = "Adjusted Odds Ratios (reference: bmi_catNormal )",
    x = "BMI",
    y = "Odds Ratio (95% CI)"
  ) +
  theme_minimal(base_size = 12)
ggplotly(p3)

Questions: a) What is the reference category for BMI? Ans: The reference category for BMI is bmi_catNormal.

  1. Interpret the odds ratio for “Obese” compared to the reference category. Ans:Individuals classified as obese have 6.59 times the odds of hypertension compared to individuals with normal BMI (the reference category), adjusting for age, sex, physical activity, and smoking status.

  2. How would you explain this to a non-statistician? Ans: People who are obese are much more likely to have high blood pressure than people with a normal weight. In this study, obese individuals were about 6 times more likely to have hypertension compared to those with normal BMI, even after accounting for age, sex, exercise, and smoking.This means obesity is strongly linked to high blood pressure in this population. ***

Task 5: Test for Interaction

# YOUR CODE HERE: Fit a model with Age × BMI interaction
# Test if the effect of age on hypertension differs by BMI category
# Model with interaction term
model_interaction <- glm(hypertension ~ age_cont *bmi_cat + sex + phys_active + current_smoker,
                         data = brfss_clean,
                         family = binomial(link = "logit"))

# Display interaction results
tidy(model_interaction, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(str_detect(term, "age_cont")) %>%
  kable(caption = "Age × BMI Interaction Model (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Age × BMI Interaction Model (Odds Ratios)
Term Odds Ratio Std. Error z-statistic p-value 95% CI Lower 95% CI Upper
age_cont 1.005 0.042 0.117 0.907 0.930 1.110
age_cont:bmi_catNormal 1.058 0.043 1.287 0.198 0.956 1.147
age_cont:bmi_catOverweight 1.064 0.043 1.431 0.152 0.962 1.152
age_cont:bmi_catObese 1.052 0.043 1.186 0.236 0.952 1.139
# Test if the effect of age on hypertension differs by BMI category
pred_interact <- ggpredict(model_interaction, terms = c("age_cont [18:80]", "bmi_cat"))
p4 <- ggplot(pred_interact, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
  labs(
    title = "Predicted Probability of Hypertension by Age and BMI Category",
    subtitle = "Testing for Age × BMI Interaction",
    x = "Age (years)",
    y = "Predicted Probability of Hypertension",
    color = "BMI Category",
    fill = "BMI Category"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

ggplotly(p4)
# YOUR CODE HERE: Perform a likelihood ratio test comparing models with and without interaction
model_main <- glm(hypertension ~ age_cont + bmi_cat + sex + phys_active + current_smoker,
                         data = brfss_clean,
                         family = binomial(link = "logit")) 

lrt_result <- anova(model_main, model_interaction, test = "LRT")

# Display clean table
broom::tidy(lrt_result) %>%
  knitr::kable(
    caption = "Likelihood Ratio Test Comparing Models With and Without Age × BMI Interaction",
    digits = 4
  ) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )
Likelihood Ratio Test Comparing Models With and Without Age × BMI Interaction
term df.residual residual.deviance df deviance p.value
hypertension ~ age_cont + bmi_cat + sex + phys_active + current_smoker 1273 1563.496 NA NA NA
hypertension ~ age_cont * bmi_cat + sex + phys_active + current_smoker 1270 1561.260 3 2.2363 0.5248

Questions:

  1. Is the interaction term statistically significant? Ans: No, the interaction terms are not statistically significant.
  2. What does this mean in epidemiologic terms (effect modification)? Ans: There is no evidence of effect modification by BMI category.
  3. Create a visualization showing predicted probabilities by age and BMI category Ans:

Task 6: Model Diagnostics

# YOUR CODE HERE: Calculate VIF for your multiple regression model
vif_values <- car::vif(model_logistic_multiple)

# Handle categorical variables (GVIF adjustment)
if (is.matrix(vif_values)) {
  
  vif_df <- data.frame(
    Variable = rownames(vif_values),
    GVIF = vif_values[, "GVIF"],
    Df = vif_values[, "Df"],
    Adjusted_VIF = vif_values[, "GVIF^(1/(2*Df))"]
  )
  
} else {
  
  vif_df <- data.frame(
    Variable = names(vif_values),
    Adjusted_VIF = as.numeric(vif_values)
  )
}

# Display table
knitr::kable(
  vif_df,
  caption = "Variance Inflation Factors (VIF) for Hypertension Logistic Model",
  digits = 3
) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )
Variance Inflation Factors (VIF) for Hypertension Logistic Model
Variable GVIF Df Adjusted_VIF
age_cont age_cont 1.127 1 1.061
sex sex 1.017 1 1.008
bmi_cat bmi_cat 1.103 3 1.016
phys_active phys_active 1.025 1 1.012
current_smoker current_smoker 1.074 1 1.036
# YOUR CODE HERE: Create a Cook's distance plot to identify influential observations


# Calculate Cook's distance
cooks_d <- cooks.distance(model_logistic_multiple)

# Create data frame
influence_df <- data.frame(
  observation = 1:length(cooks_d),
  cooks_d = cooks_d
)

# Common threshold
threshold <- 4 / nrow(brfss_clean)

influence_df <- influence_df %>%
  dplyr::mutate(
    influential = ifelse(cooks_d > threshold, "Yes", "No")
  )

# Plot
p_cook <- ggplot(influence_df, aes(x = observation, y = cooks_d)) +
  geom_point(aes(color = influential), alpha = 0.6) +
  geom_hline(yintercept = threshold, linetype = "dashed", color = "red") +
  labs(
    title = "Cook's Distance Plot",
    subtitle = paste("Threshold =", round(threshold, 4)),
    x = "Observation Number",
    y = "Cook's Distance",
    color = "Influential?"
  ) +
  scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
  theme_minimal(base_size = 12)

p_cook

# Count influential observations
sum(influence_df$influential == "Yes")
## [1] 22

Questions:

  1. Are there any concerns about multicollinearity? Ans: Since the model standard errors are small and stable (as seen earlier in your interaction output), there is no indication of problematic multicollinearity.

  2. Are there any influential observations that might affect your results? Ans: Yes, there are some potentially influential observations.From the Cook’s Distance plot: The threshold is approximately 0.0031 Several observations exceed this threshold (marked in red) A few points are substantially above the cutoff

  3. What would you do if you found serious violations? Ans: I would investigate influential observations,run sensitivity analyses


Task 7: Model Comparison

# YOUR CODE HERE: Compare three models using AIC and BIC
# Model A: Age only
model_A <- glm(
  hypertension ~ age_cont,
  data = brfss_clean,
  family = binomial(link = "logit")
)

# Model B: Age + sex + bmi_cat
model_B <- glm(
  hypertension ~ age_cont + sex + bmi_cat,
  data = brfss_clean,
  family = binomial(link = "logit")
)

# Model C: Age + sex + bmi_cat + phys_active + current_smoker
model_C <- glm(
  hypertension ~ age_cont + sex + bmi_cat + phys_active + current_smoker,
  data = brfss_clean,
  family = binomial(link = "logit")
)

# YOUR CODE HERE:# Create comparison table
model_comp <- tibble::tibble(
  Model = c("Model A: Age only",
            "Model B: Age + sex + BMI",
            "Model C: Age + sex + BMI + activity + smoking"),
  AIC = c(AIC(model_A), AIC(model_B), AIC(model_C)),
  BIC = c(BIC(model_A), BIC(model_B), BIC(model_C)),
  Deviance = c(deviance(model_A), deviance(model_B), deviance(model_C))
) %>%
  dplyr::arrange(AIC)

# Display table
model_comp %>%
  knitr::kable(
    caption = "Model Comparison for Hypertension (Lower AIC/BIC = Better Fit)",
    digits = 2
  ) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )
Model Comparison for Hypertension (Lower AIC/BIC = Better Fit)
Model AIC BIC Deviance
Model B: Age + sex + BMI 1576.49 1607.42 1564.49
Model C: Age + sex + BMI + activity + smoking 1579.50 1620.74 1563.50
Model A: Age only 1636.61 1646.92 1632.61

Questions:

  1. Which model has the best fit based on AIC? Ans: Model B
  2. Is the added complexity of the full model justified? Ans: No, the added complexity of the full model (Model C) is likely not justified.
  3. Which model would you choose for your final analysis? Why? Ans:I would choose Model B for the final analysis.

Lab Report Guidelines

Write a brief report (1-2 pages) summarizing your findings:

  1. Introduction: State your research question
  2. Methods: Describe your analytic approach
  3. Results: Present key findings with tables and figures
  4. Interpretation: Explain what your results mean
  5. Limitations: Discuss potential issues with your analysis

Submission: Submit your completed R Markdown file and knitted HTML report.


Summary

Key Concepts Covered

  1. Statistical modeling describes relationships between variables
  2. Regression types depend on the outcome variable type
  3. Logistic regression is appropriate for binary outcomes
  4. Multiple regression controls for confounding
  5. Dummy variables represent categorical predictors
  6. Interactions test for effect modification
  7. Model diagnostics check assumptions and identify problems
  8. Model comparison helps select the best model

Important Formulas

Logistic Regression:

\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]

Odds Ratio:

\[\text{OR} = e^{\beta_i}\]

Predicted Probability:

\[p = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}\]


References


Session Info

sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggstats_0.12.0   gtsummary_2.5.0  ggeffects_2.3.2  car_3.1-5       
##  [5] carData_3.0-6    broom_1.0.12     plotly_4.12.0    kableExtra_1.4.0
##  [9] knitr_1.51       haven_2.5.5      lubridate_1.9.4  forcats_1.0.1   
## [13] stringr_1.6.0    dplyr_1.1.4      purrr_1.2.1      readr_2.1.6     
## [17] tidyr_1.3.2      tibble_3.3.1     ggplot2_4.0.2    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.56          bslib_0.10.0       htmlwidgets_1.6.4 
##  [5] insight_1.4.6      tzdb_0.5.0         crosstalk_1.2.2    vctrs_0.7.1       
##  [9] tools_4.5.1        generics_0.1.4     datawizard_1.3.0   pkgconfig_2.0.3   
## [13] data.table_1.18.0  RColorBrewer_1.1-3 S7_0.2.1           lifecycle_1.0.5   
## [17] compiler_4.5.1     farver_2.1.2       textshaping_1.0.4  codetools_0.2-20  
## [21] htmltools_0.5.9    sass_0.4.10        yaml_2.3.12        lazyeval_0.2.2    
## [25] Formula_1.2-5      pillar_1.11.1      jquerylib_0.1.4    cachem_1.1.0      
## [29] abind_1.4-8        tidyselect_1.2.1   digest_0.6.39      stringi_1.8.7     
## [33] labeling_0.4.3     fastmap_1.2.0      grid_4.5.1         cli_3.6.5         
## [37] magrittr_2.0.4     utf8_1.2.6         withr_3.0.2        scales_1.4.0      
## [41] backports_1.5.0    timechange_0.3.0   rmarkdown_2.30     httr_1.4.7        
## [45] otel_0.2.0         hms_1.1.4          evaluate_1.0.5     viridisLite_0.4.2 
## [49] rlang_1.1.7        glue_1.8.0         xml2_1.5.2         svglite_2.2.2     
## [53] rstudioapi_0.18.0  jsonlite_2.0.0     R6_2.6.1           systemfonts_1.3.1