In this lab, you will:
str(brfss_data)
dim(brfss_data)
names(brfss_data)
## Descriptive Statistics
``` r
# YOUR CODE HERE: Create a frequency table of hypertension status
# Create a frequency table of hypertension status
hyprtnsn_freq <- brfss_data %>%
group_by(hypertension) %>%
summarise(
Count = n(),
Percentage = round(100 * n() / nrow(brfss_data), 1)
) %>%
mutate(
Status = case_when(
hypertension == 1 ~ "Has Hypertension",
hypertension == 0 ~ "No Hypertension"
)
) %>%
select(Status, Count, Percentage)
hyprtnsn_freq %>%
kable(caption = "Overall Prevalence of Hypertension in Study Sample",
align = "lrrrrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)
| Status | Count | Percentage |
|---|---|---|
| No Hypertension | 606 | 47.3 |
| Has Hypertension | 675 | 52.7 |
# Calculate the prevalence of hypertension by age group
hyprtnsn_by_age <- brfss_data %>%
group_by(age_group) %>%
summarise(
Total = n(),
Cases = sum(hypertension),
Prevalence_Percent = round(100 * mean(hypertension), 1),
.groups = 'drop'
)
hyprtnsn_by_age %>%
kable(caption = "Hypertension Prevalence by Age Group",
align = "lccc") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| age_group | Total | Cases | Prevalence_Percent |
|---|---|---|---|
| 18-24 | 12 | 1 | 8.3 |
| 25-34 | 77 | 15 | 19.5 |
| 35-44 | 138 | 42 | 30.4 |
| 45-54 | 161 | 61 | 37.9 |
| 55-64 | 266 | 137 | 51.5 |
| 65+ | 627 | 419 | 66.8 |
Questions:
What is the overall prevalence of hypertension in the dataset? The overall prevalence of hypertension in the dataset is 675, which is 52.7%.
How does hypertension prevalence vary by age group? Hypertension increases steadily and sharply with age with highest prevalence among the 65+ age group i.e. 419 cases out of 627 in the age group (66.8%). Only 8.3% of the age 18–24 have hypertension and it nearly doubles by early adulthood with prevalence jumping to 19.5% in the 25–34 group. By ages 35–54, about one‑third to nearly 40% are hypertensive. Major increase is seen after age 55 with over 50%.
# Outcome: hypertension
# Predictor: age_cont (continous age in years)
# Simple logistic regression: diabetes ~ age
model_hyprtnsn_simple <- glm(hypertension ~ age_cont,
data = brfss_data,
family = binomial(link = "logit"))
# Display results with odds ratios
tidy(model_hyprtnsn_simple, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Simple Logistic Regression: Diabetes ~ Age (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.048 | 0.296 | -10.293 | 0 | 0.026 | 0.084 |
| age_cont | 1.055 | 0.005 | 10.996 | 0 | 1.045 | 1.065 |
Interpretation:
# Generate predicted probabilities
pred_data <- data.frame(age_cont = seq(18, 80, by = 1))
pred_data$predicted_prob <- predict(model_hyprtnsn_simple,
newdata = pred_data,
type = "response")
# Plot
p2 <- ggplot(pred_data, aes(x = age_cont, y = predicted_prob)) +
geom_line(color = "darkred", linewidth = 1.5) +
geom_ribbon(aes(ymin = predicted_prob - 0.03,
ymax = predicted_prob + 0.03),
alpha = 0.2, fill = "darkred") +
labs(
title = "Predicted Probability of Hypertension by Age",
subtitle = "Simple Logistic Regression Model",
x = "Age (years)",
y = "Predicted Probability of hypertension"
) +
scale_y_continuous(labels = scales::percent_format(), limits = c(0, 0.8)) +
theme_minimal(base_size = 12)
ggplotly(p2)Questions:
The odds ratio (OR) for age_cont is 1.055. For each additional year of age, the odds of having diabetes increase by 5.5%. Because an OR of 1.055 means the odds are multiplied by 1.055 for every one‑year increase. This shows a positive association: older individuals have higher odds of diabetes.
Is the association statistically significant? Yes. This association is statistically significant.The p‑value for age is < 0.001, which is far below the α = 0.05 threshold. The 95% CI (1.045 to 1.065) does not include 1. Both factor indicate a statistically significant association between age and diabetes.
What is the 95% confidence interval for the odds ratio? The 95% CI for the odds ratio is 1.045 to 1.065.We are 95% confident that each additional year of age increases the odds of diabetes by 4.5% to 6.5%.
# YOUR CODE HERE: Fit a multiple logistic regression model
# Outcome: hypertension
# Predictors: age_cont, sex, bmi_cat, phys_active, current_smoker
model_hyprtnsn_multiple <- glm(hypertension ~ age_cont + sex + bmi_cat +
phys_active + current_smoker,
data = brfss_data,
family = binomial(link = "logit"))# Display results
tidy(model_hyprtnsn_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Multiple Logistic Regression: Hypertension ~ Age + Covariates (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
scroll_box(height = "400px")| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.008 | 0.653 | -7.355 | 0.000 | 0.002 | 0.028 |
| age_cont | 1.061 | 0.005 | 11.234 | 0.000 | 1.050 | 1.073 |
| sexMale | 1.270 | 0.123 | 1.950 | 0.051 | 0.999 | 1.616 |
| bmi_catNormal | 2.097 | 0.546 | 1.356 | 0.175 | 0.759 | 6.756 |
| bmi_catOverweight | 3.241 | 0.543 | 2.166 | 0.030 | 1.183 | 10.385 |
| bmi_catObese | 6.585 | 0.545 | 3.459 | 0.001 | 2.394 | 21.176 |
| phys_active | 0.900 | 0.130 | -0.808 | 0.419 | 0.697 | 1.162 |
| current_smoker | 1.071 | 0.139 | 0.495 | 0.621 | 0.817 | 1.407 |
## BMI Category Odds Ratios (vs. Underweight):
## Normal weight: OR = 2.1
## Overweight: OR = 3.24
## Obese: OR = 6.59
Questions:
The OR for age increased after adjusting for sex, BMI category, physical activity, and smoking.Some of the covariates (especially BMI or sex) were masking part of the true association between age and hypertension in the crude model.After controlling for them, the age–hypertension association became slightly stronger. When older adults also tend to have characteristics that partially overlap with hypertension risk (e.g., higher BMI), but those characteristics don’t fully explain the age effect.
Based on magnitude of OR and statistical significance: Strongest predictors are
# YOUR CODE HERE: Create a table showing the dummy variable coding for bmi_cat
# Extract dummy variable coding
bmi_dummy_table <- data.frame(
bmi_cat = c("Normal", "Overweight", "Obese"),
`Dummy 1 (Normal)` = c(1, 0, 0),
`Dummy 2 (Overweight)` = c(0, 1, 0),
`Dummy 3 (Obese)` = c(0, 0, 1),
check.names = FALSE
)
bmi_dummy_table %>%
kable(caption = "Dummy Variable Coding for BMI Category",
align = "lccc") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(1, bold = TRUE, background = "#ffe6e6") # Highlight reference category| bmi_cat | Dummy 1 (Normal) | Dummy 2 (Overweight) | Dummy 3 (Obese) |
|---|---|---|---|
| Normal | 1 | 0 | 0 |
| Overweight | 0 | 1 | 0 |
| Obese | 0 | 0 | 1 |
# Extract BMI coefficients
bmi_coefs <- tidy(model_hyprtnsn_multiple, exponentiate = TRUE, conf.int = TRUE) %>%
filter(str_detect(term, "bmi_cat")) %>%
mutate(
bmi_cat = str_remove(term, "bmi_cat"),
bmi_cat = factor(bmi_cat,
levels = c("Underweight", "Normal", "Overweight", "Obese"))
)
# Add reference category
ref_row <- data.frame(
term = "BMI_catUnderweight",
estimate = 1.0,
std.error = 0,
statistic = NA,
p.value = NA,
conf.low = 1.0,
conf.high = 1.0,
bmi_cat = factor("Underweight (Ref)",
levels = c("Underweight (Ref)",
"Normal",
"Overweight",
"Obese"))
)
##combining and reordering
bmi_coefs_full <- bind_rows(ref_row, bmi_coefs) %>%
mutate(
bmi_cat = factor(bmi_cat,
levels = c("Underweight (Ref)",
"Normal",
"Overweight",
"Obese")))
#Visualize the new underweight reference category
bmi_coefs_full %>%
select(bmi_cat, estimate, conf.low, conf.high, p.value) %>%
kable(caption = "Odds Ratios for BMI Categories (Reference: UnderWeight)",
digits = 3,
col.names = c("BMI Category", "Odds Ratio", "95% CI Lower", "95% CI Upper", "p-value"),
align = "lrrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| BMI Category | Odds Ratio | 95% CI Lower | 95% CI Upper | p-value |
|---|---|---|---|---|
| Underweight (Ref) | 1.000 | 1.000 | 1.000 | NA |
| Normal | 2.097 | 0.759 | 6.756 | 0.175 |
| Overweight | 3.241 | 1.183 | 10.385 | 0.030 |
| Obese | 6.585 | 2.394 | 21.176 | 0.001 |
# Plot the odds ratios
p_bmi <- ggplot(bmi_coefs_full, aes(x = bmi_cat, y = estimate)) +
geom_hline(yintercept = 1, linetype = "dashed", color = "gray50") +
geom_pointrange(aes(ymin = conf.low, ymax = conf.high),
size = 0.8, color = "darkblue") +
coord_flip() +
labs(
title = "Odds Ratios for BMI Categories and Hypertension",
subtitle = "Reference: Underweight",
x = "BMI Category",
y = "Odds Ratio (95% CI)"
) +
theme_minimal(base_size = 12)
ggplotly(p_bmi)Odds Ratios for BMI Category Levels
# Interpretation table
bmi_interpret <- data.frame(
`BMI Category` = c( "Underweight", "Normal", "Overweight", "Obese"),
`Odds Ratio` = c(1.0, round(filter(bmi_coefs, term == "bmi_catNormal")$estimate, 3),
round(filter(bmi_coefs, term == "bmi_catOverweight")$estimate, 3),
round(filter(bmi_coefs, term == "bmi_catObese")$estimate, 3)),
`Interpretation` = c(
"Reference group",
paste0("About ", round(filter(bmi_coefs, term == "bmi_catNormal")$estimate - 1, 1)*100, "% higher/lower odds than underweight"),
paste0("About ", round((filter(bmi_coefs, term == "bmi_catOverweight")$estimate - 1)*100, 0), "% higher odds than underweight"),
paste0("About ", round((filter(bmi_coefs, term == "bmi_catObese")$estimate - 1)*100, 0), "% higher odds than underweight")
),
check.names = FALSE
)
cat("\n=BMI INTERPRETATION =\n")##
## =BMI INTERPRETATION =
## BMI Category Odds Ratio Interpretation
## 1 Underweight 1.000 Reference group
## 2 Normal 2.097 About 110% higher/lower odds than underweight
## 3 Overweight 3.241 About 224% higher odds than underweight
## 4 Obese 6.585 About 559% higher odds than underweight
# Plot model coefficients with `ggcoef_model()`
ggcoef_model(model_hyprtnsn_multiple, exponentiate = TRUE,
include = c("bmi_cat"),
variable_labels = c(
BMI = "bmi_cat"),
facet_labeller = ggplot2::label_wrap_gen(10)
)
Questions:
The reference category is Underweight (Odds Ratio = 1.00).
The odds ratio for the Obese group is 6.585, meaning their odds are 6.6 times higher than the underweight group.
# YOUR CODE HERE: Fit a model with Age × BMI interaction
# Test if the effect of age on hypertension differs by BMI category
# Model with interaction term
model_interaction <- glm(hypertension ~ age_cont* bmi_cat,data = brfss_data,
family = binomial(link = "logit"))
# Model WITHOUT interaction (additive model)
model_no_interaction <- glm(hypertension ~ age_cont + bmi_cat,
data = brfss_data,
family = binomial(link = "logit"))
# Display interaction results
tidy(model_interaction, exponentiate = TRUE, conf.int = TRUE) %>%
filter(str_detect(term, ":")) %>%
kable(caption = "Age × BMI Interaction Model (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| age_cont:bmi_catNormal | 1.058 | 0.043 | 1.306 | 0.192 | 0.957 | 1.147 |
| age_cont:bmi_catOverweight | 1.063 | 0.043 | 1.423 | 0.155 | 0.962 | 1.151 |
| age_cont:bmi_catObese | 1.054 | 0.042 | 1.232 | 0.218 | 0.954 | 1.140 |
# show full model
tidy(model_interaction, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Full Age × BMI Interaction Model",
digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.259 | 2.536 | -0.533 | 0.594 | 0.000 | 25.001 |
| age_cont | 1.004 | 0.042 | 0.102 | 0.918 | 0.929 | 1.108 |
| bmi_catNormal | 0.065 | 2.632 | -1.039 | 0.299 | 0.001 | 38.322 |
| bmi_catOverweight | 0.078 | 2.606 | -0.978 | 0.328 | 0.001 | 44.673 |
| bmi_catObese | 0.263 | 2.573 | -0.519 | 0.604 | 0.003 | 144.533 |
| age_cont:bmi_catNormal | 1.058 | 0.043 | 1.306 | 0.192 | 0.957 | 1.147 |
| age_cont:bmi_catOverweight | 1.063 | 0.043 | 1.423 | 0.155 | 0.962 | 1.151 |
| age_cont:bmi_catObese | 1.054 | 0.042 | 1.232 | 0.218 | 0.954 | 1.140 |
Interpretation:
# Model 1: Age only (model_no_interaction)
model1 <- model_no_interaction
#Model 2: Age + BMI (Model_interaction)
model2 <- model_no_interaction
# Model 3: Full model
model3 <- model_hyprtnsn_multiple
# Likelihood ratio test
lrt_1_2 <- anova(model1, model2, test = "LRT")
lrt_2_3 <- anova(model2, model3, test = "LRT")
# Create comparison table
model_comp <- data.frame(
Model = c("Model 1: Age only",
"Model 2: Age + BMI",
"Model 3: Full model"),
AIC = c(AIC(model1), AIC(model2), AIC(model3)),
BIC = c(BIC(model1), BIC(model2), BIC(model3)),
`Deviance` = c(deviance(model1), deviance(model2), deviance(model3)),
check.names = FALSE
)
model_comp %>%
kable(caption = "Model Comparison: AIC, BIC, and Deviance",
digits = 2,
align = "lrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which.min(model_comp$AIC), bold = TRUE, background = "#d4edda")| Model | AIC | BIC | Deviance |
|---|---|---|---|
| Model 1: Age only | 1578.06 | 1603.84 | 1568.06 |
| Model 2: Age + BMI | 1578.06 | 1603.84 | 1568.06 |
| Model 3: Full model | 1579.50 | 1620.74 | 1563.50 |
Questions:
There is no evidence of effect modification by BMI on the age–hypertension relationship. The association between age and the hypertension appears consistent across BMI groups. Age increases risk at about the same rate whether someone is normal weight, overweight, or obese.
# Generate predicted probabilities by BMI
pred_interact <- ggpredict(model_interaction, terms = c("age_cont", "bmi_cat"))
# Plot
p4 <- ggplot(pred_interact, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
labs(
title = "Predicted Probability of Hypertension by Age and BMI",
subtitle = "Testing for Age × BMI Interaction",
x = "Age (years)",
y = "Predicted Probability of Hypertension",
color = "BMI",
fill = "BMI"
) +
scale_y_continuous(labels = scales::percent_format()) +
scale_color_manual(values = c("Normal" = "#E64B35", "Overweight" = "#4DBBD5", "Obese" = "pink")) +
scale_fill_manual(values = c("Normal" = "#E64B35", "Overweight" = "#4DBBD5", "Obese" = "pink")) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")
ggplotly(p4)Age-hypertension Relationship by BMI
# YOUR CODE HERE: Calculate VIF for your multiple regression model
# Calculate VIF
vif_values <- vif(model_hyprtnsn_multiple)
# Create VIF table
# For models with categorical variables, vif() returns GVIF (Generalized VIF)
if (is.matrix(vif_values)) {
# If matrix (categorical variables present), extract GVIF^(1/(2*Df))
vif_df <- data.frame(
Variable = rownames(vif_values),
VIF = vif_values[, "GVIF^(1/(2*Df))"]
)
} else {
# If vector (only continuous variables)
vif_df <- data.frame(
Variable = names(vif_values),
VIF = as.numeric(vif_values)
)
}
# Add interpretation
vif_df <- vif_df %>%
arrange(desc(VIF)) %>%
mutate(
Interpretation = case_when(
VIF < 5 ~ "Low (No concern)",
VIF >= 5 & VIF < 10 ~ "Moderate (Monitor)",
VIF >= 10 ~ "High (Problem)"
)
)
vif_df %>%
kable(caption = "Variance Inflation Factors (VIF) for Multiple Regression Model",
digits = 2,
align = "lrc") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which(vif_df$VIF >= 10), bold = TRUE, color = "white", background = "#DC143C") %>%
row_spec(which(vif_df$VIF >= 5 & vif_df$VIF < 10), background = "#FFA500") %>%
row_spec(which(vif_df$VIF < 5), background = "#90EE90")| Variable | VIF | Interpretation | |
|---|---|---|---|
| age_cont | age_cont | 1.06 | Low (No concern) |
| current_smoker | current_smoker | 1.04 | Low (No concern) |
| bmi_cat | bmi_cat | 1.02 | Low (No concern) |
| phys_active | phys_active | 1.01 | Low (No concern) |
| sex | sex | 1.01 | Low (No concern) |
Cook’s Distance: Measures how much the model would change if an observation were removed.
# Calculate Cook's distance
cooks_d <- cooks.distance(model_hyprtnsn_multiple)
# Create data frame
influence_df <- data.frame(
observation = 1:length(cooks_d),
cooks_d = cooks_d
) %>%
mutate(influential = ifelse(cooks_d > 1, "Yes", "No"))
# Plot
p5 <- ggplot(influence_df, aes(x = observation, y = cooks_d, color = influential)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
labs(
title = "Cook's Distance: Identifying Influential Observations",
subtitle = "Values > 1 indicate potentially influential observations",
x = "Observation Number",
y = "Cook's Distance",
color = "Influential?"
) +
scale_color_manual(values = c("No" = "steelblue", "Yes" = "red")) +
theme_minimal(base_size = 12)
ggplotly(p5)Cook’s Distance for Influential Observations
# Count influential observations
n_influential <- sum(influence_df$influential == "Yes")
cat("Number of potentially influential observations:", n_influential, "\n")## Number of potentially influential observations: 0
Questions:
There is no major multicollinearity concern for the main-effects model.
Cook’s Distance plot shows: - Number of potentially influential observations: 0 - All observation points fall below the common threshold of 1.0
There are no influential observations that would meaningfully distort or bias your regression results.
Investigate multicollinearity (Examine VIF values, Remove or combine correlated predictors, Avoid unnecessary interaction terms)
Address influential observations (Inspect those cases individually, Check for data entry errors, Consider robust regression or sensitivity analyses)
Refit a simpler, more stable model (Prefer models with lower AIC/BIC, Remove non‑significant interaction terms, Ensure interpretability and epidemiologic plausibility)
# YOUR CODE HERE: Compare three models using AIC and BIC
# Model A: Age only
# Model B: Age + sex + bmi_cat
# Model C: Age + sex + bmi_cat + phys_active + current_smoker
# Model A: Age only
modela <- glm(hypertension ~ age_cont,
data = brfss_data,
family = binomial)
# Model B: Model with age + demographics + BMI
modelb <- glm(hypertension ~ age_cont + sex + bmi_cat,
data = brfss_data,
family = binomial)
# Model C: Full model with all predictors
modelc <- glm(hypertension ~ age_cont + sex + bmi_cat + phys_active + current_smoker,
data = brfss_data,
family = binomial)
# Create comparison table
model_comparison <- data.frame(
Model = c("Model A: Age only",
"Model B: Age + Sex + BMI",
"Model C: Full model\n(+ Activity + Smoking)"),
AIC = c(AIC(modela), AIC(modelb), AIC(modelc)),
BIC = c(BIC(modela), BIC(modelb), BIC(modelc)),
`Deviance` = c(deviance(modela), deviance(modelb), deviance(modelc)),
check.names = FALSE
)
model_comparison %>%
kable(caption = "Model Comparison: AIC, BIC, and Deviance",
digits = 2,
align = "lrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE) %>%
row_spec(which.min(model_comparison$AIC), bold = TRUE, background = "#d4edda")| Model | AIC | BIC | Deviance |
|---|---|---|---|
| Model A: Age only | 1636.61 | 1646.92 | 1632.61 |
| Model B: Age + Sex + BMI | 1576.49 | 1607.42 | 1564.49 |
| Model C: Full model (+ Activity + Smoking) |
1579.5
|
# Likelihood ratio test
lrt_a_b <- anova(modela, modelb, test = "LRT")
cat("Test 1: Model A vs Model B\n")## Test 1: Model A vs Model B
## Analysis of Deviance Table
##
## Model 1: hypertension ~ age_cont
## Model 2: hypertension ~ age_cont + sex + bmi_cat
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 1279 1632.6
## 2 1275 1564.5 4 68.126 5.643e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("Interpretation: Adding sex and BMI to age significantly improves fit:", lrt_a_b$`Pr(>Chi)`[2] < 0.05, "\n\n")## Interpretation: Adding sex and BMI to age significantly improves fit: TRUE
##
## Test 2: Model B vs Model C
## Analysis of Deviance Table
##
## Model 1: hypertension ~ age_cont + sex + bmi_cat
## Model 2: hypertension ~ age_cont + sex + bmi_cat + phys_active + current_smoker
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 1275 1564.5
## 2 1273 1563.5 2 0.99112 0.6092
cat("Interpretation: Adding physical activity and smoking significantly improves fit:", lrt_b_c$`Pr(>Chi)`[2] < 0.05, "\n\n")## Interpretation: Adding physical activity and smoking significantly improves fit: FALSE
Questions:
According to the AIC values: - Model A (Age only): 1636.61 - Model B (Age + Sex + BMI): 1576.49 - Model C (Full model): 1579.50
Model B has the lowest AIC (1576.49), meaning it provides the best overall fit while balancing model complexity and goodness of fit.
Looking at the AIC/BIC values, Model C has higher AIC and higher BIC than Model B. This means the extra variables (activity + smoking) do not improve the model enough to justify their inclusion.When we look at P-value from deviance table for Model B vs Model C, LRT p-value = 0.6092 > 0.05. Adding physical activity and smoking does not significantly improve model fit. The added complexity is not justified.
I would choose Model B (Age + Sex + BMI) because it has the lowest AIC, indicating the best balance of fit and parsimony.The LRT shows that adding activity and smoking does not improve the model. Model B includes important epidemiologic predictors (age, sex, BMI).
Write a brief report (1-2 pages) summarizing your findings:
Logistic Regression Analysis of Hypertension
Specifically, the research question was:
“Which factors are associated with hypertension, and is the effect of age modified by BMI or other covariates?”
Given that the overall prevalence of hypertension in the dataset was 52.7% (675 cases), understanding these predictors is important for identifying high‑risk groups and informing prevention strategies.
The analytic steps included: - Model building: Model A: Age only Model B: Age + Sex + BMI Model C: Full model (Age + Sex + BMI + Activity + Smoking)
All analyses were conducted using logistic regression in R.
3.2 Main Effects Model (Model B) Key adjusted odds ratios: For age: 1.061 AoR (Each additional year increases odds by ~6%) For BMI, obese group, AoR 6.59 (obese people have 6.6 times higher odds of hypertension compared to underweight reference group.)
3.3 Interaction Model Age × BMI interaction terms were not statistically significant (all p > 0.05). This indicates no evidence of effect modification: the effect of age on hypertension does not differ across BMI categories.
3.4 Model Comparison
Model B had the lowest AIC. Adding sex and BMI significantly improved fit and adding activity and smoking did not improve fit.
3.5 Diagnostics
Influential observations: Cook’s Distance plot showed 0 influential observations. Multicollinearity: No major concerns in the main-effects model.
Age and BMI are strong, independent predictors of hypertension. The odds of hypertension increase steadily with age, and individuals who are overweight or obese have higher odds compared to those who are underweight.
There is no evidence of effect modification between age and BMI, showing the increase in hypertension risk with age is consistent across BMI groups. Model comparison strongly supports Model B (Age + Sex + BMI) as the most appropriate model. Adding physical activity and smoking does not improve model fit and introduces complexity.
Overall, it was found that hypertension risk increases with age and higher BMI, and these effects appear additive rather than interactive.
Submission: Submit your completed R Markdown file and knitted HTML report.
Logistic Regression:
\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p\]
Odds Ratio:
\[\text{OR} = e^{\beta_i}\]
Predicted Probability:
\[p = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}\]
Session Info
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggstats_0.12.0 gtsummary_2.5.0 ggeffects_2.3.2 car_3.1-3
## [5] carData_3.0-5 broom_1.0.11 plotly_4.12.0 kableExtra_1.4.0
## [9] knitr_1.51 haven_2.5.5 lubridate_1.9.4 forcats_1.0.1
## [13] stringr_1.6.0 dplyr_1.2.0 purrr_1.2.1 readr_2.1.6
## [17] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.56 bslib_0.10.0
## [4] htmlwidgets_1.6.4 insight_1.4.5 tzdb_0.5.0
## [7] vctrs_0.7.1 tools_4.5.2 crosstalk_1.2.2
## [10] generics_0.1.4 datawizard_1.3.0 pkgconfig_2.0.3
## [13] data.table_1.18.0 RColorBrewer_1.1-3 S7_0.2.1
## [16] lifecycle_1.0.5 compiler_4.5.2 farver_2.1.2
## [19] textshaping_1.0.4 htmltools_0.5.9 sass_0.4.10
## [22] yaml_2.3.12 lazyeval_0.2.2 Formula_1.2-5
## [25] pillar_1.11.1 jquerylib_0.1.4 broom.helpers_1.22.0
## [28] cachem_1.1.0 abind_1.4-8 tidyselect_1.2.1
## [31] digest_0.6.39 stringi_1.8.7 labeling_0.4.3
## [34] labelled_2.16.0 fastmap_1.2.0 grid_4.5.2
## [37] cli_3.6.5 magrittr_2.0.4 cards_0.7.1
## [40] withr_3.0.2 scales_1.4.0 backports_1.5.0
## [43] timechange_0.3.0 rmarkdown_2.30 httr_1.4.7
## [46] otel_0.2.0 hms_1.1.4 evaluate_1.0.5
## [49] viridisLite_0.4.2 rlang_1.1.7 glue_1.8.0
## [52] xml2_1.5.2 svglite_2.2.2 rstudioapi_0.18.0
## [55] jsonlite_2.0.0 R6_2.6.1 systemfonts_1.3.1