Modeling Cardiovascular Risk from Lifestyle and Health Factors

Project Overview

This project explores associations between three modifiable lifestyle and health factors (exercise habits, smoking history, body weight) and angina/coronary heart disease (CHD) among a large sample of U.S. adults, using 2024 data from the Behavioral Risk Factor Surveillance System (BRFSS) by the Centers for Disease Control and Prevention (CDC).

Using R, logistic regression was used to estimate relationships between predictors (exercise habits, smoking history, weight) and outcome (angina/CHD). The best-fitting model was applied to hypothetical individuals to illustrate cardiovascular risk patterns across different health and lifestyle profiles.

Dataset Description

The dataset used for this project is the publicly available 2024 Behavioral Risk Factor Surveillance System (BRFSS) by the Centers for Disease Control and Prevention (CDC), a large, cross-sectional health survey of U.S. adults conducted annually. The full 2024 BRFSS dataset contains 457,670 observations.

Variables Used

For this analysis, the original dataset was subset to include four variables of interest: CVDCRHD4, EXERANY2, WEIGHT2, and SMOKE100. These variables were selected because they represent modifiable, clinically relevant lifestyle factors commonly associated with cardiovascular disease risk. They have direct implications for both public health interventions and individual prevention strategies.

CVDCRHD4: Presence or absence of angina/CHD (binary outcome)
EXERANY2: Any exercise outside of work in the past 30 days (binary predictor)
WEIGHT2: Body weight without shoes in pounds (numeric predictor)
SMOKE100: Smoked at least 100 cigarettes in lifetime (binary predictor)

Data Preparation

Setup

This analysis was conducted in R using packages including haven for data import, tidyverse for data manipulation and visualization, patchwork for combining plots, caret for statistical modeling, and car and lmtest for model diagnostics.

Data Cleaning

After subsetting the original dataset to only include the four variables of interest, only valid responses were retained. Binary variables (CVDCRHD4, EXERANY2, SMOKE100) included “Yes” and “No” responses, with missing, refused, and uncertain responses removed. The WEIGHT2 variable was converted from kilograms to pounds for consistency. Binary variables were converted to factors for categorical handling.

# Convert variables and clean data frame

# Convert WEIGHT2 to lbs
# Remove "Don't know/Not Sure" (7), "Refused" (9), "Not asked or Missing" (BLANK) from binary variables (CVDCRHD4, EXERANY2, SMOKE100) by filtering to show only "Yes" (1) and "No" (2) responses
# Remove "Don't know/Not sure" (7777) and "Refused" (9999) from WEIGHT2 
# Remove NAs from WEIGHT2 
# Convert variables CVDCRHD4, EXERANY2, and SMOKE100 to factors to treat binary variables as categorical instead of numeric

brfss_2024_subset_clean <- brfss_2024_subset |>
  # Convert WEIGHT2 to lbs
  mutate(
  WEIGHT2 = case_when(
    WEIGHT2 >= 50 & WEIGHT2 <= 776 ~ WEIGHT2, # lbs
    WEIGHT2 >= 9023 & WEIGHT2 <= 9352 ~ (WEIGHT2 - 9000) * 2.20462, # kg to lbs
    TRUE ~ NA_real_ # "Don't know/Not sure", "Refused", BLANK
    )
  )|>
  # Keep only valid responses
   filter(
    CVDCRHD4 %in% c(1,2), # Include only "Yes" (1) and "No" (2) responses
    EXERANY2 %in% c(1,2), # Include only "Yes" (1) and "No" (2) responses
    SMOKE100 %in% c(1,2), # Include only "Yes" (1) and "No" (2) responses
    !(WEIGHT2  %in% c(7777, 9999)), # Remove "Don't know/Not sure" (7777) and "Refused" (9999)
    !is.na(WEIGHT2) # Remove NAs
    ) |>
  # Convert binary variables to factors
  mutate(
    CVDCRHD4 = factor(CVDCRHD4),
    EXERANY2 = factor(EXERANY2),
    SMOKE100 = factor(SMOKE100)
    )

The Impact of Removing Missing Values

Missing values were removed from all variables rather than imputed because the proportion of missing data was relatively low, < 6.5% for each variable and approximately 2.3% for the dataset overall. The dataset is sufficiently large that exclusion of missing values was expected to have minimal impact on later analyses.

Table 1: Percent of Missing Values by Variable and for Dataset Overall
Variable	Missing Values (%)
CVDCRHD4	0.0007
EXERANY2	0.0007
WEIGHT2	2.6947
SMOKE100	6.3059
Dataset overall	2.2505

Exploratory Data Analysis

Univariate Exploratory Analysis of Binary Variables: `CVDCRHD4`, `EXERANY2`, `SMOKE100`

Angina/CHD was present in only a small portion of respondents (6.4%), as illustrated in Plot 1. Plot 2 shows that the majority of respondents, 77.3%, reported exercising outside of work in the past 30 days. Plot 3 shows that 39.4% of respondents reported having smoked at least 100 cigarettes in their lifetime, suggesting that a positive smoking history is relatively common in this sample. These distributions provide context for the behavioral predictors included in the subsequent logistic regression models.

# Visualize distribution of binary variables with statistics

# Calculate statistics for each plot

stats_p1 <- brfss_2024_subset_clean |>
  group_by(CVDCRHD4 = factor(CVDCRHD4, levels = c(1, 2), labels = c("Yes", "No"))) |>
  summarise(count = n()) |>
  mutate(percentage = round(count / sum(count) * 100, 1))

stats_p2 <- brfss_2024_subset_clean |>
  group_by(EXERANY2 = factor(EXERANY2, levels = c(1, 2), labels = c("Yes", "No"))) |>
  summarise(count = n()) |>
  mutate(percentage = round(count / sum(count) * 100, 1))

stats_p3 <- brfss_2024_subset_clean |>
  group_by(SMOKE100 = factor(SMOKE100, levels = c(1, 2), labels = c("Yes", "No"))) |>
  summarise(count = n()) |>
  mutate(percentage = round(count / sum(count) * 100, 1))

# Create bar plots

p1 <- ggplot(stats_p1, aes(x = CVDCRHD4, y = count)) +
  geom_bar(stat = "identity", fill = "maroon", alpha = 0.7) +
  geom_text(aes(label = paste0(format(count, big.mark = ","), "\n(", percentage, "%)")),
            vjust = -0.5, size = 3) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE),
                     expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Plot 1: Distribution of CVDCRHD4",
       x = "Angina/CHD Status",
       y = "Count")

p2 <- ggplot(stats_p2, aes(x = EXERANY2, y = count)) +
  geom_bar(stat = "identity", fill = "aquamarine4", alpha = 0.7) +
  geom_text(aes(label = paste0(format(count, big.mark = ","), "\n(", percentage, "%)")),
            vjust = -0.5, size = 3) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE),
                     expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Plot 2: Distribution of EXERANY2",
       x = "Exercise in Past 30 Days",
       y = "Count")

p3 <- ggplot(stats_p3, aes(x = SMOKE100, y = count)) +
  geom_bar(stat = "identity", fill = "deepskyblue3", alpha = 0.7) +
  geom_text(aes(label = paste0(format(count, big.mark = ","), "\n(", percentage, "%)")),
            vjust = -0.5, size = 3) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE),
                     expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Plot 3: Distribution of SMOKE100",
       x = "Smoked >= 100 Cigarettes in Lifetime",
       y = "Count")

# Combine plots

p1 + p2 + p3 + 
  plot_layout(ncol = 3) &
  theme(plot.title = element_text(size = 9), 
        axis.text.x = element_text(size = 9),
        axis.title.x = element_text(size = 9),
        axis.title.y = element_text(size = 9)
        )

Univariate Exploratory Analysis of Numeric Variable: `WEIGHT2`

The WEIGHT2 variable has a right-skewed distribution (skew = 1.12), with the majority of values concentrated between roughly 125-250 lbs, and the middle 50% falling between 150 lbs and 210 lbs. Summary statistics indicate a median body weight of 178.0 lbs and a higher mean of 183.2 lbs, reflecting a subset of respondents reporting substantially higher weights.

# Visualize WEIGHT2 distribution - histogram

p4 <- ggplot(brfss_2024_subset_clean, aes(x = WEIGHT2)) +
  geom_histogram(binwidth = 20, fill = "peru", alpha = 0.7) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
  labs(
    title = "Plot 4: Distribution of WEIGHT2",
    x = "Weight (lbs)",
    y = "Count"
  )

p4

# Summary statistics and skewness for WEIGHT2

summary(brfss_2024_subset_clean$WEIGHT2)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   50.0   150.0   178.0   183.2   210.0   776.0

skew(brfss_2024_subset_clean$WEIGHT2)

[1] 1.121688

To reduce the influence of extreme values, observations outside of the 1.5×IQR threshold were removed prior to modeling.

# Outlier removal in WEIGHT2 based on 1.5×IQR

# Define upper and lower fences (high and low outliers)

upper_fence <- quantile(brfss_2024_subset_clean$WEIGHT2, 0.75) + 1.5 * IQR(brfss_2024_subset_clean$WEIGHT2)
lower_fence <- quantile(brfss_2024_subset_clean$WEIGHT2, 0.25) - 1.5 * IQR(brfss_2024_subset_clean$WEIGHT2)

# Create new data frame wihtout outliers

brfss_2024_subset_clean_no_outl <- brfss_2024_subset_clean |>
  filter(
    WEIGHT2 >= lower_fence,
    WEIGHT2 <= upper_fence
         )

This approach yielded a more stable representation of typical body weight values while retaining the majority of observations (98.3%).

# Assess outlier impact in percent

percent_retained_after_outl_remove <- (nrow(brfss_2024_subset_clean_no_outl) / nrow(brfss_2024_subset_clean)) * 100

percent_retained_after_outl_remove

[1] 98.27223

Bivariate Exploratory Analysis

A higher proportion of respondents who reported no exercise outside of work in the past 30 days had angina/CHD compared to those who reported exercising (10.0% vs 5.4%). Respondents who reported having smoked at least 100 cigarettes in their lifetime also show a higher proportion of angina/CHD compared to those who had not (9.3% vs. 4.6%). While these visualizations (Plots 5 and 6) do not account for potential confounding factors, they suggest possible associations between physical activity and angina/CHD status, and between smoking history and angina/CHD status.

# Calculate stats for p5

stats_p5 <- brfss_2024_subset_clean |>
  group_by(
    EXERANY2 = factor(EXERANY2, levels = c(1, 2), labels = c("Yes", "No")),
    CVDCRHD4 = factor(CVDCRHD4, levels = c(1, 2), labels = c("Angina/CHD", "No Angina/CHD"))
  ) |>
  summarise(count = n(), .groups = "drop") |>
  group_by(EXERANY2) |>
  mutate(
    total = sum(count),
    percentage = round(count / total * 100, 1),
    # Position for label (middle of each segment in proportional chart)
    prop = count / total,
    label_y = 1 - cumsum(prop) + prop/2
  )

# Create p5 with percentage labels

p5 <- ggplot(stats_p5, aes(x = EXERANY2, y = prop, fill = CVDCRHD4)) +
  geom_bar(stat = "identity", alpha = 0.7) +
  geom_text(aes(y = label_y, label = paste0(percentage, "%")),
            fontface = "bold", size = 3) +
  labs(
    title = "Plot 5: Angina/CHD by Exercise in Past 30 Days",
    x = "Exercise in Past 30 Days",
    y = "Proportion",
    fill = "Angina/CHD Status"
  )

# Calculate stats for p6

stats_p6 <- brfss_2024_subset_clean |>
  group_by(
    SMOKE100 = factor(SMOKE100, levels = c(1, 2), labels = c("Yes", "No")),
    CVDCRHD4 = factor(CVDCRHD4, levels = c(1, 2), labels = c("Angina/CHD", "No Angina/CHD"))
  ) |>
  summarise(count = n(), .groups = "drop") |>
  group_by(SMOKE100) |>
  mutate(
    total = sum(count),
    percentage = round(count / total * 100, 1),
    prop = count / total,
    label_y = 1 - cumsum(prop) + prop/2
  )

# Create p6 with percentage labels

p6 <- ggplot(stats_p6, aes(x = SMOKE100, y = prop, fill = CVDCRHD4)) +
  geom_bar(stat = "identity", alpha = 0.7) +
  geom_text(aes(y = label_y, label = paste0(percentage, "%")),
            fontface = "bold", size = 3) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Plot 6: Angina/CHD by Lifetime Cigarette Smoking",
    x = "Smoked >= 100 Cigarettes in Lifetime",
    y = "Proportion",
    fill = "Angina/CHD Status"
  )

# Combine plots

p5 + p6 +
  plot_layout(ncol = 2) &
  theme(
    plot.title = element_text(size = 9), 
    axis.text.x = element_text(size = 9),
    axis.title.x = element_text(size = 9),
    axis.title.y = element_text(size = 9),
    legend.title = element_text(size = 9)
  )

Plot 7 indicates that respondents with angina/CHD had higher median body weight than those without angina/CHD (185 lbs vs. 177 lbs), with a statistically significant difference in distributions (Wilcoxon rank-sum test, p < 0.001).

ggplot(brfss_2024_subset_clean,
       aes(x = factor(CVDCRHD4),
           y = WEIGHT2,
           fill = factor(CVDCRHD4))) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  coord_cartesian(ylim = c(50, 320)) +
  labs(
    x = "Angina/CHD Status",
    y = "Weight (lbs)",
    title = "Plot 7: Weight Distribution by Angina/CHD Status",
    fill = "Angina/CHD Status"
  ) +
  scale_x_discrete(
    labels = c("Angina/CHD", "No Angina/CHD")
  ) +
  scale_fill_manual(
    values = c("coral2", "cornflowerblue"),
    labels = c("Angina/CHD", "No Angina/CHD")
  ) +
  stat_summary(
    fun = median,
    geom = "text",
    aes(label = after_stat(round(y, 1))),
    vjust = -0.7,
    size = 3.5,
  )

Consistent with the visual patterns, bivariate analyses indicated statistically significant associations between angina/CHD status and exercise habits, smoking history, and body weight (chi-squared tests for categorical variables and Wilcoxon rank-sum test for weight, all p < 0.001), supporting their consideration for inclusion in the logistic regression models. Final variable inclusion was determined through multivariable model comparison rather than bivariate significance alone.

Table 2: Statistical Tests for Association With CVDCRHD4
Variable	Test	Statistic	P-value
EXERANY2	Chi-squared	2479.45	< 0.001
SMOKE100	Chi-squared	3471.91	< 0.001
WEIGHT2	Wilcoxon	5316279573.50	< 0.001

Logistic Regression Modeling

Four logistic regression models were fitted to explore associations between angina/CHD status and the selected predictors. Model 1 included physical activity (EXERANY2) alone. Model 2 added smoking history (SMOKE100) to examine combined effects of behavioral factors. Model 3 included exercise, smoking history, and body weight (WEIGHT2). Model 4 included smoking history, body weight, and their interaction. The interaction term was included as an exploratory test of potential effect modification. The models were assessed individually and then compared to identify the most appropriate set of predictors for the final model.

Prior to modeling, binary variables were recoded from 1/2 to 1/0 and converted from factor to numeric to meet the input requirements for logistic regression.

# Use the clean dataset excluding outliers for regression (brfss_2024_subset_clean_no_outl)

# Recode binary variables from 1/2 to 1/0, simultaneously convert from factor to numeric for logistic regression

brfss_2024_subset_clean_no_outl_bin <- brfss_2024_subset_clean_no_outl |>
  mutate(across(c(CVDCRHD4, EXERANY2, SMOKE100),
                ~ ifelse(. == 1, 1, 0)
                )
         )

Model 1, Baseline Association: Physical Activity Only

The first model assessed the association between exercise habits and angina/CHD status, serving as a baseline estimate of the relationship between physical activity and cardiovascular outcomes. The model specification is shown below.

# Model 1: Predicting CVDCRHD4 from EXERANY2

mod_1 <- glm(
  CVDCRHD4 ~ EXERANY2,
  family = binomial(),
  data = brfss_2024_subset_clean_no_outl_bin
)

Respondents who reported exercising outside of work in the past 30 days had 49% lower odds of angina/CHD compared to non-exercisers (OR = 0.51, 95% CI: 0.50-0.53, p < 0.001). Because the data are cross-sectional, this association may reflect a protective effect of physical activity, reduced activity among individuals with angina/CHD, or both.

Table 3: Odds Ratios, 95% Confidence Intervals, and P-values for Model 1 Predicting CVDCRHD4
Variable	Odds Ratio	95% CI Lower	95% CI Upper	P-value
(Intercept)	0.11	0.11	0.11	< 0.001
EXERANY2	0.51	0.50	0.53	< 0.001

Model 2, Additive Behavioral Effect: Physical Activity and Smoking History

The second model incorporated smoking history in addition to exercise habits to evaluate whether the association observed in the baseline model persisted after accounting for a potential cardiovascular risk factor. The model specification is shown below.

mod_2 <- glm(
  CVDCRHD4 ~ EXERANY2 + SMOKE100,
  family = binomial(),
  data = brfss_2024_subset_clean_no_outl_bin
  )

Adjusting for smoking history, physical activity remained inversely associated with angina/CHD (OR = 0.55, 95% CI: 0.54-0.57, p < 0.001). Smoking history was also significantly associated with angina/CHD. Respondents who had smoked at least 100 cigarettes had roughly twice the odds of angina/CHD compared to those who had not (OR = 2.02, 95% CI: 1.97-2.07, p < 0.001). The attenuation of the exercise odds ratio relative to the baseline model suggests partial confounding by smoking history.

Table 4: Odds Ratios, 95% Confidence Intervals, and P-values for Model 2 Predicting CVDCRHD4
Variable	Odds Ratio	95% CI Lower	95% CI Upper	P-value
(Intercept)	0.08	0.07	0.08	< 0.001
EXERANY2	0.55	0.54	0.57	< 0.001
SMOKE100	2.02	1.97	2.07	< 0.001

Model 3, Additional Health Effect: Physical Activity, Smoking History, and Weight

Model 3 built on the previous model by adding body weight as an additional health factor, allowing the exploration of the combined associations of exercise habits, smoking history, and body weight with angina/CHD. The model specification is shown below.

mod_3 <- glm(
  CVDCRHD4 ~ EXERANY2 + SMOKE100 + WEIGHT2, 
  family = binomial(), 
  data = brfss_2024_subset_clean_no_outl_bin
  )

Controlling for smoking history and body weight, respondents reporting exercise outside of work in the past 30 days had 44% lower odds of angina/CHD compared to non-exercisers (OR = 0.56, 95% CI: 0.54–0.58, p < 0.001). Controlling for exercise and body weight, respondents who had smoked at least 100 cigarettes had roughly twice the odds of angina/CHD compared to those who had not (OR = 2.00, 95% CI: 1.95–2.06, p < 0.001). Controlling for exercise and smoking history, higher body weight was significantly associated with increased odds of angina/CHD (0.3% higher odds per pound, OR = 1.003, 95% CI: 1.003–1.003, p < 0.001). Although the per-pound effect is small, larger differences in body weight (overweight or obesity) may meaningfully increase the odds of angina/CHD. The slight attenuation of the exercise and smoking history odds ratios compared to earlier models suggests partial confounding by body weight.

Table 5: Odds Ratios, 95% Confidence Intervals, and P-values for Model 3 Predicting CVDCRHD4
Variable	Odds Ratio	95% CI Lower	95% CI Upper	P-value
(Intercept)	0.044	0.041	0.047	< 0.001
EXERANY2	0.560	0.545	0.575	< 0.001
SMOKE100	2.004	1.953	2.057	< 0.001
WEIGHT2	1.003	1.003	1.003	< 0.001

Model 4, Effect Modification: Smoking History, Weight, and Interaction

Model 4 examined a potential effect modification by including an interaction between smoking history and body weight, assessing whether the association between weight and angina/CHD differed by smoking status. The model specification is shown below.

mod_4 <- glm(
  CVDCRHD4 ~ SMOKE100 + WEIGHT2 + SMOKE100:WEIGHT2, 
  family = binomial(), 
  data = brfss_2024_subset_clean_no_outl_bin
  )

Model 4 examined whether the association between body weight and angina/CHD differed by smoking history by including a smoking × weight interaction. Among non-smokers, higher body weight was associated with slightly increased odds of angina/CHD (0.4% higher odds per pound, OR = 1.004, 95% CI: 1.004–1.004, p < 0.001). A positive smoking history was strongly associated with angina/CHD (OR = 2.72, 95% CI: 2.43–3.05, p < 0.001), but the interaction term indicates that the effect of weight on angina/CHD is slightly weaker among smokers (OR = 0.999, 95% CI: 0.998–0.999, p < 0.001). These results suggest that the impact of body weight may differ depending on smoking history, and although the per-pound effect remains small, larger weight differences (overweight or obesity) may still be meaningful.

Table 6: Odds Ratios, 95% Confidence Intervals, and P-values for Model 4 Predicting CVDCRHD4
Variable	Odds Ratio	95% CI Lower	95% CI Upper	P-value
(Intercept)	0.023	0.021	0.025	< 0.001
SMOKE100	2.721	2.427	3.049	< 0.001
WEIGHT2	1.004	1.004	1.004	< 0.001
SMOKE100:WEIGHT2	0.999	0.998	0.999	< 0.001

Model Diagnostics and Comparisons

Deviance and Akaike Information Criterion (AIC) were used to compare model fit across the four logistic regression models. Model fit improved with the addition of predictors: adding smoking history to exercise habits (Model 2) reduced the AIC from 185,620 to 182,763, and adding body weight (Model 3) further reduced the AIC to 182,402. Model 4 with the smoking × weight interaction produced an AIC of 184,016, indicating that the interaction did not improve overall model fit compared to Models 2 and 3, possibly attributable to the exclusion of exercise, a strong predictor in Models 1, 2, and 3.

# Model comparisons

# Make a table to compare AIC and deviance for Models 1, 2, 3, 4

table_deviance_AIC <- data.frame(
  Model = c("Model 1", "Model 2", "Model 3", "Model 4"),
  Deviance = c(deviance(mod_1), deviance(mod_2), deviance(mod_3), deviance(mod_4)),
  AIC = c(AIC(mod_1), AIC(mod_2), AIC(mod_3), AIC(mod_4))
  )

kable(
  table_deviance_AIC,
  caption = "Table 7: Deviance and Akaike Information Criterion (AIC) Values for Models 1-4",
  digits = 1,
  align = "lcc"
  )

Table 7: Deviance and Akaike Information Criterion (AIC) Values for Models 1-4
Model	Deviance	AIC
Model 1	185616.2	185620.2
Model 2	182756.7	182762.7
Model 3	182393.5	182401.5
Model 4	184008.0	184016.0

Likelihood ratio tests indicated that Model 2 provided a significantly better fit than Model 1, and Model 3 provided a significantly better fit than Model 2 (both p < 0.001).

Table 8: Likelihood Ratio Tests Comparing Nested Logistic Regression Models
Comparison	df	Chi-squared	P-value
Model 1 → Model 2	1	2859.5	< 0.001
Model 2 → Model 3	1	363.2	< 0.001

Multicollinearity was assessed using variance inflation factors (VIFs). All main effects in Models 2 and 3 had VIFs below 2, indicating no multicollinearity concerns. In Model 4, the VIFs for the interaction term and smoking history exceeded 19, reflecting the expected correlation between the interaction term and the main effect it involves.

Table 9: Variance Inflation Factors (VIFs) by Model
Variable	Model 2	Model 3	Model 4
EXERANY2	1.01	1.01	NA
SMOKE100	1.01	1.01	19.61
WEIGHT2	NA	1.00	2.22
SMOKE100:WEIGHT2	NA	NA	20.94

Overall, Model 3 provided the best balance of fit, interpretability, and parsimony, and was selected as the final model for interpretation.

Prediction and Interpretation

To contextualize the final model’s findings, a new dataset of 12 hypothetical individuals was created to represent different combinations of exercise habits, smoking history, and body weight. The individuals were assigned to four categories: active non-smoker, active smoker, sedentary non-smoker, and sedentary smoker. Body weight for each individual was randomly assigned within a plausible range (100-250 lbs). Model 3 was then used to generate predicted probabilities of angina/CHD for each individual.

As shown in Table 10 and Plot 8, sedentary smokers consistently had the highest predicted risk (10.6%-13.8%), while active non-smokers had the lowest (3.4%-4.3%). Predicted risk increased with higher body weight across all profiles, with the increase particularly pronounced for sedentary smokers. Physical activity partially mitigated smoking-related risk, as active smokers consistently had lower predicted risk than sedentary smokers. Nevertheless, smoking presented substantial risk even with exercise. These results illustrate how smoking, physical inactivity, and high body weight jointly contribute to increased cardiovascular risk.

set.seed(2026)

# All profiles

profiles <- c("Active non-smoker", "Active smoker", "Sedentary non-smoker", "Sedentary smoker")
EXERANY2 <- c(1, 1, 0, 0)
SMOKE100  <- c(0, 1, 0, 1)

# Function to generate individuals with independent weight

generate_profiles_independent <- function(profile, exer, smoke, n_individuals) {
  data.frame(
    Profile = rep(profile, n_individuals),
    EXERANY2 = rep(exer, n_individuals),
    SMOKE100 = rep(smoke, n_individuals),
    WEIGHT2 = round(runif(n_individuals, min = 100, max = 250))
  )
}

# Generate three individuals per profile

hypothetical_profiles <- rbind(
  generate_profiles_independent("Active non-smoker", 1, 0, 3),
  generate_profiles_independent("Active smoker", 1, 1, 3),
  generate_profiles_independent("Sedentary non-smoker", 0, 0, 3),
  generate_profiles_independent("Sedentary smoker", 0, 1, 3)
)

# Predict probabilities

pred_probs <- predict(mod_3, newdata = hypothetical_profiles, type = "response")

# Convert to percent

risk_percent <- round(pred_probs * 100, 1)

# Construct final table

table_new <- hypothetical_profiles |>
  mutate(
    Risk_Percent = risk_percent
  ) |>
  select(
    Profile,
    EXERANY2,
    SMOKE100,
    WEIGHT2,
    Risk_Percent
  )

kable(
  table_new,
  caption = "Table 10: Characteristics of Hypothetical Individuals and Their Predicted Angina/CHD Risk (%)",
  col.names = c("Profile", "EXERANY2", "SMOKE100", "WEIGHT2", "Risk (%)"),
  digits = 1,
  align = "lcccc"
)

Table 10: Characteristics of Hypothetical Individuals and Their Predicted Angina/CHD Risk (%)
Profile	EXERANY2	SMOKE100	WEIGHT2	Risk (%)
Active non-smoker	1	0	205	4.3
Active non-smoker	1	0	183	4.0
Active non-smoker	1	0	121	3.4
Active smoker	1	1	143	7.0
Active smoker	1	1	183	7.8
Active smoker	1	1	104	6.3
Sedentary non-smoker	0	0	170	6.7
Sedentary non-smoker	0	0	229	7.9
Sedentary non-smoker	0	0	138	6.2
Sedentary smoker	0	1	187	13.2
Sedentary smoker	0	1	101	10.6
Sedentary smoker	0	1	204	13.8

# Plot 8: Predicted Angina/CHD Risk by Lifestyle Profile and Weight

ggplot(table_new,
       aes(x = WEIGHT2,
           y = Risk_Percent,
           color = Profile,
           shape = Profile,
           group = Profile)) + 
  geom_line(linetype = "dotted", linewidth = 0.75) +
  geom_point(size = 4) +
  scale_shape_manual(values = c(16, 17, 15, 18)) +
  labs(
    title = "Plot 8: Predicted Angina/CHD Risk by Lifestyle Profile and Weight",
    x = "Weight (lbs)",
    y = "Predicted Risk (%)",
    color = "Lifestyle Profile",
    shape = "Lifestyle Profile"
  )

Limitations and Next Steps

The reliability of the findings is supported by statistically significant associations, narrow confidence intervals, and improved model fit when key behavioral predictors were included. Model 3 demonstrated the best balance of fit and interpretability, suggesting a stable relationship between smoking history, physical activity, body weight, and angina/CHD status in this dataset.

Several limitations should be considered when interpreting these results. First, the cross-sectional design precludes causal inference. Associations between angina/CHD and factors such as physical activity and body weight may be bidirectional, where angina/CHD may influence these health-related factors rather than the reverse.

Additional limitations include reliance on self-reported survey data, the use of binary indicators for smoking history and exercise habits, potential unmeasured confounding, and assumptions underlying predictions for hypothetical individuals.

Despite these limitations, the analysis demonstrates how behavioral and health factors can be incorporated into predictive modeling and highlights differences in estimated cardiovascular risk associated with smoking history, physical activity, and body weight. These results illustrate both the strengths and constraints of using large cross-sectional survey data for risk modeling in public health contexts.

Reference

Centers for Disease Control and Prevention. Behavioral Risk Factor Surveillance System Survey Data, 2024. Accessed January 2026.