Hypertension is a primary risk factor for cardiovascular disease. In health data science, we analyze blood pressure (BP) as both a continuous outcome (measured in mmHg) and a categorical outcome (Hypertensive vs. Normotensive).
While clinicians use thresholds (e.g., 140/90 mmHg) to make treatment decisions, statisticians prefer continuous analysis. Dichotomizing continuous data leads to a loss of statistical power and masks individual variation within groups.
We will simulate a study of 500 participants. To ensure the statistical models are stable, we add “noise” to the relationship between predictors and outcomes.
set.seed(123)
n <- 500
# 1. Demographic variables
data_bp <- data.frame(
id = 1:n,
age = round(rnorm(n, mean = 55, sd = 12)),
sex = sample(c("Male", "Female"), n, replace = TRUE),
bmi = round(rnorm(n, mean = 28, sd = 5), 1)
)
# 2. Simulate SBP (influenced by Age and BMI + significant random error)
# Adding sd=15 to error_term prevents the model from being "too perfect"
data_bp$sbp <- 100 + (0.45 * data_bp$age) + (0.75 * data_bp$bmi) + rnorm(n, 0, 15)
# 3. Simulate DBP (correlated with SBP)
data_bp$dbp <- 50 + (0.4 * data_bp$sbp) + rnorm(n, 0, 8)
# 4. Create binary hypertension variable
# We add a random component to status to simulate biological variability
# and prevent "Perfect Separation" errors in Logistic Regression
prob_htn <- plogis((data_bp$sbp - 140) / 10)
data_bp$hypertension_status <- rbinom(n, 1, prob_htn)
data_bp$hypertension_label <- factor(data_bp$hypertension_status,
levels = c(0,1),
labels = c("Normotensive", "Hypertensive"))
kable(head(data_bp), caption = "Table 1: Simulated Patient Dataset")| id | age | sex | bmi | sbp | dbp | hypertension_status | hypertension_label |
|---|---|---|---|---|---|---|---|
| 1 | 48 | Male | 35.7 | 164.1505 | 110.0482 | 1 | Hypertensive |
| 2 | 52 | Female | 27.5 | 153.3686 | 118.4053 | 1 | Hypertensive |
| 3 | 74 | Female | 30.6 | 162.7543 | 114.0348 | 1 | Hypertensive |
| 4 | 56 | Male | 29.1 | 152.8163 | 102.1611 | 1 | Hypertensive |
| 5 | 57 | Female | 27.1 | 165.3448 | 119.8275 | 1 | Hypertensive |
| 6 | 76 | Male | 27.4 | 139.7161 | 118.0796 | 0 | Normotensive |
ggplot(data_bp, aes(x = sbp)) +
geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
geom_vline(aes(xintercept = mean(sbp)), color = "red", linetype = "dashed") +
labs(title = "Distribution of Systolic Blood Pressure", x = "SBP (mmHg)", y = "Count") +
theme_minimal()Figure 1: Distribution of Systolic Blood Pressure
We model SBP as a function of age, sex, and BMI.
lm_model <- lm(sbp ~ age + sex + bmi, data = data_bp)
# Tidy output with 95% Confidence Intervals
tidy_lm <- tidy(lm_model, conf.int = TRUE)
kable(tidy_lm, digits = 3, caption = "Table 2: Linear Regression results for SBP")| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 112.894 | 4.948 | 22.814 | 0.000 | 103.171 | 122.616 |
| age | 0.361 | 0.057 | 6.347 | 0.000 | 0.249 | 0.473 |
| sexMale | -1.391 | 1.333 | -1.043 | 0.297 | -4.011 | 1.229 |
| bmi | 0.514 | 0.131 | 3.919 | 0.000 | 0.256 | 0.772 |
Important Note: To avoid the approx()
error in R, we use confint.default(). This calculates
Wald Confidence Intervals, which are computationally
stable even if the data shows high separation.
# Fit the model
logit_model <- glm(hypertension_status ~ age + bmi,
data = data_bp,
family = binomial(link = "logit"))
# 1. Get Log-Odds and Wald CIs
results_log_odds <- tidy(logit_model, conf.int = TRUE, conf.method = "wald")
# 2. Convert to Odds Ratios (OR) by exponentiating
results_or <- results_log_odds %>%
mutate(
OR = exp(estimate),
conf.low = exp(conf.low),
conf.high = exp(conf.high)
) %>%
select(term, OR, conf.low, conf.high, p.value)
kable(results_or, digits = 3, caption = "Table 3: Logistic Regression Odds Ratios (Wald CIs)")| term | OR | conf.low | conf.high | p.value |
|---|---|---|---|---|
| (Intercept) | 0.308 | 0.079 | 1.178 | 0.087 |
| age | 1.023 | 1.007 | 1.039 | 0.006 |
| bmi | 1.017 | 0.981 | 1.055 | 0.354 |
In clinical settings, we often compare a new measurement device (Auto) against a gold standard (Manual).
# Simulate two measurements with a 3mmHg bias
data_bp$sbp_manual <- data_bp$sbp
data_bp$sbp_auto <- data_bp$sbp + 3 + rnorm(n, 0, 4)
data_bp <- data_bp %>%
mutate(
diff = sbp_auto - sbp_manual,
avg = (sbp_auto + sbp_manual) / 2
)
mean_diff <- mean(data_bp$diff)
sd_diff <- sd(data_bp$diff)
loa_upper <- mean_diff + (1.96 * sd_diff)
loa_lower <- mean_diff - (1.96 * sd_diff)ggplot(data_bp, aes(x = avg, y = diff)) +
geom_point(alpha = 0.4) +
geom_hline(yintercept = mean_diff, color = "blue") +
geom_hline(yintercept = c(loa_lower, loa_upper), color = "red", linetype = "dashed") +
labs(title = "Bland-Altman: Auto vs Manual SBP",
x = "Average SBP (mmHg)", y = "Difference (Auto - Manual)") +
theme_bw()Figure 2: Bland-Altman Plot of BP Measurement Agreement
To design a trial detecting a 5 mmHg reduction in SBP, assuming a standard deviation of 15 mmHg:
##
## Two-sample t test power calculation
##
## n = 142.2466
## delta = 5
## sd = 15
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
We require 143 participants per arm.
# Test if the effect of BMI on SBP changes with Age
ex1_model <- lm(sbp ~ age * bmi, data = data_bp)
kable(tidy(ex1_model), digits = 3)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 133.950 | 17.794 | 7.528 | 0.000 |
| age | -0.034 | 0.313 | -0.108 | 0.914 |
| bmi | -0.278 | 0.638 | -0.436 | 0.663 |
| age:bmi | 0.014 | 0.011 | 1.279 | 0.201 |