1. Introduction to Hypertension

Hypertension is a primary risk factor for cardiovascular disease. In health data science, we analyze blood pressure (BP) as both a continuous outcome (measured in mmHg) and a categorical outcome (Hypertensive vs. Normotensive).

1.1 Statistical Note on Dichotomization

While clinicians use thresholds (e.g., 140/90 mmHg) to make treatment decisions, statisticians prefer continuous analysis. Dichotomizing continuous data leads to a loss of statistical power and masks individual variation within groups.


2. Dataset Generation

We will simulate a study of 500 participants. To ensure the statistical models are stable, we add “noise” to the relationship between predictors and outcomes.

set.seed(123) 

n <- 500

# 1. Demographic variables
data_bp <- data.frame(
  id = 1:n,
  age = round(rnorm(n, mean = 55, sd = 12)),
  sex = sample(c("Male", "Female"), n, replace = TRUE),
  bmi = round(rnorm(n, mean = 28, sd = 5), 1)
)

# 2. Simulate SBP (influenced by Age and BMI + significant random error)
# Adding sd=15 to error_term prevents the model from being "too perfect"
data_bp$sbp <- 100 + (0.45 * data_bp$age) + (0.75 * data_bp$bmi) + rnorm(n, 0, 15)

# 3. Simulate DBP (correlated with SBP)
data_bp$dbp <- 50 + (0.4 * data_bp$sbp) + rnorm(n, 0, 8)

# 4. Create binary hypertension variable
# We add a random component to status to simulate biological variability 
# and prevent "Perfect Separation" errors in Logistic Regression
prob_htn <- plogis((data_bp$sbp - 140) / 10) 
data_bp$hypertension_status <- rbinom(n, 1, prob_htn)

data_bp$hypertension_label <- factor(data_bp$hypertension_status, 
                                     levels = c(0,1), 
                                     labels = c("Normotensive", "Hypertensive"))

kable(head(data_bp), caption = "Table 1: Simulated Patient Dataset")
Table 1: Simulated Patient Dataset
id age sex bmi sbp dbp hypertension_status hypertension_label
1 48 Male 35.7 164.1505 110.0482 1 Hypertensive
2 52 Female 27.5 153.3686 118.4053 1 Hypertensive
3 74 Female 30.6 162.7543 114.0348 1 Hypertensive
4 56 Male 29.1 152.8163 102.1611 1 Hypertensive
5 57 Female 27.1 165.3448 119.8275 1 Hypertensive
6 76 Male 27.4 139.7161 118.0796 0 Normotensive

3. Descriptive Statistics and Visualization

3.1 Distribution of SBP

ggplot(data_bp, aes(x = sbp)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  geom_vline(aes(xintercept = mean(sbp)), color = "red", linetype = "dashed") +
  labs(title = "Distribution of Systolic Blood Pressure", x = "SBP (mmHg)", y = "Count") +
  theme_minimal()
Figure 1: Distribution of Systolic Blood Pressure

Figure 1: Distribution of Systolic Blood Pressure


4. Statistical Modeling

4.1 Linear Regression (Continuous)

We model SBP as a function of age, sex, and BMI.

lm_model <- lm(sbp ~ age + sex + bmi, data = data_bp)

# Tidy output with 95% Confidence Intervals
tidy_lm <- tidy(lm_model, conf.int = TRUE)
kable(tidy_lm, digits = 3, caption = "Table 2: Linear Regression results for SBP")
Table 2: Linear Regression results for SBP
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 112.894 4.948 22.814 0.000 103.171 122.616
age 0.361 0.057 6.347 0.000 0.249 0.473
sexMale -1.391 1.333 -1.043 0.297 -4.011 1.229
bmi 0.514 0.131 3.919 0.000 0.256 0.772

4.2 Logistic Regression (Binary)

Important Note: To avoid the approx() error in R, we use confint.default(). This calculates Wald Confidence Intervals, which are computationally stable even if the data shows high separation.

# Fit the model
logit_model <- glm(hypertension_status ~ age + bmi, 
                   data = data_bp, 
                   family = binomial(link = "logit"))

# 1. Get Log-Odds and Wald CIs
results_log_odds <- tidy(logit_model, conf.int = TRUE, conf.method = "wald")

# 2. Convert to Odds Ratios (OR) by exponentiating
results_or <- results_log_odds %>%
  mutate(
    OR = exp(estimate),
    conf.low = exp(conf.low),
    conf.high = exp(conf.high)
  ) %>%
  select(term, OR, conf.low, conf.high, p.value)

kable(results_or, digits = 3, caption = "Table 3: Logistic Regression Odds Ratios (Wald CIs)")
Table 3: Logistic Regression Odds Ratios (Wald CIs)
term OR conf.low conf.high p.value
(Intercept) 0.308 0.079 1.178 0.087
age 1.023 1.007 1.039 0.006
bmi 1.017 0.981 1.055 0.354

5. Method Comparison (Bland-Altman)

In clinical settings, we often compare a new measurement device (Auto) against a gold standard (Manual).

# Simulate two measurements with a 3mmHg bias
data_bp$sbp_manual <- data_bp$sbp
data_bp$sbp_auto <- data_bp$sbp + 3 + rnorm(n, 0, 4)

data_bp <- data_bp %>%
  mutate(
    diff = sbp_auto - sbp_manual,
    avg = (sbp_auto + sbp_manual) / 2
  )

mean_diff <- mean(data_bp$diff)
sd_diff <- sd(data_bp$diff)
loa_upper <- mean_diff + (1.96 * sd_diff)
loa_lower <- mean_diff - (1.96 * sd_diff)
ggplot(data_bp, aes(x = avg, y = diff)) +
  geom_point(alpha = 0.4) +
  geom_hline(yintercept = mean_diff, color = "blue") +
  geom_hline(yintercept = c(loa_lower, loa_upper), color = "red", linetype = "dashed") +
  labs(title = "Bland-Altman: Auto vs Manual SBP",
       x = "Average SBP (mmHg)", y = "Difference (Auto - Manual)") +
  theme_bw()
Figure 2: Bland-Altman Plot of BP Measurement Agreement

Figure 2: Bland-Altman Plot of BP Measurement Agreement


6. Sample Size Calculation

To design a trial detecting a 5 mmHg reduction in SBP, assuming a standard deviation of 15 mmHg:

p_test <- power.t.test(delta = 5, sd = 15, sig.level = 0.05, power = 0.8)
print(p_test)
## 
##      Two-sample t test power calculation 
## 
##               n = 142.2466
##           delta = 5
##              sd = 15
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

We require 143 participants per arm.


7. Student Exercises

Exercise 1: Interaction Terms

# Test if the effect of BMI on SBP changes with Age
ex1_model <- lm(sbp ~ age * bmi, data = data_bp)
kable(tidy(ex1_model), digits = 3)
term estimate std.error statistic p.value
(Intercept) 133.950 17.794 7.528 0.000
age -0.034 0.313 -0.108 0.914
bmi -0.278 0.638 -0.436 0.663
age:bmi 0.014 0.011 1.279 0.201

Exercise 2: Visualizing Regression

ggplot(data_bp, aes(x = bmi, y = sbp)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Linear Relationship between BMI and SBP") +
  theme_minimal()

```