1. Introduction to Hypertension
- 1.1 Statistical Note on Dichotomization
2. Dataset Generation
3. Descriptive Statistics and Visualization
- 3.1 Distribution of SBP
4. Statistical Modeling
- 4.1 Linear Regression (Continuous)
- 4.2 Logistic Regression (Binary)
5. Method Comparison (Bland-Altman)
6. Sample Size Calculation
7. Student Exercises
- Exercise 1: Interaction Terms
- Exercise 2: Visualizing Regression

1. Introduction to Hypertension

Hypertension is a primary risk factor for cardiovascular disease. In health data science, we analyze blood pressure (BP) as both a continuous outcome (measured in mmHg) and a categorical outcome (Hypertensive vs. Normotensive).

1.1 Statistical Note on Dichotomization

While clinicians use thresholds (e.g., 140/90 mmHg) to make treatment decisions, statisticians prefer continuous analysis. Dichotomizing continuous data leads to a loss of statistical power and masks individual variation within groups.

2. Dataset Generation

We will simulate a study of 500 participants. To ensure the statistical models are stable, we add “noise” to the relationship between predictors and outcomes.

set.seed(123) 

n <- 500

# 1. Demographic variables
data_bp <- data.frame(
  id = 1:n,
  age = round(rnorm(n, mean = 55, sd = 12)),
  sex = sample(c("Male", "Female"), n, replace = TRUE),
  bmi = round(rnorm(n, mean = 28, sd = 5), 1)
)

# 2. Simulate SBP (influenced by Age and BMI + significant random error)
# Adding sd=15 to error_term prevents the model from being "too perfect"
data_bp$sbp <- 100 + (0.45 * data_bp$age) + (0.75 * data_bp$bmi) + rnorm(n, 0, 15)

# 3. Simulate DBP (correlated with SBP)
data_bp$dbp <- 50 + (0.4 * data_bp$sbp) + rnorm(n, 0, 8)

# 4. Create binary hypertension variable
# We add a random component to status to simulate biological variability 
# and prevent "Perfect Separation" errors in Logistic Regression
prob_htn <- plogis((data_bp$sbp - 140) / 10) 
data_bp$hypertension_status <- rbinom(n, 1, prob_htn)

data_bp$hypertension_label <- factor(data_bp$hypertension_status, 
                                     levels = c(0,1), 
                                     labels = c("Normotensive", "Hypertensive"))

kable(head(data_bp), caption = "Table 1: Simulated Patient Dataset")

Table 1: Simulated Patient Dataset
id	age	sex	bmi	sbp	dbp	hypertension_status	hypertension_label
1	48	Male	35.7	164.1505	110.0482	1	Hypertensive
2	52	Female	27.5	153.3686	118.4053	1	Hypertensive
3	74	Female	30.6	162.7543	114.0348	1	Hypertensive
4	56	Male	29.1	152.8163	102.1611	1	Hypertensive
5	57	Female	27.1	165.3448	119.8275	1	Hypertensive
6	76	Male	27.4	139.7161	118.0796	0	Normotensive

3. Descriptive Statistics and Visualization

3.1 Distribution of SBP

ggplot(data_bp, aes(x = sbp)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  geom_vline(aes(xintercept = mean(sbp)), color = "red", linetype = "dashed") +
  labs(title = "Distribution of Systolic Blood Pressure", x = "SBP (mmHg)", y = "Count") +
  theme_minimal()

Figure 1: Distribution of Systolic Blood Pressure

4. Statistical Modeling

4.1 Linear Regression (Continuous)

We model SBP as a function of age, sex, and BMI.

lm_model <- lm(sbp ~ age + sex + bmi, data = data_bp)

# Tidy output with 95% Confidence Intervals
tidy_lm <- tidy(lm_model, conf.int = TRUE)
kable(tidy_lm, digits = 3, caption = "Table 2: Linear Regression results for SBP")

Table 2: Linear Regression results for SBP
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	112.894	4.948	22.814	0.000	103.171	122.616
age	0.361	0.057	6.347	0.000	0.249	0.473
sexMale	-1.391	1.333	-1.043	0.297	-4.011	1.229
bmi	0.514	0.131	3.919	0.000	0.256	0.772

4.2 Logistic Regression (Binary)

Important Note: To avoid the approx() error in R, we use confint.default(). This calculates Wald Confidence Intervals, which are computationally stable even if the data shows high separation.

# Fit the model
logit_model <- glm(hypertension_status ~ age + bmi, 
                   data = data_bp, 
                   family = binomial(link = "logit"))

# 1. Get Log-Odds and Wald CIs
results_log_odds <- tidy(logit_model, conf.int = TRUE, conf.method = "wald")

# 2. Convert to Odds Ratios (OR) by exponentiating
results_or <- results_log_odds %>%
  mutate(
    OR = exp(estimate),
    conf.low = exp(conf.low),
    conf.high = exp(conf.high)
  ) %>%
  select(term, OR, conf.low, conf.high, p.value)

kable(results_or, digits = 3, caption = "Table 3: Logistic Regression Odds Ratios (Wald CIs)")

Table 3: Logistic Regression Odds Ratios (Wald CIs)
term	OR	conf.low	conf.high	p.value
(Intercept)	0.308	0.079	1.178	0.087
age	1.023	1.007	1.039	0.006
bmi	1.017	0.981	1.055	0.354

5. Method Comparison (Bland-Altman)

In clinical settings, we often compare a new measurement device (Auto) against a gold standard (Manual).

# Simulate two measurements with a 3mmHg bias
data_bp$sbp_manual <- data_bp$sbp
data_bp$sbp_auto <- data_bp$sbp + 3 + rnorm(n, 0, 4)

data_bp <- data_bp %>%
  mutate(
    diff = sbp_auto - sbp_manual,
    avg = (sbp_auto + sbp_manual) / 2
  )

mean_diff <- mean(data_bp$diff)
sd_diff <- sd(data_bp$diff)
loa_upper <- mean_diff + (1.96 * sd_diff)
loa_lower <- mean_diff - (1.96 * sd_diff)

ggplot(data_bp, aes(x = avg, y = diff)) +
  geom_point(alpha = 0.4) +
  geom_hline(yintercept = mean_diff, color = "blue") +
  geom_hline(yintercept = c(loa_lower, loa_upper), color = "red", linetype = "dashed") +
  labs(title = "Bland-Altman: Auto vs Manual SBP",
       x = "Average SBP (mmHg)", y = "Difference (Auto - Manual)") +
  theme_bw()

Figure 2: Bland-Altman Plot of BP Measurement Agreement

6. Sample Size Calculation

To design a trial detecting a 5 mmHg reduction in SBP, assuming a standard deviation of 15 mmHg:

p_test <- power.t.test(delta = 5, sd = 15, sig.level = 0.05, power = 0.8)
print(p_test)

## 
##      Two-sample t test power calculation 
## 
##               n = 142.2466
##           delta = 5
##              sd = 15
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

We require 143 participants per arm.

7. Student Exercises

Exercise 1: Interaction Terms

# Test if the effect of BMI on SBP changes with Age
ex1_model <- lm(sbp ~ age * bmi, data = data_bp)
kable(tidy(ex1_model), digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	133.950	17.794	7.528	0.000
age	-0.034	0.313	-0.108	0.914
bmi	-0.278	0.638	-0.436	0.663
age:bmi	0.014	0.011	1.279	0.201

Exercise 2: Visualizing Regression

ggplot(data_bp, aes(x = bmi, y = sbp)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Linear Relationship between BMI and SBP") +
  theme_minimal()

```

Chapter 4: Hypertension - Epidemiology, Measurement, and Statistical Analysis

Hibaq Ahmed Abdulle

2026-01-21