This course builds directly on EPI 552, extending your statistical toolkit to include:
Key Philosophy: We learn to code first, then use AI as a tool when needed—not as a replacement for understanding.
Choosing the correct statistical method depends entirely on identifying variable types correctly. Using the wrong test (e.g., treating ordinal data as continuous) can lead to invalid conclusions.
Categorical Variables (Qualitative): - Nominal: No inherent order (e.g., race/ethnicity, marital status, blood type) - Ordinal: Natural ordering (e.g., education level, disease severity, Likert scales)
Numerical Variables (Quantitative): - Discrete: Countable values (e.g., number of hospitalizations, parity) - Continuous: Measurable on a continuum (e.g., blood pressure, age, BMI)
library(tidyverse)
# Creating example data
example_data <- tibble(
patient_id = 1:100, # Nominal (despite being numeric!)
age_continuous = rnorm(100, mean = 45, sd = 15), # Continuous
education = sample(c("High School", "Bachelor's", "Graduate"),
100, replace = TRUE), # Ordinal
blood_type = sample(c("A", "B", "AB", "O"),
100, replace = TRUE), # Nominal
num_visits = rpois(100, lambda = 3) # Discrete
)
# View structure
glimpse(example_data)## Rows: 100
## Columns: 5
## $ patient_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ age_continuous <dbl> 11.159833, 28.877918, 44.990352, 53.849774, 48.287092, …
## $ education <chr> "Bachelor's", "High School", "High School", "High Schoo…
## $ blood_type <chr> "A", "O", "O", "B", "A", "O", "A", "B", "A", "A", "A", …
## $ num_visits <int> 3, 3, 2, 5, 4, 3, 2, 1, 0, 3, 5, 0, 3, 3, 3, 1, 5, 1, 1…
Why: Central tendency summarizes “typical” values in your data—essential for describing populations and detecting outliers.
When to Use Each: - Mean: Symmetric distributions, no extreme outliers (e.g., height, blood pressure in healthy populations) - Median: Skewed distributions, outliers present (e.g., income, length of hospital stay) - Mode: Categorical data or identifying most common values (e.g., most common diagnosis)
Watch Out: - Mean is sensitive to outliers—one extreme value can distort it - Median is robust but doesn’t use all information - Mode can have multiple values (bimodal/multimodal distributions)
Mean (Arithmetic Average): \[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]
Median: Middle value when data are ordered - If \(n\) is odd: middle value - If \(n\) is even: average of two middle values
Mode: Most frequently occurring value
# Simulating skewed data (e.g., hospital length of stay)
set.seed(553)
los_data <- rexp(100, rate = 0.2) # Exponential distribution (right-skewed)
# Calculate measures
mean_los <- mean(los_data)
median_los <- median(los_data)
# Visualize
tibble(length_of_stay = los_data) %>%
ggplot(aes(x = length_of_stay)) +
geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
geom_vline(aes(xintercept = mean_los, color = "Mean"),
linewidth = 1.2, linetype = "dashed") +
geom_vline(aes(xintercept = median_los, color = "Median"),
linewidth = 1.2, linetype = "dashed") +
scale_color_manual(values = c("Mean" = "red", "Median" = "blue")) +
labs(
title = "Length of Stay Distribution (Right-Skewed)",
subtitle = "Mean is pulled toward the tail; Median is more representative",
x = "Days", y = "Frequency",
color = "Measure"
) +
theme_minimal()## Mean: 4.76 days
## Median: 3.3 days
Why: Variability tells us about data spread—critical for understanding population heterogeneity and precision of estimates.
When: - Range: Quick, rough sense of spread (affected by outliers) - IQR: Robust measure for skewed data - Variance: Foundation for many statistical tests - Standard Deviation: Most interpretable (same units as data)
Watch Out: - Standard deviation assumes roughly normal distribution - Coefficient of variation allows comparison across different scales - Always report variability alongside central tendency
Variance (Population): \[\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}\]
Variance (Sample): \[s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}\]
Standard Deviation: \[s = \sqrt{s^2}\]
Coefficient of Variation: \[CV = \frac{s}{\bar{x}} \times 100\%\]
# Two groups with same mean, different variability
set.seed(553)
group_low_var <- rnorm(100, mean = 120, sd = 5) # Tight control
group_high_var <- rnorm(100, mean = 120, sd = 20) # High variability
# Calculate measures
var_comparison <- tibble(
Group = c("Low Variability", "High Variability"),
Mean = c(mean(group_low_var), mean(group_high_var)),
SD = c(sd(group_low_var), sd(group_high_var)),
CV = c(sd(group_low_var)/mean(group_low_var) * 100,
sd(group_high_var)/mean(group_high_var) * 100)
)
print(var_comparison)## # A tibble: 2 × 4
## Group Mean SD CV
## <chr> <dbl> <dbl> <dbl>
## 1 Low Variability 120. 4.92 4.09
## 2 High Variability 120. 22.4 18.7
# Visualize
tibble(
BP = c(group_low_var, group_high_var),
Group = rep(c("Low Variability", "High Variability"), each = 100)
) %>%
ggplot(aes(x = BP, fill = Group)) +
geom_density(alpha = 0.6) +
labs(
title = "Same Mean, Different Variability",
subtitle = "Systolic Blood Pressure in Two Populations",
x = "Systolic BP (mmHg)", y = "Density"
) +
theme_minimal()Why: Many statistical tests assume normality, and many biological variables follow approximately normal distributions.
When: - Use for parametric tests (t-tests, ANOVA, linear regression) - Large samples (n > 30) often approach normality due to Central Limit Theorem - Check normality before applying parametric methods
Watch Out: - Real data are never perfectly normal - Skewed data may need transformation (log, square root) - Outliers violate normality assumption - Small samples: normality harder to assess and more important
\[f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\]
# Standard Normal Distribution
x <- seq(-4, 4, length.out = 1000)
y <- dnorm(x, mean = 0, sd = 1)
tibble(x = x, y = y) %>%
ggplot(aes(x = x, y = y)) +
geom_line(linewidth = 1.2, color = "steelblue") +
geom_area(data = . %>% filter(x >= -1 & x <= 1),
fill = "steelblue", alpha = 0.3) +
geom_area(data = . %>% filter(x >= -2 & x <= 2),
fill = "steelblue", alpha = 0.2) +
geom_area(data = . %>% filter(x >= -3 & x <= 3),
fill = "steelblue", alpha = 0.1) +
annotate("text", x = 0, y = 0.2, label = "68%", size = 5) +
annotate("text", x = 0, y = 0.1, label = "95%", size = 5) +
annotate("text", x = 0, y = 0.05, label = "99.7%", size = 5) +
labs(
title = "Standard Normal Distribution",
subtitle = "68-95-99.7 Rule",
x = "Standard Deviations from Mean", y = "Density"
) +
theme_minimal()# Generate example data
set.seed(553)
normal_data <- rnorm(100, mean = 100, sd = 15)
skewed_data <- rexp(100, rate = 0.05)
# Visual checks
par(mfrow = c(2, 2))
# Normal data
hist(normal_data, main = "Histogram: Normal Data",
xlab = "Value", col = "lightblue")
qqnorm(normal_data, main = "Q-Q Plot: Normal Data")
qqline(normal_data, col = "red", lwd = 2)
# Skewed data
hist(skewed_data, main = "Histogram: Skewed Data",
xlab = "Value", col = "salmon")
qqnorm(skewed_data, main = "Q-Q Plot: Skewed Data")
qqline(skewed_data, col = "red", lwd = 2)##
## Shapiro-Wilk normality test
##
## data: normal_data
## W = 0.98413, p-value = 0.2745
##
## Shapiro-Wilk normality test
##
## data: skewed_data
## W = 0.76332, p-value = 2.138e-11
Why: The binomial distribution models the number of “successes” in a fixed number of independent trials—fundamental for epidemiologic studies with binary outcomes.
When: - Fixed number of trials (n): e.g., 100 patients screened - Two possible outcomes: Success/failure, disease/no disease, positive/negative test - Independent trials: One patient’s outcome doesn’t affect another’s - Constant probability: Same probability (p) for each trial
Watch Out: - Not for continuous outcomes - Assumes independence (violated in clustered data, family studies) - Large n and moderate p → approximates normal distribution - Small p and large n → consider Poisson distribution instead
\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]
where: - \(n\) = number of trials - \(k\) = number of successes - \(p\) = probability of success on each trial - \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\) = “n choose k”
Mean: \(\mu =
np\)
Variance: \(\sigma^2 =
np(1-p)\)
Standard Deviation: \(\sigma
= \sqrt{np(1-p)}\)
# Example: Vaccination effectiveness study
# 100 vaccinated individuals exposed to disease
# Vaccine is 80% effective (p = 0.20 probability of infection)
n <- 100
p <- 0.20
# Calculate probabilities for different numbers of infections
x <- 0:40
probs <- dbinom(x, size = n, prob = p)
tibble(infections = x, probability = probs) %>%
ggplot(aes(x = infections, y = probability)) +
geom_col(fill = "steelblue", alpha = 0.7) +
geom_vline(xintercept = n*p, color = "red", linetype = "dashed", linewidth = 1) +
annotate("text", x = n*p + 5, y = max(probs) * 0.9,
label = paste("Mean =", n*p), color = "red", size = 4) +
labs(
title = "Binomial Distribution: Vaccine Study",
subtitle = "n = 100 vaccinated individuals, p = 0.20 (infection probability)",
x = "Number of Infections", y = "Probability"
) +
theme_minimal()## Expected number of infections: 20
## SD: 4
# What's the probability of exactly 15 infections?
prob_15 <- dbinom(15, size = n, prob = p)
cat("P(X = 15):", round(prob_15, 4), "\n")## P(X = 15): 0.0481
# What's the probability of 25 or more infections (worse than expected)?
prob_25_or_more <- 1 - pbinom(24, size = n, prob = p)
cat("P(X >= 25):", round(prob_25_or_more, 4), "\n")## P(X >= 25): 0.1314
# What's the probability of 15 or fewer infections (better than expected)?
prob_15_or_less <- pbinom(15, size = n, prob = p)
cat("P(X <= 15):", round(prob_15_or_less, 4), "\n")## P(X <= 15): 0.1285
# How many people need to be tested to be 95% confident
# we'll see at least 1 case if prevalence is 5%?
p_disease <- 0.05
n_test <- 1:100
# Probability of seeing at least 1 case
prob_at_least_1 <- 1 - dbinom(0, size = n_test, prob = p_disease)
tibble(n = n_test, probability = prob_at_least_1) %>%
ggplot(aes(x = n, y = probability)) +
geom_line(linewidth = 1.2, color = "steelblue") +
geom_hline(yintercept = 0.95, linetype = "dashed", color = "red") +
annotate("text", x = 75, y = 0.93,
label = "95% threshold", color = "red", size = 4) +
labs(
title = "Sample Size for Detecting Disease",
subtitle = "Probability of finding ≥1 case when prevalence = 5%",
x = "Sample Size (n)", y = "Probability of ≥1 Case"
) +
theme_minimal()# Find minimum n needed
n_needed <- min(which(prob_at_least_1 >= 0.95))
cat("Need to test at least", n_needed, "people to be 95% confident of finding ≥1 case\n")## Need to test at least 59 people to be 95% confident of finding ≥1 case
Why: The Poisson distribution models the number of rare events occurring in a fixed interval of time or space—essential for disease surveillance and outbreak detection.
When: - Rare events: Low probability, but many opportunities - Count data: Number of events (0, 1, 2, 3, …) - Fixed interval: Time period, geographic area, population size - Independent events: One event doesn’t affect probability of another - Constant rate: Average rate (λ) stays the same
Watch Out: - Only one parameter (λ = mean = variance) - If variance >> mean → overdispersion (negative binomial may be better) - As λ increases, distribution becomes more symmetric (normal approximation) - Can’t use for continuous outcomes
\[P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}\]
where: - \(\lambda\) = average rate (events per interval) - \(k\) = number of events observed - \(e\) ≈ 2.71828
Mean: \(\mu =
\lambda\)
Variance: \(\sigma^2 =
\lambda\)
Standard Deviation: \(\sigma
= \sqrt{\lambda}\)
Key Property: Mean equals variance!
# Example: Hospital-acquired infections
# Average rate: 3 infections per month
lambda <- 3
# Calculate probabilities
x <- 0:12
probs <- dpois(x, lambda = lambda)
tibble(infections = x, probability = probs) %>%
ggplot(aes(x = infections, y = probability)) +
geom_col(fill = "darkorange", alpha = 0.7) +
geom_vline(xintercept = lambda, color = "red", linetype = "dashed", linewidth = 1) +
annotate("text", x = lambda + 1.5, y = max(probs) * 0.9,
label = paste("λ =", lambda), color = "red", size = 4) +
labs(
title = "Poisson Distribution: Hospital Infections",
subtitle = "Average rate: λ = 3 infections per month",
x = "Number of Infections", y = "Probability"
) +
theme_minimal()## Expected infections per month: 3
## SD: 1.732051
# What's the probability of exactly 5 infections?
prob_5 <- dpois(5, lambda = lambda)
cat("P(X = 5):", round(prob_5, 4), "\n")## P(X = 5): 0.1008
# What's the probability of 6 or more (concerning outbreak)?
prob_6_or_more <- 1 - ppois(5, lambda = lambda)
cat("P(X >= 6):", round(prob_6_or_more, 4), "\n")## P(X >= 6): 0.0839
# What's the probability of zero infections (very good month)?
prob_0 <- dpois(0, lambda = lambda)
cat("P(X = 0):", round(prob_0, 4), "\n")## P(X = 0): 0.0498
# Compare disease incidence in three counties
lambda1 <- 2 # Rural county
lambda2 <- 5 # Suburban county
lambda3 <- 10 # Urban county
x <- 0:20
tibble(
cases = rep(x, 3),
probability = c(dpois(x, lambda1), dpois(x, lambda2), dpois(x, lambda3)),
county = rep(c("Rural (λ=2)", "Suburban (λ=5)", "Urban (λ=10)"), each = length(x))
) %>%
ggplot(aes(x = cases, y = probability, fill = county)) +
geom_col(position = "dodge", alpha = 0.7) +
labs(
title = "Poisson Distribution: Disease Incidence Across Counties",
subtitle = "Higher λ = higher average case count and more spread",
x = "Number of Cases per Month", y = "Probability",
fill = "County Type"
) +
theme_minimal()# Historical average: 4 cases per month
# This month: 9 cases observed
# Is this an outbreak?
lambda_historical <- 4
observed <- 9
# Calculate p-value (probability of observing ≥9 cases by chance)
p_value <- 1 - ppois(observed - 1, lambda = lambda_historical)
cat("Historical average:", lambda_historical, "cases/month\n")## Historical average: 4 cases/month
## Observed this month: 9 cases
## P(X >= 9 | λ = 4): 0.0214
if (p_value < 0.05) {
cat("ALERT: Statistically significant increase (p < 0.05)\n")
cat("This constitutes evidence of a potential outbreak.\n")
} else {
cat("Within expected variation (p >= 0.05)\n")
}## ALERT: Statistically significant increase (p < 0.05)
## This constitutes evidence of a potential outbreak.
# Visualize
x <- 0:15
probs <- dpois(x, lambda = lambda_historical)
tibble(cases = x, probability = probs) %>%
ggplot(aes(x = cases, y = probability)) +
geom_col(fill = "steelblue", alpha = 0.7) +
geom_col(data = . %>% filter(cases == observed),
fill = "red", alpha = 0.9) +
geom_vline(xintercept = lambda_historical, linetype = "dashed", color = "blue") +
annotate("text", x = observed, y = max(probs) * 0.5,
label = "Observed", color = "red", size = 4, angle = 90) +
labs(
title = "Outbreak Detection Using Poisson Distribution",
subtitle = paste0("Observed = ", observed, " cases vs Expected λ = ", lambda_historical),
x = "Number of Cases", y = "Probability"
) +
theme_minimal()This is one of the most commonly confused concepts in statistics! Understanding the difference is critical for interpreting research findings correctly.
Standard Deviation (SD): Describes variability in the population/sample - “How spread out are individual observations?” - Remains roughly constant as sample size increases
Standard Error (SE): Describes precision of the sample mean - “How close is our sample mean to the true population mean?” - Decreases as sample size increases
Standard Error of the Mean: \[SE = \frac{s}{\sqrt{n}}\]
where \(s\) = sample standard deviation, \(n\) = sample size
# Demonstrate SE decreases with sample size
set.seed(553)
population <- rnorm(10000, mean = 120, sd = 20) # True population
sample_sizes <- c(10, 30, 50, 100, 500)
results <- tibble(
n = sample_sizes,
SD = numeric(length(sample_sizes)),
SE = numeric(length(sample_sizes))
)
for (i in seq_along(sample_sizes)) {
sample_data <- sample(population, sample_sizes[i])
results$SD[i] <- sd(sample_data)
results$SE[i] <- sd(sample_data) / sqrt(sample_sizes[i])
}
print(results)## # A tibble: 5 × 3
## n SD SE
## <dbl> <dbl> <dbl>
## 1 10 16.1 5.09
## 2 30 22.3 4.06
## 3 50 20.2 2.85
## 4 100 19.8 1.98
## 5 500 20.0 0.897
# Visualize
results %>%
pivot_longer(cols = c(SD, SE), names_to = "Measure", values_to = "Value") %>%
ggplot(aes(x = n, y = Value, color = Measure, group = Measure)) +
geom_line(linewidth = 1.2) +
geom_point(size = 3) +
labs(
title = "SD vs SE: Effect of Sample Size",
subtitle = "SD remains stable; SE decreases with √n",
x = "Sample Size (n)", y = "Value"
) +
theme_minimal()Why: Point estimates alone don’t convey uncertainty. CIs provide a range of plausible values for the true population parameter.
When: Report CIs for all estimates—means, proportions, regression coefficients, odds ratios, etc.
Watch Out: - 95% CI interpretation: “If we repeated this study many times, 95% of the CIs would contain the true parameter” - NOT: “There’s a 95% probability the true value is in this interval” - Wider CIs = more uncertainty (smaller sample size or more variability) - CI excluding null value → statistically significant result
95% Confidence Interval: \[\bar{x} \pm t_{\alpha/2, n-1} \times SE\]
where: - \(\bar{x}\) = sample mean - \(t_{\alpha/2, n-1}\) = critical t-value (1.96 for large samples) - \(SE\) = standard error
# Example: Blood pressure in hypertensive patients
set.seed(553)
bp_sample <- rnorm(50, mean = 145, sd = 18)
# Calculate 95% CI
mean_bp <- mean(bp_sample)
se_bp <- sd(bp_sample) / sqrt(length(bp_sample))
t_crit <- qt(0.975, df = length(bp_sample) - 1) # 95% CI
ci_lower <- mean_bp - t_crit * se_bp
ci_upper <- mean_bp + t_crit * se_bp
cat("Mean BP:", round(mean_bp, 2), "mmHg\n")## Mean BP: 147.03 mmHg
## 95% CI: [ 142.26 , 151.81 ]
## [1] 142.2557 151.8071
## attr(,"conf.level")
## [1] 0.95
# Visualize multiple samples
set.seed(553)
n_samples <- 20
sample_means <- numeric(n_samples)
ci_lowers <- numeric(n_samples)
ci_uppers <- numeric(n_samples)
true_mean <- 145
for (i in 1:n_samples) {
sample_i <- rnorm(50, mean = true_mean, sd = 18)
sample_means[i] <- mean(sample_i)
ci_i <- t.test(sample_i)$conf.int
ci_lowers[i] <- ci_i[1]
ci_uppers[i] <- ci_i[2]
}
tibble(
sample = 1:n_samples,
mean = sample_means,
lower = ci_lowers,
upper = ci_uppers,
contains_true = (ci_lowers <= true_mean) & (ci_uppers >= true_mean)
) %>%
ggplot(aes(x = sample, y = mean, color = contains_true)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.3) +
geom_hline(yintercept = true_mean, linetype = "dashed", color = "red", linewidth = 1) +
scale_color_manual(values = c("TRUE" = "steelblue", "FALSE" = "orange"),
labels = c("TRUE" = "Contains True Mean", "FALSE" = "Misses True Mean")) +
labs(
title = "20 Samples: 95% Confidence Intervals",
subtitle = "Red line = True population mean (145 mmHg)",
x = "Sample Number", y = "Systolic BP (mmHg)",
color = NULL
) +
theme_minimal() +
theme(legend.position = "bottom")Why: Hypothesis testing provides a structured framework for making decisions under uncertainty—essential for evidence-based public health.
When: Use when you want to test whether an observed effect is likely due to chance or represents a real phenomenon.
Watch Out: - p-value ≠ probability that null hypothesis is true - p < 0.05 doesn’t mean effect is important (statistical vs. clinical significance) - Absence of evidence ≠ evidence of absence (non-significant doesn’t prove no effect) - Multiple testing inflates Type I error rate
State hypotheses:
Choose significance level (\(\alpha\), typically 0.05)
Calculate test statistic (t, F, χ², etc.)
Determine p-value
Make decision:
| \(H_0\) True | \(H_0\) False | |
|---|---|---|
| Reject \(H_0\) | Type I Error (α) | Correct Decision |
| Fail to reject \(H_0\) | Correct Decision | Type II Error (β) |
Power = \(1 - \beta\) (probability of detecting true effect)
# Example: Testing if mean BP differs from 140 mmHg
set.seed(553)
bp_sample <- rnorm(40, mean = 145, sd = 18)
# One-sample t-test
t_test_result <- t.test(bp_sample, mu = 140)
print(t_test_result)##
## One Sample t-test
##
## data: bp_sample
## t = 2.5127, df = 39, p-value = 0.01623
## alternative hypothesis: true mean is not equal to 140
## 95 percent confidence interval:
## 141.2370 151.4495
## sample estimates:
## mean of x
## 146.3432
##
## Interpretation:
## H0: μ = 140 mmHg
## HA: μ ≠ 140 mmHg
## p-value = 0.0162
if (t_test_result$p.value < 0.05) {
cat("Decision: Reject H0 (p < 0.05)\n")
cat("Conclusion: Mean BP is significantly different from 140 mmHg\n")
} else {
cat("Decision: Fail to reject H0 (p ≥ 0.05)\n")
cat("Conclusion: Insufficient evidence that mean BP differs from 140 mmHg\n")
}## Decision: Reject H0 (p < 0.05)
## Conclusion: Mean BP is significantly different from 140 mmHg
Why: t-tests are the foundation for comparing means—the most common research question in public health.
When: - One-sample t-test: Compare sample mean to known value - Independent t-test: Compare means between two groups - Paired t-test: Compare means for same subjects (before/after)
Watch Out: - Assumes normality (especially for small samples) - Assumes equal variances for independent t-test (use Welch’s correction if violated) - Paired t-test requires matched pairs (same subject, matched pair) - Large samples can detect tiny, clinically meaningless differences
Test statistic: \[t = \frac{\bar{x} - \mu_0}{SE} = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}\]
# Example: Is mean BMI in our sample different from population mean (26.5)?
set.seed(553)
bmi_sample <- rnorm(35, mean = 28, sd = 4.5)
t.test(bmi_sample, mu = 26.5)##
## One Sample t-test
##
## data: bmi_sample
## t = 2.5131, df = 34, p-value = 0.01687
## alternative hypothesis: true mean is not equal to 26.5
## 95 percent confidence interval:
## 26.82870 29.60729
## sample estimates:
## mean of x
## 28.218
Test statistic: \[t = \frac{\bar{x}_1 - \bar{x}_2}{SE_{diff}}\]
# Example: Compare cholesterol levels between treatment and control
set.seed(553)
treatment <- rnorm(40, mean = 185, sd = 25)
control <- rnorm(40, mean = 200, sd = 25)
# Independent t-test
t.test(treatment, control, var.equal = TRUE)##
## Two Sample t-test
##
## data: treatment and control
## t = -2.8209, df = 78, p-value = 0.00607
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -24.594033 -4.242742
## sample estimates:
## mean of x mean of y
## 186.8656 201.2840
# Visualize
tibble(
cholesterol = c(treatment, control),
group = rep(c("Treatment", "Control"), each = 40)
) %>%
ggplot(aes(x = group, y = cholesterol, fill = group)) +
geom_boxplot(alpha = 0.7, width = 0.5) +
geom_jitter(width = 0.2, alpha = 0.3) +
labs(
title = "Cholesterol Levels: Treatment vs Control",
x = NULL, y = "Total Cholesterol (mg/dL)"
) +
theme_minimal() +
theme(legend.position = "none")Test statistic: \[t = \frac{\bar{d}}{SE_d} = \frac{\bar{d}}{s_d/\sqrt{n}}\]
where \(d\) = difference scores (after - before)
# Example: Blood pressure before and after intervention
set.seed(553)
n <- 30
before <- rnorm(n, mean = 145, sd = 15)
after <- before - rnorm(n, mean = 8, sd = 5) # Average reduction of 8 mmHg
# Paired t-test
t.test(after, before, paired = TRUE)##
## Paired t-test
##
## data: after and before
## t = -10.551, df = 29, p-value = 1.928e-11
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -10.77337 -7.27483
## sample estimates:
## mean difference
## -9.024099
# Visualize
tibble(
patient = rep(1:n, 2),
bp = c(before, after),
time = rep(c("Before", "After"), each = n)
) %>%
ggplot(aes(x = time, y = bp, group = patient)) +
geom_line(alpha = 0.3) +
geom_point(aes(color = time), size = 3, alpha = 0.7) +
labs(
title = "Paired Data: Blood Pressure Before/After Intervention",
x = NULL, y = "Systolic BP (mmHg)"
) +
theme_minimal() +
theme(legend.position = "none")In this course, we’ll extend these foundational concepts to:
Each method builds on what you learned in EPI 552!
Instructions: Try these questions on your own before looking at the answer keys below. You can work through the calculations by hand or use R to check your work.
Classify each variable as nominal, ordinal, discrete, or continuous:
Answers:
Discrete - Countable values (0, 1, 2, 3, …). You can’t smoke 2.5 cigarettes.
Ordinal - Has natural ordering (Excellent > Good > Fair > Poor), but the intervals between categories aren’t necessarily equal.
Continuous - Can take any value within a range (e.g., 120.5 mmHg, 120.53 mmHg). Measured on a continuous scale.
Nominal - Categories with no inherent order. Private insurance isn’t “more” or “less” than Medicare—they’re just different types.
Key Insight: Getting the variable type right determines which statistical tests you can use. For example, you’d use ANOVA for (a) as an outcome, ordinal logistic regression for (b), linear regression for (c), and chi-square or multinomial logistic regression for (d).
Given the following hospital length of stay data (days):
2, 3, 3, 4, 5, 5, 5, 6, 7, 28
Answers:
Mean: \(\bar{x} = \frac{2+3+3+4+5+5+5+6+7+28}{10} = \frac{68}{10} = 6.8\) days
Median: Order the data (already ordered). With
n=10 (even), take average of 5th and 6th values:
\(\text{Median} = \frac{5+5}{2} = 5\)
days
Mode: 5 days (appears 3 times, most frequent)
The median (5 days) best represents typical length of stay.
Reasoning: - The mean (6.8 days) is pulled up by the outlier (28 days—possibly a complicated case) - Most patients (9 out of 10) stayed 7 days or less - The median is robust to outliers and gives a better sense of what a “typical” patient experiences - For skewed distributions like this (common in healthcare data), median is usually preferred
Public Health Implication: If you reported the mean (6.8 days) to hospital administrators, they might overestimate typical resource needs. The median (5 days) better reflects the typical patient experience.
A study reports: “Mean systolic BP = 135 mmHg (95% CI: 130-140)”
Answers:
Correct interpretation: “If we repeated this study many times with different samples from the same population, approximately 95% of the confidence intervals we calculated would contain the true population mean systolic blood pressure.”
Practical interpretation: “We are 95% confident that the true average systolic BP in the population lies between 130 and 140 mmHg.”
Common WRONG interpretation (avoid this!): “There’s
a 95% probability that the true mean is between 130 and 140.” ❌
(The true mean is fixed—it either is or isn’t in this interval. The
probability statement is about the method, not this particular
interval.)
Yes, this is significantly different from 128 mmHg (at α = 0.05).
Reasoning: - The 95% CI is (130-140) - 128 mmHg falls outside this interval - If we tested \(H_0: \mu = 128\) vs \(H_A: \mu \neq 128\), we would get p < 0.05 - We can reject the null hypothesis that the true mean equals 128
Clinical note: While statistically significant, a 7 mmHg difference (135 vs 128) may or may not be clinically meaningful depending on the context. Always consider both statistical and practical significance!
You conduct a t-test comparing mean BMI between smokers and non-smokers. The p-value is 0.08.
Answers:
Fail to reject the null hypothesis (p = 0.08 > α = 0.05)
Important: We do NOT say “accept the null” or “prove there’s no difference”—we simply lack sufficient evidence to reject it.
Interpretation requires caution:
Not “proof of no effect”: p = 0.08 is close to the significance threshold. There may be a real difference that we failed to detect (Type II error), especially if sample size was small.
Effect size matters: Even if not statistically significant, if the mean difference is large (e.g., 3-4 BMI points), this could still be clinically/practically important.
Consider the mechanism: Smoking is often associated with lower BMI due to metabolic effects and appetite suppression, but this doesn’t mean smoking is “healthy”—it’s associated with numerous other serious health risks.
Don’t over-interpret: This result alone shouldn’t change public health messaging about smoking cessation.
Sample size: Small n → low statistical power → might miss real effects
Effect size: What was the actual difference in mean BMI?
Confidence interval: e.g., 95% CI for difference: (-0.2, 2.8)
Standard deviations: How variable is BMI in each group?
Confounders: Are groups comparable in age, sex, socioeconomic status?
Power analysis: Did the study have adequate power (typically 80%) to detect a meaningful difference?
Bottom line: A p-value of 0.08 is borderline. Don’t dismiss the finding entirely, but also don’t over-interpret. Consider the broader context, clinical significance, and study limitations.
For each scenario, identify whether a binomial or Poisson distribution is more appropriate:
Answers:
Key Distinction: - Binomial: Fixed n, counting “successes” out of n trials - Poisson: Open-ended, counting events per unit time/space
Bridge: When n is large and p is small, binomial ≈ Poisson (with λ = np)
A county typically reports an average of λ = 2 cases of meningitis per year.
Answers:
\[P(X = 0) = \frac{2^0 \cdot e^{-2}}{0!} = \frac{1 \cdot e^{-2}}{1} = e^{-2} \approx 0.135\]
Interpretation: There’s about a 13.5% chance of seeing zero cases in a given year—uncommon but not extremely rare.
Calculate using complement: \(P(X \geq 5) = 1 - P(X \leq 4)\)
\[P(X \leq 4) = \sum_{k=0}^{4} \frac{2^k e^{-2}}{k!}\]
Using R or tables: \(P(X \leq 4) \approx 0.947\)
Therefore: \(P(X \geq 5) = 1 - 0.947 = 0.053\)
Interpretation: There’s about a 5.3% chance of seeing 5+ cases by random chance alone.
Observed: X = 6 cases
Calculate: \(P(X \geq 6 | \lambda = 2)\)
\[P(X \geq 6) = 1 - P(X \leq 5) \approx 1 - 0.983 = 0.017\]
Decision: p = 0.017 < 0.05
Conclusion: YES, this constitutes evidence of an outbreak.
The probability of seeing 6 or more cases by chance alone is only 1.7%, which is below our significance threshold of 5%. This warrants further investigation—possible sources could include: - Environmental contamination - College dormitory outbreak - Imported case with secondary transmission - Reporting artifact (delayed cases from previous year)
Public health action: Investigate immediately, consider prophylaxis for close contacts, review vaccination coverage.
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.2
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.9.4 forcats_1.0.1 stringr_1.5.1 dplyr_1.1.4
## [5] purrr_1.1.0 readr_2.1.5 tidyr_1.3.1 tibble_3.3.0
## [9] ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.6 sass_0.4.10 generics_0.1.4 stringi_1.8.7
## [5] hms_1.1.4 digest_0.6.39 magrittr_2.0.3 timechange_0.3.0
## [9] evaluate_1.0.5 grid_4.5.2 RColorBrewer_1.1-3 fastmap_1.2.0
## [13] jsonlite_2.0.0 promises_1.3.3 httr_1.4.7 scales_1.4.0
## [17] jquerylib_0.1.4 cli_3.6.5 rlang_1.1.6 withr_3.0.2
## [21] cachem_1.1.0 yaml_2.3.10 otel_0.2.0 tools_4.5.2
## [25] telegram.bot_3.0.2 tzdb_0.5.0 httpuv_1.6.16 curl_7.0.0
## [29] vctrs_0.6.5 R6_2.6.1 lifecycle_1.0.5 pkgconfig_2.0.3
## [33] bslib_0.9.0 pillar_1.11.0 later_1.4.5 gtable_0.3.6
## [37] glue_1.8.0 Rcpp_1.1.1 xfun_0.53 tidyselect_1.2.1
## [41] rstudioapi_0.17.1 knitr_1.51 farver_2.1.2 htmltools_0.5.9
## [45] labeling_0.4.3 rmarkdown_2.30 compiler_4.5.2 S7_0.2.0
## [49] askpass_1.2.1 openssl_2.3.3
Next Lecture: ANOVA - Comparing Means Across Multiple Groups
Reminder: Complete Lab 01 (Review Exercises) by end of class Thursday