This course builds directly on EPI 552, extending your statistical toolkit to include:
Key Philosophy: We learn to code first, then use AI as a tool when needed—not as a replacement for understanding.
Choosing the correct statistical method depends entirely on identifying variable types correctly. Using the wrong test (e.g., treating ordinal data as continuous) can lead to invalid conclusions.
Categorical Variables (Qualitative):
Numerical Variables (Quantitative):
library(tidyverse)
library(kableExtra)
# Creating example data
example_data <- tibble(
patient_id = 1:100, # Nominal (despite being numeric!)
age_continuous = rnorm(100, mean = 45, sd = 15), # Continuous
education = sample(c("High School", "Bachelor's", "Graduate"),
100, replace = TRUE), # Ordinal
blood_type = sample(c("A", "B", "AB", "O"),
100, replace = TRUE), # Nominal
num_visits = rpois(100, lambda = 3) # Discrete
)
# View structure
glimpse(example_data)## Rows: 100
## Columns: 5
## $ patient_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ age_continuous <dbl> 56.50325, 44.56357, 47.79624, 37.46888, 50.82221, 71.49…
## $ education <chr> "Bachelor's", "Bachelor's", "Graduate", "High School", …
## $ blood_type <chr> "A", "A", "AB", "AB", "O", "A", "A", "O", "B", "AB", "A…
## $ num_visits <int> 1, 0, 2, 1, 4, 1, 3, 4, 1, 3, 4, 2, 5, 5, 3, 1, 8, 1, 4…
# Display first 10 rows in a nice table
example_data %>%
head(10) %>%
kbl(caption = "Example Data: Variable Types (First 10 Patients)",
col.names = c("Patient ID", "Age (years)", "Education Level",
"Blood Type", "Number of Visits"),
digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "left") %>%
add_header_above(c("Nominal" = 1, "Continuous" = 1, "Ordinal" = 1,
"Nominal" = 1, "Discrete" = 1))| Patient ID | Age (years) | Education Level | Blood Type | Number of Visits |
|---|---|---|---|---|
| 1 | 56.5 | Bachelor’s | A | 1 |
| 2 | 44.6 | Bachelor’s | A | 0 |
| 3 | 47.8 | Graduate | AB | 2 |
| 4 | 37.5 | High School | AB | 1 |
| 5 | 50.8 | Bachelor’s | O | 4 |
| 6 | 71.5 | Bachelor’s | A | 1 |
| 7 | 47.5 | Bachelor’s | A | 3 |
| 8 | 43.0 | Graduate | O | 4 |
| 9 | 34.3 | Bachelor’s | B | 1 |
| 10 | 61.7 | High School | AB | 3 |
Why: Central tendency summarizes “typical” values in your data—essential for describing populations and detecting outliers.
When to Use Each:
Watch Out: - Mean is sensitive to outliers—one extreme value can distort it - Median is robust but doesn’t use all information - Mode can have multiple values (bimodal/multimodal distributions)
Mean (Arithmetic Average): \[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]
Median: Middle value when data are ordered - If \(n\) is odd: middle value - If \(n\) is even: average of two middle values
Mode: Most frequently occurring value
# Simulating skewed data (e.g., hospital length of stay)
set.seed(553)
los_data <- rexp(100, rate = 0.2) # Exponential distribution (right-skewed)
# Calculate measures
mean_los <- mean(los_data)
median_los <- median(los_data)
# Visualize
tibble(length_of_stay = los_data) %>%
ggplot(aes(x = length_of_stay)) +
geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
geom_vline(aes(xintercept = mean_los, color = "Mean"),
linewidth = 1.2, linetype = "dashed") +
geom_vline(aes(xintercept = median_los, color = "Median"),
linewidth = 1.2, linetype = "dashed") +
scale_color_manual(values = c("Mean" = "red", "Median" = "blue")) +
labs(
title = "Length of Stay Distribution (Right-Skewed)",
subtitle = "Mean is pulled toward the tail; Median is more representative",
x = "Days", y = "Frequency",
color = "Measure"
) +
theme_minimal()## Mean: 4.76 days
## Median: 3.3 days
Why: Variability tells us about data spread—critical for understanding population heterogeneity and precision of estimates.
When:
Watch Out: - Standard deviation assumes roughly normal distribution - Coefficient of variation allows comparison across different scales - Always report variability alongside central tendency
Variance (Population): \[\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}\]
Variance (Sample): \[s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}\]
Standard Deviation: \[s = \sqrt{s^2}\]
Coefficient of Variation: \[CV = \frac{s}{\bar{x}} \times 100\%\]
# Two groups with same mean, different variability
set.seed(553)
group_low_var <- rnorm(100, mean = 120, sd = 5) # Tight control
group_high_var <- rnorm(100, mean = 120, sd = 20) # High variability
# Calculate measures
var_comparison <- tibble(
Group = c("Low Variability", "High Variability"),
Mean = c(mean(group_low_var), mean(group_high_var)),
SD = c(sd(group_low_var), sd(group_high_var)),
CV = c(sd(group_low_var)/mean(group_low_var) * 100,
sd(group_high_var)/mean(group_high_var) * 100)
)
print(var_comparison)## # A tibble: 2 × 4
## Group Mean SD CV
## <chr> <dbl> <dbl> <dbl>
## 1 Low Variability 120. 4.92 4.09
## 2 High Variability 120. 22.4 18.7
# Visualize
tibble(
BP = c(group_low_var, group_high_var),
Group = rep(c("Low Variability", "High Variability"), each = 100)
) %>%
ggplot(aes(x = BP, fill = Group)) +
geom_density(alpha = 0.6) +
labs(
title = "Same Mean, Different Variability",
subtitle = "Systolic Blood Pressure in Two Populations",
x = "Systolic BP (mmHg)", y = "Density"
) +
theme_minimal()Why: Many statistical tests assume normality, and many biological variables follow approximately normal distributions.
When:
Watch Out:
\[f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\]
# Standard Normal Distribution
x <- seq(-4, 4, length.out = 1000)
y <- dnorm(x, mean = 0, sd = 1)
tibble(x = x, y = y) %>%
ggplot(aes(x = x, y = y)) +
geom_line(linewidth = 1.2, color = "steelblue") +
geom_area(data = . %>% filter(x >= -1 & x <= 1),
fill = "steelblue", alpha = 0.3) +
geom_area(data = . %>% filter(x >= -2 & x <= 2),
fill = "steelblue", alpha = 0.2) +
geom_area(data = . %>% filter(x >= -3 & x <= 3),
fill = "steelblue", alpha = 0.1) +
annotate("text", x = 0, y = 0.2, label = "68%", size = 5) +
annotate("text", x = 0, y = 0.1, label = "95%", size = 5) +
annotate("text", x = 0, y = 0.05, label = "99.7%", size = 5) +
labs(
title = "Standard Normal Distribution",
subtitle = "68-95-99.7 Rule",
x = "Standard Deviations from Mean (z-scores)", y = "Density"
) +
theme_minimal()This visualization shows the 68-95-99.7 Rule (Empirical Rule) for the standard normal distribution:
Why This Matters:
Example: If systolic BP has mean = 120 mmHg and SD = 15: - 68% of people: 105-135 mmHg - 95% of people: 90-150 mmHg - 99.7% of people: 75-165 mmHg - Someone with BP = 155 mmHg is >2 SD above average (top 2.5%)
# Generate example data
set.seed(553)
normal_data <- rnorm(100, mean = 100, sd = 15)
skewed_data <- rexp(100, rate = 0.05)
# Visual checks
par(mfrow = c(2, 2))
# Normal data
hist(normal_data, main = "Histogram: Normal Data",
xlab = "Value", col = "lightblue")
qqnorm(normal_data, main = "Q-Q Plot: Normal Data")
qqline(normal_data, col = "red", lwd = 2)
# Skewed data
hist(skewed_data, main = "Histogram: Skewed Data",
xlab = "Value", col = "salmon")
qqnorm(skewed_data, main = "Q-Q Plot: Skewed Data")
qqline(skewed_data, col = "red", lwd = 2)##
## Shapiro-Wilk normality test
##
## data: normal_data
## W = 0.98413, p-value = 0.2745
##
## Shapiro-Wilk normality test
##
## data: skewed_data
## W = 0.76332, p-value = 2.138e-11
These plots demonstrate two methods for checking whether data follow a normal distribution—a critical assumption for parametric tests.
Top Row (Normal Data):
Bottom Row (Skewed Data):
How to Read Q-Q Plots:
Decision Rules:
What to Do with Non-Normal Data:
Why: The binomial distribution models the number of “successes” in a fixed number of independent trials—fundamental for epidemiologic studies with binary outcomes.
When:
Watch Out:
\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]
where:
Mean: \(\mu =
np\)
Variance: \(\sigma^2 =
np(1-p)\)
Standard Deviation: \(\sigma
= \sqrt{np(1-p)}\)
# Example: Vaccination effectiveness study
# 100 vaccinated individuals exposed to disease
# Vaccine is 80% effective (p = 0.20 probability of infection)
n <- 100
p <- 0.20
# Calculate probabilities for different numbers of infections
x <- 0:40
probs <- dbinom(x, size = n, prob = p)
tibble(infections = x, probability = probs) %>%
ggplot(aes(x = infections, y = probability)) +
geom_col(fill = "steelblue", alpha = 0.7) +
geom_vline(xintercept = n*p, color = "red", linetype = "dashed", linewidth = 1) +
annotate("text", x = n*p + 5, y = max(probs) * 0.9,
label = paste("Mean =", n*p), color = "red", size = 4) +
labs(
title = "Binomial Distribution: Vaccine Study",
subtitle = "n = 100 vaccinated individuals, p = 0.20 (infection probability)",
x = "Number of Infections", y = "Probability"
) +
theme_minimal()## Expected number of infections: 20
## SD: 4
# What's the probability of exactly 15 infections?
prob_15 <- dbinom(15, size = n, prob = p)
cat("P(X = 15):", round(prob_15, 4), "\n")## P(X = 15): 0.0481
# What's the probability of 25 or more infections (worse than expected)?
prob_25_or_more <- 1 - pbinom(24, size = n, prob = p)
cat("P(X >= 25):", round(prob_25_or_more, 4), "\n")## P(X >= 25): 0.1314
# What's the probability of 15 or fewer infections (better than expected)?
prob_15_or_less <- pbinom(15, size = n, prob = p)
cat("P(X <= 15):", round(prob_15_or_less, 4), "\n")## P(X <= 15): 0.1285
Scenario: 100 vaccinated individuals exposed to disease, vaccine is 80% effective (so p = 0.20 probability of infection).
Four Key Calculations:
dbinom(15, size = 100, prob = 0.20)1 - pbinom(24, ...) (complement rule)pbinom(15, ...) (cumulative probability)Public Health Application: These calculations help determine if observed outcomes differ significantly from expected vaccine performance.
# How many people need to be tested to be 95% confident
# we'll see at least 1 case if prevalence is 5%?
p_disease <- 0.05 #Disease prevalence is 5%
n_test <- 1:100 #Test sample sizes from 1 to 100
# Probability of seeing at least 1 case
prob_at_least_1 <- 1 - dbinom(0, size = n_test, prob = p_disease)
# `dbinom(0, size = n, prob = 0.05)` = Probability of finding **zero cases** (missing the disease entirely)
# `1 - dbinom(0, ...)` = **Complement rule**: Probability of finding **at least 1 case**
# This is calculated for each sample size (n = 1, 2, 3, ..., 100)
tibble(n = n_test, probability = prob_at_least_1) %>%
ggplot(aes(x = n, y = probability)) +
geom_line(linewidth = 1.2, color = "steelblue") +
geom_hline(yintercept = 0.95, linetype = "dashed", color = "red") +
annotate("text", x = 75, y = 0.93,
label = "95% threshold", color = "red", size = 4) +
labs(
title = "Sample Size for Detecting Disease",
subtitle = "Probability of finding ≥1 case when prevalence = 5%",
x = "Sample Size (n)", y = "Probability of ≥1 Case"
) +
theme_minimal()# Find minimum n needed
n_needed <- min(which(prob_at_least_1 >= 0.95))
cat("Need to test at least", n_needed, "people to be 95% confident of finding ≥1 case\n")## Need to test at least 59 people to be 95% confident of finding ≥1 case
Why: The Poisson distribution models the number of rare events occurring in a fixed interval of time or space—essential for disease surveillance and outbreak detection.
When:
Watch Out:
The Poisson distribution only models COUNT data (0, 1, 2, 3, …) — whole numbers representing “how many times” an event occurred.
Examples of COUNT data (Poisson is appropriate):
Examples of CONTINUOUS data (Poisson is NOT appropriate):
Why the distinction matters:
Poisson probability formula \(P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}\) only works for non-negative integers (k = 0, 1, 2, …)
You can’t calculate \(P(X = 2.5)\) using Poisson — there’s no such thing as “2.5 cases” of a disease
For continuous outcomes, use:
Common Mistake to Avoid:
Students sometimes confuse: - ❌ “Average length of hospital stay = 3.2 days” → This is CONTINUOUS, don’t use Poisson - ✓ “Number of hospital admissions per day = 3.2” → This is the RATE (λ) for a Poisson process, which is fine
Bottom Line: If your outcome can have decimal values, Poisson is the wrong choice. Use distributions designed for continuous data instead.
\[P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}\]
where: - \(\lambda\) = average rate (events per interval) - \(k\) = number of events observed - \(e\) ≈ 2.71828
Mean: \(\mu =
\lambda\)
Variance: \(\sigma^2 =
\lambda\)
Standard Deviation: \(\sigma
= \sqrt{\lambda}\)
Key Property: Mean equals variance!
# Example: Hospital-acquired infections
# Average rate: 3 infections per month
lambda <- 3
# Calculate probabilities
x <- 0:12
probs <- dpois(x, lambda = lambda)
tibble(infections = x, probability = probs) %>%
ggplot(aes(x = infections, y = probability)) +
geom_col(fill = "darkorange", alpha = 0.7) +
geom_vline(xintercept = lambda, color = "red", linetype = "dashed", linewidth = 1) +
annotate("text", x = lambda + 1.5, y = max(probs) * 0.9,
label = paste("λ =", lambda), color = "red", size = 4) +
labs(
title = "Poisson Distribution: Hospital Infections",
subtitle = "Average rate: λ = 3 infections per month",
x = "Number of Infections", y = "Probability"
) +
theme_minimal()## Expected infections per month: 3
## SD: 1.732051
# What's the probability of exactly 5 infections?
prob_5 <- dpois(5, lambda = lambda)
cat("P(X = 5):", round(prob_5, 4), "\n")## P(X = 5): 0.1008
# What's the probability of 6 or more (concerning outbreak)?
prob_6_or_more <- 1 - ppois(5, lambda = lambda)
cat("P(X >= 6):", round(prob_6_or_more, 4), "\n")## P(X >= 6): 0.0839
# What's the probability of zero infections (very good month)?
prob_0 <- dpois(0, lambda = lambda)
cat("P(X = 0):", round(prob_0, 4), "\n")## P(X = 0): 0.0498
# Compare disease incidence in three counties
lambda1 <- 2 # Rural county
lambda2 <- 5 # Suburban county
lambda3 <- 10 # Urban county
x <- 0:20
tibble(
cases = rep(x, 3),
probability = c(dpois(x, lambda1), dpois(x, lambda2), dpois(x, lambda3)),
county = rep(c("Rural (λ=2)", "Suburban (λ=5)", "Urban (λ=10)"), each = length(x))
) %>%
ggplot(aes(x = cases, y = probability, fill = county)) +
geom_col(position = "dodge", alpha = 0.7) +
labs(
title = "Poisson Distribution: Disease Incidence Across Counties",
subtitle = "Higher λ = higher average case count and more spread",
x = "Number of Cases per Month", y = "Probability",
fill = "County Type"
) +
theme_minimal()Key Insight: As λ increases:
Public Health Application:
Understanding these patterns helps epidemiologists:
# Historical average: 4 cases per month
# This month: 9 cases observed
# Is this an outbreak?
lambda_historical <- 4
observed <- 9
# Calculate p-value (probability of observing ≥9 cases by chance)
p_value <- 1 - ppois(observed - 1, lambda = lambda_historical)
cat("Historical average:", lambda_historical, "cases/month\n")## Historical average: 4 cases/month
## Observed this month: 9 cases
## P(X >= 9 | λ = 4): 0.0214
if (p_value < 0.05) {
cat("ALERT: Statistically significant increase (p < 0.05)\n")
cat("This constitutes evidence of a potential outbreak.\n")
} else {
cat("Within expected variation (p >= 0.05)\n")
}## ALERT: Statistically significant increase (p < 0.05)
## This constitutes evidence of a potential outbreak.
# Visualize
x <- 0:15
probs <- dpois(x, lambda = lambda_historical)
tibble(cases = x, probability = probs) %>%
ggplot(aes(x = cases, y = probability)) +
geom_col(fill = "steelblue", alpha = 0.7) +
geom_col(data = . %>% filter(cases == observed),
fill = "red", alpha = 0.9) +
geom_vline(xintercept = lambda_historical, linetype = "dashed", color = "blue") +
annotate("text", x = observed, y = max(probs) * 0.5,
label = "Observed", color = "red", size = 4, angle = 90) +
labs(
title = "Outbreak Detection Using Poisson Distribution",
subtitle = paste0("Observed = ", observed, " cases vs Expected λ = ", lambda_historical),
x = "Number of Cases", y = "Probability"
) +
theme_minimal()Statistical Test:
Public Health Interpretation:
The observed 9 cases is so far from the expected 4 cases that it’s unlikely to be random chance alone. Public health officials should investigate potential sources (contaminated water, foodborne illness, imported case, etc.).
This is one of the most commonly confused concepts in statistics! Understanding the difference is critical for interpreting research findings correctly.
Standard Deviation (SD): Describes variability in the population/sample
Standard Error (SE): Describes precision of the sample mean
Standard Error of the Mean: \[SE = \frac{s}{\sqrt{n}}\]
where \(s\) = sample standard deviation, \(n\) = sample size
# Demonstrate SE decreases with sample size
set.seed(553)
population <- rnorm(10000, mean = 120, sd = 20) # True population
sample_sizes <- c(10, 30, 50, 100, 500)
results <- tibble(
n = sample_sizes,
SD = numeric(length(sample_sizes)),
SE = numeric(length(sample_sizes))
)
for (i in seq_along(sample_sizes)) {
sample_data <- sample(population, sample_sizes[i])
results$SD[i] <- sd(sample_data)
results$SE[i] <- sd(sample_data) / sqrt(sample_sizes[i])
}
print(results)## # A tibble: 5 × 3
## n SD SE
## <dbl> <dbl> <dbl>
## 1 10 16.1 5.09
## 2 30 22.3 4.06
## 3 50 20.2 2.85
## 4 100 19.8 1.98
## 5 500 20.0 0.897
# Visualize
results %>%
pivot_longer(cols = c(SD, SE), names_to = "Measure", values_to = "Value") %>%
ggplot(aes(x = n, y = Value, color = Measure, group = Measure)) +
geom_line(linewidth = 1.2) +
geom_point(size = 3) +
labs(
title = "SD vs SE: Effect of Sample Size",
subtitle = "SD remains stable; SE decreases with √n",
x = "Sample Size (n)", y = "Value"
) +
theme_minimal()Why: Point estimates alone don’t convey uncertainty. CIs provide a range of plausible values for the true population parameter.
When: Report CIs for all estimates—means, proportions, regression coefficients, odds ratios, etc.
Watch Out:
95% Confidence Interval: \[\bar{x} \pm t_{\alpha/2, n-1} \times SE\]
where:
# Example: Blood pressure in hypertensive patients
set.seed(553)
bp_sample <- rnorm(50, mean = 145, sd = 18)
# Calculate 95% CI
mean_bp <- mean(bp_sample)
se_bp <- sd(bp_sample) / sqrt(length(bp_sample))
t_crit <- qt(0.975, df = length(bp_sample) - 1) # 95% CI
ci_lower <- mean_bp - t_crit * se_bp
ci_upper <- mean_bp + t_crit * se_bp
cat("Mean BP:", round(mean_bp, 2), "mmHg\n")## Mean BP: 147.03 mmHg
## 95% CI: [ 142.26 , 151.81 ]
## [1] 142.2557 151.8071
## attr(,"conf.level")
## [1] 0.95
# Visualize multiple samples
set.seed(553)
n_samples <- 20
sample_means <- numeric(n_samples)
ci_lowers <- numeric(n_samples)
ci_uppers <- numeric(n_samples)
true_mean <- 145
for (i in 1:n_samples) {
sample_i <- rnorm(50, mean = true_mean, sd = 18) # Draw sample
sample_means[i] <- mean(sample_i) # Calculate mean
ci_i <- t.test(sample_i)$conf.int # Calculate 95% CI
ci_lowers[i] <- ci_i[1] # Lower bound
ci_uppers[i] <- ci_i[2] # Upper bound
}
tibble(
sample = 1:n_samples,
mean = sample_means,
lower = ci_lowers,
upper = ci_uppers,
contains_true = (ci_lowers <= true_mean) & (ci_uppers >= true_mean)
) %>%
ggplot(aes(x = sample, y = mean, color = contains_true)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.3) +
geom_hline(yintercept = true_mean, linetype = "dashed", color = "red", linewidth = 1) +
scale_color_manual(values = c("TRUE" = "steelblue", "FALSE" = "orange"),
labels = c("TRUE" = "Contains True Mean", "FALSE" = "Misses True Mean")) +
labs(
title = "20 Samples: 95% Confidence Intervals",
subtitle = "Red line = True population mean (145 mmHg)",
x = "Sample Number", y = "Systolic BP (mmHg)",
color = NULL
) +
theme_minimal() +
theme(legend.position = "bottom")What the Plot Shows:
The Correct Interpretation:
“If we repeated this study many times, approximately 95% of the confidence intervals would contain the true population mean.”
In this simulation, you should see roughly 19 out of 20 CIs (95%) capturing the red line, with about 1 CI (5%) missing it.
Common Misconceptions to Avoid:
Why: Hypothesis testing provides a structured framework for making decisions under uncertainty—essential for evidence-based public health.
When: Use when you want to test whether an observed effect is likely due to chance or represents a real phenomenon.
Watch Out:
State hypotheses:
Choose significance level (\(\alpha\), typically 0.05)
Calculate test statistic (t, F, χ², etc.)
Determine p-value
Make decision:
| \(H_0\) True | \(H_0\) False | |
|---|---|---|
| Reject \(H_0\) | Type I Error (α) | Correct Decision |
| Fail to reject \(H_0\) | Correct Decision | Type II Error (β) |
# Example: Testing if mean BP differs from 140 mmHg
set.seed(553)
bp_sample <- rnorm(40, mean = 145, sd = 18)
# One-sample t-test
t_test_result <- t.test(bp_sample, mu = 140)
print(t_test_result)##
## One Sample t-test
##
## data: bp_sample
## t = 2.5127, df = 39, p-value = 0.01623
## alternative hypothesis: true mean is not equal to 140
## 95 percent confidence interval:
## 141.2370 151.4495
## sample estimates:
## mean of x
## 146.3432
##
## Interpretation:
## H0: μ = 140 mmHg
## HA: μ ≠ 140 mmHg
## p-value = 0.0162
if (t_test_result$p.value < 0.05) {
cat("Decision: Reject H0 (p < 0.05)\n")
cat("Conclusion: Mean BP is significantly different from 140 mmHg\n")
} else {
cat("Decision: Fail to reject H0 (p ≥ 0.05)\n")
cat("Conclusion: Insufficient evidence that mean BP differs from 140 mmHg\n")
}## Decision: Reject H0 (p < 0.05)
## Conclusion: Mean BP is significantly different from 140 mmHg
Why: t-tests are the foundation for comparing means—the most common research question in public health.
When:
Watch Out:
Test statistic: \[t = \frac{\bar{x} - \mu_0}{SE} = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}\]
# Example: Is mean BMI in our sample different from population mean (26.5)?
set.seed(553)
bmi_sample <- rnorm(35, mean = 28, sd = 4.5)
t.test(bmi_sample, mu = 26.5)##
## One Sample t-test
##
## data: bmi_sample
## t = 2.5131, df = 34, p-value = 0.01687
## alternative hypothesis: true mean is not equal to 26.5
## 95 percent confidence interval:
## 26.82870 29.60729
## sample estimates:
## mean of x
## 28.218
Test statistic: \[t = \frac{\bar{x}_1 - \bar{x}_2}{SE_{diff}}\]
# Example: Compare cholesterol levels between treatment and control
set.seed(553)
treatment <- rnorm(40, mean = 185, sd = 25)
control <- rnorm(40, mean = 200, sd = 25)
# Independent t-test
t.test(treatment, control, var.equal = TRUE)##
## Two Sample t-test
##
## data: treatment and control
## t = -2.8209, df = 78, p-value = 0.00607
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -24.594033 -4.242742
## sample estimates:
## mean of x mean of y
## 186.8656 201.2840
# Visualize
tibble(
cholesterol = c(treatment, control),
group = rep(c("Treatment", "Control"), each = 40)
) %>%
ggplot(aes(x = group, y = cholesterol, fill = group)) +
geom_boxplot(alpha = 0.7, width = 0.5) +
geom_jitter(width = 0.2, alpha = 0.3) +
labs(
title = "Cholesterol Levels: Treatment vs Control",
x = NULL, y = "Total Cholesterol (mg/dL)"
) +
theme_minimal() +
theme(legend.position = "none")Test statistic: \[t = \frac{\bar{d}}{SE_d} = \frac{\bar{d}}{s_d/\sqrt{n}}\]
where \(d\) = difference scores (after - before)
# Example: Blood pressure before and after intervention
set.seed(553)
n <- 30
before <- rnorm(n, mean = 145, sd = 15)
after <- before - rnorm(n, mean = 8, sd = 5) # Average reduction of 8 mmHg
# Paired t-test
t.test(after, before, paired = TRUE)##
## Paired t-test
##
## data: after and before
## t = -10.551, df = 29, p-value = 1.928e-11
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -10.77337 -7.27483
## sample estimates:
## mean difference
## -9.024099
# Visualize
tibble(
patient = rep(1:n, 2),
bp = c(before, after),
time = rep(c("Before", "After"), each = n)
) %>%
ggplot(aes(x = time, y = bp, group = patient)) +
geom_line(alpha = 0.3) +
geom_point(aes(color = time), size = 3, alpha = 0.7) +
labs(
title = "Paired Data: Blood Pressure Before/After Intervention",
x = NULL, y = "Systolic BP (mmHg)"
) +
theme_minimal() +
theme(legend.position = "none")In this course, we’ll extend these foundational concepts to:
Each method builds on what you learned in EPI 552!
Instructions: Try these questions on your own before looking at the answer keys below. You can work through the calculations by hand or use R to check your work.
Classify each variable as nominal, ordinal, discrete, or continuous:
Answers:
Discrete - Countable values (0, 1, 2, 3, …). You can’t smoke 2.5 cigarettes.
Ordinal - Has natural ordering (Excellent > Good > Fair > Poor), but the intervals between categories aren’t necessarily equal.
Continuous - Can take any value within a range (e.g., 120.5 mmHg, 120.53 mmHg). Measured on a continuous scale.
Nominal - Categories with no inherent order. Private insurance isn’t “more” or “less” than Medicare—they’re just different types.
Key Insight: Getting the variable type right determines which statistical tests you can use. Correct classification guides your entire analysis strategy:
The wrong choice leads to invalid results: Using linear regression for ordinal outcomes treats categories as equally spaced (Excellent→Good may not equal Good→Fair). Using ANOVA for count outcomes ignores the discrete, non-negative nature of the data.
Given the following hospital length of stay data (days):
2, 3, 3, 4, 5, 5, 5, 6, 7, 28
Answers:
Mean: \(\bar{x} = \frac{2+3+3+4+5+5+5+6+7+28}{10} = \frac{68}{10} = 6.8\) days
Median: Order the data (already ordered). With
n=10 (even), take average of 5th and 6th values:
\(\text{Median} = \frac{5+5}{2} = 5\)
days
Mode: 5 days (appears 3 times, most frequent)
The median (5 days) best represents typical length of stay.
Reasoning:
Public Health Implication: If you reported the mean (6.8 days) to hospital administrators, they might overestimate typical resource needs. The median (5 days) better reflects the typical patient experience.
A study reports: “Mean systolic BP = 135 mmHg (95% CI: 130-140)”
Answers:
Correct interpretation:
“If we repeated this study many times with different samples from the same population, approximately 95% of the confidence intervals we calculated would contain the true population mean systolic blood pressure.”
Practical interpretation: “We are 95% confident that the true average systolic BP in the population lies between 130 and 140 mmHg.”
Common WRONG interpretation (avoid this!): “There’s
a 95% probability that the true mean is between 130 and 140.” ❌
(The true mean is fixed—it either is or isn’t in this interval. The
probability statement is about the method, not this particular
interval.)
Yes, this is significantly different from 128 mmHg (at α = 0.05).
Reasoning:
Clinical note: While statistically significant, a 7 mmHg difference (135 vs 128) may or may not be clinically meaningful depending on the context. Always consider both statistical and practical significance!
You conduct a t-test comparing mean BMI between smokers and non-smokers. The p-value is 0.08.
Answers:
Fail to reject the null hypothesis (p = 0.08 > α = 0.05)
Important: We do NOT say “accept the null” or “prove there’s no difference”—we simply lack sufficient evidence to reject it.
Interpretation requires caution:
Not “proof of no effect”: p = 0.08 is close to the significance threshold. There may be a real difference that we failed to detect (Type II error), especially if sample size was small.
Effect size matters: Even if not statistically significant, if the mean difference is large (e.g., 3-4 BMI points), this could still be clinically/practically important.
Consider the mechanism: Smoking is often associated with lower BMI due to metabolic effects and appetite suppression, but this doesn’t mean smoking is “healthy”—it’s associated with numerous other serious health risks.
Don’t over-interpret: This result alone shouldn’t change public health messaging about smoking cessation.
Sample size: Small n → low statistical power → might miss real effects
Effect size: What was the actual difference in mean BMI?
Confidence interval: e.g., 95% CI for difference: (-0.2, 2.8)
Standard deviations: How variable is BMI in each group?
Confounders: Are groups comparable in age, sex, socioeconomic status?
Power analysis: Did the study have adequate power (typically 80%) to detect a meaningful difference?
Bottom line: A p-value of 0.08 is borderline. Don’t dismiss the finding entirely, but also don’t over-interpret. Consider the broader context, clinical significance, and study limitations.
For each scenario, identify whether a binomial or Poisson distribution is more appropriate:
Answers:
Key Distinction: - Binomial: Fixed n, counting “successes” out of n trials - Poisson: Open-ended, counting events per unit time/space
Bridge: When n is large and p is small, binomial ≈ Poisson (with λ = np)
A county typically reports an average of λ = 2 cases of meningitis per year.
Answers:
\[P(X = 0) = \frac{2^0 \cdot e^{-2}}{0!} = \frac{1 \cdot e^{-2}}{1} = e^{-2} \approx 0.135\]
Interpretation: There’s about a 13.5% chance of seeing zero cases in a given year—uncommon but not extremely rare.
Calculate using complement: \(P(X \geq 5) = 1 - P(X \leq 4)\)
\[P(X \leq 4) = \sum_{k=0}^{4} \frac{2^k e^{-2}}{k!}\]
Using R or tables: \(P(X \leq 4) \approx 0.947\)
Therefore: \(P(X \geq 5) = 1 - 0.947 = 0.053\)
Interpretation: There’s about a 5.3% chance of seeing 5+ cases by random chance alone.
Observed: X = 6 cases
Calculate: \(P(X \geq 6 | \lambda = 2)\)
\[P(X \geq 6) = 1 - P(X \leq 5) \approx 1 - 0.983 = 0.017\]
Decision: p = 0.017 < 0.05
Conclusion: YES, this constitutes evidence of an outbreak.
The probability of seeing 6 or more cases by chance alone is only 1.7%, which is below our significance threshold of 5%. This warrants further investigation—possible sources could include: - Environmental contamination - College dormitory outbreak - Imported case with secondary transmission - Reporting artifact (delayed cases from previous year)
Public health action: Investigate immediately, consider prophylaxis for close contacts, review vaccination coverage.
Note: This is the same statistical framework used in the outbreak detection example earlier in this lecture (Historical average = 4, Observed = 9).
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.2
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.4.0 lubridate_1.9.4 forcats_1.0.1 stringr_1.5.1
## [5] dplyr_1.1.4 purrr_1.1.0 readr_2.1.5 tidyr_1.3.1
## [9] tibble_3.3.0 ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.6 sass_0.4.10 generics_0.1.4
## [4] xml2_1.4.0 stringi_1.8.7 hms_1.1.4
## [7] digest_0.6.39 magrittr_2.0.3 timechange_0.3.0
## [10] evaluate_1.0.5 grid_4.5.2 RColorBrewer_1.1-3
## [13] fastmap_1.2.0 jsonlite_2.0.0 promises_1.3.3
## [16] httr_1.4.7 viridisLite_0.4.2 scales_1.4.0
## [19] textshaping_1.0.1 jquerylib_0.1.4 cli_3.6.5
## [22] rlang_1.1.7 withr_3.0.2 cachem_1.1.0
## [25] yaml_2.3.10 otel_0.2.0 tools_4.5.2
## [28] telegram.bot_3.0.2 tzdb_0.5.0 httpuv_1.6.16
## [31] curl_7.0.0 vctrs_0.6.5 R6_2.6.1
## [34] lifecycle_1.0.5 pkgconfig_2.0.3 bslib_0.9.0
## [37] pillar_1.11.0 later_1.4.5 gtable_0.3.6
## [40] glue_1.8.0 Rcpp_1.1.1 systemfonts_1.3.1.9000
## [43] xfun_0.53 tidyselect_1.2.1 rstudioapi_0.17.1
## [46] knitr_1.51 farver_2.1.2 htmltools_0.5.9
## [49] labeling_0.4.3 svglite_2.2.1 rmarkdown_2.30
## [52] compiler_4.5.2 S7_0.2.0 askpass_1.2.1
## [55] openssl_2.3.3
Next Lecture: ANOVA - Comparing Means Across Multiple Groups
Reminder: Complete Lab 01 (Review Exercises) by end of class Thursday