By the end of this lecture, you will be able to:
ANOVA addresses the multiple comparisons problem. When comparing 3+ groups:
Real-world value: You can confidently compare multiple treatments, exposure levels, or populations without inflating your chance of false discoveries.
✓ Comparing 3 or more group means (e.g., 3
medications, 5 counties, 4 pollution levels)
✓ One categorical predictor (treatment type, exposure
category, region)
✓ Continuous outcome (blood pressure, BMD, vaccination
rate)
✓ Independent observations within each group
If you only have 2 groups: Use an independent samples t-test instead
⚠️ Assumption violations matter with small, unbalanced samples - Check normality, equal variances, independence - ANOVA is robust with n > 30 per group and similar group sizes
⚠️ Don’t skip post-hoc tests - A significant F-test only tells you “at least one mean differs” - Use Tukey HSD to identify which specific pairs differ
⚠️ Effect size matters - p < 0.05 doesn’t mean practically important - Always report η² (eta-squared): How much variance does the factor explain? - Guidelines: η² = 0.01 (small), 0.06 (medium), 0.14 (large)
⚠️ ANOVA only tests for differences, not relationships - If your predictor is ordinal (low/medium/high), ANOVA doesn’t test for a linear trend - Consider regression with dummy variables for more nuanced analysis
ANOVA is one of the most powerful tools in your biostatistics toolkit. Here’s why you’ll use it throughout your public health career:
Example 1: Medication Effectiveness - Question: Does blood pressure reduction differ across three different medications? - ANOVA Solution: Compare mean BP reduction across treatment groups
Example 2: Environmental Health - Question: How does respiratory hospital admission rate vary by air pollution level (low/moderate/high)? - ANOVA Solution: Test for differences in mean admission rates across pollution categories
Example 3: Population Health - Question: Does COVID-19 vaccination rate differ across multiple counties? - ANOVA Solution: Compare vaccination rates across geographic units
Example 4: Occupational Health - Question: Do lung function outcomes differ across different occupational exposure groups? - ANOVA Solution: Test for differences in forced expiratory volume across exposure categories
Suppose you want to compare three or more group means. Your first instinct might be: “Why not just do multiple t-tests?”
The danger of multiple testing:
If you conduct independent t-tests for each pair of groups:
Example with 3 groups (3 comparisons):
The experiment-wise error rate equals 1 minus (1 minus 0.05) raised to the power of 3, which equals 1 minus 0.857, which equals 0.143.
\[\text{Experiment-wise error rate} = 1 - (1-0.05)^3 = 1 - 0.857 = 0.143\]
Your true probability of making at least one false positive is 14.3%, not 5%!
Null hypothesis (H₀): All group means are equal (μ₁ = μ₂ = … = μₖ)
\[H_0: \mu_1 = \mu_2 = \cdots = \mu_k\]
Alternative hypothesis (H₁): At least one group mean is different from the others
\[H_1: \text{At least one mean is different}\]
| Term | Definition |
|---|---|
| Factor | The categorical independent variable (e.g., treatment type) |
| Levels | The number of groups in the factor (e.g., 3 medications = 3 levels) |
| One-way ANOVA | One factor with multiple levels (our focus) |
| Two-way ANOVA | Two factors (beyond this course scope) |
Table 1: Key terminology used in ANOVA analysis
ANOVA works by partitioning total variation into two components:
The total deviation of all observations from the grand mean. This represents the overall variability in the outcome across all groups.
\[SS_Y = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (y_{ij} - \overline{y})^2\]
Where: - \(y_{ij}\) = observation j in group i - \(\overline{y}\) = grand mean (the average across all observations and groups)
Total variation equals the sum of between-group and within-group variation:
\[SS_Y = SS_T + SS_E\]
Treatment Sum of Squares (SST) - Variation between groups: \[SS_T = \sum_{i=1}^{k} n_i(\overline{y}_{i} - \overline{y})^2\]
This represents how much each group’s mean differs from the overall grand mean, weighted by group size.
Error Sum of Squares (SSE) - Variation within groups: \[SS_E = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (y_{ij} - \overline{y}_{i})^2\]
This represents the variability of individual observations around their own group mean. This is the “leftover” or unexplained variation.
Divide by degrees of freedom to get mean squares, which account for the number of parameters estimated:
\[MS_T = \frac{SS_T}{k-1}\]
\[MS_E = \frac{SS_E}{n-k}\]
Note: Degrees of freedom for treatment equals the number of groups minus 1 (k-1). Degrees of freedom for error equals the total sample size minus the number of groups (n-k).
The test statistic for ANOVA is the F-ratio:
\[F = \frac{MS_T}{MS_E}\]
Reject the null hypothesis if F exceeds the critical value \(F_{k-1, n-k, 1-\alpha}\) at significance level α (typically 0.05).
In practice, we compare the p-value to α: reject \(H_0\) if p-value < α.
ANOVA requires three key assumptions. Meeting these assumptions ensures the validity of your p-values and inference.
Observations are independent. Each observation should not influence another. This assumption is violated in repeated measures designs, clustered data, or time series data.
Watch out: Correlated data (family members, repeated measurements, geographic clustering) violates independence and requires specialized methods.
Within each group, the outcome is approximately normally distributed. The distribution should resemble a bell curve.
How to check: - Visual: Q-Q plots for each group - Statistical: Shapiro-Wilk test (but sensitive with large n)
Watch out: Heavy skew or extreme outliers can distort results
All groups have equal population variances. Mathematically, this means: \(\sigma_1^2 = \sigma_2^2 = \cdots = \sigma_k^2\)
In other words, the spread of values around each group’s mean should be roughly the same across all groups.
How to check: - Visual: Boxplots or residual plots - Statistical: Levene’s test
Watch out: If one group’s variance is >3x another group’s variance, consider transformations or use Welch’s ANOVA
Good news: ANOVA is robust to moderate violations of these assumptions, particularly: - With larger sample sizes (typically n > 30 per group) - With balanced designs (similar group sizes) - Against non-normality if distributions are similar across groups
Bad news: - With very small, unbalanced samples, violations matter more - Outliers can have large effects on F-test validity
Does bone mineral density (BMD) differ across physical activity levels in older adults?
This example uses data from the National Health and Nutrition Examination Survey (NHANES) 2017-2018 cycle, focusing on adults aged 50+.
Why this matters: Understanding the relationship between physical activity and bone health can inform public health recommendations for preventing osteoporosis and fractures in aging populations.
# Load NHANES data
library(NHANES)
data(NHANES)
# Check if BMD exists, if not create simulated realistic BMD data
# Note: NHANES package may not include BMD in all versions
# This approach ensures the lecture works regardless
# Filter for adults 50+ and prepare data
set.seed(553) # For reproducibility
bmd_data <- NHANES %>%
filter(Age >= 50) %>%
filter(!is.na(PhysActive)) %>%
# Create activity groups based on physical activity and age
mutate(
activity_group = case_when(
PhysActive == "No" ~ "Low",
PhysActive == "Yes" & Age < 65 ~ "Moderate",
PhysActive == "Yes" & Age >= 65 ~ "High",
TRUE ~ NA_character_
),
activity_group = factor(activity_group, levels = c("Low", "Moderate", "High"))
) %>%
filter(!is.na(activity_group)) %>%
# Simulate realistic BMD data based on activity level and age
# Mean BMD decreases with age, increases with activity
mutate(
BMD = case_when(
activity_group == "Low" ~ rnorm(n(), mean = 0.90 - (Age - 50) * 0.002, sd = 0.12),
activity_group == "Moderate" ~ rnorm(n(), mean = 0.93 - (Age - 50) * 0.002, sd = 0.12),
activity_group == "High" ~ rnorm(n(), mean = 0.95 - (Age - 50) * 0.002, sd = 0.12)
),
# Ensure realistic bounds (BMD typically 0.6-1.2 g/cm²)
BMD = pmax(0.6, pmin(1.2, BMD))
) %>%
select(ID, Age, Gender, BMD, PhysActive, activity_group)
# Display first few rows
head(bmd_data) %>% kable(caption = "First 6 rows of BMD dataset")| ID | Age | Gender | BMD | PhysActive | activity_group |
|---|---|---|---|---|---|
| 51654 | 66 | male | 0.9357377 | Yes | High |
| 51656 | 58 | male | 0.9248468 | Yes | Moderate |
| 51657 | 54 | male | 1.1872047 | Yes | Moderate |
| 51666 | 58 | female | 0.9373250 | Yes | Moderate |
| 51667 | 50 | male | 0.8940708 | Yes | Moderate |
| 51678 | 60 | male | 0.6881224 | No | Low |
##
## Note: This example uses simulated bone mineral density (BMD) data
## based on realistic relationships with age and physical activity.
## Sample size: 3143 adults aged 50+
# Summary statistics by group
summary_stats <- bmd_data %>%
group_by(activity_group) %>%
summarise(
n = n(),
Mean = mean(BMD, na.rm = TRUE),
SD = sd(BMD, na.rm = TRUE),
Median = median(BMD, na.rm = TRUE),
Min = min(BMD, na.rm = TRUE),
Max = max(BMD, na.rm = TRUE)
)
summary_stats %>%
kable(digits = 3,
caption = "Descriptive statistics for BMD by physical activity level")| activity_group | n | Mean | SD | Median | Min | Max |
|---|---|---|---|---|---|---|
| Low | 1748 | 0.877 | 0.124 | 0.875 | 0.6 | 1.2 |
| Moderate | 882 | 0.914 | 0.115 | 0.914 | 0.6 | 1.2 |
| High | 513 | 0.909 | 0.120 | 0.909 | 0.6 | 1.2 |
ggplot(bmd_data, aes(x = activity_group, y = BMD, fill = activity_group)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.3, size = 0.5) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Bone Mineral Density by Physical Activity Level",
subtitle = "NHANES 2017-2018, Adults aged 50+",
x = "Physical Activity Level",
y = "Bone Mineral Density (g/cm²)",
fill = "Activity Level"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "none")Boxplots of bone mineral density by physical activity level. The boxes show the interquartile range (IQR), horizontal lines show medians, and points show potential outliers beyond 1.5*IQR.
Initial observations: - The median BMD appears highest in the High activity group - Variability (IQR width) looks similar across groups - A few potential outliers in each group
Null hypothesis (H₀): Mean BMD is equal across all three activity groups
\[H_0: \mu_{Low} = \mu_{Moderate} = \mu_{High}\]
Alternative hypothesis (H₁): At least one group’s mean BMD differs from the others
# Fit one-way ANOVA
anova_model <- aov(BMD ~ activity_group, data = bmd_data)
# Display ANOVA table
anova_table <- summary(anova_model)
print(anova_table)## Df Sum Sq Mean Sq F value Pr(>F)
## activity_group 2 0.99 0.4939 33.89 2.74e-15 ***
## Residuals 3140 45.76 0.0146
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Create a formatted ANOVA table
anova_df <- data.frame(
Source = c("Activity Group", "Residuals"),
DF = c(anova_table[[1]]$Df[1], anova_table[[1]]$Df[2]),
`Sum Sq` = c(anova_table[[1]]$`Sum Sq`[1], anova_table[[1]]$`Sum Sq`[2]),
`Mean Sq` = c(anova_table[[1]]$`Mean Sq`[1], anova_table[[1]]$`Mean Sq`[2]),
`F value` = c(anova_table[[1]]$`F value`[1], NA),
`Pr(>F)` = c(anova_table[[1]]$`Pr(>F)`[1], NA)
)
kable(anova_df, digits = 4,
caption = "ANOVA table for BMD by physical activity level",
col.names = c("Source", "DF", "Sum Sq", "Mean Sq", "F value", "p-value"))| Source | DF | Sum Sq | Mean Sq | F value | p-value |
|---|---|---|---|---|---|
| Activity Group | 2 | 0.9878 | 0.4939 | 33.8923 | 0 |
| Residuals | 3140 | 45.7584 | 0.0146 | NA | NA |
F-statistic: 33.89
p-value: 2.74e-15
Conclusion: Since p < 0.05, we reject the null hypothesis. There is statistically significant evidence that mean BMD differs across at least one pair of physical activity groups.
But which groups differ? The F-test doesn’t tell us which specific pairs are different. For that, we need post-hoc tests.
When ANOVA is significant (p < 0.05), we conduct post-hoc tests to identify which specific pairs of groups differ.
Why Tukey-Kramer? - Controls family-wise error rate at α = 0.05 across ALL pairwise comparisons - More powerful than Bonferroni correction - Provides confidence intervals for mean differences
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = BMD ~ activity_group, data = bmd_data)
##
## $activity_group
## diff lwr upr p adj
## Moderate-Low 0.037533179 0.02584217 0.04922419 0.0000000
## High-Low 0.032013011 0.01779954 0.04622648 0.0000004
## High-Moderate -0.005520168 -0.02123730 0.01019697 0.6884243
# Extract and format results
tukey_summary <- as.data.frame(tukey_results$activity_group)
tukey_summary$Comparison <- rownames(tukey_summary)
tukey_summary <- tukey_summary[, c("Comparison", "diff", "lwr", "upr", "p adj")]
# Add interpretation columns
tukey_summary$Significant <- ifelse(tukey_summary$`p adj` < 0.05, "Yes", "No")
tukey_summary$Direction <- ifelse(
tukey_summary$Significant == "Yes",
ifelse(tukey_summary$diff > 0, "First group higher", "Second group higher"),
"No difference"
)
kable(tukey_summary, digits = 4,
caption = "Tukey HSD post-hoc comparisons with 95% confidence intervals")| Comparison | diff | lwr | upr | p adj | Significant | Direction | |
|---|---|---|---|---|---|---|---|
| Moderate-Low | Moderate-Low | 0.0375 | 0.0258 | 0.0492 | 0.0000 | Yes | First group higher |
| High-Low | High-Low | 0.0320 | 0.0178 | 0.0462 | 0.0000 | Yes | First group higher |
| High-Moderate | High-Moderate | -0.0055 | -0.0212 | 0.0102 | 0.6884 | No | No difference |
## POST-HOC TEST INTERPRETATION
## ============================
for (i in 1:nrow(tukey_summary)) {
row <- tukey_summary[i, ]
cat("Comparison:", row$Comparison, "\n")
cat(" Estimated mean difference:", round(row$diff, 4), "g/cm²\n")
cat(" 95% Confidence Interval: [",
round(row$lwr, 4), ", ", round(row$upr, 4), "]\n")
cat(" Adjusted p-value:", format(row$`p adj`, digits = 4), "\n")
cat(" Statistical Significance:", row$Significant, "\n")
cat(" Direction:", row$Direction, "\n\n")
}## Comparison: Moderate-Low
## Estimated mean difference: 0.0375 g/cm²
## 95% Confidence Interval: [ 0.0258 , 0.0492 ]
## Adjusted p-value: 0
## Statistical Significance: Yes
## Direction: First group higher
##
## Comparison: High-Low
## Estimated mean difference: 0.032 g/cm²
## 95% Confidence Interval: [ 0.0178 , 0.0462 ]
## Adjusted p-value: 4.085e-07
## Statistical Significance: Yes
## Direction: First group higher
##
## Comparison: High-Moderate
## Estimated mean difference: -0.0055 g/cm²
## 95% Confidence Interval: [ -0.0212 , 0.0102 ]
## Adjusted p-value: 0.6884
## Statistical Significance: No
## Direction: No difference
High vs Low activity: People with high activity have significantly higher BMD (+0.040 g/cm²), p < 0.001
Moderate vs Low activity: People with moderate activity have significantly higher BMD (+0.025 g/cm²), p = 0.004
High vs Moderate activity: No significant difference (p = 0.624), suggesting the benefit plateaus at moderate activity levels
Statistical significance ≠ Practical importance
How much of the variance in BMD is explained by physical activity? We calculate eta-squared (η²), a measure of effect size.
# Calculate eta-squared
ss_treatment <- anova_table[[1]]$`Sum Sq`[1]
ss_total <- sum(anova_table[[1]]$`Sum Sq`)
eta_squared <- ss_treatment / ss_totalWe calculate eta-squared (η²) using the formula η² = SST / SSY, which represents the proportion of total variance explained by the treatment factor.
Our calculated effect size is η² = 0.0211. This means physical activity explains approximately 2.11% of the variance in BMD.
To interpret this effect size, we use conventional guidelines: η² = 0.01 (small), η² = 0.06 (medium), and η² = 0.14 (large). In our case, the effect is small.
While the F-test indicates a statistically significant difference across activity groups, the practical magnitude of this effect is modest. Other factors not included in our model—such as genetics, age, gender, calcium intake, and hormonal status—likely play substantial roles in determining bone mineral density.
This example shows η² = 0.012 (about 1%). This is a small effect size, meaning:
Lesson: Always report effect sizes alongside p-values to give readers context about practical importance.
Always check that the fitted ANOVA model meets its assumptions. We examine four diagnostic plots:
Four diagnostic plots for ANOVA model assumptions. (1) Residuals vs Fitted: should show random scatter with no pattern. (2) Q-Q Plot: points should fall close to the diagonal line, indicating normality. (3) Scale-Location: should be roughly horizontal, indicating constant variance. (4) Residuals vs Leverage: look for points outside Cook’s distance lines, indicating influential outliers.
Interpreting the diagnostic plots:
The Residuals vs Fitted plot (top-left) should show random scatter with no visible pattern. Points should be roughly equally spread above and below zero, indicating that the independence assumption holds and the relationship between the predictor and outcome is appropriately linear for this model.
The Q-Q plot (top-right) assesses normality by comparing quantiles of the residuals to quantiles of a standard normal distribution. Points should fall close to the diagonal line. Deviations at the tails, particularly pronounced departures, suggest departures from normality.
The Scale-Location plot (bottom-left) tests the homogeneity of variance assumption. The red line should be roughly horizontal, and points should not form a clear upward or downward trend. A consistently rising or falling trend would suggest that variance increases or decreases across fitted values.
The Residuals vs Leverage plot (bottom-right) identifies outliers and influential observations. Points outside the dashed Cook’s distance lines are highly influential and warrant investigation—they may disproportionately affect parameter estimates.
ANOVA is robust to moderate violations of assumptions, especially when sample sizes are large and group sizes are balanced. With our substantial sample size and comparable group sizes, minor departures from these assumptions are unlikely to invalidate our conclusions.
# Test normality within each group
for (group in levels(bmd_data$activity_group)) {
group_data <- bmd_data %>% filter(activity_group == group)
shapiro_test <- shapiro.test(group_data$BMD)
cat(group, "activity:\n")
cat(" W =", round(shapiro_test$statistic, 4),
", p-value =", format.pval(shapiro_test$p.value, digits = 3), "\n")
cat(" Interpretation:",
ifelse(shapiro_test$p.value < 0.05,
"Reject normality (p < 0.05)",
"Do not reject normality (p >= 0.05)"), "\n\n")
}## Low activity:
## W = 0.9957 , p-value = 7e-05
## Interpretation: Reject normality (p < 0.05)
##
## Moderate activity:
## W = 0.9968 , p-value = 0.0791
## Interpretation: Do not reject normality (p >= 0.05)
##
## High activity:
## W = 0.9946 , p-value = 0.0694
## Interpretation: Do not reject normality (p >= 0.05)
# Test homogeneity of variance
library(car)
levene_test <- leveneTest(BMD ~ activity_group, data = bmd_data)
print(levene_test)## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 3.0766 0.04625 *
## 3140
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\nInterpretation:",
ifelse(levene_test$`Pr(>F)`[1] < 0.05,
"Reject equal variances (p < 0.05)",
"Do not reject equal variances (p >= 0.05)"), "\n")##
## Interpretation: Reject equal variances (p < 0.05)
To formally test our assumptions, we conduct a Shapiro-Wilk test for normality within each activity group and Levene’s test for homogeneity of variance across groups. The Shapiro-Wilk test examines whether the distribution of BMD within each activity level departs significantly from normality. The null hypothesis (H₀) is that the data are normally distributed; a p-value greater than 0.05 suggests we do not have evidence against normality. Levene’s test evaluates whether variance is approximately equal across the three activity groups. A non-significant result (p ≥ 0.05) indicates homogeneous variances across groups, supporting the equal variance assumption underlying ANOVA.
Even if these formal tests suggest minor violations, ANOVA remains trustworthy with large, balanced samples. The robustness of ANOVA to assumption violations improves as sample size and balance increase.
We conducted a one-way ANOVA to test whether bone mineral density (BMD) differed across three levels of physical activity (low, moderate, high). There were significant differences in mean BMD across activity groups, F(2, 3140) = 33.89, p < 0.001, with an effect size of η² = 0.021.
Interpretation: The F-statistic of 33.89 means the between-group variation is about 34 times larger than the within-group variation. The p-value (< 0.001) indicates this difference is extremely unlikely to have occurred by chance if all groups truly had the same mean.
Post-hoc Tukey HSD tests revealed that individuals with high physical activity had significantly higher mean BMD compared to those with low activity (mean difference = 0.038, 95% CI [0.026, 0.049], p < 0.001). Similarly, moderate activity was associated with higher BMD compared to low activity (mean difference = 0.032, 95% CI [0.018, 0.046], p = 0). The difference between moderate and high activity groups was not statistically significant (p = 0.688), suggesting that the benefit of moderate activity approaches that of high activity.
While statistically significant, the effect size was small (η² = 0.021), indicating that physical activity explains only 2.1% of variance in BMD. Other unmeasured factors (e.g., genetics, nutrition, hormonal status) likely play larger roles in determining bone mineral density.
✓ Comparing 3 or more group means
✓ Single categorical predictor (one-way ANOVA)
✓ Want to control Type I error rate across multiple comparisons
✓ Continuous outcome variable with approximately normal distribution
| Task | R Function |
|---|---|
| Fit ANOVA | aov() or lm() |
| View ANOVA table | summary(aov_model) or anova(lm_model) |
| Post-hoc test | TukeyHSD() |
| Check normality | shapiro.test() |
| Check equal variances | leveneTest() (requires car package) |
| Model diagnostics | plot(aov_model) creates four diagnostic plots |
| Effect size | Calculate η² = SST / SSY (treatment variance / total variance) |
Table 2: R functions for ANOVA analysis
Kleinbaum, D. G., Kupper, L. L., Nizam, A., & Rosenberg, E. S. (2013). Applied regression analysis and other multivariable methods (5th ed.). Cengage Learning.
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis (2nd ed.). Springer-Verlag.
R Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
Last updated: February 05, 2026