The Analysis of Variance (ANOVA) is a statistical framework used to compare the means of three or more groups to determine if at least one group mean is significantly different from the others. While a t-test is limited to comparing two groups, ANOVA allows us to analyze multiple groups simultaneously without increasing the risk of a Type I error (false positive).
The fundamental logic of ANOVA is to partition the total variability found in a dataset into two components: 1. Between-group variability: Variation due to the interaction between the groups (the effect of the treatment). 2. Within-group variability: Variation due to individual differences or measurement error within each group.
To understand ANOVA, we must break down the sum of squares. Let \(n\) be the total number of observations, \(k\) be the number of groups, and \(n_j\) be the number of observations in group \(j\).
The null hypothesis (\(H_0\)) assumes all group means are equal, while the alternative hypothesis (\(H_a\)) assumes at least one mean is different.
\[H_0: \mu_1 = \mu_2 = \dots = \mu_k\] \[H_a: \text{At least one } \mu_j \text{ is different.}\]
The total variation in the data is called the Total Sum of Squares (SST):
\[SST = \sum_{j=1}^{k} \sum_{i=1}^{n_j} (x_{ij} - \bar{x}_{..})^2\]
Where: * \(x_{ij}\) is the \(i\)-th observation in the \(j\)-th group. * \(\bar{x}_{..}\) is the grand mean of all observations.
This is decomposed into: \[SST = SSB + SSW\]
1. Sum of Squares Between (SSB): Measures the variation between groups. \[SSB = \sum_{j=1}^{k} n_j (\bar{x}_{.j} - \bar{x}_{..})^2\]
2. Sum of Squares Within (SSW): Measures the variation within groups (Error). \[SSW = \sum_{j=1}^{k} \sum_{i=1}^{n_j} (x_{ij} - \bar{x}_{.j})^2\]
We divide the sum of squares by their respective degrees of freedom (\(df\)) to get the variance estimates:
\[MSB = \frac{SSB}{k - 1}\] \[MSW = \frac{SSW}{n - k}\]
The test statistic is the ratio of the variance between groups to the variance within groups:
\[F = \frac{MSB}{MSW}\]
If \(F\) is significantly larger than 1, we reject the null hypothesis.
A farmer wants to test three different types of fertilizers (A, B, and C) to see if they result in different average crop yields (in bushels per acre).
Let’s generate some sample data for this experiment.
# Creating dummy data
set.seed(42)
fertilizer_data <- data.frame(
yield = c(rnorm(10, mean=20, sd=2), # Fertilizer A
rnorm(10, mean=25, sd=2), # Fertilizer B
rnorm(10, mean=22, sd=2)), # Fertilizer C
type = factor(rep(c("A", "B", "C"), each = 10))
)
# Preview data
head(fertilizer_data)## yield type
## 1 22.74192 A
## 2 18.87060 A
## 3 20.72626 A
## 4 21.26573 A
## 5 20.80854 A
## 6 19.78775 A
We use the aov() function to perform the analysis.
# Perform One-Way ANOVA
anova_results <- aov(yield ~ type, data = fertilizer_data)
# Display Summary Table
summary(anova_results)## Df Sum Sq Mean Sq F value Pr(>F)
## type 2 74.28 37.14 5.935 0.0073 **
## Residuals 27 168.96 6.26
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the summary table above: * Pr(>F): This is the p-value. If \(p < 0.05\), we reject \(H_0\). * F value: The ratio of the mean squares.
If the ANOVA is significant, we use Tukey’s Honestly Significant Difference (HSD) test to see which specific groups differ.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = yield ~ type, data = fertilizer_data)
##
## $type
## diff lwr upr p adj
## B-A 3.5784930 0.804720 6.3522660 0.0095151
## C-A 0.5492474 -2.224526 3.3230204 0.8761896
## C-B -3.0292456 -5.803019 -0.2554726 0.0302172
For the ANOVA results to be valid, the following assumptions must be met:
shapiro.test(residuals(anova_results))bartlett.test(yield ~ type, data = fertilizer_data)ANOVA is a powerful tool for experimental design. In our agricultural example, it allowed the farmer to mathematically prove which fertilizer optimizes yield, rather than relying on observation alone. By partitioning variance, we can isolate the “signal” (treatment effect) from the “noise” (random error). ```
ggplot2
boxplot to visualize group differences.