Analysis of Variance (ANOVA) is a statistical method used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups.
While a t-test compares two groups, ANOVA is used when we have \(k \geq 3\) groups. It helps us understand if at least one group mean is different from the others without increasing the risk of a “Type I error” (false positive) that occurs when running multiple t-tests.
The core logic of ANOVA is to partition the total variance in the data into two components: 1. Between-group variance: Variation due to the interaction between the different groups. 2. Within-group variance (Error): Variation due to individual differences within each group.
The null hypothesis (\(H_0\)) assumes that all group means are equal, while the alternative hypothesis (\(H_a\)) assumes at least one mean is different.
\[H_0: \mu_1 = \mu_2 = \mu_3 = \dots = \mu_k\] \[H_a: \text{At least one } \mu_i \neq \mu_j\]
The test statistic for ANOVA is the F-ratio:
\[F = \frac{\text{Mean Square Between (MSB)}}{\text{Mean Square Within (MSW)}}\]
Where: - Sum of Squares Between (SSB): \(\sum n_i (\bar{x}_i - \bar{x}_{grand})^2\) - Sum of Squares Within (SSW): \(\sum (n_i - 1)s_i^2\) - MSB: \(\frac{SSB}{k-1}\) - MSW: \(\frac{SSW}{N-k}\)
If the \(F\) value is significantly greater than 1, we reject the null hypothesis.
Scenario: A farmer wants to test three different types of fertilizers (Type A, Type B, and Type C) to see if they produce different crop yields (measured in kg per plot).
Let’s simulate the data for 30 plots of land (10 plots per fertilizer).
set.seed(123)
data <- data.frame(
Fertilizer = rep(c("Type_A", "Type_B", "Type_C"), each = 10),
Yield = c(rnorm(10, mean = 20, sd = 2),
rnorm(10, mean = 25, sd = 2),
rnorm(10, mean = 22, sd = 2))
)
# Preview the data
head(data)## Fertilizer Yield
## 1 Type_A 18.87905
## 2 Type_A 19.53965
## 3 Type_A 23.11742
## 4 Type_A 20.14102
## 5 Type_A 20.25858
## 6 Type_A 23.43013
Before running the test, we should visualize the distribution using a boxplot.
ggplot(data, aes(x = Fertilizer, y = Yield, fill = Fertilizer)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.1) +
theme_minimal() +
labs(title = "Crop Yield by Fertilizer Type",
subtitle = "Visual inspection of mean differences",
x = "Fertilizer Brand",
y = "Yield (kg)")Now we apply the aov() function in R to calculate the
variance components.
# Compute the analysis of variance
res.aov <- aov(Yield ~ Fertilizer, data = data)
# Summary of the analysis
summary(res.aov)## Df Sum Sq Mean Sq F value Pr(>F)
## Fertilizer 2 156.5 78.26 20.57 3.74e-06 ***
## Residuals 27 102.7 3.80
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA tells us that there is a difference, but it doesn’t tell us which groups are different. For that, we use the Tukey Honestly Significant Difference (HSD) test.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Yield ~ Fertilizer, data = data)
##
## $Fertilizer
## diff lwr upr p adj
## Type_B-Type_A 5.267993 3.105081 7.430904 0.0000056
## Type_C-Type_A 1.001631 -1.161280 3.164542 0.4935951
## Type_C-Type_B -4.266362 -6.429273 -2.103450 0.0001178
Observation: If the confidence interval bar does not cross the zero line, the difference between those two specific fertilizers is statistically significant.
For the results to be valid, the data must satisfy three main assumptions:
ANOVA is widely used across various industries:
In our agricultural example, the ANOVA test allowed us to conclude that Fertilizer B is significantly more effective than Type A and Type C. This data-driven approach saves resources and optimizes production in real-world settings. ```
ggplot2) for initial
data inspection.