Analysis of Variance (ANOVA) is a statistical method used to compare the means of three or more independent groups to determine if at least one group mean is significantly different from the others.
While a T-test is limited to comparing two groups, ANOVA allows us to analyze multiple groups simultaneously without increasing the risk of a Type I Error (false positive).
If we have 4 groups and perform separate t-tests for every pair, we would need 6 tests. If each test has a significance level of \(\alpha = 0.05\), the probability of making at least one Type I error across all tests increases to: \[1 - (0.95)^6 \approx 0.26\] ANOVA solves this by providing a single “Omnibus” test.
The core logic of ANOVA is to partition the total variance in the data into two components: 1. Between-Group Variance: Variation due to the interaction between the groups (the effect). 2. Within-Group Variance: Variation due to individual differences within groups (the error).
The Total Sum of Squares (\(SS_{Total}\)) is defined as: \[SS_{Total} = SS_{Between} + SS_{Within}\]
1. Sum of Squares Between (\(SS_{B}\)): \[SS_{B} = \sum_{i=1}^{k} n_i (\bar{x}_i - \bar{x}_{grand})^2\] Where \(n_i\) is group size, \(\bar{x}_i\) is group mean, and \(\bar{x}_{grand}\) is the overall mean.
2. Sum of Squares Within (\(SS_{W}\)): \[SS_{W} = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (x_{ij} - \bar{x}_i)^2\]
3. Mean Squares (MS): Dividing the SS by the degrees of freedom (\(df\)): \[MS_{B} = \frac{SS_{B}}{k-1}\] \[MS_{W} = \frac{SS_{W}}{N-k}\]
4. The F-Statistic: The test statistic follows an F-distribution: \[F = \frac{MS_{B}}{MS_{W}}\]
Scenario: A farmer wants to test three different fertilizers (A, B, and C) to see if they produce different crop yields (measured in kg per plot).
ggplot(fertilizer_data, aes(x = type, y = yield, fill = type)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.1) +
theme_minimal() +
labs(title = "Crop Yield Distribution", y = "Yield (kg)", x = "Fertilizer Type")Figure 1: Comparison of Crop Yields by Fertilizer Type
## Df Sum Sq Mean Sq F value Pr(>F)
## type 2 163.2 81.61 17.69 1.23e-05 ***
## Residuals 27 124.5 4.61
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: If the \(p-value < 0.05\), we reject the Null Hypothesis (\(H_0: \mu_A = \mu_B = \mu_C\)) and conclude that at least one fertilizer produces a significantly different yield.
ANOVA tells us that there is a difference, but not where it is. To find which specific groups differ, we use Tukey’s Honestly Significant Difference (HSD) test.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = yield ~ type, data = fertilizer_data)
##
## $type
## diff lwr upr p adj
## Fertilizer B-Fertilizer A 5.372304 2.990736 7.753871 0.0000181
## Fertilizer C-Fertilizer A 1.001631 -1.379937 3.383199 0.5569679
## Fertilizer C-Fertilizer B -4.370673 -6.752240 -1.989105 0.0002920
Scenario: A global company wants to know if Advertising Channel (Social Media vs. TV) and Region (USA vs. Europe) affect total Sales.
Two-Way ANOVA allows us to check for Interactions: Does the effect of Social Media ads depend on the Region?
marketing_data <- expand.grid(
Channel = c("Social Media", "TV"),
Region = c("USA", "Europe")
) %>%
slice(rep(1:n(), each = 15)) %>%
mutate(Sales = c(rnorm(15, 50, 5), rnorm(15, 40, 5),
rnorm(15, 55, 5), rnorm(15, 30, 5)))
# Running Two-Way ANOVA
two_way_aov <- aov(Sales ~ Channel * Region, data = marketing_data)
summary(two_way_aov)## Df Sum Sq Mean Sq F value Pr(>F)
## Channel 1 4291 4291 226.632 < 2e-16 ***
## Region 1 150 150 7.908 0.00677 **
## Channel:Region 1 686 686 36.217 1.42e-07 ***
## Residuals 56 1060 19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For ANOVA results to be valid, the data must satisfy:
shapiro.test(residuals(res_aov)) or
Q-Q Plot.leveneTest().| Feature | One-Way ANOVA | Two-Way ANOVA |
|---|---|---|
| Independent Variables | 1 Factor (e.g., Fertilizer) | 2 Factors (e.g., Channel & Region) |
| Dependent Variable | Continuous (Yield) | Continuous (Sales) |
| Main Goal | Compare means across levels | Compare means and find interactions |
ANOVA is a powerful statistical tool used in medicine (testing drug dosages), manufacturing (comparing machine outputs), and social sciences. By partitioning variance, it provides a robust framework for decision-making based on experimental data. ```
install.packages(c("ggplot2", "dplyr", "ggpubr", "car")) in
your R console.File > New File > R Markdown.