Analysis of Variance (ANOVA) is a statistical method used to compare the means of three or more independent groups to determine if at least one group mean is statistically different from the others. While a T-test compares two groups, ANOVA is the tool of choice for multi-group scenarios common in business, medicine, and social sciences.
The core logic of ANOVA is to partition the total variability in a dataset into two components: 1. Between-group variability: Diversity caused by the treatment or grouping. 2. Within-group variability: Diversity caused by random chance (error).
\[H_0: \mu_1 = \mu_2 = \mu_3 = \dots = \mu_k\] \[H_a: \text{At least one } \mu_i \text{ is different.}\]
The F-ratio is the test statistic used in ANOVA:
\[F = \frac{\text{Mean Square Between (MSB)}}{\text{Mean Square Within (MSW)}}\]
Where: * Sum of Squares Total (SST): \(\sum (x_{ij} - \bar{x}_{grand})^2\) * Sum of Squares Between (SSB): \(\sum n_i(\bar{x}_i - \bar{x}_{grand})^2\) * Sum of Squares Within (SSW): \(\sum (n_i - 1)s_i^2\)
Scenario: A farming collective wants to test three different fertilizers (A, B, and C) to see if they result in different average crop yields (in bushels per acre).
set.seed(123)
fertilizer_data <- data.frame(
yield = c(rnorm(20, 50, 5), rnorm(20, 55, 5), rnorm(20, 48, 5)),
type = factor(rep(c("Fertilizer A", "Fertilizer B", "Fertilizer C"), each = 20))
)
head(fertilizer_data)## yield type
## 1 47.19762 Fertilizer A
## 2 48.84911 Fertilizer A
## 3 57.79354 Fertilizer A
## 4 50.35254 Fertilizer A
## 5 50.64644 Fertilizer A
## 6 58.57532 Fertilizer A
Before running the test, we visualize the data using a boxplot to see the spread.
ggplot(fertilizer_data, aes(x = type, y = yield, fill = type)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.1) +
theme_minimal() +
labs(title = "Crop Yield by Fertilizer Type",
x = "Fertilizer Brand",
y = "Yield (Bushels/Acre)") +
scale_fill_brewer(palette = "Set1")## Df Sum Sq Mean Sq F value Pr(>F)
## type 2 397.3 198.67 9.344 0.000309 ***
## Residuals 57 1211.9 21.26
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: If the p-value (\(Pr(>F)\)) is less than 0.05, we reject the null hypothesis and conclude that the fertilizers do not perform equally.
Scenario: An e-commerce company wants to test the effect of Ad Platform (Facebook vs. Google) and Ad Format (Video vs. Static Image) on the Number of Clicks.
marketing_data <- expand.grid(
platform = c("Facebook", "Google"),
format = c("Video", "Static")
) %>%
slice(rep(1:n(), each = 15)) %>%
mutate(clicks = c(rnorm(15, 120, 10), rnorm(15, 110, 10),
rnorm(15, 150, 12), rnorm(15, 105, 10)))
head(marketing_data)## platform format clicks
## 1 Facebook Video 114.9768
## 2 Facebook Video 116.6679
## 3 Facebook Video 109.8142
## 4 Facebook Video 109.2821
## 5 Facebook Video 123.0353
## 6 Facebook Video 124.4821
In two-way ANOVA, we look for “Interaction Effects”—where the effect of the platform depends on the format used.
ggline(marketing_data, x = "platform", y = "clicks", color = "format",
add = c("mean_se"),
palette = "jco") +
labs(title = "Interaction Effect: Platform vs. Ad Format")## Df Sum Sq Mean Sq F value Pr(>F)
## platform 1 12084 12084 137.65 < 2e-16 ***
## format 1 1785 1785 20.34 3.37e-05 ***
## platform:format 1 6636 6636 75.59 5.64e-12 ***
## Residuals 56 4916 88
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For ANOVA results to be valid, three main assumptions must hold: 1. Normality: The residuals should be normally distributed. 2. Homogeneity of Variance: The variance across groups should be roughly equal (Levene’s Test). 3. Independence: Observations are independent of each other.
An ANOVA tells us that a difference exists, but not where it is. If we find a significant result, we use Tukey’s Honestly Significant Difference (HSD) to compare pairs of groups.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = yield ~ type, data = fertilizer_data)
##
## $type
## diff lwr upr p adj
## Fertilizer B-Fertilizer A 4.035595 0.5267211 7.544469 0.0204624
## Fertilizer C-Fertilizer A -2.175693 -5.6845670 1.333181 0.3022689
## Fertilizer C-Fertilizer B -6.211288 -9.7201621 -2.702414 0.0002260
ANOVA is a powerful decision-making tool in the real world: - Medicine: Comparing the efficacy of multiple drug dosages. - Manufacturing: Testing if different machines produce parts with different defect rates. - Education: Analyzing if different teaching methods lead to different test scores.
By partitioning variance and testing the F-ratio, researchers can move beyond simple guesses to statistically backed conclusions about group differences. ```
install.packages(c("tidyverse", "ggpubr", "car", "broom")).ggplot2 and
ggpubr to create publication-quality plots.