Analysis of Variance (ANOVA) is a statistical method used to compare the means of three or more independent groups. While a t-test is limited to two groups, ANOVA allows us to determine if at least one group mean is different from the others without increasing the risk of a “Type I Error” (false positive).
It might seem strange to use “Variance” to test “Means.” However, ANOVA works by partitioning the total variance in the data into two components: 1. Between-group variance: How much do the group means differ? 2. Within-group variance: How much spread is there inside each group?
If the variance between groups is significantly larger than the variance within groups, we conclude the means are likely different.
To perform a One-Way ANOVA, we follow these mathematical steps.
Total variation is broken down as: \[SS_{Total} = SS_{Between} + SS_{Within}\]
The final test statistic is the ratio of Mean Squares: \[F = \frac{MS_{Between}}{MS_{Within}} = \frac{SS_B / (k-1)}{SS_W / (N-k)}\] Where: * \(k\) = number of groups. * \(N\) = total number of observations.
Imagine an agricultural scientist testing four different types of fertilizers (A, B, C, and D) on crop yields. They want to know: Does the type of fertilizer used significantly affect the average crop yield?
set.seed(123)
# Simulating data for 4 fertilizers
fertilizer_data <- data.frame(
Fertilizer = rep(c("A", "B", "C", "D"), each = 20),
Yield = c(rnorm(20, mean = 20, sd = 2), # Fertilizer A
rnorm(20, mean = 22, sd = 2), # Fertilizer B
rnorm(20, mean = 19, sd = 2), # Fertilizer C
rnorm(20, mean = 25, sd = 2)) # Fertilizer D
)
head(fertilizer_data)## Fertilizer Yield
## 1 A 18.87905
## 2 A 19.53965
## 3 A 23.11742
## 4 A 20.14102
## 5 A 20.25858
## 6 A 23.43013
Before running the math, we visualize the distributions.
ggplot(fertilizer_data, aes(x = Fertilizer, y = Yield, fill = Fertilizer)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.15, alpha = 0.5) +
theme_minimal() +
labs(title = "Crop Yield by Fertilizer Type",
subtitle = "Visual inspection suggests differences between group D and others",
y = "Yield (Bushels/Acre)")We use the aov() function to calculate the results.
# Fit the ANOVA model
anova_model <- aov(Yield ~ Fertilizer, data = fertilizer_data)
# Display the summary table
summary(anova_model)## Df Sum Sq Mean Sq F value Pr(>F)
## Fertilizer 3 349.8 116.6 33.33 7.4e-14 ***
## Residuals 76 265.9 3.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA is valid only if certain conditions are met: 1. Independence: Observations are independent (Assumed by design). 2. Normality: The residuals should follow a normal distribution. 3. Homogeneity of Variance: Groups should have similar spread.
ANOVA tells us “something is different,” but it doesn’t say which fertilizer is better. We use Tukey’s Honest Significant Difference to find the specific differences.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Yield ~ Fertilizer, data = fertilizer_data)
##
## $Fertilizer
## diff lwr upr p adj
## B-A 1.614238 0.06057883 3.1678973 0.0386301
## C-A -1.070277 -2.62393639 0.4833821 0.2768807
## D-A 4.476918 2.92325902 6.0305775 0.0000000
## C-B -2.684515 -4.23817446 -1.1308560 0.0001208
## D-B 2.862680 1.30902095 4.4163394 0.0000390
## D-C 5.547195 3.99353617 7.1008547 0.0000000
If the confidence interval for a pair (e.g., D-A) does not cross the zero line, the difference is significant. In our plot, Fertilizer D is significantly more effective than A, B, and C.
ANOVA is used across virtually every industry:
| Industry | Application |
|---|---|
| Marketing | Comparing the click-through rates (CTR) of 5 different website layouts. |
| Medicine | Testing the effectiveness of three different dosages of a new drug. |
| Manufacturing | Checking if three different machines produce parts with the same mean diameter. |
| Education | Comparing the exam scores of students using three different teaching methods. |
ANOVA is a powerful tool for experimental design. By looking at the variance, we gain clarity on whether observed differences are due to chance or a genuine effect of our treatments.
install.packages(c("ggplot2", "dplyr", "ggpubr", "car")).