This chapter is presented as a comprehensive guide. It includes the theoretical background, the mathematical framework, and hands-on R code with real-life datasets.
Analysis of Variance (ANOVA) is a statistical method used to compare the means of three or more groups to understand if at least one group mean is significantly different from the others. While a t-test compares two groups, ANOVA handles multiple groups while controlling the Type I error rate (the risk of a “false positive”).
Imagine a pharmaceutical company testing three different dosages of a new blood pressure medication (5mg, 10mg, and 20mg). They need to know if the dosage level significantly impacts the reduction in blood pressure, or if any observed differences are just due to random chance.
The core logic of ANOVA is to partition the total variability in the data into two parts: 1. Variation between groups: Differences caused by the treatment/category. 2. Variation within groups: Natural variation or “noise” within individuals of the same group.
The test statistic for ANOVA is the F-ratio:
\[F = \frac{\text{Mean Square Between (MSB)}}{\text{Mean Square Within (MSW)}}\]
Where: * Sum of Squares Total (SST): \(\sum (X_{ij} - \bar{X}_{grand})^2\) * Sum of Squares Between (SSB): \(\sum n_i (\bar{X}_i - \bar{X}_{grand})^2\) * Sum of Squares Within (SSW): \(\sum (X_{ij} - \bar{X}_i)^2\) * MSB: \(\frac{SSB}{k-1}\) (where \(k\) is the number of groups) * MSW: \(\frac{SSW}{N-k}\) (where \(N\) is the total sample size)
A farmer wants to test four different fertilizers (A, B, C, D) to see if they produce different mean yields of corn (in bushels per acre).
# Load necessary libraries
library(ggplot2)
library(dplyr)
# Create synthetic dataset
set.seed(123)
crop_data <- data.frame(
fertilizer = rep(c("A", "B", "C", "D"), each = 20),
yield = c(runif(20, 180, 200), # Fertilizer A
runif(20, 185, 205), # Fertilizer B
runif(20, 200, 220), # Fertilizer C (The winner)
runif(20, 182, 202)) # Fertilizer D
)
# Preview data
head(crop_data)
Before running the test, it is vital to visualize the distributions.
ggplot(crop_data, aes(x = fertilizer, y = yield, fill = fertilizer)) +
geom_boxplot(alpha = 0.7) +
theme_minimal() +
labs(title = "Figure 1: Corn Yield by Fertilizer Type",
x = "Fertilizer Type",
y = "Yield (Bushels per Acre)") +
scale_fill_brewer(palette = "Set2")
# Compute the ANOVA
anova_model <- aov(yield ~ fertilizer, data = crop_data)
# Summarize the results
summary(anova_model)
Interpretation: If the Pr(>F) (p-value) is less than 0.05, we reject the null hypothesis and conclude that at least one fertilizer produces a different yield.
ANOVA tells us if there is a difference, but not where it is. To find which specific fertilizers differ, we use Tukey’s Honestly Significant Difference (HSD) test.
# Tukey HSD test
tukey_result <- TukeyHSD(anova_model)
print(tukey_result)
# Plotting the Tukey results (Figure 2)
plot(tukey_result, las = 1)
A company wants to know if Sales are affected by two factors: 1. Ad Type (Social Media vs. TV). 2. Region (East vs. West). They also want to know if there is an interaction (e.g., does Social Media work better in the West than the East?).
# Generate data
marketing_data <- expand.grid(
ad_type = c("Social Media", "TV"),
region = c("East", "West")
) %>%
slice(rep(1:n(), each = 25)) %>%
mutate(sales = c(rnorm(25, 50, 5), rnorm(25, 40, 5),
rnorm(25, 60, 5), rnorm(25, 45, 5)))
# Two-Way ANOVA with Interaction
two_way_anova <- aov(sales ~ ad_type * region, data = marketing_data)
summary(two_way_anova)
Interaction plots show if the effect of one factor depends on another.
with(marketing_data, interaction.plot(ad_type, region, sales,
fixed = TRUE, col = c("red", "blue"),
lwd = 2, main = "Figure 3: Interaction Plot of Sales"))
For ANOVA results to be valid, three assumptions must hold:
# Set up a 2x2 plotting area
par(mfrow = c(2, 2))
plot(anova_model)
par(mfrow = c(1, 1)) # Reset plotting area
# Normality test (Shapiro-Wilk)
shapiro.test(residuals(anova_model))
# Homogeneity of Variance (Levene's Test)
library(car)
leveneTest(yield ~ fertilizer, data = crop_data)
ANOVA is a powerful tool for decision-making in real-life scenarios ranging from agriculture to marketing. By using R, we can not only calculate the F-statistic but also visualize the differences and verify that our data meets the rigorous mathematical assumptions required for scientific accuracy.
Summary Checklist for ANOVA in R: 1.
Exploratory Data Analysis: ggplot2
boxplots. 2. Fit Model: aov(). 3.
Check Assumptions: plot() and
leveneTest(). 4. Post-hoc:
TukeyHSD() to find specific differences.