This chapter is presented as a comprehensive guide. It includes the theoretical background, the mathematical framework, and hands-on R code with real-life datasets.


Chapter: ANOVA and its Application in R

1. Introduction

Analysis of Variance (ANOVA) is a statistical method used to compare the means of three or more groups to understand if at least one group mean is significantly different from the others. While a t-test compares two groups, ANOVA handles multiple groups while controlling the Type I error rate (the risk of a “false positive”).

Real-Life Application

Imagine a pharmaceutical company testing three different dosages of a new blood pressure medication (5mg, 10mg, and 20mg). They need to know if the dosage level significantly impacts the reduction in blood pressure, or if any observed differences are just due to random chance.


2. Mathematical Foundation

The core logic of ANOVA is to partition the total variability in the data into two parts: 1. Variation between groups: Differences caused by the treatment/category. 2. Variation within groups: Natural variation or “noise” within individuals of the same group.

2.1 The Hypotheses

  • Null Hypothesis (\(H_0\)): \(\mu_1 = \mu_2 = \dots = \mu_k\) (All group means are equal).
  • Alternative Hypothesis (\(H_a\)): At least one \(\mu_i\) is different.

2.2 The F-Statistic

The test statistic for ANOVA is the F-ratio:

\[F = \frac{\text{Mean Square Between (MSB)}}{\text{Mean Square Within (MSW)}}\]

Where: * Sum of Squares Total (SST): \(\sum (X_{ij} - \bar{X}_{grand})^2\) * Sum of Squares Between (SSB): \(\sum n_i (\bar{X}_i - \bar{X}_{grand})^2\) * Sum of Squares Within (SSW): \(\sum (X_{ij} - \bar{X}_i)^2\) * MSB: \(\frac{SSB}{k-1}\) (where \(k\) is the number of groups) * MSW: \(\frac{SSW}{N-k}\) (where \(N\) is the total sample size)


3. One-Way ANOVA in R (Real-Life Example: Agriculture)

Scenario:

A farmer wants to test four different fertilizers (A, B, C, D) to see if they produce different mean yields of corn (in bushels per acre).

3.1 Data Preparation

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Create synthetic dataset
set.seed(123)
crop_data <- data.frame(
  fertilizer = rep(c("A", "B", "C", "D"), each = 20),
  yield = c(runif(20, 180, 200), # Fertilizer A
            runif(20, 185, 205), # Fertilizer B
            runif(20, 200, 220), # Fertilizer C (The winner)
            runif(20, 182, 202)) # Fertilizer D
)

# Preview data
head(crop_data)

3.2 Visualizing the Data (Figure 1)

Before running the test, it is vital to visualize the distributions.

ggplot(crop_data, aes(x = fertilizer, y = yield, fill = fertilizer)) +
  geom_boxplot(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Figure 1: Corn Yield by Fertilizer Type",
       x = "Fertilizer Type",
       y = "Yield (Bushels per Acre)") +
  scale_fill_brewer(palette = "Set2")

3.3 Running the ANOVA

# Compute the ANOVA
anova_model <- aov(yield ~ fertilizer, data = crop_data)

# Summarize the results
summary(anova_model)

Interpretation: If the Pr(>F) (p-value) is less than 0.05, we reject the null hypothesis and conclude that at least one fertilizer produces a different yield.


4. Post-hoc Analysis (Tukey’s HSD)

ANOVA tells us if there is a difference, but not where it is. To find which specific fertilizers differ, we use Tukey’s Honestly Significant Difference (HSD) test.

# Tukey HSD test
tukey_result <- TukeyHSD(anova_model)
print(tukey_result)

# Plotting the Tukey results (Figure 2)
plot(tukey_result, las = 1)

5. Two-Way ANOVA (Real-Life Example: Marketing)

Scenario:

A company wants to know if Sales are affected by two factors: 1. Ad Type (Social Media vs. TV). 2. Region (East vs. West). They also want to know if there is an interaction (e.g., does Social Media work better in the West than the East?).

5.1 Data Generation and Modeling

# Generate data
marketing_data <- expand.grid(
  ad_type = c("Social Media", "TV"),
  region = c("East", "West")
) %>%
  slice(rep(1:n(), each = 25)) %>%
  mutate(sales = c(rnorm(25, 50, 5), rnorm(25, 40, 5), 
                   rnorm(25, 60, 5), rnorm(25, 45, 5)))

# Two-Way ANOVA with Interaction
two_way_anova <- aov(sales ~ ad_type * region, data = marketing_data)
summary(two_way_anova)

5.2 Interaction Plot (Figure 3)

Interaction plots show if the effect of one factor depends on another.

with(marketing_data, interaction.plot(ad_type, region, sales, 
                                      fixed = TRUE, col = c("red", "blue"),
                                      lwd = 2, main = "Figure 3: Interaction Plot of Sales"))

6. Testing ANOVA Assumptions

For ANOVA results to be valid, three assumptions must hold:

  1. Independence: Each observation is independent (ensured by study design).
  2. Normality: The residuals (errors) should follow a normal distribution.
  3. Homogeneity of Variance: The variance within each group should be similar (Homoscedasticity).

6.1 Diagnostic Plots (Figure 4)

# Set up a 2x2 plotting area
par(mfrow = c(2, 2))
plot(anova_model)
par(mfrow = c(1, 1)) # Reset plotting area
  • Q-Q Plot: Points should follow the diagonal line for normality.
  • Residuals vs Fitted: Points should be randomly scattered for homogeneity.

6.2 Statistical Tests for Assumptions

# Normality test (Shapiro-Wilk)
shapiro.test(residuals(anova_model))

# Homogeneity of Variance (Levene's Test)
library(car)
leveneTest(yield ~ fertilizer, data = crop_data)

7. Conclusion

ANOVA is a powerful tool for decision-making in real-life scenarios ranging from agriculture to marketing. By using R, we can not only calculate the F-statistic but also visualize the differences and verify that our data meets the rigorous mathematical assumptions required for scientific accuracy.

Summary Checklist for ANOVA in R: 1. Exploratory Data Analysis: ggplot2 boxplots. 2. Fit Model: aov(). 3. Check Assumptions: plot() and leveneTest(). 4. Post-hoc: TukeyHSD() to find specific differences.