Imagine we’re working for a pharmaceutical company trying to see
which level of Vitamin C dose makes guinea pig teeth grow the longest.
We use the built-in R dataset ToothGrowth,
which examines the effect of Vitamin C dosage on the growth of teeth
(odontoblasts) in guinea pigs. We initially focus on the different
dosage levels (dose).
Data Preparation and Exploration in R (One-Way Setup)
# Load the dataset (it's already in R)
data(ToothGrowth)
# Convert 'dose' to a factor/categorical variable for all ANOVA modeling
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
# Inspect the first few rows and structure
head(ToothGrowth)
str(ToothGrowth)
'data.frame': 60 obs. of 3 variables:
$ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
$ dose: Factor w/ 3 levels "0.5","1","2": 1 1 1 1 1 1 1 1 1 1 ...
# Visualize the data using a Boxplot (Essential first step!)
library(ggplot2)
ggplot(ToothGrowth, aes(x = dose, y = len, fill = dose)) +
geom_boxplot() +
labs(title = "Tooth Length vs. Vitamin C Dose",
x = "Dose (mg/day)",
y = "Tooth Length (mm)") +
theme_minimal()

Running the One-Way ANOVA Model
The Setup (The Model)
- Tooth Length (\(\text{len}\)): This is what we’re
measuring—our Dependent Variable.
- Vitamin C Dose (\(\text{dose}\)): This is what we’re
changing—our Factor. We have 3 groups: Low (0.5),
Medium (1.0), and High (2.0).
- The Big Question (\(\mathbf{H_0}\)): Does the dose
really matter? Or is any difference in tooth length just due to
random luck (some guinea pigs are naturally bigger than others)? Our
Null Hypothesis (\(H_0\)) is:
“All three dose groups have the exact same average tooth
length.”
# 1. Fit the ANOVA model
# The formula is 'len' (outcome) explained only by 'dose' (factor)
anova_model_one_way <- aov(len ~ dose, data = ToothGrowth)
# 2. View the ANOVA Summary Table
summary(anova_model_one_way)
Df Sum Sq Mean Sq F value Pr(>F)
dose 2 2426 1213 67.42 9.53e-16 ***
Residuals 57 1026 18
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The Variability Breakdown (Sum of Squares)
ANOVA’s job is to figure out why the tooth lengths
vary. We split the total variation into two piles of “credit”:
| Differences BETWEEN Groups |
\(\text{Sum Sq}\) for
\(\text{dose}\) (2426) |
The “Dose Effect” Credit: The
variation in tooth length that is clearly due to the fact that we
changed the dose. This is the difference between the average tooth
lengths of the Low, Medium, and High groups. |
2426 |
| Differences WITHIN Groups |
\(\text{Sum Sq}\) for
\(\text{Residuals}\) (1026) |
The “Random Luck” Credit (Error): The
variation within a single group (e.g., why one guinea pig on
the Low dose has a longer tooth than another guinea pig on the
same Low dose). This is the unavoidable natural variation or
“noise.” |
1026 |
The Comparison (The F-Test)
We use the Mean Squares (\(\text{Mean Sq}\)) to turn the “piles of
credit” into an average variance per observation,
essentially controlling for sample size.
The F-statistic (67.42) is the ratio of these two
variances:
\[\mathbf{F} = \frac{\text{Mean
Sq}_{\text{Dose}}}{\text{Mean Sq}_{\text{Residuals}}} = \frac{\text{The
Variance explained by our factor}}{\text{The Unexplained Variance
(Error)}}\]
- If the Null Hypothesis is TRUE (Dose doesn’t
matter): The numerator (Dose Effect) should be about the same
size as the denominator (Random Luck), so \(F
\approx 1\).
- If the Null Hypothesis is FALSE (Dose DOES matter):
The Dose Effect is much bigger than Random Luck, so \(\mathbf{F \gg 1}\).
Our F-value is 67.42. This is a big
number, telling us that the variation between the dose
groups is about 67 times larger than the random variation
within the groups. This strongly suggests the dose is doing
something!
The Final Verdict (The P-Value)
- P-Value (\(\text{Pr(>F)}\)): \(\mathbf{9.53\text{e-16}}\) (or \(0.000...0953\)).
- What this P-Value means: This is the probability of
getting an F-ratio as high as 67.42 if the null hypothesis were
actually true (i.e., if all doses were exactly the same).
- Since this probability is extremely tiny (much less than
0.05), we conclude that the results we observed are
not due to random chance.
Conclusion:
We reject the Null Hypothesis. There is compelling
evidence that the average tooth length is significantly affected
by the dose of Vitamin C. The dose level definitely
matters!
Assumptions and Diagnostics
ANOVA requires three key assumptions:
Independence,
Normality of Residuals, and
Homogeneity of Variances.
# --- Checking Normality of Residuals ---
# A visual check using the QQ plot of the residuals
par(mfrow = c(1, 2)) # Set up plotting area
plot(anova_model_one_way, which = 2) # Q-Q Plot: Points should fall along the line.
Levene’s Test for Homogeneity of Variance
Before trusting the p-value from the ANOVA, we must check its
assumptions. The most critical assumptions are
normality (data within each group follows a normal
distribution) and homogeneity of variance (the
variance, or spread, of the data is roughly the same across all
groups).
Levene’s Test is used specifically to check the homogeneity
of variance assumption.
The Hypotheses for Levene’s Test
Levene’s Test uses an \(F\)-statistic, just like ANOVA, but its
hypotheses are about the variances (\(\sigma^2\)), not the means (\(\mu\)):
- Null Hypothesis (\(H_0\)): All population variances
are equal. \[\mathbf{H_0}: \sigma^2_{0.5} =
\sigma^2_{1.0} = \sigma^2_{2.0}\] (In plain language: The spread
of tooth length data is the same for all three dose
groups.)
- Alternative Hypothesis (\(H_a\)): At least one population
variance is different. \[\mathbf{H_a}:
\text{Not all } \sigma^2 \text{'s are equal.}\] (In plain
language: The spread of tooth length data is different
for at least one dose group.)
In contrast to the ANOVA, we actually want to FAIL to reject
\(H_0\) for Levene’s test.
Rejecting \(H_0\) means the assumption
is violated, and our ANOVA results might be unreliable.
# --- Checking Homogeneity of Variance ---
# Formal Test for Homogeneity: Levene's Test (use 'car' package)
library(car)
leveneTest(len ~ dose, data = ToothGrowth)
# H0 for Levene's test: Variances are equal.
# A non-significant p-value (p > 0.05) means we DO NOT reject H0,
# supporting the assumption of homogeneity.
Interpreting the Output
| \(\text{group}\) |
2 |
0.6457 |
0.5281 |
|
57 |
|
|
Degrees of Freedom (\(\text{Df}\)):
- \(\text{group}\)
(2): Degrees of freedom for the numerator (number of groups
minus 1).
- 57: Degrees of freedom for the denominator (total
observations minus number of groups).
F-Value (\(\mathbf{0.6457}\)): This is the
test statistic for Levene’s test.
P-Value (\(\text{Pr(>F)}\)): \(\mathbf{0.5281}\). This is the probability
of observing an \(F\)-statistic this
large (or larger) if the variances were truly equal.
Conclusion on Homogeneity of Variance
To make a decision, we compare the p-value (\(\mathbf{0.5281}\)) to our significance
level, \(\alpha = 0.05\).
- Since the p-value (\(0.5281\)) is
greater than \(\alpha =
0.05\), we FAIL to reject the Null Hypothesis (\(\mathbf{H_0}\)) of equal
variances.
What this means: We do not have sufficient
statistical evidence to conclude that the variance (spread) in tooth
length is different across the three dose groups.
Final Assessment of the ANOVA Model:
Because we failed to reject \(H_0\) in Levene’s test, we can
conclude that the homogeneity of variance assumption is
met. Therefore, the \(F\)-test
and p-value obtained from your initial one-way ANOVA (which found a
significant effect of dose) are reliable.
(Self-Correction/Next Step): If this test
had been significant (\(p <
0.05\)), you would need an alternative robust ANOVA methods, like
Welch’s
ANOVA, which does not require the assumption of equal
variances or the Kruskal-Wallis
test, which requires no distributional assumptions.
Post-Hoc Analysis (The Follow-up)
If the overall ANOVA \(F\)-test is
significant, we use post-hoc tests to find which specific pairs differ,
controlling the Family-Wise Error Rate (FWER).
Our F-test told us “Something is different.” It
didn’t tell us “Which specific dose levels are different from
each other.”
To answer that, you need to run Post-Hoc Tests (like
Tukey’s HSD) to perform multiple head-to-head comparisons: * Does \(0.5\text{ mg}\) differ from \(1.0\text{ mg}\)? * Does \(1.0\text{ mg}\) differ from \(2.0\text{ mg}\)? * Does \(0.5\text{ mg}\) differ from \(2.0\text{ mg}\)?
# If the F-test was significant (e.g., p < 0.05), run Tukey's HSD.
tukey_results_one_way <- TukeyHSD(anova_model_one_way)
# Print the results
print(tukey_results_one_way)
The ANOVA told us, “Yes, the dose matters.” Tukey’s HSD test answers
the question, “Specifically, which doses are different from each
other?”
Tukey’s test performs all possible pairwise comparisons
(group A vs. group B, group A vs. group C, and group B vs. group C)
while controlling the Family-Wise Error Rate (FWER).
This means the overall 95% confidence level applies to the entire set of
three comparisons, not just to each individual one.
Understanding the Output Structure
| \(\mathbf{1 -
0.5}\) |
\(9.130\) |
\(5.901805\) |
\(12.358195\) |
\(0.00\text{e}+00\) |
| \(\mathbf{2 -
0.5}\) |
\(15.495\) |
\(12.266805\) |
\(18.723195\) |
\(0.00\text{e}+00\) |
| \(\mathbf{2 -
1}\) |
\(6.365\) |
\(3.136805\) |
\(9.593195\) |
\(4.25\text{e}-05\) |
The columns show the result for the three comparisons between the
dose levels (0.5, 1, and 2 mg/day).
Analysis of Each Comparison
We look at two things for each row: 1. Does the 95%
Confidence Interval (CI) contain zero? If the interval (\(\text{lwr}\) to \(\text{upr}\)) does not
contain zero, the difference is statistically significant. 2. Is
the Adjusted P-value (\(\text{p adj}\))
less than \(\alpha = 0.05\)?
If yes, the difference is statistically significant.
Comparison 1: Dose 1.0 vs. Dose 0.5 (1 - 0.5)
- Difference (\(\text{diff}\)): \(9.130\). The average tooth length at \(1.0\text{ mg}\) dose is \(9.13\text{ units}\) longer than at
the \(0.5\text{ mg}\) dose.
- 95% CI: \([5.90,
12.36]\). This entire interval is positive (it
does not contain zero).
- P-Value (\(\text{p
adj}\)): \(0.00\text{e}+00\) (extremely small).
- Conclusion: The difference between \(1.0\text{ mg}\) and \(0.5\text{ mg}\) is highly
statistically significant.
Comparison 2: Dose 2.0 vs. Dose 0.5 (2 - 0.5)
- Difference (\(\text{diff}\)): \(15.495\). The average tooth length at \(2.0\text{ mg}\) dose is \(15.5\text{ units}\) longer than at
the \(0.5\text{ mg}\) dose.
- 95% CI: \([12.27,
18.72]\). This interval is also fully positive
(does not contain zero).
- P-Value (\(\text{p
adj}\)): \(0.00\text{e}+00\) (extremely small).
- Conclusion: The difference between \(2.0\text{ mg}\) and \(0.5\text{ mg}\) is highly
statistically significant.
Comparison 3: Dose 2.0 vs. Dose 1.0 (2 - 1)
- Difference (\(\text{diff}\)): \(6.365\). The average tooth length at \(2.0\text{ mg}\) dose is \(6.365\text{ units}\) longer than
at the \(1.0\text{ mg}\) dose.
- 95% CI: \([3.14,
9.59]\). This interval is also fully positive
(does not contain zero).
- P-Value (\(\text{p
adj}\)): \(4.25\text{e}-05\), which is \(0.0000425\). This is much smaller than
\(0.05\).
- Conclusion: The difference between \(2.0\text{ mg}\) and \(1.0\text{ mg}\) is statistically
significant.
Summary of Findings
Based on the Tukey HSD test, we find that all three pairwise
comparisons are statistically significant. This means that
increasing the dose from one level to the next (low to medium, medium to
high, and low to high) resulted in a measurable, non-random increase in
average tooth length.
- \(\mathbf{0.5\text{ mg}}\) is
significantly different from \(\mathbf{1.0\text{ mg}}\).
- \(\mathbf{0.5\text{ mg}}\) is
significantly different from \(\mathbf{2.0\text{ mg}}\).
- \(\mathbf{1.0\text{ mg}}\) is
significantly different from \(\mathbf{2.0\text{ mg}}\).
In a research context, you would report this by stating that
increasing the Vitamin C dose leads to a dose-dependent
increase in tooth growth, and that the effect observed at each
dose level is distinct from the others.
Visualize the post-hoc results
plot(tukey_results_one_way)