ANOVA Tests

Author

D.McCabe

Published

February 9, 2026

PCA connection: where PCA finds principal components of variance, ANOVA analyzes variance in the directions of the independent variables.

One Way ANOVA

ANOVA tests if there are significant statistical differences between the means of three or more groups (student’s t-testis used to test two groups) classified by a single independent factor.

hypotheses:

Null \(H_0\): all groups have the same mean
Alernative \(H_A\): not all groups have the same mean

Requires/Expects:

single continous/quantitative dependent variable \(y\) (aka metric variable)
single factored/qualitative independent variable \(p\) (aka nominal variable)

\[y\sim f(p)\]

Assumptions:

independence: the independent variables should be mutually independent (strictly no confounding e.g. intellegence ~ age, shoe_size)
normality: data should be normally distributed within groups (testable using the Shapiro-Wilk test) or the residuals from the normal model should be normally distributed (testable via qqnorm + qqline plot). The Mean Squared Error (MSE) of the residuals estimates the within-group variance and assumes normality for valid inference.
homogeneity / homoscedasticity: each group should have the same variance (testable through the Levene test; if violated, a Welch ANOVA can be used, where degrees of freedom are adjusted as in the Welch test)

Groups must be selected data normallly distributed within groups

test statistic:

Uses the F distribution/F-value: \[\frac{\text{Variance Between Groups}}{\text{Variance within Groups}}\sim F\]

\[ \underbrace{\sum_{j=1}^{k} \sum_{i=1}^{n_j} (y_{ij} - \bar{y})^2}_{\text{Total Sum of Squares}} = \underbrace{\sum_{j=1}^{k} n_j (\bar{y}_j - \bar{y})^2}_{\text{SSB (Between-group)}} + \underbrace{\sum_{j=1}^{k} \sum_{i=1}^{n_j} (y_{ij} - \bar{y}_j)^2}_{\text{SSW (Within-group)}} \]

example:

# Built-in data
dt<-as.data.table(ToothGrowth)
dt[,dose:=as.factor(dose)]

# One-way ANOVA
model <- aov(len ~ dose, data = dt)
summary(model)

            Df Sum Sq Mean Sq F value   Pr(>F)    
dose         2   2426    1213   67.42 9.53e-16 ***
Residuals   57   1026      18                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Diagnostic plots
par(mfrow = c(2, 2))
plot(model)

Two Way ANOVA

ANOVA tests if there are significant statistical differences between the means of three or more groups (student’s t-testis used to test two groups) classified by a multiple independent factors.

Compound hypotheses:

There’s a symmetric matrix of things to check here: - Main effect of first factor: - Null \(H_{0_{11}}\): all groups of the first factor have the same mean
- Alternative \(H_{A_{11}}\): not all groups of the first factor have the same mean
- Main effect of second factor: - Null \(H_{0_{22}}\): all groups of the second factor have the same mean
- Alternative \(H_{A_{22}}\): not all groups of the second factor have the same mean
- Interaction of first and second factors: - Null \(H_{0_{12}}\): no interaction between the two factors
- Alternative \(H_{A_{12}}\): there is an interaction between the two factors

effects can be seen in an Interaction plot

Requires/Expects:

single continous/quantitative dependent variable \(y\) (aka metric variable)
multiple factored/qualitative independent variable \(a,b,c\) (aka nominal variable)

\[y\sim f(a,b,c)\]

Assumptions:

independence: the independent variables should be mutually independent (strictly no confounding e.g. intellegence ~ age, shoe_size)
normality: data should be normally distributed within groups (testable using the Shapiro-Wilk test) or the residuals from the normal model should be normally distributed (testable via qqnorm + qqline plot). The Mean Squared Error (MSE) of the residuals estimates the within-group variance and assumes normality for valid inference.
homogeneity / homoscedasticity: each group should have the same variance (testable through the Levene test)

test statistic:

Total variance in \(y\) is the sum of variance in the independent variables \(A,B\) and the standard error. \[SS_{tot}=SS_A+SS_B+SS_{AB}+SS_{err}\]

Uses the F distribution/F-value: \[\frac{\text{Variance Between Groups}}{\text{Variance within Groups}}\sim F\]

The featurespace has not been standardised so the PCs don’t look mutually orthogonal due to scale

Two-way ANOVA feature space with principle components shown

example:

dt <- as.data.table(ToothGrowth)
dt[,c("supp","dose"):=lapply(.SD,as.factor), .SDcols=c("supp","dose")]

model_twoway <- aov(len ~ dose * supp, data = dt)
summary(model_twoway)

            Df Sum Sq Mean Sq F value   Pr(>F)    
dose         2 2426.4  1213.2  92.000  < 2e-16 ***
supp         1  205.4   205.4  15.572 0.000231 ***
dose:supp    2  108.3    54.2   4.107 0.021860 *  
Residuals   54  712.1    13.2                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

interaction.plot(dt$dose, dt$supp, dt$len,
                 col = 1:3, lty = 1, lwd = 2,
                 ylab = "odentoblast length", xlab = "Dose")

ANOVA with Repeated Measures

TODO

Mixed Model ANOVA

TODO