ANOVA (Analysis of variance)

Eamonn Mallon

2025-01-30

What are you going to learn today

What is an ANOVA for
The logic of an ANOVA
How to do an ANOVA by hand
Getting R to do it
The assumptions of an ANOVA
Its non-parametric equivalents

What is ANOVA?

One-way ANOVA

Analysis of variance
explanatory variables are called factors which have levels
- sex could be a factor and has two levels (male and female)
always more than two levels (otherwise you would use a t-test)
Why not just use multiple t-tests
- multiple testing
- ANOVA does other cool things which we get into at the end of the lecture

The horror of multiple testing

We say p < 0.05 is significant, but what does that mean
If there is less than a 1 in 20 probability of getting this result or an even more extreme one by random chance, we accept it as true
Now imagine we were comparing A to B to C, that is three t-tests (AB, BC, AC), now 0.05 is 3 in 20
A:B:C:D 6 in 20
A:B:C:D:E 10 in 20, 0.05 now is 50:50
R.A. Fisher save us from this nightmare

How ANOVA works

ANOVA is used to compare means by comparing variance (how?)

A graphical example

oneway <- read.csv("~/Dropbox/Teaching/old_teaching/zipped/oneway.csv")
names(oneway)

[1] "ozone"  "garden"

A graphical example

Plot the ozone data in the order it was measured

plot (1:20, oneway$ozone, ylim =c(0,8), 
      ylab="y", xlab="order")

A graphical example

Calculate the residual of each point from the mean of y

Total sum of squares

This overall variation is the total sum of squares SSY \[ SSY = \Sigma(y-\bar{y})^2 \]

A graphical example

Lets look at the data separated into its levels (garden)

A graphical example

The means are different but are they significantly different?

A graphical example

Calculate the residuals from the individual means

The error sum of squares

This variation from the treatment means is the error sum of squares SSE \[ SSE = \Sigma_{j=1}^k\Sigma(y-\bar{y_j})^2 \]
SSY = SSE + SSA
SSA is the variation due to treatment

The logic of an Anova

imagine the means are not different
- then the residuals would be the same as the previous graph (because the horizontal lines would not have moved) (SSE = SSY)
imagine now that the means are different (the amount of ozone in the two gardens is different)
- We would predict that the residuals should be smaller when computed from the individual means (SSE) compared to the residuals computed from the overall mean (SSY)
We are back to signal versus noise (SSA vs SSE)
How do we do that in a test?

Doing an ANOVA “by hand”

An Anova sort of by hand

Calculate SSY \[ SSY = \Sigma(y-\bar{y})^2 \]

oneway <- read.csv("~/Dropbox/Teaching/old_teaching/zipped/oneway.csv")
sum((oneway$ozone-mean(oneway$ozone))^2)

[1] 44

An Anova sort of by hand

Calculate SSE \[ SSE = \Sigma_{j=1}^k\Sigma(y-\bar{y_j})^2 \]

sum((oneway$ozone[oneway$garden=="A"]-mean(oneway$ozone[oneway$garden=="A"]))^2)

[1] 12

sum((oneway$ozone[oneway$garden=="B"]-mean(oneway$ozone[oneway$garden=="B"]))^2)

[1] 12

An Anova sort of by hand {background-image=“background.jpg”}

So SSA = 44 - 24 = 20 (SSY = SSE + SSA)

An Anova table

Source	Sum of squares	Degrees of freedom	Mean squares	F
Garden	20	1	20	15
Error	24	18	s^2 = 1.3333
Total	44	19

Degrees of freedom (n-p)
- Garden: 2 levels, 1 parameter, therefore 2-1
- Error: 20 samples, 2 parameters (look at the equation). 20-2
- Total: Add up the other two
Mean squares (Mean squared deviation - lecture 2) = SS/df
F = Mean squares (treatment) / Mean squares (error) = 20/1.333 [Think signal over noise]

F = 15, what does that mean?!!!!!

The p prefix, as in pf(), is how you calculate a p-value from a probability distribution

1-pf(15,1,18)

[1] 0.001114539

So the probability of obtaining data as extreme as ours (or more extreme) if the two means were really the same is about 0.1%

Doing an ANOVA in R

An R command to do all this automatically

oneway <- read.csv("~/Dropbox/Teaching/old_teaching/zipped/oneway.csv") #Just getting the data in
model_ozone<-lm(oneway$ozone~oneway$garden) #Creates the linear model
ozone_anova<-aov(model_ozone) #Creates the anova from the linear model
summary(ozone_anova) #Outputs an ANOVA table

              Df Sum Sq Mean Sq F value  Pr(>F)   
oneway$garden  1     20  20.000      15 0.00111 **
Residuals     18     24   1.333                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The gardens differ in their ozone level (One-way ANOVA: \(F_{1,18}\) = 15.0, p = 0.0011). This is the correct way to report it

Does feed type affect weight in chickens (4 treatments)

model_weight<-lm(weight~Diet, data = ChickWeight)
chick_anova<-aov(model_weight)
summary(chick_anova)

             Df  Sum Sq Mean Sq F value   Pr(>F)    
Diet          3  155863   51954   10.81 6.43e-07 ***
Residuals   574 2758693    4806                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Does feed type affect weight in chickens (4 treatments)

Diet type affects the weight of chickens (One-way ANOVA: \(F_{3,574}\) =10.81, p = \(6.433 \times 10^{-7}\))

Great, which diet is best? Eh ANOVA doesn’t tell you, it just says diet has an effect

Tukey’s post hoc test

TukeyHSD(chick_anova)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = model_weight)

$Diet
         diff         lwr      upr     p adj
2-1 19.971212  -0.2998092 40.24223 0.0552271
3-1 40.304545  20.0335241 60.57557 0.0000025
4-1 32.617257  12.2353820 52.99913 0.0002501
3-2 20.333333  -2.7268370 43.39350 0.1058474
4-2 12.646045 -10.5116315 35.80372 0.4954239
4-3 -7.687288 -30.8449649 15.47039 0.8277810

Think of them like legit t-tests.

Assumptions of an ANOVA

Independence of observations .
Normality – the distributions of the residuals are normal. (Robust)
Homoscedasticity — the variance of data in groups should be the same.

Assumptions of an ANOVA

library("ggfortify")
autoplot(model_weight)

Non-parametric options

Kruskal–Wallis oneway ANOVA on ranks
Dunn’s test (post-hoc test)

ANOVA is much more

What if you are interested in two (or more factors)?
It would be cool to know if these factors interact
ANOVA (repeated, nested, multiway) can do this and more by partioning out the variance just like in the one way example.
Imagine you are looking at the effect of two drugs. You measure men and women.
- ANOVA can remove the variation due to sex (if its uninteresting), statistically allowing you to act like you controlled for sex experimentally
- And/or it can check the interaction between drug and sex, letting you say which drug is better for men and which is better for women.

What have you learned today

What is an ANOVA for
The logic of an ANOVA
How to do an ANOVA by hand
Getting R to do it
The assumptions of an ANOVA
Its non-parametric equivalents