Lecture 9 The logic of an ANOVA

Eamonn Mallon
9/11/2020

Analysis of variance
explanatory variables are called factors which have levels
- sex could be a factor and has two levels (male and female)
always more than two levels (otherwise you would use a t-test)
Why not just use multiple t-tests
- multiple testing
- ANOVA does other cool things which we get into at the end of the lecture

We say P < 0.05 is significant, but what does that mean
If there is less than a 1 in 20 probability of getting this result or an even more extreme one by random chance, we accept it as true
Now imagine we were comparing A to B to C, thats three t-tests (AB, BC, AC), now 0.05 is 3 in 20
A:B:C:D 6 in 20
A:B:C:D:E 10 in 20, 0.05 now is 50:50
R.A. Fisher save us from this nightmare

oneway <- read.csv("~/Dropbox/Teaching/old_teaching/zipped/oneway.csv")
names(oneway)

[1] "ozone"  "garden"

Plot the ozone data in the order it was measured

plot (1:20, oneway$ozone, ylim =c(0,8), 
      ylab="y", xlab="order")

plot of chunk unnamed-chunk-2

Calculate the residual of each point from the mean of y plot of chunk unnamed-chunk-3

This overall variation is the total sum of squares SSY \[ SSY = \Sigma(y-\bar{y})^2 \]

Lets look at the data separated into its levels (garden) plot of chunk unnamed-chunk-4

The means are different but are they significantly different? plot of chunk unnamed-chunk-5

Calculate the residuals from the individual means plot of chunk unnamed-chunk-6

This variation from the treatment means is the error sum of squares SSE \[ SSE = \Sigma_{j=1}^k\Sigma(y-\bar{y_j})^2 \]
SSY = SSE + SSA
SSA is the variation due to treatment

imagine the means are not different
- then the residuals would be the same as the previous graph (because the horizontal lines would not have moved) (SSE = SSY)
imagine now that the means are different (the amount of ozone in the two gardens is different)
- We would predict that the residuals should be smaller when computed from the individual means (SSE) compared to the residuals computed from the overall mean (SSY)
We are back to signal versus noise (SSE vs SSA)
How do we do that in a test?