Lecture 9 The logic of an ANOVA

Eamonn Mallon
9/11/2020

One-way ANOVA

  • Analysis of variance
  • explanatory variables are called factors which have levels
    • sex could be a factor and has two levels (male and female)
  • always more than two levels (otherwise you would use a t-test)
  • Why not just use multiple t-tests
    • multiple testing
    • ANOVA does other cool things which we get into at the end of the lecture

The horror of multiple testing

  • We say P < 0.05 is significant, but what does that mean
  • If there is less than a 1 in 20 probability of getting this result or an even more extreme one by random chance, we accept it as true
  • Now imagine we were comparing A to B to C, thats three t-tests (AB, BC, AC), now 0.05 is 3 in 20
  • A:B:C:D 6 in 20
  • A:B:C:D:E 10 in 20, 0.05 now is 50:50
  • R.A. Fisher save us from this nightmare

How ANOVA works

  • ANOVA is used to compare means by comparing variance (how?)

A graphical example

oneway <- read.csv("~/Dropbox/Teaching/old_teaching/zipped/oneway.csv")
names(oneway)
[1] "ozone"  "garden"

A graphical example

Plot the ozone data in the order it was measured

plot (1:20, oneway$ozone, ylim =c(0,8), 
      ylab="y", xlab="order")

plot of chunk unnamed-chunk-2

A graphical example

Calculate the residual of each point from the mean of y plot of chunk unnamed-chunk-3

Total sum of squares

  • This overall variation is the total sum of squares SSY \[ SSY = \Sigma(y-\bar{y})^2 \]

A graphical example

Lets look at the data separated into its levels (garden) plot of chunk unnamed-chunk-4

A graphical example

The means are different but are they significantly different? plot of chunk unnamed-chunk-5

A graphical example

Calculate the residuals from the individual means plot of chunk unnamed-chunk-6

The error sum of squares

  • This variation from the treatment means is the error sum of squares SSE \[ SSE = \Sigma_{j=1}^k\Sigma(y-\bar{y_j})^2 \]
  • SSY = SSE + SSA
  • SSA is the variation due to treatment

The logic of an Anova

  • imagine the means are not different
    • then the residuals would be the same as the previous graph (because the horizontal lines would not have moved) (SSE = SSY)
  • imagine now that the means are different (the amount of ozone in the two gardens is different)
    • We would predict that the residuals should be smaller when computed from the individual means (SSE) compared to the residuals computed from the overall mean (SSY)
  • We are back to signal versus noise (SSE vs SSA)
  • How do we do that in a test?