Lecture 5 ANOVA

Eamonn Mallon
10/12/2019

One-way ANOVA

  • Analysis of variance
  • explanatory variables are called factors which have levels
    • sex could be a factor and has two levels (male and female)
  • always more than two levels (otherwise you would use a t-test)
  • Why not just use multiple t-tests
    • multiple testing
    • ANOVA does other cool things which we get into at the end of the lecture

The horror of multiple testing

  • We say P < 0.05 is significant, but what does that mean
  • If there is less than a 1 in 20 probability of getting this result or an even more extreme one by random chance, we accept it as true
  • Now imagine we were comparing A to B to C, thats three t-tests (AB, BC, AC), now 0.05 is 3 in 20
  • A:B:C:D 6 in 20
  • A:B:C:D:E 10 in 20, 0.05 now is 50:50
  • R.A. Fisher save us from this nightmare

How ANOVA works

  • ANOVA is used to compare means by comparing variance (how?)

A graphical example

oneway <- read.csv("~/Dropbox/Teaching/old_teaching/zipped/oneway.csv")
names(oneway)
[1] "ozone"  "garden"

A graphical example

Plot the ozone data in the order it was measured

plot (1:20, oneway$ozone, ylim =c(0,8), 
      ylab="y", xlab="order")

plot of chunk unnamed-chunk-2

A graphical example

Calculate the residual of each point from the mean of y plot of chunk unnamed-chunk-3

Total sum of squares

  • This overall variation is the total sum of squares SSY \[ SSY = \Sigma(y-\bar{y})^2 \]

A graphical example

Lets look at the data separated into its levels (garden) plot of chunk unnamed-chunk-4

A graphical example

The means are different but are they significantly different? plot of chunk unnamed-chunk-5

A graphical example

Calculate the residuals from the individual means plot of chunk unnamed-chunk-6

The error sum of squares

  • This variation from the treatment means is the error sum of squares SSE \[ SSE = \Sigma_{j=1}^k\Sigma(y-\bar{y_j})^2 \]
  • SSY = SSE + SSA
  • SSA is the variation due to treatment

The logic of an Anova

  • imagine the means are not different
    • then the residuals would be the same as the previous graph (because the horizontal lines would not have moved) (SSE = SSY)
  • imagine now that the means are different (the amount of ozone in the two gardens is different)
    • We would predict that the residuals should be smaller when computed from the individual means (SSE) compared to the residuals computed from the overall mean (SSY)
  • We are back to signal versus noise
  • How do we do that in a test

An Anova sort of by hand

Calculate SSY \[ SSY = \Sigma(y-\bar{y})^2 \]

sum((oneway$ozone-mean(oneway$ozone))^2)
[1] 44

An Anova sort of by hand

Calculate SSE \[ SSE = \Sigma_{j=1}^k\Sigma(y-\bar{y_j})^2 \]

sum((oneway$ozone[oneway$garden=="A"]-mean(oneway$ozone[oneway$garden=="A"]))^2)
[1] 12
sum((oneway$ozone[oneway$garden=="B"]-mean(oneway$ozone[oneway$garden=="B"]))^2)
[1] 12

An Anova sort of by hand

So SSA = 44 - 24 = 20 (SSY = SSE + SSA)

An Anova table

Source Sum of squares Degrees of freedom Mean squares F
Garden 20 1 20 15
Error 24 18 s2 = 1.3333
Total 44 19
  • Degrees of freedom (n-p)
    • Garden: 2 levels, 1 parameter, therefore 2-1
    • Error: 20 samples, 2 parameters (look at the equation). 20-2
    • Total: Add up the other two
  • Mean squares (Mean squared deviation - lecture 2) = SS/df
  • F = Mean squares (treatment) / Mean squares (error) = 20/1.333 [Think signal over noise]

F = 15, what does that mean?!!!!!

The p prefix, as in pf(), is how you calculate a p-value from a probability distribution

1-pf(15,1,18)
[1] 0.001114539

So the probability of obtaining data as extreme as ours (or more extreme) if the two means were really the same is about 0.1%

An R command to do all this automatically

model_ozone<-lm(oneway$ozone~oneway$garden) #Creates the linear model
ozone_anova<-aov(model_ozone) #Creates the anova from the linear model
summary(ozone_anova) #Outputs an ANOVA table
              Df Sum Sq Mean Sq F value  Pr(>F)   
oneway$garden  1     20  20.000      15 0.00111 **
Residuals     18     24   1.333                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The gardens differ in their ozone level (One-way ANOVA: \( F_{1,18} \) = 15.0, p = 0.0011). This is the correct way to report it

Does feed type affect weight in chickens (4 treatments)

plot of chunk unnamed-chunk-11

Does feed type affect weight in chickens (4 treatments)

model_weight<-lm(weight~Diet, data = ChickWeight)
chick_anova<-aov(model_weight)
summary(chick_anova)
             Df  Sum Sq Mean Sq F value   Pr(>F)    
Diet          3  155863   51954   10.81 6.43e-07 ***
Residuals   574 2758693    4806                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Does feed type affect weight in chickens (4 treatments)

Diet type affects the weight of chickens (One-way ANOVA: \( F_{3,574} \) =10.81, p = \( 6.433 \times 10^{-7} \))

Great, which diet is best? Eh ANOVA doesn't tell you, it just says diet has an effect

Tukey's post hoc test

TukeyHSD(chick_anova)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = model_weight)

$Diet
         diff         lwr      upr     p adj
2-1 19.971212  -0.2998092 40.24223 0.0552271
3-1 40.304545  20.0335241 60.57557 0.0000025
4-1 32.617257  12.2353820 52.99913 0.0002501
3-2 20.333333  -2.7268370 43.39350 0.1058474
4-2 12.646045 -10.5116315 35.80372 0.4954239
4-3 -7.687288 -30.8449649 15.47039 0.8277810

Think of them like legit t-tests.

Assumptions of an ANOVA

  • Independence of observations .
  • Normality – the distributions of the residuals are normal. (Robust)
  • Homoscedasticity — the variance of data in groups should be the same.

Assumptions of an ANOVA

library("ggfortify")
autoplot(model_weight)

plot of chunk unnamed-chunk-14

Non-parametric options

  • Kruskal–Wallis oneway ANOVA on ranks
  • Dunn's test (post-hoc test)

ANOVA is much more

  • What if you are interested in two (or more factors)?
  • It would be cool to know if these factors interact
  • ANOVA (repeated, nested, multiway) can do this and more by partioning out the variance just like in the one way example.
  • Imagine you are looking at the effect of two drugs. You measure men and women.
    • ANOVA can remove the variation due to sex (if its uninteresting), statistically allowing you to act like you controlled for sex experimentally
    • And/or it can check the interaction between drug and sex, letting you say which drug is better for men and which is better for women.