The analysis of variance Part II

M. Drew LaMar
April 8, 2016

Class announcements

Homework due today
Project milestone is today, but remember - you don't need to turn anything in!
Let's talk about the exam…

Analysis of variance (for real)

Data: Suppose I have one categorical explanatory variable X with \( k > 2 \) levels, and a response variable Y.

Hypothesis test:

\[ \begin{eqnarray*} H_{0} & : & \mu_{1} = \mu_{2} = \cdots = \mu_{n}\\ H_{A} & : & \mathrm{At \ least \ one} \ \mu_{i} \ \mathrm{is \ different \ from \ the \ others} \end{eqnarray*} \]

Test statistic:

\[ F = \frac{\mathrm{group \ mean \ square}}{\mathrm{error \ mean \ square}} = \frac{\mathrm{MS}_{\mathrm{groups}}}{\mathrm{MS}_{\mathrm{error}}} \]

Analysis of variance (for real)

Test statistic:

\[ F = \frac{\mathrm{group \ mean \ square}}{\mathrm{error \ mean \ square}} = \frac{\mathrm{MS}_{\mathrm{groups}}}{\mathrm{MS}_{\mathrm{error}}} \]

Definition: The group mean square (\( \mathrm{MS}_{\mathrm{groups}} \)) is proportional to the observed amount of variation among the group sample means [between-group variability].

Definition: The error mean square (\( \mathrm{MS}_{\mathrm{error}} \)) estimates the variance among subjects that belong to the same group [within-group variability].

Analysis of variance (for real)

Test statistic:

\[ F = \frac{\mathrm{group \ mean \ square}}{\mathrm{error \ mean \ square}} = \frac{\mathrm{MS}_{\mathrm{groups}}}{\mathrm{MS}_{\mathrm{error}}} \]

If \( H_{0} \) is true, then \( \mathrm{MS}_{\mathrm{groups}} = \mathrm{MS}_{\mathrm{error}} \) and \( F = 1 \).

If \( H_{0} \) is false, then \( \mathrm{MS}_{\mathrm{groups}} > \mathrm{MS}_{\mathrm{error}} \) and \( F > 1 \).

Analysis of variance (example)

Practice Problem #1

Singaravelan et al. (2005) set up feeding stations where bees were offered a choice between a control solution with 20% surcrose or a caffeinated solution with 20% sucrose plus some quantity of caffeine. Over the course of the experiment, four different concentrations of caffeine were provided: 50, 100, 150, and 200 ppm. The response variable was the difference between the amount of nectar consumed from the caffeine feeders and that removed from the control feeders at the same station (grams).

Analysis of variance (example)

Analysis of variance (derivation)

Data: With \( i \) representing group \( i \), we have

\[ \mathrm{SS}_{\mathrm{total}} = \mathrm{SS}_{\mathrm{groups}} + \mathrm{SS}_{\mathrm{error}} \]

From sum-of-squares to mean squares

Definition: The group mean square is given by

\[ \mathrm{MS}_{\mathrm{groups}} = \frac{\mathrm{SS}_{\mathrm{groups}}}{df_{\mathrm{groups}}}, \] with \( df_{\mathrm{groups}} = k-1 \).

Definition: The error mean square is given by

\[ \mathrm{MS}_{\mathrm{error}} = \frac{\mathrm{SS}_{\mathrm{error}}}{df_{\mathrm{error}}}, \] with \( df_{\mathrm{error}} = \sum (n_{i}-1) = N-k \).

Analysis of variance (example)

strungOutBees <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter15/chap15q01HoneybeeCaffeine.csv")
strungOutBees$ppmCaffeine <- factor(strungOutBees$ppmCaffeine)
str(strungOutBees)

'data.frame':   20 obs. of  2 variables:
 $ ppmCaffeine                     : Factor w/ 4 levels "50","100","150",..: 1 2 3 4 1 2 3 4 1 2 ...
 $ consumptionDifferenceFromControl: num  -0.4 0.01 0.65 0.24 0.34 -0.39 0.53 0.44 0.19 -0.08 ...

Discuss: Is this data tidy or messy?

Definition: Tidy!

Analysis of variance (example)

stripchart(consumptionDifferenceFromControl ~ ppmCaffeine, data = strungOutBees, vertical = TRUE, method = "jitter", xlab="Caffeine (ppm)")

plot of chunk unnamed-chunk-3

Analysis of variance (example)

Discuss: State the null and alternative hypotheses appropriate for this question.

Definition: \[ \begin{eqnarray*} H_{0} & : & \mu_{50} = \mu_{100} = \mu_{150} = \mu_{200} \\ H_{A} & : & \mathrm{At \ least \ one \ of \ the \ means \ is \ different} \end{eqnarray*} \]

Analysis of variance (example)

Short cut using R

caffResults <- lm(consumptionDifferenceFromControl ~ ppmCaffeine, data=strungOutBees)
anova(caffResults)

Analysis of Variance Table

Response: consumptionDifferenceFromControl
            Df Sum Sq Mean Sq F value  Pr(>F)  
ppmCaffeine  3 1.1344 0.37814  4.1779 0.02308 *
Residuals   16 1.4482 0.09051                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis of variance (example)

Definition: The \( R^{2} \) value is used in ANOVA is the “fraction of the variation explained by groups” and is given by

\[ R^{2} = \frac{\mathrm{SS}_{\mathrm{groups}}}{\mathrm{SS}_{\mathrm{total}}}. \] Note: \( 0 \leq R^2 \leq 1 \).

beeAnovaSummary <- summary(caffResults)
beeAnovaSummary$r.squared

[1] 0.4392573

Analysis of variance (example)

Long way

Question: Calculate the following summary statistics for each group: \( n_{i} \), \( \bar{Y}_{i} \), and \( s_{i} \).

Analysis of variance (example)

library(dplyr)
(beeStats <- strungOutBees %>% group_by(ppmCaffeine) %>% summarise(n = n(), 
  mean = mean(consumptionDifferenceFromControl), 
  sd = sd(consumptionDifferenceFromControl)))

Source: local data frame [4 x 4]

  ppmCaffeine     n   mean        sd
       (fctr) (int)  (dbl)     (dbl)
1          50     5  0.008 0.2887386
2         100     5 -0.172 0.1694698
3         150     5  0.376 0.3093218
4         200     5  0.378 0.3927722

Analysis of variance (example)

Compute sum-of-squares

grandMean <- mean(strungOutBees$consumptionDifferenceFromControl)
(SS_groups <- sum(beeStats$n*(beeStats$mean - grandMean)^2))

[1] 1.134415

(SS_error <- sum((beeStats$n-1)*beeStats$sd^2))

[1] 1.44816

Analysis of variance (example)

Compute degree of freedom

(df_groups <- 4-1)

[1] 3

(df_error <- nrow(strungOutBees)-4)

[1] 16

Analysis of variance (example)

Compute mean squares

(MS_groups <- SS_groups/df_groups)

[1] 0.3781383

(MS_error <- SS_error/df_error)

[1] 0.09051

Analysis of variance (example)

Compute \( F \)-statistic and \( P \)-value

(F_ratio <- MS_groups/MS_error)

[1] 4.177862

(pval <- pf(F_ratio, df_groups, df_error, lower.tail=FALSE))

[1] 0.02307757

Analysis of variance (example)

Create manual table and compare

mytable <- data.frame(Df = c(df_groups, df_error), SumSq = c(SS_groups, SS_error), MeanSq = c(MS_groups, MS_error), Fval = c(F_ratio, NA), Pval = c(pval, NA))
rownames(mytable) <- c("ppmCaffeine", "Residuals")
(mytable)

            Df    SumSq    MeanSq     Fval       Pval
ppmCaffeine  3 1.134415 0.3781383 4.177862 0.02307757
Residuals   16 1.448160 0.0905100       NA         NA

ANOVA assumptions and robustness

Assumptions (same as 2-sample \( t \)-test)

Measurements in each group represent a random sample from corresponding population.
Variable is normally distributed in each of the \( k \) populations.
Variance is the same in all \( k \) populations.

ANOVA assumptions and robustness

Robustness (same as 2-sample \( t \)-test)

Robust to deviations in normality.
Somewhat robust to deviations in equal variances when:
- Sample sizes are “large”, about the same size in each group
- Sample sizes are about the same (balanced)
- Standard deviations are within about a 3-fold difference

Nonparametric alternative to ANOVA

Definition: The Kruskal-Wallis test is a nonparametric method for mulutiple groups based on ranks.

The Kruskal-Wallis test is similar to the Mann-Whitney \( U \)-test and has the same assumptions:

Group samples are random samples.
To use as a test of difference between means or medians, the distributions must have the same shape in every population.

Power of Kruskal-Wallis test is nearly as powerful as ANOVA when sample sizes are large, but has smaller power than ANOVA for small sample sizes.