The analysis of variance

M. Drew LaMar
April 6, 2016

Class announcements

  • Exam #2 graded!
  • Exam #3 is on Friday, April 22.

Exam #2

summary(grades$Exam2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  48.61   69.44   79.17   78.12   87.50   97.22       1 

Exam #2

boxplot(grades$Exam2, datax=TRUE)

plot of chunk unnamed-chunk-4

Permutation tests

Definition: A permutation test generates a null distribution for the association between two variables by repeatedly and randomly rearranging the values of one of the two variables in the data.

This is a form of bootstrapping.

Permutation tests for association

In this chapter, we explore a permutation test replacement for the two-sample \( t \)-test.

Variations on this theme can be done for many other tests.

Permutation tests for association

Parametric

Permutation

Permutation tests

Algorithm

  • Choose a test statistic that measures association between the two variables in the data (e.g. \( \bar{Y}_{1}-\bar{Y}_{2} \); Why not \( t \)?)
  • Create a permuted sample of data in which the values of the response variables are randomly reordered.
  • Calculate the chosen test statistic for the permuted sample.
  • Repeat the permutation process many times – at least 1000 or more.
  • Compute the \( P \)-value by comparing the observed test statistic (from original data) to this bootstrapped null distribution.

Example: Autoimmune and gut microbes

\( H_{0} \): Mean percent interleukin-17 is the same in both groups.
\( H_{A} \): Mean percent interleukin-17 is NOT the same in both groups.

Example: Autoimmune and gut microbes

nPerm <- 10000
permResult <- vector() # initialize vector
for(i in 1:nPerm){
    # step 1: permute the percent interleukin-17
    permSample <- sample(mydata$percentInterleukin17, replace = FALSE)
    # step 2: calculate difference betweeen means
    permMeans <- tapply(permSample, mydata$treatment, mean)
    permResult[i] <- permMeans[2] - permMeans[1]
}

Example: Autoimmune and gut microbes

M <- tapply(mydata$percentInterleukin17, mydata$treatment, mean)
(tstat <- M[2]-M[1])
    SPF 
8.04875 

plot of chunk unnamed-chunk-8

Example: Autoimmune and gut microbes

\( P \)-value

(pval <- 2*sum(permResult >= tstat)/nPerm)
[1] 0.002

Reject hypothesis of equal means.

Analysis of variance (intro)

Definition: The analysis of variance (ANOVA) compares the means of multiple groups simultaneously in a single analysis.

ANOVA generalizes two-sample \( t \)-test to more than two groups.

In two-sample \( t \)-test, the test statistic is a ratio of the difference between means and the standard error of the mean:

\[ t = \frac{\bar{Y}_{1}-\bar{Y}_{2}}{\mathrm{SE}_{\bar{Y}_{1}-\bar{Y}_{2}}} \]

Analysis of variance (intro)

\[ t = \frac{\bar{Y}_{1}-\bar{Y}_{2}}{\mathrm{SE}_{\bar{Y}_{1}-\bar{Y}_{2}}} \]

Analysis of variance (intro)

By squaring the numerator and denominator of the \( t \)-statistic, we get a ratio of variance components.

\[ \frac{\left(\bar{Y}_{1}-\bar{Y}_{2}\right)^2}{\mathrm{SE}^{2}_{\bar{Y}_{1}-\bar{Y}_{2}}} \]

\[ \frac{"\mathrm{Variance \ between \ groups}"}{"\mathrm{Variance \ within \ groups}"} \]

Analysis of variance (for real)

Data: Suppose I have one categorical explanatory variable X with \( k > 2 \) levels, and a response variable Y.

Hypothesis test:

\[ \begin{eqnarray*} H_{0} & : & \mu_{1} = \mu_{2} = \cdots = \mu_{n}\\ H_{A} & : & \mathrm{At \ least \ one} \ \mu_{i} \ \mathrm{is \ different \ from \ the \ others} \end{eqnarray*} \]

Test statistic:

\[ F = \frac{\mathrm{group \ mean \ square}}{\mathrm{error \ mean \ square}} = \frac{\mathrm{MS}_{\mathrm{groups}}}{\mathrm{MS}_{\mathrm{error}}} \]

Analysis of variance (for real)

Test statistic:

\[ F = \frac{\mathrm{group \ mean \ square}}{\mathrm{error \ mean \ square}} = \frac{\mathrm{MS}_{\mathrm{groups}}}{\mathrm{MS}_{\mathrm{error}}} \]

Definition: The group mean square (\( \mathrm{MS}_{\mathrm{groups}} \)) is proportional to the observed amount of variation among the group sample means [between-group variability].

Definition: The error mean square (\( \mathrm{MS}_{\mathrm{error}} \)) estimates the variance among subjects that belong to the same group [within-group variability].

Analysis of variance (for real)

Test statistic:

\[ F = \frac{\mathrm{group \ mean \ square}}{\mathrm{error \ mean \ square}} = \frac{\mathrm{MS}_{\mathrm{groups}}}{\mathrm{MS}_{\mathrm{error}}} \]

If \( H_{0} \) is true, then \( \mathrm{MS}_{\mathrm{groups}} = \mathrm{MS}_{\mathrm{error}} \) and \( F = 1 \).

If \( H_{0} \) is false, then \( \mathrm{MS}_{\mathrm{groups}} > \mathrm{MS}_{\mathrm{error}} \) and \( F > 1 \).

Analysis of variance (example)

Analysis of variance (example)

Practice Problem #1

Many humans like the effect of caffeine, but it occurs in plants as a deterrent to herbivory by animals. Caffeine is also found in flower nectar, and nectar is meant as a reward for pollinators, not a deterrent. How does caffeine in nectar affect visitation by pollinators?

Analysis of variance (example)

Practice Problem #1

Singaravelan et al. (2005) set up feeding stations where bees were offered a choice between a control solution with 20% surcrose or a caffeinated solution with 20% sucrose plus some quantity of caffeine. Over the course of the experiment, four different concentrations of caffeine were provided: 50, 100, 150, and 200 ppm. The response variable was the difference between the amount of nectar consumed from the caffeine feeders and that removed from the control feeders at the same station (grams).

Analysis of variance (example)

Analysis of variance (example)

Singaravelan et al. (2005) set up feeding stations where bees were offered a choice between a control solution with 20% surcrose or a caffeinated solution with 20% sucrose plus some quantity of caffeine. Over the course of the experiment, four different concentrations of caffeine were provided: 50, 100, 150, and 200 ppm. The response variable was the difference between the amount of nectar consumed from the caffeine feeders and that removed from the control feeders at the same station (grams).

Discuss: Describe the experimental design.

Analysis of variance (derivation)

Data: With \( i \) representing group \( i \), we have

\[ Y_{ij} - \bar{Y} = (\bar{Y}_{i} - \bar{Y}) + (Y_{ij} - \bar{Y}_{i}) \]

\[ \mathrm{SS}_{\mathrm{total}} = \mathrm{SS}_{\mathrm{groups}} + \mathrm{SS}_{\mathrm{error}} \]

Analysis of variance (derivation)

Data: With \( i \) representing group \( i \), we have

\[ Y_{ij} - \bar{Y} = (\bar{Y}_{i} - \bar{Y}) + (Y_{ij} - \bar{Y}_{i}) \]

\[ \mathrm{SS}_{\mathrm{total}} = \sum_{i}\sum_{j}(Y_{ij}-\bar{Y})^2 \]

Analysis of variance (derivation)

Data: With \( i \) representing group \( i \), we have

\[ Y_{ij} - \bar{Y} = (\bar{Y}_{i} - \bar{Y}) + (Y_{ij} - \bar{Y}_{i}) \]

\[ \scriptsize{\mathrm{SS}_{\mathrm{total}} = \sum_{i}\sum_{j}(Y_{ij}-\bar{Y})^2 = \sum_{i}n_{i}(Y_{i}-\bar{Y})^2 + \sum_{i}\sum_{j}(Y_{ij}-\bar{Y}_{i})^2} \]

Analysis of variance (derivation)

Data: With \( i \) representing group \( i \), we have

\[ Y_{ij} - \bar{Y} = (\bar{Y}_{i} - \bar{Y}) + (Y_{ij} - \bar{Y}_{i}) \]

\[ \scriptsize{ \begin{eqnarray*} \mathrm{SS}_{\mathrm{total}} = \sum_{i}\sum_{j}(Y_{ij}-\bar{Y})^2 & = & \sum_{i}n_{i}(Y_{i}-\bar{Y})^2 + \sum_{i}\sum_{j}(Y_{ij}-\bar{Y}_{i})^2 \\ & = & \mathrm{SS}_{\mathrm{groups}} + \mathrm{SS}_{\mathrm{error}} \end{eqnarray*} } \]

Analysis of variance (derivation)

\[ \scriptsize{ \begin{eqnarray*} \mathrm{SS}_{\mathrm{total}} & = & \sum_{i}\sum_{j}(Y_{ij}-\bar{Y})^2 \\ & = & \sum_{i}\sum_{j}\left[(\bar{Y}_{i} - \bar{Y}) + (Y_{ij} - \bar{Y}_{i})\right]^2 \\ & = & \sum_{i}\sum_{j}\left[(\bar{Y}_{i} - \bar{Y})^2 + (Y_{ij} - \bar{Y}_{i})^2 + 2(\bar{Y}_{i} - \bar{Y})(Y_{ij} - \bar{Y}_{i})\right] \\ & = & \sum_{i}\sum_{j}(\bar{Y}_{i} - \bar{Y})^2 + \sum_{i}\sum_{j}(Y_{ij} - \bar{Y}_{i})^2 + \sum_{i}\sum_{j}2(\bar{Y}_{i} - \bar{Y})(Y_{ij} - \bar{Y}_{i}) \\ & = & \sum_{i}n_{i}(\bar{Y}_{i} - \bar{Y})^2 + \sum_{i}\sum_{j}(Y_{ij} - \bar{Y}_{i})^2 + \sum_{i}\sum_{j}2(\bar{Y}_{i} - \bar{Y})(Y_{ij} - \bar{Y}_{i}) \\ & = & \mathrm{SS}_{\mathrm{groups}} + \mathrm{SS}_{\mathrm{error}} + \sum_{i}\sum_{j}2(\bar{Y}_{i} - \bar{Y})(Y_{ij} - \bar{Y}_{i}) \end{eqnarray*} } \]

Analysis of variance (derivation)

Can show:

\[ \sum_{i}\sum_{j}2(\bar{Y}_{i} - \bar{Y})(Y_{ij} - \bar{Y}_{i}) = 0, \]

and thus

\[ \mathrm{SS}_{\mathrm{total}} = \mathrm{SS}_{\mathrm{groups}} + \mathrm{SS}_{\mathrm{error}}. \]