The Analysis of Variance (ANOVA) - Part 2

M. Drew LaMar
April 3, 2017

Class Announcements

  • Exam #2 is graded!!! Go over in lab this week.
  • Homework #10 is posted (last before 3rd exam)
    • Chapters 13 and 15
    • Due Monday, April 10, 5:00 pm
  • Exam #3 is on Friday, April 21
    • Whitlock & Schluter, Chapters 10-13, 15
    • Ruxton & Colegrave, Chapter 3
  • Homework #11 will be the last
    • Chapter 16-18
    • Due Friday, April 28, 5:00 pm

Analysis of variance (intro)

\[ t = \frac{\bar{Y}_{1}-\bar{Y}_{2}}{\mathrm{SE}_{\bar{Y}_{1}-\bar{Y}_{2}}} \]

Analysis of variance (derivation)

Data: With \( i \) representing group \( i \), we have

\[ Y_{ij} - \bar{Y} = (\bar{Y}_{i} - \bar{Y}) + (Y_{ij} - \bar{Y}_{i}) \]

\[ \mathrm{SS}_{\mathrm{total}} = \mathrm{SS}_{\mathrm{groups}} + \mathrm{SS}_{\mathrm{error}} \]

Analysis of variance (derivation)

Data: With \( i \) representing group \( i \), we have

\[ Y_{ij} - \bar{Y} = (\bar{Y}_{i} - \bar{Y}) + (Y_{ij} - \bar{Y}_{i}) \]

\[ \scriptsize{\mathrm{SS}_{\mathrm{total}} = \sum_{i}\sum_{j}(Y_{ij}-\bar{Y})^2 = \sum_{i}n_{i}(\bar{Y}_{i}-\bar{Y})^2 + \sum_{i}\sum_{j}(Y_{ij}-\bar{Y}_{i})^2} \]

From sum-of-squares to mean squares

Definition: The group mean square is given by

\[ \mathrm{MS}_{\mathrm{groups}} = \frac{\mathrm{SS}_{\mathrm{groups}}}{df_{\mathrm{groups}}}, \] with \( df_{\mathrm{groups}} = k-1 \).

Definition: The error mean square is given by

\[ \mathrm{MS}_{\mathrm{error}} = \frac{\mathrm{SS}_{\mathrm{error}}}{df_{\mathrm{error}}}, \] with \( df_{\mathrm{error}} = \sum (n_{i}-1) = N-k \).

Analysis of variance (example)

str(strungOutBees)
'data.frame':   20 obs. of  2 variables:
 $ ppmCaffeine                     : Factor w/ 4 levels "ppm50","ppm100",..: 1 2 3 4 1 2 3 4 1 2 ...
 $ consumptionDifferenceFromControl: num  -0.4 0.01 0.65 0.24 0.34 -0.39 0.53 0.44 0.19 -0.08 ...

Discuss: Is this data tidy or messy?

Definition: Tidy!

Analysis of variance (example)

stripchart(consumptionDifferenceFromControl ~ ppmCaffeine, 
           data = strungOutBees, 
           vertical = TRUE, 
           method = "jitter", 
           xlab="Caffeine (ppm)",
           col="red")

Analysis of variance (example)

plot of chunk unnamed-chunk-3

Analysis of variance (example)

Discuss: State the null and alternative hypotheses appropriate for this question.

\[ \begin{eqnarray*} H_{0} & : & \mu_{50} = \mu_{100} = \mu_{150} = \mu_{200} \\ H_{A} & : & \mathrm{At \ least \ one \ of \ the \ means \ is \ different} \end{eqnarray*} \]

Analysis of variance (example)

Short cut using R

caffResults <- lm(consumptionDifferenceFromControl ~ ppmCaffeine, data=strungOutBees)
anova(caffResults)
Analysis of Variance Table

Response: consumptionDifferenceFromControl
            Df Sum Sq Mean Sq F value  Pr(>F)  
ppmCaffeine  3 1.1344 0.37814  4.1779 0.02308 *
Residuals   16 1.4482 0.09051                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis of variance (example)

Definition: The \( R^{2} \) value is used in ANOVA is the “fraction of the variation explained by groups” and is given by

\[ R^{2} = \frac{\mathrm{SS}_{\mathrm{groups}}}{\mathrm{SS}_{\mathrm{total}}}. \] Note: \( 0 \leq R^2 \leq 1 \).

beeAnovaSummary <- summary(caffResults)
beeAnovaSummary$r.squared
[1] 0.4392573

Analysis of variance (example)

Long way

Question: Calculate the following summary statistics for each group: \( n_{i} \), \( \bar{Y}_{i} \), and \( s_{i} \).

Analysis of variance (example)

library(dplyr)
beeStats <- strungOutBees %>% 
  group_by(ppmCaffeine) %>% 
  summarise(n = n(), 
            mean = mean(consumptionDifferenceFromControl),
            sd = sd(consumptionDifferenceFromControl))
knitr::kable(beeStats)
ppmCaffeine n mean sd
ppm50 5 0.008 0.2887386
ppm100 5 -0.172 0.1694698
ppm150 5 0.376 0.3093218
ppm200 5 0.378 0.3927722

Analysis of variance (example)

Compute sum-of-squares

grandMean <- mean(strungOutBees$consumptionDifferenceFromControl)
(SS_groups <- sum(beeStats$n*(beeStats$mean - grandMean)^2))
[1] 1.134415
(SS_error <- sum((beeStats$n-1)*beeStats$sd^2))
[1] 1.44816

Analysis of variance (example)

Compute degree of freedom

(df_groups <- 4-1)
[1] 3
(df_error <- nrow(strungOutBees)-4)
[1] 16

Analysis of variance (example)

Compute mean squares

(MS_groups <- SS_groups/df_groups)
[1] 0.3781383
(MS_error <- SS_error/df_error)
[1] 0.09051

Analysis of variance (example)

Compute \( F \)-statistic and \( P \)-value

(F_ratio <- MS_groups/MS_error)
[1] 4.177862
(pval <- pf(F_ratio, df_groups, df_error, lower.tail=FALSE))
[1] 0.02307757

Analysis of variance (example)

Create manual table and compare

mytable <- data.frame(Df = c(df_groups, df_error), 
                      SumSq = c(SS_groups, SS_error), 
                      MeanSq = c(MS_groups, MS_error), 
                      Fval = c(F_ratio, NA), 
                      Pval = c(pval, NA))
rownames(mytable) <- c("ppmCaffeine", "Residuals")
knitr::kable(mytable)
Df SumSq MeanSq Fval Pval
ppmCaffeine 3 1.134415 0.3781383 4.177862 0.0230776
Residuals 16 1.448160 0.0905100 NA NA
Analysis of Variance Table

Response: consumptionDifferenceFromControl
            Df Sum Sq Mean Sq F value  Pr(>F)  
ppmCaffeine  3 1.1344 0.37814  4.1779 0.02308 *
Residuals   16 1.4482 0.09051                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA assumptions and robustness

Assumptions (same as 2-sample \( t \)-test)

  • Measurements in each group represent a random sample from corresponding population.
  • Variable is normally distributed in each of the \( k \) populations.
  • Variance is the same in all \( k \) populations.

ANOVA assumptions and robustness

Robustness (same as 2-sample \( t \)-test)

  • Robust to deviations in normality.
  • Somewhat robust to deviations in equal variances when:
    • Sample sizes are “large”, about the same size in each group
    • Sample sizes are about the same (balanced)
    • Standard deviations are within about a 3-fold difference

Nonparametric alternative to ANOVA

Definition: The Kruskal-Wallis test is a nonparametric method for mulutiple groups based on ranks.

The Kruskal-Wallis test is similar to the Mann-Whitney \( U \)-test and has the same assumptions:

  • Group samples are random samples.
  • To use as a test of difference between means or medians, the distributions must have the same shape in every population.

Power of Kruskal-Wallis test is nearly as powerful as ANOVA when sample sizes are large, but has smaller power than ANOVA for small sample sizes.

Planned and unplanned comparisons

So there is a difference between means amongst all groups. Now what? I want to know which means are different from one another!!!

Discuss: Which groups do you think have significantly different means?

Planned comparisons

Definition: A planned comparison is a comparison between means planned during the design of the study, identified before the data are examined.

A planned comparison must have a strong a priori justification, such as an expectation from theory or a prior study.

Only one or a small number of planned comparisons is allowed, to minimize inflating the Type I error rate.

Unplanned comparisons

Definition: An unplanned comparison is one of multiple comparisons, such as between all pairs of means, carried out to help determine where differences between means lie.

Unplanned comparisons are a form of data dredging, so we need to minimize the rising Type I errors that we get from performing many tests.

Planned comparison (details)

A planned comparison is essentially a 2-sample \( t \)-test, except you use the \( \mathrm{MS}_{\mathrm{error}} \) in place of the pooled sample variance when computing the standard error, i.e.

\[ \mathrm{SE} = \sqrt{\mathrm{MS}_{\mathrm{error}}\left(\frac{1}{n_{1}} + \frac{1}{n_{2}}\right)} \]

In R, we use the multcomp package to do this (could do it by hand, of course, but ain't nobody got time for that!)

Planned comparison (example)

caffPlanned <- glht(caffResults, linfct = mcp(ppmCaffeine = c("ppm200 - ppm50 = 0")))
confint(caffPlanned)

ppm200 - ppm50 = 0 is called a contrast. In this case, it simply means “Test if the means between the two groups ppm50 and ppm200 are the same.”

Planned comparison (example)


     Simultaneous Confidence Intervals

Multiple Comparisons of Means: User-defined Contrasts


Fit: lm(formula = consumptionDifferenceFromControl ~ ppmCaffeine, 
    data = strungOutBees)

Quantile = 2.1199
95% family-wise confidence level


Linear Hypotheses:
                    Estimate lwr      upr     
ppm200 - ppm50 == 0  0.37000 -0.03336  0.77336