The Analysis of Variance (ANOVA) - Part 2

M. Drew LaMar
November 18/20, 2019

Analysis of variance (intro)

\[ t = \frac{\bar{Y}_{1}-\bar{Y}_{2}}{\mathrm{SE}_{\bar{Y}_{1}-\bar{Y}_{2}}} \]

Analysis of variance (derivation)

Data: With \( i \) representing group \( i \), we have

\[ Y_{ij} - \bar{Y} = (\bar{Y}_{i} - \bar{Y}) + (Y_{ij} - \bar{Y}_{i}) \]

\[ \mathrm{SS}_{\mathrm{total}} = \mathrm{SS}_{\mathrm{groups}} + \mathrm{SS}_{\mathrm{error}} \]

Analysis of variance (derivation)

Data: With \( i \) representing group \( i \), we have

\[ Y_{ij} - \bar{Y} = (\bar{Y}_{i} - \bar{Y}) + (Y_{ij} - \bar{Y}_{i}) \]

\[ \mathrm{SS}_{\mathrm{total}} = \sum_{i}\sum_{j}(Y_{ij}-\bar{Y})^2 \]

Analysis of variance (derivation)

Data: With \( i \) representing group \( i \), we have

\[ Y_{ij} - \bar{Y} = (\bar{Y}_{i} - \bar{Y}) + (Y_{ij} - \bar{Y}_{i}) \]

\[ \scriptsize{\mathrm{SS}_{\mathrm{total}} = \sum_{i}\sum_{j}(Y_{ij}-\bar{Y})^2 = \sum_{i}n_{i}(\bar{Y}_{i}-\bar{Y})^2 + \sum_{i}\sum_{j}(Y_{ij}-\bar{Y}_{i})^2} \]

Analysis of variance (derivation)

Data: With \( i \) representing group \( i \), we have

\[ Y_{ij} - \bar{Y} = (\bar{Y}_{i} - \bar{Y}) + (Y_{ij} - \bar{Y}_{i}) \]

\[ \scriptsize{ \begin{eqnarray*} \mathrm{SS}_{\mathrm{total}} = \sum_{i}\sum_{j}(Y_{ij}-\bar{Y})^2 & = & \sum_{i}n_{i}(\bar{Y}_{i}-\bar{Y})^2 + \sum_{i}\sum_{j}(Y_{ij}-\bar{Y}_{i})^2 \\ & = & \mathrm{SS}_{\mathrm{groups}} + \mathrm{SS}_{\mathrm{error}} \end{eqnarray*} } \]

Analysis of variance (derivation)

\[ \scriptsize{ \begin{eqnarray*} \mathrm{SS}_{\mathrm{total}} & = & \sum_{i}\sum_{j}(Y_{ij}-\bar{Y})^2 \\ & = & \sum_{i}\sum_{j}\left[(\bar{Y}_{i} - \bar{Y}) + (Y_{ij} - \bar{Y}_{i})\right]^2 \\ & = & \sum_{i}\sum_{j}\left[(\bar{Y}_{i} - \bar{Y})^2 + (Y_{ij} - \bar{Y}_{i})^2 + 2(\bar{Y}_{i} - \bar{Y})(Y_{ij} - \bar{Y}_{i})\right] \\ & = & \sum_{i}\sum_{j}(\bar{Y}_{i} - \bar{Y})^2 + \sum_{i}\sum_{j}(Y_{ij} - \bar{Y}_{i})^2 + \sum_{i}\sum_{j}2(\bar{Y}_{i} - \bar{Y})(Y_{ij} - \bar{Y}_{i}) \\ & = & \sum_{i}n_{i}(\bar{Y}_{i} - \bar{Y})^2 + \sum_{i}\sum_{j}(Y_{ij} - \bar{Y}_{i})^2 + \sum_{i}\sum_{j}2(\bar{Y}_{i} - \bar{Y})(Y_{ij} - \bar{Y}_{i}) \\ & = & \mathrm{SS}_{\mathrm{groups}} + \mathrm{SS}_{\mathrm{error}} + \sum_{i}\sum_{j}2(\bar{Y}_{i} - \bar{Y})(Y_{ij} - \bar{Y}_{i}) \end{eqnarray*} } \]

Analysis of variance (derivation)

Can show:

\[ \sum_{i}\sum_{j}2(\bar{Y}_{i} - \bar{Y})(Y_{ij} - \bar{Y}_{i}) = 0, \]

and thus

\[ \mathrm{SS}_{\mathrm{total}} = \mathrm{SS}_{\mathrm{groups}} + \mathrm{SS}_{\mathrm{error}}. \]

Analysis of variance (derivation)

Data: With \( i \) representing group \( i \), we have

\[ Y_{ij} - \bar{Y} = (\bar{Y}_{i} - \bar{Y}) + (Y_{ij} - \bar{Y}_{i}) \]

\[ \scriptsize{\mathrm{SS}_{\mathrm{total}} = \sum_{i}\sum_{j}(Y_{ij}-\bar{Y})^2 = \sum_{i}n_{i}(\bar{Y}_{i}-\bar{Y})^2 + \sum_{i}\sum_{j}(Y_{ij}-\bar{Y}_{i})^2} \]

From sum-of-squares to mean squares

Definition: The group mean square is given by

\[ \mathrm{MS}_{\mathrm{groups}} = \frac{\mathrm{SS}_{\mathrm{groups}}}{df_{\mathrm{groups}}}, \] with \( df_{\mathrm{groups}} = k-1 \).

Definition: The error mean square is given by

\[ \mathrm{MS}_{\mathrm{error}} = \frac{\mathrm{SS}_{\mathrm{error}}}{df_{\mathrm{error}}}, \] with \( df_{\mathrm{error}} = \sum (n_{i}-1) = N-k \).

Analysis of variance (example)

str(strungOutBees)

'data.frame':   20 obs. of  2 variables:
 $ ppmCaffeine                     : Factor w/ 4 levels "ppm50","ppm100",..: 1 2 3 4 1 2 3 4 1 2 ...
 $ consumptionDifferenceFromControl: num  -0.4 0.01 0.65 0.24 0.34 -0.39 0.53 0.44 0.19 -0.08 ...

Discuss: Is this data tidy or messy?

Definition: Tidy!

Analysis of variance (example)

stripchart(consumptionDifferenceFromControl ~ ppmCaffeine, 
           data = strungOutBees, 
           vertical = TRUE, 
           method = "jitter", 
           xlab="Caffeine (ppm)",
           col="red")

Analysis of variance (example)

plot of chunk unnamed-chunk-3

Analysis of variance (example)

Discuss: State the null and alternative hypotheses appropriate for this question.

\[ \begin{eqnarray*} H_{0} & : & \mu_{50} = \mu_{100} = \mu_{150} = \mu_{200} \\ H_{A} & : & \mathrm{At \ least \ one \ of \ the \ means \ is \ different} \end{eqnarray*} \]

Analysis of variance (example)

Short cut using R

caffResults <- lm(consumptionDifferenceFromControl ~ ppmCaffeine, data=strungOutBees)
anova(caffResults)

Analysis of Variance Table

Response: consumptionDifferenceFromControl
            Df Sum Sq Mean Sq F value  Pr(>F)  
ppmCaffeine  3 1.1344 0.37814  4.1779 0.02308 *
Residuals   16 1.4482 0.09051                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysis of variance (example)

Definition: The \( R^{2} \) value in ANOVA is the “fraction of the variation explained by groups” and is given by

\[ R^{2} = \frac{\mathrm{SS}_{\mathrm{groups}}}{\mathrm{SS}_{\mathrm{total}}}. \] Note: \( 0 \leq R^2 \leq 1 \).

beeAnovaSummary <- summary(caffResults)
beeAnovaSummary$r.squared

[1] 0.4392573

Analysis of variance (example)

Long way

Question: Calculate the following summary statistics for each group: \( n_{i} \), \( \bar{Y}_{i} \), and \( s_{i} \).

Analysis of variance (example)

library(dplyr)
beeStats <- strungOutBees %>% 
  group_by(ppmCaffeine) %>% 
  summarise(n = n(), 
            mean = mean(consumptionDifferenceFromControl),
            sd = sd(consumptionDifferenceFromControl))
knitr::kable(beeStats)

ppmCaffeine	n	mean	sd
ppm50	5	0.008	0.2887386
ppm100	5	-0.172	0.1694698
ppm150	5	0.376	0.3093218
ppm200	5	0.378	0.3927722

Analysis of variance (example)

Compute sum-of-squares

grandMean <- mean(strungOutBees$consumptionDifferenceFromControl)
(SS_groups <- sum(beeStats$n*(beeStats$mean - grandMean)^2))

[1] 1.134415

(SS_error <- sum((beeStats$n-1)*beeStats$sd^2))

[1] 1.44816

Analysis of variance (example)

Compute degree of freedom

(df_groups <- 4-1)

[1] 3

(df_error <- nrow(strungOutBees)-4)

[1] 16

Analysis of variance (example)

Compute mean squares

(MS_groups <- SS_groups/df_groups)

[1] 0.3781383

(MS_error <- SS_error/df_error)

[1] 0.09051

Analysis of variance (example)

Compute \( F \)-statistic and \( P \)-value

(F_ratio <- MS_groups/MS_error)

[1] 4.177862

(pval <- pf(F_ratio, df_groups, df_error, lower.tail=FALSE))

[1] 0.02307757

Analysis of variance (example)

Create manual table and compare

mytable <- data.frame(Df = c(df_groups, df_error), 
                      SumSq = c(SS_groups, SS_error), 
                      MeanSq = c(MS_groups, MS_error), 
                      Fval = c(F_ratio, NA), 
                      Pval = c(pval, NA))
rownames(mytable) <- c("ppmCaffeine", "Residuals")
knitr::kable(mytable)

	Df	SumSq	MeanSq	Fval	Pval
ppmCaffeine	3	1.134415	0.3781383	4.177862	0.0230776
Residuals	16	1.448160	0.0905100	NA	NA

Analysis of Variance Table

Response: consumptionDifferenceFromControl
            Df Sum Sq Mean Sq F value  Pr(>F)  
ppmCaffeine  3 1.1344 0.37814  4.1779 0.02308 *
Residuals   16 1.4482 0.09051                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA assumptions and robustness

Assumptions (same as 2-sample \( t \)-test)

Measurements in each group represent a random sample from corresponding population.
Variable is normally distributed in each of the \( k \) populations.
Variance is the same in all \( k \) populations.

ANOVA assumptions and robustness

Robustness (same as 2-sample \( t \)-test)

Robust to deviations in normality.
Somewhat robust to deviations in equal variances when:
- Sample sizes are “large”, about the same size in each group
- Sample sizes are about the same (balanced)
- Standard deviations are within about a 3-fold difference

Nonparametric alternative to ANOVA

Definition: The Kruskal-Wallis test is a nonparametric method for mulutiple groups based on ranks.

The Kruskal-Wallis test is similar to the Mann-Whitney \( U \)-test and has the same assumptions:

Group samples are random samples.
To use as a test of difference between means or medians, the distributions must have the same shape in every population.

Power of Kruskal-Wallis test is nearly as powerful as ANOVA when sample sizes are large, but has smaller power than ANOVA for small sample sizes.