M. Drew LaMar
April 8, 2016
Data: Suppose I have one categorical explanatory variable X with \( k > 2 \) levels, and a response variable Y.
Hypothesis test:
\[ \begin{eqnarray*} H_{0} & : & \mu_{1} = \mu_{2} = \cdots = \mu_{n}\\ H_{A} & : & \mathrm{At \ least \ one} \ \mu_{i} \ \mathrm{is \ different \ from \ the \ others} \end{eqnarray*} \]
Test statistic:
\[ F = \frac{\mathrm{group \ mean \ square}}{\mathrm{error \ mean \ square}} = \frac{\mathrm{MS}_{\mathrm{groups}}}{\mathrm{MS}_{\mathrm{error}}} \]
Test statistic:
\[ F = \frac{\mathrm{group \ mean \ square}}{\mathrm{error \ mean \ square}} = \frac{\mathrm{MS}_{\mathrm{groups}}}{\mathrm{MS}_{\mathrm{error}}} \]
Definition: The
group mean square (\( \mathrm{MS}_{\mathrm{groups}} \)) is proportional to the observed amount of variation among the group sample means [between-group variability ].
Definition: The
error mean square (\( \mathrm{MS}_{\mathrm{error}} \)) estimates the variance among subjects that belong to the same group [within-group variability ].
Test statistic:
\[ F = \frac{\mathrm{group \ mean \ square}}{\mathrm{error \ mean \ square}} = \frac{\mathrm{MS}_{\mathrm{groups}}}{\mathrm{MS}_{\mathrm{error}}} \]
If \( H_{0} \) is true, then \( \mathrm{MS}_{\mathrm{groups}} = \mathrm{MS}_{\mathrm{error}} \) and \( F = 1 \).
If \( H_{0} \) is false, then \( \mathrm{MS}_{\mathrm{groups}} > \mathrm{MS}_{\mathrm{error}} \) and \( F > 1 \).
Practice Problem #1
Singaravelan et al. (2005) set up feeding stations where bees were offered a choice between a control solution with 20% surcrose or a caffeinated solution with 20% sucrose plus some quantity of caffeine. Over the course of the experiment, four different concentrations of caffeine were provided: 50, 100, 150, and 200 ppm. The response variable was the difference between the amount of nectar consumed from the caffeine feeders and that removed from the control feeders at the same station (grams).
Data: With \( i \) representing group \( i \), we have
\[ \mathrm{SS}_{\mathrm{total}} = \mathrm{SS}_{\mathrm{groups}} + \mathrm{SS}_{\mathrm{error}} \]
Definition: The
group mean square is given by
\[ \mathrm{MS}_{\mathrm{groups}} = \frac{\mathrm{SS}_{\mathrm{groups}}}{df_{\mathrm{groups}}}, \] with \( df_{\mathrm{groups}} = k-1 \).
Definition: The
error mean square is given by
\[ \mathrm{MS}_{\mathrm{error}} = \frac{\mathrm{SS}_{\mathrm{error}}}{df_{\mathrm{error}}}, \] with \( df_{\mathrm{error}} = \sum (n_{i}-1) = N-k \).
strungOutBees <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter15/chap15q01HoneybeeCaffeine.csv")
strungOutBees$ppmCaffeine <- factor(strungOutBees$ppmCaffeine)
str(strungOutBees)
'data.frame': 20 obs. of 2 variables:
$ ppmCaffeine : Factor w/ 4 levels "50","100","150",..: 1 2 3 4 1 2 3 4 1 2 ...
$ consumptionDifferenceFromControl: num -0.4 0.01 0.65 0.24 0.34 -0.39 0.53 0.44 0.19 -0.08 ...
Discuss: Is this data tidy or messy?
Definition: Tidy!
stripchart(consumptionDifferenceFromControl ~ ppmCaffeine, data = strungOutBees, vertical = TRUE, method = "jitter", xlab="Caffeine (ppm)")
Discuss: State the null and alternative hypotheses appropriate for this question.
Definition: \[ \begin{eqnarray*} H_{0} & : & \mu_{50} = \mu_{100} = \mu_{150} = \mu_{200} \\ H_{A} & : & \mathrm{At \ least \ one \ of \ the \ means \ is \ different} \end{eqnarray*} \]
Short cut using R
caffResults <- lm(consumptionDifferenceFromControl ~ ppmCaffeine, data=strungOutBees)
anova(caffResults)
Analysis of Variance Table
Response: consumptionDifferenceFromControl
Df Sum Sq Mean Sq F value Pr(>F)
ppmCaffeine 3 1.1344 0.37814 4.1779 0.02308 *
Residuals 16 1.4482 0.09051
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Definition: The
\( R^{2} \) value is used in ANOVA is the “fraction of the variation explained by groups” and is given by
\[ R^{2} = \frac{\mathrm{SS}_{\mathrm{groups}}}{\mathrm{SS}_{\mathrm{total}}}. \] Note: \( 0 \leq R^2 \leq 1 \).
beeAnovaSummary <- summary(caffResults)
beeAnovaSummary$r.squared
[1] 0.4392573
Long way
Question: Calculate the following summary statistics for each group: \( n_{i} \), \( \bar{Y}_{i} \), and \( s_{i} \).
library(dplyr)
(beeStats <- strungOutBees %>% group_by(ppmCaffeine) %>% summarise(n = n(),
mean = mean(consumptionDifferenceFromControl),
sd = sd(consumptionDifferenceFromControl)))
Source: local data frame [4 x 4]
ppmCaffeine n mean sd
(fctr) (int) (dbl) (dbl)
1 50 5 0.008 0.2887386
2 100 5 -0.172 0.1694698
3 150 5 0.376 0.3093218
4 200 5 0.378 0.3927722
Compute sum-of-squares
grandMean <- mean(strungOutBees$consumptionDifferenceFromControl)
(SS_groups <- sum(beeStats$n*(beeStats$mean - grandMean)^2))
[1] 1.134415
(SS_error <- sum((beeStats$n-1)*beeStats$sd^2))
[1] 1.44816
Compute degree of freedom
(df_groups <- 4-1)
[1] 3
(df_error <- nrow(strungOutBees)-4)
[1] 16
Compute mean squares
(MS_groups <- SS_groups/df_groups)
[1] 0.3781383
(MS_error <- SS_error/df_error)
[1] 0.09051
Compute \( F \)-statistic and \( P \)-value
(F_ratio <- MS_groups/MS_error)
[1] 4.177862
(pval <- pf(F_ratio, df_groups, df_error, lower.tail=FALSE))
[1] 0.02307757
Create manual table and compare
mytable <- data.frame(Df = c(df_groups, df_error), SumSq = c(SS_groups, SS_error), MeanSq = c(MS_groups, MS_error), Fval = c(F_ratio, NA), Pval = c(pval, NA))
rownames(mytable) <- c("ppmCaffeine", "Residuals")
(mytable)
Df SumSq MeanSq Fval Pval
ppmCaffeine 3 1.134415 0.3781383 4.177862 0.02307757
Residuals 16 1.448160 0.0905100 NA NA
Assumptions (same as 2-sample \( t \)-test)
Robustness (same as 2-sample \( t \)-test)
Definition: The
Kruskal-Wallis test is a nonparametric method for mulutiple groups based on ranks.
The Kruskal-Wallis test is similar to the Mann-Whitney \( U \)-test and has the same assumptions:
Power of Kruskal-Wallis test is nearly as powerful as ANOVA when sample sizes are large, but has smaller power than ANOVA for small sample sizes.