ANOVA can be broken into two ideas. One-way and two-way.
One-way ANOVA will examine the means of a quantitative variable by examining the variance among the groups. We should be comparing more than two groups (for two groups do the hypothesis test of module 0). We could do this by hand but nobody does so I am not going to bother either. The null hypothesis for this will always be that all means are equal while the alternative is that there are at least two that are not equal. Be careful with this as some treatments may still have equal means but not all are equal. Assumptions for the test include normality, sample independence, and the tricky one equal variances between different groups. We have seen normality (qq-plots) The equality of variances will be checked by using the residual vs plotted graph and again asking do we fall on the line.
Two-way ANOVA is going to look at two categorical variables and ask if there is an effect with these categories on the dependent variable. You will have to look not only at the contribution but the interaction between those categorical variables. Conditions in two-way are similar to the one-way except that we also ask that the number in each category is equal.
| A | B | C | D | E | |
|---|---|---|---|---|---|
| Plant 1 | 19 | 28 | 25 | 13 | 6 |
| Plant 2 | 8 | 27 | 22 | 12 | 7 |
| Plant 3 | 14 | 20 | 24 | 13 | 15 |
| Plant 4 | 11 | 17 | 22 | 16 | 8 |
Let’s add this into R.
data = data.frame("A" = c(19,8,14,11), "B" = c(28,27,20,17), "C" = c(25,22,24,22), "D" = c(13,12,13,16),"E" = c(6,7,15,8),"Plant"=1:4)
data
## A B C D E Plant
## 1 19 28 25 13 6 1
## 2 8 27 22 12 7 2
## 3 14 20 24 13 15 3
## 4 11 17 22 16 8 4
I need to make this into tidy data. It took me a while to find the correct code but I have done it here. By the way I don’t think I needed to include the plant number but I have included it anyway.
data <-
data %>%
pivot_longer(c('A','B','C','D','E'), names_to = "Strain", values_to = "Weight")
data
## # A tibble: 20 x 3
## Plant Strain Weight
## <int> <chr> <dbl>
## 1 1 A 19
## 2 1 B 28
## 3 1 C 25
## 4 1 D 13
## 5 1 E 6
## 6 2 A 8
## 7 2 B 27
## 8 2 C 22
## 9 2 D 12
## 10 2 E 7
## 11 3 A 14
## 12 3 B 20
## 13 3 C 24
## 14 3 D 13
## 15 3 E 15
## 16 4 A 11
## 17 4 B 17
## 18 4 C 22
## 19 4 D 16
## 20 4 E 8
To apply the ANOVA test in R you actually must do two calls. The first call aov asks you to create a function. Essentially you are asking is there a difference in the weight because of the strain you picked. This was why I had to get the data tidy! Once you run the fit, you run a test on that fit with anova of the fit.
fm = aov(Weight ~ Strain, data)
summary(fm)
## Df Sum Sq Mean Sq F value Pr(>F)
## Strain 4 660.8 165.20 11.38 0.00019 ***
## Residuals 15 217.7 14.52
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Strain here could also be called the Treatment and the Residuals could be called the Error.
Let’s include a plot to see what is going on here.
ggplot(data , aes(x = Strain, y = Weight)) +
geom_boxplot()
We can see it is very clear that these will have a different means. B and C are clearly have heavier averages while A, D, and E are lighter.
Let’s check the assumptions of ANOVA are fulfilled.
plot(fm, 1:2)
In the first graph you are looking to see if there is any relationship. If there is you will not meet the criteria of homogeneity of variance. Here we do not observe any relationship.
In the second graph you in fact want a relationship! You would like for all the points to fall on that line. Here they do not clearly fall on the line but it is close enough to continue with our results. We will discuss ways to check for this in the next module!
If we needed to find a critical F value, we would need to know two degrees of freedom and the p value we are using. Looking at the 95% critical value, we use the degrees of freedom from the numerator (or Treatment) and the degrees of freedom from the denominator (or Error). You stick a q in front of f to make it give you critical values just like with t or z.
qf(.95,4,15)
## [1] 3.055568
I want to recreate one more visualization to really express the differences in these means.
y1 <- mean(data$Weight)
ggplot(data = data, aes(x = Strain, y = Weight)) +
geom_point() +
stat_summary(fun.data = 'mean_se',color = "red") +
geom_hline(yintercept = y1, color ="blue", linetype = "dashed")
I had seen this graph some where and decided to recreate it here. Totally unnecessary but creates another visualization of really what the ANOVA test is looking at.
The homework only has the output from a two way ANOVA only after it has been run through software. So I don’t have an exact homework question to cover but I’ll still preform one on a built in dataset. Everybody loves diamonds right?
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
We want to ask is the mean price different for diamonds of different cut and color? .
table(diamonds$cut,diamonds$color)
##
## D E F G H I J
## Fair 163 224 312 314 303 175 119
## Good 662 933 909 871 702 522 307
## Very Good 1513 2400 2164 2299 1824 1204 678
## Premium 1603 2337 2331 2924 2360 1428 808
## Ideal 2834 3903 3826 4884 3115 2093 896
This also violates one of the requirements of the test. We are supposed to have the same number of entries in each category and I do not. I continue anyway. Let’s visualize!
ggplot(diamonds, aes(x = cut, y =price, color = color))+
geom_boxplot()
Time to run the ANOVA
twoWayAnova <- aov(price ~ cut*color, data = diamonds)
summary(twoWayAnova)
## Df Sum Sq Mean Sq F value Pr(>F)
## cut 4 1.104e+10 2.760e+09 181.405 <2e-16 ***
## color 6 2.551e+10 4.251e+09 279.371 <2e-16 ***
## cut:color 24 1.653e+09 6.889e+07 4.527 1e-12 ***
## Residuals 53905 8.203e+11 1.522e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So here we examined if cut, color and the interaction between the two will have an effect on the price. Since all are significant they do.
Let’s check the requirements are met
plot(twoWayAnova, 1:2)
Both of these graphs are actually problematic. The QQ-Plot is showing that the data is clearly not normal. What is normal about diamonds anyway? The first plot is clearly showing a trend as well. We expect the red line to stay near the dotted line and that is not the case. These give us a hint that I should not have used this test! We’ll explore what to do in this case during the 5th and final module.