source: https://www.scribbr.com/statistics/anova-in-r/
ANOVA is a statistical test for estimating how a quantitative dependent variable changes according to the levels of one or more categorical independent variables. ANOVA tests whether there is a difference in means of the groups at each level of the independent variable.
The null hypothesis (H0) of the ANOVA is no difference in means, and the alternative hypothesis (Ha) is that the means are different from one another.
The data looks like:
## density block fertilizer yield
## 1 1 1 1 177.2287
## 2 2 2 1 177.5500
## 3 1 3 1 176.4085
## 4 2 4 1 177.7036
## 5 1 1 1 177.1255
## 6 2 2 1 176.7783
These are the data types and summary of each variable
## 'data.frame': 96 obs. of 4 variables:
## $ density : Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2 ...
## $ block : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
## $ fertilizer: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ yield : num 177 178 176 178 177 ...
## density block fertilizer yield
## 1:48 1:24 1:32 Min. :175.4
## 2:48 2:24 2:32 1st Qu.:176.5
## 3:24 3:32 Median :177.1
## 4:24 Mean :177.0
## 3rd Qu.:177.4
## Max. :179.1
Standard deviations Density 1 and Density 2
## [1] 0.6072479
## [1] 0.6441439
In the one-way ANOVA example, we are modeling crop yield as a function of the type of fertilizer used.
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 6.07 3.0340 7.863 7e-04 ***
## Residuals 93 35.89 0.3859
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model summary first lists the independent variables being tested in the model (in this case we have only one, ‘fertilizer’) and the model residuals (‘Residual’). All of the variation that is not explained by the independent variables is called residual variance.
The rest of the values in the output table describe the independent variable and the residuals:
The p value of the fertilizer variable is low (p < 0.001). It is marked with three asterisks. So it appears that the type of fertilizer used has a real impact on the final crop yield.
In the two-way ANOVA example, we are modeling crop yield as a function of type of fertilizer and planting density.
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 6.068 3.034 9.073 0.000253 ***
## density 1 5.122 5.122 15.316 0.000174 ***
## Residuals 92 30.765 0.334
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Sometimes you have reason to think that two of your independent variables have an interaction effect rather than an additive effect.
For example, in our crop yield experiment, it is possible that planting density affects the plants’ ability to take up fertilizer. This might influence the effect of fertilizer type in a way that isn’t accounted for in the two-way model.
To test whether two variables have an interaction effect in ANOVA, simply use an asterisk instead of a plus-sign in the model:
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 6.068 3.034 9.001 0.000273 ***
## density 1 5.122 5.122 15.195 0.000186 ***
## fertilizer:density 2 0.428 0.214 0.635 0.532500
## Residuals 90 30.337 0.337
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If you have grouped your experimental treatments in some way, or if you have a confounding variable that might affect the relationship you are interested in testing, you should include that element in the model as a blocking variable.
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 6.068 3.034 9.018 0.000269 ***
## density 1 5.122 5.122 15.224 0.000184 ***
## block 2 0.486 0.243 0.723 0.488329
## Residuals 90 30.278 0.336
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA tells us if there are differences among group means, but not what the differences are. To find out which groups are statistically different from one another, you can perform a Tukey’s Honestly Significant Difference (Tukey’s HSD) post-hoc test for pairwise comparisons:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = yield ~ fertilizer + density, data = crop.data)
##
## $fertilizer
## diff lwr upr p adj
## 2-1 0.1761687 -0.16822506 0.5205625 0.4452958
## 3-1 0.5991256 0.25473179 0.9435194 0.0002219
## 3-2 0.4229569 0.07856306 0.7673506 0.0119381
##
## $density
## diff lwr upr p adj
## 2-1 0.461956 0.2275204 0.6963916 0.0001741
From the post-hoc test results, we see that there are statistically significant differences (p < 0.05) between fertilizer groups 3 and 1 and between fertilizer types 3 and 2, but the difference between fertilizer groups 2 and 1 is not statistically significant. There is also a significant difference between the two different levels of planting density.