Example Anova

Example ANOVA

source: https://www.scribbr.com/statistics/anova-in-r/

ANOVA is a statistical test for estimating how a quantitative dependent variable changes according to the levels of one or more categorical independent variables. ANOVA tests whether there is a difference in means of the groups at each level of the independent variable.

The null hypothesis (H0) of the ANOVA is no difference in means, and the alternative hypothesis (Ha) is that the means are different from one another.

The data looks like:

##   density block fertilizer    yield
## 1       1     1          1 177.2287
## 2       2     2          1 177.5500
## 3       1     3          1 176.4085
## 4       2     4          1 177.7036
## 5       1     1          1 177.1255
## 6       2     2          1 176.7783

These are the data types and summary of each variable

## 'data.frame':    96 obs. of  4 variables:
##  $ density   : Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2 ...
##  $ block     : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
##  $ fertilizer: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ yield     : num  177 178 176 178 177 ...

##  density block  fertilizer     yield      
##  1:48    1:24   1:32       Min.   :175.4  
##  2:48    2:24   2:32       1st Qu.:176.5  
##          3:24   3:32       Median :177.1  
##          4:24              Mean   :177.0  
##                            3rd Qu.:177.4  
##                            Max.   :179.1

Basic plots

Standard deviations Density 1 and Density 2

## [1] 0.6072479

## [1] 0.6441439

Yield of crops by plantation block

Yield of crops by fertilizer used

One-way ANOVA

In the one-way ANOVA example, we are modeling crop yield as a function of the type of fertilizer used.

##             Df Sum Sq Mean Sq F value Pr(>F)    
## fertilizer   2   6.07  3.0340   7.863  7e-04 ***
## Residuals   93  35.89  0.3859                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model summary first lists the independent variables being tested in the model (in this case we have only one, ‘fertilizer’) and the model residuals (‘Residual’). All of the variation that is not explained by the independent variables is called residual variance.

The rest of the values in the output table describe the independent variable and the residuals:

The Df column displays the degrees of freedom for the independent variable (the number of levels in the variable minus 1), and the degrees of freedom for the residuals (the total number of observations minus one and minus the number of levels in the independent variables).
The Sum Sq column displays the sum of squares (a.k.a. the total variation between the group means and the overall mean).
The Mean Sq column is the mean of the sum of squares, calculated by dividing the sum of squares by the degrees of freedom for each parameter.
The F value column is the test statistic from the F test. This is the mean square of each independent variable divided by the mean square of the residuals. The larger the F value, the more likely it is that the variation caused by the independent variable is real and not due to chance.
The Pr(>F) column is the p value of the F statistic. This shows how likely it is that the F value calculated from the test would have occurred if the null hypothesis of no difference among group means were true.

The p value of the fertilizer variable is low (p < 0.001). It is marked with three asterisks. So it appears that the type of fertilizer used has a real impact on the final crop yield.

Two-way ANOVA

In the two-way ANOVA example, we are modeling crop yield as a function of type of fertilizer and planting density.

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## fertilizer   2  6.068   3.034   9.073 0.000253 ***
## density      1  5.122   5.122  15.316 0.000174 ***
## Residuals   92 30.765   0.334                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding interactions between variables

Sometimes you have reason to think that two of your independent variables have an interaction effect rather than an additive effect.

For example, in our crop yield experiment, it is possible that planting density affects the plants’ ability to take up fertilizer. This might influence the effect of fertilizer type in a way that isn’t accounted for in the two-way model.

To test whether two variables have an interaction effect in ANOVA, simply use an asterisk instead of a plus-sign in the model:

##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## fertilizer          2  6.068   3.034   9.001 0.000273 ***
## density             1  5.122   5.122  15.195 0.000186 ***
## fertilizer:density  2  0.428   0.214   0.635 0.532500    
## Residuals          90 30.337   0.337                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding a blocking variable

If you have grouped your experimental treatments in some way, or if you have a confounding variable that might affect the relationship you are interested in testing, you should include that element in the model as a blocking variable.

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## fertilizer   2  6.068   3.034   9.018 0.000269 ***
## density      1  5.122   5.122  15.224 0.000184 ***
## block        2  0.486   0.243   0.723 0.488329    
## Residuals   90 30.278   0.336                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post-hoc test

ANOVA tells us if there are differences among group means, but not what the differences are. To find out which groups are statistically different from one another, you can perform a Tukey’s Honestly Significant Difference (Tukey’s HSD) post-hoc test for pairwise comparisons:

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = yield ~ fertilizer + density, data = crop.data)
## 
## $fertilizer
##          diff         lwr       upr     p adj
## 2-1 0.1761687 -0.16822506 0.5205625 0.4452958
## 3-1 0.5991256  0.25473179 0.9435194 0.0002219
## 3-2 0.4229569  0.07856306 0.7673506 0.0119381
## 
## $density
##         diff       lwr       upr     p adj
## 2-1 0.461956 0.2275204 0.6963916 0.0001741

From the post-hoc test results, we see that there are statistically significant differences (p < 0.05) between fertilizer groups 3 and 1 and between fertilizer types 3 and 2, but the difference between fertilizer groups 2 and 1 is not statistically significant. There is also a significant difference between the two different levels of planting density.