Two-way ANOVA

In this exercise, we are going to fit a two-way ANOVA using the data set ‘soil’ in the provided soil.csv file. This data set contains soil phosphorus concentrations (\(mg\ kg^{-1}\)) derived from two different soil types: an alluvial soil (soil in proximity to a river experiencing regular flooding) and soil from an upland terrace (no soil nutrient replenishment through flooding). Soil samples were taken from the A-, B- and C-horizon. In statistical terms we have a continuous response variable and two categorical predictors (soil type ‘loc’ and soil horizon ‘horizon’). Because we have two predictor variables, the resulting model is called a two-way ANOVA.

  1. Fit a two-way ANOVA model using the explanatory variables ‘loc’ and ‘horizon’ including their interaction.
  2. Inspect the model diagnostic plots and comment on whether the variance homogeneity and the normality assumption are met.
  3. If there is a significant interaction term, follow up with a post-ho comparison, i.e. slice the interaction using the functions in the emmeans package.
  4. Report the overall significance of the interaction term as you would in a paper and produce a publication-quality figure with lower case letters indicating the results of the post-hoc test.
## First: Set your working directory (in the menu bar click on 'Session' => 'Set Working Directory' => 'Choose Directory...'. Copy working directory from the console and paste it into the RMarkdown script.)

## Read in the data set
soilp <- read.csv("soil.csv")

## Check structure and get data summary
head(soilp)
##        loc horizon      p depth.cm
## 1 alluvial       A 345.45     0-10
## 2 alluvial       A 220.64     0-10
## 3 alluvial       A 224.98     0-10
## 4 alluvial       A 246.72     0-10
## 5 alluvial       A 286.60     0-10
## 6 alluvial       A 283.31     0-10
str(soilp)
## 'data.frame':    54 obs. of  4 variables:
##  $ loc     : Factor w/ 2 levels "alluvial","terrace": 1 1 1 1 1 1 1 1 1 1 ...
##  $ horizon : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 2 ...
##  $ p       : num  345 221 225 247 287 ...
##  $ depth.cm: Factor w/ 3 levels "0-10","20-30",..: 1 1 1 1 1 1 1 1 1 3 ...
summary(soilp)
##        loc     horizon       p            depth.cm 
##  alluvial:27   A:18    Min.   : 36.72   0-10  :18  
##  terrace :27   B:18    1st Qu.:117.07   20-30 :18  
##                C:18    Median :156.34   Oct-20:18  
##                        Mean   :164.05              
##                        3rd Qu.:201.25              
##                        Max.   :345.45
## Graphical data exploration
library(ggplot2)
ggplot(data = soilp, mapping = aes(x = horizon, y = p)) + geom_boxplot() + facet_wrap(facets = ~ loc)

## Fit a linear model
m1 <- lm(p ~ loc * horizon, data = soilp)

par(mfrow = c(2, 2))
plot(m1)

summary(m1)
## 
## Call:
## lm(formula = p ~ loc * horizon, data = soilp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.208 -34.988  -0.961  27.502  91.298 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           260.49      14.56  17.892  < 2e-16 ***
## locterrace           -105.55      20.59  -5.126 5.23e-06 ***
## horizonB              -90.38      20.59  -4.390 6.21e-05 ***
## horizonC             -115.27      20.59  -5.599 1.02e-06 ***
## locterrace:horizonB    85.06      29.12   2.921   0.0053 ** 
## locterrace:horizonC    64.26      29.12   2.207   0.0321 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.68 on 48 degrees of freedom
## Multiple R-squared:  0.572,  Adjusted R-squared:  0.5274 
## F-statistic: 12.83 on 5 and 48 DF,  p-value: 6.159e-08
summary.aov(m1)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## loc          1  41993   41993  22.015 2.28e-05 ***
## horizon      2  62683   31342  16.431 3.66e-06 ***
## loc:horizon  2  17697    8848   4.639   0.0144 *  
## Residuals   48  91561    1908                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pf(q = 4.639, df1 = 2, df2 = 48, lower.tail = F)
## [1] 0.01439068
library(emmeans)
slice <- emmeans(object = m1, specs = "horizon", by = "loc")
slice
## loc = alluvial:
##  horizon   emmean       SE df  lower.CL upper.CL
##  A       260.4856 14.55842 48 231.21389 289.7572
##  B       170.1056 14.55842 48 140.83389 199.3772
##  C       145.2167 14.55842 48 115.94501 174.4883
## 
## loc = terrace:
##  horizon   emmean       SE df  lower.CL upper.CL
##  A       154.9389 14.55842 48 125.66723 184.2105
##  B       149.6222 14.55842 48 120.35056 178.8939
##  C       103.9278 14.55842 48  74.65612 133.1994
## 
## Confidence level used: 0.95
pairs(slice)
## loc = alluvial:
##  contrast   estimate       SE df t.ratio p.value
##  A - B     90.380000 20.58871 48   4.390  0.0002
##  A - C    115.268889 20.58871 48   5.599  <.0001
##  B - C     24.888889 20.58871 48   1.209  0.4538
## 
## loc = terrace:
##  contrast   estimate       SE df t.ratio p.value
##  A - B      5.316667 20.58871 48   0.258  0.9639
##  A - C     51.011111 20.58871 48   2.478  0.0436
##  B - C     45.694444 20.58871 48   2.219  0.0781
## 
## P value adjustment: tukey method for comparing a family of 3 estimates