Grafen and Hales Chapter 1: ANOVA

library(tidyverse)
library(here)
library(cowplot)

fertilizer <- fertilizer %>%
  mutate(FERTIL=as.factor(FERTIL))

The basic principles of ANOVA

In a simple case we consider the comparison of three means. This is done by the analysis of variance (ANOVA). In this case we will go through an example in detail and work out all the mechanics, but once we have done that and seen how the output is derived from the input we will not need to do it again. We will use R to do the heavy lifting. We will just need to know when it is appropriate to use ANOVA, how to get R to do it and how to interpret the output that R produces.

g<-ggplot(fertilizer,aes(x=FERTIL,y=YIELD, fill= FERTIL,alpha=0.1))+
  geom_boxplot()+
  geom_point(aes(colour=FERTIL))+
  scale_y_continuous(limits=c(0,10))+
  labs(x='Fertilizer', y='Yield')+
  theme_cowplot()+
  theme(legend.position = "none")
  
g

grand_mean=mean(fertilizer$YIELD)
grand_mean

## [1] 4.643667

SST - Total sum of squares

SST is the sum of squares of the deviations of the data around the grand mean. This is a measure of the total variability of the dataset.

SST=sum((fertilizer$YIELD-grand_mean)^2)
SST

## [1] 36.4449

f_means<-fertilizer %>%
  group_by(FERTIL) %>%
  summarise(fmean=mean(YIELD))
f_means

fertilizer<-mutate(fertilizer,fmean=c(rep(f_means$fmean[1],10),rep(f_means$fmean[2],10),rep(f_means$fmean[3],10)))

f1<-filter(fertilizer,FERTIL==1)
f2<-filter(fertilizer,FERTIL==2)
f3<-filter(fertilizer,FERTIL==3)

g1<-ggplot(fertilizer,aes(x=plot,y=YIELD))+
  geom_point()+
  scale_y_continuous(limits=c(0,10))+
  labs(x='Plot number',y='Yield per plot (tonnes)')+
  theme_cowplot()
g1

Deviations from the grand mean

SST.plot<-g1+geom_hline(yintercept=grand_mean,linetype='dashed')+
  geom_segment(aes(x = fertilizer$plot, y = fertilizer$YIELD, xend = fertilizer$plot, yend = grand_mean),linetype='dotted')
SST.plot

Deviations from the means for each fertilizer

g2<-ggplot()+
  
  geom_point(data=f1,aes(x=plot,y=YIELD))+
  geom_segment(aes(x=min(f1$plot),y=f1$fmean[1],xend=max(f1$plot),yend=f1$fmean[1]))+
  geom_segment(aes(x = f1$plot, y = f1$YIELD, xend = f1$plot, yend = f1$fmean[1]),linetype='dotted')+
  
  geom_point(data=f2,aes(x=plot,y=YIELD))+
  geom_segment(aes(x=min(f2$plot),y=f2$fmean[1],xend=max(f2$plot),yend=f2$fmean[1]))+
  geom_segment(aes(x = f2$plot, y = f2$YIELD, xend = f2$plot, yend = f2$fmean[1]),linetype='dotted')+
  
  geom_point(data=f3,aes(x=plot,y=YIELD))+
  geom_segment(aes(x=min(f3$plot),y=f3$fmean[1],xend=max(f3$plot),yend=f3$fmean[1]))+
  geom_segment(aes(x = f3$plot, y = f3$YIELD, xend = f3$plot, yend = f3$fmean[1]),linetype='dotted')+
  
  scale_y_continuous(limits=c(0,10))+
  labs(x='Plot number',y='Yield per plot (tonnes)')+
  
  theme_cowplot()


g2

When the three group means are fitted, there is an obvious reduction in variability around the three means compared to that around the grand mean, but it is not obvious if the fertilizers hgave had an effect on yield.

At what point do we decide if the amount of variation explained by fitting the means is significant? By this, we mean, "When is the variability between the group means greater than we would expect by chance alone?

Measaures of variability

SST - Total sum of squares

SST=sum((fertilizer$YIELD-grand_mean)^2)
SST

## [1] 36.4449

SST is the total sum of squares. It is the sum of squares of the deviations of the data around the grand mean. This is a measure of the total variability of the dataset.

SSE - Error sum of squares

SSE<-fertilizer %>%
  group_by(FERTIL) %>%
  mutate(fmean=mean(YIELD)) %>%
  mutate(se=(YIELD-fmean)^2) %>%
  summarise(sse=sum(se),.groups = 'drop') %>%
  summarise(SSE=sum(sse),.groups = 'drop') %>%
  pull(SSE)

SSE is the error sum of squares. It is the sum of the squares of the deviations of the data around the three separate group means. This is a measure of the variation betweeen plots that have been given the same fertilizer.

SSF - Fertilizer sum of squares

SSF<-fertilizer %>%
  group_by(FERTIL) %>%
  mutate(fmean=mean(YIELD)) %>%
  mutate(se=(fmean-grand_mean)^2) %>%
  summarise(sse=sum(se),.groups = 'drop') %>%
  summarise(SSF=sum(sse),.groups = 'drop') %>%
  pull(SSF)

SSF is the fertilizer sum of squares. This is the sum of the squares of the deviations of the group means from the grand mean. This is a measure of the variation between plots given different fertilizers.

Deviations of the means from the grand mean

g3<-ggplot()+
  
  geom_hline(yintercept=grand_mean,linetype='dashed')+
  
  geom_segment(aes(x=min(f1$plot),y=f1$fmean[1],xend=max(f1$plot),yend=f1$fmean[1]))+
  geom_segment(aes(x = mean(f1$plot), y = grand_mean, xend = mean(f1$plot), yend = f1$fmean[1]),linetype='dotted')+
  
  geom_segment(aes(x=min(f2$plot),y=f2$fmean[1],xend=max(f2$plot),yend=f2$fmean[1]))+
  geom_segment(aes(x = mean(f2$plot), y = grand_mean, xend = mean(f2$plot), yend = f2$fmean[1]),linetype='dotted')+
  
  geom_segment(aes(x=min(f3$plot),y=f3$fmean[1],xend=max(f3$plot),yend=f3$fmean[1]))+
  geom_segment(aes(x = mean(f3$plot), y = grand_mean, xend = mean(f3$plot), yend = f3$fmean[1]),linetype='dotted')+
  
  scale_y_continuous(limits=c(2.5,7.5))+
  labs(x='Plot number',y='Yield per plot (tonnes)')+
  
  theme_cowplot()

g3

SST = SSF + SSE

SST

## [1] 36.4449

SSF

## [1] 10.82275

SSE

## [1] 25.62215

SSF+SSE

## [1] 36.4449

So the total variability has been divided into two components. That due to differences between plot given different treatments and that due to differences between plots given the same treatment. Variability must be due to one or other of these components. Separating the total SS into its component SS is known as partitioning the sums of squares.

A comparison of SSF and SSE is going to indicate whether fitting the three fertilizer means accounts for a significant amount of variability.

However, to make a proper comparison, we really need to compare the variability per degree of freedom ie the variance.

Partitioning the degrees of freedom

Every sum of squares (SS) has been calculated using a number of independent pieces of information. In each, case, we call this number the number of degrees of freedom for the SS.

For SST this number is one less than the number of data points n. This is because when we calculate the deviations of each data point around a grand mean there are only n-1 of them that are independent, since by definition the sum of these deviations is zero, and so when n-1 of them have been calcluated, the final one is pre-determined.

Similarly, when we calculate SSF, which measures the deviation of the group means from the grand mean, we have \(k\)-1 degrees of freedom, (where in the present example \(k\), the number of treatments, is equal to three) since the deviations must sum to zero, so when \(k\)-1 of them have been calculated, the last one is pre-determined.

Finally, SSE, which measure deviation around the group means will have n-k degrees of freedom, since the sum of each of the deviations around one of the group means must sum to zero, and so when all but one of them have been calculated, the final one is pre-determined. There are \(k\) group means, so the total degrees of freedom for SSE is n-k.

The degrees of freedom are additive: \[ df(\text{SST}) = df(\text{SSE}) + df(\text{SSF}) \] Check:

\[\begin{align*} df(\text{SST}) &= n-1\\ df(\text{SSE}) &= k-1\\ df(\text{SSF}) &= n-k\\ \therefore df(\text{SSE}) + df(\text{SSF}) &= k-1 + n-k\\ &=n-1\\ &=df(\text{SST}) \end{align*}\]

Mean Squares

Now we can calculate the variances which are a measure f the amount of variability per degree of freedom.

In this context, we call them mean squares. To find each one we divided each of the sums of squares (SS) by their corresponding degrees of freedom.

Fertiliser Mean Square (FMS) = SSF / k - 1. This is the variation per df between plots given different fertilisers.

Error Mean Square (EMS) = SSE / n - k. This is the variation per df between plots given the same fertiliser.

Total Mean Square (TMS) = SST / n - 1. This is the total variance per df of the dataset.

Unlike the SS, the MS are not additive. That is, FMS + EMS \(\neq\) TMS.

F-ratios

If none of the fertilizers infliuenced yield, we would expect as much variation between the plots treated with the same fertilizer as between the plots treated with different fertilizers.

We can express this in terms of the mean squares: the mean square for fertilizer would be the same as the mean square for error:

\[ \frac{\text{FMS}}{\text{EMS}}=1 \] We call this ratio the F-ratio. It is the end result of ANOVA.

Even if the fertilizers were identical, the F-ratio is unlikely to be exactly 1 - it could by chance take a whole range of values. The F-distribution represents the range and likelihood of all possible F ratios under the null hypothesis. ie when the fertilizers were identical.

If the fertilizers were very different, then the FMS would be much greater than the EMS and the F-ratio would be greater than one. However it can be quite large even when there are no treatment differences. So how do we decide when the size of the F-ratio is due to treatment rather than to chance?

Traditionally, we decide that it sufficiently larger than one to be due to treatment differences if it would be this large or larger under the null hypothesis only 5% or less of the time. If we had inside knowledge that the null hypothesis was in fact true then we would still get an F-ratio that large or larger 5% of the time.

Our p-value ie the probability that the F-ratio would have been as large as it is or larger, under the null hypothesis, represents the strength of evidence against the null hypothesis. The smaller it is, the stronger the evidence, and only when it is less than 0.05 do we regard the evidence as strong enough to reject the null.

dfs<-c(rep("df = 2, 27",601),rep("df = 10, 27",601))
xs<-rep(seq(0,6,0.01),2)
ys<-c(df(xs[1:601],2,27),df(xs[602:1202],10,27))
fdata<-tibble(x=xs,y=ys,dfs=dfs)

fdata %>%
  mutate(dfs = fct_relevel(dfs,"df = 2, 27","df = 10, 27")) %>%
  ggplot(aes(x=xs,y=ys)) +
  xlim(0,6) +
  ylim(0,1) +
  labs(x="F-ratio",
       y="Probability density",
       caption="The F-distributions for (left) 2 and 27 degrees of freedom and (right) 10 and 27 degrees of freedom") +
  geom_line(colour="darkblue") +
  facet_wrap(~dfs) +
  theme_cowplot()

An example of ANOVA

Is the data tidy?

Yes, our fertilizer data is tidy.

The question

Does yield depend on fertilizer?

fertil.model<-lm(YIELD~FERTIL,data=fertilizer)

anova(fertil.model)

```