The Analysis of Variance (ANOVA)

Hi, Team!

Many of your are struggling with the concept of the One-Way Analysis of Variance (ANOVA), so I am going to break it down for you.

The ANOVA test is actually a set of methods that break down the variance of the model versus the variance of of the residual. ANOVA is used in experimental design, Design of Experiements for simulation building, and many, many places elsewhere.

A New Distribution, the F

The distribution of the ratio of two variances, $\sigma^2_1 / \sigma^2_2$, is an $F$ distribution. The $F$ is actually the distribution for a variance, so it is strictly positive. It can also be formulated as the ratio of two $ ^2 $ random variables where $F=\frac{\chi^2_m / m}{\chi^2_n / n}$.

(If you recall, the $chi^2$ is the distribution of the variance, so this relationship makes perfect sense.)

The $F$ is a distribution completely determined by the degrees of freedom of the numerator (typically the model) and the degrees of freedom of the denominator (typically, the residual). This ratio forms the test statistic.

A final interesting note about this distribution is that an $F(1,m)$ is equivalent of a $t(m)$ distribution. This is sometimes useful in interpreting regression models.

The F distribution is the workhorse for Analysis of Variance.

Graphs of the F Distribution

The F-distribution, as the ratio of two Chi Squares or variances, is obviously positive semi-definite, existing from zero to positive infinity. Graphs of the distribution with varying degrees of freedom varying from 1 to 10 follow.

mycol=rainbow(10)

for (i in seq(1,10)){

   myadd=ifelse(i==1, FALSE,TRUE)
   curve(df(x, i,11-i), from = 0, to = 10, n=501,
      main = 'Different F Distributions', #add title
      ylab = 'Density', #change y-axis label
      ylim=c(0,1),
      #xlim=c(0,10),
      lwd = 1,
      col = mycol[i], add=myadd) 
  }

tdf=c('[1,10]','[2,9]','[3,8]','[4,7]','[5,6]','[6,5]','[7,4]','[8,3]','[9,2]','[10,1]')


legend("topright",                    # Add legend to plot
       title='DF and Color',
       legend = tdf,
       col = mycol,
       pch = 16)

One-Way ANOVA Hypothesis Test

Let’s start with a simple example known as a one-way ANOVA (meaning that we only care about one experimental effect). The one-way ANOVA is considered a global test, as it tests

$H_0: \mu_1=\mu_2=...\mu_n$

versus

$H_a: $ not all $\mu$ are equal.

This relationship to the F distribution makes sense, as I will explain. Assume that we selected an $\alpha = .05$.

But First, Why not Multiple Two-Sample Tests?

Why shouldn’t we just test each possible comparison of means? We would induce familywise error, meaning that we are likely to find spurious results based on the sheer number of tests we are running. (If we have 6 different factor levels, we would need to run 6 choose 2 = 10 hypothesis tests. If these are independent tests and we set $\alpha=.05$, then the true $\alpha=1-(1-.05)^{6}=0.54$, which means we are likely to have at least one spurious result. But if we can determine globally that some of the means differ first, we reduce the probability of at Type I error.

Motivating Example

I am going to start with an example of three different cancer treatments (placebo, chemo, radiation) applied to 15 rats where we are measuring the reduction or increase in the size of the tumor. We will assume that growth of tumor is normally distributed in this case. The measurements and descriptive statistics (ungrouped and by factor level) follow.

des=function(x){
  temp=c(mean(x), var(x), length(x))
  names(temp)=c('Mean', 'Var', 'n')
  return(noquote(temp))
}

mymat=matrix(c(4,5,2,4,4,2,2,3,4,4,3,1,1,3,1), ncol=3)
colnames(mymat)=c('Placebo', 'Chemo', 'Radiation')
rownames(mymat)=c('R1-3','R4-6','R7-9','R10-12', 'R13-15')
mydf=as.data.frame(mymat)
mymat

##        Placebo Chemo Radiation
## R1-3         4     2         3
## R4-6         5     2         1
## R7-9         2     3         1
## R10-12       4     4         3
## R13-15       4     4         1

tot=rbind(des(as.vector(mymat)), des(mymat[,1]), des(mymat[,2]), des(mymat[,3]))
rownames(tot)=c('Total', colnames(mymat))
tot

##               Mean      Var  n
## Total     2.866667 1.695238 15
## Placebo   3.800000 1.200000  5
## Chemo     3.000000 1.000000  5
## Radiation 1.800000 1.200000  5

Some Assumptions

In the one-way ANOVA, we are breaking down a single random variable (X=tumor size) into subsets based on a factor ($x_1, x_2,x_3$) with sizes $n_1, n_2, n_3=5$. For now, we will assume that the distribution of tumor size is normally distributed and that the variance between the treatment groups are reasonably equal. Further, we assume that group membership is independent (likely through blinding). If we have large numbers of observations in each group, we have some robustness to violations of assumptions and gain power.

So why do we use the F distribution to evaluate whether means are different? Well, this goes back to the idea of the global test. We should generally not test each pair of means due to familywise error. (If we do, we would need to correct the $\alpha$ in one way or another.) We’ll come back to why testing variances work in a minute.

ANOVA is Decomposition of Variance

Let’s assume that the distribution of tumor size is a random normal variable with a mean of $\mu$ and a variance of $\sigma^2$. The numerator of the variance of X can actually be broken down based on the grouping variable (treatment) and the remaining (residual) error by using the formula, Sum of Squares Total = Sum of Squares Between Groups + Sum of Squares within Groups.

$(n-1)Var(X)=SS_{Total}=SS_{Between}+SS_{Within}$

$Var(X)=MS_{Total}=\frac{SS_{Between}+SS_{Within}}{n-1}$

In fact, the ratio of both elements of the right-hand side is an F statistic.

$F_{df_1,df_2}=MS_{Between}/MS_{Within}$

This formula provides a statistic in the F-distribution with degrees of freedom in the numerator equal to the degrees of freedom in $MS_{Between}$ and the degrees of freedom in the denominator equal to the degrees of freedom in $MS_{Within}$. How convenient!

Since the F distribution exists between zero and infinity, the F distribution rejects the null (all means are equal) in favor of the alternative (not all means are equal) when the numerator is sufficiently larger than the denominator to attain statistical significance. Again, we will come back to this graphically.

Sum of Squares Total and Mean Sum of Squares Total (Variance)

$SS_{Total}$ is the Sum of Squares Total or the variance of tumor size multiplied by its denominator (the degrees of freedom or n-1), while the . $MS_{Total}$ is the variance. The formulae for both follow.

$SS_{Total}=\sum_{i=1}^{i=n}(x_i-\bar{\bar x)}^2$

$MS_{Total}=Var(X)=\frac{SS_{Total}}{n-1}=\frac{SS_{Between}+SS_{Within}}{n-1}$

$\bar{\bar x}$ is referred to as the “grand mean” or just the mean of tumor size. You can easily see this is just the numerator of the variance (in our case, overall tumor growth variance). The following R code illustrates.

grandmean=round(mean(mymat),3)
variance=var(as.vector(mymat))
mydf=length(as.vector(mymat))-1
SST=round(var(as.vector(mymat))*mydf,3)
test=ifelse(round(variance*mydf,3)==SST, "TRUE", "FALSE")
ret=noquote(c(grandmean,SST, round(variance,3), test))
names(ret)=c("Grand Mean", "SST",  "MST=Variance", "SST=Var*df?")
ret

##   Grand Mean          SST MST=Variance  SST=Var*df? 
##        2.867       23.733        1.695         TRUE

Sum of Squares Between and Mean Sum of Squares Between

The Sum of Squares Between, $SS_{Between}$, is the part of the variance which captures the spread between each factor level mean and the grand mean. The Mean Sum of Squares Between, $MS_{Between}$, is the variance of the between groups. Both are shown below.

$SS_{Between}=\sum_{j=1}^{j=k}n_j(\bar x_j-\bar{\bar x_j})^2$

$MS_{Between}=\frac{SS_{Between}}{k-1}=\sum_{j=1}^{j=k}\frac{n_j(\bar x_j-\bar{\bar x_j})^2}{k-1}$

In the first formula above, $\bar X_j$ is the mean of the dependent variable for each of j, and $n_j$ accounts for the number of observations in each group. In our example, there are three factor levels. If we divide the $SS_{Between}$ by it’s associated degrees of freedom (k-1), then we have the Mean Sum of Squares Between or $MS_{Between}$, which is the variance of the effect. In R, this is readily calculated using a simple function and apply.

myf=function(x){return(length(x)*(mean(x)-grandmean)^2)}
(means=apply(mymat,2,myf))

##   Placebo     Chemo Radiation 
##  4.352445  0.088445  5.692445

SSB=sum(apply(mymat,2,myf))
MSB=SSB/2
names(SSB)='SSB'
names(MSB)='MSB'
c(SSB,MSB)

##       SSB       MSB 
## 10.133335  5.066667

Sum of Squares Within

The Sum of Squares Within Groups can be calculated as $SS_{Within}=SS_{Total}-SS_{Between}$, and the Mean Sum of Squares within Groups is $MS_{Within}=\frac{SS_{Within}}{n-k}$. Both can be directly calculated as follows.

$SS_{Within}=\sum_{j=1}^{j=k}\sum_{i=1}^{i=n_j}( x_{ij}-\bar x_j)^2$

$MS_{Within}=\frac{\sum_{j=1}^{j=k}\sum_{i=1}^{i=n_j}( x_{ij}-\bar x_j)^2}{n-k}$

Similar to $MS_{Between}$, $MS_{Within}$ is calculated by dividing $SS_{Within}$ by the remaining degrees of freedom, n-k.

In R, both of these formulae are readily calculated as follows.

myf2=function(x){sum((x-mean(x))^2)}
(myres=apply(mymat,2,myf2))

##   Placebo     Chemo Radiation 
##       4.8       4.0       4.8

SSW=sum(myres)
MSW=SSW/(15-3)
names(SSW)='SSW'
names(MSW)='MSW'
c(SSW, MSW)

##       SSW       MSW 
## 13.600000  1.133333

SST=SSB+SSW

So what we have en toto follows:

$SS_{Total}\sum_{i=1}^{i=n}(x_i-\bar{\bar x)}^2 = \sum_{j=1}^{j=k}n_j(\bar x_j-\bar{\bar x_j})^2+\sum_{j=1}^{j=k}\sum_{i=1}^{i=n_j}( x_{ij}-\bar x_j)^2$

$MS_{Total}=Var(X)=\frac{SS_{Total}}{n-1}$

We are just breaking apart the variance into that accounted for by the (between group) factor levels and that accounted for by the within group variation.

NOTE: We cannot just simply add the variance of each components, as it does not account for the weighting associated with the degrees of freedom.

So How Does Variance Parsing Tell us About Mean Differences?

The F distribution looks at the ratio of the variances between versus within:

$F_{df_1,df_2}=MS_{Between}/MS_{Within}$

If $MS_{Between}$ is large enough relative to $MS_{Within}$, then the statistic will exceed the critical value of the F distribution (the p-value will be less than $\alpha$). Let’s look at our example.

mycol=rainbow(3)
mymeans=colMeans(mymat)
myf=function(x) sd(x)/sqrt(5)
mysd=apply(mymat,2,myf)

for (i in seq(1,3)){

   myadd=ifelse(i==1, FALSE,TRUE)
   curve(dnorm(x, mymeans[i], mysd[i]), from = -2, to = 8, n=501,
      main = 'Different Normal Distributions', #add title
      ylab = 'Density', #change y-axis label
      ylim=c(0,.9),
      #xlim=c(0,10),
      lwd = 1,
      col = mycol[i], add=myadd) 
      lines(x=rep(mymeans[i], 10), y=seq(0,1.5,length.out=10), col=mycol[i])
      text(mymeans[i]-.2, .1, paste("Xbar",i,":", mymeans[i]), srt=90)
}
text(grandmean-.2, .5, paste("Grand Mean: ", grandmean), srt=90, col='red')
points(grandmean,.3, col='red', pch=16)
mytext="SSB=5*(3.8-2.867)^2+"
mytext2="5*(3.0-2.867)^2+"
mytext3="5*(1.8-2.867)^2="
mytext4="10.1333"
mytext5="MSB=10.133/2=5.067"
mytext6="SSW=13.600"
mytext7="MSW=1.333"
mytext8="F(2,12)=5.067/1.333=4.471"
text(-.5, .7, mytext)
text(-.5, .6, mytext2)
text(-.5, .5, mytext3)
text(-.5, .4, mytext4)
text(-.5, .3, mytext5)
text(6, .7, mytext6)
text(6, .6,  mytext7)
text(6,.5, mytext8)

Here, we can see that each of the means for Placebo, Chemo, and Rad are not identical (1.8, 3.00, 3.80, respectively). Further, we note that that these means differ from the grand mean.

When we evaluate the squared differences between each of the group means and the factor means (multiplied by the number of group observations), the $MS_{Between}$ equals 5.067. When we form the F statistic with $MS_{Within}$ which equals 1.333, we note that the F-statistic is 5.067/1.333=4.471. If we were to evaluate this statistic based on an $F_{2,12}$ distribution in the upper tail of the distribution, we would note that the probability is 0.035. If $\alpha$ were set at 0.05, we would reject the null hypothesis. We would assume that not all means were equal.

pf(4.471,2,12, lower.tail=F)

## [1] 0.03539799

The point here is that the squared differences between the group means and the grand mean (multiplied by the number of observations) are much larger than the within differences. This means that the F-ratio is much larger than one, which we then check for statistical significance (and practical relevance).

But what would happen if the means were all equal? If the means were all equal, the numerator of the F-distribution would be zero. That would generate an F-statistic of zero. No matter what type of F-distribution exists, a zero statistic fails to reject in the right tail. So small values of the F-statistic (closer to zero than infinity) fail to reject the null that all means are equal.

So How do We Report the Results?

We use an ANOVA table to report the results. While we have already calculated all values, I provide the R code for your consideration.

p1=as.vector(mymat)
p2=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
newdf=data.frame(cbind(p1,p2))
colnames(newdf)=c('Tumor', 'Tx')

myaov=aov(newdf$Tumor~as.factor(newdf$Tx))

summary(myaov)

##                     Df Sum Sq Mean Sq F value Pr(>F)  
## as.factor(newdf$Tx)  2  10.13   5.067   4.471 0.0354 *
## Residuals           12  13.60   1.133                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Great! We Reject the Null. Now What?

Since $p<\alpha$, we reject the null in favor of the alternative. Some means do differ, but we don’t know which ones. How do we find out? There are a bunch of tests to do so, some assuming homogenity of variance of the groups and some that do not. Since we assumed homogeneity, we will go ahead and use Tukey’s Honestly Significant Difference. An easy way to look at the Tukey’s HSD is to plot the confidence intervals, which are calculated as follows.

$\bar y_{i.}-\bar y_{j.} +/- \frac{q_{\alpha, k, N-k}}{\sqrt 2}\hat \sigma \sqrt(2/n)$

q is the Tukeys HSD distribution calculated, k is the number of populations, and N-k is the degrees of freedom, where N is the total number of observations. $q=\frac{\bar y_{max}-\bar y_{min}}{S \sqrt(2/n)}$ where $\bar y_{max}-\bar y_{min}$ is the difference between the maximum and minimum means of the groups. S is the variance of the pooled groups.

Fortunately, we really don’t have to worry about doing these calculations, as R does them for us

plot(TukeyHSD(myaov, alpha=.05))

Interpretation

Looking at the graph, we can see that group 3 minus group 1 has a confidence interval that does NOT include zero. This means the difference is statistically significant. No other group differences are, though.

TukeyHSD(myaov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = newdf$Tumor ~ as.factor(newdf$Tx))
## 
## $`as.factor(newdf$Tx)`
##     diff       lwr        upr     p adj
## 2-1 -0.8 -2.596273  0.9962735 0.4820234
## 3-1 -2.0 -3.796273 -0.2037265 0.0292267
## 3-2 -1.2 -2.996273  0.5962735 0.2168186

And we can see that the p-value is 0.029. This makes sense when look at the plot of distribution 3 and distribution 1. There is very little overlap of the two densities.

mycol=rainbow(3)
mymeans=colMeans(mymat)
myf=function(x) sd(x)/sqrt(5)
mysd=apply(mymat,2,myf)

for (i in seq(1,3, by=2)){

   myadd=ifelse(i==1, FALSE,TRUE)
   curve(dnorm(x, mymeans[i], mysd[i]), from = -2, to = 8, n=501,
      main = 'Group 1 and 3 Distributions', #add title
      ylab = 'Density', #change y-axis label
      ylim=c(0,.9),
      #xlim=c(0,10),
      lwd = 1,
      col = mycol[i], add=myadd) 
      lines(x=rep(mymeans[i], 10), y=seq(0,1.5,length.out=10), col=mycol[i])
      text(mymeans[i]-.2, .1, paste("Xbar",i,":", mymeans[i]), srt=90)
}

ANOVA for Model Comparison

Let’s build four regression models with the Iris dataset and evaluate which one is better

lm=lm(Sepal.Length~1, data=iris)
lm1=lm(Sepal.Length~Sepal.Width, data=iris)
lm2=lm(Sepal.Length~Sepal.Width+Petal.Width, data=iris)
lm3=lm(Sepal.Length~Sepal.Width+Petal.Width+Petal.Length, data=iris)

anova(lm,lm1,lm2,lm3)

## Analysis of Variance Table
## 
## Model 1: Sepal.Length ~ 1
## Model 2: Sepal.Length ~ Sepal.Width
## Model 3: Sepal.Length ~ Sepal.Width + Petal.Width
## Model 4: Sepal.Length ~ Sepal.Width + Petal.Width + Petal.Length
##   Res.Df     RSS Df Sum of Sq       F    Pr(>F)    
## 1    149 102.168                                   
## 2    148 100.756  1     1.412  14.274 0.0002296 ***
## 3    147  29.911  1    70.845 716.032 < 2.2e-16 ***
## 4    146  14.445  1    15.466 156.312 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This table is a comparison of the ANOVA tables from the regressions.

NOTE: Res. Df = how many remaining residual degrees of freedom exist. RSS = Residual Sum of Squares which is the same as the Sum of Squares Within and SS = Sum of Squares Regression = Sum of Squares Between Groups. The RSS column is the non-cumulative Sum of Squared Residuals. The Df reflects the additional degrees of freedom used in each subsequent model (0 in this case). The Sum of Sq columnn is the additional Sum of Squares Regression achieved by adding the new variable(s).

The first regression (baseline) is an intercept fit. It provides the baseline for comparing the other models. From this model, we see that there are 149 degrees of freedom (n=150).

Model 2 has a reduced RSS (SSW). There is one additional degree of freedom used. The total amount of variability, $R^2$, can be calculated should you choose by the following: 1.412/(100.756+1.412)=0.0138. We can even penalize that for the number of variables in the model (adj $R^2$). 1-(1-R^2)(n-1)/(n-k-1) = 1-(1-0.0138)*149/(148)=0.007. Not very good. But the variable is indeed statistically significant.

Model 3 adds another variable, and the RSS drops to 29.911 after consuming that one additional degree of freedom. The SS increases to 1.412+70.845=72.257. The variable addition is statistically signficiant, and the model is the best so far.

What about Model 4…?

ONE-WAY ANOVA

Sith

2023-02-06