Two-way ANOVA

Overview
General Linear Model
Dataset
Computing the Two-Way ANOVA
Pairwise-Comparisons
ANOVA Assumptions
- Homogeneity of Variance
- Normality

Overview

The two-way analysis of variance (ANOVA) is the hallmark of statistical analyses in the behavioral sciences. Not only is it a powerful method of comparing multiple conditions and groups (in most cases…), but it also allows you to test the Holy Grail of relationships in Psychology: the Interaction. A multiple interaction between variables indicates that the effects of one or more particular variables has a multiplicative effect on other variables, which would be shown in your outcome variables. The two-way ANOVA is somewhat more complex than the one-way ANOVA, but follows the same rules, including the main three assumptions:

The residuals from the data are normally distributed.
The variances of the sampled populations are equal.
The data are independent and normally distributed.

As I mention in my one-way ANOVA primer, ANOVAs are fairly robust to violations to these assumptions – particularly the normality assumption – but take great caution if and when these assumptions are violated.

General Linear Model

Before we get started, it is important to note that all statistics, with the exception of some Bayesian methods, follow the general linear model, or GLM. The GLM is a statistical model that tests the same general formula:

\(Y = XB + U\)

Where \(Y\) generally represents a series of outcome measurements (the DV), \(X\) generally represents observations on independent variables (IV; although it is also referred to as a design matrix), \(B\) is a set of parameters to be estimated, and \(U\) is generally noise. I use the word “generally” here because the GLM is a broad concept that is difficult to cover in a primer on ANOVAs. If you have never heard of the GLM, or are unfamiliar with it, then please close out of this tutorial and read up on it. If you can understand the basics of the GLM, then you can understand how t-tests, chi-squared tests, ANOVAs, regressions, and many, many other analyses are related to each other.

Dataset

The dataset that we will be using to compute two-way ANOVAs is one provided by R that examines the effects of Vitamin C on tooth growth in Guinea Pigs (ToothGrowth). Each animal was assigned to one of six groups (\(K = 6\)) of 10 subjects each (\(n = 10\)) for a total of 60 Guinea Pigs in all (\(N = 60\)). The two variables that were manipulated in this study were the dosage level of Vitamin C (0.5, 1.0, or 2 mg/day; \(k_a = 3\)) and the delivery method of the dosage (orange juice or absorbic acid [coded as “VC”]; \(k_b = 2\)). We will be contrasting the different dosage levels by their delivery method to determine the best way to increase tooth growth in Guniea Pigs

In different terms, we would say that the following analysis is a 3 (Dosage: 0.5, 1, & 2) x 2 (Delivery Method: OJ & VC) between-subjects two-way ANOVA. (We say that it is a two-way ANOVA due to the fact that there are 2 independent variables [IVs].) If the guinea pigs each experienced every level of the two IVs, we would say that it is a within-subjects two-way ANOVA. If the guinea pigs experienced each level of one IV, but not the other, then we would say that it is a mixed-effects two-way ANOVA.

As you can see below, there is a reliable positive relationship between tooth length and Vitamin C dosage (e.g., the larger the dosage, the longer the tooth measurement), but it is unclear whether the vehicle for the delivery of the Vitamin C has any significant impact on tooth growth. It is also unclear whether certain combinations of dosages and delivery vehicles are significantly better than others.

Computing the Two-Way ANOVA

Because we are testing the effects of two independent variables as well as their interaction on an outcome measure, we now have three different groups of hypotheses.

For main effects (i.e., the effects of the different levels of a single IV), our null hypothesis is that the means of the different levels of a given IV are not different from each other, while our alternative hypothesis is that these groups are different from each other:

\(H_O: \mu_1 = \mu_2 = ... = \mu_K\)

\(H_A:\) Not \(H_O\)

Note that only one difference in means can be sufficient to reject \(H_0\).

For interactions, however, the null and alternative hypotheses are specified differently. We just simply say that the null hypothesis is that there is no interaction effect between different IVs on the DV and that the alternative hypothesis is that there is an interaction.

\(H_0\): No interaction

\(H_A\): Interaction

Degrees of Freedom

The structure of the degrees of freedom in the two-way ANOVA is similar to the one-way ANOVA, but with the addition of a second IV:

\[\begin{equation} df_{A} = k_a - 1 \\ df_{B} = k_b - 1 \\ df_{Interaction} = (k_a - 1)(k_b - 1) \\ df_{Within} = k_a * k_b (n - 1) \\ df_{Total} = N - 1 \end{equation}\]

Using the information provided about the tooth length dataset found in the previous section, we can easily compute the degrees of freedom for the model:

Source	Formula	df
Main Effect of Dose	3 - 1	2
Main Effect of Supplement	2 - 1	1
Interaction Effect	(3 - 1)(2 - 1)	2
Total	60 - 1	59

Although the program provides you with this information as your are computing the test, it is important to make sure that you can match the dfs to their respective levels in order to make sure that your data are structured correctly.

Restructuring the Data

If you have a dataset in which the levels of a factor are specified by numbers and decimals, you will need to rename the levels to include a string object! R will treat any factor with numeric values as continuous – even if the numbers are categorical data. This is annoying to do, but ultimately necessary.

In order to do this in our own example, we are going to call the base R function factor and specify that we want each of the three dosage conditions to be their own levels. Becase R does not recognize floats as levels, we will re-code them as strings (e.g. “D0.5”):

ToothGrowth$dose <- factor(ToothGrowth$dose, 
                  levels = c(0.5, 1, 2),
                  labels = c("D0.5", "D1", "D2"))

Specify + Run the GLM

As we mentioned earlier, the two-way ANOVA, like many analyses, is based on the GLM, and even though we are not directly constructing the model that R will run, we will still need to specify how it needs to be constructed.

How do we know what the structure of the GLM will be? Think back to the degrees of freedom chart. Apart from the “Total”, we had 3 distinct parts:

The Main Effect of Supplement
The Main Effect of Dose
The Supplement x Dose Interaction

These are the three parts of our model that we will specify in our code in order to run the GLM.

The base code that we will use to run the analysis is aov, which is included in every R distribution:

# Dummy code
aov(response ~ factor, data=data_name)

By substituting response with our dependent variable (tooth length), and by substituting factor with the three main components of our model (ME Supplement, ME Dose, & Interaction), we can specify our GLM as: len ~ supp + dose + supp:dose Where supp:dose is our interaction. Putting it all together in aov, the result comes out as:

len.aov <- aov(len ~ supp + dose + supp:dose, data = ToothGrowth)
summary(len.aov)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## supp         1  205.4   205.4  15.572 0.000231 ***
## dose         2 2426.4  1213.2  92.000  < 2e-16 ***
## supp:dose    2  108.3    54.2   4.107 0.021860 *  
## Residuals   54  712.1    13.2                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we specificed an alpha of .05 (\(\alpha = .05\)), then we can see that our Main Effect of Supplment, Main Effect of Dose, and Supplement x Dose Interaction are all statistically significant. We can reject the null hypothesis that there is an interaction between delivery method (supplement) and Vitamin C dosage (dose) on tooth growth in guinea pigs, F(2,54) = 4.107, p = .022. We can also say that there are significant differences in the impact of delivery methods on tooth growth, F(1,54) = 15.572, p < .001, as well as significant differences in the impact of dosage on tooth growth, F(2,54) = 92.000, p < .001.

Pairwise-Comparisons

Now that we know that the overall F-test is significant for both of the main effects (Supplement & Dose) as well as for the interaction (Supplement x Dose), we will need to figure out which levels of our IV and which combinations of those levels are the most effective in promoting tooth growth in guinea pigs.

When you are running ANOVAs with more than one IV, you will need to run post-hoc comparisons on each IV as well as the interaction separately.

NOTE: Because the “Supplement” IV has only 2 levels, we do not need to run any post-hoc comparisons to figure out which delivery vehicle is better. Because the Main Effect of Supplement was significant, and because the mean of the “Orange Juice” (or “OJ”) level is greater than “Absorbic Acid” (or “VC”), we can simply say that orange juice is the superior supplement delivery method for tooth growth.

Number of Comparisons

Because we now have more than one group of comparisons to run (i.e. ME of Dose and Interaction), we are going to slightly modify the formula that we used with the one-way ANOVA to figure out the number of contrasts that the program will give us.

For main effects, that formula is:

\(\# Comparisons = \frac{k_i (k_i - 1)}{2}\)

Where \(k_i\) is the number of levels in a given IV. In our example, the number of contrasts for the Main Effect of Dosage will be \(3\), or \(\frac{3 (3 - 1)}{2}\).

For interactions, that formula is:

\(\# Comparisons = \frac{K_{int} (K_{int} - 1)}{2}\)

Where \(K_{int}\) is the product of all levels over all IVs. For this example, the number of contrasts for the Interaction will be \(15\), or \(\frac{[3 * 2][(3 * 2) - 1]}{2}\).

Tukey’s HSD

A classic posthoc comparison is Tukey’s Honest Significant Differences (HSD). For this, we will call two parameters to our function: the ANOVA structure that we computed earlier (len.aov) and the particular test we want to run. First, we will compare the different levels of the Dose IV:

TukeyHSD(len.aov, which = "dose")

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = len ~ supp + dose + supp:dose, data = ToothGrowth)
## 
## $dose
##           diff       lwr       upr   p adj
## D1-D0.5  9.130  6.362488 11.897512 0.0e+00
## D2-D0.5 15.495 12.727488 18.262512 0.0e+00
## D2-D1    6.365  3.597488  9.132512 2.7e-06

Apart from the non-significant result from comparing the 1- and 2-mg dosages, all dosages are significantly different from each other. Overall, we can infer that both the 1- and 2-mg/day dosages are better than the 0.5-mg/day dosage.

Next, let’s run comparisons on the interaction between Supplement and Dosage. Here, we will just change the which parameter:

TukeyHSD(len.aov, which = "supp:dose")

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = len ~ supp + dose + supp:dose, data = ToothGrowth)
## 
## $`supp:dose`
##                  diff        lwr        upr     p adj
## VC:D0.5-OJ:D0.5 -5.25 -10.048124 -0.4518762 0.0242521
## OJ:D1-OJ:D0.5    9.47   4.671876 14.2681238 0.0000046
## VC:D1-OJ:D0.5    3.54  -1.258124  8.3381238 0.2640208
## OJ:D2-OJ:D0.5   12.83   8.031876 17.6281238 0.0000000
## VC:D2-OJ:D0.5   12.91   8.111876 17.7081238 0.0000000
## OJ:D1-VC:D0.5   14.72   9.921876 19.5181238 0.0000000
## VC:D1-VC:D0.5    8.79   3.991876 13.5881238 0.0000210
## OJ:D2-VC:D0.5   18.08  13.281876 22.8781238 0.0000000
## VC:D2-VC:D0.5   18.16  13.361876 22.9581238 0.0000000
## VC:D1-OJ:D1     -5.93 -10.728124 -1.1318762 0.0073930
## OJ:D2-OJ:D1      3.36  -1.438124  8.1581238 0.3187361
## VC:D2-OJ:D1      3.44  -1.358124  8.2381238 0.2936430
## OJ:D2-VC:D1      9.29   4.491876 14.0881238 0.0000069
## VC:D2-VC:D1      9.37   4.571876 14.1681238 0.0000058
## VC:D2-OJ:D2      0.08  -4.718124  4.8781238 1.0000000

10 out of the 15 computed comparisons were statistically significant. I will not bore you with individual interpretations of these 10 significant outcomes, but I will conclude that using orange juice as a delivery vehicle with higher dosages seems to be the most effective in growing teeth in guinea pigs.

Multiple Comparisons - General Linear Hypothesis Test

Another way to run the same post-hoc test is the multiple comparisons approach using the general linear hypothesis test, which is supported by the multicomp package. Take a moment to go ahead and load the package in your R environment (assuming you have already installed the package):

require(multcomp)

The general linear hypothesis tests function is built into the multicomp package and can be called using glht. The function is simple - glht(model, lincft = mcp(group = "Method")) - and only requires specification of the ANOVA model (or model in the dummy code) as well as the linear hypotheses to be tested (linfct). For this particular example, we will indicate that the groups we want to compare the different dosages (dose) and that we want to correct the critical value using the Tukey Studentized Range statistic, which multiplies the tabled critical value by \(\sqrt{2}\).

summary(glht(len.aov, linfct=mcp(dose = "Tukey")))

## Warning in mcp2matrix(model, linfct = linfct): covariate interactions found
## -- default contrast might be inappropriate

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = len ~ supp + dose + supp:dose, data = ToothGrowth)
## 
## Linear Hypotheses:
##                Estimate Std. Error t value Pr(>|t|)    
## D1 - D0.5 == 0    9.470      1.624   5.831   <1e-04 ***
## D2 - D0.5 == 0   12.830      1.624   7.900   <1e-04 ***
## D2 - D1 == 0      3.360      1.624   2.069    0.106    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Scheffe’s Test

While this method is only beginning to be taught in behavioral statistics classes, Scheffe’s method is an extremely powerful post-hoc comparisons test. Many professors will generally say that Scheffe’s Test is the preferred stepdown procedure, although the realities of this claim is much more nuanced than I will allow myself to get into in this document.

In general, Scheffe’s method accounts for family-wise error rate by weighting the test statistic (\(C\)) by the mean squared error (\(MSE\)), the between-samples DF (\(k - 1\)), and group sizes (if inequal):

\(C = \sqrt{(k - 1)~F~MSE~(\frac{1}{n_i} + \frac{1}{n_j})}\)

This statistic is used to build a confidence interval around \(C\) using the pre-specified alpha (\(\alpha\)) value.

In r, to run this statistic, you must first load the DescTools package:

require(DescTools)

From there, the syntax is quite simple to run: ScheffeTest(x, which = NULL, contrasts = NULL, conf.level = 0.95), where x is the fitted ANOVA model objects (len.aov), which is the specification of the terms for which confidence intervals are calculated (defaults to all contrasts), and contrasts is a matrix specifying the contrasts to be computed (factor levels x number of contrasts), and conf.level is the pre-determined confidence level (defaults to 95%).

For our test, Scheffe’s method is quite easy:

ScheffeTest(len.aov)

## 
##   Posthoc multiple comparisons of means : Scheffe Test 
##     95% family-wise confidence level
## 
## $supp
##       diff    lwr.ci     upr.ci   pval    
## VC-OJ -3.7 -6.938593 -0.4614069 0.0153 *  
## 
## $dose
##           diff   lwr.ci   upr.ci    pval    
## D1-D0.5  9.130  5.16355 13.09645 3.8e-08 ***
## D2-D0.5 15.495 11.52855 19.46145 3.9e-16 ***
## D2-D1    6.365  2.39855 10.33145 0.00014 ***
## 
## $`supp:dose`
##                  diff     lwr.ci     upr.ci    pval    
## VC:D0.5-OJ:D0.5 -5.25 -10.859408  0.3594079 0.08074 .  
## OJ:D1-OJ:D0.5    9.47   3.860592 15.0794079 5.5e-05 ***
## VC:D1-OJ:D0.5    3.54  -2.069408  9.1494079 0.45640    
## OJ:D2-OJ:D0.5   12.83   7.220592 18.4394079 4.6e-08 ***
## VC:D2-OJ:D0.5   12.91   7.300592 18.5194079 3.8e-08 ***
## OJ:D1-VC:D0.5   14.72   9.110592 20.3294079 7.8e-10 ***
## VC:D1-VC:D0.5    8.79   3.180592 14.3994079 0.00021 ***
## OJ:D2-VC:D0.5   18.08  12.470592 23.6894079 7.0e-13 ***
## VC:D2-VC:D0.5   18.16  12.550592 23.7694079 5.9e-13 ***
## VC:D1-OJ:D1     -5.93 -11.539408 -0.3205921 0.03168 *  
## OJ:D2-OJ:D1      3.36  -2.249408  8.9694079 0.51654    
## VC:D2-OJ:D1      3.44  -2.169408  9.0494079 0.48960    
## OJ:D2-VC:D1      9.29   3.680592 14.8994079 8.0e-05 ***
## VC:D2-VC:D1      9.37   3.760592 14.9794079 6.8e-05 ***
## VC:D2-OJ:D2      0.08  -5.529408  5.6894079 1.00000    
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you can see, ScheffeTest gives us everything that we need in one output.

ANOVA Assumptions

Remember way back at the beginning of this document when I said that ANOVAs are based off of three main assumptions? They are actually quite important for the ANOVA and should be tested with each model.

Here they are again, in case you forgot:

The residuals from the data are normally distributed.
The variances of the sampled populations are equal.
The data are independent and normally distributed.

With the correct sampling procedures, we can usually count on the assumption of independent and normally distributed data (#3) to be true. What is not quite within our hands, however, are the first two assumptions. Here, we will use both statistical tests as well as “eyeball” tests to determine whether our ANOVA violates a critical assumption.

Homogeneity of Variance

In order for us to assume that we are comparing two sets of data sampled from the same general population, it is necessary to determine if the two samples have similar variances. Otherwise, we cannot make any strong inferences about the test because we cannot rule out the possibility that the data were sampled from separate, and potentially orthogonal, populations.

The first order of business in testing this assumption is running Levene’s Test, which is a mathematical way of determining homogeneity of variance. This step is so important that many statistical computing packages (such as SPSS) just go ahead and throw in Levene’s Test any time that you run an ANOVA.

To run the test in R, we simply load the car package into our computing environment:

library(car)

And then run Levene’s Test using this simple syntax - leveneTest(DV ~ IV, data = "dataset") - where DV specifies the outcome (dependent) measure, IV specifies the independent variables, and data specifies the dataset that we are testing. With a few simple substitutions, we can run it on our tooth growth data:

leveneTest(len ~ dose, data = ToothGrowth)

In all cases, you want Levene’s Test statistic to be non-significant (as it is here). In the case that it is significant, you can either a) ignore this violation based on your own a priori knowledge of the distributional characteristics of the population being sampled or b) relax the assumption of homoscedasicity and run the Welch one-way test, which does not require that assumption.

Another way to examine this assumption is to plot the residuals vs. the fitted values for all grounds. In general, you do not want the average of the values (i.e. the red line) to deviate much, if at all, from zero. Here, we can use both Levene’s Test and the plot of the residuals to determine that the variances are inequal in the current sample.

plot(len.aov,1)

Normality

Finally, we can test the assuption that the residuals in our data are normally distributed by graphing the standardized residuals by their theoretical quantiles. In a perfect dataset, these values would create a perfect diagonal line from quadrant III to quadrant I:

plot(len.aov,2)

If there is deviation from the “ideal” diagonal, you can run the Shapiro-Wilk test on the residuals to mathematically determine if the assumption has been violated:

len.residuals <- residuals(object = len.aov)
shapiro.test(x = len.residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  len.residuals
## W = 0.98499, p-value = 0.6694

Ideally, the test statistic will be non-significant.

In the general linear model (the basis of most statistical tests), the assumption of normally distributed residuals applies only to the error term (i.e., noise in the model) and not to the independent variables, so in theory, violations of this assumption have minimal consequences, if any at all. Luckily, our tooth growth residuals are normally distributed.