2-way ANOVA

A 2-way Analysis of variance (2-way ANOVA) is a parametric test that examines whether there are significant differences among groups of two categorical explanatory variables. Like an ANOVA it tests for significant differences among means within a group but also a significant interaction between groups. 2-way ANOVA has two sampling variability assumptions that have to be met:

  1.    The residuals have to be normally distributed;
  2.    There is equal variance among groups (homogeneity of variance).

The residuals are the leftover variation that is not accounted for by your explanatory variable. You can use analyses of the residuals to check both of these assumptions. Before we can interpret our statistical output, we need to make sure the assumptions above are met.

Background Theory

2-way ANOVA depends on the F distribution (Fig. 1), which measures the variance in the response variable that is explained by the categorical explanatory variable divided by the variance in the response variable that remains unexplained (the error term or residuals). Unexplained variance includes the error as well as what is not explained by other factors.

If the F statistic generated by your test is larger than the F_critical value, the area under the curve is smaller than 5% and, therefore, the p-value is < 0.05, and the means are significantly different. You have explained significantly more variance than random chance. And if you say that there is a significant effect and there really isn’t, you are wrong less than 5% of the time.

Here are some statistical terms you should understand to interpret your 2-way ANOVA.

The F statistic is the ratio of the variance explained and the variance unexplained.
The p-value or alpha is the area under the curve of the F distribution at the F_critical value (Figure 1).The p-value, which is interpreted as the chance that you are wrong if you accept that there is a significant effect (also known as a Type I Error rate) has been chosen by scientists to be 5%.
The F_critical value is the value at which the area under the curve (i.e., p-value) is 5%.
The df is equal to the number of observations minus the number of parameters estimated.

Figure 1. F distribution. Image: http://www.statisticshowto.com/probability-and-statistics/f-statistic-value-test/

2-way ANOVA tests the effect of each of your explanatory variables separately and their interaction. In the figure below, growth rate was measured for two plant species (species 1 and species 2) under two light conditions (high light and low light). Growth rate is the continuous response variable, and species and light condition are the two categorical explanatory variables. The power of the 2-way ANOVA is that it can test for whether the way one categorical explanatory variable (light) influences the continuous response variable (growth rate) depends on the other categorical explanatory variable (species).

Figure 2. Growth rate was measured in two species of plants under two light conditions. Data are for illustrative purposes.

A 2-way ANOVA test will provide you with a data analysis table of output that looks like the table below. The first two rows are for each of the categorical explanatory variables (Species and Light). The third row is for the interaction between the two categorical explanatory variables. The fourth row is for the residuals, which is the variance not explained by the explanatory variables (error). This output is for the figure above. It shows that there was a significant interaction (as evidenced by the direction of growth from high to low light being different for each species). This will be discussed below but just so that you are aware, because the interaction of the two variables was significant, you can’t interpet the main effects of the model because they are not independent. So for now, from this analysis below, you can only say that the way that light influences growth rate depends on the plant species.

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
Species	1	1059.1	1059.1	12.425	0.00134 **
Light	1	83.1	83.1	0.975	0.33113
Species:Light	1	2726.0	2726.0	31.987	3.28e-06 ***
Residuals	31	2642.4	85.2

The ‘Df’ column is the degrees of freedom for each explanatory variable and the residuals. The entire df is the number of samples - 1, which are then split between the explanatory variables and residuals. The df for each explanatory variable is equivalent to the number of groups (n) minus one for the treatment (so above there are 2 groups for each categorical explanatory variable [Species or Light] because df = 1). The df for the interaction is found by multiplying the df for each explanatory variable (in this case 1*1 = 1). The df for Residuals is the remainder of the degrees of freedom. So the total number of samples from the data above is (31+1)+1+1+1 = 35 because the total df is 31+1+1+1 = 34.

The Mean Sq is the Sum Sq divided by Df (note that in this example the Mean Sq is equal to the Sum Sq for each explanatory variable because they all have 1 df). The Mean Sq Treatment is divided by the Mean Sq Residuals to get the F value for each explanatory variable and the interaction. Pr(>F) is the p-value.

Information about the Example Dataset

For this test we will be using a dataset of mite loads on male and female striped plateau lizards (Sceloporus virgatus) from burned and unburned sites in Cochise county in southeastern Arizona. The number of mites on male and female lizards were measured one year after the fire, and the burned and unburned sites were ~1 mile from each other separated by a firebreak. Burning can reduce mite loads, which might result in lower mite loads on all lizards. Alternatively, burning could result in lizards being in poor condition, which would make them more susceptible to mites. Males typically have more mites than females, at least in the breeding season when these things are typically measured. There are two non-mutually-exclusive explanations for this pattern: 1) Males have elevated testosterone during this time which acts as an immunosuppressant making them more susceptible to infection and 2) Males are moving around a lot more during this time which exposes them to more mites. The goal of this analysis was to examine if the way that burns influence mite loads differs between male and female lizards.

Figure 3. Striped plateau lizard (Sceloporus virgatus) in southeastern Arizona.

Loading the Data into RStudio

In order to make your RStudio script file organized, you will want to include some information at the top of the file. You can use the hashtag (#) to include things in your RStudio file that R can’t read.

For each test, you should include the following lines at the top of each RStudio file:

#question:
#response variable:
#explanatory variable:
#test name:

Below is what it would look like for the step moss example:

#question: does the way that burn history influence mite load in lizards depend on their sex?
#response variable: mite load (continuous)
#explanatory variable 1: site (categorical)
#explanatory variable 2: lizard sex (categorical)
#test name: 2-way ANOVA

In order to run a 2-way ANOVA, you need to have your response variable in one column and each of your categorical explanatory variables in separate columns. For this dataset, each individual lizard gets its own row and includes the sex of the lizard (male or female), the site it was from (burned or unburned) and the number of mites found on it.

To bring data into RStudio, you can use the line of code below that will allow you to select any file directly from your computer.

#code to load a datafile into R from your computer
DATA <- read.csv(file.choose(), na.strings=".")

An alternative way to open your file is to use the path directly to the folder on your computer where you set your working directory. This will allow you to choose a named file that you saved in your working directory (to learn more about what your working directory is, go the the main webpage and click on Introduction to RStudio). In the DATA code line, TO.BE.EDITED is the name of your file.

#code to set the working directory folder on your computer
setwd <- ('C:/Users/YourUserName/Documents/Rfiles')
#code to load a datafile into R from your working directory
DATA <- read.csv(file="c:/TO.BE.EDITED.csv", header=TRUE, sep=",", na.strings=".")

You can download the lizard_mite.xlsx file from the home page, save it as a .csv file in your working directory and then you can use the code below to upload the lizard_mite.csv file into RStudio.

#code to load the datafile "lizard_mite.csv" into R from your working directory
DATA <- data.frame(read.table(file='lizard_mite.csv', sep=',', header=TRUE, fill=TRUE, na.strings="."))

To make sure it loaded correctly, you can copy and paste the following code to see the names of all of the columns in your file.

#code to see the names of the column titles
names(DATA)

## [1] "Site"    "indivID" "Sex"     "Mites"

If the file was successfully loaded into RStudio, you should have the names of the columns above. Site is the categorical explanatory variable of burned or unburned. indivID is the individual ID code for each lizard. Sex is the other categorical explanatory variable of male or female. Mites is the number of mites found on each individual lizard. If you got an error, go back to the main stats webpage and click on Troubleshooting.

You can also use the following lines of code to make sure your data was read in correctly. The first line of code results in the top 6 rows of your data being displayed instead of just the column names, and the second line of code gives you the dimensions of your datafile.

#code to see the top 6 rows of your data file
head(DATA)

##     Site indivID Sex Mites
## 1 Burned    2144   m    53
## 2 Burned    2145   f    30
## 3 Burned    5433   m    82
## 4 Burned    2150   m   200
## 5 Burned    2151   m   125
## 6 Burned    2152   f    74

#code to see the dimensions (number of rows and columns) of your datafile
dim(DATA)

## [1] 359   4

Once you are sure the data are loaded into R, you can proceed.

Exploring the Data

Before we run the test, it is a good idea to explore what the data look like. This gives you a chance to see if there are any outliers and to determine if your prediction based on your hypothesis seems supported by the data. We will examine the effect of site condition (Site) and sex (Sex) on mite abundance (Mites). In this test, mite abundance is our continuous response variable, site condition and sex are our categorical explanatory variables.

For a 2-way ANOVA, the best way to visualize data is through a a box plot. This will show you any outliers as it shows the maximum and minimum values within each group. See this page for more information on what a box plot is: http://www.physics.csbsju.edu/stats/box2.html. Use the following code to create a box plot of your data.

#code to create a boxplot of the lizard data
boxplot(Mites~Sex*Site, data=DATA, xlab = "Treatment", ylab = "Mite abundance")

In this plot, f is female and m is male. Looking at these data it appears that there are differences between male and female lizards as males tend to have higher mite loads than females regardless of burning (m.Burned and m.Unburned have higher averages than f.Burned and f.Unburned). It is not clear from this figure that there is a site effect or an interaction (we will have to run the 2-way ANOVA to be sure) but there doesn’t seem to be any outliers. OUtliers would be points that are either much higher or lower than all of the other points. If you find an outlier, you can conduct outlier tests or make an argument as to why you need to remove that data point. Note that you have to be certain of such confusions before you alter data; just a hunch is not necessarily enough. Any change you do make has to be justified either by an error that you know you made (such as an incorrect measurement or incorrect data entry) or an outlier test (see this website for information about outlier tests: http://r-statistics.co/Outlier-Treatment-With-R.html).

Running the 2-way ANOVA

Now you can run the 2-way ANOVA. Below is the code you will use to run the model. Remember that to actually run the test, you need to change the Response variable and Explanatory variables to your own variable names. Use the code for names(DATA) above to make sure you write in the variables exactly as they are in the data file. If you write in mite instead of Mites, for example, the test will not work because R won’t be able to find mite in your datafile.

#code to run the 2-way anova to test model assumptions
anova2<-lm(Response variable~Explanatory variable 1*Explanatory variable 2, data=DATA)

Use the code below to run the test for the analysis of lizard data.

#code to run the 2-way anova for the lizard data to test model assumptions
anova2<-lm(Mites~Site*Sex, data=DATA)

You will notice that not much happened. If you didn’t get any error messages, great! That means that the test was successful. R runs the code and stores the information. You have to do more work to get at that the results of the test. But, first we need to make sure we can even look at the results.

You never look at the results of the test until AFTER you check that the assumptions of the model were met. So next we will examine if our data meet the assumptions of 1) residuals following a normal distribution and 2) equal variance among groups (the residuals from each group should be relatively similar).

Evaluating Model Assumptions

The difference between the observed and predicted values, called the residuals, is a measure of the error associated with each observation. It is the variation that is not explained by the explanatory variable. We plot the residuals in various ways to examine normality and homogeneity of variances.

To test whether the residuals are normally distributed (model assumption 1 above), you will use 2 pieces of information:

1. a density plot of residuals
2. a Q-Q plot of residuals

To create a density plot of residuals, use the following code:

#code to create the density plot of residuals
plot(density(anova2$residuals))

Examine the density plot. It should follow more or less a bell-shaped curve. It can be hard to determine whether or not your density plot is “normal” as it takes practice. If there are any weird bumps or the curve is clearly skewed to one side or bimodal, chances are your residuals are not normal. See the ANOVA webpage for examples of non-normal and normal curves. This density plot looks pretty normal even with a few minor bumps. 2-way ANOVA is quite robust to this assumption so even if your data don’t look like a perfect bell curve, the assumption can still be met. Let’s use more information before we decide if this meets the normality assumption.

The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a normal or exponential. Here’s an example of a Normal Q-Q plot when both sets of quantiles truly come from Normal distributions.

Figure 4. A quantile-quantile plot of residuals showing that they exhibit a close relationship with the theoretical quantiles predicted under a normal distribution.

Q-Q plots take your sample data, sort it in ascending order, and then plot them versus quantiles calculated from a theoretical distribution, in this case the normal distribution. If the points fall pretty closely along the line, the data are normal.

To be able to run the code that creates a Q-Q plot, you need to install and load the package stats. You can do this by running the code below. The first line of code (install.packages (“PackageName”)) will install the package. The second line of code (library(“PackageName”)) loads it into R.

#code to install package "stats"
install.packages("stats")

#code to load package "stats" into R
library(stats)

#code to create a Q-Q plot
qqnorm(anova2$residuals)
qqline(anova2$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))

This plot does not look particularly good. The points should fall pretty much along a straight line and the points go above the line on the right side of the graph. So while the density plot looked pretty good, the Q-Q plot suggests that the data do not meet the first assumption of normality. Let’s keep checking our assumptions before deciding on whether we can look at the model output.

To test whether the error terms for each group have similar variance (homogeneity of variance, assmuption 2 above), we will use a plot of the fitted.values vs. the predicted values. Use the code below to create your plot:

#code to create a fitted values vs residuals plot
plot(anova2$residuals~anova2$fitted.values)

For this plot, you are looking for no patterns. If you see a cone where the vertical variation on one side of the figure is smaller than the vertical variation on the other side, the variation is not homogeneous and you need to transform your data. The plot in this example suggests such a problem. The points on the left side cover a much smaller range of values and are more clustered than the points on the right side. If your plot looks like this, your data violate the assumption of homogeneity of variance. In this case, you need to transform your data.

If your data satisfies the assumptions of the model, go directly to Interpreting 2-way ANOVA Output below.

Transform and Repeat

For this dataset, you will try a square root transformation of your response variable (note that when doing transformations, you transform the response variable, not the explanatory variable). Use the code below to create a new variable called sqrtMites, which is the square root of mite abundance. You can also plot the transformed response variable to check that your code worked. Note that you never present transformed data in a results section. This figure is just so you confirm that your transformation worked.

#code to square-root transform your response variable Mites
sqrtMites = sqrt(DATA$Mites)
#code to plot the transformed data
boxplot(sqrtMites~Sex*Site, data=DATA, xlab = "Treatment", ylab = "Mite abundance")

Now that the transformation worked (as evidenced in your figure), you can use the code below to re-run the 2-way anova and model assumption tests:

#code to run the 2-way anova test on the transformed data
anova2<-lm(sqrtMites~Sex*Site, data=DATA)
#code to create the density plot of residuals
plot(density(anova2$residuals))

#code to create a Q-Q plot
qqnorm(anova2$residuals)
qqline(anova2$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))

#code to create a fitted values vs. residuals plot
plot(anova2$residuals~anova2$fitted.values)

After transforming the data, they look better. The density plot of residuals looks more normal and smoother. The points in the Q-Q plot fall along the line much better than before even though there are some dots that don’t fall perfectly on the line (having most dots close to the line and some dots not directly on it is ok). In the fitted vs. residuals plot, the variation on the left side has increased (note the change in scale on the y-axis in the fitted vs. residuals plot - it is much smaller, which means even though the points don’t extend equally on the left and right, the difference between the spreads is much smaller).

There are no influential data points in this data set so you are good to continue because you meet the assumptions of the model. You can finally look at the model output.

Interpreting 2-way ANOVA Output

Follow this flowchart for the steps involved in interpreting the output of your 2-way ANOVA.

Figure 5. A flowchart for interpreting 2-way ANOVA output.

The code below provides you with the ANOVA table that shows the F-stats and p-values for each effect.

#code to get the anova table
anova(anova2)

## Analysis of Variance Table
## 
## Response: sqrtMites
##            Df  Sum Sq Mean Sq  F value  Pr(>F)    
## Sex         1 1305.97 1305.97 222.8821 < 2e-16 ***
## Site        1   58.53   58.53   9.9894 0.00171 ** 
## Sex:Site    1    1.09    1.09   0.1861 0.66646    
## Residuals 355 2080.11    5.86                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This output shows that the interaction term is not significant. That means that the way that site condition influences mite abundance does not depend on the sex of the lizard.

If the Interaction Term is Not Significant

If the F-test of the interaction term is not significant (as it is not here), you can interpret the main effects of the model. So in this example, there was a significant effect of site condition on mite density and this was true regardless of the sex of the lizard (sites that were burned had significantly more mites than sites that were not burned). There was also a significant effect of sex as male lizards had more mites than female lizards regardless of the site condition.

When you have only two groups for your explanatory variables you can determine which is greater than the other by comparing their means (as was done in the previous paragraph). However, if you have more than two groups in your explanatory variable, you need to run some further tests to determine which groups are significantly different from each other. You need to run a Tukey’s Honestly Significant Difference (HSD) test that will do pairwise comparisons of all of your means.

Below is the code you will use to run the Tukey test. You need to change the Response variable and Explanatory variable to your own variable names.

#code to run the Tukey test 
Tukey<-aov(Response variable~Explanatory variable, data=DATA)
TukeyHSD(Tukey, "Explanatory variable", ordered=TRUE)

If the interaction term is not significant, then you can report the output from the model (each main effect and the interaction) from the 2-way ANOVA as well as the p-values from each comparison (if applicable). See the Presenting your Results section below for how to present the results of your 2-way ANOVA.

If the Interaction Term is Significant

If the F-test of the interaction term is significant, you can only report the result of the interaction term from this output in your results section. You can’t interpret or report the main effects from this output because they are not independent. To report the results of each of your explanatory variables (Sex and Site Condition separately), you need to conduct and report the results of separate t-tests for each group individually. To do that, you will need to subset your data so that you can run a t-test for each category of one of your categorical explanatory variables separately depending on your question of interest. For example, if you are primarily interested in the effect of burning on mite loads but thought that it might differ between sexes, you would run a separate t-test for burned and one for unburned because that was your focus. If, however, you were more interested in whether mite loads differed between male and female lizards and decided to include both burned and unburned sites, then you would run a separate t-test for male and one for female lizards.

You should head to the t-test or 1-way ANOVA pages to complete analyzing the subsetted data.

Use the code below to subset your data. Below is the code needed to run the t-test for female data. Once you run this, you can replace every female with male to run the t-test for the male data (or change these to burned and unburned). For more information and background about what a t-test is, go to the t-test page.

#to get the data for the males only
male<-DATA[which(DATA$Sex=='m'),]
#to get the data for the south only
female<-DATA[which(DATA$Sex=='f'),]

When running the t-tests, your datasets are now called either “male” or “female” instead of DATA so make sure you change the code to match these names.

Presenting your Results

Results Statements

The results section of your paper should begin with a narrative of your results statements. These statements should be quantitative in nature and include 1. the statistical significance and 2. the biological significance.

Statistical Significance: Your first sentences should list the results of the statistical test. In a 2-way ANOVA, you will have several sentences because you need to report:

the results of the interaction term (in this case whether the way that site condition influences mite abundance depends on sex)
the results of the main effects
- if the interaction term was significant, then you report the results of each t-test for each category of your categorical variable of your choosin (in this case whether the relationship between mite abundance and sex was significant)
- if the interaction term was not significant, then you report the results of the main effects of the model from the 2-way ANOVA output (in this case whether Sex influenced mite abundance and whether Site condition influenced mite abundance).

You include statistical data in parentheses at the end of the sentence only. You should never write “The p-value was…” or “The F-stat was, which means…”. You don’t write about your statistics, you just write the biological results and include your statistics in parentheses. This satisfies whether your results were statistically significant.

Biological Significance: For the quantitative statements, you should include means and standard errors for interesting comparisons and their effect sizes and include whether the p-value from the Tukey test for each comparison was <0.05. You can compare every mean with every other mean or focus on the comparisons that make sense in your paper (sex differences or site differences)

To help you get means and standard errors, you can install the psych package, and then load it using the following code:

#code to install the "psych" package
install.packages("psych")

#code to load the "psych" package into R
library(psych)

## Warning: package 'psych' was built under R version 3.5.3

The following code shows you how you can get your response data summarized by your explanatory variable.

#code to get summary statistics of your response variable for each group of your explanatory variable
describeBy(DATA, list(DATA$explanatory variable1, DATA$explanatory variable 2)

For the lizard example, use the following code to get a summary of the data including means and standard errors.

#code to get the summary statistics of mite load for male and female lizards in both site types
attach(DATA)
describeBy(DATA,list(DATA$Sex,DATA$Site))

## 
##  Descriptive statistics by group 
## : f
## : Burned
##          vars   n   mean     sd median trimmed    mad min max range skew
## Site*       1 106   1.00   0.00    1.0    1.00   0.00   1   1     0  NaN
## indivID*    2 106 166.19 127.53  114.0  162.50 122.31   2 358   356 0.37
## Sex*        3 106   1.00   0.00    1.0    1.00   0.00   1   1     0  NaN
## Mites       4 106  32.24  23.47   25.5   29.38  19.27   1 107   106 1.08
##          kurtosis    se
## Site*         NaN  0.00
## indivID*    -1.61 12.39
## Sex*          NaN  0.00
## Mites        0.53  2.28
## -------------------------------------------------------- 
## : m
## : Burned
##          vars  n   mean     sd median trimmed   mad min max range skew
## Site*       1 98   1.00   0.00    1.0    1.00  0.00   1   1     0  NaN
## indivID*    2 98 135.98 119.86   95.0  126.67 80.80   1 354   353 0.80
## Sex*        3 98   2.00   0.00    2.0    2.00  0.00   2   2     0  NaN
## Mites       4 98  93.18  55.28   86.5   88.16 49.67   6 238   232 0.83
##          kurtosis    se
## Site*         NaN  0.00
## indivID*    -0.99 12.11
## Sex*          NaN  0.00
## Mites        0.16  5.58
## -------------------------------------------------------- 
## : f
## : Unburned
##          vars  n   mean    sd median trimmed   mad min max range skew
## Site*       1 84   2.00  0.00    2.0    2.00  0.00   2   2     0  NaN
## indivID*    2 84 215.63 45.68  217.5  215.18 57.08 139 298   159 0.08
## Sex*        3 84   1.00  0.00    1.0    1.00  0.00   1   1     0  NaN
## Mites       4 84  25.68 23.78   20.5   21.65 17.05   0 125   125 2.08
##          kurtosis   se
## Site*         NaN 0.00
## indivID*    -1.25 4.98
## Sex*          NaN 0.00
## Mites        5.06 2.60
## -------------------------------------------------------- 
## : m
## : Unburned
##          vars  n   mean    sd median trimmed   mad min max range skew
## Site*       1 71   2.00  0.00      2    2.00  0.00   2   2     0  NaN
## indivID*    2 71 219.23 47.07    214  219.00 60.79 140 359   219 0.16
## Sex*        3 71   2.00  0.00      2    2.00  0.00   2   2     0  NaN
## Mites       4 71  74.70 44.85     64   69.61 40.03  15 216   201 1.05
##          kurtosis   se
## Site*         NaN 0.00
## indivID*    -0.48 5.59
## Sex*          NaN 0.00
## Mites        0.87 5.32

In this output, n = the number of replicates (i.e., sample size) for that species; mean = average; sd = standard deviation; se = standard error. You can use these data to create your results statements and figure (or you can calculate them in Excel). To learn how to calculate the mean, standard deviation and standard error in Excel use the link below to watch a video.

Video on how to calculate a mean, standard deviation and standard error in Excel (also found in the Helpful Resources page on this website) : https://www.screencast.com/t/wFu7UJZbU3cg

The first results statement that includes the statistical results in parentheses should also reference the figure you are referring to. Remember to always 1. refer to figures in the order in which they appear, and 2. include your results narrative before the figure.

Results Figure

A bar chart with means and standard error bars or a box plot are commonly used to report the results of a 2-way ANOVA. Standard convention varies by field but in ecology, we use a bar chart. A bar chart or a box plot are used when your response variable is continuous and your explanatory variable(s) is(are) categorical; each group in your explanatory variable is represented by each bar in two pairs. You can use Excel to create a bar chart. Use the link below to watch a video on how to make a bar chart in Excel: https://www.screencast.com/t/Dv6xg1yXJT

Figure legend

A caption must be included below each figure.

The caption for a 2-way ANOVA must include:

1.  A short descriptive title following the figure number  
2.  A description of what you plotted (including the type of error bars, if appropriate)  
3.  Your sample size (e.g., # transects/site)  
4.  All 3 relevant p-values, each following the statement it supports. For p-values from paired comparison, use letters over bars and   
    state something like "means that share a letter are not significantly different from each other at p < 0.05."

Example Results Statement and Figure (Excel was used to generate this figure)

The way that sex influenced the number of mites on lizards did not depend on whether the site was burned or note (2-way ANOVA, F = 0.19, df = 1,355, p = 0.666). Sex of lizard significantly influenced the number of mites (2-way ANOVA, F = 222.88, df = 1, 355, p < 0.001). The average (± SE) number of mites on male lizards (86.1 ± 3.9) was 191% more than on female lizards (29.5 ± 1.7, Figure 1). Site type significantly influenced the number of mites on lizards (2-way ANOVA, F = 9.99, df = 1,355, p = 0.002). Average (± SE) number of mites on lizards in burned sites (62.1 ± 3.6) was 29% higher than in unburned sites (48.1 ± 3.4, Figure 1).

Figure 6. Average (± SE) number of mites for males and female lizards in sites that were burned or unburned. The way that sex influenced the number of mites on lizards did not depend on whether the site was burned or note (p = 0.666, n = 364). Sex of lizard significantly influenced the number of mites (p < 0.001). Site type significantly influenced the number of mites on lizards (p = 0.002).

My Data Are Not Normal - What Do I Do?

When you can’t meet the assumptions of the model, you can either:
1. Transform the data
2. Run a non-parametric test

Transforming your data means modifying the response variable so that the assumptions are met. This often means taking the square root or the log of the response variable and running the test again. The reason transformations work is that the relative distance between each replicate stays the same (sample 1 is lower than sample 2, for example) but the absolute distance between them is reduced. If your density plot is bimodal (two large humps in your curve), often a transformation will make those humps smaller. You still present your data using the raw numbers, not the transformed numbers.

To transform your data for the example above by taking the log (first line of code below) or the square root (second line of code below) of your response variable, you can use the following code (Note: you did the square root transformation above):

#code to take the log of mite load
logMites = log10(DATA$Mites)
#code to take the square root of mite load
sqrtMites = sqrt(DATA$Mites)

Then you can run the analyses again using your new variables (logMites or sqrtMites) in place of your original variable (Mites). You can’t take the log of 0 so if you have many zeroes in your dataset, you can either use the square root transformation or do a log(x+1) transformation. You can’t take the square root of negative numbers so if you have negative numbers, try the log transformation.

If you can’t meet the assumptions even after transforming your data, you can use a non-parametric test. A non-parametric test does not assume an underlying distribution (such as the F-distribution) and, therefore, does not need to meet the assumption of normality as the parametric 2-way ANOVA does.

There are non-parametric alternatives to 2-way ANOVAs, but they are beyond the scope of this course. Talk to your instructor if you can’t meet the model assumptions of the 2-way ANOVA.

Quick 2-way ANOVA

Here is all of the code you need to quickly run the 2-way ANOVA. Note that below the transformed data are used.

#bring data into RStudio
DATA <- data.frame(read.table(file='lizard_mite.csv', sep=',', header=TRUE, fill=TRUE, na.strings="."))

#check that data was loaded properly
names(DATA)

#explore data (look for outliers)
boxplot(Mites~Sex*Site, data=DATA, xlab = "Treatment", ylab = "Mite abundance")

#run the 2-way ANOVA
anova2<-lm(Mites~Site*Sex, data=DATA)

#check model assumptions
plot(density(anova2$residuals))
install.packages("stats")
library(stats)
qqnorm(anova2$residuals)
qqline(anova2$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))
plot(anova2$residuals~anova2$fitted.values)
plot(anova2)

#ONLY DO THIS IF A TRANSFORMATION IS NECESSARY
#transform response variable
sqrtMites = sqrt(DATA$Mites)

#plot data to make sure the transformation worked
boxplot(sqrtMites~Sex*Site, data=DATA, xlab = "Treatment", ylab = "Mite abundance")

#run the 2-way ANOVA with the transformed response variable
anova2<-lm(sqrtMites~Sex*Site, data=DATA)

#check model assumptions again
plot(density(anova2$residuals))
qqnorm(anova2$residuals)
qqline(anova2$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))
plot(anova2$residuals~anova2$fitted.values)
plot(anova2)

#examine 2-way ANOVA output
summary(anova2)
anova(anova2)

#for summary statistics
install.packages("psych")
library(psych)
attach(DATA)
describeBy(DATA,list(DATA$Sex,DATA$Site))

Code to Create a Boxplot

Below is the code needed to generate a box-plot but remember that ecology generally doesn’t use box plots. To generate a box plot, you need to install and load the package “ggplot2” using the following code.

#code to install package "ggplot2"
install.packages("ggplot2")

#code to load package "ggplot2" into R
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.3

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

To create the box plot, use the code below.

#code to create a boxplot of mite load for male and female lizards across site types
ggplot(DATA, aes(Site, Mites, colour = Sex)) + geom_boxplot()+ scale_color_manual(values = c("black","grey")) + theme_bw() + ylab("Number of mites")

*Written July 2018. Modified from http://stats.pugetsound.edu/ecology/