ANCOVA

Analysis of Covariance (ANCOVA) is a parametric test that examines whether there are significant differences among groups of a categorical variable while controlling for the variation in a continuous covariate (continuous explanatory variable). It combines linear regression and ANOVA and tests whether the slopes of each line for each categorical explanatory variable are significantly different from each other. ANCOVA has two sampling variability assumptions that have to be met:

  1.    The residuals have to be normally distributed;
  2.    There is equal variance among groups (homogeneity of variance).

The residuals are the leftover variation that is not accounted for by your explanatory variable. You can use analyses of the residuals to check both of these assumptions. Before we can interpret our statistical output, we need to make sure the assumptions above are met.

Background Theory

ANCOVA depends on the F distribution (Fig. 1), which measures the variance in the response variable that is explained by the categorical explanatory variable divided by the variance in the response variable that remains unexplained (the error term or residuals). Unexplained variance includes the error as well as what is not explained by other factors. By moving the variation explained by the covariate into the numerator (the variation you can explain) from the denominator (the variation not explained), it increases the power of detecting a significant effect.

If the F statistic generated by your test is larger than the F_critical value, the area under the curve is smaller than 5% and, therefore, the p-value is < 0.05, and the means are significantly different. You have explained significantly more variance than random chance. And if you say that there is a significant effect and there really isn’t, you are wrong less than 5% of the time.

Figure 1. F distribution. Image: http://www.statisticshowto.com/probability-and-statistics/f-statistic-value-test/

ANCOVA tests the effect of each of your explanatory variables separately and their interaction. In the figure below, heart rate was measured for male and female frogs of varying age. Heart rate is the continuous response variable, age is the continuous explanatory variable and frog sex is the categorical explanatory variable. To test for whether the continuous explanatory variable (age) significantly influences the response variable (heart rate), ANCOVA tests whether the slope of the lines are significantly different from zero (it goes up or down). To test for whether the categorical explanatory variable (sex of frog) significantly influences the response variable, ANCOVA tests whether the line of one category is higher or lower than the line for the other category.

Figure 2. Heart rate was measured in male and female frogs of varying ages. Data are for illustrative purposes.

An ANCOVA test will provide you with a data analysis table of output that looks like the table below. The first two rows are for each of the explanatory variables (one continuous and one categorical). The third row is for the interaction between the categorical explanatory variable and the continuous explanatory variable. The fourth row is for the residuals, which is the variance not explained by the explanatory variables (error). This output is for the figure above. It shows that Age had a significant effect (as evidenced by the lines both increasing), Sex had a significant effect (as evidenced by the female frog line being higher than the male frog line), and there was a significant interaction (as evidenced by the lines increasing with age at a different rate). This will be discussed below but just so that you are aware, because the interaction of the two variables was significant, you can’t interpet the main effects of the model because they are not independent. So for now, from this analysis below, you can only say that the way that age influences heart rate in frogs depends on the sex of the frog.

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
Age	1	1.0530	1.0530	161.49	1.70e-07 ***
Sex	1	0.4547	0.4547	69.73	8.07e-06 ***
Age:Sex	1	0.0931	0.0931	14.28	0.00361 **
Residuals	10	0.0652	0.0065

The ‘Df’ column is the degrees of freedom for each explanatory variable and the residuals. The entire df is the number of samples - 1, which are then split between the explanatory variables and residuals. The df for the explanatory variable is equivalent to the number of groups (n) minus one for the treatment (so above there are 2 groups for the categorical explanatory variable [Sex] because df = 1). The df for the continuous explanatory variable is always 1. The df for the interaction is the found by multiplying the df for each explanatory variable (in this case 1*1 = 1). The df for Residuals is the remainder of the degrees of freedom. So the total number of samples from the data above is 10+1+1+1+1 = 14 because the total df is 10+1+1+1 = 13.

The Mean Sq is the Sum Sq divided by Df (note that in this example the Mean Sq is equal to the Sum Sq for each explanatory variable because they all have 1 df). The Mean Sq Treatment is divided by the Mean Sq Residuals to get the F value for each explanatory variable and the interaction. Pr(>F) is the p-value.

Information about the Example Dataset

For this test we will be using a dataset of vascular epiphyte abundance on varying tree sizes on Dominica, a volcanic island in the Lesser Antilles of the Caribbean. Vascular epiphytes were counted on over 300 individual trees from four plots that span the island. Two plots are from the north (plots SBNT and MR, Figure 3) and two plots are from the south (plots M1 and M2, Figure 3). Plots in the south have a higher tree diversity and more smaller trees than plots in the north, which is likely due to the greater effect that hurricanes have on the south side of the island than the north side (DeWalt et al. 2016). These differences in tree communities may influence epiphyte diversity. The goal of this analysis is to examine if the relationship between tree size and epiphyte abundance differs between the south and north areas of the island.

Figure 3. Map of the plots in which epiphytes were surveyed on trees of varying sizes on Dominica (a: Source: DeWalt et al. 2016) and epiphytes on a tree in Dominica (b: Source: https://commons.wikimedia.org/wiki/File:Epiphytes_(Dominica).jpg)

Loading the Data into RStudio

In order to make your RStudio script file organized, you will want to include some information at the top of the file. You can use the hashtag (#) to include things in your RStudio file that R can’t read.

For each test, you should include the following lines at the top of each RStudio file:

#question:
#response variable:
#explanatory variable:
#test name:

Below is what it would look like for the step moss example:

#question: does the relationship of epiphyte abundance and tree size vary with location on the island?
#response variable: epiphyte abundance (continuous)
#explanatory variable 1: tree size (continuous)
#explanatory variable 2: location (categorical)
#test name: ANCOVA

In order to run an ANCOVA, you need to have your response variable in one column and your continuous explanatory variable in another column so that there are replicates for each of your categories of your categorical explanatory variable. For this dataset, each individual tree gets its own row and includes the location of the tree (north or south plots) and the abundance of epiphytes on the tree.

To bring data into RStudio, you can use the line of code below that will allow you to select any file directly from your computer.

#code to load a datafile into R from your computer
DATA <- read.csv(file.choose(), na.strings=".")

An alternative way to open your file is to use the path directly to the folder on your computer where you set your working directory. This will allow you to choose a named file that you saved in your working directory (to learn more about what your working directory is, go the the main webpage and click on Introduction to RStudio). In the DATA code line, TO.BE.EDITED is the name of your file.

#code to set the working directory folder on your computer
setwd <- ('C:/Users/YourUserName/Documents/Rfiles')
#code to load a datafile into R from your working directory
DATA <- read.csv(file="c:/TO.BE.EDITED.csv", header=TRUE, sep=",", na.strings=".")

You can download the Dominica_ancova.xlsx file from the home page, save it as a .csv file in your working directory and then you can use the code below to upload the Dominica_ancova.csv file into RStudio.

#code to load the datafile "Dominica_ancova.csv" into R from your working directory
DATA <- data.frame(read.table(file='Dominica_ancova.csv', sep=',', header=TRUE, fill=TRUE, na.strings="."))

To make sure it loaded correctly, you can copy and paste the following code to see the names of all of the columns in your file.

#code to see the names of the column titles
names(DATA)

## [1] "SITE"     "Location" "DBH"      "Epiphyte"

If the file was successfully loaded into RStudio, you should have the names of the columns above. Note that DBH is the diameter at breast height (diameter of the trunk at 1.37 m), which gives you an idea of the size of the tree. Epiphyte is the abundance of epiphytes and Location is whether or not the plots are in the south or the north. If you got an error, go back to the main stats webpage and click on Troubleshooting.

You can also use the following lines of code to make sure your data was read in correctly. The first line of code results in the top 6 rows of your data being displayed instead of just the column names, and the second line of code gives you the dimensions of your datafile.

#code to see the top 6 rows of your data file
head(DATA)

##   SITE Location  DBH Epiphyte
## 1   M1    South 96.0       28
## 2   M1    South 75.8       20
## 3   M1    South 35.3       17
## 4   M1    South 24.4       16
## 5   M1    South 64.9       18
## 6   M1    South 81.4       16

#code to see the dimensions (number of rows and columns) of your datafile
dim(DATA)

## [1] 320   4

Once you are sure the data are loaded into R, you can proceed.

Exploring the Data

Before we run the test, it is a good idea to explore what the data look like. This gives you a chance to see if there are any outliers and to determine if your prediction based on your hypothesis seems supported by the data. We will examine the relationship between epiphyte abundance (Epiphyte) and tree size (DBH) for the north and south areas separately. In this test, epiphyte abundance is our continuous response variable, tree size is our continuous explanatory variable, and location (north or south) is our categorical explanatory variable.

For an ANCOVA, the best way to visualize data is through a scatterplot with multiple lines (as shown above for the frogs in Figure 2). To create a scatterplot with multiple lines, you will need to install and load the package ggplot2.

#code to install package "ggplot2"
install.packages("ggplot2")

#code to load package "ggplot2" into R
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.3

Use the following code to create the scatterplot with multiple lines.

#code to create a scatterplot with a line for each group of the categorical variable
p <- ggplot(DATA, (aes(x=DBH, y=Epiphyte, color=Location, shape=Location))) + theme_classic() +ylab("Epiphyte Abundance") + xlab("DBH (cm)")
p + geom_point() + geom_smooth(method=lm, se=FALSE, fullrange=TRUE)

Looking at these data, it looks like tree size (DBH) influences epiphyte abundance as both lines increase. It also looks like the relationship between epiphyte abundance and tree size is more steep in the south than in the north, which suggests that the location may significantly influence this relationship. There is a lot of spread in the data and we can’t tell if any of these points are outliers. We need to run the model and check to see if it meets the assumptions before looking at the results.

Running the ANCOVA

Now you can run the ANCOVA. Below is the code you will use to run the model. Remember that to actually run the test, you need to change the Response variable and Explanatory variables to your own variable names. Use the code for names(DATA) above to make sure you write in the variables exactly as they are in the data file. If you write in dbh instead of DBH, for example, the test will not work because R won’t be able to find dbh in your datafile.

#code to run the 2-way anova to test model assumptions
ancova<-lm(Response variable~Explanatory variable 1*Explanatory variable 2, data=DATA)

Use the code below to run the test for the analysis of epiphyte data.

#code to run the 2-way anova for the epiphyte data to test model assumptions
ancova<-lm(Epiphyte~DBH*Location, data=DATA)

You will notice that not much happened. If you didn’t get any error messages, great! That means that the test was successful. R runs the code and stores the information. You have to do more work to get at that the results of the test. But, first we need to make sure we can even look at the results.

You never look at the results of the test until AFTER you check that the assumptions of the model were met. So next we will examine if our data meet the assumptions of 1) residuals following a normal distribution and 2) equal variance among groups (the residuals from each group should be relatively similar).

Evaluating Model Assumptions

The difference between the observed and predicted values, called the residuals, is a measure of the error associated with each observation. It is the variation that is not explained by the explanatory variable. We plot the residuals in various ways to examine normality and homogeneity of variances.

To test whether the residuals are normally distributed (model assumption 1 above), you will use 2 pieces of information:

1. a density plot of residuals
2. a Q-Q plot of residuals

To create a density plot of residuals, use the following code:

#code to create the density plot of residuals
plot(density(ancova$residuals))

Examine the density plot. It should follow more or less a bell-shaped curve. It can be hard to determine whether or not your density plot is “normal” as it takes practice. If there are any weird bumps or the curve is clearly skewed to one side or bimodal, chances are your residuals are not normal. See the ANOVA webpage for examples of non-normal and normal curves. This density plot looks somewhat normal but there are long tails on each side. ANCOVA is quite robust to this assumption so even if your data don’t look like a perfect bell curve, the assumption can still be met. Let’s use more information before we decide if this meets the normality assumption.

The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a normal or exponential. Here’s an example of a Normal Q-Q plot when both sets of quantiles truly come from Normal distributions.

Figure 4. A quantile-quantile plot of residuals showing that they exhibit a close relationship with the theoretical quantiles predicted under a normal distribution.

Q-Q plots take your sample data, sort it in ascending order, and then plot them versus quantiles calculated from a theoretical distribution, in this case the normal distribution. If the points fall pretty closely along the line, the data are normal.

To be able to run the code that creates a Q-Q plot, you need to load the package stats. You can do this by running the code below.

#code to load package "stats" into R
library(stats)

#code to create a Q-Q plot
qqnorm(ancova$residuals)
qqline(ancova$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))

This plot does not look good. The points should fall pretty much along a straight line. With the density plot and the Q-Q plot, it doesn’t seem like the data are normal. Let’s keep checking our assumptions before deciding on whether our data meet the assumptions.

To test whether the error terms for each group have similar variance (homogeneity of variance, assmuption 2 above), we will use a plot of the fitted.values vs. the predicted values. Use the code below to create your plot:

#code to create a fitted values vs residuals plot
plot(ancova$residuals~ancova$fitted.values)
lines(lowess(ancova$fitted.values,ancova$residuals), col="blue")
text(ancova$fitted.values, ancova$residuals, row.names(DATA), cex=0.6, pos=4, col="red")

For this plot, you are looking for no patterns. If you see a cone where the vertical variation on one side of the figure is smaller than the vertical variation on the other side, the variation is not homogeneous and you need to transform your data. The plot in this example suggests such a problem. The points on the left side cover a much smaller range of values and are more clustered than the points on the right side. If your plot looks like this, your data violate the assumption of homogeneity of variance. In this case, you need to transform your data.

If your data satisfies the assumptions of the model, go directly to Interpreting ANCOVA Output below.

Transform and Repeat

For this dataset, you will try a square root transformation of your response variable (note that when doing transformations, you transform the response variable, not the explanatory variable). Use the code below to create a new variable called sqrtEpiphyte, which is the square root of epiphyte abundance. Note that you never present transformed data in a results section.

#code to square-root transform your response variable Epiphyte
sqrtEpiphyte = sqrt(DATA$Epiphyte)

You can use the code below to re-run the ancova and model assumption tests using the transformed data:

#code to run the ancova test on the transformed data
ancova<-lm(sqrtEpiphyte~DBH*Location, data=DATA)
#code to create the density plot of residuals
plot(density(ancova$residuals))

#code to create a Q-Q plot
qqnorm(ancova$residuals)
qqline(ancova$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))

#code to create a fitted values vs. residuals plot
plot(ancova$residuals~ancova$fitted.values)
lines(lowess(ancova$fitted.values,ancova$residuals), col="blue")
text(ancova$fitted.values, ancova$residuals, row.names(DATA), cex=0.6, pos=4, col="red")

After transforming the data, they look better. The density plot of residuals looks more normal despite the small shoulder. The points in the Q-Q plot fall along the line much better than before even though there are some dots that don’t fall perfectly on the line (having most dots close to the line and some dots not directly on it is ok). In the fitted vs. residuals plot, the variation on the left side has increased (note the change in scale on the y-axis in the fitted vs. residuals plot - it is much smaller, which means even though the points don’t extend equally on the left and right, the difference between the spreads is much smaller) but there is a lone point in the bottom-right corner.

Point 76 is clearly an outlier (note that because there is a header in the Excel file, this point is number 77 in the Excel file). It doesn’t fit with the other points in the fitted vs. residuals plot or the spread-location plot, it doesn’t fall on the line in the Q-Q plot, and it falls outside of the dashed red line in the residuals vs. leverage plot, which tells you that it is an influential point according to the Cook’s D statistic. Cook’s D (d is for distance) is a measure of the influence of a datapoint in an analysis; points that fall above a Cook’s distance value of 0.5 are considered “influential” points, which may be outliers and are worth checking. If you look through the Dominica datafile, you will see that this datapoint is from the largest tree in the South plots; it has a DBH of 135.9 cm, which is ~40 cm larger than the next largest tree at 96 cm (the first tree in the file). Knowing this system and these tree species, this tree was probably not felled by the Hurricane even though the epiphyte community on it was likely affected (wind is a major disturbance for epiphytes). As such, it is safe to remove this datapoint and rerun the model and test assumptions.

You have to be certain of such confusions before you alter data; just a hunch is not necessarily enough. Any change you do make has to be justified either by an error that you know you made or an outlier test (as in this case). You need to document the removal of this point in the statistical analysis section of your methods. You can write “A data point was removed as an outlier according to a Cook D’s test.”

If you make any changes to your data, save the corrected data with a different name and keep a copy of the original. Save the file as Dominica2.csv and use the following code to re-upload your data. Because this file is labeled the same as it was originally (DATA), the rest of the code remains the same and the dataset is replaced with the new one.

#code to load the updated file "Dominica2.csv" into R from your working directory
DATA <- data.frame(read.table(file='Dominica2.csv', sep=',', header=TRUE, fill=TRUE))

Given that the data did not meet the normality assumption prior to the square root transformation, you should continue to use the transformed response variable even after removing the data point. Because you did this transformation in RStudio (and not in your excel file), you will need to recreate that variable again after uploading the new data file. Use the code below to recreate the square-root transformed response variable, run the model, and assess model assumptions.

#code to square-root transform your response variable Epiphyte
sqrtEpiphyte = sqrt(DATA$Epiphyte)
#code to run the ancova test on the transformed data
ancova<-lm(sqrtEpiphyte~DBH*Location, data=DATA)
#code to create the density plot of residuals
plot(density(ancova$residuals))

#code to create a Q-Q plot
qqnorm(ancova$residuals)
qqline(ancova$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))

#code to create a fitted values vs. residuals plot
plot(ancova$residuals~ancova$fitted.values)
lines(lowess(ancova$fitted.values,ancova$residuals), col="blue")
text(ancova$fitted.values, ancova$residuals, row.names(DATA), cex=0.6, pos=4, col="red")

From these plots it is clear that you have solved the problem of the outlier and you meet the assumptions of the model. You can finally look at the model output.

Interpreting ANCOVA Output

Follow this flowchart for the steps involved in interpreting the output of your ANCOVA.

Figure 5. A flowchart for interpreting ANCOVA output.

The code below provides you with the ANOVA table that shows the F-stats and p-values for each effect.

#code to get the anova table
anova(ancova)

## Analysis of Variance Table
## 
## Response: sqrtEpiphyte
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## DBH            1 196.302 196.302 218.952 < 2.2e-16 ***
## Location       1  55.610  55.610  62.026 5.534e-14 ***
## DBH:Location   1  24.545  24.545  27.377 3.061e-07 ***
## Residuals    315 282.414   0.897                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This output shows that the interaction term is significant. That means that the way that tree size influences epiphyte abundance depends on the location on the island. Ignore the fact that the output says LocationSouth instead of just Location; it is a glitch in R and LocationSouth is Location.

If the Interaction term is Not Significant

If the interaction term is not significant, then you can report the output from the model (each main effect and the interaction) from the ANCOVA. See the Presenting your Results section below for how to present the results of your ANCOVA.

If the Interaction term is Significant

If the F-test of the interaction term is significant (as it is here), you can only report the result of the interaction term from this output in your results section. You can’t interpret or report the main effects from this output because they are not independent. To report the results of each of your explanatory variables (DBH and Location separately), you need to conduct and report the results of separate simple linear regressions for each group individually. To do that, you will need to subset your data so that you can run a linear regression for each category of your categorical explanatory variable separately (one for the north and one for the south).

Use the code below to subset your data. Below is the code needed to run the linear regression for the north data. Once you run this, you can replace every north with south to run the linear regression for the south data. For more information and background about what a linear regression is, go to the linear regression page.

#to get the data for the north only
north<-DATA[which(DATA$Location=='North'),]
#to get the data for the south only
south<-DATA[which(DATA$Location=='South'),]

When running the linear regressions, your datasets are now called either “north” or “south” instead of DATA so make sure you change the code to match these names.

#explore data (look for outliers)
attach(north)
plot(north$DBH, north$Epiphyte, xlab="Tree size (cm)", ylab="Epiphyte abundance", pch=19)
abline(lm(Epiphyte~DBH), col="red")

#code to run the linear regression
fit<-lm(Epiphyte~DBH, data=north)

In order to check model assumptions, you need to install and load the package “stats” into R.

#code to install package "stats" into R
install.packages("stats")
#code to load package "stats" into R
library(stats)

#code to create a density plot of residuals
plot(density(fit$residuals))

#code to create a Q-Q plot
qqnorm(fit$residuals)
qqline(fit$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))

#code to create a fitted vs. residuals plot
plot(fit$residuals~fit$fitted.values)
lines(lowess(fit$fitted.values,fit$residuals), col="blue")
text(fit$fitted.values, fit$residuals, row.names(DATA), cex=0.6, pos=4, col="red")

From this output, these data do not appear to meet the normality assumption so a transformation is necessary. Use the code below to transform your response variable by taking the square root of it and running the model and testing assumptions again.

#code to square-root transform the response variable Epiphyte
sqrtEpiphyte<-sqrt(north$Epiphyte)
#run model
fit<-lm(sqrtEpiphyte~DBH, data=north)

#code to create a density plot of residuals
plot(density(fit$residuals))

#code to create a Q-Q plot
qqnorm(fit$residuals)
qqline(fit$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))

#code to create a fitted vs. residuals plot
plot(fit$residuals~fit$fitted.values)
lines(lowess(fit$fitted.values,fit$residuals), col="blue")
text(fit$fitted.values, fit$residuals, row.names(DATA), cex=0.6, pos=4, col="red")

If you go through each plot using the information from above about what you are looking for in each plot, you will see that this transformation worked; you have met the assumption of normality and can now look at the summary of the model.

#examine linear regression output
summary(fit)

## 
## Call:
## lm(formula = sqrtEpiphyte ~ DBH, data = north)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6086 -0.5394 -0.3516  0.6439  2.0911 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.154997   0.081544   1.901   0.0589 .  
## DBH         0.027119   0.002528  10.729   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8116 on 186 degrees of freedom
## Multiple R-squared:  0.3823, Adjusted R-squared:  0.379 
## F-statistic: 115.1 on 1 and 186 DF,  p-value: < 2.2e-16

There is a significant relationship between tree size (DBH) and epiphyte abundance on the north side of the island. For your results statement, you need to present the results of the interaction term from the ANCOVA and then the results of each separate linear regression. However, you don’t need separate figures for each linear regression you run; you are only running these linear regressions to get the statistics.

Presenting your Results

Results Statements

The results section of your paper should begin with a narrative of your results statements. These statements should be quantitative in nature and include 1. the statistical significance and 2. the biological significance.

Statistical Significance: Your first sentences should list the results of the statistical test. In an ANCOVA, you will have several sentences because you need to report:

the results of the interaction term (in this case whether the way that DBH influences epiphyte abundance depends on location)
the results of the main effects
- if the interaction term was significant, then you report the results of each linear regression for each category of your categorical variable (in this case whether the relationship between epiphyte abundance and DBH was significant in the north side and whether it was significant on the south side)
- if the interaction term was not significant, then you report the results of the main effects of the model from the ANCOVA output (in this case whether DBH influenced epiphyte abundance and whether Location influenced epiphyte abundance)

You include statistical data in parentheses at the end of the sentence only. You should never write “The p-value was…” or “The F-stat was, which means…”. You don’t write about your statistics, you just write the biological results and include your statistics in parentheses. This satisfies whether your results were statistically significant.

Biological Significance: For the quantitative statements, you should include the direction of the effect of the continuous explanatory variable on the response variable (whether it was positive or negative) if it was significant. You could also compare the difference between the slopes of the two lines if the categorical explanatory variable was significant. Remember that to compare the slopes, you need to get the untransformed slopes. Go to How to Get the Slope of each Line below if you had to transform your data to meet model assumptions.

The first results statement that includes the statistical results in parentheses should also reference the figure you are referring to. Remember to always -refer to figures in the order in which they appear, and -include your results narrative before the figure.

Results Figure

A scatterplot with two lines is typically used to present the results of an ANCOVA. You already generated a scatterplot with a trendline for each category of your categorical explanatory variable but below is the code to do it again. Note that even though we transformed the data to meet the assumptions of the model, untransformed data are presented.

#code to create a scatterplot with a line for each group of the categorical explanatory variable
p <- ggplot(DATA, (aes(x=DBH, y=Epiphyte, color=Location, shape=Location))) + theme_classic() +ylab("Epiphyte Abundance") + xlab("DBH (cm)")
p + geom_point() + geom_smooth(method=lm, se=FALSE, fullrange=TRUE)

Alternatively, you can use Excel to create a scatterplot. Use the link below to watch a video on how to make a scatterplot in Excel: https://www.screencast.com/t/MwVeP49jLmPe

Figure legend

A caption must be included below each figure.

The caption for a linear regression must include:

1.  A short descriptive title following the figure number  
2.  A description of what you plotted (including the type of error bars, if appropriate)  
3.  Your sample size (e.g., # transects/site)  
4.  The p-value for each effect (three in total) and the R^2^ if you ran linear regressions.

Example Results Statement and Figure (Excel was used to generate this figure)

The relationship between tree size (DBH) and epiphyte abundance significantly depended on location on the island (ANCOVA, F = 27.38, df = 1, 315, p < 0.001, Figure 5). On the north side of the island, epiphyte abundance increased significantly with tree size (linear regression, F = 115.1, df = 1,186, p < 0.001, R² = 0.38). On the south side of the island, epiphyte abundance also increased significantly with tree size (linear regression, F = 114.3, df = 1,129, p < 0.001, R² = 0.47). While epiphyte abundance increased with tree size at both locations, the slope of the relationship on the south side (0.25 individuals/cm) was 178% greater than the slope of the relationship on the north side (0.09 individuals/cm).

Figure 6. Epiphyte abundance on trees of varying sizes (diameter at breast height: DBH) from north and south areas of the island of Dominica in the Lesser Antilles. The influence of tree size on epiphyte abundance depended on location (p < 0.001). Epiphyte abundance increased significantly with tree size on both the south (p < 0.001, R² = 0.47) and the north side (p < 0.001, R² = 0.38).

[NOTE: If your interaction term is significant, it is not appropriate to present “main effect” results of the ANCOVA. Instead, run separate regression tests for each level of your categorical explanatory variable.]

Example Results Statement and Figure for a Non-significant Interaction Term

Below is an example with a non-significant interaction term (modified from Tullis and Straube. 2017. The metabolic cost of carrying a sexually selected trait in the male fiddler crab Uca pugilator. J. of Exp. Biol 220:3641-3648)

“Mass-specific oxygen consumption decreased significantly with an increase in body mass (ANCOVA: F = 9.36, df = 1,24, p < 0.05) and the influence of body mass on oxygen consumption was the same for all three treatment groups (F = 0.18, df = 2,24, p = 0.84, Figure 1). There was no significant difference in oxygen consumption among clawed, declawed and loaded crabs at any given body mass (F = 0.48, df = 2,24, p = 0.62).

Figure 1. Oxygen consumption as a function of body mass for clawed, declawed and loaded male U. pugilator during sustainable exercise (n = 10 for each treatment group). Log oxygen consumption per gram decreased significantly with log body mass for all treatment groups (p < 0.05). The influence of body mass on oxygen consumption did not depend on treatment groups (p = 0.84), and there was no effect of treatment group on oxygen consumption at any body mass (p = 0.62).

My Data Are Not Normal - What Do I Do?

When you can’t meet the assumptions of the model, you can either:
1. Transform the data
2. Run a non-parametric test

Transforming your data means modifying the response variable so that the assumptions are met. This often means taking the square root or the log of the response variable and running the test again. The reason transformations work is that the relative distance between each replicate stays the same (sample 1 is lower than sample 2, for example) but the absolute distance between them is reduced. If your density plot is bimodal (two large humps in your curve), often a transformation will make those humps smaller. You still present your data using the raw numbers, not the transformed numbers.

To transform your data for the example above by taking the log (first line of code below) or the square root (second line of code below) of your response variable, you can use the following code (Note: you did the square root transformation above):

#code to log transform the response variable Epiphyte
logEpiphyte = log10(DATA$Epiphyte)
#code to square-root transform the response variable Epiphyte
sqrtEpiphyte = sqrt(DATA$Epiphyte)

Then you can run the analyses again using your new variables (logEpiphyte or sqrtEpiphyte) in place of your original variable (Epiphyte). You can’t take the log of 0 so if you have many zeroes in your dataset, you can either use the square root transformation or do a log(x+1) transformation. You can’t take the square root of negative numbers so if you have negative numbers, try the log transformation.

If you can’t meet the assumptions even after transforming your data, you can use a non-parametric test. A non-parametric test does not assume an underlying distribution (such as the F-distribution) and, therefore, does not need to meet the assumption of normality as the parametric ANOVA does. We will not be discussing nonparametric tests for an ANCOVA so if you can’t meet these assumptions even after transforming the data, see this webpage for help with using a nonparametric ANCOVA: https://stats.stackexchange.com/questions/41270/nonparametric-equivalent-of-ancova-for-continuous-dependent-variables

Quick ANCOVA

Here is all of the code you need to quickly run the ANCOVA. Note that below the transformed data with the outlier removed are used.

#bring data into RStudio
DATA <- data.frame(read.table(file='Dominica2.csv', sep=',', header=TRUE, fill=TRUE, na.strings="."))

#check that data was loaded properly
names(DATA)

#explore data (look for outliers)
install.packages("ggplot2")
library(ggplot2)
p <- ggplot(DATA, (aes(x=DBH, y=Epiphyte, color=Location, shape=Location))) + theme_classic() +ylab("Epiphyte Abundance") + xlab("DBH (cm)")
p + geom_point() + geom_smooth(method=lm, se=FALSE, fullrange=TRUE)

#run the ANCOVA
ancova<-lm(Epiphyte~DBH*Location, data=DATA)

#check model assumptions
plot(density(ancova$residuals))
install.packages("stats")
library(stats)
qqnorm(ancova$residuals)
qqline(ancova$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))
plot(ancova$residuals~ancova$fitted.values)
lines(lowess(ancova$fitted.values,ancova$residuals), col="blue")
text(ancova$fitted.values, ancova$residuals, row.names(DATA), cex=0.6, pos=4, col="red")

#ONLY DO THIS IF A TRANSFORMATION IS NECESSARY
#transform response variable
sqrtEpiphyte = sqrt(DATA$Epiphyte)

#run the ANOVA with the transformed response variable
ancova<-lm(sqrtEpiphyte~DBH*Location, data=DATA)

#check model assumptions again
plot(density(ancova$residuals))
qqnorm(ancova$residuals)
qqline(ancova$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))
plot(ancova$residuals~ancova$fitted.values)
lines(lowess(ancova$fitted.values,ancova$residuals), col="blue")
text(ancova$fitted.values, ancova$residuals, row.names(DATA), cex=0.6, pos=4, col="red")

#examine ANOVA output
summary(ancova)
anova(ancova)

#creating a scatter plot
install.packages("ggplot2")
library(ggplot2)
p <- ggplot(DATA, (aes(x=DBH, y=Epiphyte, color=Location, shape=Location))) + theme_classic() +ylab("Epiphyte Abundance") + xlab("DBH (cm)")
p + geom_point() + geom_smooth(method=lm, se=FALSE, fullrange=TRUE)

How to Get the Slope of each Line

To get the slopes, you simply need to run a linear regression for each category of the categorical explanatory variable without any transformations. Then you can get a summary of the model to get the slope. Remember that you are only using these data to get the slopes of each line; you are NOT using the model output here unless your response variable did not need to be transformed.

#run model
fit<-lm(Epiphyte~DBH, data=north)
summary(fit)

## 
## Call:
## lm(formula = Epiphyte ~ DBH, data = north)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0051 -0.9653 -0.3601  0.5973 12.4390 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.402413   0.230521  -1.746   0.0825 .  
## DBH          0.091485   0.007145  12.803   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.294 on 186 degrees of freedom
## Multiple R-squared:  0.4685, Adjusted R-squared:  0.4656 
## F-statistic: 163.9 on 1 and 186 DF,  p-value: < 2.2e-16

#run model
fit<-lm(Epiphyte~DBH, data=south)
summary(fit)

## 
## Call:
## lm(formula = Epiphyte ~ DBH, data = south)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4909  -2.6365  -0.6870   0.7011  28.2863 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.76459    0.72573  -1.054    0.294    
## DBH          0.25007    0.02436  10.264   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.312 on 129 degrees of freedom
## Multiple R-squared:  0.4495, Adjusted R-squared:  0.4453 
## F-statistic: 105.3 on 1 and 129 DF,  p-value: < 2.2e-16

The slope for the north side is 0.09 (it is the Estimate beside the continuous explanatory variable under “Coefficients”). The slope for the south side is 0.25. You can now use these numbers to calculate the effect size, or percent difference between them and include that information in your results statement.

*Written July 2018. Modified from http://stats.pugetsound.edu/ecology/

References DeWalt, S.J., K. Ickes, and A. James. (2016) Forest and community structure of tropical sub-montane rain forests on the island of Dominica, Lesser Antilles. Caribbean Naturalist 1:116-137.