ANOVA

An analysis of variance (ANOVA) is a parametric test that examines whether the mean values of a continuous response differ among groups of a categorical explanatory variable. ANOVA has two sampling variability assumptions that have to be met:

  1.    The residuals have to be normally distributed;
  2.    There is equal variance among groups (homogeneity of variance).

The residuals are the leftover variation that is not accounted for by your explanatory variable. You can use analyses of the residuals to check both of these assumptions. Before we can interpret our statistical output, we need to make sure the assumptions above are met. For more information about why you are not looking for the normality of your dependent variable but the residuals, see this great post by Karen Grace-Martin https://www.theanalysisfactor.com/the-distribution-of-dependent-variables/.

Background Theory

ANOVA depends on the F distribution (Figure 1), which measures the variance in the response variable that is explained by the categorical explanatory variable divided by the variance in the response variable that remains unexplained (the error term or residuals). If your treatment or explanatory variable explains more variance than the error within each group (differences among means are large and variance within each mean is small), then the groups will be significantly different from each other.

Here are some statistical terms you should understand to interpret your ANOVA.

The F statistic is the ratio of the variance explained and the variance unexplained.
The p-value or alpha is the area under the curve of the F distribution at the F_critical value (Figure 1).The p-value, which is interpreted as the chance that you are wrong if you accept that there is a significant effect (also known as a Type I Error rate) has been chosen by scientists to be 5%.
The F_critical value is the value at which the area under the curve (i.e., p-value) is 5%.
The df is equal to the number of observations minus the number of parameters estimated.

If the F statistic generated by your test is larger than the F_critical value, the area under the curve is smaller than 5% and, therefore, the p-value is < 0.05, and the means are significantly different. You have explained significantly more variance than random chance. And if you say that there is a significant effect and there really isn’t, you are wrong less than 5% of the time.

Figure 1. F distribution. Image: http://www.statisticshowto.com/probability-and-statistics/f-statistic-value-test/

In the figure below, the F statistic can be conceptualized as the ratio of the variance you can explain (blue arrows) over the variance you can’t explain (red braces). If you sum up the blue arrows in each figure and divide them by the red braces, the F statistic is much smaller in the left figure than the right figure. That means you are explaining more variation than you aren’t explaining in the right figure (large differences among groups and small variation within groups) and the data in the right figure are statistically significant. The data in the left figure are not statistically significant because there are small differences among groups (small blue arrows) and large variation within groups (large red braces).

Figure 2. Average (? SE) leaf area for three bromeliad species (see information below about the species). Data are used for illustrative purposes.

An ANOVA test will provide you with a data analysis table of output that looks like the table below. The first column will include the explanatory variable you are testing (treatment) as well as the residuals, which is the variance not explained by the explanatory variable (error).

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
Treatment	2	3745	1248.4	142.5	<2e-16
Residuals	57	499	8.8

The ‘Df’ column is the degrees of freedom for the explanatory variable and residuals. The entire df is the number of samples - 1, which are then split between the explanatory variable and residuals. The df for the explanatory variable is equivalent to the number of groups (n) minus one for the treatment (so above there are 3 groups because df = 2). The df for Residuals is the remainder of the degrees of freedom. So the total number of samples from the data above is 57+2+1 = 60 because the total df is 57+2 = 59.

The Sum Sq for Treatment is the sum of squared differences between the mean of each of your groups and the overall mean. The Sum Sq for Residuals is the sum of squared differences between each individual value and the mean of each group (basically the standard error bars).

The Mean Sq is the Sum Sq divided by Df. The Mean Sq Treatment is divided by the Mean Sq Residuals to get the F value. Pr(>F) is the p-value.

Information about the Example Dataset

For this test, we will be using a dataset of leaf thickness (LT; cm) of three different tank bromeliad species (Figure 2; C. Woods, unpublished data).

Below is a picture of the three tank bromeliads used in this dataset. They are all epiphytes, which means they live in the canopy of trees non-parasitically (they don’t derive nutrients from their host tree). They are all within the same genus but have varying distributions within tree crowns, which makes examining differences in their leaf characteristics interesting.

Figure 2. The three bromeliad species used in this data (Left: Tillandsia anceps (TilAnc); Middle: Tillandsia bulbosa (TilBul); Right: Tillandsia monadelpha (TilMon))

Loading the Data into RStudio

In order to make your RStudio script file organized, you will want to include some information at the top of the file. You can use the hashtag (#) to include things in your RStudio file that R can’t read.

For each test, you should include the following lines at the top of each RStudio file:

#question:
#response variable:
#explanatory variable:
#test name:

Below is what it would look like for the step moss example:

#question: does leaf thickness vary across bromeliad species?
#response variable: leaf thickness (continuous)
#explanatory variable: bromeliad species (categorical)
#test name: 1-way ANOVA

In order to run an ANOVA, you need to have your response variable in one column and your explanatory variable in another column so that there are replicates for each of your groups. In the image below, each individual of each species gets its own row and there are multiple individuals for each species. The ANOVA will take the mean and variance of each group (in this case for each species of bromeliad) and compare them. The analysis will not work if you just put in the mean values of each group because then it can’t calculate the variance around each mean.

Figure 3. Example of how to set up your excel file to run an ANOVA.

To bring data into RStudio, you can use the line of code below that will allow you to select any file directly from your computer.

#code to load a datafile into R from your computer
DATA <- read.csv(file.choose(), na.strings=".")

An alternative way to open your file is to use the path directly to the folder on your computer where you set your working directory. This will allow you to choose a named file that you saved in your working directory (to learn more about what your working directory is, go the the main webpage and click on Introduction to RStudio). In the DATA code line, TO.BE.EDITED is the name of your file.(Mac Users: remember to omit the “C:” in your code).

#code to set the working directory folder on your computer
setwd <- ('C:/Users/YourUserName/Documents/Rfiles'))
#code to load a datafile into R from your working directory
DATA <- read.csv(file="C:/TO.BE.EDITED.csv", header=TRUE, sep=",", na.strings=".")

You can download the Tillandsia_traits_anova.xls file from the home page, save it as a .csv file (MS-DOS Comma Separated in Mac) in your working directory and then you can use the code below to upload the Tillandsia_traits_anova.csv file into RStudio.

#code to load the datafile "Tillandsia_traits_anova.csv" into R from your working directory
DATA <- data.frame(read.table(file='Tillandsia_traits_anova.csv', sep=',', header=TRUE, fill=TRUE, na.strings="."))

To make sure it loaded correctly, you can copy and paste the following code to see the names of all of the columns in your file.

#code to see the names of the column titles
names(DATA)

## [1] "Species" "LT"

If the file was successfully loaded into RStudio, you should have the names of the columns above. The ‘Species’ column lists one of the three species (TilAnc, TilBul, or TilMon). If you got an error, go back to the main stats webpage and click on Troubleshooting.

You can also use the following lines of code to make sure your data was read in correctly. The first line of code results in the top 6 rows of your data being displayed instead of just the column names, and the second line of code gives you the dimensions of your datafile.

#code to see the top 6 rows of your data file
head(DATA)

##   Species        LT
## 1  TilAnc 0.3590000
## 2  TilAnc 0.3000000
## 3  TilAnc 0.2903333
## 4  TilAnc 0.3336667
## 5  TilAnc 0.3540000
## 6  TilAnc 0.4096667

#code to see the dimensions (number of rows and columns) of your datafile
dim(DATA)

## [1] 26  2

Once you are sure the data are loaded into R, you can proceed.

Exploring the Data

Before we run the test, it is a good idea to explore what the data look like. This gives you a chance to see if there are any outliers and to determine if your prediction based on your hypothesis seems supported by the data (presumably you would have some information about these bromeliad species before measuring their leaf traits and running this analysis on what you predict to find). We will use an example from the Tillandsia dataset and examine the differences in leaf thickness among species. For this test, leaf thickness (LT) is the continuous response variable and species (Species) is the categorical explanatory variable.

Use the following code to create a box plot of your data. This will show you any outliers as it shows the maximum and minimum values within each group. See this page for more information on what a box plot is: http://www.physics.csbsju.edu/stats/box2.html

#code to create a boxplot of the bromeliad data
boxplot(LT~Species, data=DATA, xlab = "Species", ylab = "Leaf thickness")

Looking at these data it is clear that there are differences among the species but it is also clear that there is an outlier in TilMon (indicated by the circle in the boxplot at ~ 1.2). Knowing this plant, there is no way that it could have a leaf thickness of 1.2 cm. So, what do you do? Check if the data somehow have been incorrectly recorded. Here it turns out that TilMon and TilBul were confused. Note that you have to be certain of such confusions before you alter data; just a hunch is not necessarily enough. Any change you do make has to be justified either by an error that you know you made (as in this case) or an outlier test (see this website for information about outlier tests: http://r-statistics.co/Outlier-Treatment-With-R.html).

If you make any changes to your data, save the corrected data with a different name and keep a copy of the original. Save the file as Tillandsia_traits2.csv and use the following code to re-upload your data. Because this file is labeled the same as it was originally (DATA), the rest of the code remains the same and the dataset is replaced with the new one.

#code to load the updated datafile "Tillandsia_traits2.csv" into R from your working directory
DATA <- data.frame(read.table(file='Tillandsia_traits2.csv', sep=',', header=TRUE, fill=TRUE))

If you run the box plot code again, it should get rid of the outlier.

#code to create a boxplot of the bromeliad data
boxplot(LT~Species, data=DATA, xlab = "Species", ylab = "Leaf thickness")

The new figure shows that the outlier is gone.

Running the ANOVA

Now you can run the ANOVA. Below is the code you will use to run the ANOVA. Remember that to actually run the test, you need to change the Response variable and Explanatory variable to your own variable names. Use the code for names(DATA) above to make sure you write in the variables exactly as they are in the data file. If you write in species instead of Species, for example, the test will not work because R won’t be able to find species in your datafile.

#code to run the anova to test model assumptions
anova2<-aov(Response variable~Explanatory variable, data=DATA)

Use the code below to run this test.

#code to run the anova for the bromeliad data to test model assumptions
anova2<-aov(LT~Species, data=DATA)

You will notice that not much happened. If you didn’t get any error messages, great! That means that the test was successful. R runs the code and stores the information. You have to do more work to get at that the results of the test. But, first we need to make sure we can even look at the results.

You never look at the results of the test until AFTER you check that the assumptions of the model were met. So next we will examine if our data meet the assumptions of 1) residuals following a normal distribution and 2) equal variance among groups (the residuals from each group should be relatively similar).

Evaluating Model Assumptions

The difference between the observed and predicted values, called the residuals, is a measure of the error associated with each observation. It is the variation that is not explained by the explanatory variable. We plot the residuals in various ways to examine normality and homogeneity of variances.

To test model assumption 1 (whether the residuals are normally distributed), you will use 2 pieces of information:

1. a density plot of residuals
2. a Q-Q plot of residuals

To create a density plot of residuals, use the following code:

#code to create the density plot of residuals
plot(density(anova2$residuals))

Examine the density plot. It should follow more or less a bell-shaped curve. It can be hard to determine whether or not your density plot is “normal” as it takes practice. If there are any weird bumps or the curve is clearly skewed to one side or bimodal, chances are your residuals are not normal. This density plot looks relatively normal except for the strange bump on the right. ANOVA is quite robust to this assumption so even if your data don’t look like a perfect bell curve, the assumption can still be met. Below shows six density plots of residuals; the top three do not meet the assumption of normality but the bottom three do meet the assumption (Figure 4). If your density plot of residuals looks like the top one, you need to either transform your response variable or use a non-parametric test. See My data are not normal, what do I do? below for help. Since your plot looks pretty normal, we can move on to examine other ways to test model assumptions below.

Figure 4. Samples of density plots of residuals. The top three panels show non-normal residuals while the bottom three panels show normal residuals.

The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a normal or exponential. Here’s an example of a Normal Q-Q plot when both sets of quantiles truly come from Normal distributions.

Figure 3. A quantile-quantile plot of residuals showing that they exhibit a close relationship with the theoretical quantiles predicted under a normal distribution.

Q-Q plots take your sample data, sort it in ascending order, and then plot them versus quantiles calculated from a theoretical distribution, in this case the normal distribution. If the points fall pretty closely along the line, the data are normal.

To be able to run the code that creates a Q-Q plot, you need to load the package stats. You can do this by running the code below.

#code to load package "stats" into R
library(stats)

#code to create a Q-Q plot
qqnorm(anova2$residuals)
qqline(anova2$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))

This plot looks pretty good. There’s one point on the right side that is pretty high but, in general, the points align with each other.

To test whether the error terms for each group have similar variance (homogeneity of variance, assmuption 2 above), we will use a plot of the fitted.values vs. the predicted values. Use the code below to create your plot:

#code to create a fitted values vs residuals plot
plot(anova2$residuals~anova2$fitted.values)
lines(lowess(anova2$fitted.values,anova2$residuals), col="blue")

For this plot, you are looking for no patterns. If you see a cone where the vertical variation on one side of the figure is smaller than the vertical variation on the other side, the variation is not homogeneous and you need to transform your data. The plot in this example suggests such a problem. The points on the left side cover a much smaller range of values than the points on the right side. If your plot looks like this, your data violate the assumption of homogeneity of variance. In this case, you need to transform your data.

If your data satisfies the assumptions of the model, go directly to Interpreting 1-way ANOVA Output below.

Transform and Repeat

For this dataset, you will try a square root transformation of your response variable (note that when doing transformations, you transform the response variable, not the explanatory variable). Use the code below to create a new variable called sqrtLT, which is the square root of leaf thickness. You can also plot the transformed response variable to check that your code worked.

#code to square-root transform your response variable LT
sqrtLT = sqrt(DATA$LT)
#code to create a boxplot of the transformed bromeliad data
boxplot(sqrtLT~Species, data=DATA, xlab = "Species", ylab = "Leaf thickness")

Now that the transformation worked (as evidenced in your figure), you can use the code below to re-run the anova and model assumption tests:

#code to run the anova test on the transformed data
anova2<-aov(sqrtLT~Species, data=DATA)
#code to create the density plot of residuals
plot(density(anova2$residuals))

#code to create a Q-Q plot
qqnorm(anova2$residuals)
qqline(anova2$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))

#code to create a fitted values vs. residuals plot
plot(anova2$residuals~anova2$fitted.values)

After transforming the data, they look better! The density plot of residuals looks more normal, the point on the right side of the Q-Q plot looks less like an outlier, and the variation on the left side has increased (note the change in scale on the y-axis in the fitted vs. residuals plot - it is much smaller, which means even though the points don’t extend equally on the left and right, the difference between the spreads is much smaller).

Another quick way to examine these diagnostic plots is to use the code below. It provides four plots (you need to click enter in R to go through these plots). Note that you don’t have a choice on which plots you get with this code - you get all of them and have to go through them:

1. A fitted vs. residuals plot. In this plot look for a non-linear pattern in the red line or uneven spread around the line (narrow on one side and wide on the other) just like you did earlier for the fitted vs. residuals plot.   
2. A Q-Q plot. This is a repeat of the Q-Q plot from above.   
3. A spread-location plot. In this plot, look for even distribution of dots above and below the line (even spread across the line).   
4. A residuals vs. leverage plot. In this plot, you are looking for influential points - those that fall **past the dashed red line**.

For more detail on interpreting these plots for model assumptions, check out this great post by Bommae Kim at the University of Virginia library: https://data.library.virginia.edu/diagnostic-plots/

#code for four diagnostic plots
plot(anova2)

While Sample 22 looks like it is having an influence as it isn’t exactly on the line in the Q-Q plot, it doesn’t cross the dashed line in the residuals vs. leverage plot and so it should not be removed (unless you had a better justification). You now can finally look at the ANOVA output.

###Interpreting ANOVA Output

Use the code below to get the model summary.

#code to get the anova model summary
summary(anova2)

##             Df Sum Sq Mean Sq F value Pr(>F)    
## Species      2 1.7406  0.8703   338.3 <2e-16 ***
## Residuals   23 0.0592  0.0026                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the ANOVA output, there is a significant effect of species on leaf thickness (but you should have already guessed that by the large differences that you saw in your boxplot). You should copy and paste the summary into a Word doc or Excel file so you record it somewhere. In particular, you want to go through the Background Theory above to really understand all aspects of this output table. For the results section of your paper, you should note the F-value, the df for your explanatory variable (in this case Species) and your Residuals, and the p-value (Pr(>F)).

While the ANOVA was significant, this only tells you whether the explanatory variable (Species) significantly influenced the response variable (leaf thickness). It doesn’t tell you which species are significantly different from each other. To determine that, you can use a Tukey’s Honestly Significant Difference (HSD) test that will do pairwise comparisons of all of your means. The following code is used to run the Tukey HSD test. Note that “TO.BE.EDITED” is where you put in your Explanatory variable.

#code to run a Tukey test
TukeyHSD(anova2, "TO.BE.EDITED", ordered=TRUE)

For our test, we are examining the differences in leaf thickness (LT) by species, so we use “Species” as our “group”. Below is the code for our test. This output will provide the p-values for each comparison.

#code to run a Tukey test across bromeliad species
TukeyHSD(anova2, "Species", ordered = TRUE)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
##     factor levels have been ordered
## 
## Fit: aov(formula = sqrtLT ~ Species, data = DATA)
## 
## $Species
##                      diff         lwr        upr     p adj
## TilMon-TilAnc 0.002714374 -0.05565111 0.06107986 0.9925519
## TilBul-TilAnc 0.584603228  0.52200299 0.64720346 0.0000000
## TilBul-TilMon 0.581888853  0.51787260 0.64590511 0.0000000

This output gives you the p-values associated with each comparison of your groups (in this case of your species). This tells you which means are significantly different from each other. So in our example, the leaf thickness of TilBul is significantly different from TilMon (p < 0.001) and TilAnc (p < 0.001) but TilMon and TilAnc are not significantly different from each other (p = 0.993). Look back at the photos of these species above, and you can see that this correlates well with our initial visual observations.

###Presenting your Results

####Results Statements

The results section of your paper should begin with a narrative of your results statements. These statements should be quantitative in nature and include 1. the statistical significance and 2. the biological significance.

Statistical Significance: Your first sentence should list the results of the statistical test (in this case whether the three species differ significantly in leaf thickness). You include statistical data in parentheses at the end of the sentence only. You should never write “The p-value was…” or “The F-stat was, which means…”. You don’t write about your statistics, you just write the biological results and include your statistics in parentheses. This satisfies whether your results were statistically significant.
Biological Significance: For the quantitative statements, you should include means and standard errors for interesting comparisons and their effect sizes and include whether the p-value from the Tukey test for each comparison was <0.05. You can compare every mean with every other mean or focus on the comparisons that make sense in your paper.

To help you get means and standard errors, you can install the psych package, and then load it using the following code:

#code to install the "psych" package
install.packages("psych")

#code to load the "psych" package into R
library(psych)

The following code shows you how you can get your response data summarized by your explanatory variable.

#code to get summary statistics of your response variable for each group of your explanatory variable
describeBy(DATA$response variable, group=DATA$explanatory variable)

For the bromeliad example, use the following code to get a summary of the data including means and standard errors.

#code to get the summary statistics of leaf thickness across bromeliad species
describeBy(DATA$LT, group=DATA$Species)

## 
##  Descriptive statistics by group 
## group: TilAnc
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 10 0.34 0.04   0.33    0.33 0.04 0.29 0.41  0.12 0.44    -1.38 0.01
## ------------------------------------------------------------ 
## group: TilBul
##    vars n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 7 1.36 0.18   1.35    1.36 0.18 1.15 1.68  0.53 0.41    -1.38 0.07
## ------------------------------------------------------------ 
## group: TilMon
##    vars n mean   sd median trimmed  mad  min  max range  skew kurtosis   se
## X1    1 9 0.34 0.04   0.35    0.34 0.02 0.26 0.38  0.12 -1.04    -0.61 0.01

In this output, n = the number of replicates (i.e., sample size) for that species; mean = average; sd = standard deviation; se = standard error. You can use these data to create your results statements and figure (or you can calculate them in Excel). To learn how to calculate the mean, standard deviation and standard error in Excel use the link below to watch a video.

Video on how to calculate a mean, standard deviation and standard error in Excel: https://www.screencast.com/t/wFu7UJZbU3cg. This can also be found on the Helpful Resources page of this website.

The first results statement that includes the statistical results in parentheses should also reference the figure you are referring to. Remember to always -refer to figures in the order in which they appear, and -include your results narrative before the figure.

####Results Figure

A bar chart with means and standard error bars or a box plot are commonly used to report the results of an ANOVA. Standard convention varies by field but in ecology, we use a bar chart. A bar chart or a box plot are used when your response variable is continuous and your explanatory variable is categorical; each group in your explanatory variable is represented by each bar. You can use Excel to create a bar chart. Use the link below to watch a video on how to make a bar chart in Excel: https://www.screencast.com/t/Dv6xg1yXJT

####Figure legend A caption must be included below each figure.

The caption for an ANOVA must include:

1.  A short descriptive title following the figure number  
2.  A description of what you plotted (including the type of error bars, if appropriate)  
3.  Your sample size (e.g., # transects/site)  
4.  The p-value for overall ANOVA. For p-values from paired comparison, use letters over bars and state   
something like "means that share a letter are not significantly different from each other at p < 0.05."

####Example Results Statement and Figure (Excel was used to generate this figure) Leaf thickness differed significantly among the three bromeliad species (1-way ANOVA, F = 338.3, df = 2,23, p < 0.001, Figure 4). Tillandsia bulbosa (TilBul) had a leaf thickness that was 303% greater than Tillandsia anceps (TilAnc; p < 0.001) and 300% greater than Tillandsia monadelpha (TilMon, p < 0.001). The leaf thicknesses of Tillandsia ancepts and Tilandsia monadelpha were not different from each other ( p = 0.993).

Figure 4. Average (? SE) leaf thickness for three tank bromeliad species. Leaf thickness varied significantly among species (p < 0.05). TilAnc = Tillandsia anceps (n = 10), TilBul = Tillandsia bulbosa (n = 7), and TilMon = Tillandsia monadelpha (n = 9). Bars with different letters are significantly different from each other according to a Tukey’s HSD test (p < 0.05).

###My Data Are Not Normal - What Do I Do? When you can’t meet the assumptions of the model, you can either:
1. Transform the data
2. Run a non-parametric test

Transforming your data means modifying the response variable so that the assumptions are met. This often means taking the square root or the log of the response variable and running the test again. The reason transformations work is that the relative distance between each replicate stays the same (sample 1 is lower than sample 2, for example) but the absolute distance between them is reduced. If your density plot is bimodal (two large humps in your curve), often a transformation will make those humps smaller. You still present your data using the raw numbers, not the transformed numbers.

To transform your data for the example above by taking the log (first line of code below) or the square root (second line of code below) of your response variable, you can use the following code (Note: you did the square root transformation above):

#code to take the log of leaf thickness
logLT = log10(DATA$LT)
#code to take the square root of leaf thickness
sqrtLT = sqrt(DATA$LT)

Then you can run the analyses again using your new variables (logLT or sqrtLT) in place of your original variable (LT). You can’t take the log of 0 so if you have many zeroes in your dataset, you can either use the square root transformation or do a log(x+1) transformation. You can’t take the square root of negative numbers so if you have negative numbers, try the log transformation.

If you can’t meet the assumptions even after transforming your data, you can use a non-parametric test. A non-parametric test does not assume an underlying distribution (such as the F-distribution) and, therefore, does not need to meet the assumption of normality as the parametric ANOVA does.

The non-parametric test for an ANOVA is called the Kruskal-Wallis test. Use the code below to run the Kruskal-Wallis test:

#code to run a Kruskal-Wallis test of leaf thickness across bromeliad species
kruskal.test(LT~Species, data=DATA)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  LT by Species
## Kruskal-Wallis chi-squared = 14.892, df = 2, p-value = 0.0005838

In your results statement on whether the analysis was significant or not, you should include the chi-square value, df, and p-value in parentheses.

##Quick ANOVA Here is all of the code you need to quickly run the ANOVA. Note that below the transformed data are used.

#bring data into RStudio
DATA <- data.frame(read.table(file='Tillandsia_traits2.csv', sep=',', header=TRUE, fill=TRUE, na.strings="."))

#check that data was loaded properly
names(DATA)

#explore data (look for outliers)
boxplot(LT~Species, data=DATA, xlab = "Species", ylab = "Leaf thickness")

#run the ANOVA
anova2<-aov(LT~Species, data=DATA)

#check model assumptions
plot(density(anova2$residuals))
install.packages("stats")
library(stats)
qqnorm(anova2$residuals)
qqline(anova2$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))
plot(anova2$residuals~anova2$fitted.values)
plot(anova2)

#ONLY DO THIS IF A TRANSFORMATION IS NECESSARY
#transform response variable
sqrtLT = sqrt(DATA$LT)

#plot data to make sure the transformation worked
plot(sqrtLT~DATA$Species)

#run the ANOVA with the transformed response variable
anova2<-aov(sqLT~Species, data=DATA)

#check model assumptions again
plot(density(anova2$residuals))
install.packages("stats")
library(stats)
qqnorm(anova2$residuals)
qqline(anova2$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))
plot(anova2$residuals~anova2$fitted.values)
plot(anova2)

#examine ANOVA output
summary(anova2)

#run a post-hoc Tukey HSD test
install.packages("agricolae")
library(agricolae)
Tukey<-HSD.test(anova2, "Species", group=TRUE)
Tukey

#for summary statistics
install.packages("psych")
library(psych)
describeBy(DATA$LT, group=DATA$Species)

###Code to Create a Boxplot

Below is the code needed to generate a box-plot but remember that ecology generally doesn’t use box plots. To generate a box plot, you need to install and load the package “ggplot2” using the following code.

#code to install package "ggplot2"
install.packages("ggplot2")

#code to load package "ggplot2" into R
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

To create the box plot, use the code below.

#code to create a boxplot of leaf thickness across bromeliad species
ggplot(DATA, aes(Species, LT, color = Species ))+ geom_boxplot() + theme_classic() +ylab ("Leaf thickness (mm)")

You can do a lot with the ggplot function, including adding letters above each of the bars that you got from your Tukey test above. The code below first calls your plot “p” and then adds in letters as “text” wherever you set the x and y coordinates (e.g., y = level on the y-axis where you want to place the letter).

#code to create a boxplot of leaf thickness across bromeliad species and add letters above the bars
p <- ggplot(DATA, aes(Species, LT, color = Species ))+ geom_boxplot() + theme_classic() +ylab("Leaf thickness (mm)")
p + annotate("text", x = 1, y = 0.5, label = "a") + 
  annotate("text", x = 2, y = 1.8, label = "b")+ 
  annotate("text", x = 3, y = 0.5, label = "a")

*Written July 2018. Modified from http://stats.pugetsound.edu/ecology/