Today we will be conducting an ANOVA in R. R is a statistical software used frequently by biologists. This tutorial assumes no prior knowledge in R.
ANOVA Background: ANOVA stands for Analysis of Variance. This is a statistical test that compares the means of different groups. We will be conducting a one-way ANOVA, which means we have one factor. A factor is a categorical variable we’ll be evaluating. Essentially, a one-way ANOVA is a t-test with more than 2 groups.
Tutorial dataset: This practice dataset is evaluating the effect of fertilizer type on plant growth.
You should run through the tutorial with the practice dataset first. Make sure all outputs are the same. Once you’re successful, modify the code for your own dataset!
Find the practice dataset on D2L.
STEP 1: The first thing we need to do is download the R software.
Windows: On your personal computer, visit this link https://cran.r-project.org/bin/windows/base/ to download the latest version of R.
Mac: On your personal computer, visit this link https://cran.r-project.org/bin/macosx/ to download the latest version of R.
If you open R, if will look like this. But, this is only a console.
R Console
Step 2: Unless you’re familiar with command lines, R is going to be difficult to work in. Which is why you will download RStudio. RStudio is in integrated development environment (IDE) for R. Follow this link to locate the download links for Windows and Mac. https://posit.co/download/rstudio-desktop/
When you open RStudio, it should look like this (but in a different color).
RStudio
You’ll notice that the program opens with 4 boxes.
Now that you have RStudio downloaded, you need to input your data from Excel into R.
Step 1: Create a new R script. File –> New File –> R Script
R Script
Step 2: Set your working directory. This tells R what folder you’re working out of in your computer. Navigate to Session –> Set Working Directory –> Choose Directory, then navigate to the folder your csv file is in.
Now, look down at the console. setwd means set working
directory. The text in quotes is the pathway to the folder. Yours will
look different!
setwd
getwd() into your
console! This is helpful if you move about directories and get
lost.setwd("C:/Users/rache/OneDrive - University of North Georgia/Course/BotanyAnova")
Alternatively, if you are comfortable with paths, you can set your
working directly manually by typing setwd followed by the
absolute pathway to the folder you will work out of.
Step 3. Import csv containing your data. There are 3 things you need to import the csv file.
read.csvFile <- read.csv("practicedata.csv")
This piece of code is telling R to read the csv named practicedata and store it as a dataframe named File. Note that the name you store your data can be anything, but it’s best practice to make the name something meaningful.
Hit Run
Run
Once we hit run, that code goes to the console and R does its thing; it does the action we specified. How do we know it worked? Well, take a look at the environment box, there’s something new there.
Under Data there is a row that says ” File 40 obs. of 2 variables” What does that mean?
Well, File is what we named our dataframe. Then, if you look at our csv file, we have 40 datapoints. There are 2 columns in the data (varaibles); the fertilizer type (A, B, C, or Control) and the plant height.
Let’s make sure our dataframe imported correctly. There are a few ways we can do this!
- Option 2: Type the name of your dataframe in your console and hit
enter.
File
## Fertilizer Plant_Height_cm
## 1 A 15.2
## 2 A 16.8
## 3 A 14.5
## 4 A 17.0
## 5 A 15.5
## 6 A 16.2
## 7 A 15.9
## 8 A 14.8
## 9 A 16.1
## 10 A 15.7
## 11 B 18.1
## 12 B 19.3
## 13 B 17.8
## 14 B 18.5
## 15 B 19.0
## 16 B 18.6
## 17 B 17.9
## 18 B 18.2
## 19 B 19.4
## 20 B 18.7
## 21 C 10.5
## 22 C 11.0
## 23 C 10.0
## 24 C 11.3
## 25 C 10.8
## 26 C 10.7
## 27 C 11.1
## 28 C 10.4
## 29 C 11.0
## 30 C 10.6
## 31 Control 10.5
## 32 Control 11.0
## 33 Control 10.2
## 34 Control 11.3
## 35 Control 10.8
## 36 Control 10.9
## 37 Control 11.1
## 38 Control 10.4
## 39 Control 11.2
## 40 Control 10.7
Both Option 1 and Option 2 give you all the contents of your dataframe. This works for small datasets, but if you have a large dataset, you likely don’t want to use these options.
Instead, you can look at the first few lines of data with Option 3.
Head() function. I can write
head(File). This returns the first few rows of your
dataset. Remember, this needs to be the name of YOUR dataframe within
the ().head(File)
## Fertilizer Plant_Height_cm
## 1 A 15.2
## 2 A 16.8
## 3 A 14.5
## 4 A 17.0
## 5 A 15.5
## 6 A 16.2
Step 1. The below bit of code runs an ANOVA in R. Let’s break it down.
aov.output <- aov(Plant_Height_cm ~ Fertilizer, data = File)
aov() This is the function for calling an ANOVA in
R.
(Plant_Height_cm ~ Fertilizer This is telling R to run the
ANOVA using the columns named “Plant_Height_cm” and “Fertilizer”.
Remember, we’re looking at the effect of fertilizer type on plant
height! How would you set this up for your dataset?
Note, that you’ll have to type the names EXACTLY the
way you have them in your csv file. Don’t use spaces, use _
instead.
data=File) This is telling the ANOVA function where to pull
the data from!
aov.output This is where we are putting the results of the
ANOVA
aov.output <- aov(Plant_Height_cm ~ Fertilizer, data = File)
Okay…..Where are the results of our ANOVA?!
Step 2: To view the results of our ANOVA, we need to call our ANOVA table up. We can do this with this line of code.
summary(aov.output)
This table pops up in our console.
## Df Sum Sq Mean Sq F value Pr(>F)
## Fertilizer 3 446.3 148.78 480 <2e-16 ***
## Residuals 36 11.2 0.31
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What information do we need? Do you see the column labeled Pr(>F), that’s our p value! The asterisks beside it are our significance values, as shown at the bottom of the table.
For Fertilizer our p value is less than 2e-16, in other words, our p value is very close to 0. Do you remember what a small p value means?
That’s right! A small p value means we reject our null. Fertilizer does have an effect on plant height!
But wait…we had a control and 3 fertilizer treatments….Do…all the treatments have an effect?
We don’t know. The ANOVA can only tell us if there is an effect or fertilizer, not where that effect is.
We need another test!
If our p value is significant, we can run a Tukey test to determine where the differences are. A Tukey test is a type of Post Hoc analysis (this means you run it after your ANOVA) that compares pairwise differences among sample means.
TukeyHSD(aov.out)
TukeyHSD This is the function for a Tukey Test
(aov.out) This is what you stored your ANOVA results as in
the previous steps.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Plant_Height_cm ~ Fertilizer, data = File)
##
## $Fertilizer
## diff lwr upr p adj
## B-A 2.78 2.1094219 3.4505781 0.0000000
## C-A -5.03 -5.7005781 -4.3594219 0.0000000
## Control-A -4.96 -5.6305781 -4.2894219 0.0000000
## C-B -7.81 -8.4805781 -7.1394219 0.0000000
## Control-B -7.74 -8.4105781 -7.0694219 0.0000000
## Control-C 0.07 -0.6005781 0.7405781 0.9921101
And like that, we’ve completed an ANOVA with only a few lines of code!
File<-read.csv("practicedata.csv")
aov.output <- aov(Plant_Height_cm ~ Fertilizer, data=File)
summary(aov.output)
TukeyHSD(aov.output)
Now, use your Photosynthesis data to conduct an ANOVA, and if needed, a Tukey test in R. Use the practice csv sheet to help you set up your data before importing it into R. Remember, to save the excel spreadsheet as a .csv, and not a .xlsx (standard excel format). In addition, avoid typing in “10”,”20”,”30” for temperature. The reason for this is the ANOVA test will not appreciate it, as it will read these as integers rather than factors. This will cause the test not to work. We can code the numbers as factors (categories), but that’s an additional level of complication. Instead, let’s write “Ten”, “Twenty” and “Thirty” as a workaround.
For a challenge, you can try using as.factor() after
importing your data (if you use ‘10’, ‘20’, ‘30’) to convert your
temperature variable to a factor from an integer.
It would look something like this
File$Temperature <- as.factor(File$Temperature)
This line of code selects the column called “Temperature” within “File” and converts it to a factor.