FOR3012 Statistics Module-Lab-3

Author

Patrick James & Jack Goldman

Published

October 6, 2023

ANOVA

Objectives

Last week we explored how to compare the means of two random variables using the t-test. We also explored the assumptions of this test, namely normality and homogeneity of variances, as well as test alternatives when and if these assumptions are violated. 

In this lab, we will extend our approach of comparing mean to include multiple groups (i.e., >2). While we could conduct multiple t-test, we know that the probability of finding a significant result increases with repeated tests. That is, out Type 1 error rate will increase. Therefore, we want a test that will be reliable and provide us with he correct and desired Type 1 error rate (e.g., α=0.05). 

The test we use to compare the means of multiple groups is called ANalysis Of VAriance, or ANOVA.

In this exercise we will practice running an ANOVA including importing and plotting the data, testing assumptions, and interpreting results. 

Starting R

Once again, start your R session. This may include starting RStudio, or R and NP++. If using RStudio, note that you can write your command in the script editor (upper left) or directly in the console window. 

If writing in the RStudio’s script editor, you can run code on a specific line by placing your cursor on that line and hitting Ctrl + Enter. If you wish to run all code from the very beginning of your script to a specific line, place your cursor on the last line you want to run, and hit Ctrl+Shift+Enter. If you only want to run a subset of code within your editor, highlight the desired lines of code and hit Ctrl+Enter. 

If using NP++ as an editor, one must write their code in the editor, and then copy/paste the code (Ctrl+c / Ctrl+v) directly into the R console. 

Importing data

As with last week, your first task will be to download the files posted on Quercus for this lab session. Save them to a folder on your laptop or on the lab computer. You will recall that R does not know automatically where your data are stored. We must tell R where to look for those data using the setwd(), command. If you are uncertain about what your current directory is, you can type in the function getwd(). Set your working directory using the following code, in which you replace the filepath object with your own filepath.

filepath<- "/Users/jgoldman/courses/FOR3012H/Tutorials/tutorial-3"

Remember we need forward slashes “/” not backslashes “” here… or… you can do a double backslash “\”

setwd(filepath)

You can also specify your directory directly in R Studio! Click on the session tab in the toolbar, and select set working directory and from the dropdown menu select choose directory. Session -> set working directory -> choose directory.

Previously, we used .txt files for our analysis. However, comma-separated value or CSV files are commonly used in R because it is a simple universal format that can be easily written and red in R. We can save Excel files as a comma-separated value (CSV) file and then import it into R. A CSV file stores tabular data (numbers and text) in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. One of the benefits of using CSV files in R is that you do not have to set call the argument header = TRUE as .csv files automatically recognize column names.

First, open the file “Lab3_Data.xlsx” in Excel. How are the data structured? Can you understand how the data are organized? You will note two tabs in the data file. One of them contains the actual data while the other serves to describe it – descriptions of data are referred to as “meta-data.” Meta-data is essential to effectively sharing your work, your data, and your results and ensuring that your data will have a life beyond your specific project.

Last week we worked with similar data that shows tree growth of 20 similarly sized trees in each of two treatments (L = limed and UL = unlimed). This week we are going to slightly extend this format and work with similar data representing three treatments: (L1 = low limed, L2 = heavy limed, and UL = unlimed). We are also going to increase the number of observations in each category to 40 trees. In the context of this entirely fictional experiment, the unlimed treatment serves as a “control.” That is, a set of observations not affected by your experimental treatment. Controls are essential to clearly identify an effect of your treatment and to reliable and reproducible science.

Today, you will save your data files as a comma-separated values file (.csv) within Excel using the Save as function to save a “Text comma-separate value (*.csv)” file. Next, let’s read in our data using the read.csv command.

Next, let’s read in our data using the read.csv command

lab3data <- read.csv('Lab3_Data.csv')

Once imported, we can look at the data frame at any point by typing its name, by typing or by double clicking on it in the upper right data window in RStudio.

View(lab3data)

We can also modify our data. Again, we have several options: 1) we can modify the data (e.g., remove erroneous values) directly in Excel before importing to R; 2) we can make changes by access specific cells (row x column locations) using indexing and square brackets [row(s), colum(s)]; or 3) we can use the edit function.

From the console (not your script editor), try the following:

edit(lab3data)

Transforming data from long to wide and reading in R packages

Last week we briefly covered transforming data from long to wide data structures and vice versa using base R functions. However, people have developed more efficient ways of performing this operation (and others). Because R is open source, people publish these function as R packages. R packages are a sets of R functions that must be downloaded and installed on your computer. The R package we will be working with today is tidyr, which is part of a bigger package called tidyverse. The tidyerse is a set of functions that are used to manipulate, transform and clean data and contain some of the most common functions.

First, we start with installing a package:

install.packages('tidyverse')

After installing the package, we must load it into our R session. Everytime we start a new R session, we start with a clean R environment. This means that there are no packages loaded, and if we were to call a function to that package, it would return an error.

To load an R package we call the function library() with the package we wish to load as the function. This tells R to search through its library of installed packages and load the specified package.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

To transform data using the tidyr package we will use the function pivot_longer to change a data structure from wide to long and pivot_wider to change from long to wide.

Lets use the data from this lab which is in wide format and change it to long format.

lab3data_long <- pivot_longer(lab3data, cols = c(1:3) , names_to = "Treatment" , values_to = "Tree Growth")

This function has 4 primary arguments. The first argument is the data frame that we want to change. The second argument is the cols or columns that we want to change. Here we wanted to change all columns, therefore we selected these columns using c(1:3). This tells R to select columns 1-3 in the data frame. The next argument, names_to is asking us what we want to call the column that we are moving the existing column names to, here we will call it “Treatment”. The final argument, values_to is asking what we want to call the column that we are moving all the values to. In this case we want to call it “Tree Growth”.

Now if we want to transform our data back to wide format we use the function pivot_wider.

lab3data_wide <- pivot_wider(lab3data_long, names_from = "Treatment", values_from = "Tree Growth")
Warning: Values from `Tree Growth` are not uniquely identified; output will contain
list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} %>%
  dplyr::group_by(Treatment) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L)

When we try to pivot to wider we get a warning message. This is because when pivoting to wider we need to specify a column to uniquely identify an observation. For example, if we had a column called species, we could use this column to uniquely identify each observation.In this case we have no such column, and the data frame is returned as a list within each column. To fix this we call the argument unnest().

lab3data_wide <- unnest(lab3data_wide)
Warning: `cols` is now required when using `unnest()`.
ℹ Please use `cols = c(L1, L2, UL)`.

For more information of tidyr:tidyr

For a tidyr cheatsheat: cheatsheat

Descriptive stats and plots

Let’s start with some simple descriptive statistics of our data.

The summary function summarizes entire table - mean, median and quartiles

summary(lab3data)
       L1              L2              UL       
 Min.   :179.0   Min.   :163.0   Min.   :187.0  
 1st Qu.:195.8   1st Qu.:202.2   1st Qu.:213.5  
 Median :205.0   Median :207.5   Median :227.5  
 Mean   :203.7   Mean   :208.6   Mean   :225.9  
 3rd Qu.:210.5   3rd Qu.:216.2   3rd Qu.:237.2  
 Max.   :228.0   Max.   :229.0   Max.   :258.0  

To summarize a specific column we can call the function:

summary(lab3data$L1) #summary of low limed treatment
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  179.0   195.8   205.0   203.7   210.5   228.0 
summary(lab3data$L2) #summary of high limed treatment only
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  163.0   202.2   207.5   208.6   216.2   229.0 
summary(lab3data$UL) #summary of Un-limed treatment only
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  187.0   213.5   227.5   225.9   237.2   258.0 

Next, let’s generate histograms of each column and add a vertical line that indicates the mean.

par(mfrow=c(1,3)) # makes a plot with 1 row and 3 columns

hist(lab3data$L1, col='firebrick', main='L1', ylab='Freq',xlab='Growth') 
abline(v=mean(lab3data$L1, col='black', lwd=2))


hist(lab3data$L2, col='green', main='L2', ylab = 'Freq', xlab='Growth')

abline(v=mean(lab3data$L1, col='black', lwd=2)) 
hist(lab3data$UL, col='royalblue', main='UL', ylab='Freq', xlab='Growth')

abline(v=mean(lab3data$L1, col='black', lwd=2))

We can also summarize these data using a box plot. Instead of passing an argument for each column in the data (e.g., boxplot(data1$L, data1$UL)), we can just pass the entire object as an argument to the boxplot function

boxplot(lab3data, col=c('darkgrey', 'lightgrey', 'pink'))

Try this as well hat’s going on here? can you increase or decrease the number of boxes by changing the size of k?

k<- matrix(c(1:27), nrow=3) 

boxplot(k, col=rainbow(ncol(k)))

ANOVA

Our question now is to assess if there is a significant difference among the means of our three groups (L1, L2, and UL). Our first step should be to declare our null and alternative hypotheses:

H0=there is no difference among the means of L1, L2, and UL; μ1= μ2 =μ3 H1=there is a difference between the means of at least one pair of groups; μi ≠ μk ; for at least one pair of groups i, k.

ANOVA, like the t-test, has several assumptions. 1) The measurements within each group are independent and random samples 2) The data within each group are (mostly) normally distributed 3) The variance is the same in all groups

The first assumption pertains to the experimental design and rigour of the sampling strategy. The second assumption can again be tested using a Shapiro-Wilk test. The third test requires a slight modification to the variance test we used last week as we are now dealing with three groups instead of two. Here, we can use the Bartlett test (bartlett.test). This test can be used to test multiple groups simultaneously as well as the variance of more complex data tables that may include the effects of more than one factor (i.e., two-way ANOVA).

It is worth noting here that ANOVA is known to be quite robust to violations of both the normality and homogeneity of variance assumptions provided that the sample size is “big enough.”

For that reason, let’s fit our ANOVA model first, and then check our assumptions. Before doing so though, we need to be sure that our data is in the right format. The data were shared with you as a series of columns where each column represented a different treatment. For use with ANOVA we must transform these data into a pair of columns where the first column contains the treatment (one of three levels of our liming factor: “L1”, “L2”, or “UL”), and the second column contains all observations associated with each treatment. You can use either Excel to do this or try the following solution in R.

We will first make a list of all three levels of our experimental treatment using the “rep” function. We know that there are 40 observations of each treatment L1, L2, and UL. The arguments to the rep function are the string we want to repeat and the number of times we want to repeat it

treatment<-c(rep('L1', 40), rep('L2', 40), rep('UL',40))

Next, we need to convert our “wide” format liming data to a single column

response<-c(lab3data$L1, lab3data$L2, lab3data$UL)

Finally, we can combine these two vectors into a dataframe using the cbind function (cbind =column bind).

limingData<-data.frame(cbind(treatment, response))

In looking at these data we see that there is a problem. R thinks that both columns of our dataframe are “characters”, as indicated by the quotes around them. You can check the structure of your data using the str command

Run the following code to correctly specify the data types we are using here

str(limingData) # check problem 

limingData$treatment <- as.factor(limingData$treatment) #fixing problem
limingData$response <- as.numeric(limingData$response) #fixing problem

str(limingData) # check that problem has been resolved – looks good!

Now that the data are in the appropriate format, we can run our ANOVA. Try the command below:

?aov # anova function
model1 <- aov(response~treatment, dat= limingData) # fit model
summary(model1) # see model output

If you find a significant result, try your post hoc test using TukeyHSD. Which pairs, if any, are significant?

TukeyHSD(model1) # using model object as argument

Finally, let’s try a toy example of a second factor that requires a two-way ANOVA. Let’s add some random numbers to represent the addition of phosphorous.

First lets simulate some data

ash<-floor(runif((40*3), 190, 230))

Now lets join the data to our dataset

limingData<-cbind(limingData, ash)

Run the models

 model2 <- aov(response ~treatment+ash, dat= limingData) #fit model summary(model2) # see model output 
model3 <- aov(response ~treatment*ash, dat= limingData) #fit model
summary(model3) # see model output

Lab 3 Assignment (50 points total)

To submit as a PDF to Quercus by 11:59pm, October 11, 2023

The same two researchers we considered last week continue to be interested in the growth of dioecious understory plant species found in temperate forest canopy gaps. Now, after another field season and collecting more data, they are interested to see if growth was different among three understory species (A, B, and C) in the last season and if it was influenced by plant sex. Their updated data can be found in the “FOR3012_Lab3_Assignment.xlsx” file uploaded to Quercus.

Your assignment is to use ANOVA to determine if there is a significant difference between the three species. Follow these steps for the assignment:

  1. Prepare a well labelled and clear illustration (e.g., box plot) of the differences among groups in your data (5 points)

  2. Demonstrate that the data satisfy the assumption of normality (we will assume equality of variance (homoscedasticity)) (10 points)

  3. Clearly state your null and alternate hypotheses (5 points)

  4. Use ANOVA to analyse the differences among groups and prepare an ANOVA table summarizing the results. (10 points)

  5. Use a post-hoc test to determine which pairs of groups differ. Report your results in a table. (5 points)

  6. What is the R2 of your model? How do you interpret this? (5 points)

  7. Prepare a short paragraph summarizing your findings as you would include in a report or publication. (10 points)