FOR3012 Statistics Module – Lab Exercise 2

Author

Patrick James & Jack Goldman

Published

September 28, 2023

Refresher

Setting your working directory.

At the start of each R session it is good practice to check (and set) your working directory.

Remember, to check your working directory you use the getwd function. What gets returned is called your PATH. The working directory tells are where to look for files. In your console, type the following:

getwd()

[1] "/Users/jgoldman/courses/FOR3012H"

To set your working directory if it is not already pointing to the right place we use the setwd command and put in the proper working directory. (in this example my working directory is already set to the right place but I will run it anyways) Again, run this command in your console.

setwd("/Users/jgoldman/courses/FOR3012H/")

R Scripts

After setting the working directory, the next step is to open an R script. There are two ways of doing this. First, we can open an R script from the dropdown menu in R Studio File -> New File -> R Script. If you have a mac you can use command + Shift + N for windows the operation is control + alt + N.

Once we have are R script open, we can save it to our working directory.

When writing code in a R script, it is good practice to comment on each line of code to document what the particular line of code is doing. You can add a comment to a script using the # symbol. For example:

# this line of creates a simulated data set based off a random normal # distribution 
ex.dat <-rnorm(500, 5, 2) 

## this line visualizes the distribution of our data using a histogram
hist(ex.dat)

Commenting your code is good practice and is important for reproducibility. Having reproducible code makes it portable, and easy for others to navigate and get the same results. For more information on reproducibility:

why should code be reproducible and what does it mean

example of reproducible R code example

To run a line of code in an R script you can use either command + enter on a Mac or control + enter on Windows.

There are many great online resources for learning more tips and tricks, and to get more comfortable with R, R studio and the scripting environment. Check this one out! Hands on programming with R

Tutorial

t-test on means

First, using the techniques you learned last week, load the liming data from last week (Lab1_Liming.txt). After setting your working directory – either using the command line or a drop-down menu in R Studio, execute the following command:

data1 <- read.table("Tutorials/Lab1_Liming.txt", header = TRUE)

If you are still uncertain and experiencing difficulties importing your data, take a bit of time to familiarize yourself with the concepts underlying data import in R. Speak to the prof or the TA for guidance. This is an essential skill that you will need for this course, in other, later courses, and for your Capstone project.

To visualize the distributions of each category of data (Limed and Unlimed), we should make a histogram for each category. Recall, that a frequency histogram illustrates how often a particular value is found in a data set. We can easily make histograms using the hist function. We can also use some funky graphic parameters to make histograms appear beside each other in the same figure. Before getting to the liming data, let’s generate some fake data to see how to make nice, comparable plots.

We will create two data sets using a random normal distribution do to do we call the rnorm function which is a base function provided by R.

Fake data 1; 1000 samples from a normal distribution with a mean=10, and SD=3

Fake data 1; 1000 samples from a normal distribution with a mean=10, and SD=12

#fake data 1
test.dat1<-rnorm(1000, 10, 3) 

#fake data 2
test.dat2<-rnorm(1000, 10, 12) 

# set plotting options
# number of panels (#rows, #cols)
par(mfrow=c(1,2)) 

#create histograms
hist(test.dat1, col="firebrick") 
hist(test.dat2, col="royalblue")

While the two histograms look a bit different, it is difficult to evaluate the differences in variation (SD) between the two of them as the scale in the x and y axes differ. To make a more direct comparison, we would want to make these axes the same. To do so, we can use the xlim and ylim arguments within our graphic (i.e., hist) function. These arguments influence the minimum and maximum values of our axes. I will now set these values to be the same for both histograms, based on visual inspection of the two (i.e., maximum x and y values).

par(mfrow=c(1,2)) 
hist(test.dat1, col="firebrick", ylim=c(0, 300), xlim=c(-30, 45))

hist(test.dat2, col="royalblue", ylim=c(0, 300), xlim=c(-30, 45))

Take some time to examine the consequences of changing the parameters used to generate the simulated data (test.dat1 and test.dat2) in these histograms, as well as the sensitivity of the plots to the other arguments (e.g., xlim, ylim).

Effective labels are also an essential component of a useful figure. Labels can be added to your plots using a range of different arguments to your plotting function. For example, the argument main adds a main title to the plot. Likewise, the arguments xlab and ylab add axis labels. For example:

dev.off() # this command clears the previous graphics window

hist(test.dat1, col="pink", main="My Title", xlab="Value", ylab="Frequency")

We can also assess differences between these data series using a box plot. Try this out:

dev.off() 

boxplot(test.dat1, test.dat2, col=c("tan1", "tan4"), main="MyBoxplot", xlab=c("Boxes"))

Nice. Now that we have some experience with plotting histograms, comparing data distributions, and making labelled boxplots, let’s go back to our liming data and compare the two treatments.

Histograms

hist(data1$UL, col="green") 

hist(data1$UL, col="grey")

Boxplot

# BOX PLOT 
dev.off() 
boxplot(data1$UL, data1$L, col=c("green", "grey"), main="Effect of Liming Treatment on Growth", xlab=c("Treatment"))

Take a moment to add meaningful labels to these histograms.

Significant Testing

Let’s see if the differences in means and distributions illustrated in the boxplot and histograms are significantly different by using a t-test. If the variances in the two treatments are about equal,then we can pool the two samples to estimate the variance (remember, the variance is critical for calculating the test statistic). This pooling (i.e, Welch’s correction) allows us to still sue the parametric t-test when variances are unequal. In fact, the correction is the default in R.

Even though we don’t have to do this directly for the test, we’ll test the variances using a simple F-test.

var.test(data1$UL, data1$L)

Turns out that the variances of the two treatments are not significantly different (p=0.1056). Recall that in using the standard α=0.05, our calculated p-value must be smaller than this threshold to indicate a significant difference. Therefore, we are unable to reject the null hypothesis that there is no significant difference between the variances of the two treatments.

Within our testing framework, we can specify that the variances are equal by setting the argument var.equal to “T” (True). This means that the t-test will not use Welch’s correction and pool the variances.

The final assumption to test is that of normality within groups. Here, we simply want to verify that our sample data are not significantly different from normal. We can test for normality using a Shapiro-wilk test (among other methods like Q-Q plots; not covered here). This test is also easy to implement in R. Here, our H0 is that there is no difference between our data and a sample of equal size from a normal distribution. The H1 is that the data are not from a normal distribution.

Before running the test though, let’s acquaint ourselves with the function itself using the ?() function.

## see function details 
?shapiro.test

The ?() function is a base R function that allows us to see the details of the function we wish to call, and how to specify its arguments or what type of data is required.

Now that we see how the function operates, lets run the function of the liming data

shapiro.test(data1$UL) # unlimed data 
shapiro.test(data1$L) # limed data

Have our assumptions of normality been met?

Finally, let’s run the actual t-test, after inspecting the function as we have done previously. Here our H0 is that the is no difference in the mean of the two groups. Our H1 is that there is some difference. As the direction of the hypothesized comparison has not been specified, we will assume a two-tailed test, and will specify this in the function call using the alternative argument.

Let’s check the details on the t.test function.

?(t.test)

Are there other arguments of interest here? What is the default assumption regarding equal variances?

Now we can run the test and interpret the results:

t.test(data1$UL, data1$L, var.equal=T, alternative="two.sided")

The corresponding p value is 0.0121. This value is below our significance threshold of α=0.05, which means that we can reject our null hypothesis of no significant difference and conclude that the two means are “significantly” different.

How does this result change (if at all) if you used Welch’s correction (default) instead of specifying var.equal=T)?

Week 2 Assignment (50 points total)

All assignments are individual. Submit as a PDF to Quercus by 12:00am, October 3, 2023

Define the following (2 points each):

1. Type 1 error
1. Power
1. P-value
1. Null hypothesis
1. Alternate hypothesis

What is the difference between a one-sided and two-sided statistical test? (5 points)
Two UofT researchers are interested in the growth of a pair of dioecious understory plant species found in temperate forest canopy gaps. The researchers are specifically interested to see if growth was different between the two species (A and B) in the last season. The data they collected can be found in the “FOR3012_Lab2_Assignment_2023.xlsx” file uploaded to Quercus. For this assignment, you can ignore the “sex” column. The researchers wish to use a two-sample t-test to address their research question. To help them accomplish this goal, please carry out the following steps:

1. Make a boxplot showing the differences in growth between the two species (5 points)
1. Produce a labelled figure that shows the frequency distributions (histograms) of growth for each species. (5 points)
1. Provide a clear table of summary statistics (mean, variance, standard deviation, and sample size (n)) for each species. (5 points)
1. Determine if the variances are equal between the two species using the var.test function in R. (5 points)
1. Assess if the data for each species are normally distributed using the Shapiro-Wilk test using the shapiro.test function in R. (5 points)
1. Conduct the appropriate test to compare the two means and report on your results in a short paragraph that refers directly to your null and alternate hypothesis. Include in this description the test statistic, degrees of freedom, and p-value associated with your statistic. Be sure to indicate if your hypothesis is one-sided or two sided. (10 points)

How would your approach differ if growth had been measured on the same plants before and after the opening of a canopy gap? Why? (5 points)