In this exercise we find out how to use R to run a t-test, to determine whether there is evidence of a difference between two populations.
The exercise is taken from Chapter 5: Beckerman, Childs and Petchey: Getting Started with R.
Open your RStuff project using File/Open Project, navigating to the RStuff folder, then clicking on the RStuff.Rproj file you will find there.
Create a new R Notebook file called t_test. Save this in your scripts folder. You should see that it has been saved as a .Rmd file. This stands for R markdown. In this type of file you can combine human readable text with R code. In this way, you can combine all of your analysis of a data set in one document - the text, the code to do any analysis and the output of that code, including any figures that you draw. Except for the very tiniest scripts, these files are much easier to read than a normal script. You can include extensive commentary without it looking as messy as it would be in a standard script. These instructions, what you are reading now, were written as a markdown document then knitted which formats everything and, finally, published to the web. Once you realise you can do all this easily, you rarely look back.
There are very simple rules for formatting text in R markdown. A quick guide is given in
Help/Markdown Quick Reference More extensive help can be found in this RStudio cheatsheet
For example, to make a really big header you start the line one #, like this
# A Really Big Header
appears, after formatting, as
for a smaller one, use two hashes
## Not quite such a big header
which gives, after formatting
and so on.
Writing in markdown is very easy and you will very soon get the hang of it.
If you wish to include R code in your document, you put it in ‘chunks’ which begin with three back ticks (top left of your keyboard) followed by {r}, then end with three more back ticks. The code goes in the middle. Like this:
```{r}
rm(list=ls())
```
In your document, the code chunks will appear greyed out.
To run the code you press the green arrow in the top right of the chunk.
You can set various options after the little ‘r’ between the curly braces, which affect what you see when you run the code, but no need to worry about them just yet. Read about them on the Rmarkdown cheatsheet if you want. For now, though, it is a good idea to include a label for each chunk, like this (notice the comma):
```{r, a succinct label that tells me what this chunk does}
rm(list=ls())
```
Now, to work on the t-test.
This is useful for comparing the means of two data sets where we have numerical data and we have replicates of those data. Here we will investigate data of ozone levels for gardens around a city. The data you will use gives ozone concentration in ppb. Ozone levels can affect how well crops grow, and can impact on human health. The gardens are from two regions - east or west of the city centre.
Is there a difference between ozone concentrations to the east or west?
We will use a t-test to decide this.
Write down a null and alternate hypothesis suitable for this investigation. Remember that the null is the ‘nothing going on’ scenario.
Should the alternate hypothesis be one-sided or two-sided? (Hint: two-sided is usually the answer here. But why?)
library(tidyverse)
library(here)
ozone<-read_csv(here("data","ozone.csv"))
## Parsed with column specification:
## cols(
## Ozone = col_double(),
## Garden.location = col_character(),
## Garden.ID = col_character()
## )
glimpse(ozone) #what command do you need here?
## Rows: 20
## Columns: 3
## $ Ozone <dbl> 61.7, 64.0, 72.4, 56.8, 52.4, 44.8, 70.4, 67.6, 68.8,…
## $ Garden.location <chr> "West", "West", "West", "West", "West", "West", "West…
## $ Garden.ID <chr> "G1", "G2", "G3", "G4", "G5", "G6", "G7", "G8", "G9",…
What kind of data have we got?
You might also wish to inspect the data using summary(). Include code chunk to do this.
Remember, before we do any statistical analysis, it is almost always a good idea to plot the data in some way.
ggplot() to plot two histograms of ozone levels, one for east and one for west.facet feature of ggplot()to stack the histograms one above the other.g<-ggplot(ozone,(aes(x=Ozone)))+
geom_histogram(binwidth=10)+
facet_wrap(~Garden.location,ncol=1)
g
Do the data look as though they support the null hypothesis or not? Let’s now do some stats.
We can use group_by() and summarise() from dplyr() to do this.
stats<- ozone %>%
group_by(Garden.location) %>%
summarise(avg = mean(Ozone),st_dev=sd(Ozone))
## `summarise()` ungrouping output (override with `.groups` argument)
stats
In the summarise line above we have asked for a table with two columns titled avg and st_dev. For the first we use the mean() function, and for the second we use the sd() function. Since we have grouped by Garden.location, each row of the table will be for a different location.
We use the t.test() command for this. It needs to be given a formula and a data set as arguments. Look up t.test() in R’s help documentation, and see if you can get the t-test to tell you whether there is a significant difference between ozone levels in the east and in the west of the city.
t.test(Ozone~Garden.location,data=ozone)
Study the output of the t-test.
t.test().