In this exercise we find out how to use R to run a t-test, to determine whether there is evidence of a difference between two populations.
We will also learn how to write and knit our R file as an R Notebook.
The exercise is based on Chapter 5: Beckerman, Childs and Petchey: Getting Started with R.
Open your RStuff project using File/Open Project, navigating to the RStuff folder, then clicking on the RStuff.Rproj file you will find there.
If your Rstuff folder is not already a Project, then make it one using File/New Project/Existing Directory - then navigate to your Rstuff folder.
Create a new R Notebook file called t_test. Save this in your scripts folder. You should see that it has been saved as a .Rmd file. This stands for R markdown. In this type of file you can combine human readable text with R code. In this way, you can combine all of your analysis of a data set in one document - the text, the code to do any analysis and the output of that code, including any figures that you draw. Except for the very tiniest scripts, these files are much easier to read than a normal script. You can include extensive commentary without it looking as messy as it would be in a standard script. These instructions, what you are reading now, were written as a markdown document then knitted which formats everything and, finally, published to the web. Once you realise you can do all this easily, you rarely look back.
There are very simple rules for formatting text in R markdown. A quick guide is given in
Help/Markdown Quick Reference More extensive help can be found in this RStudio cheatsheet
For example, to make a really big header you start the line one #, like this
# A Really Big Header
appears, after formatting, as
for a smaller one, use two hashes
## Not quite such a big header
which gives, after formatting
and so on.
Writing in markdown is very easy and you will very soon get the hang of it.
If you wish to include R code in your document, you put it in ‘chunks’ which begin with three back ticks (top left of your keyboard) followed by {r}, then end with three more back ticks. The code goes in the middle. Like this:
```{r}
rnorm(10)
```
In your document, the code chunks will appear greyed out.
To run the code in each code as you write them, you press the green arrow in the top right of the chunk.
You can set various options after the little ‘r’ between the curly braces, which affect what you see when you run the code, but no need to worry about them just yet. Read about them on the Rmarkdown cheatsheet if you want. For now, though, it is a good idea to include a label for each chunk, like this (notice the comma):
```{r, a succinct label that tells me what this chunk does}
rnorm(10)
```
To render your .rmd document, you press the Preview/Knit button at the top of the scripts pane and choose (I recommend) Knit to HTML. This will create a rendered version of your rmd file where everything is nice and readable. It will be in the same folder as your rmd file.
You can knit as often as you want. You will just overwrite the existing HTML file. Along the way though, you may wish to implement the code chunk by chunk by pressing the little green chunk arrows.
Now, to work on the t-test.
This is useful for comparing the means of two data sets where we have numerical data and we have replicates of those data. Here we will investigate data of ozone levels for gardens around a city. The data you will use gives ozone concentration in ppb. Ozone levels can affect how well crops grow, and can impact on human health. The gardens are from two regions - east or west of the city centre.
Is there a difference between ozone concentrations to the east or west?
We will use a t-test to decide this.
Write down a null and alternate hypothesis suitable for this investigation. Remember that the null is the ‘nothing going on’ scenario.
Should the alternate hypothesis be one-sided or two-sided? (Hint: two-sided is usually the answer here. But why?)
When you create a new R Notebook, you get some ‘yaml’ at the top between the two — lines. Leave this, but delete everything below. That is just exemplar stuff that you don’t need. What you put in the ‘yaml’ can affect how your file rendered and all sorts of things, but for now you can ignore it.
Now copy in the code below, chunk by chunk, remembering to put each new block of code within its own self contained chunk. Run each chunk as you complete it to make sure there are no errors. The output will appear in the script pane and also in the console window.
Also add explanatory test between the chunks, as much as you think useful, but at least put headers. Experiment with hashes to see how many are needed to give headers of the size you like.
Whenever you feel like seeing fancy output, Knit your document. I love it, so I do this frequently.
library(tidyverse)
library(here)
# there should be an 'ozone.csv' file in your data folder
# if not, you can get it from the data folder on Teams
ozone<-read_csv(here("data","ozone.csv"))
## Parsed with column specification:
## cols(
## Ozone = col_double(),
## Garden.location = col_character(),
## Garden.ID = col_character()
## )
glimpse(ozone)
## Rows: 20
## Columns: 3
## $ Ozone <dbl> 61.7, 64.0, 72.4, 56.8, 52.4, 44.8, 70.4, 67.6, 68.8,…
## $ Garden.location <chr> "West", "West", "West", "West", "West", "West", "West…
## $ Garden.ID <chr> "G1", "G2", "G3", "G4", "G5", "G6", "G7", "G8", "G9",…
What kind of data have we got?
You might also wish to inspect the data using summary(). If so, include a code chunk to do this.
Remember, before we do any statistical analysis, it is almost always a good idea to plot the data in some way. We can often get a very good idea as to the answer to our research question just from the plots we do.
ggplot() to plot two histograms of ozone levels, one for east and one for west.facet feature of ggplot()to stack the histograms one above the other.g<-ggplot(ozone,(aes(x=Ozone)))+
geom_histogram(binwidth=10,fill="darkred")+
facet_wrap(~Garden.location,ncol=1) +
theme_bw()
g
Do the data look as though they support the null hypothesis or not?
In addition, do the data look as though each group is drawn from a normally distributed population? When there are so few data, it’s kind of hard to tell from these, no?
Let’s now do some stats.
We can use group_by() and summarise() from dplyr() to do this, together with the pipe operator %>%
stats<- ozone %>%
group_by(Garden.location) %>%
summarise(avg = mean(Ozone),st_dev=sd(Ozone))
## `summarise()` ungrouping output (override with `.groups` argument)
stats
Study this code snippet. You will use something like it again and again.
In the summarise() line above we have asked for a table with two columns titled avg and st_dev. For the first we use the mean() function, and for the second we use the sd() function. Since we have grouped by Garden.location, each row of the table will be for a different location.
Bear in mind that if the data are normally distributed, then under the null hypothesis there is a less than 5% chance that the two sample means would be more than two standard deviations apart. Does it look as though we are going to reject the null, or not?
There are several analytical tests one can run on a set of data to determine if it is normally distributed. One is the Shapiro-Wilk test.
For more information on the Shapiro-Wilk test, type ?shapiro.test into the console window. For kicks, try it out on the examples that appear in the help window (which is the bottom right pane, Help tab) . One example is testing data that explicitly are drawn from a normal distribution, the other tests data that definitely are not. What \(p\)-value do you get in each case? Wht do the histogreams of each distribution look like?
example1<-rnorm(100, mean = 5, sd = 3) # first example from the help pane
example2<-runif(100, min = 2, max = 4) # second example from the help pane
df<-tibble(data=c(example1,example2), distribution=c(rep("normal",100),rep("very not normal",100)))
# histograms
ggplot(df,aes(x=data)) +
geom_histogram(bins=10,fill="cornflowerblue") +
facet_wrap(~distribution) +
theme_bw()
# normality tests
shapiro.test(rnorm(100, mean = 5, sd = 3)) # 100 samples drawn from a normally distributed population
##
## Shapiro-Wilk normality test
##
## data: rnorm(100, mean = 5, sd = 3)
## W = 0.99556, p-value = 0.987
shapiro.test(runif(100, min = 2, max = 4)) # 100 samples drawn from a uniformly (ie NOT normally) distributed population
##
## Shapiro-Wilk normality test
##
## data: runif(100, min = 2, max = 4)
## W = 0.95342, p-value = 0.001403
This tests your data against the null hypothesis that it is drawn from a normally distributed population. It gives a \(p\)-value. If the \(p\)-value is less than 0.05 then we reject the null hypothesis and cannot suppose our data is normally distributed. In that case we would have to ditch the t-test for a difference, and choose another difference test in its place that could cope with data that was not normally distributed.
Why don’t we do that in the first place, I hear you ask? Why bother with this finicky t-test that requires that we go through the faff of testing the data for normality before we can use it? The answer is that it is more powerful than other tests that can cope with non-normal data. It is more likely than they are to spot a difference if there really is a difference.
So, onwards….
We want to test each garden group for normality, so we group the data by location as before and and then summarise, this time asking for the \(p\)-value returned by the Shapiro-Wilk test.
normality_test<- ozone %>%
group_by(Garden.location) %>%
summarise(sw=shapiro.test(Ozone)$p.value)
## `summarise()` ungrouping output (override with `.groups` argument)
normality_test
For both groups the \(p\)-value is more than 0.05, so at the 5% significance level we cannot reject the null hypothesis that the data are normally distributed, so we can go on and use the t-test.
We could also do a quantile-quantile plot for each group of Ozone data:
ozone %>%
ggplot(aes(sample=Ozone)) +
stat_qq(colour="blue") +
stat_qq_line() +
facet_wrap(~Garden.location) +
theme_bw()
Nothing outrageously non-linear there, so that also suggests we can safely use the t-test.
We use the t.test() command for this. It needs to be given a formula and a data set as arguments. Look up t.test() in R’s help documentation, and see if you can get the t-test to tell you whether there is a significant difference between ozone levels in the east and in the west of the city.
t.test(Ozone~Garden.location,data=ozone)
Note the ~ tilda symbol. This means ‘is a function of’. So this line means: do a t-test to see if there is a significant difference between the ozone levels in the two garden locations.
Study the output of the t-test.
t.test().Here is the minimal script you need to run your t-test on this data:
# load the packages we need
library(tidyverse)
library(here)
# load the data
ozone<-read_csv(here("data","ozone.csv"))
glimpse(ozone)
# plot histograms for each location
ggplot(ozone,(aes(x=Ozone)))+
geom_histogram(binwidth=10,fill="darkred")+
facet_wrap(~Garden.location,ncol=1) +
theme_bw()
# what are the means and standard deviations of ozone levels for each location?
stats<- ozone %>%
group_by(Garden.location) %>%
summarise(avg = mean(Ozone),st_dev=sd(Ozone))
stats
# are the ozone levels in each location normally distributed?
normality_test<- ozone %>%
group_by(Garden.location) %>%
summarise(sw=shapiro.test(Ozone)$p.value)
normality_test
# quantile-quantile plots for ozone levels in each location
ozone %>%
ggplot(aes(sample=Ozone)) +
stat_qq(colour="blue") +
stat_qq_line() +
facet_wrap(~Garden.location) +
theme_bw()
# and finally, the actual _t_-test
t.test(Ozone~Garden.location,data=ozone)