Preliminaries

In this exercise we find out how to use R to run a t-test, to determine whether there is evidence of a difference between two populations.

We will also learn how to write and knit our R file as an R Notebook.

The exercise is based on Chapter 5: Beckerman, Childs and Petchey: Getting Started with R.

Open your project

Open your RStuff project using File/Open Project, navigating to the RStuff folder, then clicking on the RStuff.Rproj file you will find there.

If your Rstuff folder is not already a Project, then make it one using File/New Project/Existing Directory - then navigate to your Rstuff folder.

R markdown

Create a new R Notebook file called t_test. Save this in your scripts folder. You should see that it has been saved as a .Rmd file. This stands for R markdown. In this type of file you can combine human readable text with R code. In this way, you can combine all of your analysis of a data set in one document - the text, the code to do any analysis and the output of that code, including any figures that you draw. Except for the very tiniest scripts, these files are much easier to read than a normal script. You can include extensive commentary without it looking as messy as it would be in a standard script. These instructions, what you are reading now, were written as a markdown document then knitted which formats everything and, finally, published to the web. Once you realise you can do all this easily, you rarely look back.

Formatting text in R markdown

There are very simple rules for formatting text in R markdown. A quick guide is given in
Help/Markdown Quick Reference More extensive help can be found in this RStudio cheatsheet

For example, to make a really big header you start the line one #, like this

# A Really Big Header

appears, after formatting, as

A Really Big Header

for a smaller one, use two hashes

## Not quite such a big header

which gives, after formatting

Not quite such a big header

and so on.

Writing in markdown is very easy and you will very soon get the hang of it.

Code chunks

If you wish to include R code in your document, you put it in ‘chunks’ which begin with three back ticks (top left of your keyboard) followed by {r}, then end with three more back ticks. The code goes in the middle. Like this:

```{r}
rnorm(10)
```

In your document, the code chunks will appear greyed out.

To run the code in each code as you write them, you press the green arrow in the top right of the chunk.

You can set various options after the little ‘r’ between the curly braces, which affect what you see when you run the code, but no need to worry about them just yet. Read about them on the Rmarkdown cheatsheet if you want. For now, though, it is a good idea to include a label for each chunk, like this (notice the comma):

```{r, a succinct label that tells me what this chunk does}
rnorm(10)
```

Knit your markdown file

To render your .rmd document, you press the Preview/Knit button at the top of the scripts pane and choose (I recommend) Knit to HTML. This will create a rendered version of your rmd file where everything is nice and readable. It will be in the same folder as your rmd file.

You can knit as often as you want. You will just overwrite the existing HTML file. Along the way though, you may wish to implement the code chunk by chunk by pressing the little green chunk arrows.

Now, to work on the t-test.

The two-sample t-test

This is useful for comparing the means of two data sets where we have numerical data and we have replicates of those data. Here we will investigate data of ozone levels for gardens around a city. The data you will use gives ozone concentration in ppb. Ozone levels can affect how well crops grow, and can impact on human health. The gardens are from two regions - east or west of the city centre.

Is there a difference between ozone concentrations to the east or west?

We will use a t-test to decide this.

Pros of the t-test

  • It can be used when the data set is small.
  • It can still be used when the data set is large. So…if in doubt, just use the t-test, kind of. See below.

Cons of the t-test

  • It assumes that the data are drawn from a normally distributed population. There are various ways to test if this is true, and you should try at least one of them, but with small samples, just where the t-test is most useful, it can be difficult to tell. In the end we can also appeal to reason: is there good reason to suppose that the data would or would not be normally distributed?
  • When comparing the means of two samples both samples should normally have the same variance, which is a measure of the spread of the data. You need to check that this is the case, or at least have reason to suppose that it should be. (Note: in an actual t-test, it is possible to ignore this requirement - see below).
  • When we have more than two samples and we use the t-test to look for a difference between any two of them, it becomes increasingly likely, the more samples we have, that we will decide that we have found a difference because we got a \(p\)-value that was less than some pre-determined threshold (which could be anything, but is most often chosen to be 0.05) even if in reality there is none. That is where ANOVA comes in. t-tests are only used to detect evidence for a difference between two groups, not more. ANOVAs (or their non-parametric equivalent) are used when we are looking for differences between more than two groups.

Hypotheses

Write down a null and alternate hypothesis suitable for this investigation. Remember that the null is the ‘nothing going on’ scenario.

Should the alternate hypothesis be one-sided or two-sided? (Hint: two-sided is usually the answer here. But why?)

Back to work on the script

When you create a new R Notebook, you get some ‘yaml’ at the top between the two — lines. Leave this, but delete everything below. That is just exemplar stuff that you don’t need. What you put in the ‘yaml’ can affect how your file rendered and all sorts of things, but for now you can ignore it.

Now copy in the code below, chunk by chunk, remembering to put each new block of code within its own self contained chunk. Run each chunk as you complete it to make sure there are no errors. The output will appear in the script pane and also in the console window.

Also add explanatory test between the chunks, as much as you think useful, but at least put headers. Experiment with hashes to see how many are needed to give headers of the size you like.

Whenever you feel like seeing fancy output, Knit your document. I love it, so I do this frequently.

Load packages

library(tidyverse)
library(here)

Read in the data

# there should be an 'ozone.csv' file in your data folder
# if not, you can get it from the data folder on Teams
ozone<-read_csv(here("data","ozone.csv"))
## Parsed with column specification:
## cols(
##   Ozone = col_double(),
##   Garden.location = col_character(),
##   Garden.ID = col_character()
## )

Step 1: Inspect the data

glimpse(ozone)
## Rows: 20
## Columns: 3
## $ Ozone           <dbl> 61.7, 64.0, 72.4, 56.8, 52.4, 44.8, 70.4, 67.6, 68.8,…
## $ Garden.location <chr> "West", "West", "West", "West", "West", "West", "West…
## $ Garden.ID       <chr> "G1", "G2", "G3", "G4", "G5", "G6", "G7", "G8", "G9",…

What kind of data have we got?

You might also wish to inspect the data using summary(). If so, include a code chunk to do this.

Step 2: Plot the data

Remember, before we do any statistical analysis, it is almost always a good idea to plot the data in some way. We can often get a very good idea as to the answer to our research question just from the plots we do.

  • Use ggplot() to plot two histograms of ozone levels, one for east and one for west.
  • Use the facet feature of ggplot()to stack the histograms one above the other.
  • Make the bins 10 ppm wide.
g<-ggplot(ozone,(aes(x=Ozone)))+
  geom_histogram(binwidth=10,fill="darkred")+
  facet_wrap(~Garden.location,ncol=1) +
  theme_bw()
g

Do the data look as though they support the null hypothesis or not?

In addition, do the data look as though each group is drawn from a normally distributed population? When there are so few data, it’s kind of hard to tell from these, no?

Let’s now do some stats.

Step 3: Carry out statistical analysis

Calculate means and variances of the ozone concentrations in east and west

We can use group_by() and summarise() from dplyr() to do this, together with the pipe operator %>%

stats<- ozone %>%
  group_by(Garden.location) %>%
  summarise(avg = mean(Ozone),st_dev=sd(Ozone))
## `summarise()` ungrouping output (override with `.groups` argument)
stats

Study this code snippet. You will use something like it again and again.

In the summarise() line above we have asked for a table with two columns titled avg and st_dev. For the first we use the mean() function, and for the second we use the sd() function. Since we have grouped by Garden.location, each row of the table will be for a different location.

Bear in mind that if the data are normally distributed, then under the null hypothesis there is a less than 5% chance that the two sample means would be more than two standard deviations apart. Does it look as though we are going to reject the null, or not?

Is the data normally distributed?

Normality test

There are several analytical tests one can run on a set of data to determine if it is normally distributed. One is the Shapiro-Wilk test.

For more information on the Shapiro-Wilk test, type ?shapiro.test into the console window. For kicks, try it out on the examples that appear in the help window (which is the bottom right pane, Help tab) . One example is testing data that explicitly are drawn from a normal distribution, the other tests data that definitely are not. What \(p\)-value do you get in each case? Wht do the histogreams of each distribution look like?

example1<-rnorm(100, mean = 5, sd = 3) # first example from the help pane
example2<-runif(100, min = 2, max = 4) # second example from the help pane
df<-tibble(data=c(example1,example2), distribution=c(rep("normal",100),rep("very not normal",100)))

# histograms
ggplot(df,aes(x=data)) +
  geom_histogram(bins=10,fill="cornflowerblue") +
  facet_wrap(~distribution) +
  theme_bw()


# normality tests
shapiro.test(rnorm(100, mean = 5, sd = 3)) # 100 samples drawn from a normally distributed population
## 
##  Shapiro-Wilk normality test
## 
## data:  rnorm(100, mean = 5, sd = 3)
## W = 0.99556, p-value = 0.987
shapiro.test(runif(100, min = 2, max = 4)) # 100 samples drawn from a uniformly (ie NOT normally) distributed population
## 
##  Shapiro-Wilk normality test
## 
## data:  runif(100, min = 2, max = 4)
## W = 0.95342, p-value = 0.001403

This tests your data against the null hypothesis that it is drawn from a normally distributed population. It gives a \(p\)-value. If the \(p\)-value is less than 0.05 then we reject the null hypothesis and cannot suppose our data is normally distributed. In that case we would have to ditch the t-test for a difference, and choose another difference test in its place that could cope with data that was not normally distributed.

Why don’t we do that in the first place, I hear you ask? Why bother with this finicky t-test that requires that we go through the faff of testing the data for normality before we can use it? The answer is that it is more powerful than other tests that can cope with non-normal data. It is more likely than they are to spot a difference if there really is a difference.

So, onwards….

We want to test each garden group for normality, so we group the data by location as before and and then summarise, this time asking for the \(p\)-value returned by the Shapiro-Wilk test.

normality_test<- ozone %>%
  group_by(Garden.location) %>%
  summarise(sw=shapiro.test(Ozone)$p.value)
## `summarise()` ungrouping output (override with `.groups` argument)
normality_test

For both groups the \(p\)-value is more than 0.05, so at the 5% significance level we cannot reject the null hypothesis that the data are normally distributed, so we can go on and use the t-test.

QQ plot

We could also do a quantile-quantile plot for each group of Ozone data:

ozone %>%
  ggplot(aes(sample=Ozone)) +
  stat_qq(colour="blue") +
  stat_qq_line() +
  facet_wrap(~Garden.location) +
  theme_bw()

Nothing outrageously non-linear there, so that also suggests we can safely use the t-test.

Now for the actual two-sample t-test

We use the t.test() command for this. It needs to be given a formula and a data set as arguments. Look up t.test() in R’s help documentation, and see if you can get the t-test to tell you whether there is a significant difference between ozone levels in the east and in the west of the city.

t.test(Ozone~Garden.location,data=ozone)

Note the ~ tilda symbol. This means ‘is a function of’. So this line means: do a t-test to see if there is a significant difference between the ozone levels in the two garden locations.

Interpret the output of the t-test.

Study the output of the t-test.

  • What kind of test was carried out?
  • What data was used for the test?
  • What is the test statistic of the data?
  • How many degrees of freedom were there? Does the number make sense? In a t-test the ’degrees of freedom is one less than the number of data points.
  • What is the p-value?
  • What does the p value mean?
  • What is the confidence interval for the difference between ozone levels in east and west? Does it encompass zero?
  • Is there sufficient evidence to reject the null hypothesis?
  • What does the word ‘Welch’ tell you - look it up in the help for t.test().

The complete script

Here is the minimal script you need to run your t-test on this data:

# load the packages we need
library(tidyverse)
library(here)

# load the data
ozone<-read_csv(here("data","ozone.csv"))
glimpse(ozone)

# plot histograms for each location
ggplot(ozone,(aes(x=Ozone)))+
  geom_histogram(binwidth=10,fill="darkred")+
  facet_wrap(~Garden.location,ncol=1) +
  theme_bw()

# what are the means and standard deviations of ozone levels for each location?
stats<- ozone %>%
  group_by(Garden.location) %>%
  summarise(avg = mean(Ozone),st_dev=sd(Ozone))
stats

# are the ozone levels in each location normally distributed?
normality_test<- ozone %>%
  group_by(Garden.location) %>%
  summarise(sw=shapiro.test(Ozone)$p.value)
normality_test

# quantile-quantile plots for ozone levels in each location
ozone %>%
  ggplot(aes(sample=Ozone)) +
  stat_qq(colour="blue") +
  stat_qq_line() +
  facet_wrap(~Garden.location) +
  theme_bw()

# and finally, the actual _t_-test
t.test(Ozone~Garden.location,data=ozone)