Preliminaries

In this exercise we find out how to use R to run a t-test, to determine whether there is evidence of a difference between two populations.

The exercise is taken from Chapter 5: Beckerman, Childs and Petchey: Getting Started with R.

Open your RStuff project using File/Open Project, navigating to the RStuff folder, then clicking on the RStuff.Rproj file you will find there.

R markdown

Create a new R Notebook file called t_test. Save this in your scripts folder. You should see that it has been saved as a .Rmd file. This stands for R markdown. In this type of file you can combine human readable text with R code. In this way, you can combine all of your analysis of a data set in one document - the text, the code to do any analysis and the output of that code, including any figures that you draw. Except for the very tiniest scripts, these files are much easier to read than a normal script. You can include extensive commentary without it looking as messy as it would be in a standard script. These instructions, what you are reading now, were written as a markdown document then knitted which formats everything and, finally, published to the web. Once you realise you can do all this easily, you rarely look back.

Formatting text in R markdown

There are very simple rules for formatting text in R markdown. A quick guide is given in
Help/Markdown Quick Reference More extensive help can be found in this RStudio cheatsheet

For example, to make a really big header you start the line one #, like this

# A Really Big Header

appears, after formatting, as

A Really Big Header

for a smaller one, use two hashes

## Not quite such a big header

which gives, after formatting

Not quite such a big header

and so on.

Writing in markdown is very easy and you will very soon get the hang of it.

Code chunks

If you wish to include R code in your document, you put it in ‘chunks’ which begin with three back ticks (top left of your keyboard) followed by {r}, then end with three more back ticks. The code goes in the middle. Like this:

```{r}
rm(list=ls())
```

In your document, the code chunks will appear greyed out.

To run the code you press the green arrow in the top right of the chunk.

You can set various options after the little ‘r’ between the curly braces, which affect what you see when you run the code, but no need to worry about them just yet. Read about them on the Rmarkdown cheatsheet if you want. For now, though, it is a good idea to include a label for each chunk, like this (notice the comma):

```{r, a succinct label that tells me what this chunk does}
rm(list=ls())
```

Now, to work on the t-test.

The two-sample t-test

This is useful for comparing the means of two data sets where we have numerical data and we have replicates of those data. Here we will investigate data of ozone levels for gardens around a city. The data you will use gives ozone concentration in ppb. Ozone levels can affect how well crops grow, and can impact on human health. The gardens are from two regions - east or west of the city centre.

Is there a difference between ozone concentrations to the east or west?

We will use a t-test to decide this.

Pros of the t-test

  • It can be used when the data set is small.
  • It can still be used when the data set is large. So…if in doubt, just use the t-test, kind of. See below.

Cons of the t-test

  • It assumes that the data are drawn from a normally distributed population. There are various ways to test if this is true, and you should try at least one of them, but with small samples, just where the t-test is most useful, it can be difficult to tell. In the end we can also appeal to reason: is there good reason to suppose that the data would or would not be normally distributed?
  • When comparing the means of two samples both samples should normally have the same variance, which is a measure of the spread of the data. You need to check that this is the case, or at least have reason to suppose that it should be. (Note: in an actual t-test, it is possible to ignore this requirement - see below).
  • When we have more than two samples and we use the t-test to look for a difference between any two of them, it becomes increasingly likely, the more samples we have, that we will decide that we have found difference becuase we got a p-value that was less than some pre-determined threshold (which could be anything, but is most often chosen to be 0.05) even if in reality there is none. That is where ANOVA comes in.

Hypotheses

Write down a null and alternate hypothesis suitable for this investigation. Remember that the null is the ‘nothing going on’ scenario.

Should the alternate hypothesis be one-sided or two-sided? (Hint: two-sided is usually the answer here. But why?)

Back to work on the script

Load packages

library(tidyverse)
library(here)

Read in the data

ozone<-read_csv(here("data","ozone.csv"))
## Parsed with column specification:
## cols(
##   Ozone = col_double(),
##   Garden.location = col_character(),
##   Garden.ID = col_character()
## )

Step 1: Inspect the data

glimpse(ozone)    #what command do you need here?
## Rows: 20
## Columns: 3
## $ Ozone           <dbl> 61.7, 64.0, 72.4, 56.8, 52.4, 44.8, 70.4, 67.6, 68.8,…
## $ Garden.location <chr> "West", "West", "West", "West", "West", "West", "West…
## $ Garden.ID       <chr> "G1", "G2", "G3", "G4", "G5", "G6", "G7", "G8", "G9",…

What kind of data have we got?

You might also wish to inspect the data using summary(). Include code chunk to do this.

Step 2: Plot the data

Remember, before we do any statistical analysis, it is almost always a good idea to plot the data in some way.

  • Use ggplot() to plot two histograms of ozone levels, one for east and one for west.
  • Use the facet feature of ggplot()to stack the histograms one above the other.
  • Make the bins 10 ppm wide.
g<-ggplot(ozone,(aes(x=Ozone)))+
  geom_histogram(binwidth=10)+
  facet_wrap(~Garden.location,ncol=1)
g

Do the data look as though they support the null hypothesis or not? Let’s now do some stats.

Step 3: Carry out statistical analysis

Calculate means and variances of the ozone concentrations in east and west

We can use group_by() and summarise() from dplyr() to do this.

stats<- ozone %>%
  group_by(Garden.location) %>%
  summarise(avg = mean(Ozone),st_dev=sd(Ozone))
## `summarise()` ungrouping output (override with `.groups` argument)
stats

In the summarise line above we have asked for a table with two columns titled avg and st_dev. For the first we use the mean() function, and for the second we use the sd() function. Since we have grouped by Garden.location, each row of the table will be for a different location.

Now for the actual two-sample t-test

We use the t.test() command for this. It needs to be given a formula and a data set as arguments. Look up t.test() in R’s help documentation, and see if you can get the t-test to tell you whether there is a significant difference between ozone levels in the east and in the west of the city.

t.test(Ozone~Garden.location,data=ozone)

Interpret the output of the t-test.

Study the output of the t-test.

  • What kind of test was carried out?
  • What data was used for the test?
  • What is the test statistic of the data?
  • How many degrees of freedom were there? Does the number make sense?
  • What is the p-value?
  • What does the p value mean?
  • What is the confidence interval for the difference between ozone levels in east and west?
  • Is there sufficient evidence to reject the null hypothesis?
  • What does the word ‘Welch’ tell you - look it up in the help for t.test().