Preliminaries

In this exercise we find out how to use R to run a t-test, to determine whether there is evidence of a differnce between two populations.

The exercise is taken from Chapter 5: Beckerman, Childs and Petchey: Getting Started with R.

Working directory

Set your Rstuff folder as the working directory.

R markdown

Create a new R Notebook file called t_test. Save this in your scripts folder. You should see that it has been saved as a .Rmd file. This stands for R markdown. In this type of file you can combine human readable text with R code. In this way, you can combine all of your analysis of a data set in one document - the text, the code to do any analysis and the output of that code, including any figures that you draw. Excdept for the very tiniest scripts, these files are much easier to read than a normal script. You can include extensive commentary without it looking as messy as that would be in a standard script. These instructions, what you are reading now, were written as a markdown document then knitted which formats everything and, finally, published to the web.

Formatting text in R markdown

There are very simple rules for formatting text in R markdown. A quick guide is given in
Help/Markdown Quick Reference More extensive help can be found in this RStudio cheatsheet

For example, to make a really big header you start the line one #, like this

# A Really Big Header

appears, after formatting, as

A Really Big Header

for a smaller one, use two hashes

## Not quite such a big header

which gives, after formatting

Not quite such a big header

and so on.

Writing in markdown is very easy and you will very soon get the hang of it.

Code chunks

If you wish to include R code in your document, you put it in ‘chunks’ which begin with three back ticks (top left of your keyboard) followed by {r}, then end with three more back ticks. The code goes in the middle. Like this:

```{r}
rm(list=ls())
```

In your document, the code chunks will appear greyed out.

To run the code you press the green arrow in the top right of the chunk.

You can set various options after the little ‘r’ between the curly braces, which affect what you see when you run the code, but no need to worry about them just yet. Read about them on the Rmarkdown cheatsheet if you want. For now, though, it is a good idea to include a label for each chunk, like this (notice the comma):

```{r, a label that telle me what this chunk dose.}
rm(list=ls())
```

Now, to work on the t-test.

The Two-sample t-test

This is useful for comparing the means of two data sets. Here we will investigate data of ozone levels for gardens around a city. The data you will use gives ozone concentration in ppb. Ozone levels can affect how well crops grow, and can impact on human health. The gardens are from two regions - east or west of the city centre.

Is there a difference between ozone concentrations to the east or west?

We will use a t-test to decide this.

PROS of the t-test

  • It can be used when the data set is small
  • It can still be used when the data set is large

CONS of the t-test

  • It assumes that the data are drawn from a normally distributed population
  • When comparing the means of two samples both samples must have the same variance

Hypotheses

Write down a null and alternate hypothesis suitable for this investigation.

Should the alternate hypothesis be one-sided or two-sided?

Back to work on the script

Clear R’s brain

rm(list=ls())

Load packages

library(dplyr)
library(ggplot2)
library(readr)

Read in the data

ozone<-read_csv('../data/ozone.csv')

Step 1: Inspect the data

glimpse(ozone)    #what command do you need here?

What kind of data have we got?

You might also wish to inspect the data using summary(). Include code chunk to do this.

Step 2: Plot the data

Remember, before we do any statistical analysis, it is almost always a good idea to plot the data in some way.

  • Use ggplot() to plot two histograms of ozone levels, one for east and one for west.
  • Use the facet feature of ggplot()to stack the histograms one above the other.
  • Make the bins 10 ppm wide.
g<-ggplot(ozone,(aes(x=Ozone)))+
  geom_histogram(binwidth=10)+
  facet_wrap(~Garden.location,ncol=1)
g

Do the data look as though they support the null hypothesis or not?

Let’s now do some stats.

Step 3: Carry out statistical analysis

Calculate means and variances of the ozone concentrations in east and west

We can use group_by() and summarise() from dplyr() to do this.

stats<- ozone %>%
  group_by(Garden.location) %>%
  summarise(avg = mean(Ozone),st_dev=sd(Ozone))
stats

In the summarise line above we have asked for a table with two columns titled avg and st_dev. For the first we use the mean() function, and for the second we use the sd() function. Since we have grouped by Garden.location, each row of the table will be for a different location.

Now for the actual two-sample t-test

We use the t.test() command for this. It needs to be given a formula and a data set as arguments. Look up t.test() in R’s help documentation, and see if you can get the t-test to tell you whether there is a significant difference between ozone levels in the east and in the west of the city.

t.test(Ozone~Garden.location,data=ozone)

Interpret the output of the t-test.

Study the output of the t-test.

  • What kind of test was carried out?
  • What data was used for the test?
  • What is the test statistic of the data?
  • How many degrees of freedom were there? Does the number make sense?
  • What is the p-value?
  • What does the p value mean?
  • What is the confidence interval for the difference between ozone levels in east and west?
  • Is there sufficient evidence to reject the null hypothesis?
  • What does the word ‘Welch’ tell you - look it up in the help for t.test().