quick introduction

getting help

There is some really basic stuff you should know before you delve into R. The most critical thing to know is how to get help. In R, if you’re ever confused about what a function does, or what inputs (arguments) it takes, simply type in the name of the function preceeded by a question mark. For example, ?plot will tell you everything you wanted to know about the plot function. If that fails or your question in more complicated, google your question. Even people that are comfortable with R are constantly googling and looking at help files. They’re there for a reason.

Rstudio

Just about everyone these days uses Rstudio to work with R. Rstudio is one place where you can work on a script, view graphics, and tests parts of your code in the main console all at the same time. I recommend you download R from here and Rstudio from here.

installing packages

One of the reasons for R’s popularity is the strength of the R community and abilty of people to share their R code through packages. Packages are collections of functions. Packages allow R to be used for capabilities that greatly exceed the reach of the base R language you downloaded from R’s website. Packages can be downloaded like install.packages("viridis") where viridis is the name of the package.

how to use this document

This document is meant to introduce you to ~90% of what you’ll use R for. Some of the text looks like this: these are either objects in R or functions in R and I use this formatting to distinguish the text of this document from what’s happening in R. Other things look like this:

?c

This is text you should enter in R to follow along with the document. Usually the output of the code will follow below the code on lines that start with ##; in this case, I’ve queried a help file, which doesn’t output anything. Note that the document builds on itself so you need to run all the preceeding code to ensure code later on in the document will work.

this is only an introduction

Please remember that this is not indended to be a comprehensive document. All scientists should guard against the unknown unknowns–things you don’t even realize you don’t know–in their research. The only way to guard against these is to gather as much knowledge as possible and always be double-checking. For example, most people assume that the R function log takes the base 10 logarithm of some number. Never make this sort of assumption, especially when dealing with a computer program. log actually takes the base e logarithm, or the natural logarithm, of some number. log10 or log(x, base=10) take the logarithm was a base of 10. The only way to guard against these sorts of mistakes is to look at help files and never assume anything.

reading in data from Excel

In our line of work, most data are entered into a spreadsheet in Excel or a similar program. Excel is not great for actually doing statistics and visualing data, though, and that’s were R comes in.

Before attempting to save your spreadsheet and load it into R, check a few things: - make sure the first row contains the names of your columns. - make sure that your data (in particular, your first row) doesn’t have any spaces within a cell. R likes things without spaces. If you do have spaces in a variable name, for example, testes size, you could change it to testes_size or testes.size and R would be totally cool with that. - make sure you are only working on one sheet

Save your spreadsheet as a .csv file. This is a simple type of text file where items are separated by commas. Your normal Excel file contains a lot of extra formatting information that isn’t the data, and R doesn’t want to have anything to do with that.

Once the data are saved as a .csv file, you can import it into R using nigrensis_scototaxis_results <- read.csv("/path/to/file/") where /path/to/file/ is the location of the file on your computer. You can also use nigrensis_scototaxis_results <-read.csv(file.choose()) which will bring up a dialog box where you can navigate to the file. I don’t recommend the latter approach because in 6 months you will forget where the file you’re reading in is. R reads in your csv file as a type of data structure called a data frame. Be sure to name your dataframe something useful and descriptive. data is a bad name for a dataframe because it tells you nothing about what the dataframe is or what it contains.

looking at data in R

The simplest way to look at your dataframe in R is to just type in the name of the dataframe and hit Enter. This is not practicle for large datasets, however, and it doesn’t tell you all of what you might want to know.

To get a sense of what the data look like, use head. We’ll use Fisher’s famous iris dataset for our purposes. (iris is a special dataframe that comes with R.)

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Doing this tells us some useful stuff: there are five columns and we know each column contains because each column has a name. To access these column names by themeselves, you can use the function colnames:

colnames(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

Perhaps the most useful function for looking at your data is str which displays the underlying structure of any object in R. What happens when we call str on the iris dataset?

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

We are told there are 150 observations on 5 variables, we are told that the first four columns contain numbers and the last is a factor–what R calls a group. If we had wanted to see the number of columns and rows without using R, we could have used ncol(iris) and nrow(iris).

subsetting data in R

In general, dataframes in R are subset based on their row and column numbers. For example,

iris[1:5,1:2] 
##   Sepal.Length Sepal.Width
## 1          5.1         3.5
## 2          4.9         3.0
## 3          4.7         3.2
## 4          4.6         3.1
## 5          5.0         3.6

returna rows 1 to 5 and columns 1 to 2. Always give R the rows you want first, followed by all the columns. If you want all the columns, leave that part blank but keep the comma:

iris[1:5,]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa

Alternatively, to show all the rows of the first two columns:

iris[,1:2] # or iris[,c(1,2)]
##     Sepal.Length Sepal.Width
## 1            5.1         3.5
## 2            4.9         3.0
## 3            4.7         3.2
## 4            4.6         3.1
## 5            5.0         3.6
## 6            5.4         3.9
## 7            4.6         3.4
## 8            5.0         3.4
## 9            4.4         2.9
## 10           4.9         3.1
## 11           5.4         3.7
## 12           4.8         3.4
## 13           4.8         3.0
## 14           4.3         3.0
## 15           5.8         4.0
## 16           5.7         4.4
## 17           5.4         3.9
## 18           5.1         3.5
## 19           5.7         3.8
## 20           5.1         3.8
## 21           5.4         3.4
## 22           5.1         3.7
## 23           4.6         3.6
## 24           5.1         3.3
## 25           4.8         3.4
## 26           5.0         3.0
## 27           5.0         3.4
## 28           5.2         3.5
## 29           5.2         3.4
## 30           4.7         3.2
## 31           4.8         3.1
## 32           5.4         3.4
## 33           5.2         4.1
## 34           5.5         4.2
## 35           4.9         3.1
## 36           5.0         3.2
## 37           5.5         3.5
## 38           4.9         3.6
## 39           4.4         3.0
## 40           5.1         3.4
## 41           5.0         3.5
## 42           4.5         2.3
## 43           4.4         3.2
## 44           5.0         3.5
## 45           5.1         3.8
## 46           4.8         3.0
## 47           5.1         3.8
## 48           4.6         3.2
## 49           5.3         3.7
## 50           5.0         3.3
## 51           7.0         3.2
## 52           6.4         3.2
## 53           6.9         3.1
## 54           5.5         2.3
## 55           6.5         2.8
## 56           5.7         2.8
## 57           6.3         3.3
## 58           4.9         2.4
## 59           6.6         2.9
## 60           5.2         2.7
## 61           5.0         2.0
## 62           5.9         3.0
## 63           6.0         2.2
## 64           6.1         2.9
## 65           5.6         2.9
## 66           6.7         3.1
## 67           5.6         3.0
## 68           5.8         2.7
## 69           6.2         2.2
## 70           5.6         2.5
## 71           5.9         3.2
## 72           6.1         2.8
## 73           6.3         2.5
## 74           6.1         2.8
## 75           6.4         2.9
## 76           6.6         3.0
## 77           6.8         2.8
## 78           6.7         3.0
## 79           6.0         2.9
## 80           5.7         2.6
## 81           5.5         2.4
## 82           5.5         2.4
## 83           5.8         2.7
## 84           6.0         2.7
## 85           5.4         3.0
## 86           6.0         3.4
## 87           6.7         3.1
## 88           6.3         2.3
## 89           5.6         3.0
## 90           5.5         2.5
## 91           5.5         2.6
## 92           6.1         3.0
## 93           5.8         2.6
## 94           5.0         2.3
## 95           5.6         2.7
## 96           5.7         3.0
## 97           5.7         2.9
## 98           6.2         2.9
## 99           5.1         2.5
## 100          5.7         2.8
## 101          6.3         3.3
## 102          5.8         2.7
## 103          7.1         3.0
## 104          6.3         2.9
## 105          6.5         3.0
## 106          7.6         3.0
## 107          4.9         2.5
## 108          7.3         2.9
## 109          6.7         2.5
## 110          7.2         3.6
## 111          6.5         3.2
## 112          6.4         2.7
## 113          6.8         3.0
## 114          5.7         2.5
## 115          5.8         2.8
## 116          6.4         3.2
## 117          6.5         3.0
## 118          7.7         3.8
## 119          7.7         2.6
## 120          6.0         2.2
## 121          6.9         3.2
## 122          5.6         2.8
## 123          7.7         2.8
## 124          6.3         2.7
## 125          6.7         3.3
## 126          7.2         3.2
## 127          6.2         2.8
## 128          6.1         3.0
## 129          6.4         2.8
## 130          7.2         3.0
## 131          7.4         2.8
## 132          7.9         3.8
## 133          6.4         2.8
## 134          6.3         2.8
## 135          6.1         2.6
## 136          7.7         3.0
## 137          6.3         3.4
## 138          6.4         3.1
## 139          6.0         3.0
## 140          6.9         3.1
## 141          6.7         3.1
## 142          6.9         3.1
## 143          5.8         2.7
## 144          6.8         3.2
## 145          6.7         3.3
## 146          6.7         3.0
## 147          6.3         2.5
## 148          6.5         3.0
## 149          6.2         3.4
## 150          5.9         3.0

grabbing a single column of a dataframe

R has two ways you can specify a column. The first we’ve already seen: with a number. For example:

iris[,2]
##   [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9
##  [18] 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2
##  [35] 3.1 3.2 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2
##  [52] 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7
##  [69] 2.2 2.5 3.2 2.8 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0
##  [86] 3.4 3.1 2.3 3.0 2.5 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7
## [103] 3.0 2.9 3.0 3.0 2.5 2.9 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6
## [120] 2.2 3.2 2.8 2.8 2.7 3.3 3.2 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0
## [137] 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2 3.3 3.0 2.5 3.0 3.4 3.0

this grabs the second column of iris. There is a better way to do the same thing. We can tell R what name we’ve given the column using a $ after the dataframe:

iris$Sepal.Width
##   [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9
##  [18] 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2
##  [35] 3.1 3.2 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2
##  [52] 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7
##  [69] 2.2 2.5 3.2 2.8 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0
##  [86] 3.4 3.1 2.3 3.0 2.5 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7
## [103] 3.0 2.9 3.0 3.0 2.5 2.9 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6
## [120] 2.2 3.2 2.8 2.8 2.7 3.3 3.2 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0
## [137] 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2 3.3 3.0 2.5 3.0 3.4 3.0

This is a better way to refer to columns, as the name of the column won’t change even has columns are added or deleted to the dataframe.

only grabbing parts of a dataframe that meet some condition

We can use subset to create a dataframe that is a subset of a dataframe, but meets some condition. For example, we might want a dataframe that only contiains the data from the setosa species. We could do that like this:

setosa <- subset(iris, iris$Species=="setosa")

This tells R to subset the iris dataframe and only keep those rows (observations) in which Species is equal to "setosa". The result looks like this:

setosa
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa
## 16          5.7         4.4          1.5         0.4  setosa
## 17          5.4         3.9          1.3         0.4  setosa
## 18          5.1         3.5          1.4         0.3  setosa
## 19          5.7         3.8          1.7         0.3  setosa
## 20          5.1         3.8          1.5         0.3  setosa
## 21          5.4         3.4          1.7         0.2  setosa
## 22          5.1         3.7          1.5         0.4  setosa
## 23          4.6         3.6          1.0         0.2  setosa
## 24          5.1         3.3          1.7         0.5  setosa
## 25          4.8         3.4          1.9         0.2  setosa
## 26          5.0         3.0          1.6         0.2  setosa
## 27          5.0         3.4          1.6         0.4  setosa
## 28          5.2         3.5          1.5         0.2  setosa
## 29          5.2         3.4          1.4         0.2  setosa
## 30          4.7         3.2          1.6         0.2  setosa
## 31          4.8         3.1          1.6         0.2  setosa
## 32          5.4         3.4          1.5         0.4  setosa
## 33          5.2         4.1          1.5         0.1  setosa
## 34          5.5         4.2          1.4         0.2  setosa
## 35          4.9         3.1          1.5         0.2  setosa
## 36          5.0         3.2          1.2         0.2  setosa
## 37          5.5         3.5          1.3         0.2  setosa
## 38          4.9         3.6          1.4         0.1  setosa
## 39          4.4         3.0          1.3         0.2  setosa
## 40          5.1         3.4          1.5         0.2  setosa
## 41          5.0         3.5          1.3         0.3  setosa
## 42          4.5         2.3          1.3         0.3  setosa
## 43          4.4         3.2          1.3         0.2  setosa
## 44          5.0         3.5          1.6         0.6  setosa
## 45          5.1         3.8          1.9         0.4  setosa
## 46          4.8         3.0          1.4         0.3  setosa
## 47          5.1         3.8          1.6         0.2  setosa
## 48          4.6         3.2          1.4         0.2  setosa
## 49          5.3         3.7          1.5         0.2  setosa
## 50          5.0         3.3          1.4         0.2  setosa

So setosa is now its own dataframe that only contains observations from setosa individuals.

basic plotting

Once you’ve verified that the data have been entered in correctly and look good, you probably want to visualize the data. We’ll look at a couple ways of visualizing different parts of this dataset.

visualizing the distribution of a continuous variable

For this next example, I’m going to use the trees dataframe that comes with R. The data are girth, height, and volume in cubic feet of 31 felled black cherry trees. To see what the dataframe looks like, try using str(trees) or head(trees).

We often want to get a graphical sense of what the distribution of some continuous variable looks like. For instance, in the trees dataset, we see that the height of the trees is variable, but by looking at the data we don’t get a good sense of what the most common height is or what the range of the heights is.

histogram

To plot a histogram of the tree heights, use the R function hist:

hist(trees$Height)

To get a more fine-scale idea of the distribution, we can change the number of breaks:

hist(trees$Height,breaks=10)

Remember that all the options for any R functions can be found by typing ? in front of the function, e.g., ?hist.

boxplots

Boxplots are a good graphical way to show the five number summary–the median, upper and lower extremes, and the 1st and 3rd quartiles–of a distribution. Use the boxplot function in R to plot a boxplot:

boxplot(trees$Height)

stem and leaf plots

Stem and leaf plots are often a nice way of showing the distribution of some variable because they show the actual data which forms a type of histogram out of the data themselves:

stem(trees$Height)
## 
##   The decimal point is at the |
## 
##   62 | 0
##   64 | 00
##   66 | 0
##   68 | 0
##   70 | 00
##   72 | 00
##   74 | 00000
##   76 | 000
##   78 | 00
##   80 | 0000000
##   82 | 00
##   84 | 0
##   86 | 00

visualizing the relationship between two continuous variables

We might want to plot the relationship between girth and volume. In R, plot is the easiest way to accomplish this. plot takes two arguments as its most basic: a variable for the x axis and a variable for the y axis:

plot(trees$Girth, trees$Volume)

Later on we’ll see how to test whether the relationship between girth and volume is statistically significant, even though it’s clear that there is a positive association between these two variables.

An aside: Challenge yourself not to use defaults. If you’re only using defaults, you’re probably doing something wrong. Read the help file on plot to find out about ways you can change the plot to make it better. Here’s an example of what that might look like:

plot(trees$Girth, trees$Volume, pch=16, xlab="girth (in.)", ylab= expression(feet^3), cex.lab=1.2, main="relationship between girth and volume",bty="l")
abline(a=-36.94,5.0659,col="red")

You can even write functions that implement these changes. I’ve done this and posted the resulting function to github as an R package:

library(redingPlot)
## Loading required package: beeswarm
## Loading required package: magrittr
## Loading required package: viridis
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
scatter(trees$Girth, trees$Volume,xlab="tree girth (in.)",ylab="tree volume (cu. ft.)",main="relationship between girth and volume")

visualizing the relationship between a continuous and quantitative variable

Scientists often do some experiment or manipulation and measure some variable of interest in a control and two or more treatment groups. How do we visualize this type of data? For these example, I’ll jump back to using the iris dataset.

boxplots

We can use a boxplot to compare pedal width by species in the iris dataset like this:

boxplot(Petal.Width~Species, data=iris)

This is different syntax than we’ve used in the past. We’re telling R that we want to plot Petal.Width as a function of Species. You can think of the ~ as meaning as a function of. This sort of syntax comes up a lot in R.

Personally, I like it’s important to show the actual data as your would in a scatterplot, so I’ve written some functions that combine boxplots (which show a summary of the data) with the actual data stacked to look like a histogram:

beeStripBox(iris$Petal.Width,iris$Species,stats=F,ylab="pedal width", xlab="species")

other types of plots

You can also use bar plots, strip plots, or violin plots to show this type of data. One cool plot I’d like to point out is called beeswarm and can be implemented like this:

require(beeswarm); require(viridis)
beeswarm(Petal.Width~Species, data=iris, col=magma(5)[1:3], pch=16)


basic statistics

A quick review of some really important concepts:

sample vs population

When we collect data, we are sampling from some population. If we want to understand the average height of men at UT, our population is human men at UT and our sample would be the random men we pull aside to measure their height. From our data, we get some estimate of the average height. It’s important to remember that because we cannot census every UT male, the average height of a UT male is unknowable. For all interesting questions, whatever we want to estimate (e.g. a mutation rate or a preference score) cannot be known–we use a sample of our population of interest (e.g. E. coli genomes or sailfins molly females) to make an inference about what that interesting number might be. The numbers we get from data we collect from samples are statistics and those numbers that we will never technically be able to know are parameters.

hypothesis testing

Falsification is arguably the cornerstone of scientific progress: we make claims and proceed by showing they are likely false. Because we are mortal men and are therefore limited to sampling populations instead of cencusing them, scientisits can never technically prove anything in the mathematical sense of the word. We use hypothesis testing to show that a null hypothesis–a boring hypothesis about how the world works– is unlikely, and that therefore some other hypothesis, often called an alternative hypothesis–a more interesting theory about how the world works–is more likely.

For example, we might wish to test whether there exists some relationship between height and weight. Our null hypothesis–our boring vision of the world–says no, there is no relationship there. We might say \(H_O: r = 0\). The \(H_O\) means ‘the null hypothesis’ and r is the correlation coefficent, the thing we are interested in measuring. We might think that there’s something more interesting happening though. Our alternative view of the world might be that there is a non-random association between height and weight. We might write our alternative hypothesis as \(H_A: r \neq 0\). This called a two-sided test because there are two ways we our null hypothesis could be rejected: if it’s greater than or if it’s less than 0. We could have guessed that

p-values

A p-value is a probability so it ranges from 0 to 1. It is the probability that, assuming our null hypothesis is true, that we get a statistic as extreme or more extreme than the statistic we actually observed. For example, if we compute a mean of 10 for some data we collected and our null hypothesis is that our mean should be 5, the p value quantifies the probability of seeing a number as or more extreme as 10 if the true mean if actually 5.

testing for a relationship between two quantitative variables

A common question to ask with the tree data we examined above is whether two things covary more than expected by chance. Recall that we plotted the girth and volume of trees like this

scatter(trees$Girth, trees$Volume,xlab="tree girth (in.)",ylab="tree volume (cu. ft.)",main="relationship between girth and volume", stats=F)

and saw that there was clearly a relationship between tree girth and volume. How do we quantify our beleif that there really is a non-random association between these two quantities? We could do two different types of tests, each of which makes a slightly different assumption.

correlation

A correlation test asks whether the correlation between two variables, X and Y, is non-random. A correlation is calculated like

\[ correlation = \frac{Cov(X,Y)}{\sqrt{(Var(X)Var(Y)}} \]

where \(Cov(X,Y)\) is the covariance between X and Y, a measure of how much a change in X can predict a change in Y and vice versa, and \(Var(X)\) is the variance of the variable X, an important statistical property of a distribution that describes the spread of the data.


Bonus! For the curious: \[ Cov(X,Y) = \sum_{i=1}^{n} (X_i - \bar{X}) (Y_i - \bar{Y}) \]

where \(\bar{X}\) is the mean of X and \(X_i\) is the ith observation of X and \[ Var(X) = \sum_{i=1}^{n} (X_i - \bar{X})^2 \]


This may seem too math heavy, but it’ll become clear why this is important when we talk about regression.

Importantly, a correlation makes no presumption of which variable is causing a change in the other variable. This comes through in the \(\sqrt{Var(X)Var(Y)}\) part of the definition for correlation: the X and Y variables are being given the same weight. We’ll see that this is not true for a regression analysis.

assumptions

see Assumptions for regression.

in R

Let’s test whether the correlation coefficent between tree girth and volume is significantly different than zero. We are asking, “If there were no relationship between tree girth and volume, how likely is it that we get a correlation coefficent as extreme or more extreme than what we observe in our sample?”


a quick aside: How could we ever get a relationship between two things is there really is none? Sampling. Just by chance, it’s possible that we happen to grab a sample that shows a relationship where none exists. To convince you of this, let’s sample 20 X,Y values from two variable I know to be totally independent. These X,Y values might be 20 students’ height and their score on their last exam. I use R to simulate these 20 X,Y pairs and plot them:

set.seed(19)
scatter(rnorm(20), rnorm(20), xlab="X",ylab="Y")

Here we have a relationship is statistically significant (p < 0.05) although we know there is no relationship there because I’ve sampled two completely independent distributions. (Another way of thinking about it is \(Cov(X,Y)\) = 0 in the equation above.)

While this particular sample showed a significant relationship, others won’t. If you run set.seed(20);scatter(rnorm(20), rnorm(20), xlab="X",ylab="Y") you’ll see that there is–correctly–no significant relationship. Using a p-value cutoff (also called an alpha value) of 0.05, we see a statistically significant relationship 5% of the time when or X and Y values are completely uncorrelated.


In R, we use cor.test to (1) get out estimate of the correlation coefficent given our sample (2) place confidence limits on our estimate and (3) do hypothesis testing.

cor.test(trees$Girth, trees$Volume)
## 
##  Pearson's product-moment correlation
## 
## data:  trees$Girth and trees$Volume
## t = 20.478, df = 29, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9322519 0.9841887
## sample estimates:
##       cor 
## 0.9671194

Interpreting the results: I’m not going to get into the math, but we use a t-distribution to test for the significance of a correlation. The second line of the output shows us this statistic. The second line also shows us the degrees of freedom. Here we have n - 2 = 29 degrees of freedom. Degrees of freedom are the number of independent peices of information we have. Why the -2? We lost one each for when we estimated the mean of x and the mean of y. Once we have the mean of x, we could lose one data point but we’d always be able to fill it in. The second line also shows the p-value under the null hypothesis that there is no correlation.

The next line tells us the alternative hypothesis that R is assuming. R almostly always assumes you want a two-tailed test, but you can tell it you want a one-tailed test instead. Look at ?cor to see how.

Next it shows the 95% confidence interval around the our estimate of the correlation coefficient. This is like saying, ’we’re 95% sure that the true correlation is between 0.9322519 and 0.9841887. It’s a measure of uncertainty about the parameter we are attempting to estimate.

Finally, R shows you the calculated correlation coefficient, here 0.967: really high, but that’s sort of what we expected after looking at the data.


BONUS: Extracting statistics from tests in R.

Recall that we can view the structure of any R object using str. When we use that on cor.test, we see:

str(cor.test(trees$Girth, trees$Volume))
## List of 9
##  $ statistic  : Named num 20.5
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named int 29
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 0
##  $ estimate   : Named num 0.967
##   ..- attr(*, "names")= chr "cor"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "correlation"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Pearson's product-moment correlation"
##  $ data.name  : chr "trees$Girth and trees$Volume"
##  $ conf.int   : atomic [1:2] 0.932 0.984
##   ..- attr(*, "conf.level")= num 0.95
##  - attr(*, "class")= chr "htest"

Those $ should look familiar. It’s a clue that you can use the $ like you could for dataframes to pull out something specific from the results. For instance, to get just the p value from this test, you could use

cor.test(trees$Girth, trees$Volume)$p.value
## [1] 0

To get the parameter being estimated here (the correlation coeffient), you could use

cor.test(trees$Girth, trees$Volume)$estimate
##       cor 
## 0.9671194

Too fun!


regression

Regression is the most commonly used method to assess the relationship between quantitive variables.

maths …

assumptions

  1. The relationship between X and Y is actually linear and not, say, hump-shaped or more something more complicated.
  2. The X variable is measured without error.
  3. For any given value of X, the sampled Y values are independent and have normally distributed errors (i.e. normally distributed residuals)
  4. Variance is constant along the x-axis.

in R

We’ll test for the relationship between tree girth (as the explanatory variable) and tree height (as the response variable) using regression. In R, we use lm to build a linear model. We use the syntax y~x, read “y as a function of x”, to specify our response variable y and our explanatory variable x.

tree_model <- lm(Height~Girth, data=trees) # equivalent to tree_model <- lm(trees$Height~trees$Girth)
summary(tree_model)
## 
## Call:
## lm(formula = Height ~ Girth, data = trees)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5816  -2.7686   0.3163   2.4728   9.9456 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  62.0313     4.3833  14.152 1.49e-14 ***
## Girth         1.0544     0.3222   3.272  0.00276 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.538 on 29 degrees of freedom
## Multiple R-squared:  0.2697, Adjusted R-squared:  0.2445 
## F-statistic: 10.71 on 1 and 29 DF,  p-value: 0.002758

In the example above, I’ve created an object called tree_model that holds the results of the regression. To see what things comprise this object, try using str(tree_model). To actually view the results of the regression, call summary on this object.

The results show us that there is indeed a positive relationship between girth and volume. Our estimate of the y-intercept is 62.0313 as shown in the table and our estimate of the slope is 1.0544, meaning that for every foot increase in height, girth increases by 1.0544 cubic feet.

Also notice there are standard errors, t-values, and p-values associated with our estimates of the slope and intercept. The standard error quantified our uncertainty of our estimate of each of these parameters, the t value combines this uncertainty with our point estimate of each parameter and our null hypothesis to give us a p-value associated with each estimate. In this case, the null hypothesis for the y-intercept, often called \(\beta_0\), is \(H_0: \beta_0 = 0\) and our null hypothesis for the slope, which is often referred to as \(\beta_1\), is \(H_0: \beta_1\)

This is great, but we haven’t tested whether these data meet the assumptions of regression. All assumptions are things that we assume for the sake of convenience. Some assumptions we can attempt to verify. For example, because we are mere mortals and cannot know for certain that the residuals are normally distributed, but we can submit the residuals to a test of normality and see if they pass. Of course, they may pass the normality test but in actuality be distributed in some other way; or they could, be chance, fail the normality test but actual be normally distributed.

All that being said, it’s important to see whether the assumptions of any statistical test are likely violated and, if so, how badly.

In R, we can check the normality of the residuals using a Shaprio Wilk test:

# normality of residuals
shapiro.test(tree_model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  tree_model$residuals
## W = 0.97426, p-value = 0.6425

The null hypothesis here is that the values are normally distributed. Thus a non-significant result is evidence in favor of normality of our data.

We can also eyeball whether it’s likely that the relationship between our explanatory and predictor variables is linear, and whether the variance around each value of our predictor variable increases as the value of the predictor variable increases. We can do this by plotting the residuals against their fitted values (the fitted values are the values of the response variable that fall along the regression line):

# residuals vs fitted
plot(tree_model$residuals, predict(tree_model),pch=16,xlab="residuals",ylab="fitted", bty="l")

This should look completely random, and it does. What would this have looked like if the relationship between X and Y had been something non-linear?

vals <- runif(40, min=-50,max=50)
vals_y <- vals^2 + vals + runif(length(vals),-250,250)
plot(vals,vals_y,pch=16,main="example of non-linear relationship")

a_bad_model <- lm(vals_y~vals)
plot(a_bad_model$residuals, predict(a_bad_model),pch=16,xlab="residuals",ylab="fitted",main="reidual plot shows pattern",bty="l")

There is a clear pattern here indicating that the relationship because the predictor and response variables was not linear. A linear model of the type we ran would be inappropriate on this dataset.

Another good practice is to plot what called an influence plot. This plot is the result of deleting one datum in your dataset, re-running your model, and recording what parameter estimates you would have gotten had you not collected that data point. (In general, this idea is called jackknifing.) It’s a useful plot to look for outliers in your dataset and be aware of the effects of individual datum. We can plot an influence plot of our tree data like this:

# influence plot
infl<-lm.influence(tree_model)
plot(infl$coefficients[,1], infl$coefficients[,2],pch=16, xlab="change to intercept", ylab="change to slope", main="influence plot",bty="l")
points(0,0,pch=16,col="red",cex=1.2)

The red point at (0,0) shows you the parameter estimates for the entire dataset. The inidivual points show what would happen to the estimate of the slope and the intercept had you deleted that point from the dataset. In general, deleting inidivudal datum from the dataset had a pretty small effect on our estimate of the slope. That’s good. Let’s add an outlier and see what the influence plot looks like:

x <- c(trees$Girth,40)
y <- c(trees$Height,150)
plot(x,y,xlab="Girth with outlier", ylab="Height with outlier")

tree_model_outlier <- lm(y~x)
infl<-lm.influence(tree_model_outlier)
plot(infl$coefficients[,1], infl$coefficients[,2],pch=16, xlab="change to intercept", ylab="change to slope", main="influence plot with outlier")

We see the effect of adding an outlier to the dataset: all the data points appear as a cluster except the outlier, which has a relatively huge effect in the estimate of the slope and the intercept. If you have a dataset like this, run the your statistical tests with and without the outlier and hope they converge on a similar answer. You don’t want your conclusions to hinge on the inclusion of a single data point.

testing whether a mean is different from some expection

Sometimes we want to test whether the mean of some data we’ve collected is different from some expectation we have. For example, we might be interested in the proportion of time that a fish spends on the white side of a tank that is half black, half white. We’d be interesting in testing whether the proportion of time fish spend in the white part of the tank is different from 0.5, which is our expectation if the fish are swimming around randomly in the tank. For asking this type of statistical question we can use a one sample t-test.

assumptions

  1. the data are independent and identically distributed
  2. the data are drawn from a normal distribution

in R:

For this example I’ll use the sleep dataset, which shows the increase in number of hours of sleep people had when taking a soporific drug. We might be interested to know if the drug had any effect at all; if it didn’t, then the mean difference in hours slept should be 0.

boxplot(sleep$extra)
abline(a=0,b=0,lwd=2,lty=2)

t.test(sleep$extra, mu=0)
## 
##  One Sample t-test
## 
## data:  sleep$extra
## t = 3.413, df = 19, p-value = 0.002918
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.5955845 2.4844155
## sample estimates:
## mean of x 
##      1.54

The first arugment to t.test is the data you’ve collected: the excess in sleep in number of hours. The mu=0 part is telling R that the null hypothesis is that the mean of the distribution from which these data were drawn (often called \(\mu\)) is 0. This is the default in R anyway, but it’s a good thing to know.

Finally, we should check and make sure that the data are normally distributed:

shapiro.test(sleep$extra)
## 
##  Shapiro-Wilk normality test
## 
## data:  sleep$extra
## W = 0.94607, p-value = 0.3114

The p-value is not significant, so it’s unlikely these data are drawn from a non-normal distribution.

testing for differences of some quantitative variable between two groups

Often we want to compare whether the mean of some quantitative variable we’ve measured in two groups is different. For example, maybe we’re measuring the amount of time a vole spends with his partner, only we’ve injected some males with isotocin and other males with saline. We might want to know whether the difference in time spend with partner is different in these two groups. We can use a two-sample t-test to test whether the mean amount of time spent in the two groups differs.

t-test

assumptions

  1. the data collected are independent and identically distributed (iid)
  2. the data for each group are drawn from a normal distribution
  3. the data in the two groups are independent from one another (e.g. the same individuals weren’t given both treatments)

It is often the case, depending on the software, that the variances of the variable for each group should be the same. Welsh’s t-test, the default in R, relaxes this assumption so we don’t need to worry about it, but be wary if you carry out a t-test using different software.

in R

For this example I’ll use a dataset on tooth growth in guinea pigs after ingesting vitamin C either through orange juice (coded as OJ) or absorbic acid alone (VC for ‘vitamin c’). Run head and str on this dataset to see what thye look like. The first column is the length of the tooth. For now, we’ll ignore the last column.

A good check is to always plot things out:

cats_meow(ToothGrowth$len, ToothGrowth$supp, ylab= "tooth length")

Running a t-test is pretty simple:

t.test(ToothGrowth$len[ToothGrowth$sup=="OJ"], ToothGrowth$len[ToothGrowth$sup=="VC"]) # or t.test(len~supp, data = ToothGrowth) # or 
## 
##  Welch Two Sample t-test
## 
## data:  ToothGrowth$len[ToothGrowth$sup == "OJ"] and ToothGrowth$len[ToothGrowth$sup == "VC"]
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333

We’re basically just giving t.test the two datasets we want to compare: the lengths of teeth from pigs given orange juice and the lengths of teeth from pigs given absorbic acid.

The output shows us our t-statistic, degrees of freedom, and resulting p-value. We’re also given a confidence interval around the difference in the tooth lengths between the two groups: note that this confidence intervals includes 0 (no difference) so we know this is not a statistically significant result. Remember that we can use str to query specific parts of this result–try running results <- t.test(len~supp, data = ToothGrowth); results$p.value.

We can make sure the tooth lengths in each group are normally distributed using a Shapiro Wilk test:

shapiro.test(ToothGrowth$len[ToothGrowth$sup=="OJ"])
## 
##  Shapiro-Wilk normality test
## 
## data:  ToothGrowth$len[ToothGrowth$sup == "OJ"]
## W = 0.91784, p-value = 0.02359
shapiro.test(ToothGrowth$len[ToothGrowth$sup=="VC"])
## 
##  Shapiro-Wilk normality test
## 
## data:  ToothGrowth$len[ToothGrowth$sup == "VC"]
## W = 0.96567, p-value = 0.4284

Oh no! We have a significant result here in the OJ group, meaning that these data are likely not drawn from a normal distribution. This violates one of the assumptions of a t-test. What can we do now?

wilcoxon-rank sum test

A wilcoxon rank-sum test is a non-parametric test, meaning that it relaxes the assumption about the data being drawn from a specific distribution. In exchange, non-parametric test typically lack the power–the ability to reject a null hypothesis when it is false–than parametric tests.

A wilcoxon test is run much the same way:

wilcox.test(ToothGrowth$len[ToothGrowth$sup=="OJ"], ToothGrowth$len[ToothGrowth$sup=="VC"])
## Warning in wilcox.test.default(ToothGrowth$len[ToothGrowth$sup == "OJ"], :
## cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  ToothGrowth$len[ToothGrowth$sup == "OJ"] and ToothGrowth$len[ToothGrowth$sup == "VC"]
## W = 575.5, p-value = 0.06449
## alternative hypothesis: true location shift is not equal to 0

The results are basically the same as with the t-test, but the p-value is slightly higher. (This is typically what we find when we compare parametric tests like t-tests to non-parametric tests like wilcoxon tests.) This is a valid test for this type of data because non-parametric test do not make assumptions about the nature of the underlying distributions from which our data were drawn.

testing for differences of some quantitative variable between three or more groups