Getting Started

You should have already installed R Studio and completed the online tutorials

Learn R Basics

You should go through the R basics tutorial, which teaches R environment and some of the most basic funcionality aspects of R. Then you should learn about R dataframes. You should be able to manipulate data with dplyr and create effective data visualization. You can also go through the swirl tutorial, which teaches you R programming and data science interactively, at your own pace and in the R console. Once you have R installed, you can install swirl and run it the following way:

install.packages("swirl")
library(swirl)
swirl()

Alternatively you can take the try R interactive class from Code School.

There are also many open and free resources and reference guides for R. Two examples are:

Quick-R: a quick online reference for data input, basic statistics and plots
R reference card (PDF) by Tom Short

Two key things you need to know about R is that you can get help for a function using help or ?, like this:

?install.packages
help("install.packages")

and the hash character represents comments, so text following these characters is not interpreted:

##This is just a comment

Installing Packages

The first R command we will run is install.packages. If you took the swirl tutorial you should have already done this. R only includes a basic set of functions. It can do much more than this, but not everybody needs everything so we instead make some functions available via packages. Many of these functions are stored in CRAN. Note that these packages are vetted: they are checked for common errors and they must have a dedicated maintainer. You can easily install packages from within R if you know the name of the packages. As an example, we are going to install the package rafalib which we use in our first data analysis examples:

install.packages("rafalib")

We can then load the package into our R sessions using the library function:

library(rafalib)

From now on you will see that we sometimes load packages without installing them. This is because once you install the package, it remains in place and only needs to be loaded with library. If you try to load a package and get an error, it probably means you need to install it first.

Importing Data into R

The first step when preparing to analyze data is to read in the data into R. There are several ways to do this and we will discuss three of them. But you only need to learn one to follow along.

In the life sciences, small datasets such as the one used as an example in the next sections are typically stored as Excel files. Although there are R packages designed to read Excel (xls) format, you generally want to avoid this and save files as comma delimited (Comma-Separated Value/CSV) or tab delimited (Tab-Separated Value/TSV/TXT) files. These plain-text formats are often easier for sharing data with collaborators, as commercial software is not required for viewing or working with the data. We will start with a simple example dataset containing female mouse weights.

The first step is to find the file containing your data and know its path.

Paths and the Working Directory

When you are working in R it is useful to know your working directory. This is the directory or folder in which R will save or look for files by default. You can see your working directory by typing:

getwd()

You can also change your working directory using the function setwd. Or you can change it through RStudio by clicking on “Session”.

The functions that read and write files (there are several in R) assume you mean to look for files or write files in the working directory. Our recommended approach for beginners will have you reading and writing to the working directory. However, you can also type the full path, which will work independently of the working directory.

Projects in RStudio

We find that the simplest way to organize yourself is to start a Project in RStudio (Click on “File” and “New Project”). When creating the project, you will select a folder to be associated with it. You can then download all your data into this folder. Your working directory will be this folder.

Option 1: Read file over the Internet

You can navigate to the femaleMiceWeights.csv file by visiting the data directory of dagdata on GitHub. If you navigate to the file, you need to click on Raw on the upper right hand corner of the page.

Now you can copy and paste the URL and use this as the argument to read.csv. Here we break the URL into a base directory and a filename and then combine with paste0 because the URL would otherwise be too long for the page. We use paste0 because we want to put the strings together as is, if you were specifying a file on your machine you should use the smarter function, file.path, which knows the difference between Windows and Mac file path connectors. You can specify the URL using a single string to avoid this extra step.

dir <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/"
url <- paste0(dir, "femaleMiceWeights.csv")
dat <- read.csv(url)

Option 2: Download file with your browser to your working directory

There are reasons for wanting to keep a local copy of the file. For example, you may want to run the analysis while not connected to the Internet or you may want to ensure reproducibility regardless of the file being available on the original site. To download the file, as in option 1, you can navigate to the femaleMiceWeights.csv. In this option we use your browser’s “Save As” function to ensure that the downloaded file is in a CSV format. Some browsers add an extra suffix to your filename by default. You do not want this. You want your file to be named femaleMiceWeights.csv. Once you have this file in your working directory, then you can simply read it in like this:

dat <- read.csv("femaleMiceWeights.csv")

If you did not receive any message, then you probably read in the file successfully.

Option 3: Download the file from within R

We store many of the datasets used here on GitHub. You can save these files directly from the Internet to your computer using R. In this example, we are using the download.file function in the downloader package to download the file to a specific location and then read it in. We can assign it a random name and a random directory using the function tempfile, but you can also save it in directory with the name of your choosing.

library(downloader) ##use install.packages to install
dir <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/"
filename <- "femaleMiceWeights.csv" 
url <- paste0(dir, filename)
if (!file.exists(filename)) download(url, destfile=filename)

We can then proceed as in option 2:

dat <- read.csv(filename)

Option 4: Download the data package (Advanced)

Many of the datasets we include in this book are available in custom-built packages from GitHub. The reason we use GitHub, rather than CRAN, is that on GitHub we do not have to vet packages, which gives us much more flexibility.

To install packages from GitHub you will need to install the devtools package:

install.packages("devtools")

Note to Windows users: to use devtools you will have to also install Rtools. In general you will need to install packages as administrator. One way to do this is to start R as administrator. If you do not have permission to do this, then it is a bit more complicated.

Now you are ready to install a package from GitHub. For this we use a different function:

library(devtools)
install_github("genomicsclass/dagdata")

The file we are working with is actually included in this package. Once you install the package, the file is on your computer. However, finding it requires advanced knowledge. Here are the lines of code:

dir <- system.file(package="dagdata") #extracts the location of package
list.files(dir)

## [1] "data"        "DESCRIPTION" "extdata"     "help"        "html"       
## [6] "Meta"        "NAMESPACE"   "script"

list.files(file.path(dir,"extdata")) #external data is in this directory

## [1] "admissions.csv"               "astronomicalunit.csv"        
## [3] "babies.txt"                   "femaleControlsPopulation.csv"
## [5] "femaleMiceWeights.csv"        "mice_pheno.csv"              
## [7] "msleep_ggplot2.csv"           "README"                      
## [9] "spider_wolff_gorb_2013.csv"

And now we are ready to read in the file:

filename <- file.path(dir,"extdata/femaleMiceWeights.csv")
dat <- read.csv(filename)

Inference

Introduction

Let’s use this paper as an example.

Note that the abstract has this statement:

“Body weight was higher in mice fed the high-fat diet already after the first week, due to higher dietary intake in combination with lower metabolic efficiency.”

To support this claim they provide the following in the results section:

“Already during the first week after introduction of high-fat diet, body weight increased significantly more in the high-fat diet-fed mice ($+$ 1.6 $\pm$ 0.1 g) than in the normal diet-fed mice ($+$ 0.2 $\pm$ 0.1 g; P < 0.001).”

What does P < 0.001 mean? What are the $\pm$ included? We will learn what this means and learn to compute these values in R. The first step is to understand random variables. To do this, we will use data from a mouse database (provided by Karen Svenson via Gary Churchill and Dan Gatti and partially funded by P50 GM070683). We will import the data into R and explain random variables and null distributions using R programming.

If you already downloaded the femaleMiceWeights file into your working directory, you can read it into R with just one line:

dat <- read.csv("femaleMiceWeights.csv")

Remember that a quick way to read the data, without downloading it is by using the url:

dir <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/"
filename <- "femaleMiceWeights.csv"
url <- paste0(dir, filename)
dat <- read.csv(url)

Our first look at data

We are interested in determining if following a given diet makes mice heavier after several weeks. This data was produced by ordering 24 mice from The Jackson Lab and randomly assigning either chow or high fat (hf) diet. After several weeks, the scientists weighed each mouse and obtained this data (head just shows us the first 6 rows):

head(dat)

##   Diet Bodyweight
## 1 chow      21.51
## 2 chow      28.14
## 3 chow      24.04
## 4 chow      23.45
## 5 chow      23.68
## 6 chow      19.79

In RStudio, you can view the entire dataset with:

View(dat)

So are the hf mice heavier? Mouse 24 at 20.73 grams is one of the lightest mice, while Mouse 21 at 34.02 grams is one of the heaviest. Both are on the hf diet. Just from looking at the data, we see there is variability. Claims such as the one above usually refer to the averages. So let’s look at the average of each group:

library(dplyr)
control <- filter(dat,Diet=="chow") %>% select(Bodyweight) %>% unlist
treatment <- filter(dat,Diet=="hf") %>% select(Bodyweight) %>% unlist
print( mean(treatment) )

## [1] 26.83417

print( mean(control) )

## [1] 23.81333

obsdiff <- mean(treatment) - mean(control)
print(obsdiff)

## [1] 3.020833

So the hf diet mice are about 10% heavier. Are we done? Why do we need p-values and confidence intervals? The reason is that these averages are random variables. They can take many values.

If we repeat the experiment, we obtain 24 new mice from The Jackson Laboratory and, after randomly assigning them to each diet, we get a different mean. Every time we repeat this experiment, we get a different value. We call this type of quantity a random variable.

Random Variables

Let’s explore random variables further. Imagine that we actually have the weight of all control female mice and can upload them to R. In Statistics, we refer to this as the population. These are all the control mice available from which we sampled 24. Note that in practice we do not have access to the population. We have a special dataset that we are using here to illustrate concepts.

The first step is to download the data from here into your working directory and then read it into R:

population <- read.csv("femaleControlsPopulation.csv")
##use unlist to turn it into a numeric vector
population <- unlist(population)

Now let’s sample 12 mice three times and see how the average changes.

control <- sample(population,12)
mean(control)

## [1] 23.42083

control <- sample(population,12)
mean(control)

## [1] 23.65167

control <- sample(population,12)
mean(control)

## [1] 23.79667

Note how the average varies. We can continue to do this repeatedly and start learning something about the distribution of this random variable.

The Null Hypothesis

Now let’s go back to our average difference of obsdiff. As scientists we need to be skeptics. How do we know that this obsdiff is due to the diet? What happens if we give all 24 mice the same diet? Will we see a difference this big? Statisticians refer to this scenario as the null hypothesis. The name “null” is used to remind us that we are acting as skeptics: we give credence to the possibility that there is no difference.

Because we have access to the population, we can actually observe as many values as we want of the difference of the averages when the diet has no effect. We can do this by randomly sampling 24 control mice, giving them the same diet, and then recording the difference in mean between two randomly split groups of 12 and 12. Here is this process written in R code:

##12 control mice
control <- sample(population,12)
##another 12 control mice that we act as if they were not
treatment <- sample(population,12)
print(mean(treatment) - mean(control))

## [1] -1.579167

Now let’s do it 10,000 times. We will use a “for-loop”, an operation that lets us automate this (a simpler approach that, we will learn later, is to use replicate).

n <- 10000
null <- vector("numeric",n)
for (i in 1:n) {
  control <- sample(population,12)
  treatment <- sample(population,12)
  null[i] <- mean(treatment) - mean(control)
}

The values in null form what we call the null distribution. We will define this more formally below.

So what percent of the 10,000 are bigger than obsdiff?

mean(null >= obsdiff)

## [1] 0.0131

Only a small percent of the 10,000 simulations. As skeptics what do we conclude? When there is no diet effect, we see a difference as big as the one we observed only 1.5% of the time. This is what is known as a p-value, which we will define more formally later.

Distributions

We have explained what we mean by null in the context of null hypothesis, but what exactly is a distribution? The simplest way to think of a distribution is as a compact description of many numbers. For example, suppose you have measured the heights of all men in a population. Imagine you need to describe these numbers to someone that has no idea what these heights are, such as an alien that has never visited Earth. Suppose all these heights are contained in the following dataset:

data(father.son,package="UsingR")
x <- father.son$fheight

One approach to summarizing these numbers is to simply list them all out for the alien to see. Here are 10 randomly selected heights of 1,078:

round(sample(x,10),1)

##  [1] 68.9 70.7 65.9 70.2 69.0 64.4 67.5 71.0 71.3 60.4

Cumulative Distribution Function

Scanning through these numbers, we start to get a rough idea of what the entire list looks like, but it is certainly inefficient. We can quickly improve on this approach by defining and visualizing a distribution. To define a distribution we compute, for all possible values of $a$, the proportion of numbers in our list that are below $a$. We use the following notation:

\[ F(a) \equiv \mbox{Pr}(x \leq a) \]

This is called the cumulative distribution function (CDF). When the CDF is derived from data, as opposed to theoretically, we also call it the empirical CDF (ECDF). The ECDF for the height data looks like this:

Empirical cummulative distribution function for height.

Histograms

Although the empirical CDF concept is widely discussed in statistics textbooks, the plot is actually not very popular in practice. The reason is that histograms give us the same information and are easier to interpret. Histograms show us the proportion of values in intervals:

\[ \mbox{Pr}(a \leq x \leq b) = F(b) - F(a) \]

Plotting these heights as bars is what we call a histogram. It is a more useful plot because we are usually more interested in intervals, such and such percent are between 70 inches and 71 inches, etc., rather than the percent less than a particular height. It is also easier to distinguish different types (families) of distributions by looking at histograms. Here is a histogram of heights:

hist(x)

We can specify the bins and add better labels in the following way:

bins <- seq(smallest, largest)
hist(x,breaks=bins,xlab="Height (in inches)",main="Adult men heights")

Histogram for heights.

Showing this plot to the alien is much more informative than showing numbers. With this simple plot, we can approximate the number of individuals in any given interval. For example, there are about 70 individuals over six feet (72 inches) tall.

Probability Distribution

Summarizing lists of numbers is one powerful use of distribution. An even more important use is describing the possible outcomes of a random variable. Unlike a fixed list of numbers, we don’t actually observe all possible outcomes of random variables, so instead of describing proportions, we describe probabilities. For instance, if we pick a random height from our list, then the probability of it falling between $a$ and $b$ is denoted with:

\[ \mbox{Pr}(a \leq X \leq b) = F(b) - F(a) \]

Note that the $X$ is now capitalized to distinguish it as a random variable and that the equation above defines the probability distribution of the random variable. Knowing this distribution is incredibly useful in science. For example, in the case above, if we know the distribution of the difference in mean of mouse weights when the null hypothesis is true, referred to as the null distribution, we can compute the probability of observing a value as large as we did, referred to as a p-value. In a previous section we ran what is called a Monte Carlo simulation and we obtained 10,000 outcomes of the random variable under the null hypothesis. Let’s repeat the loop above, but this time let’s add a point to the figure every time we re-run the experiment. If you run this code, you can see the null distribution forming as the observed values stack on top of each other.

n <- 100
library(rafalib)
nullplot(-5,5,1,30, xlab="Observed differences (grams)", ylab="Frequency")
totals <- vector("numeric",11)
for (i in 1:n) {
  control <- sample(population,12)
  treatment <- sample(population,12)
  nulldiff <- mean(treatment) - mean(control)
  j <- pmax(pmin(round(nulldiff)+6,11),1)
  totals[j] <- totals[j]+1
  text(j-6,totals[j],pch=15,round(nulldiff,1))
  ##if(i < 15) Sys.sleep(1) ##You can add this line to see values appear slowly
  }

Illustration of the null distribution.

The figure above amounts to a histogram. From a histogram of the null vector we calculated earlier, we can see that values as large as obsdiff are relatively rare:

hist(null, freq=TRUE)
abline(v=obsdiff, col="red", lwd=2)

Null distribution with observed difference marked with vertical red line.

An important point to keep in mind here is that while we defined $\mbox{Pr}(a)$ by counting cases, we will learn that, in some circumstances, mathematics gives us formulas for $\mbox{Pr}(a)$ that save us the trouble of computing them as we did here. One example of this powerful approach uses the normal distribution approximation.

Normal Distribution

The probability distribution we see above approximates one that is very common in nature: the bell curve, also known as the normal distribution or Gaussian distribution. When the histogram of a list of numbers approximates the normal distribution, we can use a convenient mathematical formula to approximate the proportion of values or outcomes in any given interval:

\[ \mbox{Pr}(a < x < b) = \int_a^b \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left( \frac{-(x-\mu)^2}{2 \sigma^2} \right)} \, dx \]

While the formula may look intimidating, don’t worry, you will never actually have to type it out, as it is stored in a more convenient form (as pnorm in R which sets a to $-\infty$, and takes b as an argument).

Here $\mu$ and $\sigma$ are referred to as the mean and the standard deviation of the population (we explain these in more detail in another section). If this normal approximation holds for our list, then the population mean and variance of our list can be used in the formula above. An example of this would be when we noted above that only 1.5% of values on the null distribution were above obsdiff. We can compute the proportion of values below a value x with pnorm(x,mu,sigma) without knowing all the values. The normal approximation works very well here:

1 - pnorm(obsdiff,mean(null),sd(null))

## [1] 0.01370292

Later, we will learn that there is a mathematical explanation for this. A very useful characteristic of this approximation is that one only needs to know $\mu$ and $\sigma$ to describe the entire distribution. From this, we can compute the proportion of values in any interval.

Summary

So computing a p-value for the difference in diet for the mice was pretty easy, right? But why are we not done? To make the calculation, we did the equivalent of buying all the mice available from The Jackson Laboratory and performing our experiment repeatedly to define the null distribution. Yet this is not something we can do in practice. Statistical Inference is the mathematical theory that permits you to approximate this with only the data from your sample, i.e. the original 24 mice. We will focus on this in the following sections.

Setting the random seed

Before we continue, we briefly explain the following important line of code:

set.seed(1)

Throughout this book, we use random number generators. This implies that many of the results presented can actually change by chance, including the correct answer to problems. One way to ensure that results do not change is by setting R’s random number generation seed. For more on the topic please read the help file:

?set.seed

Homework 1

Prepare an R markdown document that outputss your answers as an html file. Submit both the Rmd file and the html file on Collab.

Read in the file femaleMiceWeights.csv
The [ and ] symbols can be used to extract specific rows and specific columns of the table. What is the entry in the 12th row and second column?
You should have learned how to use the $ character to extract a column from a table and return it as a vector. Use $ to extract the weight column and report the weight of the mouse in the 11th row.
The length function returns the number of elements in a vector. How many mice are included in the dataset?
View the data and determine what rows are associated with the high fat or hf diet. Then use the mean function to compute the average weight of these mice.
One of the functions we will be using often is sample. Read the help file for sample using ?sample. Now take a random sample of size 1 from the numbers 13 to 24 and report back the weight of the mouse represented by that row. Make sure to type set.seed(1) to ensure that everybody gets the same answer.
Read in the file femaleControlsPopulation.csv
What is the average of these weights?
After setting the seed at 1, set.seed(1) take a random sample of size 5. What is the absolute value (use abs) of the difference between the average of the sample and the average of all the values? 10.After setting the seed at 5, set.seed(5) take a random sample of size 5. What is the absolute value of the difference between the average of the sample and the average of all the values?’
Why are the answers from 9 and 10 different?
Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages. What percent of these 1,000 averages are more than 1 ounce away from the average of the whole population?
Increase the number of times you redo the sample from 1,000 to 10,000. Set the seed at 1, then using a for-loop take a random sample of 5 mice 10,000 times. Save these averages. What percent of these 10,000 averages are more than 1 ounce away from the average of the whole population?
Set the seed at 1, then using a for-loop take a random sample of 50 mice 1,000 times. Save these averages. What percent of these 1,000 averages are more than 1 ounce away from the average of the whole population?
Use ggplot to plot the histograms of the distribution of averages you get with a sample size of 5 and a sample size of 50. Plot both histograms in one panel. How do they differ?
For the last set of averages, the ones obtained from a sample size of 50, what percent are between 23 and 25?
Now ask the same question of a normal distribution with average 23.9 and standard deviation 0.43.

Inference in Statistics 1

Rafael Irizarry and Michael Love modified by Mete Civelek

August 24, 2017