Week1: Getting Started
Week 2: Random Variables and Probability Distributions
Week 3:Inference I:- P-values, Confidence Intervals and Power Calculations
Week 3 :Inference II: Monte Carlo Simulation, Permutation Tests and Association tests
Week 4
- Exploratory Data Analysis
- Robust summaries
Resources

Week1: Getting Started

In this book we will be using the R programming language for all our analysis. You will learn R and statistics simultaneously. However, we assume you have some basic programming skills and knowledge of R syntax. If you don’t, your first homework, listed below, is to complete a tutorial. Here we give step-by-step instructions on how to get set up to follow along.

R code which Rafa runs in this video is available here

Installing R

The first step is to install R. You can download and install R from the Comprehensive R Archive Network (CRAN). It is relatively straightforward, but if you need further help you can try the following resources:

Setting up R and RStudio in Windows

https://github.com/genomicsclass/windows#installing-r

Running as Administrator

When you run the installers, make sure to run them as administrator.
Select the file, right click and choose Run as administrator

Installing R

Open your browser and go to http://cran.r-project.org/
Click on Download R for Windows
Click on base
Click on Download R 3.1.2 for Windows
Open and run the file you just downloaded R-3.1.2-win.exe
There is no need to change the default installation!

.	.	.

1	2	3

4	5	6

7	8	9
		DONE
10	11

Please make sure you install the latest version of R (R version 3.1.2).

Installing Rtools

Now we need to install Rtools
Open your browser and go to http://cran.r-project.org/
Click on Download R for Windows
Click on Rtools
Download Rtools31.exe
Open and run the file you just downloaded Rtools31.exe
Again, there is no need to change the defaults

.	.	.

1	2	3

4	5	6

7	8	9

10	11	12
	DONE
13	14

Installing RStudio

If you have already installed RStudio, before you skip this section check if you are using the right R version. Go to Tools/Global Options. Make you sure you have R-3.1.2 under R version.
Go to http://www.rstudio.com/ and click on Desktop.
Select DOWNLOAD RSTUDIO DESKTOP
Download the installer for Windows.
Open and run the file you just downloaded RStudio-0.98.1091.exe
You don’t have to change any of the defaults for the installation

.	.	.

1	2	3
		DONE
4	5	6

Installing devtools

Open Rstudio, in the Files/Plots/Packages/Help panel, select Packages and then click on Install
Type devtools under Packages, make sure you spell the name of the package correctly. Then, click on Install
Alternatively, you can run the following command in the console.
```
install.packages("devtools")
```
Installing packages from GitHub
We are ready to install R packages from GitHub
First we need to load the devtools package

#install.packages("devtools")
library(devtools)

Now we can install the gapminder package.

#install_github("jennybc/gapminder")

License

http://opensource.org/licenses/MIT

Installing RStudio

The next step is to install RStudio, a program for viewing and running R scripts. Technically you can run all the code shown here without installing RStudio, but we highly recommend this integrated development environment (IDE). Instructions are here and for Windows we have special instructions.

RStudio has four panels,the source, the console, the environment or history panel, and then fourth panel for help and plots and others, depending on which tab you click.

rstudio

Learn R Basics

The first homework assignment is to complete an R tutorial to familiarize yourself with the basics of programming and R syntax. To follow this book you should be familiar with the difference between lists (including data frames) and numeric vectors, for-loops, how to create functions, and how to use the sapply and replicate functions.

If you are already familiar with R, you can skip to the next section. Otherwise, you should go through the swirl tutorial, which teaches you R programming and data science interactively, at your own pace and in the R console. Once you have R installed, you can install swirl and run it the following way:

#install.packages("swirl")
#library(swirl)
#swirl()

swirl exercise

Install and run a course automatically from swirl

This is the preferred method of installing courses. It automates the process by allowing you to do everything right from the R console.

Make sure you have a recent version version of swirl:

#install.packages("swirl")

Enter the following from the R console, substituting the name of the course that you wish to install:

#library(swirl)
#install_from_swirl("Course Name Here")
#swirl()

For example, install_from_swirl("R Programming") will install the R Programming course. Please note that course names are case sensitive!*

If that doesn’t work for you…

Install and run a course manually

If the automatic course installation method outlined above does not work for you, then there’s a simple alternative.

Click on the Download ZIP button on the righthand side of this github page.
Enter the following from the R console, substituting the correct file path to your downloaded file and the name of your desired course:

#library(swirl)
#install_course_zip("path/to/file/here/swirl_courses-master.zip", multi=TRUE,which_course="Course Name Here")
#swirl()

For example, if you download the zip file to ~/Downloads/swirl_courses-master.zip, then the following command will install the R Programming course.

#install_course_zip("~/Downloads/swirl_courses-master.zip", multi=TRUE, which_course="R Programming")

Please note that course names are case sensitive!

Run swirl Exercises #1

Q:What version of R are you using (hint: make sure you download the latest version and then type version)? Please note that this question does not count toward your grade, but it is important to make sure that you are using the latest version of R.

version

A:3.2.3

Run swirl Exercises #2

Q:Create a numeric vector containing the numbers 2.23, 3.45, 1.87, 2.11, 7.33, 18.34, 19.23. What is the average of these numbers?

x <- c(2.23, 3.45, 1.87, 2.11, 7.33, 18.34, 19.23)
mean(x)

## [1] 7.794286

A:7.794286

Run swirl Exercises #3

Q:Use a for loop to determine the value of $\sum_{i=1}^{25} i^2$

sum <- 0
for(i in 1:25)
  sum <- sum + i^2
sum

## [1] 5525

A:5525

Run swirl Exercises #4

Q:The cars dataset is available in base R. You can type cars to see it. Use the class function to determine what type of object is cars.

class(cars)

## [1] "data.frame"

A:data.frame

Run swirl Exercises #5

Q:How many rows does the cars object have?

nrow(cars)

## [1] 50

A:50

Run swirl Exercises #6

Q:What is the name of the second column of cars?

names(cars)[2]

## [1] "dist"

A:dist

Run swirl Exercises #7

Q:The simplest way to extract the columns of a matrix or data.frame is using [. For example you can access the second column with cars[,2]. What is the average distance traveled in this dataset?

mean(cars[,2])

## [1] 42.98

A:42.98

Run swirl Exercises #8

Q:Familiarize yourself with the which function. What row of cars has a a distance of 85?

which(cars[,2]==85)

## [1] 50

A:50

Alternatively you can take the try R interactive class from Code School.

There are also many open and free resources and reference guides for R. Two examples are:

Quick-R: a quick online reference for data input, basic statistics and plots
R reference card (PDF)[https://cran.r-project.org/doc/contrib/Short-refcard.pdf] by Tom Short

Two key things you need to know about R is that you can get help for a function using help or ?, like this:

?install.packages
help("install.packages")

and the hash character represents comments, so text following these characters is not interpreted:

##This is just a comment

Installing Packages

The first R command we will run is install.packages. If you took the swirl tutorial you should have already done this. R only includes a basic set of functions. It can do much more than this, but not everybody needs everything so we instead make some functions available via packages. Many of these functions are stored in CRAN. Note that these packages are vetted: they are checked for common errors and they must have a dedicated maintainer. You can easily install packages from within R if you know the name of the packages. As an example, we are going to install the package rafalib which we use in our first data analysis examples:

install.packages("rafalib")

We can then load the package into our R sessions using the library function:

library(rafalib)

From now on you will see that we sometimes load packages without installing them. This is because once you install the package, it remains in place and only needs to be loaded with library. If you try to load a package and get an error, it probably means you need to install it first.

Importing Data into R

The first step when preparing to analyze data is to read in the data into R. There are several ways to do this and we will discuss three of them. But you only need to learn one to follow along.

In the life sciences, small datasets such as the one used as an example in the next sections are typically stored as Excel files. Although there are R packages designed to read Excel (xls) format, you generally want to avoid this and save files as comma delimited (Comma-Separated Value/CSV) or tab delimited (Tab-Separated Value/TSV/TXT) files. These plain-text formats are often easier for sharing data with collaborators, as commercial software is not required for viewing or working with the data. We will start with a simple example dataset containing female mouse weights.

The first step is to find the file containing your data and know its path.

Paths and the Working Directory

When you are working in R it is useful to know your working directory. This is the directory or folder in which R will save or look for files by default. You can see your working directory by typing:

getwd()

You can also change your working directory using the function setwd. Or you can change it through RStudio by clicking on “Session”.

The functions that read and write files (there are several in R) assume you mean to look for files or write files in the working directory. Our recommended approach for beginners will have you reading and writing to the working directory. However, you can also type the full path, which will work independently of the working directory.

Projects in RStudio

We find that the simplest way to organize yourself is to start a Project in RStudio (Click on “File” and “New Project”). When creating the project, you will select a folder to be associated with it. You can then download all your data into this folder. Your working directory will be this folder.

Option 1: Read file over the internet

You can navigate to the femaleMiceWeights.csv file by visiting the data directory of dagdata on GitHub. If you navigate to the file, you need to click on Raw on the upper right hand corner of the page.

GitHub page screenshot

Now you can copy and paste the URL and use this as the argument to read.csv:

url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleMiceWeights.csv"
dat <- read.csv(url)
#dat <- read.csv("femaleMiceWeights.csv")

Option 2: Download file with your browser to your working directory

There are reasons for wanting to keep a local copy of the file. For example, you may want to run the analysis while not connected to internet or you may want to assure reproducibility regardless of the file being available on the original site. To download the file, as in option 1 you can navigate to the femaleMiceWeights.csv. In this option we use your browser’s “Save As” function to ensure that the downloaded file is in a CSV format. Some browsers add an extra suffix to your filename by default. You do not want this. You want your file to be named femaleMiceWeights.csv. Once you have this file in your working directory, then you can simply read it in like this:

dat <- read.csv("femaleMiceWeights.csv")

If you did not receive any message, then you probably read in the file successfully.

Option 3: Download the file from within R

We store many of the datasets used here on GitHub. You can save these files directly from the internet to your computer using R. In this example, we are using the download.file function in the downloader package to download the file to a specific location and then read it in. We can assign it a random name and a random directory using the function tempfile, but you can also save it in directory with the name of your choosing.

library(downloader) ##use install.packages to install
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleMiceWeights.csv"
filename <- "femaleMiceWeights.csv" 
if (!file.exists(filename)) download(url, destfile=filename)

We can then proceed as in option 2:

dat <-read.csv(filename)

Option 4: Download the data package (Advanced)

Many of the datasets we include in this book are available in custom-built packages from GitHub. The reason we use GitHub, rather than CRAN, is that on GitHub we do not have to vet packages, which gives us much more flexibility.

To install packages from GitHub you will need to install the devtools package:

install.packages("devtools")

Note to Windows users: to use devtools you will have to also install Rtools. In general you will need to install packages as administrator. One way to do this is to start R as administrator. If you do not have permission to do this, then it is a bit more complicated.

Now you are ready to install a package from GitHub. For this we use a different function:

library(devtools)
install_github("genomicsclass/dagdata")

The file we are working with is actually included in this package. Once you install the package, the file is on your computer. However, finding it requires advanced knowledge. Here are the lines of code:

dir <- system.file(package="dagdata") #extracts the location of package
list.files(dir)

## [1] "data"        "DESCRIPTION" "extdata"     "help"        "html"       
## [6] "Meta"        "NAMESPACE"   "script"

list.files(file.path(dir,"extdata")) #external data is in this directory

## [1] "admissions.csv"               "astronomicalunit.csv"        
## [3] "babies.txt"                   "femaleControlsPopulation.csv"
## [5] "femaleMiceWeights.csv"        "mice_pheno.csv"              
## [7] "msleep_ggplot2.csv"           "README"                      
## [9] "spider_wolff_gorb_2013.csv"

And now we are ready to read in the file:

filename <- file.path(dir,"extdata/femaleMiceWeights.csv")
dat <- read.csv(filename)

Getting Started Exercises

Here we will test some of the basics of R data manipulation which you should know or should have learned by following the tutorials above. You will need to have the file femaleMiceWeights.csv in your working directory. As we showed above, one way to do this is by using the downloader package:

library(downloader) 
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleMiceWeights.csv"
filename <- "femaleMiceWeights.csv" 
download(url, destfile=filename)

Q:Read in the file femaleMiceWeights.csv and report the exact name of the column containing the weights.

micewt<-read.csv("femaleMiceWeights.csv")
colnames(micewt)

## [1] "Diet"       "Bodyweight"

A:Bodyweight

Q:The [ and ] symbols can be used to extract specific rows and specific columns of the table. What is the entry in the 12th row and second column?

micewt[12,2]

## [1] 26.25

A: 26.25

Q:You should have learned how to use the $ character to extract a column from a table and return it as a vector. Use $ to extract the weight column and report the weight of the mouse in the 11th row.

weights <- micewt$Bodyweight
weights[11]

## [1] 26.91

#or
micewt$Bodyweight[11]

## [1] 26.91

A:26.91

Q:The length function returns the number of elements in a vector. How many mice are included in our dataset?

length(micewt$Bodyweight)

## [1] 24

A:24

Q:To create a vector with the numbers 3 to 7, we can use seq(3,7) or, because they are consecutive, 3:7. View the data and determine what rows are associated with the high fat or hf diet. Then use the mean function to compute the average weight of these mice.

hfdiet<-micewt[micewt$Diet=="hf",]
mean(hfdiet$Bodyweight)

## [1] 26.83417

#or

#View(dat) 
weights <- dat$Bodyweight
mean( weights[ 13:24 ])

## [1] 26.83417

A:26.83417

Q:One of the functions we will be using often is sample. Read the help file for sample using ?sample. Now take a random sample of size 1 from the numbers 13 to 24 and report back the weight of the mouse represented by that row. Make sure to type set.seed(1) to ensure that everybody gets the same answer.

#?sample

set.seed(1)
i <- sample( 13:24, 1)
dat$Bodyweight[i]

## [1] 25.34

#or

set.seed(1)
x<-13:24
y<-sample(x,1)
micewt$Bodyweight[y]

## [1] 25.34

A:25.34

Brief Introduction to `dplyr`

The learning curve for R syntax is slow. One of the more difficult aspects that requires some getting used to is subsetting data tables. The dplyr packages brings these tasks closer to English and we are therefore going to introduce two simple functions: one is used to subset and the other to select columns.

Take a look at the dataset we read in:

filename <- "femaleMiceWeights.csv"
dat <- read.csv(filename)
head(dat) #In R Studio use View(dat)

##   Diet Bodyweight
## 1 chow      21.51
## 2 chow      28.14
## 3 chow      24.04
## 4 chow      23.45
## 5 chow      23.68
## 6 chow      19.79

There are two types of diets, which are denoted in the first column. If we want just the weights, we only need the second column. So if we want the weights for mice on the chow diet, we subset and filter like this:

library(dplyr) 
chow <- filter(dat, Diet=="chow") #keep only the ones with chow diet
head(chow)

##   Diet Bodyweight
## 1 chow      21.51
## 2 chow      28.14
## 3 chow      24.04
## 4 chow      23.45
## 5 chow      23.68
## 6 chow      19.79

And now we can select only the column with the values:

chowVals <- select(chow,Bodyweight)
head(chowVals)

##   Bodyweight
## 1      21.51
## 2      28.14
## 3      24.04
## 4      23.45
## 5      23.68
## 6      19.79

A nice feature of the dplyr package is that you can perform consecutive tasks by using what is called a “pipe”. In dplyr we use %>% to denote a pipe. This symbol tells the program to first do one thing and then do something else to the result of the first. Hence, we can perform several data manipulations in one line. For example:

chowVals <- filter(dat, Diet=="chow") %>% select(Bodyweight)

In the second task, we no longer have to specify the object we are editing since it is whatever comes from the previous call.

Also, note that if dplyr receives a data.frame it will return a data.frame.

class(dat)

## [1] "data.frame"

class(chowVals)

## [1] "data.frame"

For pedagogical reasons, we will often want the final result to be a simple numeric vector. To obtain such a vector with dplyr, we can apply the unlist function which turns lists, such as data.frames, into numeric vectors:

chowVals <- filter(dat, Diet=="chow") %>% select(Bodyweight) %>% unlist
class( chowVals )

## [1] "numeric"

To do this in R without dplyr the code is the following:

chowVals <- dat[ dat$Diet=="chow", colnames(dat)=="Bodyweight"]

dplyr exercises

For these exercises, we will use a new dataset related to mammalian sleep. This data is described here. Download the CSV file from this location:

We are going to read in this data, then test your knowledge of they key dplyr functions select and filter. We are also going to review two different classes: data frames and vectors.

Q:Read in the msleep_ggplot2.csv file with the function read.csv and use the function class to determine what type of object is returned.

#Import the data set

library(downloader)
url="https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv"
filename <- basename(url)
download(url,filename)

mammalian<-read.csv("msleep_ggplot2.csv")
class(mammalian)

## [1] "data.frame"

Q:Now use the filter function to select only the primates. How many animals in the table are primates? Hint: the nrow function gives you the number of rows of a data frame or matrix.

library(dplyr) 
head(mammalian)

##                         name      genus  vore        order conservation
## 1                    Cheetah   Acinonyx carni    Carnivora           lc
## 2                 Owl monkey      Aotus  omni     Primates         <NA>
## 3            Mountain beaver Aplodontia herbi     Rodentia           nt
## 4 Greater short-tailed shrew    Blarina  omni Soricomorpha           lc
## 5                        Cow        Bos herbi Artiodactyla domesticated
## 6           Three-toed sloth   Bradypus herbi       Pilosa         <NA>
##   sleep_total sleep_rem sleep_cycle awake brainwt  bodywt
## 1        12.1        NA          NA  11.9      NA  50.000
## 2        17.0       1.8          NA   7.0 0.01550   0.480
## 3        14.4       2.4          NA   9.6      NA   1.350
## 4        14.9       2.3   0.1333333   9.1 0.00029   0.019
## 5         4.0       0.7   0.6666667  20.0 0.42300 600.000
## 6        14.4       2.2   0.7666667   9.6      NA   3.850

primates <- filter(mammalian, order=="Primates") #keep only the ones with Primates order
nrow(primates)

## [1] 12

A:12

Q:What is the class of the object you obtain after subsetting the table to only include primates?

class(primates)

## [1] "data.frame"

A:data.frame

Q:Now use the select function to extract the sleep (total) for the primates. What class is this object? Hint: use %>% to pipe the results of the filter function to select.

primateSleep<-filter(mammalian, order=="Primates") %>% select(sleep_total)
class(primateSleep)

## [1] "data.frame"

A:data.frame

Q:Now we want to calculate the average amount of sleep for primates (the average of the numbers computed above). One challenge is that the mean function requires a vector so, if we simply apply it to the output above, we get an error. Look at the help file for unlist and use it to compute the desired average.

#?unlist
y <- filter(mammalian, order=="Primates") %>% select(sleep_total) %>% unlist
mean( y )

## [1] 10.5

#or in one step

filter(mammalian, order=="Primates") %>% select(sleep_total) %>% unlist %>% mean

## [1] 10.5

#or
mean(unlist(primateSleep))

## [1] 10.5

A:10.5

Q:For the last exercise, we could also use the dplyr summarize function. We have not introduced this function, but you can read the help file and repeat exercise 5, this time using just filter and summarize to get the answer.

library(dplyr)
#?summarise
y <- filter(mammalian, order=="Primates") %>% select(sleep_total) %>% summarise(mean(sleep_total))
y

##   mean(sleep_total)
## 1              10.5

#or 

filter(mammalian, order=="Primates") %>% summarise( mean( sleep_total) )

##   mean(sleep_total)
## 1              10.5

Mathematical Notation

This book focuses on teaching statistical concepts and data analysis programming skills. We avoid mathematical notation as much as possible, but we do use it. We do not want readers to be intimidated by the notation though. Mathematics is actually the easier part of learning statistics. Unfortunately, many text books use mathematical notation in what we believe to be an over-complicated way. For this reason, we do try to keep the notation as simple as possible. However, we do not want to water down the material, and some mathematical notation facilitates a deeper understanding of the concepts. Here we describe a few specific symbols that we use often. If they appear intimidating to you, please take some time to read this section carefully as they are actually simpler than they seem. Because by now you should be somewhat familiar with R, we will make the connection between mathematical notation and R code.

Indexing

Those of us dealing with data almost always have a series of numbers. To describe the concepts in an abstract way, we use indexing. For example 5 numbers:

x <- 1:5

can be generally represented like this $x_1, x_2, x_3, x_4, x_5$. We use dots to simplify this $x_1,\dots,x_5$ and indexing to simplify even more $x_i, i=1,\dots,5$. If we want to describe a procedure for a list of any size $n$ , we write $x_i, i=1,\dots,n$.

We sometimes have two indexes. For example, we may have several measurements (blood pressure, weight, height, age, cholesterol level) for 100 individuals. We can then use double indexes: $x_{i,j}, i=1,\dots,100, j=1,\dots,5$.

Summation

A very common operation in data analysis is to sum several numbers. This is comes up, for example, when we compute averages and standard deviations. If we have many numbers, there is a mathematical notation that makes it quite easy to express the following:

n <- 1000
x <- 1:n
S <- sum(x)

and it is the $\sum$ notation (capital S in Greek):

\[ S = \sum_{i=1}^n x_i \]

Note that we make use of indexing as well. We will see that what is included inside the summation can become quite complicated. However, the summation part should not confuse you as it is a simple operation.

Greek letters

We would prefer to avoid Greek letters, but they are ubiquitous in the statistical literature so we want you to become used to them. They are mainly used to distinguish the unknown from the observed. Suppose we want to find out the average height of a population and we take a sample of 1,000 people to estimate this. The unknown average we want to estimate is often denoted with $\mu$, the Greek letter for m (m is for mean). The standard deviation is often denoted with $\sigma$, the Greek letter for s. Measurement error or other unexplained random variability is typically denoted with $\varepsilon$, the Greek letter for e. Effect sizes, for example the effect of a diet on weight, are typically denoted with $\beta$. We may use other Greek letters but those are the most commonly used.

You should get used to these four Greek letters as you will be seeing them often: $\mu$, $\sigma$, $\beta$ and $\varepsilon$.

Note that indexing is sometimes used in conjunction with Greek letters to denote different groups. For example, if we have one set of numbers denoted with $x$ and another with $y$ we may use $\mu_x$ and $\mu_y$ to denote their averages.

Infinity

In the text we often talk about asymptotic results. Typically, this refers to an approximation that gets better and better as the number of data points we consider gets larger and larger, with perfect approximations occurring when the number of data points is $\infty$. In practice, there is no such thing as $\infty$, but it is a convenient concept to understand. One way to think about asymptotic results is as results that become better and better as some number increases and we can pick a number so that a computer can’t tell the difference between the approximation and the real number. Here is a very simple example that approximates 1/3 with decimals:

onethird <- function(n) sum( 3/10^c(1:n))
1/3 - onethird(4)

## [1] 3.333333e-05

1/3 - onethird(10)

## [1] 3.333334e-11

1/3 - onethird(16)

## [1] 0

In the example above, 16 is practically $\infty$.

Integrals

We only use these a couple of times so you can skip this section if you prefer. However, integrals are actually much simpler to understand than perhaps you realize.

For certain statistical operations, we need to figure out areas under the curve. For example, for a function $f(x)$ …

…we need to know what proportion of the total area under the curve is grey.

The grey area can be thought of as many small grey bars stacked next to each other. The area is then just the sum of the areas of these little bars. The problem is that we can’t do this for every number between 2 and 4 because there are an infinite number. The integral is the mathematical solution to this problem. In this case, the total area is 1 so the answer to what proportion is grey is the following integral:

\[ \int_2^4 f(x) \, dx \]

Because we constructed this example, we know that the grey area is 2.27% of the total. Note that this is very well approximated by an actual sum of little bars:

width <- 0.01
x <- seq(2,4,width)
areaofbars <-  f(x)*width
sum( areaofbars )

## [1] 0.02298998

The smaller we make width, the closer the sum gets to the integral, which is equal to the area.

Week 2: Random Variables and Probability Distributions

Introduction to Random Variables

Inference

Introduction

This chapter introduces the statistical concepts necessary to understand p-values and confidence intervals. These terms are ubiquitous in the life science literature. Let’s use this paper as an example.

Note that the abstract has this statement:

“Body weight was higher in mice fed the high-fat diet already after the first week, due to higher dietary intake in combination with lower metabolic efficiency.”

To support this claim they provide the following in the results section:

“Already during the first week after introduction of high-fat diet, body weight increased significantly more in the high-fat diet-fed mice ($+$ 1.6 $\pm$ 0.1 g) than in the normal diet-fed mice ($+$ 0.2 $\pm$ 0.1 g; P < 0.001).”

What does P < 0.001 mean? What are the $\pm$ included? We will learn what this means and learn to compute these values in R. The first step is to understand random variables. To do this, we will use data from a mouse database (provided by Karen Svenson via Gary Churchill and Dan Gatti and partially funded by P50 GM070683). We will import the data into R and explain random variables and null distributions using R programming.

If you already downloaded the femaleMiceWeights file into your working directory, you can read it into R with just one line:

dat <- read.csv("femaleMiceWeights.csv")

Remember that a quick way to read the data, without downloading it is by using the url:

url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleMiceWeights.csv"
dat <- read.csv(url)

Our first look at data

We are interested in determining if following a given diet makes mice heavier after several weeks. This data was produced by ordering 24 mice from The Jackson Lab and randomly assigning either chow or high fat (hf) diet. After several weeks, the scientists weighed each mice and obtained this data (head just shows us the first 6 rows):

head(dat)

##   Diet Bodyweight
## 1 chow      21.51
## 2 chow      28.14
## 3 chow      24.04
## 4 chow      23.45
## 5 chow      23.68
## 6 chow      19.79

In RStudio, you can view the entire dataset with:

View(dat)

So are the hf mice heavier? Mouse 24 at 20.73 grams is one the lightest mice, while Mouse 21 at 34.02 grams is one of the heaviest. Both are on the hf diet. Just from looking at the data, we see there is variability. Claims such as the one above usually refer to the averages. So let’s look at the average of each group:

library(dplyr)
control <- filter(dat,Diet=="chow") %>% select(Bodyweight) %>% unlist
treatment <- filter(dat,Diet=="hf") %>% select(Bodyweight) %>% unlist
print( mean(treatment) )

## [1] 26.83417

print( mean(control) )

## [1] 23.81333

obsdiff <- mean(treatment) - mean(control)
print(obsdiff)

## [1] 3.020833

So the hf diet mice are about 10% heavier. Are we done? Why do we need p-values and confidence intervals? The reason is that these averages are random variables. They can take many values.

If we repeat the experiment, we obtain 24 new mice from The Jackson Laboratory and, after randomly assigning them to each diet, we get a different mean. Every time we repeat this experiment, we get a different value. We call this type of quantity a random variable.

Random Variables

Let’s explore random variables further. Imagine that we actually have the weight of all control female mice and can upload them to R. In Statistics, we refer to this as the population. These are all the control mice available from which we sampled 24. Note that in practice we do not have access to the population. We have a special data set that we are using here to illustrate concepts.

The first step is to download the data from here into your working directory and then read it into R:

population <- read.csv("femaleControlsPopulation.csv")
##use unlist to turn it into a numeric vector
population <- unlist(population)

Now let’s sample 12 mice three times and see how the average changes.

control <- sample(population,12)
mean(control)

## [1] 23.81333

control <- sample(population,12)
mean(control)

## [1] 23.77083

control <- sample(population,12)
mean(control)

## [1] 24.18667

Note how the average varies. We can continue to do this repeatedly and start learning something about the distribution of this random variable.

Random Variables Exercises

#For these exercises, we will be using the following dataset:

library(downloader) 
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile=filename)
x <- unlist( read.csv(filename) )
head(x)

## Bodyweight1 Bodyweight2 Bodyweight3 Bodyweight4 Bodyweight5 Bodyweight6 
##       27.03       24.80       27.02       28.07       23.55       22.72

Here x represents the weights for the entire population.

Q:What is the average of these weights?

mean(x)

## [1] 23.89338

A:23.89338

Q:After setting the seed at 1, set.seed(1) take a random sample of size 5. What is the absolute value (use abs) of the difference between the average of the sample and the average of all the values?

set.seed(1)
sample<-sample(x,5)
abs(mean(sample)-mean(x))

## [1] 0.2706222

Q:After setting the seed at 5, set.seed(5) take a random sample of size 5. What is the absolute value of the difference between the average of the sample and the average of all the values?

set.seed(5)
sample<-sample(x,5)
abs(mean(sample)-mean(x))

## [1] 1.433378

Q:Why are the answers from 2 and 3 different? A) Because we made a coding mistake. B) Because the average of the x is random. C) Because the average of the samples is a random variable. D) All of the above.

A:Because the average of the samples is a random variable.

The Null Hypothesis

Now let’s go back to our average difference of obsdiff. As scientists we need to be skeptics. How do we know that this obsdiff is due to the diet? What happens if we give all 24 mice the same diet? Will we see a difference this big? Statisticians refer to this scenario as the null hypothesis. The name “null” is used to remind us that we are acting as skeptics: we give credence to the possibility that there is no difference.

Because we have access to the population, we can actually observe as many values as we want of the difference of the averages when the diet has no effect. We can do this by randomly sampling 24 control mice, giving them the same diet, and then recording the difference in mean between two randomly split groups of 12 and 12. Here is this process written in R code:

population <- read.csv("femaleControlsPopulation.csv")
##use unlist to turn it into a numeric vector
population <- unlist(population)

##12 control mice
control <- sample(population,12)
##another 12 control mice that we act as if they were not
treatment <- sample(population,12)
print(mean(treatment) - mean(control))

## [1] 1.490833

Now let’s do it 10,000 times. We will use a “for-loop”, an operation that lets us automate this (a simpler approach that, we will learn later, is to use replicate).

n <- 10000
null <- vector("numeric",n)
for (i in 1:n) {
  control <- sample(population,12)
  treatment <- sample(population,12)
  null[i] <- mean(treatment) - mean(control)
}

The values in null form what we call the null distribution. We will define this more formally below.

So what percent of the 10,000 are bigger than obsdiff?

library(dplyr)
control <- filter(dat,Diet=="chow") %>% select(Bodyweight) %>% unlist
treatment <- filter(dat,Diet=="hf") %>% select(Bodyweight) %>% unlist
print( mean(treatment) )

## [1] 26.83417

print( mean(control) )

## [1] 23.81333

obsdiff <- mean(treatment) - mean(control)
print(obsdiff)

## [1] 3.020833

mean(null >= obsdiff) #one tailed p value

## [1] 0.0129

mean(abs(null)>= obsdiff) # two tailed p value

## [1] 0.026

Only a small percent of the 10,000 simulations. As skeptics what do we conclude? When there is no diet effect, we see a difference as big as the one we observed only 1.5% of the time. This is what is known as a p-value, which we will define more formally later in the book.

Null distributions Exercises

Q:Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages. What percent of these 1,000 averages are more than 1 ounce away from the average of x ?

library(downloader) 
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile=filename)
x <- unlist( read.csv(filename) )
head(x)

## Bodyweight1 Bodyweight2 Bodyweight3 Bodyweight4 Bodyweight5 Bodyweight6 
##       27.03       24.80       27.02       28.07       23.55       22.72

set.seed(1) #so that we get same results
n <- 1000
averages5 <- vector("numeric",n)
for(i in 1:n){
  X <- sample(x,5)
  averages5[i] <- mean(X)
}
hist(averages5) ##take a look

mean( abs( averages5 - mean(x) ) > 1)

## [1] 0.498

A:0.498

Q:We are now going to increase the number of times we redo the sample from 1,000 to 10,000. Set the seed at 1, then using a for-loop take a random sample of 5 mice 10,000 times. Save these averages. What percent of these 10,000 averages are more than 1 ounce away from the average of x ?

set.seed(1) #so that we get same results
n <- 10000
averages5 <- vector("numeric",n)
for(i in 1:n){
  X <- sample(x,5)
  averages5[i] <- mean(X)
}
hist(averages5) ##take a look

mean( abs( averages5 - mean(x) ) > 1)

## [1] 0.4976

A:0.4976

Q:Note that the answers to 4 and 5 barely changed. This is expected. The way we think about the random value distributions is as the distribution of the list of values obtained if we repeated the experiment an infinite number of times. On a computer, we can’t perform an infinite number of iterations so instead, for our examples, we consider 1,000 to be large enough, thus 10,000 is as well. Now if instead we change the sample size, then we change the random variable and thus its distribution.

Set the seed at 1, then using a for-loop take a random sample of 50 mice 1,000 times. Save these averages. What percent of these 1,000 averages are more than 1 ounce away from the average of x ?

set.seed(1) #so that we get same results
n <- 1000
averages50 <- vector("numeric",n)
for(i in 1:n){
  X <- sample(x,50)
  averages50[i] <- mean(X)
}
hist(averages50) ##take a look

mean( abs( averages50 - mean(x) ) > 1)

## [1] 0.019

Use a histogram to “look” at the distribution of averages we get with a sample size of 5 and a sample size of 50. How would you say they differ? A) They are actually the same. B) They both look roughly normal, but with a sample size of 50 the spread is smaller. C) They both look roughly normal, but with a sample size of 50 the spread is larger. D) The second distribution does not look normal at all.

par(mfrow=c(2,1))
hist(averages5)
hist(averages50)

A:They both look roughly normal, but with a sample size of 50 the spread is smaller.

Distributions

We have explained what we mean by null in the context of null hypothesis, but what exactly is a distribution? The simplest way to think of a distribution is as a compact description of many numbers. For example, suppose you have measured the heights of all men in a population. Imagine you need to describe these numbers to someone that has no idea what these heights are, such as an alien that has never visited Earth. Suppose all these heights are contained in the following dataset:

data(father.son,package="UsingR")
x <- father.son$fheight

One approach to summarizing these numbers is to simply list them all out for the alien to see. Here are 10 randomly selected heights of 1,078:

round(sample(x,10),1)

##  [1] 68.9 65.1 69.0 70.0 65.8 65.9 69.9 68.3 67.1 64.9

Cumulative Distribution Function

Scanning through these numbers, we start to get a rough idea of what the entire list looks like, but it is certainly inefficient. We can quickly improve on this approach by defining and visualizing a distribution. To define a distribution we compute, for all possible values of $a$, the proportion of numbers in our list that are below $a$. We use the following notation:

\[ F(a) \equiv \mbox{Pr}(x \leq a) \]

This is called the cumulative distribution function (CDF). When the CDF is derived from data, as opposed to theoretically, we also call it the empirical CDF (ECDF). The ECDF for the height data looks like this:

Empirical cummulative distribution function for height.

Histograms

Although the empirical CDF concept is widely discussed in statistics textbooks, the plot is actually not very popular in practice. The reason is that histograms give us the same information and are easier to interpret. Histograms show us the proportion of values in intervals:

\[ \mbox{Pr}(a \leq x \leq b) = F(b) - F(a) \]

Plotting these heights as bars is what we call a histogram. It is a more useful plot because we are usually more interested in intervals, such and such percent are between 70 inches and 71 inches, etc., rather than the percent less than a particular height. It is also easier to distinguish different types (families) of distributions by looking at histograms. Here is a histogram of heights:

hist(x)

We can specify the bins and add better labels in the following way:

bins <- seq(smallest, largest)
hist(x,breaks=bins,xlab="Height (in inches)",main="Adult men heights")

Histogram for heights.

Showing this plot to the alien is much more informative than showing numbers. With this simple plot, we can approximate the number of individuals in any given interval. For example, there are about 70 individuals over six feet (72 inches) tall.

Probability Distribution

Summarizing lists of numbers is one powerful use of distribution. An even more important use is describing the possible outcomes of a random variable. Unlike a fixed list of numbers, we don’t actually observe all possible outcomes of random variables, so instead of describing proportions, we describe probabilities. For instance, if we pick a random height from our list, then the probability of it falling between $a$ and $b$ is denoted with:

\[ \mbox{Pr}(a \leq X \leq b) = F(b) - F(a) \]

Note that the $X$ is now capitalized to distinguish it as a random variable and that the equation above defines the probability distribution of the random variable. Knowing this distribution is incredibly useful in science. For example, in the case above, if we know the distribution of the difference in mean of mouse weights when the null hypothesis is true, referred to as the null distribution, we can compute the probability of observing a value as large as we did,referred to as a p-value. In a previous section we ran what is called a Monte Carlo simulation (we will provide more details on Monte Carlo simulation in a later section) and we obtained 10,000 outcomes of the random variable under the null hypothesis. Let’s repeat the loop above, but this time let’s add a point to the figure every time we re-run the experiment. If you run this code, you can see the null distribution forming as the observed values stack on top of each other.

n <- 100
library(rafalib)
nullplot(-5,5,1,30, xlab="Observed differences (grams)", ylab="Frequency")
totals <- vector("numeric",11)
for (i in 1:n) {
  control <- sample(population,12)
  treatment <- sample(population,12)
  nulldiff <- mean(treatment) - mean(control)
  j <- pmax(pmin(round(nulldiff)+6,11),1)
  totals[j] <- totals[j]+1
  text(j-6,totals[j],pch=15,round(nulldiff,1))
  ##if(i < 15) Sys.sleep(1) ##You can add this line to see values appear slowly
  }

Illustration of the null distribution.

The figure above amounts to a histogram. From a histogram of the null vector we calculated earlier, we can see that values as large as obsdiff are relatively rare:

hist(null, freq=TRUE)
abline(v=obsdiff, col="red", lwd=2)

Null distribution with observed difference marked with vertical red line.

An important point to keep in mind here is that while we defined $\mbox{Pr}(a)$ by counting cases, we will learn that, in some circumstances, mathematics gives us formulas for $\mbox{Pr}(a)$ that save us the trouble of computing them as we did here. One example of this powerful approach uses the normal distribution approximation.

Probability Distribution Exercises

In the video you just watched, Rafa looked at distributions of heights, and asked what was the probability of someone being shorter than a given height. In this assessment, we are going to ask the same question, but instead of people and heights, we are going to look at whole countries and the average life expectancy in those countries.

We will use the data set called “Gapminder” which is available as an R-package on Github. This data set contains the life expectancy, GDP per capita, and population by country, every five years, from 1952 to 2007. It is an excerpt of a larger and more comprehensive set of data available on Gapminder.org, and the R package of this dataset was created by the statistics professor Jennifer Bryan.

First, install the gapminder data using:

install.packages(“gapminder”)

Next, load the gapminder data set. To find out more information about the data set, use ?gapminder which will bring up a help file. To return the first few lines of the data set, use the function head().

#install.packages("gapminder")library(gapminder)
#?gapminder
library(gapminder)
data(gapminder)
head(gapminder)

## Source: local data frame [6 x 6]
## 
##       country continent  year lifeExp      pop gdpPercap
##        (fctr)    (fctr) (int)   (dbl)    (int)     (dbl)
## 1 Afghanistan      Asia  1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia  1957  30.332  9240934  820.8530
## 3 Afghanistan      Asia  1962  31.997 10267083  853.1007
## 4 Afghanistan      Asia  1967  34.020 11537966  836.1971
## 5 Afghanistan      Asia  1972  36.088 13079460  739.9811
## 6 Afghanistan      Asia  1977  38.438 14880372  786.1134

Create a vector ‘x’ of the life expectancies of each country for the year 1952. Plot a histogram of these life expectancies to see the spread of the different countries.

library(dplyr)
x<-filter(gapminder,year==1952) %>% select(lifeExp) %>% unlist
head(x)

## lifeExp1 lifeExp2 lifeExp3 lifeExp4 lifeExp5 lifeExp6 
##   28.801   55.230   43.077   30.015   62.485   69.120

#or in a convention way

dat1952 = gapminder[ gapminder$year == 1952, ]

x = dat1952$lifeExp

Probability Distributions Exercises #1

Q:In statistics, the empirical cumulative distribution function (or empirical cdf or empirical distribution function) is the function F(a) for any a, which tells you the proportion of the values which are less than or equal to a.

We can compute F in two ways: the simplest way is to type mean(x <= a). This calculates the number of values in x which are less than or equal a, divided by the total number of values in x, in other words the proportion of values less than or equal to a.

The second way, which is a bit more complex for beginners, is to use the ecdf() function. This is a bit complicated because this is a function that doesn’t return a value, but a function.

Let’s continue, using the simpler, mean() function.

What is the proportion of countries in 1952 that have a life expectancy less than or equal to 40?

mean(x<=40)

## [1] 0.2887324

A:0.2887324

Probability Distributions Exercises #2

Q:What is the proportion of countries in 1952 that have a life expectancy between 40 and 60 years? Hint: this is the proportion that have a life expectancy less than or equal to 60 years, minus the proportion that have a life expectancy less than or equal to 40 years.

mean(x<=60) - mean(x<=40)

## [1] 0.4647887

A:0.4647887

SAPPLY() ON A CUSTOM FUNCTION

Suppose we want to plot the proportions of countries with life expectancy ‘q’ for a range of different years. R has a built in function for this, plot(ecdf(x)), but suppose we didn’t know this. The function is quite easy to build, by turning the code from question 1.1 into a custom function, and then using sapply(). Our custom function will take an input variable ‘q’, and return the proportion of countries in ‘x’ less than or equal to q. The curly brackets { and }, allow us to write an R function which spans multiple lines:

prop = function(y) {
  mean(x <= y)
}

#Try this out for a value of 'y':  prop(40)
prop(40)

## [1] 0.2887324

Now let’s build a range of q’s that we can apply the function to:

qs = seq(from=min(x), to=max(x), length=20)
qs

##  [1] 28.80100 31.10989 33.41879 35.72768 38.03658 40.34547 42.65437
##  [8] 44.96326 47.27216 49.58105 51.88995 54.19884 56.50774 58.81663
## [15] 61.12553 63.43442 65.74332 68.05221 70.36111 72.67000

Print ‘qs’ to the R console to see what the seq() function gave us. Now we can use sapply() to apply the ‘prop’ function to each element of ‘qs’:

props = sapply(qs, prop)
props

##  [1] 0.007042254 0.028169014 0.063380282 0.105633803 0.204225352
##  [6] 0.288732394 0.408450704 0.492957746 0.535211268 0.577464789
## [11] 0.626760563 0.640845070 0.683098592 0.718309859 0.774647887
## [16] 0.809859155 0.852112676 0.922535211 0.964788732 1.000000000

plot(qs, props)

#Note that we could also have written this in one line, by defining the 'prop' function but without naming it:
props = sapply(qs, function(q) mean(x <= q))
props

##  [1] 0.007042254 0.028169014 0.063380282 0.105633803 0.204225352
##  [6] 0.288732394 0.408450704 0.492957746 0.535211268 0.577464789
## [11] 0.626760563 0.640845070 0.683098592 0.718309859 0.774647887
## [16] 0.809859155 0.852112676 0.922535211 0.964788732 1.000000000

This last style is called using an “inline” function or an “anonymous” function. Let’s compare our homemade plot with the pre-built one in R:

plot(ecdf(x))

Central Limit Theorem

Normal Distribution

The probability distribution we see above approximates one that is very common in nature: the bell curve, also known as the normal distribution or Gaussian distribution. When the histogram of a list of numbers approximates the normal distribution, we can use a convenient mathematical formula to approximate the proportion of values or outcomes in any given interval:

\[ \mbox{Pr}(a < x < b) = \int_a^b \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left( \frac{-(x-\mu)^2}{2 \sigma^2} \right)} \, dx \]

While the formula may look intimidating, don’t worry, you will never actually have to type it out, as it is stored in a more convenient form (as pnorm in R which sets a to $-\infty$, and takes b as an argument).

Here $\mu$ and $\sigma$ are referred to as the mean and the standard deviation of the population (we explain these in more detail in another section). If this normal approximation holds for our list, then the population mean and variance of our list can be used in the formula above. An example of this would be when we noted above that only 1.5% of values on the null distribution were above obsdiff. We can compute the proportion of values below a value x with pnorm(x,mu,sigma) without knowing all the values. The normal approximation works very well here:

n <- 10000
null <- vector("numeric",n)
for (i in 1:n) {
  control <- sample(population,12)
  treatment <- sample(population,12)
  null[i] <- mean(treatment) - mean(control)
}
1 - pnorm(obsdiff,mean(null),sd(null))

## [1] 0.0144292

Later, we will learn that there is a mathematical explanation for this. A very useful characteristic of this approximation is that one only needs to know $\mu$ and $\sigma$ to describe the entire distribution. From this, we can compute the proportion of values in any interval.

Summary

So computing a p-value for the difference in diet for the mice was pretty easy, right? But why are we not done? To make the calculation, we did the equivalent of buying all the mice available from The Jackson Laboratory and performing our experiment repeatedly to define the null distribution. Yet this is not something we can do in practice. Statistical Inference is the mathematical theory that permits you to approximate this with only the data from your sample, i.e. the original 24 mice. We will focus on this in the following sections.

Setting the random seed

Before we continue, we briefly explain the following important line of code:

set.seed(1)

Throughout this book, we use random number generators. This implies that many of the results presented can actually change by chance, including the correct answer to problems. One way to ensure that results do not change is by setting R’s random number generation seed. For more on the topic please read the help file:

?set.seed

Normal Distribution Exercises

library(downloader) 
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile=filename)
x <- unlist( read.csv(filename) )

Here x represents the weights for the entire population.

Using the same process as before (in Null Distribution Exercises), set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages. After that, set the seed at 1, then using a for-loop take a random sample of 50 mice 1,000 times. Save these averages.

set.seed(1) #so that we get same results
n <- 1000
averages5 <- vector("numeric",n)
for(i in 1:n){
  X <- sample(x,5)
  averages5[i] <- mean(X)
}
hist(averages5) ##take a look

mean( abs( averages5 - mean(x) ) > 1)

## [1] 0.498

set.seed(1) #so that we get same results
n <- 1000
averages50 <- vector("numeric",n)
for(i in 1:n){
  X <- sample(x,50)
  averages50[i] <- mean(X)
}
hist(averages50) ##take a look

mean( abs( averages50 - mean(x) ) > 1)

## [1] 0.019

Normal Distribution Exercises #1

Q:Use a histogram to “look” at the distribution of averages we get with a sample size of 5 and a sample size of 50. How would you say they differ?

library(rafalib) 
###mypar(1,2) is optional. it is used to put both plots on one page
mypar(1,2)
hist(averages5, xlim=c(18,30))
hist(averages50, xlim=c(18,30))

A:They both look roughly normal, but with a sample size of 50 the spread is smaller.

Normal Distribution Exercises

Q:For the last set of averages, the ones obtained from a sample size of 50, what proportion are between 23 and 25?

mean( averages50 < 25 & averages50 > 23)

## [1] 0.976

#or

pnorm(25,mean(averages50),sd(averages50))-pnorm(23,mean(averages50),sd(averages50))

## [1] 0.9767305

A:0.976

Normal Distribution Exercises #3

Q:Now ask the same question of a normal distribution with average 23.9 and standard deviation 0.43.

pnorm(25,23.9,0.43)-pnorm(23,23.9,0.43)

## [1] 0.9765648

#or
pnorm( (25-23.9) / 0.43)  - pnorm( (23-23.9) / 0.43)

## [1] 0.9765648

A: 0.9765648

The answer to both Qs were very similar. This is because we can approximate the distribution of the sample average with a normal distribution. We will learn more about the reason for this next.

Populations, parameters, and sample estimates

Now that we have introduced the idea of a random variable, a null distribution, and a p-value, we are ready to describe the mathematical theory that permits us to compute p-values in practice. We will also learn about confidence intervals and power calculations.

Population parameters

A first step in statistical inference is to understand what population you are interested in. In the mouse weight example, we have two populations: female mice on control diets and female mice on high fat diets, with weight being the outcome of interest. We consider this population to be fixed, and the randomness comes from the sampling. One reason we have been using this dataset as an example is because we happen to have the weights of all the mice of this type. We download this file to our working directory and read in to R:

dat <- read.csv("mice_pheno.csv")

We can then access the population values and determine, for example, how many we have. Here we compute the size of the control population which are mice on control diet:

library(dplyr)
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>% 
  select(Bodyweight) %>% unlist
length(controlPopulation)

## [1] 225

We usually denote these values as $x_1,\dots,x_m$. In this case, $m$ is the number computed above. We can do the same for the high fat diet population:

hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%  
  select(Bodyweight) %>% unlist
length(hfPopulation)

## [1] 200

and denote with $y_1,\dots,y_n$.

We can then define summaries of interest for these populations, such as the mean and variance.

The mean:

\[\mu_X = \frac{1}{m}\sum_{i=1}^m x_i \mbox{ and } \mu_Y = \frac{1}{n} \sum_{i=1}^n y_i\]

The variance:

\[\sigma_X^2 = \frac{1}{m}\sum_{i=1}^m (x_i-\mu_X)^2 \mbox{ and } \sigma_Y^2 = \frac{1}{n} \sum_{i=1}^n (y_i-\mu_Y)^2\]

with the standard deviation being the square root of the variance. We refer to such quantities that can be obtained from the population as population parameters. The question we started out asking can now be written mathematically: is $\mu_Y - \mu_X = 0$ ?

Although in our illustration we have all the values and can check if this is true, in practice we do not. For example, in practice it would be prohibitively expensive to buy all the mice in a population. Here we learn how taking a sample permits us to answer our questions. This is the essence of statistical inference.

Sample estimates

In the previous chapter, we obtained samples of 12 mice from each population. We represent data from samples with capital letters to indicate that they are random. This is common practice in statistics, although it is not always followed. So the samples are $X_1,\dots,X_M$ and $Y_1,\dots,Y_N$ and, in this case, $N=M=12$. In contrast and as we saw above, when we list out the values of the population, which are set and not random, we use lower-case letters.

Since we want to know if $\mu_Y - \mu_X$ is 0, we consider the sample version: $\bar{Y}-\bar{X}$ with:

\[ \bar{X}=\frac{1}{M} \sum_{i=1}^M X_i \mbox{ and }\bar{Y}=\frac{1}{N} \sum_{i=1}^N Y_i. \]

Note that this difference of averages is also a random variable. Previously, we learned about the behavior of random variables with an exercise that involved repeatedly sampling from the original distribution. Of course, this is not an exercise that we can execute in practice. In this particular case it would involve buying 24 mice over and over again. Here we described the mathematical theory that mathematically relates $\bar{X}$ to $\mu_X$ and $\bar{Y}$ to $\mu_Y$, that will in turn help us understand the relationship between $\bar{Y}-\bar{X}$ and $\mu_Y - \mu_X$. Specifically, we will describe how the Central Limit Theorem permits us to use an approximation to answer this question, as well as motivate the widely used t-distribution.

Population, Samples, and Estimates Exercises

For these exercises, we will be using the following dataset:

dat <- read.csv("mice_pheno.csv")
summary(dat)

##  Sex       Diet       Bodyweight   
##  F:425   chow:449   Min.   :15.51  
##  M:421   hf  :397   1st Qu.:24.04  
##                     Median :28.19  
##                     Mean   :28.85  
##                     3rd Qu.:32.95  
##                     Max.   :54.08  
##                     NA's   :5

#We will remove the lines that contain missing values:
dat <- na.omit( dat )

Population, Samples, and Estimates Exercises #1

Q:Use dplyr to create a vector x with the body weight of all males on the control (chow) diet. What is this population’s average?

library(dplyr)
x<- filter(dat,Sex == "M" & Diet == "chow") %>% 
  select(Bodyweight) %>% unlist
length(x)

## [1] 223

mean(x)

## [1] 30.96381

A:30.96381

Population, Samples, and Estimates Exercises #2

Q:Now use the rafalib package and use the popsd function to compute the population standard deviation.

library(rafalib)
popsd(x)

## [1] 4.420501

A:4.420501

Population, Samples, and Estimates Exercises #3

Q:Set the seed at 1. Take a random sample X of size 25 from x. What is the sample average?

set.seed(1)
X<- sample(x,25)
mean(X)

## [1] 32.0956

A:32.0956

Population, Samples, and Estimates Exercises #4

Q:Use dplyr to create a vector y with the body weight of all males on the high fat hf) diet. What is this population’s average?

library(dplyr)
y<- filter(dat,Sex == "M" & Diet == "hf") %>%  
  select(Bodyweight) %>% unlist
length(y)

## [1] 193

mean(y)

## [1] 34.84793

A:34.84793

Population, Samples, and Estimates Exercises #5

Q:Now use the rafalib package and use the popsd function to compute the population standard deviation.

library(rafalib)
popsd(y)

## [1] 5.574609

A:5.574609

Population, Samples, and Estimates Exercises #6

Q:Set the seed at 1. Take a random sample Y of size 25 from y. What is the sample average?

set.seed(1)
Y<- sample(y,25)
mean(Y)

## [1] 34.768

A: 34.768

Population, Samples, and Estimates Exercises #7

Q:What is the difference in absolute value between $\bar{y}-\bar{x}$ $\bar{X}-\bar{Y}$

a<-mean(y)-mean(x)
b<-abs(mean(X)-mean(Y))
a-b

## [1] 1.211716

#or
abs( ( mean(y) - mean(x) ) - ( mean(Y) - mean(X) ) )

## [1] 1.211716

A:1.211716

Population, Samples, and Estimates Exercises #8

Q:Repeat the above for females. Make sure to set the seed to 1 before each sample call. What is the difference in absolute value between $\bar{y}-\bar{x}$ $\bar{X}-\bar{Y}$

library(dplyr)
x<- filter(dat,Sex == "F" & Diet == "chow") %>% 
  select(Bodyweight) %>% unlist
mean(x)

## [1] 23.89338

set.seed(1)
X<- sample(x,25)
mean(X)

## [1] 23.1692

y<- filter(dat,Sex == "F" & Diet == "hf") %>%  
  select(Bodyweight) %>% unlist
mean(y)

## [1] 26.2689

set.seed(1)
Y<- sample(y,25)
mean(Y)

## [1] 26.2812

abs( ( mean(y) - mean(x) ) - ( mean(Y) - mean(X) ) )

## [1] 0.7364828

A:0.7364828

Population, Samples, and Estimates Exercises #9

Q:For the females, our sample estimates were closer to the population difference than with males. What is a possible explanation for this?

The population variance of the females is smaller than that of the males; thus, the sample variable has less variability.

Statistical estimates are more precise for females.

The sample size was larger for females.

The sample size was smaller for females.

popsd(x)

## [1] 3.416438

popsd(y)

## [1] 5.06987

A:The population variance of the females is smaller than that of the males; thus, the sample variable has less variability.

Central Limit Theorem (CLT)

Below we will discuss the Central Limit Theorem (CLT) and the t-distribution, both of which help us make important calculations related to probabilities. Both are frequently used in science to test statistical hypotheses. To use these, we have to make different assumptions from those for the CLT and the t-distribution. However, if the assumptions are true, then we are able to calculate the exact probabilities of events through the use of mathematical formula.

Central Limit Theorem

The CLT is one of the most frequently used mathematical results in science. It tells us that when the sample size is large, the average $\bar{Y}$ of a random sample follows a normal distribution centered at the population average $\mu_Y$ and with standard deviation equal to the population standard deviation $\sigma_Y$, divided by the square root of the sample size $N$. We refer to the standard deviation of the distribution of a random variable as the random variable’s standard error.

Please note that if we subtract a constant from a random variable, the mean of the new random variable shifts by that constant. Mathematically, if $X$ is a random variable with mean $\mu$ and $a$ is a constant, the mean of $X - a$ is $\mu-a$. A similarly intuitive result holds for multiplication and the standard deviation (SD). If $X$ is a random variable with mean $\mu$ and SD $\sigma$, and $a$ is a constant, then the mean and SD of $aX$ are $a \mu$ and $\mid a \mid \sigma$ respectively. To see how intuitive this is, imagine that we subtract 10 grams from each of the mice weights. The average weight should also drop by that much. Similarly, if we change the units from grams to milligrams by multiplying by 1000, then the spread of the numbers becomes larger.

This implies that if we take many samples of size $N$, then the quantity:

\[ \frac{\bar{Y} - \mu}{\sigma_Y/\sqrt{N}} \]

is approximated with a normal distribution centered at 0 and with standard deviation 1.

Now we are interested in the difference between two sample averages. Here again a mathematical result helps. If we have two random variables $X$ and $Y$ with means $\mu_X$ and $\mu_Y$ and variance $\sigma_X$ and $\sigma_Y$ respectively, then we have the following result: the mean of the sum $Y + X$ is the sum of the means $\mu_Y + \mu_X$. Using one of the facts we mentioned earlier, this implies that the mean of $Y - X = Y + aX$ with $a = -1$ , which implies that the mean of $Y - X$ is $\mu_Y - \mu_X$. This is intuitive. However, the next result is perhaps not as intuitive. If $X$ and $Y$ are independent of each other, as they are in our mouse example, then the variance (SD squared) of $Y + X$ is the sum of the variances $\sigma_Y^2 + \sigma_X^2$. This implies that variance of the difference $Y - X$ is the variance of $Y + aX$ with $a = -1$ which is $\sigma^2_Y + a^2 \sigma_X^2 = \sigma^2_Y + \sigma_X^2$. So the variance of the difference is also the sum of the variances. If this seems like a counterintuitive result, remember that if $X$ and $Y$ are independent of each other, the sign does not really matter. It can be considered random: if $X$ is normal with certain variance, for example, so is $-X$. Finally, another useful result is that the sum of normal variables is again normal.

All this math is very helpful for the purposes of our study because we have two sample averages and are interested in the difference. Because both are normal, the difference is normal as well, and the variance (the standard deviation squared) is the sum of the two variances. Under the null hypothesis that there is no difference between the population averages, the difference between the sample averages $\bar{Y}-\bar{X}$, with $\bar{X}$ and $\bar{Y}$ the sample average for the two diets respectively, is approximated by a normal distribution centered at 0 (there is no difference) and with standard deviation $\sqrt{\sigma_X^2 +\sigma_Y^2}/\sqrt{N}$.

This suggests that this ratio:

\[ \frac{\bar{Y}-\bar{X}}{\sqrt{\frac{\sigma_X^2}{M} + \frac{\sigma_Y^2}{N}}} \]

is approximated by a normal distribution centered at 0 and standard deviation 1. Using this approximation makes computing p-values simple because we know the proportion of the distribution under any value. For example, only 5% of these values are larger than 2 (in absolute value):

pnorm(-2) + (1 - pnorm(2))

## [1] 0.04550026

We don’t need to buy more mice, 12 and 12 suffice.

However, we can’t claim victory just yet because we don’t know the population standard deviations: $\sigma_X$ and $\sigma_Y$. These are unknown population parameters, but we can get around this by using the sample standard deviations, call them $s_X$ and $s_Y$. These are defined as:

\[ s_X^2 = \frac{1}{M-1} \sum_{i=1}^M (X_i - \bar{X})^2 \mbox{ and } s_Y^2 = \frac{1}{N-1} \sum_{i=1}^N (Y_i - \bar{Y})^2 \]

Note that we are dividing by $M-1$ and $N-1$, instead of by $M$ and $N$. There is a theoretical reason for doing this which we do not explain here. But to get an intuition, think of the case when you just have 2 numbers. The average distance to the mean is basically 1/2 the difference between the two numbers. So you really just have information from one number. This is somewhat of a minor point. The main point is that $s_X$ and $s_Y$ serve as estimates of $\sigma_X$ and $\sigma_Y$

So we can redefine our ratio as

\[ \sqrt{N} \frac{\bar{Y}-\bar{X}}{\sqrt{s_X^2 +s_Y^2}} \]

if $M=N$ or in general,

\[ \frac{\bar{Y}-\bar{X}}{\sqrt{\frac{s_X^2}{M} + \frac{s_Y^2}{N}}} \]

The CLT tells us that when $M$ and $N$ are large, this random variable is normally distributed with mean 0 and SD 1. Thus we can compute p-values using the function pnorm.

Central Limit Theorem Exercises

For these exercises, we will be using the following dataset:

dat <- read.csv("mice_pheno.csv")

#We will remove the lines that contain missing values:
dat <- na.omit( dat )

Central Limit Theorem Exercises #1

Q:If a list of numbers has a distribution that is well approximated by the normal distribution, what proportion of these numbers are within one standard deviation away from the list’s average?

pnorm(1)-pnorm(-1)

## [1] 0.6826895

#or
1-(pnorm(-1) + (1 - pnorm(1)))

## [1] 0.6826895

A:0.6826895

Central Limit Theorem Exercises #2

Q:What proportion of these numbers are within two standard deviations away from the list’s average?

pnorm(2)-pnorm(-2)

## [1] 0.9544997

A:0.9544997

Central Limit Theorem Exercises #3

Q:What proportion of these numbers are within three standard deviations away from the list’s average?

pnorm(3)-pnorm(-3)

## [1] 0.9973002

A: 0.9973002

Central Limit Theorem Exercises #4

Q:Define y to be the weights of males on the control diet. What proportion of the mice are within one standard deviation away from the average weight (remember to use popsd for the population sd)?

library(dplyr)
y <- filter(dat, Sex=="M" & Diet=="chow") %>% select(Bodyweight) %>% unlist
z <- ( y - mean(y) ) / popsd(y)
mean( abs(z) <=1 )

## [1] 0.6950673

A:0.6950673

Central Limit Theorem Exercises #5

Q:What proportion of these numbers are within two standard deviations away from the list’s average?

z <- ( y - mean(y) ) / popsd(y)
mean( abs(z) <=2 )

## [1] 0.9461883

A:0.9461883

Central Limit Theorem Exercises #6

Q:What proportion of these numbers are within three standard deviations away from the list’s average?

mean( abs(z) <=3 )

## [1] 0.9910314

A: 0.9910314

Central Limit Theorem Exercises #7

Q:Note that the numbers for the normal distribution and our weights are relatively close. Also, notice that we are indirectly comparing quantiles of the normal distribution to quantiles of the mouse weight distribution. We can actually compare all quantiles using a qqplot. Which of the following best describes the qq-plot comparing mouse weights to the normal distribution?

library(rafalib)
mypar(1,1)
qqnorm(z)
abline(0,1)

A: The mouse weights are well approximated by the normal distribution, although the larger values (right tail) are larger than predicted by the normal. This is consistent with the differences seen between question 3 and 6.

Central Limit Theorem Exercises #8

Q:Create the above qq-plot for the four populations: male/females on each of the two diets. What is the most likely explanation for the mouse weights being well approximated? What is the best explanation for all these being well approximated by the normal distribution?

mypar(2,2)
y <- filter(dat, Sex=="M" & Diet=="chow") %>% select(Bodyweight) %>% unlist
z <- ( y - mean(y) ) / popsd(y)
qqnorm(z);abline(0,1)
y <- filter(dat, Sex=="F" & Diet=="chow") %>% select(Bodyweight) %>% unlist
z <- ( y - mean(y) ) / popsd(y)
qqnorm(z);abline(0,1)
y <- filter(dat, Sex=="M" & Diet=="hf") %>% select(Bodyweight) %>% unlist
z <- ( y - mean(y) ) / popsd(y)
qqnorm(z);abline(0,1)
y <- filter(dat, Sex=="F" & Diet=="hf") %>% select(Bodyweight) %>% unlist
z <- ( y - mean(y) ) / popsd(y)
qqnorm(z);abline(0,1)

A: This just happens to be how nature behaves in this particular case. Perhaps the result of many biological factors averaging out.

Central Limit Theorem Exercises #9

Here we are going to use the function replicate to learn about the distribution of random variables. All the above exercises relate to the normal distribution as an approximation of the distribution of a fixed list of numbers or a population. We have not yet discussed probability in these exercises. If the distribution of a list of numbers is approximately normal, then if we pick a number at random from this distribution, it will follow a normal distribution. However, it is important to remember that stating that some quantity has a distribution does not necessarily imply this quantity is random. Also, keep in mind that this is not related to the central limit theorem. The central limit applies to averages of random variables. Let’s explore this concept.

We will now take a sample of size 25 from the population of males on the chow diet. The average of this sample is our random variable. We will use the replicate to observe 10,000 realizations of this random variable. Set the seed at 1, generate these 10,000 averages. Make a histogram and qq-plot of these 10,000 numbers against the normal distribution.

We can see that, as predicted by the CLT, the distribution of the random variable is very well approximated by the normal distribution.

y <- filter(dat, Sex=="M" & Diet=="chow") %>% select(Bodyweight) %>% unlist
avgs <- replicate(10000, mean( sample(y, 25)))
mypar(1,2)
hist(avgs)
qqnorm(avgs)
qqline(avgs)

Q:What is the average of the distribution of the sample average?

mean(avgs)

## [1] 30.9556

A:30.9578

Central Limit Theorem Exercises #10

Q:What is the standard deviation of the distribution of sample averages?

library(rafalib)
popsd(avgs)

## [1] 0.8367952

A:0.8334121

CLT in Practice

Let’s use our data to see how well the central limit theorem approximates sample averages from our data. We will leverage our entire population dataset to compare the results we obtain by actually sampling from the distribution to what the CLT predicts.

dat <- read.csv("mice_pheno.csv") #file was previously downloaded
head(dat)

##   Sex Diet Bodyweight
## 1   F   hf      31.94
## 2   F   hf      32.48
## 3   F   hf      22.82
## 4   F   hf      19.92
## 5   F   hf      32.22
## 6   F   hf      27.50

Start by selecting only female mice since males and females have different weights. We will select three mice from each population.

library(dplyr)
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%  
  select(Bodyweight) %>% unlist
hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%  
  select(Bodyweight) %>% unlist

We can compute the population parameters of interest using the mean function.

mu_hf <- mean(hfPopulation)
mu_control <- mean(controlPopulation)
print(mu_hf - mu_control)

## [1] 2.375517

We can compute the population standard deviations of, say, a vector $x$ as well. However, we do not use the R function sd because this function actually does not compute the population standard deviation $\sigma_x$. Instead, sd assumes the main argument is a random sample, say $X$, and provides an estimate of $\sigma_x$, defined by $s_X$ above. As shown in the equations above actual final answer differs because on divides by the sample size and the other by the sample size minus one. We can see that with R code:

x <- controlPopulation
N <- length(x)
populationvar <- mean((x-mean(x))^2)
identical(var(x), populationvar)

## [1] FALSE

identical(var(x)*(N-1)/N, populationvar)

## [1] TRUE

So to be mathematically correct, we do not use sd or var. Instead, we use the popvar and popsd function in rafalib:

library(rafalib)
sd_hf <- popsd(hfPopulation)
sd_control <- popsd(controlPopulation)

Remember that in practice we do not get to compute these population parameters. These are values we never see. In general, we want to estimate them from samples.

N <- 12
hf <- sample(hfPopulation, 12)
control <- sample(controlPopulation, 12)

As we described, the CLT tells us that for large $N$, each of these is approximately normal with average population mean and standard error population variance divided by $N$. We mentioned that a rule of thumb is that $N$ should be 30 or more. However, that is just a rule of thumb since the preciseness of the approximation depends on the population distribution. Here we can actually check the approximation and we do that for various values of $N$.

Now we use sapply and replicate instead of for loops, which makes for cleaner code (we do not have to pre-allocate a vector, R takes care of this for us):

Ns <- c(3,12,25,50)
B <- 10000 #number of simulations
res <-  sapply(Ns,function(n) {
  replicate(B,mean(sample(hfPopulation,n))-mean(sample(controlPopulation,n)))
})

Now we can use qq-plots to see how well CLT approximations works for these. If in fact the normal distribution is a good approximation, the points should fall on a straight line when compared to normal quantiles. The more it deviates, the worse the approximation. In the title, we also show the average and SD of the observed distribution, which demonstrates how the SD decreases with $\sqrt{N}$ as predicted.

mypar(2,2)
for (i in seq(along=Ns)) {
  titleavg <- signif(mean(res[,i]),3)
  titlesd <- signif(popsd(res[,i]),3)
  title <- paste0("N=",Ns[i]," Avg=",titleavg," SD=",titlesd)
  qqnorm(res[,i],main=title)
  qqline(res[,i],col=2)
}

Quantile versus quantile plot of simulated differences versus theoretical normal distribution for four different sample sizes.

Here we see a pretty good fit even for 3. Why is this? Because the population itself is relatively close to normally distributed, the averages are close to normal as well (the sum of normals is also a normal). In practice, we actually calculate a ratio: we divide by the estimated standard deviation. Here is where the sample size starts to matter more.

Ns <- c(3,12,25,50)
B <- 10000 #number of simulations
##function to compute a t-stat
computetstat <- function(n) {
  y <- sample(hfPopulation,n)
  x <- sample(controlPopulation,n)
  (mean(y)-mean(x))/sqrt(var(y)/n+var(x)/n)
}
res <-  sapply(Ns,function(n) {
  replicate(B,computetstat(n))
})
mypar(2,2)
for (i in seq(along=Ns)) {
  qqnorm(res[,i],main=Ns[i])
  qqline(res[,i],col=2)
}

Quantile versus quantile plot of simulated ratios versus theoretical normal distribution for four different sample sizes.

So we see that for $N=3$, the CLT does not provide a usable approximation. For $N=12$, there is a slight deviation at the higher values, although the approximation appears useful. For 25 and 50, the approximation is spot on.

This simulation only proves that $N=12$ is large enough in this case, not in general. As mentioned above, we will not be able to perform this simulation in most situations. We only use the simulation to illustrate the concepts behind the CLT and its limitations. In future sections, we will describe the approaches we actually use in practice.

The t-distribution

The CLT relies on large samples, what we refer to as asymptotic results. When the CLT does not apply, there is another option that does not rely on asymptotic results. When the original population from which a random variable, say $Y$, is sampled is normally distributed with mean 0, then we can calculate the distribution of:

\[ \sqrt{N} \frac{\bar{Y}}{s_Y} \]

This is the ratio of two random variables so it is not necessarily normal. The fact that the denominator can be small by chance increases the probability of observing large values. William Sealy Gosset, an employee of the Guinness brewing company, deciphered the distribution of this random variable and published a paper under the pseudonym “Student”. The distribution is therefore called Student’s t-distribution. Later we will learn more about how this result is used.

Here we will use the mice phenotype data as an example. We start by creating two vectors, one for the control population and one for the high-fat diet population:

library(dplyr)
dat <- read.csv("mice_pheno.csv") #We downloaded this file in a previous section
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%  
  select(Bodyweight) %>% unlist
hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%  
  select(Bodyweight) %>% unlist

It is important to keep in mind that what we are assuming to be normal here is the distribution of $y_1,y_2,\dots,y_n$, not the random variable $\bar{Y}$. Although we can’t do this in practice, in this illustrative example, we get to see this distribution for both controls and high fat diet mice:

library(rafalib)
mypar(1,2)
hist(hfPopulation)
hist(controlPopulation)

Histograms of all weights for both populations.

We can use qq-plots to confirm that the distributions are relatively close to being normally distributed. We will explore these plots in more depth in a later section, but the important thing to know is that it compares data (on the y-axis) against a theoretical distribution (on the x-axis). If the points fall on the identity line, then the data is close to the theoretical distribution.

mypar(1,2)
qqnorm(hfPopulation)
qqline(hfPopulation)
qqnorm(controlPopulation)
qqline(controlPopulation)

Quantile-quantile plots of all weights for both populations.

The larger the sample, the more forgiving the result is to the weakness of this approximation. In the next section, we will see that for this particular dataset the t-distribution works well even for sample sizes as small as 3.

t-tests in Practice

Introduction

We will now demonstrate how to obtain a p-value in practice. We begin by loading experimental data and walking you through the steps used to form a t-statistic and compute a p-value. We can perform this task with just a few lines of code (go to the end of section to see them). However, to understand the concepts, we will construct a t-statistic from “scratch”.

Read in and prepare data

We start by reading in the data. A first important step is to identify which rows are associated with treatment and control, and to compute the difference in mean.

library(dplyr)
dat <- read.csv("femaleMiceWeights.csv") #previously downloaded

control <- filter(dat,Diet=="chow") %>% select(Bodyweight) %>% unlist
treatment <- filter(dat,Diet=="hf") %>% select(Bodyweight) %>% unlist

diff <- mean(treatment) - mean(control)
print(diff)

## [1] 3.020833

We are asked to report a p-value. What do we do? We learned that diff, referred to as the observed effect size, is a random variable. We also learned that under the null hypothesis, the mean of the distribution of diff is 0. What about the standard error? We also learned that the standard error of this random variable is the population standard deviation divided by the square root of the sample size:

\[ SE(\bar{X}) = \sigma / \sqrt{N}\]

We use the sample standard deviation as an estimate of the population standard deviation. In R, we simply use the sd function and the SE is:

sd(control)/sqrt(length(control))

## [1] 0.8725323

This is the SE of the sample average, but we actually want the SE of diff. We saw how statistical theory tells us that the variance of the difference of two random variables is the sum of its variances, so we compute the variance and take the square root:

se <- sqrt( 
  var(treatment)/length(treatment) + 
  var(control)/length(control) 
  )

Statistical theory tells us that if we divide a random variable by its SE, we get a new random variable with an SE of 1.

tstat <- diff/se

This ratio is what we call the t-statistic. It’s the ratio of two random variables and thus a random variable. Once we know the distribution of this random variable, we can then easily compute a p-value.

As explained in the previous section, the CLT tells us that for large sample sizes, both sample averages mean(treatment) and mean(control) are normal. Statistical theory tells us that the difference of two normally distributed random variables is again normal, so CLT tells us that tstat is approximately normal with mean 0 (the null hypothesis) and SD 1 (we divided by its SE).

So now to calculate a p-value all we need to do is ask: how often does a normally distributed random variable exceed diff? R has a built-in function, pnorm, to answer this specific question. pnorm(a) returns the probability that a random variable following the standard normal distribution falls below a. To obtain the probability that it is larger than a, we simply use 1-pnorm(a). We want to know the probability of seeing something as extreme as diff: either smaller (more negative) than -abs(diff) or larger than abs(diff). We call these two regions “tails” and calculate their size:

righttail <- 1 - pnorm(abs(tstat)) 
lefttail <- pnorm(-abs(tstat))
pval <- lefttail + righttail
print(pval)

## [1] 0.0398622

In this case, the p-value is smaller than 0.05 and using the conventional cutoff of 0.05, we would call the difference statistically significant.

Now there is a problem. CLT works for large samples, but is 12 large enough? A rule of thumb for CLT is that 30 is a large enough sample size (but this is just a rule of thumb). The p-value we computed is only a valid approximation if the assumptions hold, which do not seem to be the case here. However, there is another option other than using CLT.

The t-distribution in Practice

As described earlier, statistical theory offers another useful result. If the distribution of the population is normal, then we can work out the exact distribution of the t-statistic without the need for the CLT. This is a big “if” given that, with small samples, it is hard to check if the population is normal. But for something like weight, we suspect that the population distribution is likely well approximated by normal and that we can use this approximation. Furthermore, we can look at a qq-plot for the samples. This shows that the approximation is at least close:

library(rafalib)
mypar(1,2)

qqnorm(treatment)
qqline(treatment,col=2)

qqnorm(control)
qqline(control,col=2)

Quantile-quantile plots for sample against theoretical normal distribution.

If we use this approximation, then statistical theory tells us that the distribution of the random variable tstat follows a t-distribution. This is a much more complicated distribution than the normal. The t-distribution has a location parameter like the normal and another parameter called degrees of freedom. R has a nice function that actually computes everything for us.

t.test(treatment, control)

## 
##  Welch Two Sample t-test
## 
## data:  treatment and control
## t = 2.0552, df = 20.236, p-value = 0.053
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.04296563  6.08463229
## sample estimates:
## mean of x mean of y 
##  26.83417  23.81333

To see just the p-value, we can use the $ extractor:

result <- t.test(treatment,control)
result$p.value

## [1] 0.05299888

The p-value is slightly bigger now. This is to be expected because our CLT approximation considered the denominator of tstat practically fixed (with large samples it practically is), while the t-distribution approximation takes into account that the denominator (the standard error of the difference) is a random variable. The smaller the sample size, the more the denominator varies.

It may be confusing that one approximation gave us one p-value and another gave us another, because we expect there to be just one answer. However, this is not uncommon in data analysis. We used different assumptions, different approximations, and therefore we obtained different results.

Later, in the power calculation section, we will describe type I and type II errors. As a preview, we will point out that the test based on the CLT approximation is more likely to incorrectly reject the null hypothesis (a false positive), while the t-distribution is more likely to incorrectly accept the null hypothesis (false negative).

Running the t-test in practice

Now that we have gone over the concepts, we can show the relatively simple code that one would use to actually compute a t-test:

library(dplyr)
dat <- read.csv("mice_pheno.csv")
control <- filter(dat,Diet=="chow") %>% select(Bodyweight) 
treatment <- filter(dat,Diet=="hf") %>% select(Bodyweight) 
t.test(treatment,control)

## 
##  Welch Two Sample t-test
## 
## data:  treatment and control
## t = 7.1932, df = 735.02, p-value = 1.563e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.231533 3.906857
## sample estimates:
## mean of x mean of y 
##  30.48201  27.41281

The arguments to t.test can be of type data.frame and thus we do not need to unlist them into numeric objects.

CLT and t-distribution in Practice Exercises

use the mouse data set we have previously downloaded:

library(downloader)
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleMiceWeights.csv"
filename <- "femaleMiceWeights.csv"
if(!file.exists("femaleMiceWeights.csv")) download(url,destfile=filename)
dat <- read.csv(filename)

CLT and t-distribution in Practice Exercises #1

The CLT is a result from probability theory. Much of probability theory was originally inspired by gambling. This theory is still used in practice by casinos. For example, they can estimate how many people need to play slots for there to be a 99.9999% probability of earning enough money to cover expenses. Let’s try a simple example related to gambling.

Suppose we are interested in the proportion of times we see a 6 when rolling n=100 die. This is a random variable which we can simulate with x=sample(1:6, n, replace=TRUE) and the proportion we are interested in can be expressed as an average: mean(x==6). Because the die rolls are independent, the CLT applies.

We want to roll n dice 10,000 times and keep these proportions. This random variable (proportion of 6s) has mean p=1/6 and variance p(1-p)/n. So according to CLT z = (mean(x==6) - p) / sqrt(p(1-p)/n) should be normal with mean 0 and SD 1. Set the seed to 1, then use replicate to perform the simulation, and report what proportion of times z was larger than 2 in absolute value (CLT says it should be about 0.05).

set.seed(1)
n <- 100
sides <- 6
p <- 1/sides
zs <- replicate(10000,{
  x <- sample(1:sides,n,replace=TRUE)
  (mean(x==6) - p) / sqrt(p*(1-p)/n)
}) 
qqnorm(zs)
abline(0,1)#confirm it's well approximated with normal distribution

mean(abs(zs) > 2)

## [1] 0.0424

A:0.0424

CLT and t-distribution in Practice Exercises #2

For the last simulation you can make a qqplot to confirm the normal approximation. Now, the CLT is an asympototic result, meaning it is closer and closer to being a perfect approximation as the sample size increases. In practice, however, we need to decide if it is appropriate for actual sample sizes. Is 10 enough? 15? 30?

In the example used in exercise 1, the original data is binary (either 6 or not). In this case, the success probability also affects the appropriateness of the CLT. With very low probabilities, we need larger sample sizes for the CLT to “kick in”.

Run the simulation from exercise 1, but for different values of p and n. For which of the following is the normal approximation best?

p=0.5 and n=5
p=0.5 and n=30
p=0.01 and n=30
p=0.01 and n=100

ps <- c(0.5,0.5,0.01,0.01)
ns <- c(5,30,30,100)
library(rafalib)
mypar(4,2)
for(i in 1:4){
  p <- ps[i]
  sides <- 1/p
  n <- ns[i]
  zs <- replicate(10000,{
    x <- sample(1:sides,n,replace=TRUE)
    (mean(x==1) - p) / sqrt(p*(1-p)/n)
  }) 
  hist(zs,nclass=7)
  qqnorm(zs)
  abline(0,1)
}

A:p=0.5 and n=30

CLT and t-distribution in Practice Exercises #3

Q:As we have already seen, the CLT also applies to averages of quantitative data. A major difference with binary data, for which we know the variance is p(1-p), is that with quantitative data we need to estimate the population standard deviation.

In several previous exercises we have illustrated statistical concepts with the unrealistic situation of having access to the entire population. In practice, we do not have access to entire populations. Instead, we obtain one random sample and need to reach conclusions analyzing that data. dat is an example of a typical simple dataset representing just one sample. We have 12 measurements for each of two populations:

library(dplyr)
X <- filter(dat, Diet=="chow") %>% select(Bodyweight) %>% unlist
Y <- filter(dat, Diet=="hf") %>% select(Bodyweight) %>% unlist
str(X)

##  Named num [1:12] 21.5 28.1 24 23.4 23.7 ...
##  - attr(*, "names")= chr [1:12] "Bodyweight1" "Bodyweight2" "Bodyweight3" "Bodyweight4" ...

We think of X as a random sample from the population of all mice in the control diet and Y as a random sample from the population of all mice in the high fat diet.

Define the parameter $\mu_X$ as the average of the control population. We estimate this parameter with the sample average $\bar{X}$. What is the sample average?

mean(X)

## [1] 23.81333

A:23.81333

CLT and t-distribution in Practice Exercises #4

Q:We don’t know $\mu_X$, but want to use $\bar{X}$ to understand $\mu_X$. Which of the following uses CLT to understand how well $\bar{X}$ approximates $\mu_X$?

A: $\bar{X}$ follows a normal distribution with mean $\mu_X$ and standard deviation $\frac {\sigma_X}{\sqrt{12}}$ where $\sigma_X$ is the population standard deviation.

CLT and t-distribution in Practice Exercises #5

Q:The result above tells us the distribution of the following random variable: $Z=\sqrt{12} \frac {X-\mu_X}{\sigma_X}$.What does the CLT tell us is the mean of $Z$ (you don’t need code)?
A:0

CLT and t-distribution in Practice Exercises #6

Q:The result of 4 and 5 tell us that we know the distribution of the difference between our estimate and what we want to estimate, but don’t know. However, the equation involves the population standard deviation $\sigma_X$, which we don’t know. Given what we discussed, what is your estimate of $\sigma_X$ ?

sd(X)

## [1] 3.022541

A:3.022541

CLT and t-distribution in Practice Exercises #7

Q:Use the CLT to approximate the probability that our estimate $\bar{X}$ is off by more than 2 grams from $\mu_X$.

2 * ( 1-pnorm(2/sd(X) * sqrt(12) ) )

## [1] 0.02189533

A:0.02189533

CLT and t-distribution in Practice Exercises #8

Now we introduce the concept of a null hypothesis. We don’t know $\mu_X$ or $\mu_Y$.We want to quantify what the data say about the possibility that the diet has no effect: $\mu_X$ = $\mu_Y$.If we use CLT, then we approximate the distribution of $\bar{X}$ as normal with mean $\mu_X$ and standard deviation $\frac {\sigma_X}{\sqrt{M}}$ and the distribution of $\bar{Y}$ as normal with mean $\mu_Y$ and standard deviation $\frac {\sigma_Y}{\sqrt{N}}$ with $M$ and $N$ the sample sizes for $X$ and $Y$ respectively, in this case 12.This implies that the difference $\bar{Y} - \bar{X}$ has mean 0. We described that the standard deviation of this statistic (the standard error) is $SE(\bar{X} - \bar{Y}) = \sqrt {\frac {\mu_Y^2}{12} + \frac {\mu_X^2}{12}}$ and that we estimate the population standard deviations $\sigma_X$ and $\sigma_Y$ with the sample estimates. What is the estimate of $SE(\bar{X} - \bar{Y}) = \sqrt {\frac {\mu_Y^2}{12} + \frac {\mu_X^2}{12}}$ ?

sqrt( sd(X)^2/12 + sd(Y)^2/12 )

## [1] 1.469867

##or sqrt( var(X)/12 + var(Y)/12)

A:1.469867

CLT and t-distribution in Practice Exercises #9

So now we can compute $\bar{Y} - \bar{X}$ as well as an estimate of this standard error and construct a t-statistic. What number is this t-statistic?

( mean(Y) - mean(X) ) / sqrt( var(X)/12 + var(Y)/12)

## [1] 2.055174

##or 
t.test(Y,X)$stat

##        t 
## 2.055174

A:2.055175

For some of the following exercises you need to review the t-distribution that was discussed in the lecture. If you have not done so already, you should review the related book chapters from our textbook which can also be found here and here.

In particular, you will need to remember that the t-distribution is centered at 0 and has one parameter: the degrees of freedom, that control the size of the tails. You will notice that if X follows a t-distribution the probability that X is smaller than an extreme value such as 3 SDs away from the mean grows with the degrees of freedom. For example, notice the difference between:

1 - pt(3,df=3)

## [1] 0.02883444

1 - pt(3,df=15)

## [1] 0.004486369

1 - pt(3,df=30)

## [1] 0.002694982

1 - pnorm(3)

## [1] 0.001349898

As we explained, under certain assumptions, the t-statistic follows a t-distribution. Determining the degrees of freedom can sometimes be cumbersome, but the t.test function calculates it for you. One important fact to keep in mind is that the degrees of freedom are directly related to the sample size. There are various resources for learning more about degrees of freedom on the internet as well as statistics books.

CLT and t-distribution in Practice Exercises #10

Q:If we apply the CLT, what is the distribution of this t-statistic? A:Normal with mean 0 and standard deviation 1

CLT and t-distribution in Practice Exercises #11

Q:Now we are ready to compute a p-value using the CLT. What is the probability of observing a quantity as large as what we computed in 9, when the null distribution is true?

Z <- ( mean(Y) - mean(X) ) / sqrt( var(X)/12 + var(Y)/12)
2*( 1-pnorm(Z))

## [1] 0.0398622

A:0.0398622

CLT and t-distribution in Practice Exercises #12

Q:CLT provides an approximation for cases in which the sample size is large. In practice, we can’t check the assumption because we only get to see 1 outcome (which you computed above). As a result, if this approximation is off, so is our p-value. As described earlier, there is another approach that does not require a large sample size, but rather that the distribution of the population is approximately normal. We don’t get to see this distribution so it is again an assumption, although we can look at the distribution of the sample with qqnorm(X) and qqnorm(Y). If we are willing to assume this, then it follows that the t-statistic follows t-distribution. What is the p-value under the t-distribution approximation? Hint: use the t.test function

t.test(Y,X)

## 
##  Welch Two Sample t-test
## 
## data:  Y and X
## t = 2.0552, df = 20.236, p-value = 0.053
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.04296563  6.08463229
## sample estimates:
## mean of x mean of y 
##  26.83417  23.81333

t.test(Y,X)$p.value

## [1] 0.05299888

#?t.test

CLT and t-distribution in Practice Exercises #13

Q:With the CLT distribution, we obtained a p-value smaller than 0.05 and with the t-distribution, one that is larger. They can’t both be right. What best describes the difference? A:These are two different assumptions. The t-distribution accounts for the variability introduced by the estimation of the standard error and thus, under the null, large values are more probable under the null distribution.

Week 3:Inference I:- P-values, Confidence Intervals and Power Calculations

T-test Exercises

For these exercises we will load the babies dataset from babies.txt. We will use this data to review the concepts behind the p-values and then test confidence interval concepts.

library(downloader)
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/babies.txt"
filename <- basename(url)
download(url, destfile=filename)
babies <- read.table("babies.txt", header=TRUE)
head(babies)

##   bwt gestation parity age height weight smoke
## 1 120       284      0  27     62    100     0
## 2 113       282      0  33     64    135     0
## 3 128       279      0  28     64    115     1
## 4 123       999      0  36     69    190     0
## 5 108       282      0  23     67    125     1
## 6 136       286      0  25     62     93     0

This is a large dataset (1,236 cases), and we will pretend that it contains the entire population in which we are interested. We will study the differences in birth weight between babies born to smoking and non-smoking mothers.

First, let’s split this into two birth weight datasets: one of birth weights to non-smoking mothers and the other of birth weights to smoking mothers.

library(dplyr)
bwt.nonsmoke <- filter(babies, smoke==0) %>% select(bwt) %>% unlist 
bwt.smoke <- filter(babies, smoke==1) %>% select(bwt) %>% unlist

Now, we can look for the true population difference in means between smoking and non-smoking birth weights.

library(rafalib)
mean(bwt.nonsmoke)-mean(bwt.smoke)

## [1] 8.937666

popsd(bwt.nonsmoke)

## [1] 17.38696

popsd(bwt.smoke)

## [1] 18.08024

The population difference of mean birth weights is about 8.9 ounces. The standard deviations of the nonsmoking and smoking groups are about 17.4 and 18.1 ounces, respectively.

As we did with the mouse weight data, this assessment interactively reviews inference concepts using simulations in R. We will treat the babies dataset as the full population and draw samples from it to simulate individual experiments. We will then ask whether somebody who only received the random samples would be able to draw correct conclusions about the population.

We are interested in testing whether the birth weights of babies born to non-smoking mothers are significantly different from the birth weights of babies born to smoking mothers.

T-test Exercises #1

Q:Set the seed at 1 and obtain a samples from the non-smoking mothers (dat.ns) of size $N=25$ .Then, without resetting the seed, take a sample of the same size from and smoking mothers (dat.s). Compute the t-statistic (call it tval).

set.seed(1)
dat.ns<-sample(bwt.nonsmoke,25)
dat.s<-sample(bwt.smoke,25)
tval<-t.test(dat.ns,dat.s)$statistic
tval

##        t 
## 2.120904

#or
N=25
set.seed(1)
dat.ns <- sample(bwt.nonsmoke , N)
dat.s <- sample(bwt.smoke , N)

X.ns <- mean(dat.ns)
sd.ns <- sd(dat.ns)

X.s <- mean(dat.s)
sd.s <- sd(dat.s)

sd.diff <- sqrt(sd.ns^2/N+sd.s^2/N)
tval <- (X.ns - X.s)/sd.diff
tval

## [1] 2.120904

A:2.120904

T-test Exercises #2

Recall that we summarize our data using a t-statistics because we know that in situations where the null hypothesis is true (what we mean when we say “under the null”) and the sample size is relatively large, this t-value will have an approximate standard normal distribution. Because we know the distribution of the t-value under the null, we can quantitatively determine how unusual the observed t-value would be if the null hypothesis were true.

The standard procedure is to examine the probability a t-statistic that actually does follow the null hypothesis would have larger absolute value than the absolute value of the t-value we just observed – this is called a two-sided test.

We have computed these by taking one minus the area under the standard normal curve between -abs(tval) and abs(tval). In R, we can do this by using the pnorm function, which computes the area under a normal curve from negative infinity up to the value given as its first argument:

pval <- 1-(pnorm(abs(tval))-pnorm(-abs(tval)))
pval

## [1] 0.03392985

A:0.0339298

T-test Exercises #3

Because of the symmetry of the standard normal distribution, there is a simpler way to calculate the probability that a t-value under the null could have a larger absolute value than tval. Choose the simplified calculation from the following:

2*pnorm(-abs(tval))

## [1] 0.03392985

T-test Exercises #4

By reporting only p-values, many scientific publications provide an incomplete story of their findings. As we have mentioned, with very large sample sizes, scientifically insignificant differences between two groups can lead to small p-values. Confidence intervals are more informative as they include the estimate itself. Our estimate of the difference between babies of smoker and non-smokers: mean(dat.s) - mean( dat.ns). If we use the CLT, what quantity would we add and subtract to this estimate to obtain a 99% confidence interval?

N <- 25
qnorm(0.995)*sqrt( sd( dat.ns)^2/N + sd( dat.s)^2/N )

## [1] 12.0478

A:12.0478

Confidence Intervals

We have described how to compute p-values which are ubiquitous in the life sciences. However, we do not recommend reporting p-values as the only statistical summary of your results. The reason is simple: statistical significance does not guarantee scientific significance. With large enough sample sizes, one might detect a statistically significance difference in weight of, say, 1 microgram. But is this an important finding? Would we say a diet results in higher weight if the increase is less than a fraction of a percent? The problem with reporting only p-values is that you will not provide a very important piece of information: the effect size. Recall that the effect size is the observed difference. Sometimes the effect size is divided by the mean of the control group and so expressed as a percent increase.

A much more attractive alternative is to report confidence intervals. A confidence interval includes information about your estimated effect size and the uncertainty associated with this estimate. Here we use the mice data to illustrate the concept behind confidence intervals.

Confidence Interval For Population Mean

Before we show how to construct a confidence interval for the difference between the two groups, we will show how to construct a confidence interval for the population mean of control female mice. Then we will return to the group difference after we’ve learned how to build confidence intervals in the simple case. We start by reading in the data and selecting the appropriate rows:

dat <- read.csv("mice_pheno.csv")
chowPopulation <- dat[dat$Sex=="F" & dat$Diet=="chow",3]

The population average $\mu_X$ is our parameter of interest here:

mu_chow <- mean(chowPopulation)
print(mu_chow)

## [1] 23.89338

We are interested in estimating this parameter. In practice, we do not get to see the entire population so, as we did for p-values, we demonstrate how we can use samples to do this. Let’s start with a sample of size 30:

N <- 30
chow <- sample(chowPopulation,N)
print(mean(chow))

## [1] 23.45567

We know this is a random variable, so the sample average will not be a perfect estimate. In fact, because in this illustrative example we know the value of the parameter, we can see that they are not exactly the same. A confidence interval is a statistical way of reporting our finding, the sample average, in a way that explicitly summarizes the variability of our random variable.

With a sample size of 30, we will use the CLT. The CLT tells us that $\bar{X}$ or mean(chow) follows a normal distribution with mean $\mu_X$ or mean(chowPopulation) and standard error approximately $s_X/\sqrt{N}$ or:

se <- sd(chow)/sqrt(N)
print(se)

## [1] 0.5500685

Defining The Interval

A 95% confidence interval (we can use percentages other than 95%) is a random interval with a 95% probability of falling on the parameter we are estimating. Keep in mind that saying 95% of random intervals will fall on the true value (our definition above) is not the same as saying there is a 95% chance that the true value falls in our interval. To construct it, we note that the CLT tells us that $\sqrt{N} (\bar{X}-\mu_X) / s_X$ follows a normal distribution with mean 0 and SD 1. This implies that the probability of this event:

\[ -2 \leq \sqrt{N} (\bar{X}-\mu_X)/s_X \leq 2 \]

which written in R code is:

pnorm(2) - pnorm(-2)

## [1] 0.9544997

…is about 95% (to get closer use qnorm(1-0.05/2) instead of 2). Now do some basic algebra to clear out everything and leave $\mu_X$ alone in the middle and you get that the following event:

\[ \bar{X}-2 s_X/\sqrt{N} \leq \mu_X \leq \bar{X}+2s_X/\sqrt{N} \]

has a probability of 95%.

Be aware that it is the edges of the interval $\bar{X} \pm 2 s_X / \sqrt{N}$ , not $\mu_X$ , that are random. Again, the definition of the confidence interval is that 95% of random intervals will contain the true, fixed value $\mu_X$. For a specific interval that has been calculated, the probability is either 0 or 1 that it contains the fixed population mean $\mu_X$.

Let’s demonstrate this logic through simulation. We can construct this interval with R relatively easily:

Q <- qnorm(1- 0.05/2)
interval <- c(mean(chow)-Q*se, mean(chow)+Q*se )
interval

## [1] 22.37755 24.53378

interval[1] < mu_chow & interval[2] > mu_chow

## [1] TRUE

which happens to cover $\mu_X$ or mean(chowPopulation). However, we can take another sample and we might not be as lucky. In fact, the theory tells us that we will cover $\mu_X$ 95% of the time. Because we have access to the population data, we can confirm this by taking several new samples:

library(rafalib)
B <- 250
mypar()
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
     xlab="weight",ylab="interval",ylim=c(1,B))
abline(v=mean(chowPopulation))
for (i in 1:B) {
  chow <- sample(chowPopulation,N)
  se <- sd(chow)/sqrt(N)
  interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
  covered <- 
    mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
  color <- ifelse(covered,1,2)
  lines(interval, c(i,i),col=color)
}

We show 250 random realizations of 95% confidence intervals. The color denotes if the interval fell on the parameter or not.

You can run this repeatedly to see what happens. You will see that in about 5% of the cases, we fail to cover $\mu_X$.

Small Sample Size And The CLT

For $N=30$, the CLT works very well. However, if $N=5$, do these confidence intervals work as well? We used the CLT to create our intervals, and with $N=5$ it may not be as useful an approximation. We can confirm this with a simulation:

mypar()
plot(mean(chowPopulation)+c(-7,7),c(1,1),type="n",
     xlab="weight",ylab="interval",ylim=c(1,B))
abline(v=mean(chowPopulation))
Q <- qnorm(1- 0.05/2)
N <- 5
for (i in 1:B) {
  chow <- sample(chowPopulation,N)
  se <- sd(chow)/sqrt(N)
  interval <- c(mean(chow)-Q*se, mean(chow)+Q*se)
  covered <- mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
  color <- ifelse(covered,1,2)
  lines(interval, c(i,i),col=color)
}

We show 250 random realizations of 95% confidence intervals, but now for a smaller sample size. The confidence interval is based on the CLT approximation. The color denotes if the interval fell on the parameter or not.

Despite the intervals being larger (we are dividing by $\sqrt{5}$ instead of $\sqrt{30}$ ), we see many more intervals not covering $\mu_X$. This is because the CLT is incorrectly telling us that the distribution of the mean(chow) is approximately normal with standard deviation 1 when, in fact, it has a larger standard deviation and a fatter tail (the parts of the distribution going to $\pm \infty$). This mistake affects us in the calculation of Q, which assumes a normal distribution and uses qnorm. The t-distribution might be more appropriate. All we have to do is re-run the above, but change how we calculate Q to use qt instead of qnorm.

mypar()
plot(mean(chowPopulation) + c(-7,7), c(1,1), type="n",
     xlab="weight", ylab="interval", ylim=c(1,B))
abline(v=mean(chowPopulation))
##Q <- qnorm(1- 0.05/2) ##no longer normal so use:
Q <- qt(1- 0.05/2, df=4)
N <- 5
for (i in 1:B) {
  chow <- sample(chowPopulation, N)
  se <- sd(chow)/sqrt(N)
  interval <- c(mean(chow)-Q*se, mean(chow)+Q*se )
  covered <- mean(chowPopulation) <= interval[2] & mean(chowPopulation) >= interval[1]
  color <- ifelse(covered,1,2)
  lines(interval, c(i,i),col=color)
}

We show 250 random realizations of 95% confidence intervals, but now for a smaller sample size. The confidence is now based on the t-distribution approximation. The color denotes if the interval fell on the parameter or not.

Now the intervals are made bigger. This is because the t-distribution has fatter tails and therefore:

qt(1- 0.05/2, df=4)

## [1] 2.776445

is bigger than…

qnorm(1- 0.05/2)

## [1] 1.959964

…which makes the intervals larger and hence cover $\mu_X$ more frequently; in fact, about 95% of the time.

Connection Between Confidence Intervals and p-values

We recommend that in practice confidence intervals be reported instead of p-values. If for some reason you are required to provide p-values, or required that your results are significant at the 0.05 of 0.01 levels, confidence intervals do provide this information.

If we are talking about a t-test p-value, we are asking if differences as extreme as the one we observe, $\bar{Y} - \bar{X}$, are likely when the difference between the population averages is actually equal to zero. So we can form a confidence interval with the observed difference. Instead of writing $\bar{Y} - \bar{X}$ repeatedly, let’s define this difference as a new variable $d \equiv \bar{Y} - \bar{X}$ .

Suppose you use CLT and report $d \pm 2 s_d/\sqrt{N}$ as a 95% confidence interval for the difference and this interval does not include 0 (a false positive). Because the interval does not include 0, this implies that either $D - 2 s_d/\sqrt{N} > 0$ or $d + 2 s_d/\sqrt{N} < 0$. This suggests that either $\sqrt{N}d/s_d > 2$ or $\sqrt{N}d/s_d < 2$. This then implies that the t-statistic is more extreme than 2, which in turn suggests that the p-value must be smaller than 0.05 (approximately, for a more exact calculation use qnorm(.05/2) instead of 2). The same calculation can be made if we use the t-distribution instead of CLT (with qt(.05/2, df=N-2)). In summary, if a 95% or 99% confidence interval does not include 0, then the p-value must be smaller than 0.05 or 0.01 respectively.

Note that the confidence interval for the difference $d$ is provided by the t.test function:

## [1] -0.04296563  6.08463229
## attr(,"conf.level")
## [1] 0.95

In this case, the 95% confidence interval does include 0 and we observe that the p-value is larger than 0.05 as predicted. If we change this to a 90% confidence interval, then:

t.test(treatment,control,conf.level=0.9)$conf.int

## [1] 0.4871597 5.5545070
## attr(,"conf.level")
## [1] 0.9

0 is no longer in the confidence interval (which is expected because the p-value is smaller than 0.10).

Confidence Intervals Exercises

For these exercises we will load the babies dataset from babies.txt. We will use this data to review the concepts behind the p-values and then test confidence interval concepts.

# library(downloader)
# url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/babies.txt"
# filename <- basename(url)
# download(url, destfile=filename)
# babies <- read.table("babies.txt", header=TRUE)
# head(babies)

First, let’s split this into two birth weight datasets: one of birth weights to non-smoking mothers and the other of birth weights to smoking mothers.

library(dplyr)
bwt.nonsmoke <- filter(babies, smoke==0) %>% select(bwt) %>% unlist 
bwt.smoke <- filter(babies, smoke==1) %>% select(bwt) %>% unlist

Now, we can look for the true population difference in means between smoking and non-smoking birth weights.

library(rafalib)
mean(bwt.nonsmoke)-mean(bwt.smoke)

## [1] 8.937666

popsd(bwt.nonsmoke)

## [1] 17.38696

popsd(bwt.smoke)

## [1] 18.08024

The population difference of mean birth weights is about 8.9 ounces. The standard deviations of the nonsmoking and smoking groups are about 17.4 and 18.1 ounces, respectively.

We are interested in testing whether the birth weights of babies born to non-smoking mothers are significantly different from the birth weights of babies born to smoking mothers.

Confidence Intervals Exercises #1

Q:Set the seed at 1 and obtain two samples, each of size N = 25, from non-smoking mothers (dat.ns) and smoking mothers (dat.s). If instead of CLT, we use the t-distribution approximation, what do we add and subtract to obtain a 99% confidence interval (use 2*N-2 degrees of freedom)?

N <- 25
set.seed(1)
dat.ns <- sample(bwt.nonsmoke, N) 
dat.s <- sample(bwt.smoke, N) 
qt(0.995,2*N-2)*sqrt( sd( dat.ns)^2/N + sd( dat.s)^2/N )

## [1] 12.54534

##note that if you define dat.s before dat.ns, you get a different answer
##due to sampling randomness
##tolerance is set to accept both answers

A:12.54534

Confidence Intervals Exercises #2

Q:Why are the values from T-test Exercises #3 and Confidence Intervals Exercises #1 so similar? A:N and thus the degrees of freedom is large enough to make the normal and t-distributions very similar.

Confidence Intervals Exercises #3

No matter which way you compute it, the p-value pval is the probability that the null hypothesis could have generated a t-statistic more extreme than than what we observed: tval. If the p-value is very small, this means that observing a value more extreme than tval would be very rare if the null hypothesis were true, and would give strong evidence that we should reject the null hypothesis. We determine how small the p-value needs to be to reject the null by deciding how often we would be willing to mistakenly reject the null hypothesis.

The standard decision rule is the following: choose some small value $\alpha$ (in most disciplines the conventional choice is $\alpha=0.05$ and reject the null hypothesis if the p-value is less than $\alpha$ We call $\alpha$ the significance level of the test.

It turns out that if we follow this decision rule, the probability that we will reject the null hypothesis by mistake is equal to $\alpha$ ?? . (This fact is not immediately obvious and requires some probability theory to show.) We call the event of rejecting the null hypothesis, when it is in fact true, a Type I error, we call the probability of making a Type I error, the Type I error rate, and we say that rejecting the null hypothesis when the p-value is less than $\alpha$, controls the Type I error rate so that it is equal to $\alpha$. We will see a number of decision rules that we use in order to control the probabilities of other types of errors. Often, we will guarantee that the probability of an error is less than some level, but, in this case, we can guarantee that the probability of a Type I error is exactly equal to $\alpha$.

Q:Which of the following sentences about a Type I error is not true?
The following is another way to describe a Type I error: you decided to reject the null hypothesis on the basis of data that was actually generated by the null hypothesis.(True)

The following is the another way to describe a Type I error: due to random fluctuations, even though the data you observed were actually generated by the null hypothesis, the p-value calculated from the observed data was small, so you rejected it.(True)

In scientific practice, a Type I error constitutes reporting a “significant” result when there is actually no result.(True)

From the original data alone, you can tell whether you have made a Type I error (False)

A:From the original data alone, you can tell whether you have made a Type I error

Confidence Intervals Exercises #4

In the simulation we have set up here, we know the null hypothesis is false – the true value of difference in means is actually around $8.9$. Thus, we are concerned with how often the decision rule outlined in the last section allows us to conclude that the null hypothesis is actually false. In other words, we would like to quantify the Type II error rate of the test, or the probability that we fail to reject the null hypothesis when the alternative hypothesis is true.

Unlike the Type I error rate, which we can characterize by assuming that the null hypothesis of “no difference” is true, the Type II error rate cannot be computed by assuming the alternative hypothesis alone because the alternative hypothesis alone does not specify a particular value for the difference. It thus does not nail down a specific distribution for the t-value under the alternative.

For this reason, when we study the Type II error rate of a hypothesis testing procedure, we need to assume a particular effect size, or hypothetical size of the difference between population means, that we wish to target. We ask questions such as “what is the smallest difference I could reliably distinguish from 0 given my sample size $N$?” or, more commonly, “How big does $N$ have to be in order to detect that the absolute value of the difference is greater than zero?” Type II error control plays a major role in designing data collection procedures before you actually see the data, so that you know the test you will run has enough sensitivity or power. Power is one minus the Type II error rate, or the probability that you will reject the null hypothesis when the alternative hypothesis is true.

There are several aspects of a hypothesis test that affect its power for a particular effect size. Intuitively, setting a lower $\alpha$ decreases the power of the test for a given effect size because the null hypothesis will be more difficult to reject. This means that for an experiment with fixed parameters (i.e., with a predetermined sample size, recording mechanism, etc), the power of the hypothesis test trades off with its Type I error rate, no matter what effect size you target.

We can explore the trade off of power and Type I error concretely using the babies data. Since we have the full population, we know what the true effect size is (about 8.93) and we can compute the power of the test for true difference between populations.

Set the seed at 1 and take a random sample of $N=5$ measurements from each of the smoking and nonsmoking datasets. What is the p-value (use the t-test function)?

N=5
set.seed(1)
dat.ns <- sample(bwt.nonsmoke , N) 
dat.s <- sample(bwt.smoke , N) 
t.test(dat.s, dat.ns)$p.value

## [1] 0.1366428

##Note that if you defone dat.s before dat.ns, you get a different answer
##due to sampling randomness

A:0.1366428

Power Calculations

Introduction

We have used the example of the effects of two different diets on the weight of mice. Since in this illustrative example we have access to the population, we know that in fact there is a substantial (about 10%) difference between the average weights of the two populations:

library(dplyr)
dat <- read.csv("mice_pheno.csv") #Previously downloaded 

controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%  
  select(Bodyweight) %>% unlist

hfPopulation <- filter(dat,Sex == "F" & Diet == "hf") %>%  
  select(Bodyweight) %>% unlist

mu_hf <- mean(hfPopulation)
mu_control <- mean(controlPopulation)
print(mu_hf - mu_control)

## [1] 2.375517

print((mu_hf - mu_control)/mu_control * 100) #percent increase

## [1] 9.942157

We have also seen that, in some cases, when we take a sample and perform a t-test, we don’t always get a p-value smaller than 0.05. For example, here is a case were we take sample of 5 mice and don’t achieve statistical significance at the 0.05 level:

set.seed(1)
N <- 5
hf <- sample(hfPopulation,N)
control <- sample(controlPopulation,N)
t.test(hf,control)$p.value

## [1] 0.1410204

Did we make a mistake? By not rejecting the null hypothesis, are we saying the diet has no effect? The answer to this question is no. All we can say is that we did not reject the null hypothesis. But this does not necessarily imply that the null is true. The problem is that, in this particular instance, we don’t have enough power, a term we are now going to define. If you are doing scientific research, it is very likely that you will have to do a power calculation at some point. In many cases, it is an ethical obligation as it can help you avoid sacrificing mice unnecessarily or limiting the number of human subjects exposed to potential risk in a study. Here we explain what statistical power means.

Types Of Error

Whenever we perform a statistical test, we are aware that we may make a mistake. This is why our p-values are not 0. Under the null, there is always a positive, perhaps very small, but still positive chance that we will reject the null when it is true. If the p-value is 0.05, it will happen 1 out of 20 times. This error is called type I error by statisticians.

A type I error is defined as rejecting the null when we should not. This is also referred to as a false positive. So why do we then use 0.05? Shouldn’t we use 0.000001 to be really sure? The reason we don’t use infinitesimal cut-offs to avoid type I errors at all cost is that there is another error we can commit: to not reject the null when we should. This is called a type II error or a false negative. The R code analysis above shows an example of a false negative: we did not reject the null hypothesis (at the 0.05 level) and, because we happen to know and peeked at the true population means, we know there is in fact a difference. Had we used a p-value cutoff of 0.25, we would not have made this mistake. However, in general, are we comfortable with a type I error rate of 1 in 4? Usually we are not.

The 0.05 and 0.01 Cut-offs Are Arbitrary

Most journals and regulatory agencies frequently insist that results be significant at the 0.01 or 0.05 levels. Of course there is nothing special about these numbers other than the fact that some of the first papers on p-values used these values as examples. Part of the goal of this book is to give readers a good understanding of what p-values and confidence intervals are so that these choices can be judged in an informed way. Unfortunately, in science, these cut-offs are applied somewhat mindlessly, but that topic is part of a complicated debate.

Power Calculation

Power is the probability of rejecting the null when the null is false. Of course “when the null is false” is a complicated statement because it can be false in many ways. $\Delta \equiv \mu_Y - \mu_X$ could be anything and the power actually depends on this parameter. It also depends on the standard error of your estimates which in turn depends on the sample size and the population standard deviations. In practice, we don’t know these so we usually report power for several plausible values of $\Delta$, $\sigma_X$, $\sigma_Y$ and various sample sizes. Statistical theory gives us formulas to calculate power. The pwr package performs these calculations for you. Here we will illustrate the concepts behind power by coding up simulations in R.

Suppose our sample size is:

N <- 12

and we will reject the null hypothesis at:

alpha <- 0.05

What is our power with this particular data? We will compute this probability by re-running the exercise many times and calculating the proportion of times the null hypothesis is rejected. Specifically, we will run:

B <- 2000

simulations. The simulation is as follows: we take a sample of size $N$ from both control and treatment groups, we perform a t-test comparing these two, and report if the p-value is less than alpha or not. We write a function that does this:

reject <- function(N, alpha=0.05){
   hf <- sample(hfPopulation,N) 
   control <- sample(controlPopulation,N)
   pval <- t.test(hf,control)$p.value
   pval < alpha
}

Here is an example of one simulation for a sample size of 12. The call to reject answers the question “Did we reject?”

reject(12)

## [1] FALSE

Now we can use the replicate function to do this B times.

rejections <- replicate(B,reject(N))

Our power is just the proportion of times we correctly reject. So with $N=12$ our power is only:

mean(rejections)

## [1] 0.2145

This explains why the t-test was not rejecting when we knew the null was false. With a sample size of just 12, our power is about 23%. To guard against false positives at the 0.05 level, we had set the threshold at a high enough level that resulted in many type II errors.

Let’s see how power improves with N. We will use the function sapply, which applies a function to each of the elements of a vector. We want to repeat the above for the following sample size:

Ns <- seq(5, 50, 5)

So we use apply like this:

power <- sapply(Ns,function(N){
  rejections <- replicate(B, reject(N))
  mean(rejections)
  })

For each of the three simulations, the above code returns the proportion of times we reject. Not surprisingly power increases with N:

plot(Ns, power, type="b")

Power plotted against sample size.

Similarly, if we change the level alpha at which we reject, power changes. The smaller I want the chance of type I error to be, the less power I will have. Another way of saying this is that we trade off between the two types of error. We can see this by writing similar code, but keeping $N$ fixed and considering several values of alpha:

N <- 30
alphas <- c(0.1,0.05,0.01,0.001,0.0001)
power <- sapply(alphas,function(alpha){
  rejections <- replicate(B,reject(N,alpha=alpha))
  mean(rejections)
})
plot(alphas, power, xlab="alpha", type="b", log="x")

Power plotted against cut-off.

Note that the x-axis in this last plot is in the log scale.

There is no “right” power or “right” alpha level, but it is important that you understand what each means.

To see this clearly, you could create a plot with curves of power versus N. Show several curves in the same plot with color representing alpha level.

p-values are Arbitrary under the Alternative Hypothesis

Another consequence of what we have learned about power is that p-values are somewhat arbitrary when the null hypothesis is not true and therefore the alternative hypothesis is true (the difference between the population means is not zero). When the alternative hypothesis is true, we can make a p-value as small as we want simply by increasing the sample size (supposing that we have an infinite population to sample from). We can show this property of p-values by drawing larger and larger samples from our population and calculating p-values. This works because, in our case, we know that the alternative hypothesis is true, since we have access to the populations and can calculate the difference in their means.

First write a function that returns a p-value for a given sample size $N$:

calculatePvalue <- function(N) {
   hf <- sample(hfPopulation,N) 
   control <- sample(controlPopulation,N)
   t.test(hf,control)$p.value
}

We have a limit here of 200 for the high-fat diet population, but we can see the effect well before we get to 200. For each sample size, we will calculate a few p-values. We can do this by repeating each value of $N$ a few times.

Ns <- seq(10,200,by=10)
Ns_rep <- rep(Ns, each=10)

Again we use sapply to run our simulations:

pvalues <- sapply(Ns_rep, calculatePvalue)

Now we can plot the 10 p-values we generated for each sample size:

plot(Ns_rep, pvalues, log="y", xlab="sample size",
     ylab="p-values")
abline(h=c(.01, .05), col="red", lwd=2)

p-values from random samples at varying sample size. The actual value of the p-values decreases as we increase sample size whenever the alternative hypothesis is true.

Note that the y-axis is log scale and that the p-values show a decreasing trend all the way to $10^{-8}$ as the sample size gets larger. The standard cutoffs of 0.01 and 0.05 are indicated with horizontal red lines.

It is important to remember that p-values are not more interesting as they become very very small. Once we have convinced ourselves to reject the null hypothesis at a threshold we find reasonable, having an even smaller p-value just means that we sampled more mice than was necessary. Having a larger sample size does help to increase the precision of our estimate of the difference $\Delta$, but the fact that the p-value becomes very very small is just a natural consequence of the mathematics of the test. The p-values get smaller and smaller with increasing sample size because the numerator of the t-statistic has $\sqrt{N}$ (for equal sized groups, and a similar effect occurs when $M \neq N$). Therefore, if $\Delta$ is non-zero, the t-statistic will increase with $N$.

Therefore, a better statistic to report is the effect size with a confidence interval or some statistic which gives the reader a sense of the change in a meaningful scale. We can report the effect size as a percent by dividing the difference and the confidence interval by the control population mean:

N <- 12
hf <- sample(hfPopulation, N)
control <- sample(controlPopulation, N)
diff <- mean(hf) - mean(control)
diff / mean(control) * 100

## [1] 1.868663

t.test(hf, control)$conf.int / mean(control) * 100

## [1] -20.94576  24.68309
## attr(,"conf.level")
## [1] 0.95

In addition, we can report a statistic called Cohen’s d, which is the difference between the groups divided by the pooled standard deviation of the two groups.

sd_pool <- sqrt(((N-1)*var(hf) + (N-1)*var(control))/(2*N - 2))
diff / sd_pool

## [1] 0.07140083

This tells us how many standard deviations of the data the mean of the high-fat diet group is from the control group. Under the alternative hypothesis, unlike the t-statistic which is guaranteed to increase, the effect size and Cohen’s d will become more precise.

Power Calculations Exercises

For these exercises we will load the babies dataset from babies.txt. We will use this data to review the concepts behind the p-values and then test confidence interval concepts.

url <- “https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/babies.txt” filename <- basename(url) download(url, destfile=filename) babies <- read.table(“babies.txt”, header=TRUE)

First, let’s split this into two birth weight datasets: one of birth weights to non-smoking mothers and the other of birth weights to smoking mothers.

bwt.nonsmoke <- filter(babies, smoke==0) %>% select(bwt) %>% unlist bwt.smoke <- filter(babies, smoke==1) %>% select(bwt) %>% unlist

Now, we can look for the true population difference in means between smoking and non-smoking birth weights.

library(rafalib) mean(bwt.nonsmoke)-mean(bwt.smoke) popsd(bwt.nonsmoke) popsd(bwt.smoke)

The population difference of mean birth weights is about 8.9 ounces. The standard deviations of the nonsmoking and smoking groups are about 17.4 and 18.1 ounces, respectively.

We are interested in testing whether the birth weights of babies born to non-smoking mothers are significantly different from the birth weights of babies born to smoking mothers.

Power Calculations Exercises #1

Set the seed at 1 and take a random sample of N=5 measurements from each of the smoking and nonsmoking datasets. You used the t-test function to find the p-value.

The p-value is larger than 0.05 so using the typical cut-off, we would not reject. This is a type II error. Which of the following is not a way to decrease this type of error?

Increase our chance of a type I error.(reduces type II error)

Take a larger sample size.(reduces type II error)

Use a higher $\alpha$ level.(reduces type II error)

N=5
set.seed(1)
dat.ns <- sample(bwt.nonsmoke , N) 
dat.s <- sample(bwt.smoke , N) 
t.test(dat.s, dat.ns)$p.value>0.05

## [1] TRUE

A: Find a population for which the null is not true.

Power Calculations Exercises #2

Set the seed at 1, then use the replicate function to repeat the code used in the exercise above 10,000 times. What proportion of the time do we reject at the 0.05 level?

N=5
set.seed(1)
rejects <- replicate(10000,{
  dat.ns <- sample(bwt.nonsmoke , N)
  dat.s <- sample(bwt.smoke , N)
  t.test(dat.s, dat.ns)$p.value < 0.05
})
mean(rejects)

## [1] 0.0984

#or 

set.seed(1)
N <- 5
alpha <- 0.05
reject <- function(N, alpha=0.05){
   dat.ns <- sample(bwt.nonsmoke , N) 
   dat.s <- sample(bwt.smoke , N) 
   pval <- t.test(dat.s,dat.ns )$p.value
   pval < alpha
}
reject(N)

## [1] FALSE

B <- 10000
rejections <- replicate(B,reject(N))
mean(rejections)

## [1] 0.0984

A:0.0984

Power Calculations Exercises #3

Note that, not surprisingly, the power is lower than 10%. Repeat the exercise above for samples sizes of 30, 60, 90 and 120. Which of those four gives you power of about 80%?

Ns=c(10,60,90,120)
res <- sapply(Ns, function(N){
  set.seed(1)
  rejects <- replicate(10000,{
    dat.ns <- sample(bwt.nonsmoke , N)
    dat.s <- sample(bwt.smoke , N)
    t.test(dat.s, dat.ns)$p.value < 0.05
    })
  mean(rejects)
  })
Ns[ which.min( abs( res - .8) ) ]

## [1] 60

#or

Ns <- c(30, 60, 90,120)
power <- sapply(Ns,function(N){
  rejections <- replicate(B, reject(N))
  mean(rejections)
  })
power

## [1] 0.4881 0.7951 0.9357 0.9837

A:60

Power Calculations Exercises #4

Repeat the problem above, but now require an $\alpha$ level of 0.01. Which of those four gives you power of about 80%?

Ns=c(10,60,90,120)
res <- sapply(Ns, function(N){
  set.seed(1)
  rejects <- replicate(10000,{
    dat.ns <- sample(bwt.nonsmoke , N)
    dat.s <- sample(bwt.smoke , N)
    t.test(dat.s, dat.ns)$p.value < 0.01
    })
  mean(rejects)
  })
Ns[ which.min( abs( res - .8) ) ]

## [1] 90

A:90

Week 3 :Inference II: Monte Carlo Simulation, Permutation Tests and Association tests

Monte Carlo Simulation

Computers can be used to generate pseudo-random numbers. For practical purposes these pseudo-random numbers can be used to imitate random variables from the real world. This permits us to examine properties of random variables using a computer instead of theoretical or analytical derivations. One very useful aspect of this concept is that we can create simulated data to test out ideas or competing methods, without actually having to perform laboratory experiments.

Simulations can also be used to check theoretical or analytical results. Also, many of the theoretical results we use in statistics are based on asymptotics: they hold when the sample size goes to infinity. In practice, we never have an infinite number of samples so we may want to know how well the theory works with our actual sample size. Sometimes we can answer this question analytically, but not always. Simulations are extremely useful in these cases.

As an example, let’s use a Monte Carlo simulation to compare the CLT to the t-distribution approximation for different sample sizes.

library(dplyr)
dat <- read.csv("mice_pheno.csv")
controlPopulation <- filter(dat,Sex == "F" & Diet == "chow") %>%  
  select(Bodyweight) %>% unlist

We will build a function that automatically generates a t-statistic under the null hypothesis for a any sample size of n.

ttestgenerator <- function(n) {
  #note that here we have a false "high fat" group where we actually
  #sample from the chow or control population. 
  #This is because we are modeling the null.
  cases <- sample(controlPopulation,n)
  controls <- sample(controlPopulation,n)
  tstat <- (mean(cases)-mean(controls)) / 
      sqrt( var(cases)/n + var(controls)/n ) 
  return(tstat)
  }
ttests <- replicate(1000, ttestgenerator(10))

With 1,000 Monte Carlo simulated occurrences of this random variable, we can now get a glimpse of its distribution:

hist(ttests)

Histogram of 1000 Monte Carlo simulated t-statistics.

So is the distribution of this t-statistic well approximated by the normal distribution? In the next chapter, we will formally introduce quantile-quantile plots, which provide a useful visual inspection of how well one distribution approximates another. As we will explain later, if points fall on the identity line, it means the approximation is a good one.

qqnorm(ttests)
abline(0,1)

Quantile-quantile plot comparing 1000 Monte Carlo simulated t-statistics to theoretical normal distribution.

This looks like a very good approximation. For this particular population, a sample size of 10 was large enough to use the CLT approximation. How about 3?

ttests <- replicate(1000, ttestgenerator(3))
qqnorm(ttests)
abline(0,1)

Quantile-quantile plot comparing 1000 Monte Carlo simulated t-statistics with three degrees of freedom to theoretical normal distribution.

Now we see that the large quantiles, referred to by statisticians as the tails, are larger than expected (below the line on the left side of the plot and above the line on the right side of the plot). In the previous module, we explained that when the sample size is not large enough and the population values follow a normal distribution, then the t-distribution is a better approximation. Our simulation results seem to confirm this:

ps <- (seq(0,999)+0.5)/1000
qqplot(qt(ps,df=2*3-2),ttests,xlim=c(-6,6),ylim=c(-6,6))
abline(0,1)

Quantile-quantile plot comparing 1000 Monte Carlo simulated t-statistics with three degrees of freedom to theoretical t-distribution.

The t-distribution is a much better approximation in this case, but it is still not perfect. This is due to the fact that the original data is not that well approximated by the normal distribution.

qqnorm(controlPopulation)
qqline(controlPopulation)

Quantile-quantile of original data compared to theoretical quantile distribution.

Parametric Simulations for the Observations

The technique we used to motivate random variables and the null distribution was a type of Monte Carlo simulation. We had access to population data and generated samples at random. In practice, we do not have access to the entire population. The reason for using the approach here was for educational purposes. However, when we want to use Monte Carlo simulations in practice, it is much more typical to assume a parametric distribution and generate a population from this, which is called a parametric simulation. This means that we take parameters estimated from the real data (here the mean and the standard deviation), and plug these into a model (here the normal distribution). This is actually the most common form of Monte Carlo simulation.

For the case of weights, we could use our knowledge that mice typically weigh 24 grams with a SD of about 3.5 grams, and that the distribution is approximately normal, to generate population data:

controls<- rnorm(5000, mean=24, sd=3.5)

After we generate the data, we can then repeat the exercise above. We no longer have to use the sample function since we can re-generate random normal numbers. The ttestgenerator function therefore can be written as follows:

ttestgenerator <- function(n, mean=24, sd=3.5) {
  cases <- rnorm(n,mean,sd)
  controls <- rnorm(n,mean,sd)
  tstat <- (mean(cases)-mean(controls)) / 
      sqrt( var(cases)/n + var(controls)/n ) 
  return(tstat)
  }

Monte Carlo Exercises

We have used Monte Carlo simulation throughout this chapter to demonstrate statistical concepts; namely, sampling from the population. We mostly applied this to demonstrate the statistical properties related to inference on differences in averages. Here, we will consider examples of how Monte Carlo simulations are used in practice.

Monte Carlo Exercises #1

Imagine you are William_Sealy_Gosset and have just mathematically derived the distribution of the t-statistic when the sample comes from a normal distribution. Unlike Gosset you have access to computers and can use them to check the results.

Let’s start by creating an outcome.

Set the seed at 1, use rnorm to generate a random sample of size 5,$X_1 , \dots,X_5$ from a standard normal distribution, then compute the t-statistic $t=\sqrt{5} \frac{\bar{X}}{s}$ with $s$ the sample standard deviation. What value do you observe?

set.seed(1)
X<-rnorm(5)
t<-(sqrt(5)*mean(X))/sd(X)
t

## [1] 0.3007746

A: 0.3007746

Monte Carlo Exercises #2

You have just performed a Monte Carlo simulation using rnorm , a random number generator for normally distributed data. Gosset’s mathematical calculation tells us that the t-statistic defined in the previous exercises, a random variable, follows a t-distribution with $N-1$ degrees of freedom. Monte Carlo simulations can be used to check the theory: we generate many outcomes and compare them to the theoretical result. Set the seed to 1, generate $B=1000$ t-statistics as done in exercise 1. What proportion is larger than 2?

set.seed(1)
ttestgenerator <- function(n) {
  X <- rnorm(n)
  tstat <- (sqrt(n)*mean(X))/sd(X) 
  return(tstat)
}

ttests <- replicate(1000, ttestgenerator(5))
mean(ttests>=2)

## [1] 0.068

#or
set.seed(1)
N <- 5
B<- 1000

tstats <- replicate(B,{
  X <- rnorm(N)
  sqrt(N)*mean(X)/sd(X)
})
mean(tstats>2)

## [1] 0.068

1-pt(2,df=4)

## [1] 0.05805826

A:0.068

Monte Carlo Exercises #3

The answer to exercise 2 is very similar to the theoretical prediction: 1-pt(2,df=4). We can check several such quantiles using the qqplot function.

To obtain quantiles for the t-distribution we can generate percentiles from just above 0 to just below 1: B=100; ps = seq(1/(B+1), 1-1/(B+1),len=B) and compute the quantiles with qt(ps,df=4). Now we can use qqplot to compare these theoretical quantiles to those obtained in the Monte Carlo simulation. Use Monte Carlo simulation developed for exercise 2 to corroborate that the t-statistic $t=\sqrt{N} \frac{\bar{X}}{s}$ ollows a t-distribution for several values of $N$.

For which sample sizes does the approximation best work?

library(rafalib)
mypar(3,2)

Ns<-seq(5,30,5)
B <- 1000
mypar(3,2)
LIM <- c(-4.5,4.5)
for(N in Ns){
    ts <- replicate(B, {
    X <- rnorm(N)
    sqrt(N)*mean(X)/sd(X)
    })
    ps <- seq(1/(B+1),1-1/(B+1),len=B)
    qqplot(qt(ps,df=N-1),ts,main=N,
           xlab="Theoretical",ylab="Observed",
           xlim=LIM, ylim=LIM)
    abline(0,1)
}

A:The approximations are spot on for all sample sizes

Monte Carlo Exercises #4

Use Monte Carlo simulation to corroborate that the t-statistic comparing two means and obtained with normally distributed (mean 0 and sd) data follows a t-distribution. In this case we will use the t.test function with var.equal=TRUE. With this argument the degrees of freedom will be df=2*N-2 with N the sample size. For which sample sizes does the approximation best work?

Ns<-seq(5,30,5)
B <- 1000
mypar(3,2)
LIM <- c(-4.5,4.5)
for(N in Ns){
    ts <- replicate(B,{
    x <- rnorm(N)
    y <- rnorm(N)
    t.test(x,y, var.equal = TRUE)$stat
    })
  ps <- seq(1/(B+1),1-1/(B+1),len=B)
  qqplot(qt(ps,df=2*N-2),ts,main=N,
         xlab="Theoretical",ylab="Observed",
         xlim=LIM, ylim=LIM)
  abline(0,1)
}

A:The approximations are spot on for all sample sizes.

Monte Carlo Exercises #5

Is the following statement true or false? If instead of generating the sample with X=rnorm(15) we generate it with binary data (either positive or negative 1 with probability 0.5) X =sample(c(-1,1), 15, replace=TRUE) then the t-statistic

set.seed(1)
N <- 15
B <- 10000
tstats <- replicate(B,{
  X <- sample(c(-1,1), N, replace=TRUE)
  sqrt(N)*mean(X)/sd(X)
})
ps=seq(1/(B+1), 1-1/(B+1), len=B) 
qqplot(qt(ps,N-1), tstats, xlim=range(tstats))
abline(0,1)

#The population data is not normal thus the theory does not apply.
#We check with a Monte Carlo simulation. The qqplot shows a large tail. 
#Note that there is a small but positive chance that all the X are the same.
##In this case the denominator is 0 and the t-statistics is not defined

A:False

Monte Carlo Exercises #6

Is the following statement true or false ? If instead of generating the sample with X=rnorm(N) with N=1000, we generate the data with binary data X= sample(c(-1,1), N, replace=TRUE), then the t-statistic sqrt(N)*mean(X)/sd(X) is approximated by a t-distribution with 999 degrees of freedom.

set.seed(1)
N <- 1000
B <- 10000
tstats <- replicate(B,{
  X <-  sample(c(-1,1), N, replace=TRUE)
  sqrt(N)*mean(X)/sd(X)
})
qqnorm(tstats)
abline(0,1)

#With N=1000, CLT kicks in and the t-statistic is approximated with normal 0,1
##Furthermore, t-distribution with df=999 and normal are practically the same.

A:True

Monte Carlo Exercises #7

We can derive approximation of the distribution of the sample average or the t-statistic theoretically. However, suppose we are interested in the distribution of a statistic for which a theoretical approximation is not immediately obvious.

Consider the sample median as an example. Use a Monte Carlo to determine which of the following best approximates the median of a sample taken from normally distributed population with mean 0 and standard deviation 1.

set.seed(1)
Ns <- seq(5,45,5)
library(rafalib)
mypar(3,3)
for(N in Ns){
  medians <- replicate(10000, median ( rnorm(N) ) )
  title <- paste("N=",N,", avg=",round( mean(medians), 2) , ", sd*sqrt(N)=", round( sd(medians)*sqrt(N),2) )
  qqnorm(medians, main = title )
  qqline(medians)
}

##there is an asymptotic result that says SD is sqrt(N*4*dnorm(0)^2)

A:The sample median is approximately normal with mean 0 and SD larger than $\frac{1}{\sqrt{N}}$

Permutation Tests

Suppose we have a situation in which none of the standard mathematical statistical approximations apply. We have computed a summary statistic, such as the difference in mean, but do not have a useful approximation, such as that provided by the CLT. In practice, we do not have access to all values in the population so we can’t perform a simulation as done above. Permutation tests can be useful in these scenarios.

We are back to the scenario were we only have 10 measurements for each group.

dat=read.csv("femaleMiceWeights.csv")

library(dplyr)

control <- filter(dat,Diet=="chow") %>% select(Bodyweight) %>% unlist
treatment <- filter(dat,Diet=="hf") %>% select(Bodyweight) %>% unlist
obsdiff <- mean(treatment)-mean(control)

In previous sections, we showed parametric approaches that helped determine if the observed difference was significant. Permutation tests take advantage of the fact that if we randomly shuffle the cases and control labels, then the null is true. So we shuffle the cases and control labels and assume that the ensuing distribution approximates the null distribution. Here is how we generate a null distribution by shuffling the data 1,000 times:

N <- 12
avgdiff <- replicate(1000, {
    all <- sample(c(control,treatment))
    newcontrols <- all[1:N]
    newtreatments <- all[(N+1):(2*N)]
  return(mean(newtreatments) - mean(newcontrols))
})
hist(avgdiff)
abline(v=obsdiff, col="red", lwd=2)

Histogram of difference between averages from permutations. Vertical line shows the observed difference.

How many of the null means are bigger than the observed value? That proportion would be the p-value for the null. We add a 1 to the numerator and denominator to account for misestimation of the p-value (for more details see Phipson and Smyth, Permutation P-values should never be zero).

#the proportion of permutations with larger difference
(sum(abs(avgdiff) > abs(obsdiff)) + 1) / (length(avgdiff) + 1)

## [1] 0.05694306

Now let’s repeat this experiment for a smaller dataset. We create a smaller dataset by sampling:

N <- 5
control <- sample(control,N)
treatment <- sample(treatment,N)
obsdiff <- mean(treatment)- mean(control)

and repeat the exercise:

avgdiff <- replicate(1000, {
    all <- sample(c(control,treatment))
    newcontrols <- all[1:N]
    newtreatments <- all[(N+1):(2*N)]
  return(mean(newtreatments) - mean(newcontrols))
})
hist(avgdiff)
abline(v=obsdiff, col="red", lwd=2)

Histogram of difference between averages from permutations for smaller sample size. Vertical line shows the observed difference.

Now the observed difference is not significant using this approach. Keep in mind that there is no theoretical guarantee that the null distribution estimated from permutations approximates the actual null distribution. For example, if there is a real difference between the populations, some of the permutations will be unbalanced and will contain some samples that explain this difference. This implies that the null distribution created with permutations will have larger tails than the actual null distribution. This is why permutations result in conservative p-values. For this reason, when we have few samples, we can’t do permutations.

Note also that permutations tests still have assumptions: samples are assumed to be independent and “exchangeable”. If there is hidden structure in your data, then permutation tests can result in estimated null distributions that underestimate the size of tails because the permutations may destroy the existing structure in the original data.

Permutations Exercises

Read through the permutation book chapter.

We will use the following dataset to demonstrate the use of permutations:

library(dplyr)
library(downloader)
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/babies.txt"
filename <- basename(url)
download(url, destfile=filename)
babies <- read.table("babies.txt", header=TRUE)
bwt.nonsmoke <- filter(babies, smoke==0) %>% select(bwt) %>% unlist 
bwt.smoke <- filter(babies, smoke==1) %>% select(bwt) %>% unlist

Permutations Exercises #1

We will generate the following random variable based on a sample size of 10 and observe the following difference:

N=10
    set.seed(1)
    nonsmokers <- sample(bwt.nonsmoke , N)
    smokers <- sample(bwt.smoke , N)
    obs <- mean(smokers) - mean(nonsmokers)

The question is whether this observed difference is statistically significant. We do not want to rely on the assumptions needed for the normal or t-distribution approximations to hold, so instead we will use permutations. We will reshuffle the data and recompute the mean. We can create one permuted sample with the following code:

dat <- c(smokers,nonsmokers)
    shuffle <- sample( dat )
    smokersstar <- shuffle[1:N]
    nonsmokersstar <- shuffle[(N+1):(2*N)]
    mean(smokersstar)-mean(nonsmokersstar)

## [1] -8.5

The last value is one observation from the null distribution we will construct. Set the seed at 1, and then repeat the permutation 1,000 times to create a null distribution. What is the permutation derived p-value for our observation?

set.seed(1)
null <- replicate(1000, {
  shuffle <- sample( dat )
  smokersstar <- shuffle[1:N]
  nonsmokersstar <- shuffle[(N+1):(2*N)]
  mean(smokersstar)-mean(nonsmokersstar)
})
( sum( abs(null) >= abs(obs)) +1 ) / ( length(null)+1 )

## [1] 0.05694306

##we add the 1s to avoid p-values=0 but we also accept:
( sum( abs(null) >= abs(obs)) ) / ( length(null) )

## [1] 0.056

A:0.05694306

Permutations Exercises #2

Repeat the above exercise, but instead of the differences in mean, consider the differences in median obs <- median(smokers) - median(nonsmokers). What is the permutation based p-value?

N=10
    set.seed(1)
    nonsmokers <- sample(bwt.nonsmoke , N)
    smokers <- sample(bwt.smoke , N)
    obs <- median(smokers) - median(nonsmokers)
    
set.seed(1)
null <- replicate(1000, {
  shuffle <- sample( dat )
  smokersstar <- shuffle[1:N]
  nonsmokersstar <- shuffle[(N+1):(2*N)]
  median(smokersstar)-median(nonsmokersstar)
})
( sum( abs(null) >= abs(obs)) +1 ) / ( length(null)+1 )

## [1] 0.02897103

##we add the 1s to avoid p-values=0 but we also accept:
( sum( abs(null) >= abs(obs)) ) / ( length(null) )

## [1] 0.028

A:0.02897103

Association Tests

The statistical tests we have covered up to now leave out a substantial portion of life science projects. Specifically, we are referring to data that is binary, categorical and ordinal. To give a very specific example, consider genetic data where you have two groups of genotypes (AA/Aa or aa) for cases and controls for a given disease. The statistical question is if genotype and disease are associated. As in the examples we have been studying previously, we have two populations (AA/Aa and aa) and then numeric data for each, where disease status can be coded as 0 or 1. So why can’t we perform a t-test? Note that the data is either 0 (control) or 1 (cases). It is pretty clear that this data is not normally distributed so the t-distribution approximation is certainly out of the question. We could use CLT if the sample size is large enough; otherwise, we can use association tests.

Lady Tasting Tea

One of the most famous examples of hypothesis testing was performed by R.A. Fisher. An acquaintance of Fisher’s claimed that she could tell if milk was added before or after tea was poured. Fisher gave her four pairs of cups of tea: one with milk poured first, the other after. The order was randomized. Say she picked 3 out 4 correctly, do we believe she has a special ability? Hypothesis testing helps answer this question by quantifying what happens by chance. This example is called the “Lady tasting tea” experiment (and, as it turns out, Fisher’s friend was a scientist herself, Muriel Bristol).

The basic question we ask is: if the tester is actually guessing, what are the chances that she gets 3 or more correct? Just as we have done before, we can compute a probability under the null hypothesis that she is guessing four of each. If we assume this null hypothesis, we can think of this particular examples as picking 4 balls out of an urn with 4 green (correct answer) and 4 red (incorrect answer) balls.

Under the null hypothesis that she is simply guessing, each ball has the same chance of being picked. We can then use combinatorics to figure out each probability. The probability of picking 3 is ${4 \choose 3} {4 \choose 1} / {8 \choose 4} = 16/70$. The probability of picking all 4 correct is ${4 \choose 4} {4 \choose 0}/{8 \choose 4}= 1/70$. Thus, the chance of observing a 3 or something more extreme, under the null hypothesis, is $\approx 0.24$. This is the p-value. The procedure that produced this p-value is called Fisher’s exact test and it uses the hypergeometric distribution.

Two By Two Tables

The data from the experiment above can be summarized by a 2 by 2 table:

tab <- matrix(c(3,1,1,3),2,2)
rownames(tab)<-c("Poured Before","Poured After")
colnames(tab)<-c("Guessed before","Guessed after")
tab

##               Guessed before Guessed after
## Poured Before              3             1
## Poured After               1             3

The function fisher.test performs the calculations above and can be obtained like this:

fisher.test(tab,alternative="greater")

## 
##  Fisher's Exact Test for Count Data
## 
## data:  tab
## p-value = 0.2429
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  0.3135693       Inf
## sample estimates:
## odds ratio 
##   6.408309

Chi-square Test

Genome-wide association studies (GWAS) have become ubiquitous in biology. One of the main statistical summaries used in these studies are Manhattan plots. The y-axis of a Manhattan plot typically represents the negative of log (base 10) of the p-values obtained for association tests applied at millions of single nucleotide polymorphisms (SNP). The x-axis is typically organized by chromosome (chromosome 1 to 22, X, Y, etc.). These p-values are obtained in a similar way to the test performed on the tea taster. However, in that example the number of green and red balls is experimentally fixed and the number of answers given for each category is also fixed. Another way to say this is that the sum of the rows and the sum of the columns are fixed. This defines constraints on the possible ways we can fill the 2 by 2 table and also permits us to use the hypergeometric distribution. In general, this is not the case. Nonetheless, there is another approach, the Chi-squared test, which is described below.

Imagine we have 250 individuals, where some of them have a given disease and the rest do not. We observe that 20% of the individuals that are homozygous for the minor allele (aa) have the disease compared to 10% of the rest. Would we see this again if we picked another 250 individuals?

Let’s create a dataset with these percentages:

disease=factor(c(rep(0,180),rep(1,20),rep(0,40),rep(1,10)),
               labels=c("control","cases"))
genotype=factor(c(rep("AA/Aa",200),rep("aa",50)),
                levels=c("AA/Aa","aa"))
dat <- data.frame(disease, genotype)
dat <- dat[sample(nrow(dat)),] #shuffle them up
head(dat)

##     disease genotype
## 53  control    AA/Aa
## 29  control    AA/Aa
## 37  control    AA/Aa
## 77  control    AA/Aa
## 248   cases       aa
## 130 control    AA/Aa

To create the appropriate two by two table, we will use the function table. This function tabulates the frequency of each level in a factor. For example:

table(genotype)

## genotype
## AA/Aa    aa 
##   200    50

table(disease)

## disease
## control   cases 
##     220      30

If you you feed the function two factors, it will tabulate all possible pairs and thus create the two by two table:

tab <- table(genotype,disease)
tab

##         disease
## genotype control cases
##    AA/Aa     180    20
##    aa         40    10

Note that you can feed table $n$ factors and it will tabulate all $n$-tables.

The typical statistics we use to summarize these results is the odds ratio (OR). We compute the odds of having the disease if you are an “aa”: 10/40, the odds of having the disease if you are an “AA/Aa”: 20/180, and take the ratio: $(10/40) / (20/180)$

(tab[2,2]/tab[2,1]) / (tab[1,2]/tab[1,1])

## [1] 2.25

To compute a p-value, we don’t use the OR directly. We instead assume that there is no association between genotype and disease, and then compute what we expect to see in each cell of the table (note: this use of the word “cell” refers to elements in a matrix or table and has nothing to do with biological cells). Under the null hypothesis, the group with 200 individuals and the group with 50 individuals were each randomly assigned the disease with the same probability. If this is the case, then the probability of disease is:

p=mean(disease=="cases")
p

## [1] 0.12

The expected table is therefore:

expected <- rbind(c(1-p,p)*sum(genotype=="AA/Aa"),
                  c(1-p,p)*sum(genotype=="aa"))
dimnames(expected)<-dimnames(tab)
expected

##         disease
## genotype control cases
##    AA/Aa     176    24
##    aa         44     6

The Chi-square test uses an asymptotic result (similar to the CLT) related to the sums of independent binary outcomes. Using this approximation, we can compute the probability of seeing a deviation from the expected table as big as the one we saw. The p-value for this table is:

chisq.test(tab)$p.value

## [1] 0.08857435

Large Samples, Small p-values

As mentioned earlier, reporting only p-values is not an appropriate way to report the results of your experiment. Many genetic association studies seem to over emphasize p-values. They have large sample sizes and report impressively small p-values. Yet when one looks closely at the results, we realize odds ratios are quite modest: barely bigger than 1. In this case the difference of having genotype AA/Aa or aa might not change an individual’s risk for a disease in an amount which is practically significant, in that one might not change one’s behavior based on the small increase in risk.

There is not a one-to-one relationship between the odds ratio and the p-value. To demonstrate, we recalculate the p-value keeping all the proportions identical, but increasing the sample size by 10, which reduces the p-value substantially (as we saw with the t-test under the alternative hypothesis):

tab<-tab*10
chisq.test(tab)$p.value

## [1] 1.219624e-09

Confidence Intervals For The Odd Ratio

Computing confidence intervals for the OR is not mathematically straightforward. Unlike other statistics, for which we can derive useful approximations of their distributions, the OR is not only a ratio, but a ratio of ratios. Therefore, there is no simple way of using, for example, the CLT.

One approach is to use the theory of generalized linear models which provides estimates of the log odds ratio, rather than the OR itself, that can be shown to be asymptotically normal. Here we provide R code without presenting the theoretical details (for further details please see a reference on generalized linear models such as Wikipedia or McCullagh and Nelder, 1989):

fit <- glm(disease~genotype,family="binomial",data=dat)
coeftab<- summary(fit)$coef
coeftab

##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -2.1972246  0.2356828 -9.322803 1.133070e-20
## genotypeaa   0.8109302  0.4249074  1.908487 5.632834e-02

The second row of the table shown above gives you the estimate and SE of the log odds ratio. Mathematical theory tells us the this estimate is approximately normally distributed. We can therefore form a confidence interval and then exponentiate to provide a confidence interval for the OR.

ci <- coeftab[2,1] + c(-2,2)*coeftab[2,2]
exp(ci)

## [1] 0.9618616 5.2632310

The confidence includes 1, which is consistent with the p-value being bigger than 0.05. Note that the p-value shown here is based on a different approximation to the one used by the Chi-square test, which is why they differ.

Association Tests Exercises

In the previous video, Rafa showed how to calculate a Chi-square test from a table. Here we will show how to generate the table from data which is in the form of a dataframe, so that you can then perform an association test to see if two columns have an enrichment (or depletion) of shared occurances.

Download the assoctest.csv file into your R working directory, and then read it into R:

d = read.csv("assoctest.csv")
head(d)

##   allele case
## 1      0    1
## 2      0    0
## 3      1    1
## 4      1    1
## 5      1    1
## 6      0    0

This dataframe reflects the allele status (either AA/Aa or aa) and the case/control status for 72 individuals.

Association Tests Exercises #1

Compute the Chi-square test for the association of genotype with case/control status (using the table() function and the chisq.test() function). Examine the table to see if it look enriched for association by eye. What is the X-squared statistic?

tab <- table(d$allele, d$case)
tab

##    
##      0  1
##   0 17 17
##   1 10 28

chisq.test(tab)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab
## X-squared = 3.3437, df = 1, p-value = 0.06746

chisq.test(tab)$p.value

## [1] 0.06746466

A:3.3437

Association Tests Exercises #2

Compute the Fisher’s exact test ( fisher.test() ) for the same table. What is the p-value?

tab = table(d$allele, d$case)

fisher.test(tab)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  tab
## p-value = 0.05194
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.940442 8.493001
## sample estimates:
## odds ratio 
##   2.758532

Week 4

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a key part of what we do when we analyze data. We start out every analysis with EDA to familiarize ourselves with the data. First, we want to check to see if some of the samples or experiments produce unusable data and take them out of the analysis. We also perform EDA at the end to check for nonsensical results. In fact, one should perform EDA all throughout the analysis. For example, before applying any of the techniques we have learned, we want to make sure the data are in agreement with the necessary assumptions. We will introduce some basic EDA tools such as histogram, the Q-Q plot, scatter plots, boxplot, stratification, log transforms, and several summary statistics.

Histogram

Introduction

“The greatest value of a picture is when it forces us to notice what we never expected to see.” -John W. Tukey

Biases, systematic errors and unexpected variability are common in genomics data. Failure to discover these problems often leads to flawed analyses and false discoveries. As an example, consider that experiments sometimes fail and not all data processing pipelines are designed to detect these. Yet, these pipelines still give you an answer and the from the final results it may be hard or impossible to notice an error was made. In later modules we will cover many other examples.

Graphing data is a powerful approach to detecting these problems. We refer to this as exploratory data analyis (EDA). Many important methodological contributions to genomics data analysis were initiated as discovery made via EDA. We will show some useful exploratory plots for gene expression data measured with microarrays and NGS. We start with a general introduction to EDA using height data.

Histograms

We can think of any given dataset as a list of numbers. Suppose you have measured the heights of all men in a population. Imagine you have need to describe these numbers to someone that has no idea what these heights are, say an alien that has never visited earth.

library(UsingR)
x=father.son$fheight
head(father.son)

##    fheight  sheight
## 1 65.04851 59.77827
## 2 63.25094 63.21404
## 3 64.95532 63.34242
## 4 65.75250 62.79238
## 5 61.13723 64.28113
## 6 63.02254 64.24221

One approach is to simply list out all the numbers for the alien to see. Here are 20 randomly selected heights of 1,078.

round(sample(x,20),1)

##  [1] 70.5 70.9 69.8 67.2 66.0 69.8 65.9 68.5 68.3 69.2 65.8 68.1 66.6 66.3
## [15] 69.1 64.5 69.2 62.9 68.3 68.4

From scanning through these numbers we start getting a rough idea of what the entire list looks like but it certainly inefficient. We can quickly improve on this approach by creating bins, say by rounding each value to its nearest integer, and reporting the number of individuals in each bin. A plot of these heights is called a histogram

hist(x,breaks=seq(floor(min(x)),ceiling(max(x))),main="",xlab="Height")

Showing this plot to the alien is much more informative than showing the numbers. Note that with this simple plot we can approximate the number of individuals in any given interval. For example, there are about 70 individuals over six feet (72 inches) tall.

xs<-seq(floor(min(x)),ceiling(max(x)),0.1)
plot(xs,ecdf(x)(xs),type="l",xlab="x=Height",ylab="F(x)")

Histogram Exercises

hist_exercise-1

Histogram Exercises #1

Given the above histogram, how many people are between the ages of 35 and 45?

#sum(age>=35 & age<45)

A:6

Normal approximation

If instead of the total numbers we report the proportions, then the histogram is a probability distribution. The probability distribution we see above approximates one that is very common in a nature: the bell curve or normal distribution or Gaussian distribution. When the histogram of a list of numbers approximates the normal distribution we can use a convenient mathematical formula to approximate the proportion of individuals in any given interval

\[ \mbox{Pr}(a < x < b) = \int_a^b \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left( \frac{-(x-\mu)^2}{2 \sigma^2} \right)} \, dx \]

Here $\mu$ and $\sigma$ are refereed to as the mean and standard deviation. If this approximation holds for our list then the population mean and variance of our list can be used in the formula above. To see this with an example remember that above we noted that 70 individuals or 6% of our population were taller than 6 feet. The normal approximation works well:

1-pnorm(72,mean(x),sd(x))

## [1] 0.05806108

A very useful characteristic of this approximation is that one only needs to know $\mu$ and $\sigma$ to describe the entire distribution. All we really have to tell our alien friend is that heights follow a normal distribution with average height 68’‘and a standard deviation of 3’’. From this we can compute the proportion of individuals in any interval.

Quantile Quantile Plots (Q-Q Plots)

To corroborate that a theoretical distribution, for example the normal distribution, is in fact a good approximation, we can use quantile-quantile plots (qq-plots). Quantiles are best understood by considering the special case of percentiles. The p-th percentile of a list of a distribution is defined as the number q that is bigger than p% of numbers (so the inverse of the cumulative distribution function we defined earlier). For example, the median 50-th percentile is the median. We can compute the percentiles for our list of heights:

library(rafalib)
data(father.son,package="UsingR") ##available from CRAN
x <- father.son$fheight

and for the normal distribution:

ps <- ( seq(0,99) + 0.5 )/100 
qs <- quantile(x, ps)
normalqs <- qnorm(ps, mean(x), popsd(x))
plot(normalqs,qs,xlab="Normal percentiles",ylab="Height percentiles")
abline(0,1) ##identity line

First example of qqplot. Here we compute the theoretical quantiles ourselves.

Note how close these values are. Also, note that we can see these qq-plots with less code (this plot has more points than the one we constructed manually, and so tail-behavior can be seen more clearly).

qqnorm(x)
qqline(x)

Second example of qqplot. Here we use the function qqnorm which computes the theoretical normal quantiles automatically.

However, the qqnorm function plots against a standard normal distribution. This is why the line has slope popsd(x) and intercept mean(x).

In the example above, the points match the line very well. In fact, we can run Monte Carlo simulations to see plots like this for data known to be normally distributed.

n <-1000
x <- rnorm(n)
qqnorm(x)
qqline(x)

Example of the qqnorm function. Here we apply it to numbers generated to follow a normal distribution.

We can also get a sense for how non-normally distributed data will look in a qq-plot. Here we generate data from the t-distribution with different degrees of freedom. Notice that the smaller the degrees of freedom, the fatter the tails. We call these “fat tails” because if we plotted an empirical density or histogram, the density at the extremes would be higher than the theoretical curve. In the qq-plot, this can be seen in that the curve is lower than the identity line on the left side and higher on the right side. This means that there are more extreme values than predicted by the theoretical density plotted on the x-axis.

dfs <- c(3,6,12,30)
mypar(2,2)
for(df in dfs){
  x <- rt(1000,df)
  qqnorm(x,xlab="t quantiles",main=paste0("d.f=",df),ylim=c(-6,6))
  qqline(x)
}

We generate t-distributed data for four degrees of freedom and make qqplots against normal theoretical quantiles.

QQ-plot Exercises

Download this RData file to your working directory: link. Then load the data into R with the following command:

load("skew.RData") #You should have a 1000 x 9 dimensional matrix 'dat'
dim(dat)

## [1] 1000    9

str(dat)

##  num [1:1000, 1:9] -0.626 0.184 -0.836 1.595 0.33 ...

names(dat)

## NULL

Using QQ-plots, compare the distribution of each column of the matrix to a normal. That is, use qqnorm() on each column. To accomplish this quickly, you can use the following line of code to set up a grid for 3x3=9 plots. (“mfrow” means we want a multifigure grid filled in row-by-row. Another choice is mfcol.)

par(mfrow = c(3,3))
#Then you can use a for loop, to loop through the columns, and display one qqnorm() plot at a time. 
for (i in 1:9) {
  x <- dat[,i]
  qqnorm(x,xlab="quantiles",main=paste0("Col No=",i))
  qqline(x)
}

#plotting multiple histograms
par(mfrow = c(3,3))
#Then you can use a for loop, to loop through the columns, and display one qqnorm() plot at a time. 
for (i in 1:9) {
  x <- dat[,i]
hist(x,xlab="X",main=paste0("Col.No=",i))
}

Identify the two columns which are skewed.

Examine each of these two columns using a histogram. Note which column has “positive skew”, in other words the histogram shows a long tail to the right (toward larger values). Note which column has “negative skew”, that is, a long tail to the left (toward smaller values). Note that positive skew looks like an up-shaping curve in a qqnorm() plot, while negative skew looks like a down-shaping curve.

You can use the following line to reset your graph to just show one at a time: par(mfrow=c(1,1))

QQ-plot Exercises #1

Which column has positive skew (a long tail to the right)?

par(mfrow=c(1,1))
hist(dat[,4])

A:4

QQ-plot Exercises #2

Which column has negative skew (a long tail to the left)?

hist(dat[,9])

A:9

Boxplots

Data is not always normally distributed. Income is a widely cited example. In these cases, the average and standard deviation are not necessarily informative since one can’t infer the distribution from just these two numbers. The properties described above are specific to the normal. For example, the normal distribution does not seem to be a good approximation for the direct compensation for 199 United States CEOs in the year 2000.

data(exec.pay,package="UsingR")
mypar(1,2)
hist(exec.pay) 
qqnorm(exec.pay)
qqline(exec.pay)

Histogram and QQ-plot of executive pay.

In addition to qq-plots, a practical summary of data is to compute 3 percentiles: 25-th, 50-th (the median) and the 75-th. A boxplot shows these 3 values along with a range of the points within median $\pm$ 1.5 (75-th percentile -25th-percentile). Values outside this range are shown as points and sometimes refereed to as outliers.

boxplot(exec.pay, ylab="10,000s of dollars", ylim=c(0,400))

Simple boxplot of executive pay.

Here we show just one boxplot. However, one of the great benefits of boxplots is that we could easily show many distributions in one plot, by lining them up, side by side. We will see several examples of this throughout the book.

Boxplot Exercises

The InsectSprays data set measures the counts of insects in agricultural experimental units treated with different insecticides. This dataset is included in R, and you can examine it by typing:

head(InsectSprays)

##   count spray
## 1    10     A
## 2     7     A
## 3    20     A
## 4    14     A
## 5    14     A
## 6    12     A

Try out two equivalent ways of drawing boxplots in R, using the InsectSprays dataset. Below is pseudocode, which you should modify to work with the InsectSprays dataset.

using split:

boxplot(split(values, factor))

using a formula:

boxplot(values ~ factor)

Boxplot Exercises #1

Which spray seems the most effective (has the lowest median)?

#Using split
boxplot(split(InsectSprays$count, InsectSprays$spray))

#using formula
boxplot(InsectSprays$count ~ InsectSprays$spray)

A:C

Boxplot Exercises #2

Let’s consider a random sample of finishers from the New York City Marathon in 2002. This dataset can be found in the UsingR package. Load the library and then load the nym.2002 dataset.

library(dplyr)
data(nym.2002, package="UsingR")

Use boxplots and histograms to compare the finishing times of males and females. Which of the following best describes the difference?

data(nym.2002, package="UsingR")
str(nym.2002)

## 'data.frame':    1000 obs. of  5 variables:
##  $ place : num  3592 13853 12256 10457 9686 ...
##  $ gender: Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 1 2 2 2 ...
##  $ age   : num  52 40 31 33 33 40 30 27 42 48 ...
##  $ home  : Factor w/ 165 levels "","--","AB","ABE",..: 50 107 47 91 107 103 23 50 100 15 ...
##  $ time  : num  217 273 265 256 252 ...

head(nym.2002)

##       place gender age home     time
## 3475   3592   Male  52  GBR 217.4833
## 13594 13853 Female  40   NY 272.5500
## 12012 12256   Male  31  FRA 265.2833
## 10236 10457 Female  33   MI 256.1500
## 9476   9686   Male  33   NY 252.2500
## 1720   1784   Male  40   NJ 201.9667

library(dplyr)
males<-filter(nym.2002, gender=="Male") 
females<-filter(nym.2002, gender=="Female") 

library(rafalib)
mypar(1,3)

boxplot(females$time, males$time)
hist(females$time,xlim=c(range( nym.2002$time)))
hist(males$time,xlim=c(range( nym.2002$time)))

A:Male and females have similar right skewed distributions with the former, 20 minutes shifted to the left.

Scatterplots And Correlation

The methods described above relate to univariate variables. In the biomedical sciences, it is common to be interested in the relationship between two or more variables. A classic example is the father/son height data used by Francis Galton to understand heredity. If we were to summarize these data, we could use the two averages and two standard deviations since both distributions are well approximated by the normal distribution. This summary, however, fails to describe an important characteristic of the data.

data(father.son,package="UsingR")
x=father.son$fheight
y=father.son$sheight
plot(x,y,xlab="Father's height in inches",ylab="Son's height in inches",main=paste("correlation =",signif(cor(x,y),2)))

Heights of father and son pairs plotted against each other.

The scatter plot shows a general trend: the taller the father, the taller the son. A summary of this trend is the correlation coefficient, which in this cases is 0.5. We will motivate this statistic by trying to predict the son’s height using the father’s height.

Stratification

Suppose we are asked to guess the height of randomly select sons. The average height, 68.7 inches, is the value with the highest proportion (see histogram) and would be our prediction. But what if we are told that the father is 72 inches tall, do we sill guess 68.7?

The father is taller than average. Specifically, he is 1.75 standard deviations taller than the average father. So should we predict that the son is also 1.75 standard deviations taller? It turns out that this would be an overestimate. To see this, we look at all the sons with fathers who are about 72 inches. We do this by stratifying the father heights.

groups <- split(y,round(x)) 
boxplot(groups)

Boxplot of son heights stratified by father heights.

print(mean(y[ round(x) == 72]))

## [1] 70.67719

Stratification followed by boxplots lets us see the distribution of each group. The average height of sons with fathers that are 72 inches tall is 70.7 inches. We also see that the medians of the strata appear to follow a straight line (remember the middle line in the boxplot shows the median, not the mean). This line is similar to the regression line, with a slope that is related to the correlation, as we will learn below.

Bivariate Normal Distribution

Correlation is a widely used summary statistic in the life sciences. However, it is often misused or misinterpreted. To properly interprete correlation we actually have to understand the bivariate normal distribution.

A pair of random variables $(X,Y)$ is considered to be approximated by bivariate normal when the proportion of values below, for example $a$ and $b$, is approximated by this expression:

\[ \mbox{Pr}(X<a,Y<b) = \]

\[ \int_{-\infty}^{a} \int_{-\infty}^{b} \frac{1}{2\pi\sigma_x\sigma_y\sqrt{1-\rho^2}} \exp{ \left( \frac{1}{2(1-\rho^2)} \left[\left(\frac{x-\mu_x}{\sigma_x}\right)^2 - 2\rho\left(\frac{x-\mu_x}{\sigma_x}\right)\left(\frac{y-\mu_y}{\sigma_y}\right)+ \left(\frac{y-\mu_y}{\sigma_y}\right)^2 \right] \right) } \]

This may seem like a rather complicated equation, but the concept behind it is rather intuitive. An alternative definition is the following: fix a value $x$ and look at all the pairs $(X,Y)$ for which $X=x$. Generally, in statistics we call this exercise conditioning. We are conditioning $Y$ on $X$. If a pair of random variables is approximated by a bivariate normal distribution, then the distribution of $Y$ conditioned on $X=x$ is approximated with a normal distribution, no matter what $x$ we choose. Let’s see if this holds with our height data. We show 4 different strata:

groups <- split(y,round(x)) 
mypar(2,2)
for(i in c(5,8,11,14)){
  qqnorm(groups[[i]],main=paste0("X=",names(groups)[i]," strata"),
         ylim=range(y),xlim=c(-2.5,2.5))
  qqline(groups[[i]])
}

qqplots of son heights for four strata defined by father heights.

Now we come back to defining correlation. Mathematical statistics tells us that when two variables follow a bivariate normal distribution, then for any given value of $x$, the average of the $Y$ in pairs for which $X=x$ is:

\[ \mu_Y + \rho \frac{X-\mu_X}{\sigma_X}\sigma_Y \]

Note that this is a line with slope $\rho \frac{\sigma_Y}{\sigma_X}$. This is referred to as the regression line. If the SDs are the same, then the slope of the regression line is the correlation $\rho$. Therefore, if we standardize $X$ and $Y$, the correlation is the slope of the regression line.

Another way to see this is to form a prediction $\hat{Y}$: for every SD away from the mean in $x$, we predict $\rho$ SDs away for $Y$:

\[ \frac{\hat{Y} - \mu_Y}{\sigma_Y} = \rho \frac{x-\mu_X}{\sigma_X} \]

If there is perfect correlation, we predict the same number of SDs. If there is 0 correlation, then we don’t use $x$ at all. For values between 0 and 1, the prediction is somewhere in between. For negative values, we simply predict in the opposite direction.

To confirm that the above approximations hold in this case, let’s compare the mean of each strata to the identity line and the regression line:

x=( x-mean(x) )/sd(x)
y=( y-mean(y) )/sd(y)
means=tapply(y, round(x*4)/4, mean)
fatherheights=as.numeric(names(means))
mypar(1,1)
plot(fatherheights, means, ylab="average of strata of son heights", ylim=range(fatherheights))
abline(0, cor(x,y))

Average son height of each strata plotted against father heights defining the strata

Variance explained

The standard deviation of the conditional distribution described above is:

\[ \sqrt{1-\rho^2} \sigma_Y \]

This is where statements like $X$ explains $\rho^2 \times 100$ % of the variation in $Y$: the variance of $Y$ is $\sigma^2$ and, once we condition, it goes down to $(1-\rho^2) \sigma^2_Y$ . It is important to remember that the “variance explained” statement only makes sense when the data is approximated by a bivariate normal distribution.

Spearman’s correlation

Just like the average and standard deviation are not good summaries when the data is not well approximated by the normal distribution, the correlation is not a good summary when pairs of lists are not approximated by the bivariate normal distribution. Examples include cases in which on variable is related to another by a parabolic function. Another, more common example are caused by outliers or extreme values.

a=rnorm(100);a[1]=10
b=rnorm(100);b[1]=11
plot(a,b,main=paste("correlation =",signif(cor(a,b),2)))

In the example above the data are not associated but for one pair both values are very large. The correlation here is about 0.5. This is driven by just that one point as taking it out lowers to correlation to about 0. An alternative summary for cases with outliers or extreme values is Spearman’s correlation which is based on ranks instead of the values themselves.

Scatterplot Exercises

Let’s consider a random sample of finishers from the New York City Marathon in 2002. This dataset can be found in the UsingR package. Load the library and then load the nym.2002 dataset.

Here we will use the plots we’ve learned about to explore a dataset: some stats on a random sample of runners from the New York City Marthon in 2002. This data set can be found in the UsingR package (used in the previous assessment). Load the library and then load the nym.2002 data set with the following line:

data(nym.2002, package="UsingR")
head(nym.2002)

##       place gender age home     time
## 3475   3592   Male  52  GBR 217.4833
## 13594 13853 Female  40   NY 272.5500
## 12012 12256   Male  31  FRA 265.2833
## 10236 10457 Female  33   MI 256.1500
## 9476   9686   Male  33   NY 252.2500
## 1720   1784   Male  40   NJ 201.9667

Scatterplot Exercises #1

Use dplyr to create two new data frames: males and females, with the data for each gender. For males, what is the Pearson correlation between age and time to finish?

library(dplyr)
males <- filter(nym.2002, gender=="Male") 
females <- filter(nym.2002, gender=="Female") 

cor(males$age,males$time)

## [1] 0.2432273

Scatterplot Exercises #2

For females, what is the Pearson correlation between age and time to finish?

cor(females$age,females$time)

## [1] 0.2443156

A:0.2443156

Scatterplot Exercises #3

If we interpret these correlations without visualizing the data, we would conclude that the older we get, the slower we run marathons, regardless of gender. Look at scatterplots and boxplots of times stratified by age groups (20-25, 25-30, etc..). After examining the data, what is a more reasonable conclusion?

library(rafalib)
mypar(2,2)
plot(females$age, females$time)
plot(males$age, males$time)
group <- floor(females$age/5) * 5
boxplot(females$time~group)
group <- floor(males$age/5) * 5
boxplot(males$time~group)

A: Finish times are constant up until sometime between 50-60, then we get slower.

Symmetry of Log Ratios’

Ratios are not symmetric. To see this, we will start by simulating the ratio of two positive random numbers, which will represent the expression of genes in two samples:

x <- 2^(rnorm(100))
y <- 2^(rnorm(100)) 
ratios <- x / y

Reporting ratios or fold changes are common in the life science. Suppose you are studying ratio data showing, say, gene expression before and after treatment. You are given ratio data so values larger than 1 imply gene expression was higher after the treatment. If the treatment has no effect, we should see as many values below 1 as above 1. A histogram seems to suggest that the treatment does in fact have an effect:

mypar(1,2)
hist(ratios)

logratios <- log2(ratios)
hist(logratios)

Histogram of original (left) and log (right) ratios.

The problem here is that ratios are not symmetrical around 1. For example, 1/32 is much closer to 1 than 32/1. Taking logs takes care of this problem. The log of ratios are of course symmetric around 0 because:

\[\log(x/y) = \log(x)-\log(y) = -(\log(y)-\log(x)) = \log(y/x)\]

As demonstrated by these simple plots:

Histogram of original (left) and log (right) powers of 2 seen as ratios.

In the life science, the log transformation is also commonly used because (multiplicative) fold changes are the most widely used quantification of interest. Note that a fold change of 100 can be a ratio of 100/1 or 1/100. However, 1/100 is much closer to 1 (no fold change) than 100: ratios are not symmetric about 1.

Symmetry of Log Ratios Exercises

In the previous video, we saw that multiplicative changes are symmetric around 0 when we are on the logarithmic scale. In other words, if we use the log scale, 1/2 times a number x, and 2 times a number x, are equally far away from x. We will explore this with the NYC marathon data.

Create a vector time of the sorted times:

time = sort(nym.2002$time)
head(time)

## [1] 147.3333 156.0000 157.5833 163.9333 164.5000 165.5167

Symmetry of Log Ratios Exercises #1

What is the fastest time divided the median time?

min(time) / median(time)

## [1] 0.5605402

#or
time[1]/median(time)

## [1] 0.5605402

A:0.5605402

Symmetry of Log Ratios Exercises #2

What is the slowest time divided the median time?

max(time) / median(time)

## [1] 2.156368

A:2.156368

Compare the following two plots.

#1) A plot of the ratio of times to the median time, with horizontal lines at twice as fast as the median time, and twice as slow as the median time.

plot(time/median(time), ylim=c(1/4,4))
abline(h=c(1/2,1,2))

#2) A plot of the log2 ratio of times to the median time. The horizontal lines indicate the same as above: twice as fast and twice as slow.
plot(log2(time/median(time)),ylim=c(-2,2))
abline(h=-1:1)

#Note that the lines are equally spaced in Figure #2.

Plots To Avoid

This section is based on a talk by Karl W. Broman titled “How to Display Data Badly”, in which he described how the default plots offered by Microsoft Excel “obscure your data and annoy your readers” (here is a link to a collection of Karl Broman’s talks). His lecture was inspired by the 1984 paper by H. Wainer: How to display data badly. American Statistician 38(2): 137–147. Dr. Wainer was the first to elucidate the principles of the bad display of data. However, according to Karl, “The now widespread use of Microsoft Excel has resulted in remarkable advances in the field.” Here we show examples of “bad plots” and how to improve them in R.

General principles

The aims of good data graphics is to display data accurately and clearly. According to Karl, some rules for displaying data badly are:

Display as little information as possible.
Obscure what you do show (with chart junk).
Use pseudo-3D and color gratuitously.
Make a pie chart (preferably in color and 3D).
Use a poorly chosen scale.
Ignore significant figures.

Pie charts

Let’s say we want to report the results from a poll asking about browser preference (taken in August 2013). The standard way of displaying these is with a pie chart:

pie(browsers,main="Browser Usage (August 2013)")

Pie chart of browser usage.

Nonetheless, as stated by the help file for the pie function:

“Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.”

To see this, look at the figure above and try to determine the percentages just from looking at the plot. Unless the percentages are close to 25%, 50% or 75%, this is not so easy. Simply showing the numbers is not only clear, but also saves on printing costs.

browsers

##   Opera  Safari Firefox      IE  Chrome 
##       1       9      20      26      44

If you do want to plot them, then a barplot is appropriate. Here we add horizontal lines at every multiple of 10 and then redraw the bars:

barplot(browsers, main="Browser Usage (August 2013)", ylim=c(0,55))
abline(h=1:5 * 10)
barplot(browsers, add=TRUE)

Barplot of browser usage.

Notice that we can now pretty easily determine the percentages by following a horizontal line to the x-axis. Do avoid a 3D version since it obfuscates the plot, making it more difficult to find the percentages by eye.

3D version

Even worse than pie charts are donut plots.

Donut plot

The reason is that by removing the center, we remove one of the visual cues for determining the different areas: the angles. There is no reason to ever use a donut plot to display data.

Barplots as data summaries

While barplots are useful for showing percentages, they are incorrectly used to display data from two groups being compared. Specifically, barplots are created with height equal to the group means; an antenna is added at the top to represent standard errors. This plot is simply showing two numbers per group and the plot adds nothing:

Bad bar plots

Much more informative is to summarize with a boxplot. If the number of points is small enough, we might as well add them to the plot. When the number of points is too large for us to see them, just showing a boxplot is preferable. We can even set range=0 in boxplot to avoid drawing many outliers when the data is in the range of millions.

Let’s recreate these barplots as boxplots. First let’s download the data:

library(downloader)
filename <- "fig1.RData"
url <- "https://github.com/kbroman/Talk_Graphs/raw/master/R/fig1.RData"
if (!file.exists(filename)) download(url,filename)
load(filename)

Now we can simply show the points and make simple boxplots:

library(rafalib)
mypar()
dat <- list(Treatment=x,Control=y)
boxplot(dat,xlab="Group",ylab="Response",cex=0)
stripchart(dat,vertical=TRUE,method="jitter",pch=16,add=TRUE,col=1)

Treatment data and control data shown with a boxplot.

Notice how much more we see here: the center, spread, range and the points themselves. In the barplot, we only see the mean and the SE, and the SE has more to do with sample size, than with the spread of the data.

This problem is magnified when our data has outliers or very large tails. In the plot below, there appears to be very large and consistent differences between the two groups:

Bar plots with outliers

However, a quick look at the data demonstrates that this difference is mostly driven by just two points. A version showing the data in the log-scale is much more informative.

Start by downloading data:

library(downloader)
url <- "https://github.com/kbroman/Talk_Graphs/raw/master/R/fig3.RData"
filename <- "fig3.RData"
if (!file.exists(filename)) download(url, filename)
load(filename)

Now we can show data and boxplots in original scale and log-scale.

library(rafalib)
mypar(1,2)
dat <- list(Treatment=x,Control=y)

boxplot(dat,xlab="Group",ylab="Response",cex=0)
stripchart(dat,vertical=TRUE,method="jitter",pch=16,add=TRUE,col=1)

boxplot(dat,xlab="Group",ylab="Response",log="y",cex=0)
stripchart(dat,vertical=TRUE,method="jitter",pch=16,add=TRUE,col=1)

Data and boxplots for original data (left) and in log scale (right).

Show the scatter plot

The purpose of many statistical analyses is to determine relationships between two variables. Sample correlations are typically reported and sometimes plots are displayed to show this. However, showing just the regression line is one way to display your data badly since it hides the scatter. Surprisingly, plots such as the following are commonly seen.

Again start by loading data:

url <- "https://github.com/kbroman/Talk_Graphs/raw/master/R/fig4.RData"
filename <- "fig4.RData"
if (!file.exists(filename)) download(url, filename)
load(filename)

mypar(1,2)
plot(x,y,lwd=2,type="n")
fit <- lm(y~x)
abline(fit$coef,lwd=2)
b <- round(fit$coef,4)
text(78, 200, paste("y =", b[1], "+", b[2], "x"), adj=c(0,0.5))
rho <- round(cor(x,y),4) 
text(78, 187,expression(paste(rho," = 0.8567")),adj=c(0,0.5))

plot(x,y,lwd=2)
fit <- lm(y~x)
abline(fit$coef,lwd=2)

The plot on the left shows a regression line that was fitted to the data shown on the right. It is much more informative to show all the data.

When there are large amounts of points, the scatter can be shown by binning in two dimensions and coloring the bins by the number of points in the bin. An example of this is the hexbin function in the hexbin package.

High correlation does not imply replication

When new technologies or laboratory techniques are introduced, we are often shown scatter plots and correlations from replicated samples. High correlations are used to demonstrate that the new technique is reproducible. Correlation, however, can be very misleading. Below is a scatter plot showing data from replicated samples run on a high throughput technology. This technology outputs 12,626 simultaneous measurements.

In the plot on the left, we see the original data which shows very high correlation. Yet the data follows a distribution with very fat tails. Furthermore, 95% of the data is below the green line. The plot on the right is in the log scale:

Gene expression data from two replicated samples. Left is in original scale and right is in log scale.

Note that do not show the code here as it is rather complex but we explain how to make MA plots in a latter chapter.

Although the correlation is reduced in the log-scale, it is very close to 1 in both cases. Does this mean these data are reproduced? To examine how well the second vector reproduces the first, we need to study the differences. We therefore should plot that instead. In this plot, we plot the difference (in the log scale) versus the average:

MA plot of the same data shown above shows that data is not replicated very well despite a high correlation.

These are referred to as Bland-Altman plots, or MA plots in the genomics literature, and we will talk more about them later. “MA” stands for “minus” and “average” because in this plot, the y-axis is the difference between two samples on the log scale (the log ratio is the difference of the logs), and the x-axis is the average of the samples on the log scale. In this plot, we see that the typical difference in the log (base 2) scale between two replicated measures is about 1. This means that when measurements should be the same, we will, on average, observe 2 fold difference. We can now compare this variability to the differences we want to detect and decide if this technology is precise enough for our purposes.

Barplots for paired data

A common task in data analysis is the comparison of two groups. When the dataset is small and data are paired, such as the outcomes before and after a treatment, two color barplots are unfortunately often used to display the results:

Barplot for two variables

There are better ways of showing these data to illustrate that there is an increase after treatment. One is to simply make a scatter plot, which shows that most points are above the identity line. Another alternative is to plot the differences against the before values.

set.seed(12201970)
before <- runif(6, 5, 8)
after <- rnorm(6, before*1.05, 2)
li <- range(c(before, after))
ymx <- max(abs(after-before))

mypar(1,2)
plot(before, after, xlab="Before", ylab="After",
     ylim=li, xlim=li)
abline(0,1, lty=2, col=1)

plot(before, after-before, xlab="Before", ylim=c(-ymx, ymx),
     ylab="Change (After - Before)", lwd=2)
abline(h=0, lty=2, col=1)

For two variables a scatter plot or a ‘rotated’ plot similar to an MA plot is much more informative.

Line plots are not a bad choice, although I find them harder to follow than the previous two. Boxplots show you the increase, but lose the paired information.

z <- rep(c(0,1), rep(6,2))
mypar(1,2)
plot(z, c(before, after),
     xaxt="n", ylab="Response",
     xlab="", xlim=c(-0.5, 1.5))
axis(side=1, at=c(0,1), c("Before","After"))
segments(rep(0,6), before, rep(1,6), after, col=1)     

boxplot(before,after,names=c("Before","After"),ylab="Response")

Another alternative is a line plot. If we don’t care about pairings, then the boxplot is appropriate.

Gratuitous 3D

The figure below shows three curves. Pseudo 3D is used, but it is not clear why. Maybe to separate the three curves? Notice how difficult it is to determine the values of the curves at any given point:

Gratuitous 3-D

This plot can be made better by simply using color to distinguish the three lines:

##First read data
url <- "https://github.com/kbroman/Talk_Graphs/raw/master/R/fig8dat.csv"
x <- read.csv(url)

##Now make alternative plot
plot(x[,1],x[,2],xlab="log Dose",ylab="Proportion survived",ylim=c(0,1),
     type="l",lwd=2,col=1)
lines(x[,1],x[,3],lwd=2,col=2)
lines(x[,1],x[,4],lwd=2,col=3)
legend(1,0.4,c("Drug A","Drug B","Drug C"),lwd=2, col=1:3)

This plot demonstrates that using color is more than enough to distinguish the three lines.

Ignoring important factors

In this example, we generate data with a simulation. We are studying a dose-response relationship between two groups: treatment and control. We have three groups of measurements for both control and treatment. Comparing treatment and control using the common barplot:

Ingoring important factors

Instead, we should show each curve. We can use color to distinguish treatment and control, and dashed and solid lines to distinguish the original data from the mean of the three groups.

plot(x, y1, ylim=c(0,1), type="n", xlab="Dose", ylab="Response") 
for(i in 1:3) lines(x, y[,i], col=1, lwd=1, lty=2)
for(i in 1:3) lines(x, z[,i], col=2, lwd=1, lty=2)
lines(x, ym, col=1, lwd=2)
lines(x, zm, col=2, lwd=2)
legend("bottomleft", lwd=2, col=c(1, 2), c("Control", "Treated"))

Because dose is an important factor, we show it in this plot.

Too many significant digits

By default, statistical software like R return many significant digits. This does not mean we should report them. Cutting and pasting directly from R is a bad idea since you might end up showing a table, such as the one below, comparing the heights of basketball players:

heights <- cbind(rnorm(8,73,3),rnorm(8,73,3),rnorm(8,80,3),
                 rnorm(8,78,3),rnorm(8,78,3))
colnames(heights)<-c("SG","PG","C","PF","SF")
rownames(heights)<- paste("team",1:8)
heights

##              SG       PG        C       PF       SF
## team 1 76.39843 76.21026 81.68291 75.32815 77.18792
## team 2 74.14399 71.10380 80.29749 81.58405 73.01144
## team 3 71.51120 69.02173 85.80092 80.08623 72.80317
## team 4 78.71579 72.80641 81.33673 76.30461 82.93404
## team 5 73.42427 73.27942 79.20283 79.71137 80.30497
## team 6 72.93721 71.81364 77.35770 81.69410 80.39703
## team 7 68.37715 73.01345 79.10755 71.24982 77.19851
## team 8 73.77538 75.59278 82.99395 75.57702 87.68162

We are reporting precision up to 0.00001 inches. Do you know of a tape measure with that much precision? This can be easily remedied:

round(heights,1)

##          SG   PG    C   PF   SF
## team 1 76.4 76.2 81.7 75.3 77.2
## team 2 74.1 71.1 80.3 81.6 73.0
## team 3 71.5 69.0 85.8 80.1 72.8
## team 4 78.7 72.8 81.3 76.3 82.9
## team 5 73.4 73.3 79.2 79.7 80.3
## team 6 72.9 71.8 77.4 81.7 80.4
## team 7 68.4 73.0 79.1 71.2 77.2
## team 8 73.8 75.6 83.0 75.6 87.7

Displaying data well

In general, you should follow these principles:

Be accurate and clear.
Let the data speak.
Show as much information as possible, taking care not to obscure the message.
Science not sales: avoid unnecessary frills (esp. gratuitous 3D).
In tables, every digit should be meaningful. Don’t drop ending 0’s.

Some further reading:

ER Tufte (1983) The visual display of quantitative information. Graphics Press.
ER Tufte (1990) Envisioning information. Graphics Press.
ER Tufte (1997) Visual explanations. Graphics Press.
WS Cleveland (1993) Visualizing data. Hobart Press.
WS Cleveland (1994) The elements of graphing data. CRC Press.
A Gelman, C Pasarica, R Dodhia (2002) Let’s practice what we preach: Turning tables into graphs. The American Statistician 56:121-130
NB Robbins (2004) Creating more effective graphs. Wiley.
Nature Methods columns

Misunderstanding Correlation (Advanced)

The use of correlation to summarize reproducibility has become widespread in, for example, genomics. Despite its English language definition, mathematically, correlation is not necessarily informative with regards to reproducibility. Here we briefly describe three major problems.

The most egregious related mistake is to compute correlations of data that is not approximated by bi-variate normal data. As described above, averages, standard deviations and correlations are popular summary statistics for two-dimensional data because, for the bivariate normal distribution, these five parameters fully describe the distribution. However, there are many examples of data that are not well approximated by bivariate normal data. Raw gene expression data, for example, tends to have a distribution with a very fat right tail.

The standard way to quantify reproducibility between two sets of replicated measurements, say $x_1,\dots,x_n$ and $y_1,\dots,y_n$, is simply to compute the distance between them:

\[ \sqrt{ \sum_{i=1}^n d_i^2} \mbox{ with } d_i=x_i - y_i \]

This metric decreases as reproducibility improves and it is 0 when the reproducibility is perfect. Another advantage of this metric is that if we divide the sum by N, we can interpret the resulting quantity as the standard deviation of the $d_1,\dots,d_N$ if we assume the $d$ average out to 0. If the $d$ can be considered residuals, then this quantity is equivalent to the root mean squared error (RMSE), a summary statistic that has been around for over a century. Furthermore, this quantity will have the same units as our measurements resulting in a more interpretable metric.

Another limitation of the correlation is that it does not detect cases that are not reproducible due to average changes. The distance metric does detect these differences. We can rewrite:

\[\frac{1}{n} \sum_{i=1}^n (x_i - y_i)^2 = \frac{1}{n} \sum_{i=1}^n [(x_i - \mu_x) - (y_i - \mu_y) + (\mu_x - \mu_y)]^2\]

with $\mu_x$ and $\mu_y$ the average of each list. Then we have:

\[\frac{1}{n} \sum_{i=1}^n (x_i - y_i)^2 = \frac{1}{n} \sum_{i=1}^n (x_i-\mu_x)^2 + \frac{1}{n} \sum_{i=1}^n (y_i - \mu_y)^2 + (\mu_x-\mu_y)^2 + \frac{1}{n}\sum_{i=1}^n (x_i-\mu_x)(y_i - \mu_y) \]

For simplicity, if we assume that the variance of both lists is 1, then this reduces to:

\[\frac{1}{n} \sum_{i=1}^n (x_i - y_i)^2 = 2 + (\mu_x-\mu_y)^2 - 2\rho\]

with $\rho$ the correlation. So we see the direct relationship between distance and correlation. However, an important difference is that the distance contains the term \[(\mu_x-\mu_y)^2\] and, therefore, it can detect cases that are not reproducible due to large average changes.

Yet another reason correlation is not an optimal metric for reproducibility is the lack of units. To see this, we use a formula that relates the correlation of a variable with that variable, plus what is interpreted here as deviation: $x$ and $y=x+d$. The larger the variance of $d$, the less $x+d$ reproduces $x$. Here the distance metric would depend only on the variance of $d$ and would summarize reproducibility. However, correlation depends on the variance of $x$ as well. If $d$ is independent of $x$, then

\[ \mbox{cor}(x,y) = \frac{1}{\sqrt{1+\mbox{var}(d)/\mbox{var}(x)}} \]

This suggests that correlations near 1 do not necessarily imply reproducibility. Specifically, irrespective of the variance of $d$, we can make the correlation arbitrarily close to 1 by increasing the variance of $x$.

Plots to Avoid Exercises

Plots to Avoid Exercises #1

When is it appropriate to use pie charts or donut charts?

A:Never

Plots to Avoid Exercises #2

The use of pseudo-3D plots in the literature mostly adds:

A:Confusion.

Robust summaries

Median, MAD, and Spearman Correlation

The normal approximation is often useful when analyzing life sciences data. However, due to the complexity of the measurement devices, it is also common to mistakenly observe data points generated by an undesired process. For example, a defect on a scanner can produce a handful of very high intensities or a PCR bias can lead to a fragment appearing much more often than others. We therefore may have situations that are approximated by, for example, 99 data points from a standard normal distribution and one large number.

set.seed(1)
x=c(rnorm(100,0,1)) ##real distribution
x[23] <- 100 ##mistake made in 23th measurement
boxplot(x)

Normally distributed data with one point that is very large due to a mistake.

In statistics we refer to these type of points as outliers. A small number of outliers can throw off an entire analysis. For example, notice how the following one point results in the sample mean and sample variance being very far from the 0 and 1 respectively.

cat("The average is",mean(x),"and the SD is",sd(x))

## The average is 1.108142 and the SD is 10.02938

The median

The median, defined as the point having half the data larger and half the data smaller, is a summary statistic that is robust to outliers. Note how much closer the median is to 0, the center of our actual distribution:

median(x)

## [1] 0.1684483

The median absolute deviation

The median absolute deviation (MAD) is a robust summary for the standard deviation. It is defined by computing the differences between each point and the median, and then taking the median of their absolute values:

\[ 1.4826 \mbox{median} \lvert X_i - \mbox{median}(X_i) \rvert \]

The number $1.4826$ is a scaling factor such that the MAD is an unbiased estimate of the standard deviation. Notice how much closer we are to 1 with the MAD:

mad(x)

## [1] 0.8857141

Spearman correlation

Earlier we saw that the correlation is also sensitive to outliers. Here we construct a independent list of numbers, but for which a similar mistake was made for the same entry:

set.seed(1)
x=c(rnorm(100,0,1)) ##real distribution
x[23] <- 100 ##mistake made in 23th measurement
y=c(rnorm(100,0,1)) ##real distribution
y[23] <- 84 ##similar mistake made in 23th measurement
library(rafalib)
mypar()
plot(x,y,main=paste0("correlation=",round(cor(x,y),3)),pch=21,bg=1,xlim=c(-3,100),ylim=c(-3,100))
abline(0,1)

scatterplot showing bivariate normal data with one signal outlier resulting in large values in both dimensions.

The Spearman correlation follows the general idea of median and MAD, that of using quantiles. The idea is simple: we convert each dataset to ranks and then compute correlation:

mypar(1,2)
plot(x,y,main=paste0("correlation=",round(cor(x,y),3)),pch=21,bg=1,xlim=c(-3,100),ylim=c(-3,100))
plot(rank(x),rank(y),main=paste0("correlation=",round(cor(x,y,method="spearman"),3)),pch=21,bg=1,xlim=c(-3,100),ylim=c(-3,100))
abline(0,1)

scatterplot of original data (left) and ranks (right). Spearman correlation reduces the influence of outliers by considering the ranks instead of original data.

So if these statistics are robust to outliers, why would we ever use the non-robust version? In general, if we know there are outliers, then median and MAD are recommended over the mean and standard deviation counterparts. However, there are examples in which robust statistics are less powerful than the non-robust versions.

We also note that there is a large statistical literature on Robust Statistics that go far beyond the median and the MAD.

Symmetry of log ratios

Ratios are not symmetric. To see this, we will start by simulating the ratio of two positive random numbers, which will represent the expression of genes in two samples:

x <- 2^(rnorm(100))
y <- 2^(rnorm(100)) 
ratios <- x / y

mypar(1,2)
hist(ratios)

logratios <- log2(ratios)
hist(logratios)

Histogram of original (left) and log (right) ratios.

\[\log(x/y) = \log(x)-\log(y) = -(\log(y)-\log(x)) = \log(y/x)\]

As demonstrated by these simple plots:

Histogram of original (left) and log (right) powers of 2 seen as ratios.

Median, MAD, and Spearman Correlation Exercises

We are going to explore the properties of robust statistics. We will use one of the datasets included in R, which contains weight of chicks in grams as they grow from day 0 to day 21. This dataset also splits up the chicks by different protein diets, which are coded from 1 to 4. We use this dataset to also show an important operation in R (not related to robust summaries): reshape.

This dataset is built into R and can be loaded with:

data(ChickWeight)

#To begin, take a look at the weights of all observations over time and color the points to represent the Diet:

str(ChickWeight)

## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   578 obs. of  4 variables:
##  $ weight: num  42 51 59 64 76 93 106 125 149 171 ...
##  $ Time  : num  0 2 4 6 8 10 12 14 16 18 ...
##  $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
##  $ Diet  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "formula")=Class 'formula' length 3 weight ~ Time | Chick
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "outer")=Class 'formula' length 2 ~Diet
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Time"
##   ..$ y: chr "Body weight"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(days)"
##   ..$ y: chr "(gm)"

head(ChickWeight)

##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

plot( ChickWeight$Time, ChickWeight$weight, col=ChickWeight$Diet)

First, notice that the rows here represent time points rather than individuals. To facilitate the comparison of weights at different time points and across the different chicks, we will reshape the data so that each row is a chick. In R we can do this with the reshape function:

chick = reshape(ChickWeight, idvar=c("Chick","Diet"), timevar="Time",
                direction="wide")

The meaning of this line is: reshape the data from long to wide, where the columns Chick and Diet are the ID’s and the column Time indicates different observations for each ID. Now examine the head of this dataset:

head(chick)

##    Chick Diet weight.0 weight.2 weight.4 weight.6 weight.8 weight.10
## 1      1    1       42       51       59       64       76        93
## 13     2    1       40       49       58       72       84       103
## 25     3    1       43       39       55       67       84        99
## 37     4    1       42       49       56       67       74        87
## 49     5    1       41       42       48       60       79       106
## 61     6    1       41       49       59       74       97       124
##    weight.12 weight.14 weight.16 weight.18 weight.20 weight.21
## 1        106       125       149       171       199       205
## 13       122       138       162       187       209       215
## 25       115       138       163       187       198       202
## 37       102       108       136       154       160       157
## 49       141       164       197       199       220       223
## 61       141       148       155       160       160       157

We also want to remove any chicks that have missing observations at any time points (NA for “not available”). The following line of code identifies these rows and then removes them:

chick = na.omit(chick)

Median, MAD, and Spearman Correlation Exercises #1

Focus on the chick weights on day 4 (check the column names of ‘chick’ and note the numbers). How much does the average of chick weights at day 4 increase if we add an outlier measurement of 3000 grams? Specifically, what is the average weight of the day 4 chicks, including the outlier chick, divided by the average of the weight of the day 4 chicks without the outlier. Hint: use c to add a number to a vector.

chickout<-c(chick$weight.4,3000) # adding the outlier value of 3000
mean(chickout)/mean(chick$weight.4)

## [1] 2.062407

#or in one line
mean(c(chick$weight.4, 3000))/mean(chick$weight.4)

## [1] 2.062407

A: 2.062407

Median, MAD, and Spearman Correlation Exercises #2

In exercise 1, we saw how sensitive the mean is to outliers. Now let’s see what happens when we use the median instead of the mean. Compute the same ratio, but now using median instead of mean. Specifically, what is the median weight of the day 4 chicks, including the outlier chick, divided by the median of the weight of the day 4 chicks without the outlier.

median(c(chick$weight.4, 3000))/median(chick$weight.4)

## [1] 1

A:1

Median, MAD, and Spearman Correlation Exercises #3

Now try the same thing with the sample standard deviation (the sd function in R). Add a chick with weight 3000 grams to the chick weights from day 4. How much does the standard deviation change? What’s the standard deviation with the outlier chick divided by the standard deviation without the outlier chick?

sd(c(chick$weight.4, 3000))/sd(chick$weight.4)

## [1] 101.2859

A:101.2859

Median, MAD, and Spearman Correlation Exercises #4

Compare the result above to the median absolute deviation in R, which is calculated with the mad function. Note that the mad is unaffected by the addition of a single outlier. The mad function in R includes the scaling factor 1.4826, such that mad and sd are very similar for a sample from a normal distribution. What’s the MAD with the outlier chick divided by the MAD without the outlier chick?

mad(c(chick$weight.4, 3000))/mad(chick$weight.4)

## [1] 1

A:1

Median, MAD, and Spearman Correlation Exercises #5

Our last question relates to how the Pearson correlation is affected by an outlier as compared to the Spearman correlation. The Pearson correlation between x and y is given in R by cor(x,y). The Spearman correlation is given by cor(x,y,method="spearman").

Plot the weights of chicks from day 4 and day 21. We can see that there is some general trend, with the lower weight chicks on day 4 having low weight again on day 21, and likewise for the high weight chicks.

Calculate the Pearson correlation of the weights of chicks from day 4 and day 21. Now calculate how much the Pearson correlation changes if we add a chick that weighs 3000 on day4 and 3000 on day 21. Again, divide the Pearson correlation with the outlier chick over the Pearson correlation computed without the outliers.

plot(chick$weight.4,chick$weight.21)

cor(chick$weight.4,chick$weight.21)

## [1] 0.4159499

plot(c(chick$weight.4,3000),c(chick$weight.21,3000))

cor(c(chick$weight.4,3000),c(chick$weight.21,3000))

## [1] 0.9861002

cor(c(chick$weight.4,3000),c(chick$weight.21,3000))/cor(chick$weight.4,chick$weight.21)

## [1] 2.370719

Mann-Whitney-Wilcoxon Test

Wilcoxon Rank Sum Test

We learned how the sample mean and SD are susceptible to outliers. The t-test is based on these measures and is susceptible as well. The Wilcoxon rank test (equivalent to the Mann-Whitney test) provides an alternative. In the code below, we perform a t-test on data for which the null is true. However, we change one sum observation by mistake in each sample and the values incorrectly entered are different. Here we see that the t-test results in a small p-value, while the Wilcoxon test does not:

set.seed(779) ##779 picked for illustration purposes
N=25
x<- rnorm(N,0,1)
y<- rnorm(N,0,1)

Create outliers:

x[1] <- 5
x[2] <- 7
cat("t-test pval:",t.test(x,y)$p.value)

## t-test pval: 0.04439948

cat("Wilcox test pval:",wilcox.test(x,y)$p.value)

## Wilcox test pval: 0.1310212

The basic idea is to 1) combine all the data, 2) turn the values into ranks, 3) separate them back into their groups, and 4) compute the sum or average rank and perform a test.

library(rafalib)
mypar(1,2)

stripchart(list(x,y),vertical=TRUE,ylim=c(-7,7),ylab="Observations",pch=21,bg=1)
abline(h=0)

xrank<-rank(c(x,y))[seq(along=x)]
yrank<-rank(c(x,y))[-seq(along=y)]

stripchart(list(xrank,yrank),vertical=TRUE,ylab="Ranks",pch=21,bg=1,cex=1.25)

ws <- sapply(x,function(z) rank(c(z,y))[1]-1)
text( rep(1.05,length(ws)), xrank, ws, cex=0.8)

Data from two populations with two outliers. The left plot shows the original data and the right plot shows their ranks. The numbers are the w values

W <-sum(ws)

W is the sum of the ranks for the first group relative to the second group. We can compute an exact p-value for $W$ based on combinatorics. We can also use the CLT since statistical theory tells us that this W is approximated by the normal distribution. We can construct a z-score as follows:

n1<-length(x);n2<-length(y)
Z <- (mean(ws)-n2/2)/ sqrt(n2*(n1+n2+1)/12/n1)
print(Z)

## [1] 1.523124

Here the Z is not large enough to give us a p-value less than 0.05. These are part of the calculations performed by the R function wilcox.test.

Mann-Whitney-Wilcoxon Test Exercises

This dataset is built into R and can be loaded with:

data(ChickWeight)

#To begin, take a look at the weights of all observations over time and color the points to represent the Diet:

head(ChickWeight)

##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

plot( ChickWeight$Time, ChickWeight$weight, col=ChickWeight$Diet)

chick = reshape(ChickWeight, idvar=c("Chick","Diet"), timevar="Time",
                direction="wide")

#The meaning of this line is: reshape the data from _long_ to _wide_, where the columns Chick and Diet are the ID's and the column Time indicates different observations for each ID. Now examine the head of this dataset:

head(chick)

##    Chick Diet weight.0 weight.2 weight.4 weight.6 weight.8 weight.10
## 1      1    1       42       51       59       64       76        93
## 13     2    1       40       49       58       72       84       103
## 25     3    1       43       39       55       67       84        99
## 37     4    1       42       49       56       67       74        87
## 49     5    1       41       42       48       60       79       106
## 61     6    1       41       49       59       74       97       124
##    weight.12 weight.14 weight.16 weight.18 weight.20 weight.21
## 1        106       125       149       171       199       205
## 13       122       138       162       187       209       215
## 25       115       138       163       187       198       202
## 37       102       108       136       154       160       157
## 49       141       164       197       199       220       223
## 61       141       148       155       160       160       157

#We also want to remove any chicks that have missing observations at any time points (NA for "not available"). The following line of code identifies these rows and then removes them:
chick = na.omit(chick)

Mann-Whitney-Wilcoxon Test Exercises #1

Save the weights of the chicks on day 4 from diet 1 as a vector x. Save the weights of the chicks on day 4 from diet 4 as a vector y. Perform a t-test comparing x and y (in R the function t.test(x,y) will perform the test). Then perform a Wilcoxon test of x and y (in R the function wilcox.test(x,y) will perform the test). A warning will appear that an exact p-value cannot be calculated with ties, so an approximation is used, which is fine for our purposes.

Perform a t-test of x and y, after adding a single chick of weight 200 grams to x (the diet 1 chicks). What is the p-value from this test? The p-value of a test is available with the following code: t.test(x,y)$p.value

x = chick$weight.4[chick$Diet == 1]
y = chick$weight.4[chick$Diet == 4]
t.test(c(x, 200), y)$p.value

## [1] 0.9380347

A:0.9380347

Mann-Whitney-Wilcoxon Test Exercises #2

Do the same for the Wilcoxon test. The Wilcoxon test is robust to the outlier. In addition, it has less assumptions that the t-test on the distribution of the underlying data.

wilcox.test(x,y,exact=FALSE)$p.value

## [1] 0.0002011939

wilcox.test(c(x,200),y,exact=FALSE)$p.value

## [1] 0.0009840921

#or
x = chick$weight.4[chick$Diet == 1]
y = chick$weight.4[chick$Diet == 4]
wilcox.test(c(x, 200), y, exact=FALSE)$p.value

## [1] 0.0009840921

A:0.0009840921

Mann-Whitney-Wilcoxon Test Exercises #3

We will now investigate a possible downside to the Wilcoxon-Mann-Whitney test statistic. Using the following code to make three boxplots, showing the true Diet 1 vs 4 weights, and then two altered versions: one with an additional difference of 10 grams and one with an additional difference of 100 grams. Use the x and y as defined above, NOT the ones with the added outlier.

library(rafalib)

x = chick$weight.4[chick$Diet == 1]
y = chick$weight.4[chick$Diet == 4]

mypar(1,3)
boxplot(x,y)
boxplot(x,y+10)
boxplot(x,y+100)

What is the difference in t-test statistic (obtained by t.test(x,y)$statistic) between adding 10 and adding 100 to all the values in the group ‘y’? Take the the t-test statistic with x and y+10 and subtract the t-test statistic with x and y+100. The value should be positive.

t.test(x,y+10)$statistic - t.test(x,y+100)$statistic

##        t 
## 67.75097

Mann-Whitney-Wilcoxon Test Exercises #4

Examine the Wilcoxon test statistic for x and y+10 and for x and y+100. Because the Wilcoxon works on ranks, once the two groups show complete separation, that is all points from group ‘y’ are above all points from group ‘x’, the statistic will not change, regardless of how large the difference grows. Likewise, the p-value has a minimum value, regardless of how far apart the groups are. This means that the Wilcoxon test can be considered less powerful than the t-test in certain contexts. In fact, for small sample sizes, the p-value can’t be very small, even when the difference is very large. What is the p-value if we compare c(1,2,3) to c(4,5,6) using a Wilcoxon test?

wilcox.test(c(1,2,3),c(4,5,6))$p.value

## [1] 0.1

A:0.1

Mann-Whitney-Wilcoxon Test Exercises #5

What is the p-value if we compare c(1,2,3) to c(400,500,600) using a Wilcoxon test?

wilcox.test(c(1,2,3),c(400,500,600))$p.value

## [1] 0.1

A:0.1

Resources

https://github.com/genomicsclass

Html book

devtools::session_info()

## Session info --------------------------------------------------------------

##  setting  value                       
##  version  R version 3.2.3 (2015-12-10)
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_India.1252          
##  tz       Asia/Calcutta               
##  date     2016-02-25

## Packages ------------------------------------------------------------------

##  package        * version date       source        
##  acepack          1.3-3.3 2013-05-03 CRAN (R 3.2.0)
##  affy           * 1.46.1  2015-06-10 Bioconductor  
##  affyio           1.36.0  2015-04-17 Bioconductor  
##  assertthat       0.1     2013-12-06 CRAN (R 3.2.1)
##  Biobase        * 2.28.0  2015-04-17 Bioconductor  
##  BiocGenerics   * 0.14.0  2015-04-17 Bioconductor  
##  BiocInstaller    1.18.5  2015-10-14 Bioconductor  
##  cluster          2.0.3   2015-07-21 CRAN (R 3.2.3)
##  colorspace       1.2-6   2015-03-11 CRAN (R 3.2.1)
##  DBI              0.3.1   2014-09-24 CRAN (R 3.2.1)
##  devtools       * 1.10.0  2016-01-23 CRAN (R 3.2.3)
##  digest           0.6.9   2016-01-08 CRAN (R 3.2.3)
##  downloader     * 0.4     2015-07-09 CRAN (R 3.2.2)
##  dplyr          * 0.4.3   2015-09-01 CRAN (R 3.2.2)
##  evaluate         0.8     2015-09-18 CRAN (R 3.2.2)
##  foreign          0.8-66  2015-08-19 CRAN (R 3.2.2)
##  formatR          1.2.1   2015-09-18 CRAN (R 3.2.2)
##  Formula        * 1.2-1   2015-04-07 CRAN (R 3.2.0)
##  gapminder      * 0.2.0   2015-12-31 CRAN (R 3.2.3)
##  ggplot2        * 2.0.0   2015-12-18 CRAN (R 3.2.3)
##  gridExtra        2.0.0   2015-07-14 CRAN (R 3.2.1)
##  gtable           0.1.2   2012-12-05 CRAN (R 3.2.1)
##  HistData       * 0.7-6   2015-10-16 CRAN (R 3.2.2)
##  Hmisc          * 3.17-2  2016-02-21 CRAN (R 3.2.3)
##  htmltools        0.3     2015-12-29 CRAN (R 3.2.3)
##  knitr            1.12.3  2016-01-22 CRAN (R 3.2.3)
##  lattice        * 0.20-33 2015-07-14 CRAN (R 3.2.3)
##  latticeExtra     0.6-28  2016-02-09 CRAN (R 3.2.3)
##  lazyeval         0.1.10  2015-01-02 CRAN (R 3.2.1)
##  magrittr         1.5     2014-11-22 CRAN (R 3.2.1)
##  MASS           * 7.3-45  2015-11-10 CRAN (R 3.2.3)
##  memoise          1.0.0   2016-01-29 CRAN (R 3.2.3)
##  munsell          0.4.3   2016-02-13 CRAN (R 3.2.3)
##  nnet             7.3-12  2016-02-02 CRAN (R 3.2.3)
##  plyr             1.8.3   2015-06-12 CRAN (R 3.2.1)
##  preprocessCore   1.30.0  2015-04-17 Bioconductor  
##  R6               2.1.2   2016-01-26 CRAN (R 3.2.3)
##  rafalib        * 1.0.0   2015-08-09 CRAN (R 3.2.2)
##  RColorBrewer     1.1-2   2014-12-07 CRAN (R 3.2.0)
##  Rcpp             0.12.3  2016-01-10 CRAN (R 3.2.3)
##  rmarkdown        0.9.5   2016-02-22 CRAN (R 3.2.3)
##  rpart            4.1-10  2015-06-29 CRAN (R 3.2.1)
##  scales           0.3.0   2015-08-25 CRAN (R 3.2.2)
##  SpikeInSubset  * 1.8.0   2016-02-23 Bioconductor  
##  stringi          1.0-1   2015-10-22 CRAN (R 3.2.2)
##  stringr          1.0.0   2015-04-30 CRAN (R 3.2.1)
##  survival       * 2.38-3  2015-07-02 CRAN (R 3.2.3)
##  UsingR         * 2.0-5   2015-08-06 CRAN (R 3.2.2)
##  yaml             2.1.13  2014-06-12 CRAN (R 3.2.1)
##  zlibbioc         1.14.0  2015-04-17 Bioconductor

#sessionInfo()$otherPkgs

Data Analysis for Life Sciences 1 Statistics and R_HarvardX PH525.1x

Data Analysis for Life Sciences 1 Statistics and R

Nishant Upadhyay

11 Feb 2016

Week1: Getting Started

Installing R

Setting up R and RStudio in Windows

Running as Administrator

Installing R

Installing Rtools

Installing RStudio

Installing devtools

Installing packages from GitHub

Installing RStudio

Learn R Basics

swirl exercise

Installing Packages

Importing Data into R

Paths and the Working Directory

Projects in RStudio

Option 1: Read file over the internet

Option 2: Download file with your browser to your working directory

Option 3: Download the file from within R

Option 4: Download the data package (Advanced)

Getting Started Exercises

Brief Introduction to dplyr

dplyr exercises

Mathematical Notation

Indexing

Summation

Greek letters

Infinity

Integrals

Week 2: Random Variables and Probability Distributions

Introduction to Random Variables

Inference

Introduction

Our first look at data

Random Variables

Random Variables Exercises

The Null Hypothesis

Null distributions Exercises

Distributions

Cumulative Distribution Function

Histograms

Probability Distribution

Probability Distribution Exercises

Central Limit Theorem

Normal Distribution

Summary

Setting the random seed

Normal Distribution Exercises

Populations, parameters, and sample estimates

Population parameters

Sample estimates

Population, Samples, and Estimates Exercises

Central Limit Theorem (CLT)

Central Limit Theorem

Central Limit Theorem Exercises

CLT in Practice

The t-distribution

t-tests in Practice

Introduction

Read in and prepare data

The t-distribution in Practice

Running the t-test in practice

CLT and t-distribution in Practice Exercises

Week 3:Inference I:- P-values, Confidence Intervals and Power Calculations

T-test Exercises

Confidence Intervals

Confidence Interval For Population Mean

Defining The Interval

Small Sample Size And The CLT

Connection Between Confidence Intervals and p-values

Confidence Intervals Exercises

Power Calculations

Introduction

Types Of Error

The 0.05 and 0.01 Cut-offs Are Arbitrary

Power Calculation

Brief Introduction to `dplyr`