Data Analysis for Life Sciences 3_Statistical Inference and Modeling for High-throughput Experiments:HarvardX PH525.3x

Week 1 :Introduction and Motivation
Week 2: Error rates and procedures
Week 3: Statistical Models
Week 4:Hierarchical Modeling
Resources

Source file ⇒ Data_Analysis_for_Life_Sciences3.rmd

Week 1 :Introduction and Motivation

** Inference For High Dimensional Data**

Introduction

High-throughput technologies have changed basic biology and the biomedical sciences from data poor disciplines to data intensive ones. A specific example comes from research fields interested in understanding gene expression. Gene expression is the process in which DNA, the blueprint for life, is copied into RNA, the templates for the synthesis of proteins, the building blocks for life. In the 1990s, the analysis of gene expression data amounted to spotting black dots on a piece of paper or extracting a few numbers from standard curves. With high-throughput technologies, such as microarrays, this suddenly changed to sifting through tens of thousands of numbers. More recently, RNA-Sequencing has further increased data complexity. Biologists went from using their eyes or simple summaries to categorize results, to having thousands (and now millions) of measurements per sample to analyze. In this chapter, we will focus on statistical inference in the context of high-throughput measurements. Specifically, we focus on the problem of detecting differences in groups using statistical tests and quantifying uncertainty in a meaningful way. We also introduce exploratory data analysis techniques that should be used in conjunction with inference when analyzing high-throughput data. In later chapters, we will study the statistics behind clustering, machine learning, factor analysis and multi-level modeling.

An Example of High-throughput Data

Since there is a vast number of available public datasets, we use several gene expression examples. Nonetheless, the statistical techniques you will learn have also proven useful in other fields that make use of high-throughput technologies. Technologies such as microarrays, next generation sequencing, fRMI, and mass spectrometry all produce data to answer questions for which what we learn here will be indispensable.

Data packages

Several of the examples we are going to use in the following sections are best obtained through R packages. These are available from GitHub and can be installed using the install_github function from the devtools package. Microsoft Windows users might need to follow these instructions to properly install devtools.

Once devtools is installed, you can then install the data packages like this:

library(devtools)
install_github("genomicsclass/GSE5859Subset")

The three tables

Most of the data we use as examples in this book are created with high-throughput technologies. These technologies measure thousands of features. Examples of feature are genes, single base locations of the genome, genomic regions, or image pixel intensities. Each specific measurement product is defined by a specific set of features. For example, a specific gene expression microarray product is defined by the set of genes that it measures.

A specific study will typically use one product to make measurements on several experimental units, such as individuals. The most common experimental unit will be the individual, but they can also be defined by other entities, for example different parts of a tumor. We often call the experimental units samples following experimental jargon. It is important that these are not confused with samples as referred to in previous chapters, for example “random sample”.

So a high-throughput experiment is usually defined by three tables: one with the high-throughput measurements and two tables with information about the columns and rows of this first table respectively.

Because a dataset is typically defined by a set of experimental units and a product defines a fixed set of features, the high-throughput measurements can be stored in an $n \times m$ matrix, with $n$ the number of units and $m$ the number of features. In R, the convention has been to store the transpose of these matrices.

Here is an example from a gene expression dataset:

library(GSE5859Subset)
data(GSE5859Subset) ##this loads the three tables
head(geneExpression)

##           GSM136508.CEL.gz GSM136530.CEL.gz GSM136517.CEL.gz
## 1007_s_at         6.543954         6.401470         6.298943
## 1053_at           7.546708         7.263547         7.201699
## 117_at            5.402622         5.050546         5.024917
## 121_at            7.892544         7.707754         7.461886
## 1255_g_at         3.242779         3.222804         3.185605
## 1294_at           7.531754         7.090270         7.466018
##           GSM136576.CEL.gz GSM136566.CEL.gz GSM136574.CEL.gz
## 1007_s_at         6.837899         6.470689         6.450220
## 1053_at           7.052761         6.980207         7.096195
## 117_at            5.304313         5.214149         5.173731
## 121_at            7.558130         7.819013         7.641136
## 1255_g_at         3.195363         3.251915         3.324934
## 1294_at           7.122145         7.058973         6.992396
##           GSM136575.CEL.gz GSM136569.CEL.gz GSM136568.CEL.gz
## 1007_s_at         6.052854         6.387026         6.640583
## 1053_at           6.983827         7.060558         7.010453
## 117_at            5.022882         5.175134         5.281784
## 121_at            7.729267         7.700608         7.615513
## 1255_g_at         3.088541         3.184015         3.076940
## 1294_at           7.112384         7.194791         6.884312
##           GSM136559.CEL.gz GSM136565.CEL.gz GSM136573.CEL.gz
## 1007_s_at         6.948474         6.778464         6.595414
## 1053_at           6.775048         7.063689         6.864693
## 117_at            5.309194         5.071376         5.091403
## 121_at            7.992304         7.706135         7.808486
## 1255_g_at         3.167413         3.492037         3.231536
## 1294_at           7.401553         7.478987         7.065355
##           GSM136523.CEL.gz GSM136509.CEL.gz GSM136727.CEL.gz
## 1007_s_at         6.255549         6.379983         6.133068
## 1053_at           7.174769         7.702533         7.280781
## 117_at            5.237160         5.398616         5.401876
## 121_at            7.574813         7.511478         7.607461
## 1255_g_at         3.208304         3.212051         3.225123
## 1294_at           7.344407         7.631689         7.018479
##           GSM136510.CEL.gz GSM136515.CEL.gz GSM136522.CEL.gz
## 1007_s_at         6.502051         6.331567         6.354293
## 1053_at           7.302209         7.456509         7.282859
## 117_at            5.395087         5.280535         4.986950
## 121_at            7.993732         7.632947         7.706585
## 1255_g_at         3.440186         3.185090         3.192436
## 1294_at           7.478820         7.577597         7.339535
##           GSM136507.CEL.gz GSM136524.CEL.gz GSM136514.CEL.gz
## 1007_s_at         6.517539         6.156754         6.037871
## 1053_at           7.689282         7.491967         7.413133
## 117_at            5.562001         5.039387         5.054133
## 121_at            7.612557         7.543667         7.507113
## 1255_g_at         3.107306         3.128269         3.085953
## 1294_at           7.398595         7.359040         7.377372
##           GSM136563.CEL.gz GSM136564.CEL.gz GSM136572.CEL.gz
## 1007_s_at         6.639091         6.393338         6.469794
## 1053_at           7.028731         6.697240         7.092346
## 117_at            5.361298         5.218937         5.340272
## 121_at            7.798197         7.976375         7.753480
## 1255_g_at         3.174794         3.409032         3.274033
## 1294_at           7.467240         7.355222         6.910401

str(geneExpression)

##  num [1:8793, 1:24] 6.54 7.55 5.4 7.89 3.24 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:8793] "1007_s_at" "1053_at" "117_at" "121_at" ...
##   ..$ : chr [1:24] "GSM136508.CEL.gz" "GSM136530.CEL.gz" "GSM136517.CEL.gz" "GSM136576.CEL.gz" ...

dim(geneExpression)

## [1] 8793   24

#checking for missing values
matrixStats::anyMissing(geneExpression)

## [1] FALSE

matrixStats::anyMissing(sampleInfo)

## [1] FALSE

matrixStats::anyMissing(geneAnnotation)

## [1] TRUE

matrixStats::colAnyMissings(geneAnnotation)

## [1] FALSE  TRUE  TRUE  TRUE

sum(matrixStats::rowAnyMissings(geneAnnotation))

## [1] 113

We have RNA expression measurements for 8793 genes from blood taken from 24 individuals (the experimental units). For most statistical analyses, we will also need information about the individuals. For example, in this case the data was originally collected to compare gene expression across ethnic groups. However, we have created a subset of this dataset for illustration and separated the data into two groups:

dim(sampleInfo)

## [1] 24  4

head(sampleInfo)

##     ethnicity       date         filename group
## 107       ASN 2005-06-23 GSM136508.CEL.gz     1
## 122       ASN 2005-06-27 GSM136530.CEL.gz     1
## 113       ASN 2005-06-27 GSM136517.CEL.gz     1
## 163       ASN 2005-10-28 GSM136576.CEL.gz     1
## 153       ASN 2005-10-07 GSM136566.CEL.gz     1
## 161       ASN 2005-10-07 GSM136574.CEL.gz     1

sampleInfo$group

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0

One of the columns, filenames, permits us to connect the rows of this table to the columns of the measurement table.

match(sampleInfo$filename,colnames(geneExpression))

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24

identical(sampleInfo$filename,colnames(geneExpression))

## [1] TRUE

sampleInfo$group

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0

Finally, we have a table describing the features:

dim(geneAnnotation)

## [1] 8793    4

head(geneAnnotation)

##      PROBEID  CHR     CHRLOC SYMBOL
## 1  1007_s_at chr6   30852327   DDR1
## 30   1053_at chr7  -73645832   RFC2
## 31    117_at chr1  161494036  HSPA6
## 32    121_at chr2 -113973574   PAX8
## 33 1255_g_at chr6   42123144 GUCA1A
## 34   1294_at chr3  -49842638   UBA7

The table includes an ID that permits us to connect the rows of this table with the rows of the measurement table:

head(match(geneAnnotation$PROBEID,rownames(geneExpression)))

## [1] 1 2 3 4 5 6

The table also includes biological information about the features, namely chromosome location and the gene “name” used by biologists.

R Refresher Exercises

As stated in the pre-requisites, you should be familiar with R. This assessment serves as a refresher while at the same time introduces you to high-throughput data.

You should make sure that the devtools package is installed and working. In most cases this package installs with no problems, but problems are unfortunately common. Please do not hesitate to ask for help on the Discussion board. We have purposely have included less material in week 1 to give everybody a chance to install all the necessary tools and catch up in R programming.

Note that in this course we will be using base R rather than the dplyr approach used in course 1. You should become familiar with subsetting such as the following: to subset a matrix dat for which the third column is larger than k you do as follows:

dat[ dat[,3] > k , ]

You should also be familiar with functions such as apply, match, and which. You should also be able to write your own functions. If you need a review, you should go through the swirl:

install.packages(swirl) library(swirl) swirl() Here we provide some assessment questions to test yourself.

The workload for week 1 is kept relatively small to give students time to install the necessary packages and hone their R skills.

R Refresher Exercises #1

For the remaining parts of this book we will be downloading larger datasets than those we have been using. Most of these datasets are not available as part of the standard R installation or packages such as UsingR. For some of these packages, we have created packages and offer them via GitHub. To download these you will need to install the devtools package. Once you do this, you can install packages such as the GSE5859Subset which we will be using here:

devtools::install_github("genomicsclass/GSE5859Subset")
library(GSE5859Subset)
data(GSE5859Subset)  ##this loads the three tables

This package loads three tables: geneAnnotation, geneExpression, and sampleInfo. Answer the following questions to familiarize yourself with the data set:

Q:How many samples where processed on 2005-06-27?

dim(sampleInfo)

## [1] 24  4

head(sampleInfo)

##     ethnicity       date         filename group
## 107       ASN 2005-06-23 GSM136508.CEL.gz     1
## 122       ASN 2005-06-27 GSM136530.CEL.gz     1
## 113       ASN 2005-06-27 GSM136517.CEL.gz     1
## 163       ASN 2005-10-28 GSM136576.CEL.gz     1
## 153       ASN 2005-10-07 GSM136566.CEL.gz     1
## 161       ASN 2005-10-07 GSM136574.CEL.gz     1

tail(sampleInfo)

##     ethnicity       date         filename group
## 106       ASN 2005-06-23 GSM136507.CEL.gz     0
## 119       ASN 2005-06-27 GSM136524.CEL.gz     0
## 110       ASN 2005-06-23 GSM136514.CEL.gz     0
## 150       ASN 2005-10-07 GSM136563.CEL.gz     0
## 151       ASN 2005-10-07 GSM136564.CEL.gz     0
## 159       ASN 2005-10-07 GSM136572.CEL.gz     0

nrow(sampleInfo[sampleInfo$date=="2005-06-27",])

## [1] 5

#or

sum(sampleInfo$date=="2005-06-27")

## [1] 5

Ans:5

R Refresher Exercises #2

Q: How many of the genes represented in this particular technology are on chromosome Y?

head(geneAnnotation)

##      PROBEID  CHR     CHRLOC SYMBOL
## 1  1007_s_at chr6   30852327   DDR1
## 30   1053_at chr7  -73645832   RFC2
## 31    117_at chr1  161494036  HSPA6
## 32    121_at chr2 -113973574   PAX8
## 33 1255_g_at chr6   42123144 GUCA1A
## 34   1294_at chr3  -49842638   UBA7

dim(geneAnnotation)

## [1] 8793    4

class(geneAnnotation)

## [1] "data.frame"

sum(geneAnnotation$CHR=="chrY",na.rm=TRUE)

## [1] 21

A:21 The object geneAnnotation contains the information about genes.

R Refresher Exercises #3

Q:What is the log expression value of the for gene ARPC1A on the one subject that we measured on 2005-06-10?

class(geneExpression)

## [1] "matrix"

head(geneExpression)

##           GSM136508.CEL.gz GSM136530.CEL.gz GSM136517.CEL.gz
## 1007_s_at         6.543954         6.401470         6.298943
## 1053_at           7.546708         7.263547         7.201699
## 117_at            5.402622         5.050546         5.024917
## 121_at            7.892544         7.707754         7.461886
## 1255_g_at         3.242779         3.222804         3.185605
## 1294_at           7.531754         7.090270         7.466018
##           GSM136576.CEL.gz GSM136566.CEL.gz GSM136574.CEL.gz
## 1007_s_at         6.837899         6.470689         6.450220
## 1053_at           7.052761         6.980207         7.096195
## 117_at            5.304313         5.214149         5.173731
## 121_at            7.558130         7.819013         7.641136
## 1255_g_at         3.195363         3.251915         3.324934
## 1294_at           7.122145         7.058973         6.992396
##           GSM136575.CEL.gz GSM136569.CEL.gz GSM136568.CEL.gz
## 1007_s_at         6.052854         6.387026         6.640583
## 1053_at           6.983827         7.060558         7.010453
## 117_at            5.022882         5.175134         5.281784
## 121_at            7.729267         7.700608         7.615513
## 1255_g_at         3.088541         3.184015         3.076940
## 1294_at           7.112384         7.194791         6.884312
##           GSM136559.CEL.gz GSM136565.CEL.gz GSM136573.CEL.gz
## 1007_s_at         6.948474         6.778464         6.595414
## 1053_at           6.775048         7.063689         6.864693
## 117_at            5.309194         5.071376         5.091403
## 121_at            7.992304         7.706135         7.808486
## 1255_g_at         3.167413         3.492037         3.231536
## 1294_at           7.401553         7.478987         7.065355
##           GSM136523.CEL.gz GSM136509.CEL.gz GSM136727.CEL.gz
## 1007_s_at         6.255549         6.379983         6.133068
## 1053_at           7.174769         7.702533         7.280781
## 117_at            5.237160         5.398616         5.401876
## 121_at            7.574813         7.511478         7.607461
## 1255_g_at         3.208304         3.212051         3.225123
## 1294_at           7.344407         7.631689         7.018479
##           GSM136510.CEL.gz GSM136515.CEL.gz GSM136522.CEL.gz
## 1007_s_at         6.502051         6.331567         6.354293
## 1053_at           7.302209         7.456509         7.282859
## 117_at            5.395087         5.280535         4.986950
## 121_at            7.993732         7.632947         7.706585
## 1255_g_at         3.440186         3.185090         3.192436
## 1294_at           7.478820         7.577597         7.339535
##           GSM136507.CEL.gz GSM136524.CEL.gz GSM136514.CEL.gz
## 1007_s_at         6.517539         6.156754         6.037871
## 1053_at           7.689282         7.491967         7.413133
## 117_at            5.562001         5.039387         5.054133
## 121_at            7.612557         7.543667         7.507113
## 1255_g_at         3.107306         3.128269         3.085953
## 1294_at           7.398595         7.359040         7.377372
##           GSM136563.CEL.gz GSM136564.CEL.gz GSM136572.CEL.gz
## 1007_s_at         6.639091         6.393338         6.469794
## 1053_at           7.028731         6.697240         7.092346
## 117_at            5.361298         5.218937         5.340272
## 121_at            7.798197         7.976375         7.753480
## 1255_g_at         3.174794         3.409032         3.274033
## 1294_at           7.467240         7.355222         6.910401

sampleInfo[sampleInfo$date=="2005-06-10",]

##     ethnicity       date         filename group
## 207       CEU 2005-06-10 GSM136727.CEL.gz     0

head(geneAnnotation[geneAnnotation$SYMBOL=="ARPC1A",])

##        PROBEID  CHR   CHRLOC SYMBOL
## 489  200950_at chr7 98923496 ARPC1A
## NA        <NA> <NA>       NA   <NA>
## NA.1      <NA> <NA>       NA   <NA>
## NA.2      <NA> <NA>       NA   <NA>
## NA.3      <NA> <NA>       NA   <NA>
## NA.4      <NA> <NA>       NA   <NA>

gene1<-geneExpression[,"GSM136727.CEL.gz"]
head(gene1)

## 1007_s_at   1053_at    117_at    121_at 1255_g_at   1294_at 
##  6.133068  7.280781  5.401876  7.607461  3.225123  7.018479

gene1["200950_at"]

## 200950_at 
##  8.233599

#or
#We get the row using the geneAnnotation table and the column using the sampleInfo table.
(i = which(geneAnnotation$SYMBOL=="ARPC1A"))

## [1] 323

(j = which(sampleInfo$date=="2005-06-10"))

## [1] 15

geneExpression[i,j]

## [1] 8.233599

A: 8.233599

R Refresher Exercises #4

Q:Use the function apply to find the median value of each column. What is the median value of these values?

median(apply(geneExpression,2,median))

## [1] 5.421568

#or
meds <- apply(geneExpression,2,median)
median(meds)

## [1] 5.421568

#install.packages("matrixStats")
median(matrixStats::colMedians(geneExpression))

## [1] 5.421568

A: 5.421568

R Refresher Exercises #5 This problem is more advanced than the previous ones. Note that it might take you some time to solve and you should feel free to seek help in the discussion forum. The exercises is meant to motivate you to learn a an imporant R skills: creating functions to use with apply.

Write a function that takes a vector of values e and a binary vector group coding two groups, and returns the p-value from a t-test: t.test( e[group==1], e[group==0])$p.value.

Now define g to code cases (1) and controls (0) like this g <- factor(sampleInfo$group)

Next use the function apply to run a t-test for each row of geneExpression and obtain the p-value. What is smallest p-value among all these t-tests?

myttest <- function(e,group){
    x <- e[group==1]
    y <- e[group==0]
    return( t.test(x,y)$p.value )
}
g <- factor(sampleInfo$group)
pvals <- apply(geneExpression,1,myttest, group=g)
min( pvals )

## [1] 1.406803e-21

A:1.406803e-21

The challenge of multiple testing

Inference in Practice

Suppose we were given high-throughput gene expression data that was measured for several individuals in two populations. We are asked to report which genes have different average expression levels in the two populations. If instead of thousands of genes, we were handed data from just one gene, we could simply apply the inference techniques that we have learned before. We could, for example, use a t-test or some other test. Here we review what changes when we consider high-throughput data.

p-values are random variables

An important concept to remember in order to understand the concepts presented in this chapter is that p-values are random variables. To see this, consider the example in which we define a p-value from a t-test with a large enough sample size to use the CLT approximation. Then our p-value is defined as the probability that a normally distributed random variable is larger, in absolute value, than the observed t-test, call it $Z$. So for a two sided test the p-value is:

\[ p = 2 \{ 1 - \Phi(\mid Z \mid)\} \]

In R, we write:

2*( 1-pnorm( abs(Z) ) )

Now because $Z$ is a random variable and $\Phi$ is a deterministic function, $p$ is also a random variable. We will create a Monte Carlo simulation showing how the values of $p$ change. We use femaleControlsPopulation.csv from earlier chapters.

filename <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"

We read in the data, and use replicate to repeatedly create p-values.

set.seed(1)
N <- 12
B <- 10000
population <-unlist(read.csv(filename))
pvals <- replicate(B,{
  control = sample(population,N)
  treatment = sample(population,N)
  t.test(treatment,control)$p.val
  })

hist(pvals)

P-value histogram for 10,000 tests in which null hypothesis is true.

As implied by the histogram, in this case the distribution of the p-value is uniformly distributed. In fact, we can show theoretically that when the null hypothesis is true, this is always the case. For the case in which we use the CLT, we have that the null hypothesis $H_0$ implies that our test statistic $Z$ follows a normal distribution with mean 0 and SD 1 thus:

\[ p_a = \mbox{Pr}(Z < a \mid H_0) = \Phi(a) \]

This implies that:

\[ \begin{align*} \mbox{Pr}(p < p_a) &= \mbox{Pr}[ \Phi^{-1}(p) < \Phi^{-1}(p_a) ] \\ & = \mbox{Pr}(Z < a) = p_a \end{align*} \]

which is the definition of a uniform distribution.

Thousands of tests

In this data we have two groups denoted with 0 and 1:

library(GSE5859Subset)
data(GSE5859Subset)
g <- sampleInfo$group
g

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0

If we were interested in a particular gene, let’s arbitrarily pick the one on the 25th row, we would simply compute a t-test. To compute a p-value, we will use the t-distribution approximation and therefore we need the population data to be approximately normal. We check this assumption with a qq-plot:

e <- geneExpression[25,]

library(rafalib)
mypar(1,2)

qqnorm(e[g==1])
qqline(e[g==1])

qqnorm(e[g==0])
qqline(e[g==0])

Normal qq-plots for one gene. Left plot shows first group and right plot shows second group.

The qq-plots show that the data is well approximated by the normal approximation. The t-test does not find this gene to be statistically significant:

t.test(e[g==1],e[g==0])$p.value

## [1] 0.779303

To answer the question for each gene, we simply do repeat the above for each gene. Here we will define our own function and use apply:

myttest <- function(x) t.test(x[g==1],x[g==0],var.equal=TRUE)$p.value
myttest(geneExpression[25,])

## [1] 0.7792058

pvals <- apply(geneExpression,1,myttest)
length(pvals)

## [1] 8793

We can now see which genes have p-values less than, say, 0.05. For example, right away we see that…

sum(pvals<0.05)

## [1] 1383

sum(pvals<0.01)

## [1] 417

… genes had p-values less than 0.05.

However, as we will describe in more detail below, we have to be careful in interpreting this result because we have performed over 8,000 tests. If we performed the same procedure on random data, for which the null hypothesis is true for all features, we obtain the following results:

set.seed(1)
m <- nrow(geneExpression)
n <- ncol(geneExpression)
randomData <- matrix(rnorm(n*m),m,n)
nullpvals <- apply(randomData,1,myttest)
sum(nullpvals<0.05)

## [1] 419

sum(nullpvals<0.01)

## [1] 80

As we will explain later in the chapter, this is to be expected: 419 is roughly 0.05*8192 and we will describe the theory that tells us why this prediction works.

Faster t-test implementation

Before we continue, we should point out that the above implementation is very inefficient. There are several faster implementations that perform t-test for high-throughput data. We make use of a function that is not available from CRAN, but rather from the Bioconductor project.

To download and install packages from Bioconductor, we can use the install_bioc function in rafalib to install the package:

library(rafalib)
install_bioc("genefilter")

# ?BiocUpgrade
# source("http://bioconductor.org/biocLite.R")
# biocLite("BiocUpgrade")

Now we can show that this function is much faster than our code above and produce practically the same answer:

library(genefilter)
results <- rowttests(geneExpression,factor(g))
max(abs(pvals-results$p))

## [1] 6.528111e-14

Inference in Practice Exercises #1

These exercises will help clarify that p-values are random variables and some of the properties of these p-values. Note that just like the sample average is a random variable because it is based on a random sample, the p-values are based on random variables (sample mean and sample standard deviation for example) and thus it is also a random variable.

To see this, let’s see how p-values change when we take different samples.

set.seed(1)
library(downloader)
url = "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename = "femaleControlsPopulation.csv"
if (!file.exists(filename)) download(url,destfile=filename)

population = read.csv(filename)
pvals <- replicate(1000,{
  control = sample(population[,1],12)
  treatment = sample(population[,1],12)
  t.test(treatment,control)$p.val
})
head(pvals)

## [1] 0.9758457 0.4723582 0.2068672 0.7023475 0.9407852 0.0723841

hist(pvals)

Q:Question: What proportion of the p-values is below 0.05?

sum(pvals<0.05)/1000

## [1] 0.041

#or

mean(pvals < 0.05)

## [1] 0.041

A: 0.041

Q:What proportion of the p-values are below 0.01?

mean(pvals < 0.01)

## [1] 0.008

Q:Assume you are testing the effectiveness of 20 diets on mice weight. For each of the 20 diets you run an experiment with 10 control mice and 10 treated mice. Assume the null hypothesis that the diet has no effect is true for all 20 diets and that mice weights follow a normal distribution with mean 30 grams and a standard deviation of 2 grams, run a Monte Carlo simulation for one of these studies:

cases = rnorm(10,30,2)
controls = rnorm(10,30,2)
t.test(cases,controls)

## 
##  Welch Two Sample t-test
## 
## data:  cases and controls
## t = 0.16473, df = 17.934, p-value = 0.871
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.708669  1.999327
## sample estimates:
## mean of x mean of y 
##  30.23172  30.08639

Now run a Monte Carlo simulation imitating the results for the experiment for all 20 diets. If you set the seed at 100, set.seed(100), and use the same code as above inside a call to replicate how many of p-values (number not proportion) are below 0.05?

set.seed(100)
pvals<- replicate(20,{
  cases = rnorm(10,30,2)
  controls = rnorm(10,30,2)
  t.test(cases,controls)$p.value
})

sum(pvals<=0.05)

## [1] 1

A:1

Q:Now create a simulation to learn about the distribution of the number of p-values that are less than 0.05. In question 1.2.3 we ran the 20 diet experiment once. Now we will run these 20 experiments 1,000 times and each time save the number of p-values that are less than 0.05.

Set the seed at 100 again, set.seed(100), run the code from Question 1.2.3 1,000 times, and save the number of times the p-value is less than 0.05 for each of the 1,000 instances.

What is the average of these 1,000 numbers? Note that this is the expected number of tests (out of the 20 we run) that we will reject when the null is true. (Hint: use replicate twice)

set.seed(100)
B = 1000
plessthan = replicate(B,{
  pvals = replicate(20,{
    cases = rnorm(10,30,2)
    controls = rnorm(10,30,2)
    t.test(cases,controls)$p.value
    })
  sum(pvals<=0.05)
})
table(plessthan) ##just for illustration

## plessthan
##   0   1   2   3   4   6 
## 354 374 193  71   7   1

mean(plessthan)

## [1] 1.007

A:1.007

Q:Note that what the answer to question #4 says is that on average, we expect some p-value to be 0.05 even when the null is true for all diets.

Using the same simulation data from the question above, for what proportion of the 1,000 replications do we reject the null hypothesis at least once (more than 0 false positives)? (Enter your response as a decimal value – i.e. 0.10 for 10%.)

set.seed(100)
B = 1000
plessthan = replicate(B,{
  pvals = replicate(20,{
    cases = rnorm(10,30,2)
    controls = rnorm(10,30,2)
    t.test(cases,controls)$p.value
    })
  sum(pvals<=0.05)
})
table(plessthan) ##just for illustration

## plessthan
##   0   1   2   3   4   6 
## 354 374 193  71   7   1

mean(plessthan)

## [1] 1.007

(1000-354)/1000

## [1] 0.646

#or

#plessthan contains, for each of the 1,000 replications, the number of p-values that were less than 0.05. Now we want to know for what proportion of these, this happened at least once:

mean(plessthan>0)

## [1] 0.646

A:0.646

Week 2: Error rates and procedures

Multiple testing:Error Rates and Procedures

Procedures

In the previous section we learned how p-values are no longer a useful quantity to interpret when dealing with high-dimensional data. This is because we are testing many features at the same time. We refer to this as the multiple comparison or multiple testing or multiplicity problem. The definition of a p-value does not provide a useful quantification here. Again, because when we test many hypotheses simultaneously, a list based simply on a small p-value cut-off of, say 0.01, can result in many false positives with high probability. Here we define terms that are more appropriate in the context of high-throughput data.

The most widely used approach to the multiplicity problem is to define a procedure and then estimate or control an informative error rate for this procedure. What we mean by control here is that we adapt the procedure to guarantee an error rate below a predefined value. The procedures are typically flexible through parameters or cutoffs that let us control specificity and sensitivity. An example of a procedure is:

Compute a p-value for each gene.
Call significant all genes with p-values smaller than $\alpha$.

Note that changing the $\alpha$ permits us to adjust specificity and sensitivity.

Next we define the error rates that we will try to estimate and control.

Error Rates

Throughout this section we will be using the type I error and type II error terminology. We will also refer to them as false positives and false negatives respectively. We also use the more general terms specificity, which relates to type I error, and sensitivity, which relates to type II errors.

In the context of high-throughput data we can make several type I errors and several type II errors in one experiment, as opposed to one or the other as seen in Chapter 1. In this table, we summarize the possibilities using the notation from the seminal paper by Benjamini-Hochberg:

	Called significant	Not called significant	Total
Null True	$V$	$m_0-V$	$m_0$
Alternative True	$S$	$m_1-S$	$m_1$
True	$R$	$m-R$	$m$

To describe the entries in the table, let’s use as an example a dataset representing measurements from 10,000 genes, which means that the total number of tests that we are conducting is: $m=10,000$. The number of genes for which the null hypothesis is true, which in most cases represent the “non-interesting” genes, is $m_0$, while the number of genes for which the null hypothesis is false is $m_1$. For this we can also say that the alternative hypothesis is true. In general, we are interested in detecting as many as the cases for which the alternative hypothesis is true (true positives), without incorrectly detecting cases for which the null hypothesis is true (false positives). For most high-throughput experiments, we assume that $m_0$ is much greater than $m_1$. For example, we test 10,000 expecting 100 genes or less to be interesting. This would imply that $m_1 \leq 100$ and $m_0 \geq 19,900$.

Throughout this chapter we refer to features as the units being tested. In genomics, examples of features are genes, transcripts, binding sites, CpG sites, and SNPs. In the table, $R$ represents the total number of features that we call significant after applying our procedure, while $m-R$ is the total number of genes we don’t call significant. The rest of the table contains important quantities that are unknown in practice.

$V$ represents the number of type I errors or false positives. Specifically, $V$ is the number of features for which the null hypothesis is true, that we call significant.
$S$ represents the number of true positives. Specifically, $S$ is the number of features for which the alternative is true, that we call significant.

This implies that there are $m_1-S$ type II errors or false negatives and $m_0-V$ true negatives.

Keep in mind that if we only ran one test, a p-value is simply the probability that $V=1$ when $m=m_0=1$. Power is the probability of $S=1$ when $m=m_1=1$. In this very simple case, we wouldn’t bother making the table above, but now we show how defining the terms in the table helps in practice the high-dimensional context.

Error Rates and Procedures Examples

Data example

Let’s compute these quantities with a data example. We will use a Monte Carlo simulation using our mice data to imitate a situation in which we perform tests for 10,000 different fad diets, none of them having an effect on weight. This implies that the null hypothesis is true for diets and thus $m=m_0=10,000$ and $m_1=0$. Let’s run the tests with a sample size of $N=12$ and compute $R$. Our procedure will declare any diet achieving a p-value smaller than $\alpha=0.05$ as significant.

set.seed(1)
population = unlist( read.csv("femaleControlsPopulation.csv") )
alpha <- 0.05
N <- 12
m <- 10000
pvals <- replicate(m,{
  control = sample(population,N)
  treatment = sample(population,N)
  t.test(treatment,control)$p.value
})

Although in practice we do not know the fact that no diet works, in this simulation we do, and therefore we can actually compute $V$ and $S$. Because all null hypotheses are true, we know, in this specific simulation, that $V=R$. Of course, in practice we can compute $R$ but not $V$.

sum(pvals < 0.05) ##This is R

## [1] 462

These many false positives are not acceptable in most contexts.

Here is more complicated code showing results where 10% of the diets are effective with an average effect size of $\Delta= 3$ ounces. Studying this code carefully will help understand the meaning of the table above. First let’s define the truth:

alpha <- 0.05
N <- 12
m <- 10000
p0 <- 0.90 ##10% of diets work, 90% don't
m0 <- m*p0
m1 <- m-m0
nullHypothesis <- c( rep(TRUE,m0), rep(FALSE,m1))
delta <- 3

Now we are ready to simulate 10,000 tests, perform a t-test on each, and record if we rejected the null hypothesis or not:

set.seed(1)
calls <- sapply(1:m, function(i){
  control <- sample(population,N)
  treatment <- sample(population,N)
  if(!nullHypothesis[i]) treatment <- treatment + delta
  ifelse( t.test(treatment,control)$p.value < alpha, 
          "Called Significant",
          "Not Called Significant")
})

Because in this simulation we know the truth (saved in nullHypothesis), we can compute the entries of the table:

null_hypothesis <- factor( nullHypothesis, levels=c("TRUE","FALSE"))
table(null_hypothesis,calls)

##                calls
## null_hypothesis Called Significant Not Called Significant
##           TRUE                 421                   8579
##           FALSE                520                    480

The first column of the table above shows us $V$ and $S$. Note that $V$ and $S$ are random variables. If we run the simulation repeatedly, these values change. Here is a quick example:

B <- 10 ##number of simulations
VandS <- replicate(B,{
  calls <- sapply(1:m, function(i){
    control <- sample(population,N)
    treatment <- sample(population,N)
    if(!nullHypothesis[i]) treatment <- treatment + delta
    t.test(treatment,control)$p.val < alpha
  })
  cat("V =",sum(nullHypothesis & calls), "S =",sum(!nullHypothesis & calls),"\n")
  c(sum(nullHypothesis & calls),sum(!nullHypothesis & calls))
  })

## V = 410 S = 564 
## V = 400 S = 552 
## V = 366 S = 546 
## V = 382 S = 553 
## V = 372 S = 505 
## V = 382 S = 530 
## V = 381 S = 539 
## V = 396 S = 554 
## V = 380 S = 550 
## V = 405 S = 569

This motivates the definition of error rates. We can, for example, estimate probability that $V$ is larger than 0. This is interpreted as the probability of making at least one type I error among the 10,000 tests. In the simulation above, $V$ was much larger than 1 in every single simulation, so we suspect this probability is very practically 1. When $m=1$, this probability is equivalent to the p-value. When we have a multiple tests situation, we call it the Family Wide Error Rate (FWER) and it relates to a technique that is widely used: The Bonferroni Correction.

Error Rates and Procedures Exercises

In this assessment we hope to help you further grasp the concept that p-values are random variables and start laying the ground work for the development of procedures that control error rates. The calculations to compute error rates require us to understand the random behavior of p-values.

We are going to ask you to perform some calculations related to introductory probability theory. One particular concept you need to grasp is statistical independence. You also will need to know that the probability of two random events that are statistically independent occurring is P( A and B) = P(A)P(B). Note that this is a consequence of the more general formula P(A and B) = P(A) P(B | A )

Q:Assume the null is true and denote the p-value you would get if you ran a test as $P$. Define the function $f(x)=Pr(P\le x)$ What does $f(x)$ look like? A:the identity line

Q:In the previous assessment we saw how the probability of incorrectly rejecting the null for at least one of 20 experiments for which the null is true is well over 5%. Now let’s consider a case in which we run thousands of tests as we would do in a high throughput experiment.

We previously learned that under the null, the probability of a p-value < p is p. If we run 8,793 independent tests, what is the probability of incorrectly rejecting at least one of the null hypotheses?

B<-1000
minpval <- replicate(B, min(runif(8793,0,1))<0.05)
mean(minpval>=1)

## [1] 1

A=1

Explanation:

Error Rates and Procedures Exercises #3 (Sidak’s procedure)

Q:Suppose we need to run 8,793 statistical tests and we want to make the probability of a mistake very small, say 5%. Using the answer to exercise #2, how small do we have to change the cutoff, previously 0.05, to lower our probability of at least one mistake to be 5%.

##warning this can take several minutes
##and will only give an approximate answer
B=10000
cutoffs = 10^seq(-7,-4,0.1) ##we know it has to be small
prob = sapply(cutoffs,function(cutoff){
    minpval =replicate(B, min(runif(8793,0,1))<=cutoff)
    mean(minpval>=1)
    })
cutoffs[which.min(abs(prob-0.05))]

## [1] 6.309573e-06

A:0.000005833407

Vectorizing code

library(downloader)
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- "femaleControlsPopulation.csv"
if (!file.exists(filename)) download(url,destfile=filename)

set.seed(1) 
population = unlist( read.csv("femaleControlsPopulation.csv") )

#To give an example of how we can simulate V and S we constructed a simulation with:

alpha <- 0.05
N <- 12
m <- 10000
p0 <- 0.90 ##10% of diets work, 90% don't
m0 <- m*p0
m1 <- m-m0
nullHypothesis <- c( rep(TRUE,m0), rep(FALSE,m1))
delta <- 3

#We then ran a Monte Carlo simulation by repeating a procedure in which 10,000 tests were run one by one using sapply.

B <- 10 ##number of simulations 
system.time(
VandS <- replicate(B,{
  calls <- sapply(1:m, function(i){
    control <- sample(population,N)
    treatment <- sample(population,N)
    if(!nullHypothesis[i]) treatment <- treatment + delta
    t.test(treatment,control)$p.val < alpha
  })
  c(sum(nullHypothesis & calls),sum(!nullHypothesis & calls))
  })
)

##    user  system elapsed 
##   73.27    0.02   74.06

#In each iteration we checked if that iteration was associated with the null or alternative hypothesis. We did this with the line

#if(!nullHypothesis[i]) treatment <- treatment + delta

#In R, operations based on matrices are typically much faster than operations performed within loops or sapply. We can vectorize the code to make it go much faster. This means that instead of using sapply to run m tests, we will create a matrix with all data in one call to sample.

#This code runs several times faster than the code above, which is necessary here due to the fact that we will be generating several simulations. Understanding this chunk of code and how it is equivalent to the code above using sapply will take a you long way in helping you code efficiently in R.

library(genefilter) ##rowttests is here
set.seed(1)
##Define groups to be used with rowttests
g <- factor( c(rep(0,N),rep(1,N)) )
B <- 10 ##number of simulations
system.time(
VandS <- replicate(B,{
  ##matrix with control data (rows are tests, columns are mice)
  controls <- matrix(sample(population, N*m, replace=TRUE),nrow=m)
  
  ##matrix with control data (rows are tests, columns are mice)
  treatments <-  matrix(sample(population, N*m, replace=TRUE),nrow=m)
  
  ##add effect to 10% of them
  treatments[which(!nullHypothesis),]<-treatments[which(!nullHypothesis),]+delta
  
  ##combine to form one matrix
  dat <- cbind(controls,treatments)
  
 calls <- rowttests(dat,g)$p.value < alpha
 
  c(sum(nullHypothesis & calls),sum(!nullHypothesis & calls))
})
)

##    user  system elapsed 
##    0.26    0.00    0.27

#Note that this code is about 100 times faster!

The Bonferroni Correction

Now that we have learned about the Family Wide Error Rate (FWER), we describe what we can actually do to control it. In practice, we want to choose a procedure that guarantees the FWER is smaller than a predetermined value such as 0.05. We can keep it general and instead of 0.05, use $\alpha$ in our derivations.

Since we are now describing what we do in practice, we no longer have the advantage of knowing the truth. Instead, we pose a procedure and try to estimate the FWER. Let’s consider the naive procedure: “reject all the hypotheses with p-value <0.01”. For illustrative purposes we will assume all the tests are independent (in the case of testing diets this is a safe assumption; in the case of genes it is not so safe since some groups of genes act together). Let $p_1,\dots,p_{10000}$ be the the p-values we get from each test. These are independent random variables so:

\[ \begin{align*} \mbox{Pr}(\mbox{at least one rejection}) &= 1 -\mbox{Pr}(\mbox{no rejections}) \\ &= 1 - \prod_{i=1}^{1000} \mbox{Pr}(p_i>0.01) \\ &= 1-0.95^{1000} \approx 1 \end{align*} \]

Or if you want to use simulations:

B<-10000
minpval <- replicate(B, min(runif(10000,0,1))<0.01)
mean(minpval>=1)

## [1] 1

So our FWER is 1! This is not what we were hoping for. If we wanted it to be lower than $\alpha=0.05$, we failed miserably.

So what do we do to make the probability of a mistake lower than $\alpha$ ? Using the derivation above we can change the procedure by selecting a more stringent cutoff, previously 0.01, to lower our probability of at least one mistake to be 5%. Namely, by noting that:

\[\mbox{Pr}(\mbox{at least one rejection}) = 1-(1-k)^{10000}\]

and solving for $k$, we get $1-(1-k)^{10000}=0.05 \implies k = 1-0.95^{1/10000} \approx 0.000005$

This now gives a specific example of a procedure. This one is actually called Sidak’s procedure. Specifically, we define a set of instructions, such as “reject all the null hypothesis for which p-values < 0.000005”. Then, knowing the p-values are random variables, we use statistical theory to compute how many mistakes, on average, we are expected to make if we follow this procedure. More precisely, we compute bounds on these rates; that is, we show that they are smaller than some predetermined value. There is a preference in the life sciences to err on the side of being conservative.

A problem with Sidak’s procedure is that it assumes the tests are independent. It therefore only controls FWER when this assumption holds. The Bonferroni correction is more general in that it controls FWER even if the tests are not independent. As with Sidak’s procedure we start by noting that:

\[FWER = \mbox{Pr}(V>0) \leq \mbox{Pr}(V>0 \mid \mbox{all nulls are true})\]

or using the notation from the table above:

\[\mbox{Pr}(V>0) \leq \mbox{Pr}(V>0 \mid m_1=0)\]

The Bonferroni procedure sets $k=\alpha/m$ since we can show that:

\[ \begin{align*} \mbox{Pr}(V>0 \,\mid \, m_1=0) &= \mbox{Pr}\left( \min_i \{p_i\} \leq \frac{\alpha}{m} \mid m_1=0 \right)\\ &\leq \sum_{i=1}^m \mbox{Pr}\left(p_i \leq \frac{\alpha}{m} \right)\\ &= m \frac{\alpha}{m}=\alpha \end{align*} \]

Controlling the FWER at 0.05 is a very conservative approach. Using the p-values computed in the previous section…

set.seed(1)
pvals <- sapply(1:m, function(i){
  control <- sample(population,N)
  treatment <- sample(population,N)
  if(!nullHypothesis[i]) treatment <- treatment + delta
  t.test(treatment,control)$p.value
})

…we note that only:

sum(pvals < 0.05/10000)

## [1] 2

are called significant after applying the Bonferroni procedure, despite having 1000 diets that work.

Bonferroni Correction Exercises

This assessment should help you understand the concept of a error controlling procedure. You can think of it as defnining a set of instructions, such as “reject all the null hypothesis for for which p-values < 0.0001” or “reject the null hypothesis for the 10 features with smallest p-values”. Then, knowing the p-values are random variables, we use statistical theory to compute how many mistakes, on average, will we make if we follow this procedure. More precisely we commonly bounds on these rates, meaning that we show that they are smaller than some predermined value.

As described in the video, we can compute different error rates. The FWER tells us the probability of having at least one false positive. The FDR is the expected rate of rejected null hypothesis.

Note 1: the FWER and FDR are not procedures but error rates. We will review procedures here and use Monte Carlo simulations to estimate their error rates.

Note 2: We sometimes use the colloquial term “pick genes that” meaning “reject the null hypothesis for genes that.”

Bonferroni Correction Exercises #1 (Bonferonni versus Sidak)

EXPLANATION:

alphas <- seq(0,0.25,0.01)
par(mfrow=c(2,2))
for(m in c(2,10,100,1000)){
  plot(alphas,alphas/m - (1-(1-alphas)^(1/m)),type="l")
  abline(h=0,col=2,lty=2)
}

Note that for small values of $\alpha$ the difference between these cutoffs are very similar.

Bonferroni Correction Exercises #2 (Monte Carlo Simulation)

Monte Carlo simulation. To simulate the p-value results of, say, 8,793 t-tests for which the null is true we don’t actual have to generate the original data. As we learned in class we can generate p-values from a uniform distribution like this:

Using what we have learned, set the cutoff using the Bonferroni correction that guarantees an FWER lower than 0.05 and report back the FWER. Set the seed at 1,set.seed(1) and run 10,000 simulation. Report the Monte Carlo estimate of the FWER below.

set.seed(1)
B <- 10000
m <- 8793
alpha <- 0.05
pvals <- matrix(runif(B*m,0,1),B,m)
k <- alpha/m
mistakes <- rowSums(pvals<k) 
mean(mistakes>0)

## [1] 0.0467

A:0.0467

Bonferroni Correction Exercises #3

Using the same seed repeat the above for Sidak’s cutoff.

Report the FWER below.

##if pvals already defined no need to rerun this
set.seed(1)
B <- 10000
m <- 8793
alpha <- 0.05
pvals <- matrix(runif(B*m,0,1),B,m)
k <- (1-(1-alpha)^(1/m))
mistakes <- rowSums(pvals<k) 
mean(mistakes>0)

## [1] 0.0475

A:0.0475

EXPLANATION:Note that error rate should be bigger than the Bonferroni one as we showed that Bonferroni was more conservative.

False Discovery Rate

There are many situations for which requiring an FWER of 0.05 does not make sense as it is much too strict. For example, consider the very common exercise of running a preliminary small study to determine a handful of candidate genes. This is referred to as a discovery driven project or experiment. We may be in search of an unknown causative gene and more than willing to perform follow up studies with many more samples on just the candidates. If we develop a procedure that produces, for example, a list of 10 genes of which 1 or 2 pan out as important, the experiment is a resounding success. With a small sample size, the only way to achieve a FWER $\leq$ 0.05 is with an empty list of genes. We already saw in the previous section that despite 1,000 diets being effective, we ended up with a list with just 2. Change the sample size to 6 and you very likely get 0:

set.seed(1)
pvals <- sapply(1:m, function(i){
  control <- sample(population,6)
  treatment <- sample(population,6)
  if(!nullHypothesis[i]) treatment <- treatment + delta
  t.test(treatment,control)$p.value
  })
sum(pvals < 0.05/10000)

## [1] 0

By requiring a FWER $\leq$ 0.05, we are practically assuring 0 power (sensitivity). In many applications, this specificity requirement is over-kill. A widely used alternative to the FWER is the false discover rate (FDR). The idea behind FDR is to focus on the random variable $Q \equiv V/R$ with $Q=0$ when $R=0$ and $V=0$. Note that $R=0$ (nothing called significant) implies $V=0$ (no false positives). So $Q$ is a random variable that can take values between 0 and 1 and we can define a rate by considering the average of $Q$. To better understand this concept here, we compute $Q$ for the procedure: call everything p-value < 0.05 significant.

Vectorizing code

Before running the simulation, we are going to vectortize the code. This means that instead of using sapply to run m tests, we will create a matrix with all data in one call to sample. This code runs several times faster than the code above, which is necessary here due to the fact that we will be generating several simulations. Understanding this chunk of code and how it is equivalent to the code above using sapply will take a you long way in helping you code efficiently in R.

library(downloader)
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- "femaleControlsPopulation.csv"
if (!file.exists(filename)) download(url,destfile=filename)

set.seed(1) 
population = unlist( read.csv("femaleControlsPopulation.csv") )

library(genefilter) ##rowttests is here
set.seed(1)

alpha <- 0.05
N <- 12
m <- 10000
p0 <- 0.90 ##10% of diets work, 90% don't
m0 <- m*p0
m1 <- m-m0
nullHypothesis <- c( rep(TRUE,m0), rep(FALSE,m1))
delta <- 3

##Define groups to be used with rowttests
g <- factor( c(rep(0,N),rep(1,N)) )
B <- 1000 ##number of simulations
Qs <- replicate(B,{
  ##matrix with control data (rows are tests, columns are mice)
  controls <- matrix(sample(population, N*m, replace=TRUE),nrow=m)
  
  ##matrix with control data (rows are tests, columns are mice)
  treatments <-  matrix(sample(population, N*m, replace=TRUE),nrow=m)
  
  ##add effect to 10% of them
  treatments[which(!nullHypothesis),]<-treatments[which(!nullHypothesis),]+ delta
  
  ##combine to form one matrix
  dat <- cbind(controls,treatments)
  
 calls <- rowttests(dat,g)$p.value < alpha
 R=sum(calls)
 V=sum(nullHypothesis & calls)
 Q=ifelse(R>0,V/R,0)
 return(Q)
})

Controlling FDR

The code above is a Monte Carlo simulation that generates 10,000 experiments 1,000 times, each time saving the observed $Q$. Here is a histogram of these values:

library(rafalib)
mypar(1,1)
hist(Qs) ##Q is a random variable, this is its distribution

Q (false positives divided by number of features called significant) is a random variable. Here we generated a distribution with a Monte Carlo simulation.

The FDR is the average value of $Q$

FDR=mean(Qs)
print(FDR)

## [1] 0.4463354

The FDR is relatively high here. This is because for 90% of the tests, the null hypotheses is true. This implies that with a 0.05 p-value cut-off, out of the 100 tests we incorrectly call between 4 and 5 significant on average. This combined with the fact that we don’t “catch” all the cases where the alternative is true, gives us a relatively high FDR. So how can we control this? What if we want lower FDR, say 5%?

To visually see why the FDR is high, we can make a histogram of the p-values. We use a higher value of m to have more data from the histogram. We draw a horizontal line representing the uniform distribution one gets for the m0 cases for which the null is true.

set.seed(1)
controls <- matrix(sample(population, N*m, replace=TRUE),nrow=m)
treatments <-  matrix(sample(population, N*m, replace=TRUE),nrow=m)
treatments[which(!nullHypothesis),]<-treatments[which(!nullHypothesis),]+delta
dat <- cbind(controls,treatments)
pvals <- rowttests(dat,g)$p.value 

h <- hist(pvals,breaks=seq(0,1,0.05))
polygon(c(0,0.05,0.05,0),c(0,0,h$counts[1],h$counts[1]),col="grey")
abline(h=m0/20)

Histogram of p-values. Monte Carlo simulation was used to generate data with m_1 genes having differences between groups.

The first bar (grey) on the left represents cases with p-values smaller than 0.05. From the horizontal line we can infer that about 1/2 are false positives. This is in agreement with an FDR of 0.50. If we look at the bar for 0.01, we can see a lower FDR, as expected, but would call less features significant.

h <- hist(pvals,breaks=seq(0,1,0.01))
polygon(c(0,0.01,0.01,0),c(0,0,h$counts[1],h$counts[1]),col="grey")
abline(h=m0/100)

Histogram of p-values with breaks at every 0.01. Monte Carlo simulation was used to generate data with m_1 genes having differences between groups.

As we consider a lower and lower p-value cut-off, the number of features detected decreases (loss of sensitivity), but our FDR also decreases (gain of specificity). So how do we decide on this cut-off? One approach is to set a desired FDR level $\alpha$, and then develop procedures that control the error rate: FDR $\leq \alpha$.

Benjamini-Hochberg (Advanced)

We want to construct a procedure that guarantees the FDR to be below a certain level $\alpha$. For any given $\alpha$, the Benjamini-Hochberg (1995) procedure is very practical because it simply requires that we are able to compute p-values for each of the individual tests and this permits a procedure to be defined.

For this procedure, order the p-values in increasing order: $p_{(1)},\dots,p_{(m)}$. Then define $k$ to be the largest $i$ for which

\[p_{(i)} \leq \frac{i}{m}\alpha\]

The procedure is to reject tests with p-values smaller or equal to $p_{(k)}$. Here is an example of how we would select the $k$ with code using the p-values computed above:

alpha <- 0.05
i = seq(along=pvals)

mypar(1,2)
plot(i,sort(pvals))
abline(0,i/m*alpha)
##close-up
plot(i[1:15],sort(pvals)[1:15],main="Close-up")
abline(0,i/m*alpha)

Plotting p-values plotted against their rank illustrates the Benjamini-Hochberg procedure. The plot on the right is a close-up of the plot on the left.

k <- max( which( sort(pvals) < i/m*alpha) )
cutoff <- sort(pvals)[k]
cat("k =",k,"p-value cutoff=",cutoff)

## k = 11 p-value cutoff= 3.763357e-05

We can show mathematically that this procedure has FDR lower than 5%. Please see Benjamini-Hochberg (1995) for details. An important outcome is that we now have selected 11 tests instead of just 2. If we are willing to set an FDR of 50% (this means we expect at least 1/2 our genes to be hits), then this list grows to 1063. The FWER does not provide this flexibility since any list of substantial size will result in an FWER of 1.

Keep in mind that we don’t have to run the complicated code above as we have functions to do this. For example, using the p-values pvals computed above, we simply type the following:

fdr <- p.adjust(pvals, method="fdr")
mypar(1,1)
plot(pvals,fdr,log="xy")
abline(h=alpha,v=cutoff) ##cutoff was computed above

FDR estimates plotted against p-value.

We can run a Monte-Carlo simulation to confirm that the FDR is in fact lower than .05. We compute all p-values first, and then use these to decide which get called.

alpha <- 0.05
B <- 1000 ##number of simulations. We should increase for more precision
res <- replicate(B,{
  controls <- matrix(sample(population, N*m, replace=TRUE),nrow=m)
  treatments <-  matrix(sample(population, N*m, replace=TRUE),nrow=m)
  treatments[which(!nullHypothesis),]<-treatments[which(!nullHypothesis),]+delta
  dat <- cbind(controls,treatments)
  pvals <- rowttests(dat,g)$p.value 
  ##then the FDR
  calls <- p.adjust(pvals,method="fdr") < alpha
  R=sum(calls)
  Q=ifelse(R>0,sum(nullHypothesis & calls)/R,0)
  return(c(R,Q))
})
Qs <- res[2,]
mypar(1,1)
hist(Qs) ##Q is a random variable, this is its distribution

Histogram of Q (false positives divided by number of features called significant) when the alternative hypothesis is true for some features.

FDR=mean(Qs)
print(FDR)

## [1] 0.03813818

The FDR is lower than 0.05. This is to be expected because we need to be conservative to assure the FDR $\leq$ 0.05 for any value of $m_0$, such as for the extreme case where every hypothesis tested is null: $m=m_0$. If you re-do the simulation above for this case, you will find that the FDR increases.

We should also note that in …

Rs <- res[1,]
mean(Rs==0)*100

## [1] 0.7

… percent of the simulations, we did not call any genes significant.

Finally, note that the p.adjust function has several options for error rate controlling procedures:

p.adjust.methods

## [1] "holm"       "hochberg"   "hommel"     "bonferroni" "BH"        
## [6] "BY"         "fdr"        "none"

It is important to remember that these options offer not just different approaches to estimating error rates, but also that different error rates are estimated: namely FWER and FDR. This is an important distinction. More information is available from:

#?p.adjust

In summary, requiring that FDR $\leq$ 0.05 is a much more lenient requirement FWER $\leq$ 0.05. Although we will end up with more false positives, FDR gives us much more power. This makes it particularly appropriate for discovery phase experiments where we may accept FDR levels much higher than 0.05.

Direct Approach to FDR and q-values (Advanced)

Here we review the results described by John D. Storey in J. R. Statist. Soc. B (2002). One major distinction between Storey’s approach and Benjamini and Hochberg’s is that we are no longer going to set a $\alpha$ level a priori. Because in many high-throughput experiments we are interested in obtaining some list for validation, we can instead decide beforehand that we will consider all tests with p-values smaller than 0.01. We then want to attach an estimate of an error rate. Using this approach, we are guaranteed to have $R>0$. Note that in the FDR definition above we assigned $Q=0$ in the case that $R=V=0$. We were therefore computing:

\[ \mbox{FDR} = E\left( \frac{V}{R} \mid R>0\right) \mbox{Pr}(R>0) \]

In the approach proposed by Storey, we condition on having a non-empty list, which implies $R>0$, and we instead compute the positive FDR

\[ \mbox{pFDR} = E\left( \frac{V}{R} \mid R>0\right) \]

A second distinction is that while Benjamini and Hochberg’s procedure controls under the worst case scenario, in which all null hypotheses are true ( $m=m_0$ ), Storey proposes that we actually try to estimate $m_0$ from the data. Because in high-throughput experiments we have so much data, this is certainly possible. The general idea is to pick a relatively high value p-value cut-off, call it $\lambda$, and assume that tests obtaining p-values > $\lambda$ are mostly from cases in which the null hypothesis holds. We can then estimate $\pi_0 = m_0/m$ as:

\[ \hat{\pi}_0 = \frac{\#\left\{p_i > \lambda \right\} }{ (1-\lambda) m } \]

There are more sophisticated procedures than this, but they follow the same general idea. Here is an example setting $\lambda=0.1$. Using the p-values computed above we have:

hist(pvals,breaks=seq(0,1,0.05),freq=FALSE)
lambda = 0.1
pi0=sum(pvals> lambda) /((1-lambda)*m)
abline(h= pi0)

p-value histogram with pi0 estimate.

print(pi0) ##this is close to the trye pi0=0.9

## [1] 0.9311111

With this estimate in place we can, for example, alter the Benjamini and Hochberg procedures to select the $k$ to be the largest value so that:

\[\hat{\pi}_0 p_{(i)} \leq \frac{i}{m}\alpha\]

However, instead of doing this, we compute a q-value for each test. If a feature resulted in a p-value of $p$, the q-value is the estimated pFDR for a list of all the features with a p-value at least as small as $p$.

In R, this can be computed with the qvalue function in the qvalue package:

#devtools::install_github("jdstorey/qvalue")
library(qvalue)
res <- qvalue(pvals)
qvals <- res$qvalues
plot(pvals,qvals)

Q-values versus p-values.

we also obtain the estimate of $\hat{\pi}_0$:

res$pi0

## [1] 0.8813727

This function uses a more sophisticated approach at estimating $\pi_0$ than what is described above.

Note on estimating $\pi_0$

In our experience the estimation of $\pi_0$ can be unstable and adds a step of uncertainty to the data analysis pipeline. Although more conservative, the Benjamini-Hochberg procedure is computationally more stable.

FDR Exercises

In this assessment we will define error controlling procedures for experimental data. We will make list of genes based on q-values. We will also assess your understanding of false positives rates and false negative rates by asking you to create a Monte Carlo simulation.

You will need to install the following libraries:

library(devtools) library(rafalib) install_github(“genomicsclass/GSE5859Subset”) install_bioc(“genefilter”) install_bioc(“qvalue”)

FDR Exercises #1

Q:Load the gene expression data

library(GSE5859Subset) data(GSE5859Subset)

We are interested in comparing gene expression between the two groups defined in the sampleInfo table.

Compute a p-value for each gene using the function rowttests from the genefilter package in Bioconductor.

library(genefilter) ?rowttests How many genes have p-values smaller than 0.05?

library(GSE5859Subset)
data(GSE5859Subset)
library(genefilter)

g <- factor(sampleInfo$group)
pvals = rowttests(geneExpression,g)$p.value
sum(pvals<0.05)

## [1] 1383

A:1383

FDR Exercises #2

Q:Apply the Bonferroni correction to the p-values obtained in question #1 to achieve a FWER of 0.05. How many genes are called significant under this procedure?

k = 0.05/length(pvals)
sum(pvals<k)

## [1] 10

A:10

Note that we went from over a thousand to just 10. One practical question we should ask is “are we being too conservative?”. Note that we have not really assessed false negatives at all yet.

FDR Exercises #3

Note that the FDR is a property of a list of features, not each specific feature. The q-value relates FDR to an individual feature. To define the q-value we order features we tested by p-value then compute the FDRs for a list with the most significant, the two most significant, the three most significant, etc… The FDR of the list with the, say, m most significant tests is defined as the q-value of the m-th most significant feature. In other words, the q-value of a feature, is the FDR of the biggest list that includes that gene.

In R, we can compute the q-value using the p.adjust function with the FDR option. Read the help file for p.adjust and then, for our gene expression dataset, compute how many genes achieve an FDR < 0.05

g = factor(sampleInfo$group)
pvals = rowttests(geneExpression,g)$p.value
fdr = p.adjust(pvals,method="fdr")
sum(fdr<0.05)

## [1] 13

A:13

Note that controlling the FDR at 0.05 gives us 3 more genes than the Bonferroni correction. Note that we are controlloing two very different error rates. Here we are saying that we think this list of 13 genes has about 5% false positives. The Bonferroni procedure gave us a list ot 10 genes for which we were quite certain had no false positives. Note again that we have not discussed false negatives.

FDR Exercises #4

Now use the qvalue function, in the Bioconductor qvalue package, to estimate q-values using the procedure described by Storey.

Using this estimate how many genes have q-values below 0.05?

library(qvalue)
res = qvalue(pvals)
qvals = res$qvalues
sum(qvals<0.05)

## [1] 22

A:22

Now the list has increased to 17. However, we are also claiming this list has an FDR of 5%. The reason this list is different is because as explained in the videos, qvalue, estimates FDR differently and is less conservative. Remember that the theory provides bounds for FDR: it guarantees FDR will be less than 0.05. If qvalue does in fact estimate pi0 well then it will provide a list with FDR closer to 0.05.

FDR Exercises #5

Read the help file for qvalue and report the estimated proportion of genes for which the null hypothesis is true $\pi = \frac{m_0}{m}$

qvalue(pvals)$pi0

## [1] 0.6695739

A:0.6695739

FDR Exercises #6

Note that we have the number of genes passing the q-value <0.05 threshold is larger with the qvalue function than the p.adjust difference.

Why is this the case? Make a plot of the ratio of these two estimates to help answer the question.

plot(qvalue(pvals)$qvalue/p.adjust(pvals,method="fdr"))
abline(h=qvalue(pvals)$pi0,col=2)

#To get an idea of how pi0 is estimated, note that if we look at the histogram, pi0 roughly tells us the proportion that looks about uniform:

hist(pvals,breaks=seq(0,1,len=21))
expectedfreq <- length(pvals)/20 #per bin
abline(h=expectedfreq*qvalue(pvals)$pi0,col=2,lty=2)

A:The qvalue function estimates the proportion of genes for which the null hypothesis is true and provides a less conservative estimate

FDR Exercises #7

Note that this is an advanced question and that you can ask questions in the discussion forum.

Create a Monte Carlo Simulation in which you simulate measurements from 8,793 genes for 24 samples: 12 cases and 12 controls.

n <- 24 m <- 8793 mat <- matrix(rnorm(n*m),m,n) Now for 500 genes, there is a difference of 2 between cases and controls:

delta <- 2 positives <- 500 mat[1:positives,1:(n/2)] <- mat[1:positives,1:(n/2)]+delta So the null hypothesis is true for 8793-500 genes. Using the notation from the videos m=8793, m0=8293 and m1=500

Set the seed at 1, set.seed(1) and run this experiment 1,000 times with a Monte Carlo simulation. For each instance compute p-values using a t-test (using rowttests in the genefilter package) and create three lists of genes using:

Bonferroni correction to achieve an FWER of 0.05,
p-adjust estimates of FDR to achieve an FDR of 0.05, and
qvalue estimates of FDR to to achieve an FDR of 0.05.

For each of these three lists compute the number of false positives in the list and the number of false negatives: genes not in the list that should have been because the null hypothesis is not true (we added 2 to the controls to create the cases).

What is the false positive rate (false positives divided by m0) if we use Bonferroni?

#First let's run the simulation:

set.seed(1)
library(qvalue)
library(genefilter)
n <- 24
m <- 8793
B <- 1000
delta <-2
positives <- 500
g <- factor(rep(c(0,1),each=12))
result <- replicate(B,{
  mat <- matrix(rnorm(n*m),m,n)
  mat[1:positives,1:(n/2)] <- mat[1:positives,1:(n/2)]+delta
  pvals = rowttests(mat,g)$p.val
  ##Bonferroni
  FP1 <- sum(pvals[-(1:positives)]<=0.05/m)  
  FP1
  })
  

#Now to get the false positive rate we divide the false positives by the total number of genes for which the null hypothesis is true:

mean(result/(m-positives))

## [1] 5.305679e-06

A:0.000005305679

Note that this value is much smaller than 0.05. This is because Bonferroni controls FWER to be 0.05 not the FDR. In this case, controlling FWER to be 0.05 gives us very low FDR. This makes intuitive sense since having just 1 mistake out of 8,293 possible mistakes is very small and we trying to avoid even 1.

FDR Exercises #8

Q:From the same Monte Carlo simulation as in the question above, what is the false negative rate if we use Bonferroni?

set.seed(1)
library(qvalue)
library(genefilter)
n <- 24
m <- 8793
B <- 1000
delta <-2
positives <- 500
g <- factor(rep(c(0,1),each=12))
result <- replicate(B,{
  mat <- matrix(rnorm(n*m),m,n)
  mat[1:positives,1:(n/2)] <- mat[1:positives,1:(n/2)]+delta
  pvals = rowttests(mat,g)$p.val
  ##Bonferroni
  FP1 <- sum(pvals[-(1:positives)]<=0.05/m)  
  FN1 <- sum(pvals[1:positives]>0.05/m)
  c(FP1,FN1)
  })

#We divide the number of false negatives by the total number of tests for which the null hypothesis is false.

mean(result[2,]/(positives))

## [1] 0.763252

A:0.763252

Note that having a very low FDR comes at a cost. Namely that we increase our false negative rate, in this case to 76%. This means we miss including the great majority of genes for which the null is not true. This trade-off is always present when we have to pick a cutoff. Understanding the tradeoff can help us determine what appraoch is better in the context of our scientific problem.

FDR Exercises #9

Q:From the same Monte Carlo simulation as in question #7, what is the false positive rate if we use q-values from p.adjust?

#We update our simulation to keep track of false positives from p.adjust:

set.seed(1)
library(qvalue)
library(genefilter)
n <- 24
m <- 8793
B <- 1000
delta <-2
positives <- 500
g <- factor(rep(c(0,1),each=12))
result <- replicate(B,{
  mat <- matrix(rnorm(n*m),m,n)
  mat[1:positives,1:(n/2)] <- mat[1:positives,1:(n/2)]+delta
  pvals = rowttests(mat,g)$p.val
  ##Bonferroni
  FP1 <- sum(pvals[-(1:positives)]<=0.05/m)  
  FN1 <- sum(pvals[1:positives]>0.05/m)
  #p.adjust q-value
  qvals1 = p.adjust(pvals,method="fdr")
  FP2 <- sum(qvals1[-(1:positives)]<=0.05)
  c(FP1,FN1,FP2)
  })
  
#Then to get our false positive rate

mean(result[3,]/(m-positives))

## [1] 0.002737851

A:0.002737851

Note that although much higher than the FDR for Bonferroni, the FDR is substantially lower than 0.05 we were shooting for. This is because the Benjamini-Hochberg procedure gives us a bound. The larger m1, the more conservative this approximation will be.

FDR Exercises #10

Q:From the same Monte Carlo simulation as in question #7, what is the false negative rate if we use q-values from p.adjust?

#We update our simulation again:

set.seed(1)
library(qvalue)
library(genefilter)
n <- 24
m <- 8793
B <- 1000
delta <-2
positives <- 500
g <- factor(rep(c(0,1),each=12))
result <- replicate(B,{
  mat <- matrix(rnorm(n*m),m,n)
  mat[1:positives,1:(n/2)] <- mat[1:positives,1:(n/2)]+delta
  pvals = rowttests(mat,g)$p.val
  ##Bonferroni
  FP1 <- sum(pvals[-(1:positives)]<=0.05/m)  
  FN1 <- sum(pvals[1:positives]>0.05/m)
   #p.adjust q-value
  qvals1 = p.adjust(pvals,method="fdr")
  FP2 <- sum(qvals1[-(1:positives)]<=0.05)
  FN2 <- sum(qvals1[1:positives]>0.05)
  c(FP1,FN1,FP2,FN2)
  })
  
#Then our false negative rate is

mean(result[4,]/(positives))

## [1] 0.081418

A:0.081418

Here we see the potential advantage of FDR over FWER, in particular if our goal is discovery. The false negative rate is much reduced now from 0.76 to 0.08

FDR Exercises #11

Q:From the same Monte Carlo simulation as in question #7, what is the false positive rate if we use q-values from qvalue function?

#We update our simulation:

set.seed(1)
library(qvalue)
library(genefilter)
n <- 24
m <- 8793
B <- 1000
delta <-2
positives <- 500
g <- factor(rep(c(0,1),each=12))
result <- replicate(B,{
  mat <- matrix(rnorm(n*m),m,n)
  mat[1:positives,1:(n/2)] <- mat[1:positives,1:(n/2)]+delta
  pvals = rowttests(mat,g)$p.val
  ##Bonferroni
  FP1 <- sum(pvals[-(1:positives)]<=0.05/m)  
  FN1 <- sum(pvals[1:positives]>0.05/m)
   #p.adjust q-value
  qvals1 = p.adjust(pvals,method="fdr")
  FP2 <- sum(qvals1[-(1:positives)]<=0.05)
  FN2 <- sum(qvals1[1:positives]>0.05)
  #qvalue q-value
  qvals2 = qvalue(pvals)$qvalue
  FP3 <- sum(qvals2[-(1:positives)]<=0.05)
  c(FP1,FN1,FP2,FN2,FP3)
  })
  
#Then the false positive rate is

mean(result[5,]/(m-positives))

## [1] 0.002933076

A:0.002926082

Here we see that by estimating pi0 this approach gets closer to the targeted FDR of 0.05.

FDR Exercises #12

Q:From the same Monte Carlo simulation as in question #7, what is the false negative rate if we use q-values from qvalue function?

#We update our simulation:

set.seed(1)
library(qvalue)
library(genefilter)
n <- 24
m <- 8793
B <- 1000
delta <-2
positives <- 500
g <- factor(rep(c(0,1),each=12))
result <- replicate(B,{
  mat <- matrix(rnorm(n*m),m,n)
  mat[1:positives,1:(n/2)] <- mat[1:positives,1:(n/2)]+delta
  pvals = rowttests(mat,g)$p.val
  ##Bonferroni
  FP1 <- sum(pvals[-(1:positives)]<=0.05/m)  
  FN1 <- sum(pvals[1:positives]>0.05/m)
   #p.adjust q-value
  qvals1 = p.adjust(pvals,method="fdr")
  FP2 <- sum(qvals1[-(1:positives)]<=0.05)
  FN2 <- sum(qvals1[1:positives]>0.05)
  #qvalue q-value
  qvals2 = qvalue(pvals)$qvalue
  FP3 <- sum(qvals2[-(1:positives)]<=0.05)
  FN3 <- sum(qvals2[1:positives]>0.05)  
  c(FP1,FN1,FP2,FN2,FP3,FN3)
  })
  
#Then our false negative rate is

mean(result[6,]/(positives))

## [1] 0.077302

A:0.07737

Here we see that by creating a list of an FDR closer to 0.05 we are less conservative and thus decrease the false negative rate further.

Week 3: Statistical Models

Introduction to Statistical Models

“All models are wrong, but some are useful” -George E. P. Box

When we see a p-value in the literature, it means a probability distribution of some sort was used to quantify the null hypothesis. Many times deciding which probability distribution to use is relatively straightforward. For example, in the tea tasting challenge we can use simple probability calculations to determine the null distribution. Most p-values in the scientific literature are based on sample averages or least squares estimates from a linear model and make use of the CLT to approximate the null distribution of their statistic as normal.

The CLT is backed by theoretical results that guarantee that the approximation is accurate. However, we cannot always use this approximation, such as when our sample size is too small. Previously, we described how the sample average can be approximated as t-distributed when the population data is approximately normal. However, there is no theoretical backing for this assumption. We are now modeling. In the case of height, we know from experience that this turns out to be a very good model.

But this does not imply that every dataset we collect will follow a normal distribution. Some examples are: coin tosses, the number of people who win the lottery, and US incomes. The normal distribution is not the only parametric distribution that is available for modeling. Here we provide a very brief introduction to some of the most widely used parametric distributions and some of their uses in the life sciences. We focus on the models and concepts needed to understand the techniques currently used to perform statistical inference on high-throughput data. To do this we also need to introduce the basics of Bayesian statistics. For more in depth description of probability models and parametric distributions please consult a Statistics textbook such as this one.

The Binomial Distribution

The first distribution we will describe is the binomial distribution. It reports the probability of observing $S=k$ successes in $N$ trails as

\[ \mbox{Pr}(S=k) = {N \choose k}p^k (1-p)^{N-k} \]

with $p$ the probability of success. The best known example is coin tosses with $S$ the number of heads when tossing $N$ coins. In this example $p=0.5$.

Note that $S/N$ is the average of independent random variables and thus the CLT tells us that $S$ is approximately normal when $N$ is large. This distribution has many applications in the life sciences. Recently, it has been used by the variant callers and genotypers applied to next generation sequencing. A special case of this distribution is approximated by the Poisson distribution which we describe next.

The Poisson Distribution

Since it is the sum of binary outcomes, the number of people that win the lottery follows a binomial distribution (we assume each person buys one ticket). The number of trials $N$ is the number of people that buy tickets and is usually very large. However, the number of people that win the lottery oscillates between 0 and 3, which implies the normal approximation does not hold. So why does CLT not hold? One can explain this mathematically, but the intuition is that with the sum of successes so close to and also constrained to be larger than 0, it is impossible for the distribution to be normal. Here is a quick simulation:

p=10^-7 ##1 in 10,000,0000 chances of winning
N=5*10^6 ##5,000,000 tickets bought
winners=rbinom(1000,N,p) ##1000 is the number of different lotto draws
tab=table(winners)
plot(tab)

Number of people that win the lottery obtained from Monte Carlo simulation.

prop.table(tab)

## winners
##     0     1     2     3 
## 0.604 0.300 0.081 0.015

For cases like this, where $N$ is very large, but $p$ is small enough to make $N \times p$ (call it $\lambda$) a number between 0 and, for example, 10, then $S$ can be shown to follow a Poisson distribution, which has a simple parametric form:

\[ \mbox{Pr}(S=k)=\frac{\lambda^k \exp{-\lambda}}{k!} \]

Poisson Example from RNA-seq

The Poisson distribution is commonly used in RNAseq analyses. Because we are sampling thousands of molecules and most genes represent a very small proportion of the totality of molecules, the Poisson distribution seems appropriate.

So how does this help us? One way is that it provides insight about the statistical properties of summaries that are widely used in practice. For example, let’s say we only have one sample from each of a case and control RNAseq experiment and we want to report the genes with larges fold-changes. One insight that the Poisson model provides is that under the null that there are no differences, the statistical variability of this quantity depends on the total abundance of the gene. We can show this mathematically, but here is a quick simulation to demonstrate the point:

N=10000##number of genes
lambdas=2^seq(1,16,len=N) ##these are the true abundances of genes
y=rpois(N,lambdas)##note that the null hypothesis is true for all genes
x=rpois(N,lambdas) 
ind=which(y>0 & x>0)##make sure no 0s due to ratio and log

library(rafalib)
splot(log2(lambdas),log2(y/x),subset=ind)

MA plot of simulated RNAseq data. Replicated measurements follow a Poisson distribution.

For lower values of lambda there is much more variability and, if we were to report anything with a fold change of 2 or more, the number of false positives would be quite high for low abundance genes.

NGS experiments and the Poisson distribution

In this section we will use the data stored in this dataset:

# library(BiocInstaller)
# biocLite("parathyroidSE")
library(parathyroidSE) ##available from Bioconductor
data(parathyroidGenesSE)
se <- parathyroidGenesSE

The data is contained in a SummarizedExperiment object, which we do not describe here. The important thing to know is that it includes a matrix of data, where each row is a genomic feature and each column is a sample. We can extract this data using the assay function. For this dataset, the value of a single cell in the data matrix is the count of reads which align to a given gene for a given sample. Thus, a similar plot to the one we simulated above with technical replicates reveals that the behavior predicted by the model is present in experimental data:

x <- assay(se)[,23]
y <- assay(se)[,24]
ind=which(y>0 & x>0)##make sure no 0s due to ratio and log
splot((log2(x)+log2(y))/2,log(x/y),subset=ind)

MA plot of replicated RNAseq data.

If we compute the standard deviations across four individuals, it is quite a bit higher than what is predicted by a Poisson model. Assuming most genes are differentially expressed across individuals, then, if the Poisson model is appropriate, there should be a linear relationship in this plot:

library(rafalib)
library(matrixStats)

vars=rowVars(assay(se)[,c(2,8,16,21)]) ##we now these four are 4
means=rowMeans(assay(se)[,c(2,8,16,21)]) ##different individulsa

splot(means,vars,log="xy",subset=which(means>0&vars>0)) ##plot a subset of data
abline(0,1,col=2,lwd=2)

Variance versus mean plot. Summaries were obtained from the RNAseq data.

The reason for this is that the variability plotted here includes biological variability, which the motivation for the Poisson does not include. The negative binomial distribution, which combines the sampling variability of a Poisson and biological variability, is a more appropriate distribution to model this type of experiment. The negative binomial has two parameters and permits more flexibility for count data. For more on the use of the negative binomial to model RNAseq data you can read this paper. The Poisson is a special case of the negative binomial distribution.

Statistical Models Exercises

Q:The probability of conceiving a girl is 0.49. What is the probability that a family with 4 children has 2 girls and 2 boys (you can assume no twins)?

dbinom(2,4,0.49)

## [1] 0.3747001

A:0.3747001

Q:What is the probability that a family with 10 children has 4 girls and 6 boys (you can assume no twins)?

dbinom(4,10,0.49)

## [1] 0.2130221

A:0.2130221

Q:The genome has 3 billion bases. About 20% are C, 20% are G, 30% are T and 30% are A. Suppose you take a random interval of 20 bases, what is the probability that the GC-content (proportion of Gs or Cs) is strictly above 0.5 in this interval (you can assume independence)?

1-pbinom(10,20,0.4)

## [1] 0.1275212

A:0.1275212

Q:The following two questions are motivated by this event.

The probability of winning the lottery is 1 in 175,223,510. If 189,000,000 randomly generated (with replacement) tickets are sold, what is the probability that at least one winning tickets is sold? (give your answer as a proportion not percentage)

1- dbinom(0, 189000000, 1/175223510)

## [1] 0.6599363

A:0.6599363

Q:Using the information from the previous question, what is the probability that two or more winning tickets are sold?

1 - pbinom(1, 189000000, 1/175223510)

## [1] 0.293136

A:0.293136

pbinom(9,20,0.4)-pbinom(7,20,0.4)

## [1] 0.3394443

A:0.3394443

Q:For the question above, what is the normal approximation to the probability?

b <- (9 - 20*.4)/sqrt(20*.4*.6)
a <- (7 - 20*.4)/sqrt(20*.4*.6)
pnorm(b)-pnorm(a)

## [1] 0.3519231

A:0.3519231

Q:Repeat Statistical Models Exercises #3, but using an interval of 1000 bases. What is the difference (in absolute value) between the normal approximation and the exact probability (using binomial) of the GC-content being greater than 0.35 and lesser or equal to 0.45?

exact = pbinom(450,1000,0.4)-pbinom(350,1000,0.4)
b <- (450 - 1000*.4)/sqrt(1000*.4*.6)
a <- (350 - 1000*.4)/sqrt(1000*.4*.6)
approx <- pnorm(b)-pnorm(a)
abs(exact-approx)

## [1] 9.728752e-06

A:9.728752e-06

Ns <- c(5,10,30,100)
ps <- c(0.01,0.10,0.5,0.9,0.99)
library(rafalib)
mypar(4,5)
for(N in Ns){
  ks <- 1:(N-1)
  for(p in ps){
    exact = dbinom(ks,N,p)
    a = (ks+0.5 - N*p)/sqrt(N*p*(1-p))
    b = (ks-0.5 - N*p)/sqrt(N*p*(1-p))
    approx = pnorm(a) - pnorm(b)
    LIM <- range(c(approx,exact))
    plot(exact,approx,main=paste("N =",N," p = ",p),xlim=LIM,ylim=LIM,col=1,pch=16)
    abline(0,1)
  }
}

A:When N is 100 all approximations are spot on.

N = 189000000
p = 1/175223510
1 - ppois(1, N*p)

## [1] 0.293136

A:0.293136

Maximum Likelihood Estimation

To illustrate the concept of maximum likelihood estimates (MLE), we use a relatively simple dataset containing palindrome locations in the HMCV genome. We read in the locations of the palindrome and then count the number of palindromes in each 4,000 basepair segments.

datadir="http://www.biostat.jhsph.edu/bstcourse/bio751/data"
x=read.csv(file.path(datadir,"hcmv.csv"))[,2]

breaks=seq(0,4000*round(max(x)/4000),4000)
tmp=cut(x,breaks)
counts=table(tmp)

library(rafalib)
mypar(1,1)
hist(counts)

Palindrome count histogram.

The counts do appear to follow a Poisson distribution. But what is the rate $\lambda$ ? The most common approach to estimating this rate is maximum likelihood estimation. To find the maximum likelihood estimate (MLE), we note that these data are independent and the probability of observing the values we observed is:

\[ \Pr(X_1=k_1,\dots,X_n=k_n;\lambda) = \prod_{i=1}^n \lambda^{k_i} / k_i! \exp ( -\lambda) \]

The MLE is the value of $\lambda$ that maximizes the likelihood:.

\[ \mbox{L}(\lambda; X_1=k_1,\dots,X_n=k_1)=\exp\left\{\sum_{i=1}^n \log \Pr(X_i=k_i;\lambda)\right\} \]

In practice, it is more convenient to maximize the log-likelihood which is the summation that is exponentiated in the expression above. Below we write code that computes the log-likelihood for any $\lambda$ and use the function optimize to find the value that maximizes this function (the MLE). We show a plot of the log-likelihood along with vertical line showing the MLE.

l<-function(lambda) sum(dpois(counts,lambda,log=TRUE)) 

lambdas<-seq(3,7,len=100)
ls <- exp(sapply(lambdas,l))

plot(lambdas,ls,type="l")

mle=optimize(l,c(0,10),maximum=TRUE)
abline(v=mle$maximum)

Likelihood versus lambda.

If you work out the math and do a bit of calculus, you realize that this is a particularly simple example for which the MLE is the average.

print( c(mle$maximum, mean(counts) ) )

## [1] 5.157894 5.157895

Note that a plot of observed counts versus counts predicted by the Poisson shows that the fit is quite good in this case:

theoretical<-qpois((seq(0,99)+0.5)/100,mean(counts))

qqplot(theoretical,counts)
abline(0,1)

Observed counts versus theoretical Poisson counts.

We therefore can model the palindrome count data with a Poisson with $\lambda=5.16$.

MLE Exercises

# library(devtools)
# install_github("genomicsclass/dagdata")

library(dagdata)
data(hcmv)

library(rafalib)
mypar()
plot(locations,rep(1,length(locations)),ylab="",yaxt="n")

breaks=seq(0,4000*round(max(locations)/4000),4000)
tmp=cut(locations,breaks)
counts=as.numeric(table(tmp))

hist(counts)

probs <- dpois(counts,4)
likelihood <- prod(probs)
likelihood

## [1] 1.177527e-62

logprobs <- dpois(counts,4,log=TRUE)
loglikelihood <- sum(logprobs)
loglikelihood

## [1] -142.5969

loglikelihood = function(lambda,x){
  sum(dpois(x,lambda,log=TRUE))
}
lambdas = seq(1,15,len=300)
l = sapply(lambdas,function(lambda) loglikelihood(lambda,counts))
plot(lambdas,l)
mle=lambdas[which.max(l)]
abline(v=mle)

print(mle)

## [1] 5.167224

#It turns out that, using calculus, we can work out mathematically what ?? maximizes the likelihood. The average of the counts is the MLE. Note that we obtain a similar number to the answer to Question 4.2.1:

mean(counts)

## [1] 5.157895

A:5.167224

Q:The point of collecting this dataset was to try to determine if there is a region of the genome that has higher palindrome rate than expected. We can create a plot and see the counts per location:

breaks=seq(0,4000*round(max(locations)/4000),4000)
tmp=cut(locations,breaks)
counts=as.numeric(table(tmp))
binLocation=(breaks[-1]+breaks[-length(breaks)])/2
plot(binLocation,counts,type="l",xlab=)

What is the center of the bin with the highest count?

binLocation[which.max(counts)]

## [1] 94000

A:94000

Q:For the question above, what is the maximum count?

max(counts)

## [1] 14

A:14

Q:Now that we have identified the location with the largest palindrome count, we want to know if by chance we could see a value this big.

If X is a Poisson random variable with rate

lambda = mean(counts[ - which.max(counts) ])

What is the probability of seeing a count of 14 or more?

pval = 1 - ppois(13,lambda)
print(pval)

## [1] 0.00069799

A:0.00069799

Q:From the question above, we obtain a p-value smaller than 0.001 for a count of 14. Why is it problematic to report this p-value as strong evidence of a location that is different?

A:We selected the highest region out of 57 and need to adjust for multiple testing.

Q:Use the Bonferroni correction to determine the p-value cut-off that guarantees a FWER of 0.05. What is this p-value cutoff?

0.05/57

## [1] 0.000877193

A:0.00087719

Note that our observed p-value satisfy the Bonferroni correction.

Q:Create a qq-plot to see if our Poisson model is a good fit:

ps <- (seq(along=counts) - 0.5)/length(counts)
lambda <- mean( counts[ -which.max(counts)])
poisq <- qpois(ps,lambda)
qqplot(poisq,counts)
abline(0,1)

How would you characterize this qq-plot

A: Poisson is a very good approximation except for one point that we actually think is associated with a region of interest.

Distributions for Positive Continuous Values

Different genes vary differently across biological replicates. Later, in the hierarchical models chapter, we will describe one of the most influential statistical methods in the analysis of genomics data. This method provides great improvements over naive approaches to detecting differentially expressed genes. This is achieved by modeling the distribution of the gene variances. Here we describe the parametric model used in this method.

We want to model the distribution of the gene-specific standard errors. Are they normal? Keep in mind that we are modeling the population standard errors so CLT does not apply, even though we have thousands of genes.

As an example, we use an experimental data that included both technical and biological replicates for gene expression measurements on mice. We can load the data and compute the gene specific sample standard error for both the technical replicates and the biological replicates

library(Biobase) ##available from Bioconductor

#devtools::install_github("genomicsclass/maPooling")
library(maPooling) ##available from course github repo

data(maPooling)
pd=pData(maPooling)

##determing which samples are bio reps and which are tech reps
strain=factor(as.numeric(grepl("b",rownames(pd))))
pooled=which(rowSums(pd)==12 & strain==1)
techreps=exprs(maPooling[,pooled])
individuals=which(rowSums(pd)==1 & strain==1)

##remove replicates
individuals=individuals[-grep("tr",names(individuals))]
bioreps=exprs(maPooling)[,individuals]

###now compute the gene specific standard deviations
library(matrixStats)
techsds=rowSds(techreps)
biosds=rowSds(bioreps)

We can now explore the sample standard deviation:

###now plot
library(rafalib)
mypar()
shist(biosds,unit=0.1,col=1,xlim=c(0,1.5))
shist(techsds,unit=0.1,col=2,add=TRUE)
legend("topright",c("Biological","Technical"), col=c(1,2),lty=c(1,1))

Histograms of biological variance and technical variance.

An important observation here is that the biological variability is substantially higher than the technical variability. This provides strong evidence that genes do in fact have gene-specific biological variability.

If we want to model this variability, we first notice that the normal distribution is not appropriate here since the right tail is rather large. Also, because SDs are strictly positive, there is a limitation to how symmetric this distribution can be. A qqplot shows this very clearly:

qqnorm(biosds)
qqline(biosds)

Normal qq-plot for sample standard deviations.

There are parametric distributions that posses these properties (strictly positive and heavy right tails). Two examples are the gamma and F distributions. The density of the gamma distribution is defined by:

\[ f(x;\alpha,\beta)=\frac{\beta^\alpha x^{\alpha-1}\exp{-\beta x}}{\Gamma(\alpha)} \]

It is defined by two parameters $\alpha$ and $\beta$ that can, indirectly, control location and scale. They also control the shape of the distribution. For more on this distribution please refer to this book.

Two special cases of the gamma distribution are the chi-squared and exponential distribution. We used the chi-squared earlier to analyze a 2x2 table data. For chi-square, we have $\alpha=\nu/2$ and $\beta=2$ with $\nu$ the degrees of freedom. For exponential, we have $\alpha=1$ and $\beta=\lambda$ the rate.

The F-distribution comes up in analysis of variance (ANOVA). It is also always positive and has large right tails. Two parameters control its shape:

\[ f(x,d_1,d_2)=\frac{1}{B\left( \frac{d_1}{2},\frac{d_2}{2}\right)} \left(\frac{d_1}{d_2}\right)^{\frac{d_1}{2}} x^{\frac{d_1}{2}-1}\left(1+\frac{d1}{d2}x\right)^{-\frac{d_1+d_2}{2}} \]

with $B$ the beta function and $d_1$ and $d_2$ are called the degrees of freedom for reasons having to do with how it arises in ANOVA. A third parameter is sometimes used with the F-distribution, which is a scale parameter.

Modeling the variance

In a later section we will learn about a hierarchical model approach to improve estimates of variance. In these cases it is mathematically convenient to model the distribution of the variance $\sigma^2$. The hierarchical model used here implies that the sample standard deviation of genes follows scaled F-statistics:

\[ s^2 \sim s_0^2 F_{d,d_0} \]

with $d$ the degrees of freedom involved in computing $s^2$ . For example, in a case comparing 3 versus 3, the degrees of freedom would be 4. This leaves two free parameters to adjust to the data. Here $d$ will control the location and $s_0$ will control the scale. Below are some examples of $F$ distribution plotted on top of the histogram from the sample variances:

library(rafalib)
mypar(3,3)
sds=seq(0,2,len=100)
for(d in c(1,5,10)){
  for(s0 in c(0.1, 0.2, 0.3)){
    tmp=hist(biosds,main=paste("s_0 =",s0,"d =",d),xlab="sd",ylab="density",freq=FALSE,nc=100,xlim=c(0,1))
    dd=df(sds^2/s0^2,11,d)
    ##multiply by normalizing constant to assure same range on plot
    k=sum(tmp$density)/sum(dd) 
    lines(sds,dd*k,type="l",col=2,lwd=2)
    }
}

Histograms of sample standard deviations and densities of estimated distributions.

Now which $s_0$ and $d$ fit our data best? This is a rather advanced topic as the MLE does not perform well for this particular distribution (we refer to Smyth (2004)). The Bioconductor limma package provides a function to estimate these parameters:

# source("https://bioconductor.org/biocLite.R")
# biocLite("limma")
library(limma)
estimates=fitFDist(biosds^2,11)

theoretical<- sqrt(qf((seq(0,999)+0.5)/1000, 11, estimates$df2)*estimates$scale)
observed <- biosds

The fitted models do appear to provide a reasonable approximation, as demonstrated by the qq-plot and histogram:

mypar(1,2)
qqplot(theoretical,observed)
abline(0,1)
tmp=hist(biosds,main=paste("s_0 =", signif(estimates[[1]],2), "d =", signif(estimates[[2]],2)), xlab="sd", ylab="density", freq=FALSE, nc=100, xlim=c(0,1), ylim=c(0,9))
dd=df(sds^2/estimates$scale,11,estimates$df2)
k=sum(tmp$density)/sum(dd) ##a normalizing constant to assure same area in plot
lines(sds, dd*k, type="l", col=2, lwd=2)

qq-plot (left) and density (right) demonstrate that model fits data well.

Models for Variance Exercises

Install and load the following data library:

# library(devtools)
# install_github("genomicsclass/tissuesGeneExpression")

library(tissuesGeneExpression)

#Now load this data and select the columns related to endometrium: 

data("tissuesGeneExpression")
library(genefilter)
y = e[,which(tissue=="endometrium")]

This will give you a matrix y with 15 samples.

Q:Compute the across sample variance for the fifteen samples. Then make a qq-plot to see if the variances follow ae normal distribution.

Which statement is true? (pick one)

library(genefilter)
s2 <- rowVars(y)
library(rafalib)
mypar(1,2)
qqnorm(s2)
qqline(s2)
##To see the square root transformation does not help much:
qqnorm(sqrt(s2))
qqline(sqrt(s2))

A: The normal distribution is not a useful approximation here: the left tail is over estimated and the right tail is underestimated.

Q:Now fit an F-distribution with 14 degrees of freedom using the fitFDist function in the limma package:

What is estimated the estimated scale parameter?

library(limma)
estimates=fitFDist(s2,14)
print( estimates$scale )

## [1] 0.03125834

A:0.03125834

Q:Now create a qq-plot of the observed sample standard deviation versus the quantiles predicted by the F-distribution (remember to take square root).

Which of the following best describes the qq-plot?

ps <- (seq(along=s2)-0.5)/length(s2)
theoretical<- qf(ps,14,estimates$df2)*estimates$scale 
LIM <- sqrt( range(c(theoretical,s2)) )
mypar(1,2)
qqplot(sqrt( theoretical ), sqrt( s2 ),ylim=LIM,xlim=LIM)
abline(0,1)
##close up excluding the upper 5%
K <- sqrt( quantile(s2,0.95) )
qqplot( sqrt( theoretical ), sqrt( s2 ),ylim=c(0,K),xlim=c(0,K))
abline(0,1)

A: If we exclude the genes with the highest variances (top 5%), the F-distribution provides a good fit.

Week 4:Hierarchical Modeling

Bayes’ Rule

Bayesian Statistics

One distinguishing characteristic of high-throughput data is that although we want to report on specific features, we observe many related outcomes. For example, we measure the expression of thousands of genes, or the height of thousands of peaks representing protein binding, or the methylation levels across several CpGs. However, most of the statistical inference approaches we have shown here treat each feature independently and pretty much ignores data from other features. We will learn how using statistical models provides power by modeling features jointly. The most successful of these approaches are what we refer to as hierarchical models, which we explain below in the context of Bayesian statistics.

Bayes theorem

We start by reviewing Bayes theorem. We do this using a hypothetical cystic fibrosis test as an example. Suppose a test for cystic fibrosis has an accuracy of 99%. We will use the following notation:

\[ \mbox{Prob}(+ \mid D=1)=0.99, \mbox{Prob}(- \mid D=0)=0.99 \]

with $+$ meaning a positive test and $D$ representing if you actually have the disease (1) or not (0).

Suppose we select a random person and they test positive, what is the probability that they have the disease? We write this as $\mbox{Prob}(D=1 \mid +)?$ The cystic fibrosis rate is 1 in 3,900 which implies that $\mbox{Prob}(D=1)=0.00025$. To answer this question we will use Bayes Theorem, which in general tells us that:

\[ \mbox{Pr}(A \mid B) = \frac{\mbox{Pr}(B \mid A)\mbox{Pr}(A)}{\mbox{Pr}(B)} \]

This equation applied to our problem becomes:

\[ \begin{align*} \mbox{Prob}(D=1 \mid +) & = \frac{ P(+ \mid D=1) \cdot P(D=1)} {\mbox{Prob}(+)} \\ & = \frac{\mbox{Prob}(+ \mid D=1)\cdot P(D=1)} {\mbox{Prob}(+ \mid D=1) \cdot P(D=1) + \mbox{Prob}(+ \mid D=0) \mbox{Prob}( D=0)} \end{align*} \]

Plugging in the numbers we get:

\[ \frac{0.99 \cdot 0.00025}{0.99 \cdot 0.00025 + 0.01 \cdot (.99975)} = 0.02 \]

This says that despite the test having 0.99 accuracy, the probability of having the disease given a positive test is only 0.02. This may appear counterintuitive to some. The reason this is the case is because we have to factor in the very rare probability that a person, chosen at random, has the disease. To illustrate this we run a Monte Carlo simulation.

Simulation

The following simulation is meant to help you visualize Bayes Theorem. We start by randomly selecting 1500 people from a population in which the disease in question has a 5% prevalence.

set.seed(3)
prev <- 1/20
##Later, we are arranging 1000 people in 80 rows and 20 columns
M <- 50 ; N <- 30
##do they have the disease?
d<-rbinom(N*M,1,p=prev)

Now each person gets the test which is correct 90% of the time.

accuracy <- 0.9
test <- rep(NA,N*M)
##do controls test positive?
test[d==1]  <- rbinom(sum(d==1), 1, p=accuracy)
##do cases test positive?
test[d==0]  <- rbinom(sum(d==0), 1, p=1-accuracy)

Because there are so many more controls than cases, even with a low false positive rate, we get more controls than cases in the group that tested positive (code not shown):

Simulation demonstrating Bayes theorem. Top plot shows every individual with red denoting cases. Each one takes a test and with 90% gives correct answer. Those called positive (either correctly or incorrectly) are put in the bottom left pane. Those called negative in the bottom right.

The proportions of red in the top plot shows $\mbox{Pr}(D=1)$. The bottom left shows $\mbox{Pr}(D=1 \mid +)$ and the bottom right shows $\mbox{Pr}(D=0 \mid +)$.

Bayes Rule Exercises

Bayes Rule Exercises #1

A:0.02415813

This blog post has a visual explanation of why this is.

Bayes in practice

JosÃ© Iglesias is a professional baseball player. In April 2013, when he was starting his career, he was performing rather well:

Month	At Bats	H	AVG
April	20	9	.450

The batting average (AVG) statistic is one way of measuring success. Roughly speaking, it tells us the success rate when batting. An AVG of .450 means JosÃ© has been successful 45% of the times he has batted (At Bats) which is rather high as we will see. Note, for example, that no one has finished a season with an AVG of .400 since Ted Williams did it in 1941! To illustrate the way hierarchical models are powerful, we will try to predict JosÃ©’s batting average at the end of the season, after he has gone to bat over 500 times.

With the techniques we have learned up to now, referred to as frequentist techniques, the best we can do is provide a confidence interval. We can think of outcomes from hitting as a binomial with a success rate of $p$. So if the success rate is indeed .450, the standard error of just 20 at bats is:

\[ \sqrt{\frac{.450 (1-.450)}{20}}=.111 \]

This means that our confidence interval is .450-.222 to .450+.222 or .228 to .672.

This prediction has two problems. First, it is very large so not very useful. Also, it is centered at .450 which implies that our best guess is that this new player will break Ted William’s record. If you follow baseball, this last statement will seem wrong and this is because you are implicitly using a hierarchical model that factors in information from years of following baseball. Here we show how we can quantify this intuition.

First, let’s explore the distribution of batting averages for all players with more than 500 at bats during the previous three seasons:

Batting average histograms for 2010, 2011, and 2012.

We note that the average player had an AVG of .275 and the standard deviation of the population of players was 0.027. So we can see already that .450 would be quite an anomaly since it is over six SDs away from the mean. So is JosÃ© lucky or the best batter seen in the last 50 years? Perhaps it’s a combination of both. But how lucky and how good is he? If we become convinced that he is lucky, we should trade him to a team that trusts the .450 observation and is maybe overestimating his potential.

The hierarchical model

The hierarchical model provides a mathematical description of how we came to see the observation of .450. First, we pick a player at random with an intrinsic ability summarized by, for example, $\theta$, then we see 20 random outcomes with success probability $\theta$.

\[ \begin{align*} \theta &\sim N(\mu, \tau^2) \mbox{ describes randomness in picking a player}\\ Y \mid \theta &\sim N(\theta, \sigma^2) \mbox{ describes randomness in the performance of this particular player} \end{align*} \]

Note the two levels (this is why we call them hierarchical): 1) Player to player variability and 2) variability due to luck when batting. In a Bayesian framework, the first level is called a prior distribution and the second the sampling distribution.

Now, let’s use this model for JosÃ©’s data. Suppose we want to predict his innate ability in the form of his true batting average $\theta$. This would be the hierarchical model for our data:

\[ \begin{align*} \theta &\sim N(.275, .027^2) \\ Y \mid \theta &\sim N(\theta, .111^2) \end{align*} \]

We now are ready to compute a posterior distribution to summarize our prediction of $\theta$. The continuous version of Bayes rule can be used here to derive the posterior probability, which is the distribution of the parameter $\theta$ given the observed data:

\[ \begin{align*} f_{ \theta \mid Y} (\theta\mid Y) &= \frac{f_{Y\mid \theta}(Y\mid \theta) f_{\theta}(\theta) }{f_Y(Y)}\\ &= \frac{f_{Y\mid \theta}(Y\mid \theta) f_{\theta}(\theta)} {\int_{\theta}f_{Y\mid \theta}(Y\mid \theta)f_{\theta}(\theta)} \end{align*} \]

We are particularly interested in the $\theta$ that maximizes the posterior probability $f_{\theta\mid Y}(\theta\mid Y)$. In our case, we can show that the posterior is normal and we can compute the mean $\mbox{E}(\theta\mid y)$ and variance $\mbox{var}(\theta\mid y)$. Specifically, we can show the average of this distribution is the following:

\[ \begin{align*} \mbox{E}(\theta\mid y) &= B \mu + (1-B) Y\\ &= \mu + (1-B)(Y-\mu)\\ B &= \frac{\sigma^2}{\sigma^2+\tau^2} \end{align*} \]

It is a weighted average of the population average $\mu$ and the observed data $Y$. The weight depends on the SD of the the population $\tau$ and the SD of our observed data $\sigma$. This weighted average is sometimes referred to as shrinking because it shrinks estimates towards a prior mean. In the case of JosÃ© Iglesias, we have:

\[ \begin{align*} \mbox{E}(\theta \mid Y=.450) &= B \times .275 + (1 - B) \times .450 \\ &= .275 + (1 - B)(.450 - .275) \\ B &=\frac{.111^2}{.111^2 + .027^2} = 0.944\\ \mbox{E}(\theta \mid Y=450) &\approx .285 \end{align*} \]

The variance can be shown to be:

\[ \mbox{var}(\theta\mid y) = \frac{1}{1/\sigma^2+1/\tau^2} = \frac{1}{1/.111^2 + 1/.027^2} = 0.00069 \] and the standard deviation is therefore $0.026$. So we started with a frequentist 95% confidence interval that ignored data from other players and summarized just JosÃ©’s data: .450 $\pm$ 0.220. Then we used an Bayesian approach that incorporated data from other players and other years to obtain a posterior probability. This is actually referred to as an empirical Bayes approach because we used data to construct the prior. From the posterior we can report what is called a 95% credible interval by reporting a region, centered at the mean, with a 95% chance of occurring. In our case, this turns out to be: .285 $\pm$ 0.052.

The Bayesian credible interval suggests that if another team is impressed by the .450 observation, we should consider trading JosÃ© as we are predicting he will be just slightly above average. Interestingly, the Red Sox traded JosÃ© to the Detroit Tigers in July. Here are the JosÃ© Iglesias batting averages for the next five months.

Month	At Bat	Hits	AVG
April	20	9	.450
May	26	11	.423
June	86	34	.395
July	83	17	.205
August	85	25	.294
September	50	10	.200
Total w/o April	330	97	.293

Although both intervals included the final batting average, the Bayesian credible interval provided a much more precise prediction. In particular, it predicted that he would not be as good the remainder of the season.

Bayes’ Rule in Practice Exercises

First download some baseball statistics.

tmpfile <- tempfile()
tmpdir <- tempdir()
download.file("http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip",tmpfile)
##this shows us files
filenames <- unzip(tmpfile,list=TRUE)
players <- read.csv(unzip(tmpfile,files="Batting.csv",exdir=tmpdir),as.is=TRUE)

head(players)

##    playerID yearID stint teamID lgID  G G_batting AB R H X2B X3B HR RBI SB
## 1 aardsda01   2004     1    SFN   NL 11        11  0 0 0   0   0  0   0  0
## 2 aardsda01   2006     1    CHN   NL 45        43  2 0 0   0   0  0   0  0
## 3 aardsda01   2007     1    CHA   AL 25         2  0 0 0   0   0  0   0  0
## 4 aardsda01   2008     1    BOS   AL 47         5  1 0 0   0   0  0   0  0
## 5 aardsda01   2009     1    SEA   AL 73         3  0 0 0   0   0  0   0  0
## 6 aardsda01   2010     1    SEA   AL 53         4  0 0 0   0   0  0   0  0
##   CS BB SO IBB HBP SH SF GIDP G_old
## 1  0  0  0   0   0  0  0    0    11
## 2  0  0  0   0   0  1  0    0    45
## 3  0  0  0   0   0  0  0    0     2
## 4  0  0  1   0   0  0  0    0     5
## 5  0  0  0   0   0  0  0    0    NA
## 6  0  0  0   0   0  0  0    0    NA

unlink(tmpdir)
file.remove(tmpfile)

## [1] TRUE

In our first course we learned to use the dplyr package. Please review for example with this tutorial.

Here we use dplyr to obtain the necessary information to perform a hierarchical model.

Q:Which of the following dplyr commands gives us the batting averages (AVG) for players with more than 500 at bats (AB) in 2012:

library(dplyr)
filter(players,yearID==2012) %>% mutate(AVG=H/AB) %>% filter(AB>=500) %>% select(AVG)

##           AVG
## 1   0.2257002
## 2   0.2732240
## 3   0.2899306
## 4   0.2438095
## 5   0.2861685
## 6   0.2884615
## 7   0.2500000
## 8   0.2901354
## 9   0.2536496
## 10  0.2342857
## 11  0.3211921
## 12  0.2687386
## 13  0.2740385
## 14  0.2880435
## 15  0.3193980
## 16  0.2517857
## 17  0.3127036
## 18  0.2702703
## 19  0.3295820
## 20  0.3125997
## 21  0.2832817
## 22  0.2826087
## 23  0.2459893
## 24  0.2598291
## 25  0.2699029
## 26  0.2273603
## 27  0.2805344
## 28  0.2628458
## 29  0.2923977
## 30  0.2040816
## 31  0.2804428
## 32  0.2925620
## 33  0.2526882
## 34  0.2474747
## 35  0.2841727
## 36  0.3132530
## 37  0.2352941
## 38  0.2592593
## 39  0.2934132
## 40  0.2859922
## 41  0.3030888
## 42  0.2943925
## 43  0.2315436
## 44  0.2846975
## 45  0.2383107
## 46  0.2701689
## 47  0.2704626
## 48  0.2864238
## 49  0.2691652
## 50  0.3021346
## 51  0.2954925
## 52  0.2317757
## 53  0.3127341
## 54  0.3001842
## 55  0.2455446
## 56  0.3162518
## 57  0.2248521
## 58  0.2870370
## 59  0.2872727
## 60  0.2564885
## 61  0.2571912
## 62  0.2983114
## 63  0.2529644
## 64  0.2714536
## 65  0.3192661
## 66  0.2426036
## 67  0.3271501
## 68  0.3148515
## 69  0.2601942
## 70  0.2673267
## 71  0.2415631
## 72  0.2907180
## 73  0.2876033
## 74  0.2895204
## 75  0.2391714
## 76  0.2810345
## 77  0.3358491
## 78  0.3014587
## 79  0.2850082
## 80  0.2647555
## 81  0.3000000
## 82  0.2230088
## 83  0.2422259
## 84  0.2935421
## 85  0.2866044
## 86  0.3041322
## 87  0.2500000
## 88  0.2524655
## 89  0.2465483
## 90  0.2592593
## 91  0.2620321
## 92  0.2829457
## 93  0.2718808
## 94  0.3255814
## 95  0.2683824
## 96  0.2198853
## 97  0.2460733
## 98  0.2797834
## 99  0.2554455
## 100 0.2295918
## 101 0.2490494
## 102 0.2601156
## 103 0.3063683
## 104 0.2665505
## 105 0.2765957
## 106 0.2820069
## 107 0.2696429

A:filter(players,yearID==2012) %>% mutate(AVG=H/AB) %>% filter(AB>=500) %>% select(AVG)

Q:Edit the command above to obtain all the batting averages from 2010, 2011, 2012 and removing rows with AB < 500.

What is the average of these batting averages?

dat <- filter(players,yearID>=2010, yearID <=2012) %>% mutate(AVG=H/AB) %>% filter(AB>500) %>% select(AVG)

mean(dat$AVG)

## [1] 0.2753465

A:0.2753465

Q:What is the standard deviation of these batting averages?

sd(dat$AVG)

## [1] 0.02741713

A:0.02741713

Q:Use exploratory data analysis to decide which of the following distributions approximates the distribution of the average across players (hint: this is contained in the AVG component)?

qqnorm(dat$AVG)

qqline(dat$AVG)

A:Normal

Q:It is April and after 20 at bats, Jose Iglesias is batting .450 (this is very good). We can think of this as a binomial distribution with 20 trials with probability of success p. Our sample estimate of p is .450. What is our estimate of standard deviation? Hint: This AVG a sum of Bernoulli trials, that is binomial, divided by 20.

sqrt(.45*(1-.45)/20)

## [1] 0.111243

A:0.111243

What is your estimate of Jose Iglesias’ batting average going forward taking into account his current batting average?

A:0.2849443

Hierarchical Models

In this section, we use the mathematical theory which describes an approach that has become widely applied in the analysis of high-throughput data. The general idea is to build a hierachichal model with two levels. One level describes variability across samples/units, and the other describes variability across features. This is similar to the baseball example in which the first level described variability across players and the second described the randomness for the success of one player. The first level of variation is accounted for by all the models and approaches we have described here, for example the model that leads to the t-test. The second level provides power by permitting us to “borrow” information from all features to inform the inference performed on each one.

Here we describe one specific case that is currently the most widely used approach to inference with gene expression data. It is the model implemented by the limma Bioconductor package. This idea has been adapted to develop methods for other data types such as RNAseq by, for example, edgeR and DESeq2. This package provides an alternative to the t-test that greatly improves power by modeling the variance. While in the baseball example we modeled averages, here we model variances. Modelling variances requires more advanced math, but the concepts are practically the same. We motivate and demonstrate the approach with an example.

Here is a volcano showing effect sizes and p-value from applying a t-test to data from an experiment running six replicated samples with 16 genes artificially made to be different in two groups of three samples each. These 16 genes are the only genes for which the alternative hypothesis is true. In the plot they are shown in blue.

library(SpikeInSubset) ##Available from Bioconductor
data(rma95)
library(genefilter)
fac <- factor(rep(1:2,each=3))
tt <- rowttests(exprs(rma95),fac)
smallp <- with(tt, p.value < .01)
spike <- rownames(rma95) %in% colnames(pData(rma95))
cols <- ifelse(spike,"dodgerblue",ifelse(smallp,"red","black"))

with(tt, plot(-dm, -log10(p.value), cex=.8, pch=16,
     xlim=c(-1,1), ylim=c(0,4.5),
     xlab="difference in means",
     col=cols))
abline(h=2,v=c(-.2,.2), lty=2)

Volcano plot for t-test comparing two groups. Spiked-in genes are denoted with blue. Among the rest of the genes, those with p-value < 0.01 are denoted with red.

We cut-off the range of the y-axis at 4.5, but there is one blue point with a p-value smaller than $10^{-6}$. Two findings stand out from this plot. The first is that only one of the positives would be found to be significant with a standard 5% FDR cutoff:

sum( p.adjust(tt$p.value,method = "BH")[spike] < 0.05)

## [1] 1

This of course has to do with the low power associated with a sample size of three in each group. The second finding is that if we forget about inference and simply rank the genes based on the size of the t-statistic, we obtain many false positives in any rank list of size larger than 1. For example, six of the top 10 genes ranked by t-statistic are false positives.

table( top50=rank(tt$p.value)<= 10, spike) #t-stat and p-val rank is the same

##        spike
## top50   FALSE  TRUE
##   FALSE 12604    12
##   TRUE      6     4

In the plot we notice that these are mostly genes for which the effect size is relatively small, implying that the estimated standard error is small. We can confirm this with a plot:

tt$s <- apply(exprs(rma95), 1, function(row) 
  sqrt(.5 * (var(row[1:3]) + var(row[4:6]) ) ) )
with(tt, plot(s, -log10(p.value), cex=.8, pch=16,
              log="x",xlab="estimate of standard deviation",
              col=cols))

p-values versus standard deviation estimates.

Here is where a hierarchical model can be useful. If we can make an assumption about the distribution of these variances across genes, then we can improve estimates by “adjusting” estimates that are “too small” according to this distribution. In a previous section we described how the F-distribution approximates the distribution of the observed variances.

\[ s^2 \sim s_0^2 F_{d,d_0} \]

Because we have thousands of data points, we can actually check this assumption and also estimate the parameters $s_0$ and $d_0$. This particular approach is referred to as empirical Bayes because it can be described as using data (empirical) to build the prior distribution (Bayesian approach).

Now we apply what we learned with the baseball example to the standard error estimates. As before we have an observed value for each gene $s_g$, a sampling distribution as a prior distribution. We can therefore compute a posterior distribution for the variance $\sigma^2_g$ and obtain the posterior mean. You can see the details of the derivation in this paper.

\[ \mbox{E}[\sigma^2_g \mid s_g] = \frac{d_0 s_0^2 + d s^2_g}{d_0 + d} \]

As in the baseball example, the posterior mean shrinks the observed variance $s_g^2$ towards the global variance $s_0^2$ and the weights depend on the sample size through the degrees of freedom $d$ and, in this case, the shape of the prior distribution through $d_0$.

In the plot above we can see how the variance estimate shrink for 40 genes (code not shown):

Illustration of how estimates shrink towards the prior expectation. Forty genes spanning the range of values were selected.

An important aspect of this adjustment is that genes having a sample standard deviation close to 0 are no longer close to 0 (the shrink towards $s_0$ ). We can now create a version of the t-test that instead of the sample standard deviation uses this posterior mean or “shrunken” estimate of the variance. We refer to these as moderated t-tests. Once we do this, the improvements can be seen clearly in the volcano plot:

library(limma)
fit <- lmFit(rma95, model.matrix(~ fac))
ebfit <- ebayes(fit)
limmares <- data.frame(dm=coef(fit)[,"fac2"], p.value=ebfit$p.value[,"fac2"])
with(limmares, plot(dm, -log10(p.value),cex=.8, pch=16,
     col=cols,xlab="difference in means",
     xlim=c(-1,1), ylim=c(0,5)))
abline(h=2,v=c(-.2,.2), lty=2)

Volcano plot for moderated t-test comparing two groups. Spiked-in genes are denoted with blue. Among the rest of the genes, those with p-value < 0.01 are denoted with red.

The number of false positives in the top 10 is now reduced to 2.

table( top50=rank(limmares$p.value)<= 10, spike)

##        spike
## top50   FALSE  TRUE
##   FALSE 12608     8
##   TRUE      2     8

Hierarchical Models in Practice Exercises

Load the following data (you can install it from Bioconductor) and extract the data matrix using exprs:

## to install:
library(rafalib)
#install_bioc("SpikeInSubset")
library(Biobase)
library(SpikeInSubset)
data(rma95)
y <- exprs(rma95)

This dataset comes from an experiment in which RNA was obtained from the same background pool to create six replicate samples. Then RNA from 16 genes were artificially added in different quantities to each sample. These quantities (in picoMolars) and gene IDs are stored here:

pData(rma95)

##                  37777_at 684_at 1597_at 38734_at 39058_at 36311_at
## 1521a99hpp_av06      0.00   0.25     0.5        1        2        4
## 1532a99hpp_av04      0.00   0.25     0.5        1        2        4
## 2353a99hpp_av08      0.00   0.25     0.5        1        2        4
## 1521b99hpp_av06      0.25   0.50     1.0        2        4        8
## 1532b99hpp_av04      0.25   0.50     1.0        2        4        8
## 2353b99hpp_av08r     0.25   0.50     1.0        2        4        8
##                  36889_at 1024_at 36202_at 36085_at 40322_at 407_at
## 1521a99hpp_av06         8      16       32       64      128   0.00
## 1532a99hpp_av04         8      16       32       64      128   0.00
## 2353a99hpp_av08         8      16       32       64      128   0.00
## 1521b99hpp_av06        16      32       64      128      256   0.25
## 1532b99hpp_av04        16      32       64      128      256   0.25
## 2353b99hpp_av08r       16      32       64      128      256   0.25
##                  1091_at 1708_at 33818_at 546_at
## 1521a99hpp_av06      512    1024      256     32
## 1532a99hpp_av04      512    1024      256     32
## 2353a99hpp_av08      512    1024      256     32
## 1521b99hpp_av06     1024       0      512     64
## 1532b99hpp_av04     1024       0      512     64
## 2353b99hpp_av08r    1024       0      512     64

Note that these quantities were the same in the first three arrays and in the last three arrays. So we define two groups like this:

g <- factor(rep(0:1,each=3))

and create an index of which rows are associated with the artificially added genes:

spike <- rownames(y) %in% colnames(pData(rma95))

Q:Note that only these 16 genes are differentially expressed since these six samples differ only due to random sampling (they all come from the same background pool of RNA).

Perform a t-test on each gene using the rowttests function in the genefilter package.

What proportion of genes with a p-value < 0.01 (no multiple comparison correction) are not part of the artificially added (false positive)?

library(genefilter)
rtt = rowttests(y,g)
index = rtt$p.value < 0.01 
print (mean( !spike[index] ))

## [1] 0.7608696

## We can make a volcano plot to visualize this:
mask <- with(rtt, abs(dm) < .2 & p.value < .01)
cols <- ifelse(mask,"red",ifelse(spike,"dodgerblue","black"))
with(rtt,plot(-dm, -log10(p.value), cex=.8, pch=16,
     xlim=c(-1,1), ylim=c(0,5),
     xlab="difference in means",
     col=cols))
abline(h=2,v=c(-.2,.2), lty=2)

A:0.7608696

Q:Now compute the within group sample standard deviation for each gene (you can use group 1). Based on the p-value < 0.01 cut-off, split the genes into true positives, false positives, true negatives and false negatives. Create a boxplot comparing the sample SDs for each group. Which of the following best described the box-plot?

library(genefilter)
sds <- rowSds(y[,g==0])
index <- paste0( as.numeric(spike), as.numeric(rtt$p.value<0.01))
index <- factor(index,levels=c("11","01","00","10"),labels=c("TP","FP","TN","FN"))
boxplot(split(sds,index))

A: The false positives have smaller standard deviation.

Q:In the previous two questions we observed results consistent with the fact that the random variability associated with the sample standard deviation leads to t-statistics that are large by chance.

Note that the sample standard deviation we use in the t-test is an estimate and that with just a pair of triplicate samples, the variability associated with the denominator in the t-test can be large.

The following three steps perform the basic limma analysis. The eBayes step uses a hierarchical model that provides a new estimate of the gene specific standard error.

library(limma)
fit <- lmFit(y, design=model.matrix(~ g))
colnames(coef(fit))

## [1] "(Intercept)" "g1"

fit <- eBayes(fit)

Make a plot of the original new hierarchical models based estimate versus the sample based estimate.

sampleSD = fit$sigma
posteriorSD = sqrt(fit$s2.post)

Which best describes what the hierarchical modelling approach does?

LIM = range( c(posteriorSD,sampleSD))
plot(sampleSD, posteriorSD,ylim=LIM,xlim=LIM)
abline(0,1)
abline(v=sqrt(fit$s2.prior))

A:Moves all the estimates of standard deviation closer to 0.12.

Q:Use these new estimates (computed in Question 4.6.3) of standard deviation in the denominator of the t-test and compute p-values. You can do it like this:

library(limma)
fit = lmFit(y, design=model.matrix(~ g))
fit = eBayes(fit)
##second coefficient relates to diffences between group
pvals = fit$p.value[,2]

What proportion of genes with a p-value < 0.01 (no multiple comparison correction) are not part of the artificially added (false positives)?

index = pvals < 0.01 
print (mean( !spike[index] ))

## [1] 0.6486486

## We can make a volcano plot to visualize this:
mask <- abs(fit$coef[,2]) < .2 & fit$p.value[,2] < .01
cols <- ifelse(mask,"red",ifelse(spike,"dodgerblue","black"))
plot(fit$coef[,2], -log10(fit$p.value[,2]), cex=.8, pch=16,
     xlim=c(-1,1), ylim=c(0,5),
     xlab="difference in means",
     col=cols)
abline(h=2,v=c(-.2,.2), lty=2)

A:0.6486486

Compare to the previous volcano plot and notice that we no longer have small p-values for genes with small effect sizes.

Basic Exploratory Data Analysis

Plots: Volcano plots and p-value histograms, boxplots, and MAplots

An under-appreciated advantage of working with high-throughput data is that problems with the data are sometimes more easily exposed than with low-throughput data. The fact that we have thousands of measurements permits us to see problems that are not apparent when only a few measurements are available. A powerful way to detect these problems is with exploratory data analysis (EDA). Here we review some of the plots that allow us to detect quality problems.

Volcano plots

Here we will use the results obtained from applying t-test to data from a gene expression dataset:

library(genefilter)
library(GSE5859Subset)
data(GSE5859Subset)
head(geneExpression)

##           GSM136508.CEL.gz GSM136530.CEL.gz GSM136517.CEL.gz
## 1007_s_at         6.543954         6.401470         6.298943
## 1053_at           7.546708         7.263547         7.201699
## 117_at            5.402622         5.050546         5.024917
## 121_at            7.892544         7.707754         7.461886
## 1255_g_at         3.242779         3.222804         3.185605
## 1294_at           7.531754         7.090270         7.466018
##           GSM136576.CEL.gz GSM136566.CEL.gz GSM136574.CEL.gz
## 1007_s_at         6.837899         6.470689         6.450220
## 1053_at           7.052761         6.980207         7.096195
## 117_at            5.304313         5.214149         5.173731
## 121_at            7.558130         7.819013         7.641136
## 1255_g_at         3.195363         3.251915         3.324934
## 1294_at           7.122145         7.058973         6.992396
##           GSM136575.CEL.gz GSM136569.CEL.gz GSM136568.CEL.gz
## 1007_s_at         6.052854         6.387026         6.640583
## 1053_at           6.983827         7.060558         7.010453
## 117_at            5.022882         5.175134         5.281784
## 121_at            7.729267         7.700608         7.615513
## 1255_g_at         3.088541         3.184015         3.076940
## 1294_at           7.112384         7.194791         6.884312
##           GSM136559.CEL.gz GSM136565.CEL.gz GSM136573.CEL.gz
## 1007_s_at         6.948474         6.778464         6.595414
## 1053_at           6.775048         7.063689         6.864693
## 117_at            5.309194         5.071376         5.091403
## 121_at            7.992304         7.706135         7.808486
## 1255_g_at         3.167413         3.492037         3.231536
## 1294_at           7.401553         7.478987         7.065355
##           GSM136523.CEL.gz GSM136509.CEL.gz GSM136727.CEL.gz
## 1007_s_at         6.255549         6.379983         6.133068
## 1053_at           7.174769         7.702533         7.280781
## 117_at            5.237160         5.398616         5.401876
## 121_at            7.574813         7.511478         7.607461
## 1255_g_at         3.208304         3.212051         3.225123
## 1294_at           7.344407         7.631689         7.018479
##           GSM136510.CEL.gz GSM136515.CEL.gz GSM136522.CEL.gz
## 1007_s_at         6.502051         6.331567         6.354293
## 1053_at           7.302209         7.456509         7.282859
## 117_at            5.395087         5.280535         4.986950
## 121_at            7.993732         7.632947         7.706585
## 1255_g_at         3.440186         3.185090         3.192436
## 1294_at           7.478820         7.577597         7.339535
##           GSM136507.CEL.gz GSM136524.CEL.gz GSM136514.CEL.gz
## 1007_s_at         6.517539         6.156754         6.037871
## 1053_at           7.689282         7.491967         7.413133
## 117_at            5.562001         5.039387         5.054133
## 121_at            7.612557         7.543667         7.507113
## 1255_g_at         3.107306         3.128269         3.085953
## 1294_at           7.398595         7.359040         7.377372
##           GSM136563.CEL.gz GSM136564.CEL.gz GSM136572.CEL.gz
## 1007_s_at         6.639091         6.393338         6.469794
## 1053_at           7.028731         6.697240         7.092346
## 117_at            5.361298         5.218937         5.340272
## 121_at            7.798197         7.976375         7.753480
## 1255_g_at         3.174794         3.409032         3.274033
## 1294_at           7.467240         7.355222         6.910401

g <- factor(sampleInfo$group)
results <- rowttests(geneExpression,g)
pvals <- results$p.value

And we also generate p-values from a dataset for which we know the null is true:

m <- nrow(geneExpression)
n <- ncol(geneExpression)
randomData <- matrix(rnorm(n*m),m,n)
nullpvals <- rowttests(randomData,g)$p.value

As we described earlier, reporting only p-values is a mistake when we can also report effect sizes. With high-throughput data, we can visualize the results by making a volcano plot. The idea behind a volcano plot is to show these for all features. In the y-axis we plot -log (base 10) p-values and on the x-axis we plot the effect size. By using - log (base 10), the “highly significant” features appear at the top of the plot. Using log also permits us to better distinguish between small and very small p-values, for example 0.01 and $10^6$. Here is the volcano plot for our results above:

plot(results$dm,-log10(results$p.value),
     xlab="Effect size",ylab="- log (base 10) p-values")

Many features with very small p-values, but small effect sizes as we see here, are sometimes indicative of problematic data.

p-value Histograms

Another plot we can create to get an overall idea of the results is to make histograms of p-values. When we generate completely null data the histogram follows a uniform distribution. With our original data set we see a higher frequency of smaller p-values.

library(rafalib)
mypar(1,2)
hist(nullpvals,ylim=c(0,1400))
hist(pvals,ylim=c(0,1400))

P-value histogram. We show a simulated case in which all null hypotheses are true (left) and p-values from the gene expression described above.

When we expect most hypothesis to be null and don’t see a uniform p-value distribution, it might be indicative of unexpected properties, such as correlated samples.

If we permute the outcomes and calculate p-values then, if the samples are independent, we should see a uniform distribution. With these data we do not:

permg <- sample(g)
permresults <- rowttests(geneExpression,permg)
hist(permresults$p.value)

Histogram obtained after permuting labels.

In a later chapter we will see that the columns in this dataset are not independent and thus the assumptions used to compute the p-values here are incorrect.

Data boxplots and histograms

With high-throughput data, we have thousands of measurements for each experimental unit. As mentioned earlier, this can help us detect quality issues. For example, if one sample has a completely different distribution than the rest, we might suspect there are problems. Although a complete change in distribution could be due to real biological differences, more often than not it is due to a technical problem. Here we load a large gene expression experiment available from Bioconductor. We “accidentally” use log instead of log2 on one of the samples.

library(Biobase)
#devtools::install_github("genomicsclass/GSE5859")
library(GSE5859) 
data(GSE5859) 
ge <- exprs(e) ##ge for gene expression
dim(ge)

## [1] 8793  208

ge[,49] <- ge[,49]/log2(exp(1)) ##immitate error

A quick look at a summary of the distribution using boxplots immediately highlights the mistake:

library(rafalib)
mypar(1,1)
boxplot(ge,range=0,names=1:ncol(e),col=ifelse(1:ncol(ge)==49,1,2))

Boxplot for log-scale expression for all samples.

Note that the number of samples is a bit too large here, making it hard to see the boxes. One can instead simply show the boxplot summaries without the boxes:

qs <- t(apply(ge,2,
              quantile,prob=c(0.05,0.25,0.5,0.75,0.95)))
matplot(qs,type="l",lty=1)

The 0.05, 0.25, 0.5, 0.75, and 0.95 quantiles are plotted for each sample.

We refer to this figure as a kaboxplot because Karl Broman was the first we saw use it as an alternative to boxplots.

We can also plot all the histograms. Because we have so much data, we create histograms using small bins, then smooth the heights of the bars and then plot smooth histograms. We re-calibrate the height of these smooth curves so that if a bar is made with base of size “unit” and height given by the curve at $x_0$, the area approximates the number of points in region of size “unit” centered at $x_0$:

mypar(1,1)
shist(ge,unit=0.5)

Smooth histograms for each sample.

MA plot

Scatterplots and correlation are not the best tools to detect replication problems. A better measure of replication can be obtained from examining the differences between the values that should be the same. Therefore, a better plot is a rotation of the scatterplot containing the differences on the y-axis and the averages on the x-axis. This plot was originally named a Bland-Altman plot, but in genomics it is commonly referred to as an MA-plot. The name MA comes from plots of red log intensity minus (M) green intensities versus average (A) log intensities used with microarrays (MA) data.

x <- ge[,1]
y <- ge[,2]
mypar(1,2)
plot(x,y)
plot((x+y)/2,x-y)

Scatter plot (left) and M versus A plot (right) for the same data.

Note that once we rotate the plot, the fact that these data have differences of about:

sd(y-x)

## [1] 0.2025465

becomes immediate. The scatterplot shows very strong correlation, which is not necessarily informative here.

We will later introduce dendograms, heatmaps, and multi-dimensional scaling plots.

Plots Exercises

Download and install the Bioconductor package SpikeInSubset and then load the library and the mas133 data

# source("http://www.bioconductor.org/biocLite.R")
# biocLite("SpikeInSubset")
library(SpikeInSubset)
data(mas133)

Now make the following plot of the first two samples and compute the correlation:

e <- exprs(mas133)
plot(e[,1],e[,2],main=paste0("corr=",signif(cor(e[,1],e[,2]),3)),cex=0.5)
k <- 3000
b <- 1000 #a buffer
polygon(c(-b,k,k,-b),c(-b,-b,k,k),col="red",density=0,border="red")

Q:What proportion of the points are inside the box?

#We simply want to know which genes are below k for both samples:

mean(e[,1]<k & e[,2]<k)

## [1] 0.9475336

A:0.9475336

Q:Now make the sample plot with log:

plot(log2(e[,1]),log2(e[,2]),main=paste0("corr=",signif(cor(log2(e[,1]),log2(e[,2])),2)),cex=0.5)
k <- log2(3000)
b <- log2(0.5)
polygon(c(b,k,k,b),c(b,b,k,k),col="red",density=0,border="red")

What is an advantage of taking the log?

A: The tails do not dominate the plot.

When you take the log, 95% of data is no longer in a tiny section of plot.

Q:Make an MA-plot

e <- log2(exprs(mas133))
plot((e[,1]+e[,2])/2,e[,2]-e[,1],cex=0.5)

The two samples we are plotting are replicates (they random samples fro the same batch of RNA) The correlation of the data was 0.997 in the original scale, 0.96 in the log-scale. High correlations are sometimes confused for evidence of replication. But replication implies we get very small difference between the observations which is better measured with distance or differences.

What is the standard deviation of the log ratios for this comparison?

#Here we simply compute the standard deviaton of the difference

sd(e[,2]-e[,1])

## [1] 0.7767887

#However, note that if there is a mean shift the standard deviation will not summarize this. We can instead consider the average distance:
sqrt(mean((e[,2]-e[,1])^2))

## [1] 0.7856103

#which turns out to be similar in this case.

A:0.7767887

Q:How many fold changes above 2 do we see? Note that these measures of log (base 2) of expression so a fold change of 2 translates into a difference, in absolute value, of 1.

#This are log2 measurements so a fold-change of 2 relates to differences of 1 (in absolute value). We then simply count the occurrences:

sum(abs(e[,2]-e[,1])>1)

## [1] 3057

A:3057

Resources

https://github.com/genomicsclass

Html book

devtools::session_info()

## Session info --------------------------------------------------------------

##  setting  value                       
##  version  R version 3.3.0 (2016-05-03)
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_India.1252          
##  tz       Asia/Calcutta               
##  date     2016-05-16

## Packages ------------------------------------------------------------------

##  package               * version  date      
##  affy                  * 1.48.0   2015-10-14
##  affyio                  1.40.0   2015-10-14
##  annotate                1.48.0   2015-10-14
##  AnnotationDbi           1.32.3   2015-12-24
##  assertthat              0.1      2013-12-06
##  base64enc             * 0.1-3    2015-07-28
##  Biobase               * 2.30.0   2015-10-14
##  BiocGenerics          * 0.16.1   2015-11-06
##  BiocInstaller           1.22.2   2016-05-12
##  car                   * 2.1-2    2016-03-25
##  colorspace              1.2-6    2015-03-11
##  curl                  * 0.9.7    2016-04-10
##  dagdata               * 1.1      2016-02-11
##  DataComputing         * 0.8.3    2016-03-19
##  DBI                     0.4      2016-05-02
##  devtools                1.11.1   2016-04-21
##  digest                  0.6.9    2016-01-08
##  downloader            * 0.4      2015-07-09
##  dplyr                 * 0.4.3    2015-09-01
##  evaluate                0.9      2016-04-29
##  formatR                 1.3      2016-03-05
##  genefilter            * 1.52.1   2016-01-28
##  GenomeInfoDb          * 1.6.3    2016-01-26
##  GenomicRanges         * 1.22.4   2016-02-01
##  ggdendro                0.1-20   2016-04-27
##  ggplot2               * 2.1.0    2016-05-06
##  gridExtra               2.2.1    2016-02-29
##  GSE5859               * 1.0      2016-05-16
##  GSE5859Subset         * 1.0      2016-05-09
##  gtable                  0.2.0    2016-02-26
##  htmltools               0.3.5    2016-03-21
##  IRanges               * 2.4.8    2016-02-26
##  knitr                 * 1.12.3   2016-01-22
##  lattice               * 0.20-33  2015-07-14
##  lazyeval                0.1.10   2015-01-02
##  limma                 * 3.26.9   2016-03-23
##  lme4                    1.1-12   2016-04-16
##  lubridate             * 1.5.6    2016-04-06
##  magrittr                1.5      2014-11-22
##  manipulate            * 1.0.1    2014-12-24
##  maPooling             * 1.0      2016-05-13
##  MASS                    7.3-45   2016-04-21
##  Matrix                  1.2-6    2016-05-02
##  MatrixModels            0.4-1    2015-08-22
##  matrixStats           * 0.50.2   2016-04-24
##  memoise                 1.0.0    2016-01-29
##  mgcv                    1.8-12   2016-03-03
##  mime                    0.4      2015-09-03
##  minqa                   1.2.4    2014-10-09
##  mosaic                * 0.13.0   2015-12-18
##  mosaicData            * 0.13.0   2015-12-18
##  munsell                 0.4.3    2016-02-13
##  nlme                    3.1-127  2016-04-16
##  nloptr                  1.0.4    2014-08-04
##  nnet                    7.3-12   2016-02-02
##  parathyroidSE         * 1.8.0    2016-05-13
##  pbkrtest                0.4-6    2016-01-27
##  plyr                    1.8.3    2015-06-12
##  preprocessCore          1.32.0   2015-10-14
##  quantreg                5.21     2016-02-13
##  qvalue                * 2.2.2    2016-01-08
##  R6                      2.1.2    2016-01-26
##  rafalib               * 1.0.0    2015-08-09
##  RColorBrewer            1.1-2    2014-12-07
##  Rcpp                    0.12.4   2016-03-26
##  reshape2                1.4.1    2014-12-06
##  rmarkdown               0.9.6    2016-05-01
##  RSQLite                 1.0.0    2014-10-25
##  S4Vectors             * 0.8.11   2016-01-29
##  scales                  0.4.0    2016-02-26
##  SparseM                 1.7      2015-08-15
##  SpikeInSubset         * 1.10.0   2016-05-09
##  stringi                 1.0-1    2015-10-22
##  stringr               * 1.0.0    2015-04-30
##  SummarizedExperiment  * 1.0.2    2016-01-01
##  survival                2.39-2   2016-04-16
##  tidyr                 * 0.4.1    2016-02-05
##  tissuesGeneExpression * 1.0      2016-05-13
##  withr                   1.0.1    2016-02-04
##  XML                     3.98-1.4 2016-03-01
##  xtable                  1.8-2    2016-02-05
##  XVector                 0.10.0   2015-10-14
##  yaml                    2.1.13   2014-06-12
##  zlibbioc                1.16.0   2015-10-14
##  source                                              
##  Bioconductor                                        
##  Bioconductor                                        
##  Bioconductor                                        
##  Bioconductor                                        
##  CRAN (R 3.2.1)                                      
##  CRAN (R 3.2.1)                                      
##  Bioconductor                                        
##  Bioconductor                                        
##  Bioconductor                                        
##  CRAN (R 3.2.4)                                      
##  CRAN (R 3.3.0)                                      
##  CRAN (R 3.2.5)                                      
##  Github (genomicsclass/dagdata@b486130)              
##  Github (DataComputing/DataComputing@fb983c7)        
##  CRAN (R 3.2.5)                                      
##  CRAN (R 3.2.5)                                      
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.2.2)                                      
##  CRAN (R 3.2.2)                                      
##  CRAN (R 3.2.5)                                      
##  CRAN (R 3.2.4)                                      
##  Bioconductor                                        
##  Bioconductor                                        
##  Bioconductor                                        
##  CRAN (R 3.2.5)                                      
##  Github (hadley/ggplot2@59c503b)                     
##  CRAN (R 3.2.3)                                      
##  Github (genomicsclass/GSE5859@3d8cdd7)              
##  Github (genomicsclass/GSE5859Subset@8ada5f4)        
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.2.4)                                      
##  Bioconductor                                        
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.3.0)                                      
##  CRAN (R 3.2.1)                                      
##  Bioconductor                                        
##  CRAN (R 3.2.5)                                      
##  CRAN (R 3.2.4)                                      
##  CRAN (R 3.2.1)                                      
##  CRAN (R 3.2.0)                                      
##  Github (genomicsclass/maPooling@489b3c7)            
##  CRAN (R 3.3.0)                                      
##  CRAN (R 3.3.0)                                      
##  CRAN (R 3.2.2)                                      
##  CRAN (R 3.3.0)                                      
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.2.2)                                      
##  CRAN (R 3.2.1)                                      
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.2.5)                                      
##  CRAN (R 3.2.1)                                      
##  CRAN (R 3.2.3)                                      
##  Bioconductor                                        
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.2.1)                                      
##  Bioconductor                                        
##  CRAN (R 3.2.3)                                      
##  Bioconductor                                        
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.2.2)                                      
##  CRAN (R 3.2.0)                                      
##  CRAN (R 3.2.4)                                      
##  CRAN (R 3.2.1)                                      
##  CRAN (R 3.2.5)                                      
##  CRAN (R 3.2.1)                                      
##  Bioconductor                                        
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.2.2)                                      
##  Bioconductor                                        
##  CRAN (R 3.2.2)                                      
##  CRAN (R 3.2.1)                                      
##  Bioconductor                                        
##  CRAN (R 3.3.0)                                      
##  CRAN (R 3.2.3)                                      
##  Github (genomicsclass/tissuesGeneExpression@a43cf4b)
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.2.3)                                      
##  Bioconductor                                        
##  CRAN (R 3.2.1)                                      
##  Bioconductor

#sessionInfo()$otherPkgs

	Called significant	Not called significant	Total
Null True	\(V\)	\(m_0-V\)	\(m_0\)
Alternative True	\(S\)	\(m_1-S\)	\(m_1\)
True	\(R\)	\(m-R\)	\(m\)

Data Analysis for Life Sciences 3_Statistical Inference and Modeling for High-throughput Experiments:HarvardX PH525.3x