Basic Exploratory Data Analysis

An under-appreciated advantage of working with high-throughput data is that problems with the data are sometimes more easily exposed than with low-throughput data. The fact that we have thousands of measurements permits us to see problems that are not apparent when only a few measurements are available. A powerful way to detect these problems is with exploratory data analysis (EDA). Here we review some of the plots that allow us to detect quality problems.

Volcano plots

Here we will use the results obtained from applying t-test to data from a gene expression dataset:

library(genefilter)
library(GSE5859Subset)
data(GSE5859Subset)
g <- factor(sampleInfo$group)
results <- rowttests(geneExpression,g)
pvals <- results$p.value

And we also generate p-values from a dataset for which we know the null is true:

m <- nrow(geneExpression)
n <- ncol(geneExpression)
randomData <- matrix(rnorm(n*m),m,n)
nullpvals <- rowttests(randomData,g)$p.value

As we described earlier, reporting only p-values is a mistake when we can also report effect sizes. With high-throughput data, we can visualize the results by making a volcano plot. The idea behind a volcano plot is to show these for all features. In the y-axis we plot -log (base 10) p-values and on the x-axis we plot the effect size. By using -log (base 10), the “highly significant” features appear at the top of the plot. Using log also permits us to better distinguish between small and very small p-values, for example 0.01 and \(10^6\). Here is the volcano plot for our results above:

plot(results$dm,-log10(results$p.value),
     xlab="Effect size",ylab="- log (base 10) p-values")

Many features with very small p-values, but small effect sizes as we see here, are sometimes indicative of problematic data.

p-value Histograms

Another plot we can create to get an overall idea of the results is to make histograms of p-values. When we generate completely null data the histogram follows a uniform distribution. With our original dataset we see a higher frequency of smaller p-values.

library(rafalib)
mypar(1,2)
hist(nullpvals,ylim=c(0,1400))
hist(pvals,ylim=c(0,1400))

P-value histogram. We show a simulated case in which all null hypotheses are true (left) and p-values from the gene expression described above.

When we expect most hypotheses to be null and don’t see a uniform p-value distribution, it might be indicative of unexpected properties, such as correlated samples.

If we permute the outcomes and calculate p-values then, if the samples are independent, we should see a uniform distribution. With these data we do not:

permg <- sample(g)
permresults <- rowttests(geneExpression,permg)
hist(permresults$p.value)

Histogram obtained after permuting labels.

Later we will discuss that the columns in this dataset are not independent and thus the assumptions used to compute the p-values here are incorrect.

Data boxplots and histograms

With high-throughput data, we have thousands of measurements for each experimental unit. This can help us detect quality issues. For example, if one sample has a completely different distribution than the rest, we might suspect there are problems. Although a complete change in distribution could be due to real biological differences, more often than not it is due to a technical problem. Here we load a large gene expression experiment available from Bioconductor. We “accidentally” use log instead of log2 on one of the samples.

library(Biobase)
library(devtools)
install_github("genomicsclass/GSE5859")
library(GSE5859) 
data(GSE5859) 
ge <- exprs(e) ##ge for gene expression
ge[,49] <- ge[,49]/log2(exp(1)) ##imitate error

A quick look at a summary of the distribution using boxplots immediately highlights the mistake:

library(rafalib)
mypar(1,1)
boxplot(ge,range=0,names=1:ncol(e),col=ifelse(1:ncol(ge)==49,1,2))

Boxplot for log-scale expression for all samples.

MA plot

Scatterplots and correlation are not the best tools to detect replication problems. A better measure of replication can be obtained from examining the differences between the values that should be the same. Therefore, a better plot is a rotation of the scatterplot containing the differences on the y-axis and the averages on the x-axis. This plot was originally named a Bland-Altman plot, but in genomics it is commonly referred to as an MA-plot. The name MA comes from plots of red log intensity minus (M) green intensities versus average (A) log intensities used with microarrays (MA) data.

x <- ge[,1]
y <- ge[,2]
mypar(1,2)
plot(x,y)
plot((x+y)/2,x-y)

Scatter plot (left) and M versus A plot (right) for the same data.

Note that once we rotate the plot, the fact that these data have differences of about:

sd(y-x)

## [1] 0.2025465

becomes immediate. The scatterplot shows very strong correlation, which is not necessarily informative here.

Statistical Models

All models are wrong, but some are useful. -George E. P. Box

When we see a p-value in the literature, it means a probability distribution of some sort was used to quantify the null hypothesis. Many times deciding which probability distribution to use is relatively straightforward. Most p-values in the scientific literature are based on sample averages or least squares estimates from a linear model and make use of the CLT to approximate the null distribution of their statistic as normal.

The CLT is backed by theoretical results that guarantee that the approximation is accurate. However, we cannot always use this approximation, such as when our sample size is too small. Previously, we described how the sample average can be approximated as t-distributed when the population data is approximately normal. However, there is no theoretical backing for this assumption. We are now modeling. In the case of human height, we know from experience that this turns out to be a very good model.

But this does not imply that every dataset we collect will follow a normal distribution. Some examples are: coin tosses, the number of people who win the lottery, and US incomes. The normal distribution is not the only parametric distribution that is available for modeling. We will now discuss a brief introduction to some of the most widely used parametric distributions and some of their uses in the life sciences. We focus on the models and concepts needed to understand the techniques currently used to perform statistical inference on high-throughput data. To do this we also need to introduce the basics of Bayesian statistics. For more in-depth description of probability models and parametric distributions please consult a Statistics textbook such as this one.

The Binomial Distribution

The first distribution we will describe is the binomial distribution. It reports the probability of observing \(S=k\) successes in \(N\) trials as

\[ \mbox{Pr}(S=k) = {N \choose k}p^k (1-p)^{N-k} \]

with \(p\) the probability of success. The best known example is coin tosses with \(S\) the number of heads when tossing \(N\) coins. In this example \(p=0.5\).

Note that \(S/N\) is the average of independent random variables and thus the CLT tells us that \(S\) is approximately normal when \(N\) is large. This distribution has many applications in the life sciences. Recently, it has been used by the variant callers and genotypers applied to next generation sequencing. A special case of this distribution is approximated by the Poisson distribution which we describe next.

The Poisson Distribution

Since it is the sum of binary outcomes, the number of people that win the lottery follows a binomial distribution (we assume each person buys one ticket). The number of trials \(N\) is the number of people that buy tickets and is usually very large. However, the number of people that win the lottery oscillates between 0 and 3, which implies the normal approximation does not hold. So why does CLT not hold? One can explain this mathematically, but the intuition is that with the sum of successes so close to and also constrained to be larger than 0, it is impossible for the distribution to be normal. Here is a quick simulation:

p=10^-7 ##1 in 10,000,0000 chances of winning
N=5*10^6 ##5,000,000 tickets bought
winners=rbinom(1000,N,p) ##100 is the number of different lotto draws
tab=table(winners)
plot(tab)

Number of people that win the lottery obtained from Monte Carlo simulation.

prop.table(tab)

## winners
##     0     1     2     3     4 
## 0.590 0.316 0.083 0.009 0.002

For cases like this, where \(N\) is very large, but \(p\) is small enough to make \(N \times p\) (call it \(\lambda\)) a number between 0 and, for example, 10, then \(S\) can be shown to follow a Poisson distribution, which has a simple parametric form:

\[ \mbox{Pr}(S=k)=\frac{\lambda^k \exp{-\lambda}}{k!} \]

The Poisson distribution is commonly used in RNA-seq analyses. Because we are sampling thousands of molecules and most genes represent a very small proportion of the totality of molecules, the Poisson distribution seems appropriate.

So how does this help us? One way is that it provides insight about the statistical properties of summaries that are widely used in practice. For example, let’s say we only have one sample from each of a case and control RNA-seq experiment and we want to report the genes with the largest fold-changes. One insight that the Poisson model provides is that under the null hypothesis of no true differences, the statistical variability of this quantity depends on the total abundance of the gene. We can show this mathematically, but here is a quick simulation to demonstrate the point:

N=10000##number of genes
lambdas=2^seq(1,16,len=N) ##these are the true abundances of genes
y=rpois(N,lambdas)##note that the null hypothesis is true for all genes
x=rpois(N,lambdas) 
ind=which(y>0 & x>0)##make sure no 0s due to ratio and log

library(rafalib)
splot(log2(lambdas),log2(y/x),subset=ind)

MA plot of simulated RNA-seq data. Replicated measurements follow a Poisson distribution.

For lower values of lambda there is much more variability and, if we were to report anything with a fold change of 2 or more, the number of false positives would be quite high for low abundance genes.

Next Generation Sequencing experiments and the Poisson distribution

In this section we will use the data stored in this dataset:

source("https://bioconductor.org/biocLite.R")
biocLite("parathyroidSE")
library(parathyroidSE) ##available from Bioconductor

## Warning: package 'GenomicRanges' was built under R version 3.4.2

## Warning: package 'GenomeInfoDb' was built under R version 3.4.2

## Warning: package 'matrixStats' was built under R version 3.4.2

data(parathyroidGenesSE)
se <- parathyroidGenesSE

This package provides RangedSummarizedExperiment objects of read counts in genes and exonic parts for paired-end RNA-Seq data from experiments on primary cultures of parathyroid tumors. The data were presented in this article The sequencing was performed on tumor cultures from 4 patients at 2 time points over 3 conditions (DPN, OHT and control). One control sample was omitted by the paper authors due to low quality.

The important thing to know is that it includes a matrix of data, where each row is a genomic feature and each column is a sample. We can extract this data using the assay function. For this dataset, the value of a single cell in the data matrix is the count of reads which align to a given gene for a given sample. Thus, a similar plot to the one we simulated above with technical replicates reveals that the behavior predicted by the model is present in experimental data:

x <- assay(se)[,23]
y <- assay(se)[,24]
ind=which(y>0 & x>0)##make sure no 0s due to ratio and log
splot((log2(x)+log2(y))/2,log(x/y),subset=ind)

MA plot of replicated RNA-seq data.

If we compute the standard deviations across four individuals, it is quite a bit higher than what is predicted by a Poisson model. Assuming most genes are differentially expressed across individuals, then, if the Poisson model is appropriate, there should be a linear relationship in this plot:

library(rafalib)
library(matrixStats)

vars=rowVars(assay(se)[,c(2,8,16,21)]) ##we know these are four
means=rowMeans(assay(se)[,c(2,8,16,21)]) ##different individuals

splot(means,vars,log="xy",subset=which(means>0&vars>0)) ##plot a subset of data
abline(0,1,col=2,lwd=2)

Variance versus mean plot. Summaries were obtained from the RNA-seq data.

The reason for this is that the variability plotted here includes biological variability, which the motivation for the Poisson does not include. The negative binomial distribution, which combines the sampling variability of a Poisson and biological variability, is a more appropriate distribution to model this type of experiment. The negative binomial has two parameters and permits more flexibility for count data. For more on the use of the negative binomial to model RNA-seq data you can read this paper. The Poisson is a special case of the negative binomial distribution.

Maximum Likelihood Estimation

To illustrate the concept of maximum likelihood estimates (MLE), we use a relatively simple dataset containing palindrome locations in the Human cytomegalovirus genome. We read in the locations of the palindrome and then count the number of palindromes in each 4,000 basepair segments.

datadir="http://www.biostat.jhsph.edu/bstcourse/bio751/data"
x=read.csv(file.path(datadir,"hcmv.csv"))[,2]

breaks=seq(0,4000*round(max(x)/4000),4000)
tmp=cut(x,breaks)
counts=table(tmp)

library(rafalib)
mypar(1,1)
hist(counts)

Palindrome count histogram.

The counts do appear to follow a Poisson distribution. But what is the rate \(\lambda\) ? The most common approach to estimating this rate is maximum likelihood estimation. To find the maximum likelihood estimate (MLE), we note that these data are independent and the probability of observing the values we observed is:

\[ \Pr(X_1=k_1,\dots,X_n=k_n;\lambda) = \prod_{i=1}^n \lambda^{k_i} / k_i! \exp ( -\lambda) \]

The MLE is the value of \(\lambda\) that maximizes the likelihood:.

\[ \mbox{L}(\lambda; X_1=k_1,\dots,X_n=k_1)=\exp\left\{\sum_{i=1}^n \log \Pr(X_i=k_i;\lambda)\right\} \]

In practice, it is more convenient to maximize the log-likelihood which is the summation that is exponentiated in the expression above. Below we write code that computes the log-likelihood for any \(\lambda\) and use the function optimize to find the value that maximizes this function (the MLE). We show a plot of the log-likelihood along with vertical line showing the MLE.

l<-function(lambda) sum(dpois(counts,lambda,log=TRUE)) 

lambdas<-seq(3,7,len=100)
ls <- exp(sapply(lambdas,l))

plot(lambdas,ls,type="l")

mle=optimize(l,c(0,10),maximum=TRUE)
abline(v=mle$maximum)

Likelihood versus lambda.

If you work out the math and do a bit of calculus, you realize that this is a particularly simple example for which the MLE is the average.

print( c(mle$maximum, mean(counts) ) )

## [1] 5.157894 5.157895

Note that a plot of observed counts versus counts predicted by the Poisson shows that the fit is quite good in this case:

theoretical<-qpois((seq(0,99)+0.5)/100,mean(counts))

qqplot(theoretical,counts)
abline(0,1)

Observed counts versus theoretical Poisson counts.

We therefore can model the palindrome count data with a Poisson with \(\lambda=5.16\).

Distributions for Positive Continuous Values

Different genes vary differently across biological replicates. Later, in the hierarchical models chapter, we will describe one of the most influential statistical methods in the analysis of genomics data. This method provides great improvements over naive approaches to detecting differentially expressed genes. This is achieved by modeling the distribution of the gene variances. Here we describe the parametric model used in this method.

We want to model the distribution of the gene-specific standard errors. Are they normal? Keep in mind that we are modeling the population standard errors so CLT does not apply, even though we have thousands of genes.

As an example, we use an experimental data that included both technical and biological replicates for gene expression measurements on mice. We can load the data and compute the gene specific sample standard error for both the technical replicates and the biological replicates

library(Biobase) ##available from Bioconductor
library(devtools)
install_github("genomicsclass/maPooling")
library(maPooling) 

data(maPooling)
pd=pData(maPooling)

##determine which samples are bio reps and which are tech reps
strain=factor(as.numeric(grepl("b",rownames(pd))))
pooled=which(rowSums(pd)==12 & strain==1)
techreps=exprs(maPooling[,pooled])
individuals=which(rowSums(pd)==1 & strain==1)

##remove replicates
individuals=individuals[-grep("tr",names(individuals))]
bioreps=exprs(maPooling)[,individuals]

###now compute the gene specific standard deviations
library(matrixStats)
techsds=rowSds(techreps)
biosds=rowSds(bioreps)

We can now explore the sample standard deviation:

###now plot
library(rafalib)
mypar()
shist(biosds,unit=0.1,col=1,xlim=c(0,1.5))
shist(techsds,unit=0.1,col=2,add=TRUE)
legend("topright",c("Biological","Technical"), col=c(1,2),lty=c(1,1))

Histograms of biological variance and technical variance.

An important observation here is that the biological variability is substantially higher than the technical variability. This provides strong evidence that genes do in fact have gene-specific biological variability.

If we want to model this variability, we first notice that the normal distribution is not appropriate here since the right tail is rather large. Also, because SDs are strictly positive, there is a limitation to how symmetric this distribution can be. A qqplot shows this very clearly:

qqnorm(biosds)
qqline(biosds)

Normal qq-plot for sample standard deviations.

There are parametric distributions that possess these properties (strictly positive and heavy right tails). Two examples are the gamma and F distributions. The density of the gamma distribution is defined by:

\[ f(x;\alpha,\beta)=\frac{\beta^\alpha x^{\alpha-1}\exp{-\beta x}}{\Gamma(\alpha)} \]

It is defined by two parameters \(\alpha\) and \(\beta\) that can, indirectly, control location and scale. They also control the shape of the distribution. For more on this distribution please refer to this book.

Two special cases of the gamma distribution are the chi-squared and exponential distribution. chi-squared test can be used to analyze a 2x2 table data. For chi-square, we have \(\alpha=\nu/2\) and \(\beta=2\) with \(\nu\) the degrees of freedom. For exponential, we have \(\alpha=1\) and \(\beta=\lambda\) the rate.

The F-distribution comes up in analysis of variance (ANOVA). It is also always positive and has large right tails. Two parameters control its shape:

\[ f(x,d_1,d_2)=\frac{1}{B\left( \frac{d_1}{2},\frac{d_2}{2}\right)} \left(\frac{d_1}{d_2}\right)^{\frac{d_1}{2}} x^{\frac{d_1}{2}-1}\left(1+\frac{d1}{d2}x\right)^{-\frac{d_1+d_2}{2}} \]

with \(B\) the beta function and \(d_1\) and \(d_2\) are called the degrees of freedom for reasons having to do with how it arises in ANOVA. A third parameter is sometimes used with the F-distribution, which is a scale parameter.

Modeling the variance

It is mathematically convenient to model the distribution of the variance \(\sigma^2\). The hierarchical model used here implies that the sample standard deviation of genes follows scaled F-statistics:

\[ s^2 \sim s_0^2 F_{d,d_0} \]

with \(d\) the degrees of freedom involved in computing \(s^2\). For example, in a case comparing 3 versus 3, the degrees of freedom would be 4. This leaves two free parameters to adjust to the data. Here \(d\) will control the location and \(s_0\) will control the scale. Below are some examples of \(F\) distribution plotted on top of the histogram from the sample variances:

library(rafalib)
mypar(3,3)
sds=seq(0,2,len=100)
for(d in c(1,5,10)){
  for(s0 in c(0.1, 0.2, 0.3)){
    tmp=hist(biosds,main=paste("s_0 =",s0,"d =",d),
      xlab="sd",ylab="density",freq=FALSE,nc=100,xlim=c(0,1))
    dd=df(sds^2/s0^2,11,d)
    ##multiply by normalizing constant to assure same range on plot
    k=sum(tmp$density)/sum(dd) 
    lines(sds,dd*k,type="l",col=2,lwd=2)
    }
}

Histograms of sample standard deviations and densities of estimated distributions.

Now which \(s_0\) and \(d\) fit our data best? This is a rather advanced topic as the MLE does not perform well for this particular distribution (we refer to Smyth (2004)). The Bioconductor limma package provides a function to estimate these parameters:

source("https://bioconductor.org/biocLite.R")
biocLite("limma")

## package 'limma' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Mete Civelek\AppData\Local\Temp\RtmpwHnieX\downloaded_packages

library(limma)

## Warning: package 'limma' was built under R version 3.4.2

estimates=fitFDist(biosds^2,11)

theoretical<- sqrt(qf((seq(0,999)+0.5)/1000, 11, estimates$df2)*estimates$scale)
observed <- biosds

The fitted models do appear to provide a reasonable approximation, as demonstrated by the qq-plot and histogram:

mypar(1,2)
qqplot(theoretical,observed)
abline(0,1)
tmp=hist(biosds,main=paste("s_0 =", signif(estimates[[1]],2),
                  "d =", signif(estimates[[2]],2)),
  xlab="sd", ylab="density", freq=FALSE, nc=100, xlim=c(0,1), ylim=c(0,9))
dd=df(sds^2/estimates$scale,11,estimates$df2)
k=sum(tmp$density)/sum(dd) ##a normalizing constant to assure same area in plot
lines(sds, dd*k, type="l", col=2, lwd=2)

qq-plot (left) and density (right) demonstrate that model fits data well.

Basic Explarotary Analysis using High Dimensional Data and Statistical Models

Rafael Irizarry and Michael Love modified by Mete Civelek

October 24, 2017