Applied Statistics for High-throughput Biology: Day 4

Levi Waldron

June 23, 2017

Day 4 outline

Book chapter 7

Distances in high-dimensional data analysis

The importance of distance

animals

The importance of distance (cont’d)

The importance of distance (cont’d)

Source: http://master.bioconductor.org/help/course-materials/2002/Summer02Course/Distance/distance.pdf

Metrics and distances

A metric satisfies the following five properties:

  1. non-negativity \(d(a, b) \ge 0\)
  2. symmetry \(d(a, b) = d(b, a)\)
  3. identification mark \(d(a, a) = 0\)
  4. definiteness \(d(a, b) = 0\) if and only if \(a=b\)
  5. triangle inequality \(d(a, b) + d(b, c) \ge d(a, c)\)

Euclidian distance (metric)

Euclidian distance in high dimensions

##biocLite("genomicsclass/tissuesGeneExpression")
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data
## [1] 22215   189
table(tissue) ##tissue[i] corresponds to e[,i]
## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Interested in identifying similar samples and similar genes

Euclidian distance in high dimensions

Euclidian distance in high dimensions

Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:

\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]

and the distance between two features \(h\) and \(g\) is:

\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]

Matrix algebra notation

The distance between samples \(i\) and \(j\) can be written as:

\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]

with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).

Matrix algebra notation

t(matrix(1:3, ncol=1))
##      [,1] [,2] [,3]
## [1,]    1    2    3
matrix(1:3, ncol=1)
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
t(matrix(1:3, ncol=1)) %*% matrix(1:3, ncol=1)
##      [,1]
## [1,]   14

Note: R is very efficient at matrix algebra

3 sample example

kidney1 <- e[, 1]
kidney2 <- e[, 2]
colon1 <- e[, 87]
sqrt(sum((kidney1 - kidney2)^2))
## [1] 85.8546
sqrt(sum((kidney1 - colon1)^2))
## [1] 122.8919

3 sample example using dist()

dim(e)
## [1] 22215   189
(d <- dist(t(e[, c(1, 2, 87)])))
##                 GSM11805.CEL.gz GSM11814.CEL.gz
## GSM11814.CEL.gz         85.8546                
## GSM92240.CEL.gz        122.8919        115.4773
class(d)
## [1] "dist"

The dist() function

Excerpt from ?dist:

dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)

Caution: dist(e) creates a 22215 x 22215 matrix that will probably crash your R session.

Note on standardization

\[x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}\]

Distance Exercises: p. 322-323

Dimension reduction and PCA

Motivation for dimension reduction

Simulate the heights of twin pairs:

library(MASS)
set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0,0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)
## [1]   2 100
cor(t(y))
##           [,1]      [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000

Motivation for dimension reduction

Motivation for dimension reduction

Singular Value Decomposition (SVD)

SVD generalizes the example rotation we looked at:

\[\mathbf{Y} = \mathbf{UDV}^\top\]

SVD

Singular Value Decomposition (SVD)

SVD

SVD of gene expression dataset

Scaling:

system.time(e.standardize.slow <- t(apply(e, 1, function(x) x - mean(x) )))
##    user  system elapsed 
##   1.429   0.115   1.568
system.time(e.standardize.fast <- t(scale(t(e), scale=FALSE)))
##    user  system elapsed 
##   0.116   0.013   0.130
all.equal(e.standardize.slow[, 1], e.standardize.fast[, 1])
## [1] TRUE

SVD:

s <- svd(e.standardize.fast)
names(s)
## [1] "d" "u" "v"

SVD of gene expression dataset

dim(s$u)     # loadings
## [1] 22215   189
length(s$d)  # eigenvalues
## [1] 189
dim(s$v)     # d %*% vT = scores
## [1] 189 189
SVD

PCA of gene expression dataset

rafalib::mypar()
p <- prcomp( t(e.standardize.fast) )
plot(s$u[, 1] ~ p$rotation[, 1])

Lesson: u and v can each be multiplied by -1 arbitrarily

PCA interpretation: loadings

SVD

PCA interpretation: loadings

plot(p$rotation[, 1], xlab="Index of genes", ylab="Loadings of PC1", 
     main="Scores of PC1") #or, predict(p)
abline(h=c(-0.03, 0.04), col="red")

PCA interpretation: loadings

Genes with high PC1 loadings:

e.pc1genes <- e.standardize.fast[p$rotation[, 1] < -0.03 | p$rotation[, 1] > 0.03, ]
pheatmap::pheatmap(e.pc1genes, scale="none", show_rownames=TRUE, 
                  show_colnames = FALSE)

PCA interpretation: eigenvalues

PCA interpretation: eigenvalues

Alternatively as cumulative % variance explained (using cumsum() function):

PCA interpretation: scores

SVD

PCA interpretation: scores

Multi-dimensional Scaling (MDS)

d <- as.dist(1 - cor(e.standardize.fast))
mds <- cmdscale(d)

Multi-dimensional Scaling (MDS)

Batch Effects

Batch Effects

Batch Effects - an example

library(Biobase)
library(genefilter)
library(GSE5859) ## BiocInstaller::biocLite("genomicsclass/GSE5859")
data(GSE5859)
geneExpression = exprs(e)
sampleInfo = pData(e)

Batch Effects - an example

head(table(sampleInfo$date))
## 
## 2002-10-31 2002-11-15 2002-11-19 2002-11-21 2002-11-22 2002-11-27 
##          2          2          2          5          4          2

Batch Effects - an example

year <-  as.integer(format(sampleInfo$date,"%y") )
year <- year - min(year)
month = as.integer( format(sampleInfo$date,"%m") ) + 12*year
table(year, sampleInfo$ethnicity)
##     
## year ASN CEU HAN
##    0   0  32   0
##    1   0  54   0
##    2   0  13   0
##    3  80   3   0
##    4   2   0  24

Batch Effects: PCA

pc <- prcomp(t(geneExpression), scale.=TRUE)
boxplot(pc$x[, 1] ~ month, varwidth=TRUE, notch=TRUE,
        main="PC1 scores vs. month", xlab="Month", ylab="PC1")

Batch Effects: MDS

A starting point for a color palette:

RColorBrewer::display.brewer.all(n=3, colorblindFriendly = TRUE)

Interpolate one color per month on a quantitative palette:

col3 <- c(RColorBrewer::brewer.pal(n=3, "Greys")[2:3], "black")
MYcols <- colorRampPalette(col3, space="Lab")(length(unique(month)))

Batch Effects: MDS

d <- as.dist(1 - cor(geneExpression))
mds <- cmdscale(d)
plot(mds, col=MYcols[as.integer(factor(month))],
     main="MDS shaded by batch")

Batch Effects: clustering

hcclass <- cutree(hclust(d), k=5)
table(hcclass, year)
##        year
## hcclass  0  1  2  3  4
##       1  0  2  8  1  0
##       2 25 43  0  2  0
##       3  6  0  0  0  0
##       4  1  7  0  0  0
##       5  0  2  5 80 26

Batch Effects - summary

Conclusions

Final Project

Your project for the course will be to:

  1. run this RNA-seq analysis workflow provided in the R Markdown file on your own computer. Make sure to set the knitr option {cache=TRUE} is set to speed up subsequent compiling of this R Markdown script.

  2. Remove the text provided in this Rmd file, and replace it with your own basic explanations of what is going on. You do not need to explain things that we didn’t cover in class, although you are welcome to do so. For things that we did cover in class, write a short explanation that would be appropriate to someone at your own level before you started the course. You don’t need to write an essay here, the point is for you to identify familiar methods in this workflow and show that you’ve learned something about how to interpret their use in practice.