Applied Statistics for High-throughput Biology Day 4

Levi Waldron

Day 4 outline

Book chapter 8:

Distances in high dimensions
Principal Components Analysis and Singular Value Decomposition
Multidimensional Scaling
Batch Effects (Chapter 10)

Metrics and distances

A metric satisfies the following five properties:

non-negativity \(d(a, b) \ge 0\)
symmetry \(d(a, b) = d(b, a)\)
identification mark \(d(a, a) = 0\)
definiteness \(d(a, b) = 0\) if and only if \(a=b\)
triangle inequality \(d(a, b) + d(b, c) \ge d(a, c)\)

A distance is only required to satisfy 1-3.
A similarity function satisfies 1-2, and increases as \(a\) and \(b\) become more similar
A dissimilarity function satisfies 1-2, and decreases as \(a\) and \(b\) become more similar

Euclidian distance (metric)

Remember grade school: Euclidean d = \(\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}\).
Side note: also referred to as \(L_2\) norm

Euclidian distance in high dimensions

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'tissuesGeneExpression'

## * checking for file ‘/tmp/RtmpP0WEXN/remotes12787a4ea081c0/genomicsclass-tissuesGeneExpression-a43cf4b/DESCRIPTION’ ... OK
## * preparing ‘tissuesGeneExpression’:
## * checking DESCRIPTION meta-information ... OK
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * looking to see if a ‘data/datalist’ file should be added
## * creating default NAMESPACE file
## * building ‘tissuesGeneExpression_1.0.tar.gz’

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'GSE5859'

## * checking for file ‘/tmp/RtmpP0WEXN/remotes12787a282882c/genomicsclass-GSE5859-3d8cdd7/DESCRIPTION’ ... OK
## * preparing ‘GSE5859’:
## * checking DESCRIPTION meta-information ... OK
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * looking to see if a ‘data/datalist’ file should be added
## * creating default NAMESPACE file
## * building ‘GSE5859_1.0.tar.gz’

## BiocManager::install("genomicsclass/tissuesGeneExpression") #if needed
## BiocManager::install("genomicsclass/GSE5859") #if needed
library(GSE5859)
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data

## [1] 22215   189

table(tissue) ##tissue[i] corresponds to e[,i]

## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Interested in identifying similar samples and similar genes

Notes about Euclidian distance in high dimensions

Points are no longer on the Cartesian plane
instead they are in higher dimensions. For example:
- sample \(i\) is defined by a point in 22,215 dimensional space: \((Y_{1,i},\dots,Y_{22215,i})^\top\).
- feature \(g\) is defined by a point in 189 dimensions \((Y_{g,189},\dots,Y_{g,189})^\top\)

Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:

\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]

and the distance between two features \(h\) and \(g\) is:

\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]

Euclidian distance in matrix algebra notation

The Euclidian distance between samples \(i\) and \(j\) can be written as:

\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]

with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).

t(matrix(1:3, ncol = 1))

##      [,1] [,2] [,3]
## [1,]    1    2    3

matrix(1:3, ncol = 1)

##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3

t(matrix(1:3, ncol = 1)) %*% matrix(1:3, ncol = 1)

##      [,1]
## [1,]   14

Note about matrix algebra in R

R is very efficient at matrix algebra
for very large matricies, see the:
- Matrix CRAN package (sparse matrices)
- rhdf5 and DelayedArray Bioconductor package (on-disk arrays)

3 sample example

kidney1 <- e[, 1]
kidney2 <- e[, 2]
colon1 <- e[, 87]
sqrt(sum((kidney1 - kidney2) ^ 2))

## [1] 85.8546

sqrt(sum((kidney1 - colon1) ^ 2))

## [1] 122.8919

3 sample example using dist()

dim(e)

## [1] 22215   189

(d <- dist(t(e[, c(1, 2, 87)])))

##                 GSM11805.CEL.gz GSM11814.CEL.gz
## GSM11814.CEL.gz         85.8546                
## GSM92240.CEL.gz        122.8919        115.4773

class(d)

## [1] "dist"

The dist() function

Excerpt from ?dist:

dist(
    x,
    method = "euclidean",
    diag = FALSE,
    upper = FALSE,
    p = 2
)

method: the distance measure to be used.
- This must be one of “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”. Any unambiguous substring can be given.
dist class output from dist() is used for many clustering algorithms and heatmap functions

Caution: dist(e) creates a 22215 x 22215 matrix that will probably crash your R session.

Note on standardization

In practice, features (e.g. genes) are typically “standardized”, i.e. converted to z-score:

\[x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}\]

This is done because the differences in overall levels between features are often not due to biological effects but technical ones, e.g.:
- GC bias, PCR amplification efficiency, …
Euclidian distance and \(1-r\) (Pearson cor) are related:
- \(\frac{d_E(x, y)^2}{2m} = 1 - r_{xy}\)
- \(m\) = # dimensions

Dimension reduction and PCA

Motivation for dimension reduction

Simulate the heights of twin pairs:

set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0, 0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)

## [1]   2 100

cor(t(y))

##           [,1]      [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000

Visualizing twin pairs data

Not much distance is lost in the second dimension

Not much loss of height differences when just using average heights of twin pairs.
- because twin heights are highly correlated

Singular Value Decomposition (SVD)

SVD generalizes the example rotation we looked at:

\[\mathbf{Y} = \mathbf{UDV}^\top\]

note: the above formulation is for \(m\) rows \(> n\) columns
\(\mathbf{Y}\): the \(m\) rows x \(n\) cols matrix of measurements
\(\mathbf{U}\): \(m \times n\) matrix relating original scores to PCA scores (loadings)
\(\mathbf{D}\): \(n \times n\) diagonal matrix (eigenvalues)
\(\mathbf{V}\): \(n \times n\) orthogonal matrix (eigenvectors or PCA scores)
- orthogonal = unit length and “perpendicular” in 3-D

SVD of gene expression dataset

Scaling:

system.time(e.standardize.slow <-
                t(apply(e, 1, function(x)
                    x - mean(x))))

##    user  system elapsed 
##   0.578   0.080   0.657

system.time(e.standardize.fast <- t(scale(t(e), scale = FALSE)))

##    user  system elapsed 
##   0.131   0.041   0.171

all.equal(e.standardize.slow[, 1], e.standardize.fast[, 1])

## [1] TRUE

SVD:

s <- svd(e.standardize.fast)
names(s)

## [1] "d" "u" "v"

Components of SVD results

dim(s$u)     # loadings

## [1] 22215   189

length(s$d)  # eigenvalues

## [1] 189

dim(s$v)     # d %*% vT = scores

## [1] 189 189

PCA is a SVD

gene expression dataset

rafalib::mypar()
p <- prcomp(t(e.standardize.fast))
plot(s$u[, 1] ~ p$rotation[, 1])

Lesson: u and v can each be multiplied by -1 arbitrarily

PCA interpretation: loadings

\(\mathbf{U}\) (loadings): relate the principal component axes to the original variables
- think of principal component axes as a weighted combination of original axes

Visualizing PCA loadings

plot(p$rotation[, 1],
     xlab = "Index of genes",
     ylab = "Loadings of PC1",
     main = "PC1 loadings of each gene") #or, predict(p)
abline(h = c(-0.03, 0.03), col = "red")

Genes with high PC1 loadings

e.pc1genes <-
    e.standardize.fast[p$rotation[, 1] < -0.03 |
                           p$rotation[, 1] > 0.03,]
pheatmap::pheatmap(
    e.pc1genes,
    scale = "none",
    show_rownames = TRUE,
    show_colnames = FALSE
)

PCA interpretation: eigenvalues

\(\mathbf{D}\) (eigenvalues): standard deviation scaling factor that each decomposed variable is multiplied by.

rafalib::mypar()
plot(
    p$sdev ^ 2 / sum(p$sdev ^ 2) * 100,
    xlim = c(0, 150),
    type = "b",
    ylab = "% variance explained",
    main = "Screeplot"
)

PCA interpretation: eigenvalues

Alternatively as cumulative % variance explained (using cumsum() function)

rafalib::mypar()
plot(
    cumsum(p$sdev ^ 2) / sum(p$sdev ^ 2) * 100,
    ylab = "cumulative % variance explained",
    ylim = c(0, 100),
    type = "b",
    main = "Cumulative screeplot"
)

PCA interpretation: scores

\(\mathbf{V}\) (scores): The “datapoints” in the reduced prinipal component space
In some implementations (like prcomp()), scores are already scaled by eigenvalues: \(\mathbf{D V^T}\)

PCA interpretation: scores

Multi-dimensional Scaling (MDS)

also referred to as Principal Coordinates Analysis (PCoA)
a reduced SVD, performed on a distance matrix
identify two (or more) eigenvalues/vectors that preserve distances

d <- as.dist(1 - cor(e.standardize.fast))
mds <- cmdscale(d)

Summary: distances and dimension reduction

Note: signs of eigenvalues (square to get variances) and eigenvectors (loadings) can be arbitrarily flipped
PCA and MDS are useful for dimension reduction when you have correlated variables
Variables are always centered.
Variables are also scaled unless you know they have the same scale in the population
PCA projection can be applied to new datasets if you know the matrix calculations
PCA is subject to over-fitting, screeplot can be tested by cross-validation

PCA and MDS Exercises

Repeat the PCA after removing half of the features with lowest variance. Does it look any different?
Repeat the MDS, using Spearman correlation instead of Pearson correlation. Does it look any different?

Batch Effects

pervasive in genomics (e.g. Leek et al. Nat Rev Genet. 2010 Oct;11(10):733-9.)
affect DNA and RNA sequencing, proteomics, imaging, microarray…
have caused high-profile problems and retractions
- you can’t get rid of them
- but you can make sure they are not confounded with your experiment

Batch Effects - an example

Nat Genet. 2007 Feb;39(2):226-31. Epub 2007 Jan 7.
Title: Common genetic variants account for differences in gene expression among ethnic groups.
- “The results show that specific genetic variation among populations contributes appreciably to differences in gene expression phenotypes.”

library(Biobase)
library(genefilter)
library(GSE5859) ## BiocInstaller::biocLite("genomicsclass/GSE5859")
data(GSE5859)
geneExpression = exprs(e)
sampleInfo = pData(e)

Note: the ExpressionSet object is obsolete, we use SummarizedExperiment now

Date of processing as a proxy for batch

Sample metadata included date of processing:

head(table(sampleInfo$date))

## 
## 2002-10-31 2002-11-15 2002-11-19 2002-11-21 2002-11-22 2002-11-27 
##          2          2          2          5          4          2

year <-  as.integer(format(sampleInfo$date, "%y"))
year <- year - min(year)
month = as.integer(format(sampleInfo$date, "%m")) + 12 * year
table(year, sampleInfo$ethnicity)

##     
## year ASN CEU HAN
##    0   0  32   0
##    1   0  54   0
##    2   0  13   0
##    3  80   3   0
##    4   2   0  24

Visualizing batch effects by PCA

pc <- prcomp(t(geneExpression), scale. = TRUE)
boxplot(
    pc$x[, 1] ~ month,
    varwidth = TRUE,
    notch = TRUE,
    main = "PC1 scores vs. month",
    xlab = "Month",
    ylab = "PC1"
)

Visualizing batch effects by MDS

A starting point for a color palette:

RColorBrewer::display.brewer.all(n = 3, colorblindFriendly = TRUE)

Interpolate one color per month on a quantitative palette:

col3 <- c(RColorBrewer::brewer.pal(n = 3, "Greys")[2:3], "black")
MYcols <- colorRampPalette(col3, space = "Lab")(length(unique(month)))

d <- as.dist(1 - cor(geneExpression))
mds <- cmdscale(d)
plot(mds, col = MYcols[as.integer(factor(month))],
     main = "MDS shaded by month")

The batch effects impact clustering

hcclass <- cutree(hclust(d), k = 5)
table(hcclass, year)

##        year
## hcclass  0  1  2  3  4
##       1  0  2  8  1  0
##       2 25 43  0  2  0
##       3  6  0  0  0  0
##       4  1  7  0  0  0
##       5  0  2  5 80 26

Batch Effects - summary

tend to affect many or all measurements by a little bit
during experimental design:
- keep track of anything that might cause a batch effect for post-hoc analysis
- include control samples in each batch
batches can be corrected for if randomized in study design
- if confounded with treatment or outcome of interest, nothing can help you
see Bioconductor sva package (Surrogate Variable Analysis)
I don’t recommend batch effect correction in the context of machine learning / prediction modeling
- but only believe independent datasets for estimating accuracy

Links

A built html version of this lecture is available.
The source R Markdown is also available from Github.