Applied Statistics for High-throughput Biology Day 4
Levi Waldron
Day 4 outline
Book chapter 8:
- Distances in high dimensions
- Principal Components Analysis and Singular Value Decomposition
- Multidimensional Scaling
- Batch Effects (Chapter 10)
Metrics and distances
A metric satisfies the following five properties:
- non-negativity \(d(a, b) \ge 0\)
- symmetry \(d(a, b) = d(b, a)\)
- identification mark \(d(a, a) = 0\)
- definiteness \(d(a, b) = 0\) if and only if \(a=b\)
- triangle inequality \(d(a, b) + d(b, c) \ge d(a, c)\)
- A distance is only required to satisfy 1-3.
- A similarity function satisfies 1-2, and increases as \(a\) and \(b\) become more similar
- A dissimilarity function satisfies 1-2, and decreases as \(a\) and \(b\) become more similar
Euclidian distance (metric)
- Remember grade school:
Euclidean d = \(\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}\).
- Side note: also referred to as \(L_2\) norm
Euclidian distance in high dimensions
## BiocManager::install("genomicsclass/tissuesGeneExpression") #if needed
## BiocManager::install("genomicsclass/GSE5859") #if needed
library(GSE5859)
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data
## [1] 22215 189
table(tissue) ##tissue[i] corresponds to e[,i]
## tissue
## cerebellum colon endometrium hippocampus kidney liver
## 38 34 15 31 39 26
## placenta
## 6
Interested in identifying similar samples and similar genes
Notes about Euclidian distance in high dimensions
- Points are no longer on the Cartesian plane
- instead they are in higher dimensions. For example:
- sample \(i\) is defined by a point in 22,215 dimensional space: \((Y_{1,i},\dots,Y_{22215,i})^\top\).
- feature \(g\) is defined by a point in 189 dimensions \((Y_{g,189},\dots,Y_{g,189})^\top\)
Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:
\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]
and the distance between two features \(h\) and \(g\) is:
\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]
Euclidian distance in matrix algebra notation
The Euclidian distance between samples \(i\) and \(j\) can be written as:
\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]
with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).
## [,1] [,2] [,3]
## [1,] 1 2 3
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
t(matrix(1:3, ncol = 1)) %*% matrix(1:3, ncol = 1)
## [,1]
## [1,] 14
Note about matrix algebra in R
- R is very efficient at matrix algebra
- for very large matricies, see the:
3 sample example
kidney1 <- e[, 1]
kidney2 <- e[, 2]
colon1 <- e[, 87]
sqrt(sum((kidney1 - kidney2) ^ 2))
## [1] 85.8546
sqrt(sum((kidney1 - colon1) ^ 2))
## [1] 122.8919
3 sample example using dist()
## [1] 22215 189
(d <- dist(t(e[, c(1, 2, 87)])))
## GSM11805.CEL.gz GSM11814.CEL.gz
## GSM11814.CEL.gz 85.8546
## GSM92240.CEL.gz 122.8919 115.4773
## [1] "dist"
The dist() function
Excerpt from ?dist:
dist(
x,
method = "euclidean",
diag = FALSE,
upper = FALSE,
p = 2
)
- method: the distance measure to be used.
- This must be one of “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”. Any unambiguous substring can be given.
dist
class output from dist()
is used for many clustering algorithms and heatmap functions
Caution: dist(e)
creates a 22215 x 22215 matrix that will probably crash your R session.
Note on standardization
- In practice, features (e.g. genes) are typically “standardized”, i.e. converted to z-score:
\[x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}\]
- This is done because the differences in overall levels between features are often not due to biological effects but technical ones, e.g.:
- GC bias, PCR amplification efficiency, …
- Euclidian distance and \(1-r\) (Pearson cor) are related:
- \(\frac{d_E(x, y)^2}{2m} = 1 - r_{xy}\)
- \(m\) = # dimensions
Dimension reduction and PCA
- Motivation for dimension reduction
Simulate the heights of twin pairs:
set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0, 0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)
## [1] 2 100
## [,1] [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000
Visualizing twin pairs data

Not much distance is lost in the second dimension

- Not much loss of height differences when just using average heights of twin pairs.
- because twin heights are highly correlated
Singular Value Decomposition (SVD)
SVD generalizes the example rotation we looked at:
\[\mathbf{Y} = \mathbf{UDV}^\top\]
note: the above formulation is for \(m\) rows \(> n\) columns
\(\mathbf{Y}\): the \(m\) rows x \(n\) cols matrix of measurements
\(\mathbf{U}\): \(m \times n\) matrix relating original scores to PCA scores (loadings)
\(\mathbf{D}\): \(n \times n\) diagonal matrix (eigenvalues)
\(\mathbf{V}\): \(n \times n\) orthogonal matrix (eigenvectors or PCA scores)
- orthogonal = unit length and “perpendicular” in 3-D
SVD of gene expression dataset
Scaling:
system.time(e.standardize.slow <-
t(apply(e, 1, function(x)
x - mean(x))))
## user system elapsed
## 0.880 0.092 0.974
system.time(e.standardize.fast <- t(scale(t(e), scale = FALSE)))
## user system elapsed
## 0.164 0.060 0.225
all.equal(e.standardize.slow[, 1], e.standardize.fast[, 1])
## [1] TRUE
SVD:
s <- svd(e.standardize.fast)
names(s)
## [1] "d" "u" "v"
Components of SVD results
## [1] 22215 189
length(s$d) # eigenvalues
## [1] 189
dim(s$v) # d %*% vT = scores
## [1] 189 189
PCA is a SVD
rafalib::mypar()
p <- prcomp(t(e.standardize.fast))
plot(s$u[, 1] ~ p$rotation[, 1])

Lesson: u and v can each be multiplied by -1 arbitrarily
PCA interpretation: loadings
- \(\mathbf{U}\) (loadings): relate the principal component axes to the original variables
- think of principal component axes as a weighted combination of original axes
Visualizing PCA loadings
plot(p$rotation[, 1],
xlab = "Index of genes",
ylab = "Loadings of PC1",
main = "PC1 loadings of each gene") #or, predict(p)
abline(h = c(-0.03, 0.03), col = "red")

Genes with high PC1 loadings
e.pc1genes <-
e.standardize.fast[p$rotation[, 1] < -0.03 |
p$rotation[, 1] > 0.03,]
pheatmap::pheatmap(
e.pc1genes,
scale = "none",
show_rownames = TRUE,
show_colnames = FALSE
)

PCA interpretation: eigenvalues
- \(\mathbf{D}\) (eigenvalues): standard deviation scaling factor that each decomposed variable is multiplied by.
rafalib::mypar()
plot(
p$sdev ^ 2 / sum(p$sdev ^ 2) * 100,
xlim = c(0, 150),
type = "b",
ylab = "% variance explained",
main = "Screeplot"
)

PCA interpretation: eigenvalues
Alternatively as cumulative % variance explained (using cumsum()
function)
rafalib::mypar()
plot(
cumsum(p$sdev ^ 2) / sum(p$sdev ^ 2) * 100,
ylab = "cumulative % variance explained",
ylim = c(0, 100),
type = "b",
main = "Cumulative screeplot"
)

PCA interpretation: scores
- \(\mathbf{V}\) (scores): The “datapoints” in the reduced prinipal component space
- In some implementations (like
prcomp()
), scores are already scaled by eigenvalues: \(\mathbf{D V^T}\)
PCA interpretation: scores

Multi-dimensional Scaling (MDS)
- also referred to as Principal Coordinates Analysis (PCoA)
- a reduced SVD, performed on a distance matrix
- identify two (or more) eigenvalues/vectors that preserve distances
d <- as.dist(1 - cor(e.standardize.fast))
mds <- cmdscale(d)

Summary: distances and dimension reduction
- Note: signs of eigenvalues (square to get variances) and eigenvectors (loadings) can be arbitrarily flipped
- PCA and MDS are useful for dimension reduction when you have correlated variables
- Variables are always centered.
- Variables are also scaled unless you know they have the same scale in the population
- PCA projection can be applied to new datasets if you know the matrix calculations
- PCA is subject to over-fitting, screeplot can be tested by cross-validation
PCA and MDS Exercises
- Repeat the PCA after removing half of the features with lowest variance. Does it look any different?
- Repeat the MDS, using Spearman correlation instead of Pearson correlation. Does it look any different?
Batch Effects
- pervasive in genomics (e.g. Leek et al. Nat Rev Genet. 2010 Oct;11(10):733-9.)
- affect DNA and RNA sequencing, proteomics, imaging, microarray…
- have caused high-profile problems and retractions
- you can’t get rid of them
- but you can make sure they are not confounded with your experiment
Batch Effects - an example
- Nat Genet. 2007 Feb;39(2):226-31. Epub 2007 Jan 7.
- Title: Common genetic variants account for differences in gene expression among ethnic groups.
- “The results show that specific genetic variation among populations contributes appreciably to differences in gene expression phenotypes.”
library(Biobase)
library(genefilter)
library(GSE5859) ## BiocInstaller::biocLite("genomicsclass/GSE5859")
data(GSE5859)
geneExpression = exprs(e)
sampleInfo = pData(e)
- Note: the
ExpressionSet
object is obsolete, we use SummarizedExperiment
now
Date of processing as a proxy for batch
- Sample metadata included date of processing:
head(table(sampleInfo$date))
##
## 2002-10-31 2002-11-15 2002-11-19 2002-11-21 2002-11-22 2002-11-27
## 2 2 2 5 4 2
year <- as.integer(format(sampleInfo$date, "%y"))
year <- year - min(year)
month = as.integer(format(sampleInfo$date, "%m")) + 12 * year
table(year, sampleInfo$ethnicity)
##
## year ASN CEU HAN
## 0 0 32 0
## 1 0 54 0
## 2 0 13 0
## 3 80 3 0
## 4 2 0 24
Visualizing batch effects by PCA
pc <- prcomp(t(geneExpression), scale. = TRUE)
boxplot(
pc$x[, 1] ~ month,
varwidth = TRUE,
notch = TRUE,
main = "PC1 scores vs. month",
xlab = "Month",
ylab = "PC1"
)

Visualizing batch effects by MDS
A starting point for a color palette:
RColorBrewer::display.brewer.all(n = 3, colorblindFriendly = TRUE)
Interpolate one color per month on a quantitative palette:
col3 <- c(RColorBrewer::brewer.pal(n = 3, "Greys")[2:3], "black")
MYcols <- colorRampPalette(col3, space = "Lab")(length(unique(month)))
d <- as.dist(1 - cor(geneExpression))
mds <- cmdscale(d)
plot(mds, col = MYcols[as.integer(factor(month))],
main = "MDS shaded by month")

The batch effects impact clustering
hcclass <- cutree(hclust(d), k = 5)
table(hcclass, year)
## year
## hcclass 0 1 2 3 4
## 1 0 2 8 1 0
## 2 25 43 0 2 0
## 3 6 0 0 0 0
## 4 1 7 0 0 0
## 5 0 2 5 80 26
Batch Effects - summary
- tend to affect many or all measurements by a little bit
- during experimental design:
- keep track of anything that might cause a batch effect for post-hoc analysis
- include control samples in each batch
- batches can be corrected for if randomized in study design
- if confounded with treatment or outcome of interest, nothing can help you
- see Bioconductor sva package (Surrogate Variable Analysis)
- I don’t recommend batch effect correction in the context of machine learning / prediction modeling
- but only believe independent datasets for estimating accuracy
Links
- A built html version of this lecture is available.
- The source R Markdown is also available from Github.