- Distances in high dimensions
- Principal Components Analysis
- Multidimensional Scaling
Book chapter 7
March 10, 2016
Book chapter 7
Source: http://master.bioconductor.org/help/course-materials/2002/Summer02Course/Distance/distance.pdf
A metric satisfies the following five properties:
##biocLite("genomicsclass/tissuesGeneExpression") library(tissuesGeneExpression) data(tissuesGeneExpression) dim(e) ##gene expression data
## [1] 22215 189
table(tissue) ##tissue[i] corresponds to e[,i]
## tissue ## cerebellum colon endometrium hippocampus kidney liver ## 38 34 15 31 39 26 ## placenta ## 6
Interested in identifying similar samples and similar genes
Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:
\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]
and the distance between two features \(h\) and \(g\) is:
\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]
The distance between samples \(i\) and \(j\) can be written as:
\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]
with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).
t(matrix(1:3, ncol=1))
## [,1] [,2] [,3] ## [1,] 1 2 3
matrix(1:3, ncol=1)
## [,1] ## [1,] 1 ## [2,] 2 ## [3,] 3
t(matrix(1:3, ncol=1)) %*% matrix(1:3, ncol=1)
## [,1] ## [1,] 14
Note: R is very efficient at matrix algebra
kidney1 <- e[, 1] kidney2 <- e[, 2] colon1 <- e[, 87] sqrt(sum((kidney1 - kidney2)^2))
## [1] 85.8546
sqrt(sum((kidney1 - colon1)^2))
## [1] 122.8919
dim(e)
## [1] 22215 189
(d <- dist(t(e[, c(1, 2, 87)])))
## GSM11805.CEL.gz GSM11814.CEL.gz ## GSM11814.CEL.gz 85.8546 ## GSM92240.CEL.gz 122.8919 115.4773
class(d)
## [1] "dist"
Excerpt from ?dist:
dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
dist
class output from dist()
is used for many clustering algorithms and heatmap functionsCaution: dist(e)
creates a 22215 x 22215 matrix that will probably crash your R session.
\[x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}\]
Distance Exercises: p. 322-323
Simulate the heights of twin pairs:
library(MASS) set.seed(1) n <- 100 y <- t(MASS::mvrnorm(n, c(0,0), matrix(c(1, 0.95, 0.95, 1), 2, 2))) dim(y)
## [1] 2 100
cor(t(y))
## [,1] [,2] ## [1,] 1.0000000 0.9433295 ## [2,] 0.9433295 1.0000000
SVD generalizes the example rotation we looked at:
\[\mathbf{Y} = \mathbf{UDV}^\top\]
system.time(e.standardize.slow <- t(apply(e, 1, function(x) x - mean(x) )))
## user system elapsed ## 1.939 0.137 2.118
system.time(e.standardize.fast <- t(scale(t(e), scale=FALSE)))
## user system elapsed ## 0.169 0.015 0.185
## all.equal(e.standardize.slow[, 1], e.standardize.fast[, 1]) s <- svd(e.standardize.fast) names(s)
## [1] "d" "u" "v"
dim(s$u) # loadings
## [1] 22215 189
length(s$d) # eigenvalues
## [1] 189
dim(s$v) # d %*% vT = scores
## [1] 189 189
p <- princomp(e.standardize.fast) #cor=TRUE to scale rowss
plot(p$scores[, 1], xlab="Index of genes", ylab="Scores of PC1", main="Scores of PC1") #or, predict(p) abline(h=c(-35, 25), col="red")
Genes with high PC1 scores:
e.pc1genes <- e[which(p$scores[, 1] < -35 | p$scores[, 1] > 25), ] pheatmap::pheatmap(e.pc1genes, scale="none", show_rownames=FALSE, show_colnames = FALSE)
plot(p$sdev^2 / sum(p$sdev^2)*100, ylab="% variance explained")
Alternatively as cumulative % variance explained (using cumsum()
function):
cor=TRUE
in the princomp()
functiond <- as.dist(1 - cor(e.standardize.fast)) mds <- cmdscale(d)