Distance and Dimension Reduction

Introduction

The concept of distance is quite intuitive. For example, when we cluster animals into subgroups, we are implicitly defining a distance that permits us to say what animals are “close” to each other.

Many of the analyses we perform with high-dimensional data relate directly or indirectly to distance. Many clustering and machine learning techniques rely on being able to define distance, using features or predictors. For example, to create heatmaps, which are widely used in genomics and other high-throughput fields, a distance is computed explicitly.

In these plots the measurements, which are stored in a matrix, are represented with colors after the columns and rows have been clustered. (A side note: red and green, a common color theme for heatmaps, are two of the most difficult colors for many color-blind people to discern.) Here we will learn the necessary mathematics and computing skills to understand and create heatmaps. We start by reviewing the mathematical definition of distance.

Euclidean Distance

As a review, let’s define the distance between two points, \(A\) and \(B\), on a Cartesian plane.

The euclidean distance between \(A\) and \(B\) is simply:

\[\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}\]

Distance in High Dimensions

We introduce a dataset with gene expression measurements for 22,215 genes from 189 samples. The R objects can be downloaded like this:

library(devtools)
install_github("genomicsclass/tissuesGeneExpression")

The data represent RNA expression levels for eight tissues, each with several individuals.

library(devtools)

## 
## Attaching package: 'devtools'

## The following object is masked from 'package:rafalib':
## 
##     install_bioc

install_github("genomicsclass/tissuesGeneExpression")

## Downloading GitHub repo genomicsclass/tissuesGeneExpression@master
## from URL https://api.github.com/repos/genomicsclass/tissuesGeneExpression/zipball/master

## Installing tissuesGeneExpression

## "C:/PROGRA~1/R/R-34~1.1/bin/x64/R" --no-site-file --no-environ --no-save  \
##   --no-restore --quiet CMD INSTALL "C:/Users/Mete  \
##   Civelek/AppData/Local/Temp/RtmpCYTVK1/devtools3f94526a7f84/genomicsclass-tissuesGeneExpression-a43cf4b"  \
##   --library="C:/Users/Mete Civelek/Documents/R/win-library/3.4"  \
##   --install-tests

##

library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##e contains the expression data

## [1] 22215   189

table(tissue) ##tissue[i] tells us what tissue is represented by e[,i]

## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

We are interested in describing distance between samples in the context of this dataset. We might also be interested in finding genes that behave similarly across samples.

To define distance, we need to know what the points are since mathematical distance is computed between points. With high dimensional data, points are no longer on the Cartesian plane. Instead they are in higher dimensions. For example, sample \(i\) is defined by a point in 22,215 dimensional space: \((Y_{1,i},\dots,Y_{22215,i})^\top\). Feature \(g\) is defined by a point in 189 dimensions \((Y_{g,1},\dots,Y_{g,189})^\top\)

Once we define points, the Euclidean distance is defined in a very similar way as it is defined for two dimensions. For instance, the distance between two samples \(i\) and \(j\) is:

\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]

and the distance between two features \(h\) and \(g\) is:

\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]

Note: In practice, distances between features are typically applied after standardizing the data for each feature. This is equivalent to computing one minus the correlation. This is done because the differences in overall levels between features are often not due to biological effects but technical ones. More details on this topic can be found in this presentation.

Distance with matrix algebra

The distance between samples \(i\) and \(j\) can be written as

\[ \mbox{dist}(i,j) = (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j)\]

with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\). This result can be very convenient in practice as computations can be made much faster using matrix multiplication.

Examples

We can now use the formulas above to compute distance. Let’s compute distance between samples 1 and 2, both kidneys, and then to sample 87, a colon.

x <- e[,1]
y <- e[,2]
z <- e[,87]
sqrt(sum((x-y)^2))

## [1] 85.8546

sqrt(sum((x-z)^2))

## [1] 122.8919

As expected, the kidneys are closer to each other. A faster way to compute this is using matrix algebra:

sqrt( crossprod(x-y) )

##         [,1]
## [1,] 85.8546

sqrt( crossprod(x-z) )

##          [,1]
## [1,] 122.8919

Now to compute all the distances at once, we have the function dist. Because it computes the distance between each row, and here we are interested in the distance between samples, we transpose the matrix

d <- dist(t(e))
class(d)

## [1] "dist"

Note that this produces an object of class dist and, to access the entries using row and column indices, we need to coerce it into a matrix:

as.matrix(d)[1,2]

## [1] 85.8546

as.matrix(d)[1,87]

## [1] 122.8919

It is important to remember that if we run dist on e, it will compute all pairwise distances between genes. This will try to create a \(22215 \times 22215\) matrix that may crash your R sessions.

Dimension Reduction Motivation

Visualizing data is one of the most, if not the most, important step in the analysis of high-throughput data. The right visualization method may reveal problems with the experimental data that can render the results from a standard analysis, although typically appropriate, completely useless.

We have shown methods for visualizing global properties of the columns or rows, but plots that reveal relationships between columns or between rows are more complicated due to the high dimensionality of data. For example, to compare each of the 189 samples to each other, we would have to create, for example, 17,766 MA-plots. Creating one single scatterplot of the data is impossible since points are very high dimensional.

We will describe powerful techniques for exploratory data analysis based on dimension reduction. The general idea is to reduce the dataset to have fewer dimensions, yet approximately preserve important properties, such as the distance between samples. If we are able to reduce down to, say, two dimensions, we can then easily make plots. The technique behind it all, the singular value decomposition (SVD), is also useful in other contexts. Before introducing the rather complicated mathematics behind the SVD, we will motivate the ideas behind it with a simple example.

Example: Reducing two dimensions to one

We consider an example with twin heights. Here we simulate 100 two dimensional points that represent the number of standard deviations each individual is from the mean height. Each point is a pair of twins:

library(rafalib)
library(MASS)

set.seed(1)
n <- 100
y=t(mvrnorm(n,c(0,0), matrix(c(1,0.95,0.95,1),2,2)))

mypar()
plot(y[1,], y[2,], xlab="Twin 1 (standardized height)", 
     ylab="Twin 2 (standardized height)", xlim=c(-3,3), ylim=c(-3,3))
points(y[1,1:2], y[2,1:2], col=2, pch=16)

Simulated twin pair heights.

To help with the illustration, think of this as high-throughput gene expression data with the twin pairs representing the \(N\) samples and the two heights representing gene expression from two genes.

We are interested in the distance between any two samples. We can compute this using dist. For example, here is the distance between the two orange points in the figure above:

d=dist(t(y))
as.matrix(d)[1,2]

## [1] 1.140897

What if making two dimensional plots was too complex and we were only able to make 1 dimensional plots. Can we, for example, reduce the data to a one dimensional matrix that preserves distances between points?

If we look back at the plot, and visualize a line between any pair of points, the length of this line is the distance between the two points. These lines tend to go along the direction of the diagonal. We can “rotate” the plot so that the diagonal is in the x-axis:

z1 = (y[1,]+y[2,])/2 #the sum 
z2 = (y[1,]-y[2,])   #the difference

z = rbind( z1, z2) #matrix now same dimensions as y

thelim <- c(-3,3)
mypar(1,2)

plot(y[1,],y[2,],xlab="Twin 1 (standardized height)",
     ylab="Twin 2 (standardized height)",
     xlim=thelim,ylim=thelim)
points(y[1,1:2],y[2,1:2],col=2,pch=16)

plot(z[1,],z[2,],xlim=thelim,ylim=thelim,xlab="Average height",ylab="Difference in height")
points(z[1,1:2],z[2,1:2],col=2,pch=16)

Twin height scatterplot (left) and MA-plot (right).

Later, we will start using linear algebra to represent transformation of the data such as this. Here we can see that to get z we multiplied y by the matrix:

\[ A = \, \begin{pmatrix} 1/2&1/2\\ 1&-1\\ \end{pmatrix} \implies z = A y \]

Remember that we can transform back by simply multiplying by \(A^{-1}\) as follows:

\[ A^{-1} = \, \begin{pmatrix} 1&1/2\\ 1&-1/2\\ \end{pmatrix} \implies y = A^{-1} z \]

Rotations

In the plot above, the distance between the two orange points remains roughly the same, relative to the distance between other points. This is true for all pairs of points. A simple re-scaling of the transformation we performed above will actually make the distances exactly the same. What we will do is multiply by a scalar so that the standard deviations of each point is preserved. If you think of the columns of y as independent random variables with standard deviation \(\sigma\), then note that the standard deviations of \(M\) and \(A\) in the MA plot are:

\[ \mbox{sd}[ Z_1 ] = \mbox{sd}[ (Y_1 + Y_2) / 2 ] = \frac{1}{\sqrt{2}} \sigma \mbox{ and } \mbox{sd}[ Z_2] = \mbox{sd}[ Y_1 - Y_2 ] = {\sqrt{2}} \sigma \]

This implies that if we change the transformation above to:

\[ A = \frac{1}{\sqrt{2}} \begin{pmatrix} 1&1\\ 1&-1\\ \end{pmatrix} \]

then the SD of the columns of \(Y\) are the same as the variance of the columns \(Z\). Also, notice that \(A^{-1}A=I\). We call matrices with these properties orthogonal and it guarantees the SD-preserving properties described above. The distances are now exactly preserved:

A <- 1/sqrt(2)*matrix(c(1,1,1,-1),2,2)
z <- A%*%y
d <- dist(t(y))
d2 <- dist(t(z))
mypar(1,1)
plot(as.numeric(d),as.numeric(d2)) #as.numeric turns distances into long vector
abline(0,1,col=2)

Distance computed from original data and after rotation is the same.

We call this particular transformation a rotation of y.

mypar(1,2)

thelim <- c(-3,3)
plot(y[1,],y[2,],xlab="Twin 1 (standardized height)",
     ylab="Twin 2 (standardized height)",
     xlim=thelim,ylim=thelim)
points(y[1,1:2],y[2,1:2],col=2,pch=16)

plot(z[1,],z[2,],xlim=thelim,ylim=thelim,xlab="Average height",ylab="Difference in height")
points(z[1,1:2],z[2,1:2],col=2,pch=16)

Twin height scatterplot (left) and after rotation (right).

The reason we applied this transformation in the first place was because we noticed that to compute the distances between points, we followed a direction along the diagonal in the original plot, which after the rotation falls on the horizontal, or the first dimension of z. So this rotation actually achieves what we originally wanted: we can preserve the distances between points with just one dimension. Let’s remove the second dimension of z and recompute distances:

d3 = dist(z[1,]) ##distance computed using just first dimension
mypar(1,1)
plot(as.numeric(d),as.numeric(d3)) 
abline(0,1)

Distance computed with just one dimension after rotation versus actual distance.

The distance computed with just the one dimension provides a very good approximation to the actual distance and a very useful dimension reduction: from 2 dimensions to 1. This first dimension of the transformed data is actually the first principal component. This idea motivates the use of principal component analysis (PCA) and the singular value decomposition (SVD) to achieve dimension reduction more generally.

Important note on a difference to other explanations

If you search the web for descriptions of PCA, you will notice a difference in notation to how we describe it here. This mainly stems from the fact that it is more common to have rows represent units. Hence, in the example shown here, \(Y\) would be transposed to be an \(N \times 2\) matrix. In statistics this is also the most common way to represent the data: individuals in the rows. However, for practical reasons, in genomics it is more common to represent units in the columns. For example, genes are rows and samples are columns. For this reason, in this book we explain PCA and all the math that goes with it in a slightly different way than it is usually done. As a result, many of the explanations you find for PCA start out with the sample covariance matrix usually denoted with \(\mathbf{X}^\top\mathbf{X}\) and having cells representing covariance between two units. Yet for this to be the case, we need the rows of \(\mathbf{X}\) to represents units. So in our notation above, you would have to compute, after scaling, \(\mathbf{Y}\mathbf{Y}^\top\) instead.

Basically, if you want our explanations to match others you have to transpose the matrices we show here.

Example of a 3D object projected onto 2D surface

Dot Products and Projections

Singular Value Decomposition

In the previous section, we motivated dimension reduction and showed a transformation that permitted us to approximate the distance between two dimensional points with just one dimension. The singular value decomposition (SVD) is a generalization of the algorithm we used in the motivational section. As in the example, the SVD provides a transformation of the original data. This transformation has some very useful properties.

The main result SVD provides is that we can write an \(m \times n\), matrix \(\mathbf{Y}\) as

\[\mathbf{U}^\top\mathbf{Y} = \mathbf{DV}^\top\]

With:

\(\mathbf{U}\) is an \(m \times p\) orthogonal matrix
\(\mathbf{V}\) is an \(p \times p\) orthogonal matrix
\(\mathbf{D}\) is an \(n \times p\) diagonal matrix

with \(p=\mbox{min}(m,n)\). \(\mathbf{U}^\top\) provides a rotation of our data \(\mathbf{Y}\) that turns out to be very useful because the variability (sum of squares to be precise) of the columns of \(\mathbf{U}^\top \mathbf{Y}=\mathbf{VD}\) are decreasing. Because \(\mathbf{U}\) is orthogonal, we can write the SVD like this:

\[\mathbf{Y} = \mathbf{UDV}^\top\]

In fact, this formula is much more commonly used. We can also write the transformation like this:

\[\mathbf{YV} = \mathbf{UD}\]

This transformation of \(Y\) also results in a matrix with column of decreasing sum of squares.

Applying the SVD to the motivating example we have:

library(rafalib)
library(MASS)
n <- 100
y <- t(mvrnorm(n,c(0,0), matrix(c(1,0.95,0.95,1),2,2)))
s <- svd(y)

We can immediately see that applying the SVD results in a transformation very similar to the one we used in the motivating example:

round(sqrt(2) * s$u , 3)

##        [,1]   [,2]
## [1,] -0.982 -1.017
## [2,] -1.017  0.982

The plot we showed after the rotation was showing what we call the principal components: the second plotted against the first. To obtain the principal components from the SVD, we simply need the columns of the rotation \(\mathbf{U}^\top\mathbf{Y}\) :

PC1 = s$d[1]*s$v[,1]
PC2 = s$d[2]*s$v[,2]
plot(PC1,PC2,xlim=c(-3,3),ylim=c(-3,3))

Second PC plotted against first PC for the twins height data.

How is this useful?

It is not immediately obvious how incredibly useful the SVD can be, so let’s consider some examples. In this example, we will greatly reduce the dimension of \(V\) and still be able to reconstruct \(Y\).

Let’s compute the SVD on the gene expression table we have been working with. We will take a subset of 100 genes so that computations are faster.

library(tissuesGeneExpression)
data(tissuesGeneExpression)
set.seed(1)
ind <- sample(nrow(e),500) 
Y <- t(apply(e[ind,],1,scale)) #standardize data for illustration

The svd command returns the three matrices (only the diagonal entries are returned for \(D\))

s <- svd(Y)
U <- s$u
V <- s$v
D <- diag(s$d) ##turn it into a matrix

First note that we can in fact reconstruct y:

Yhat <- U %*% D %*% t(V)
resid <- Y - Yhat
max(abs(resid))

## [1] 3.552714e-14

If we look at the sum of squares of \(\mathbf{UD}\), we see that the last few are quite close to 0 (perhaps we have some replicated columns).

plot(s$d)

Entries of the diagonal of D for gene expression data.

This implies that the last columns of V have a very small effect on the reconstruction of Y. To see this, consider the extreme example in which the last entry of \(V\) is 0. In this case the last column of \(V\) is not needed at all. Because of the way the SVD is created, the columns of \(V\) have less and less influence on the reconstruction of \(Y\). You commonly see this described as “explaining less variance”. This implies that for a large matrix, by the time you get to the last columns, it is possible that there is not much left to “explain” As an example, we will look at what happens if we remove the four last columns:

k <- ncol(U)-4
Yhat <- U[,1:k] %*% D[1:k,1:k] %*% t(V[,1:k])
resid <- Y - Yhat 
max(abs(resid))

## [1] 3.552714e-14

The largest residual is practically 0, meaning that we Yhat is practically the same as Y, yet we need 4 fewer dimensions to transmit the information.

By looking at \(d\), we can see that, in this particular dataset, we can obtain a good approximation keeping only 94 columns. The following plots are useful for seeing how much of the variability is explained by each column:

plot(s$d^2/sum(s$d^2)*100,ylab="Percent variability explained")

Percent variance explained by each principal component of gene expression data.

We can also make a cumulative plot:

plot(cumsum(s$d^2)/sum(s$d^2)*100,ylab="Percent variability explained",ylim=c(0,100),type="l")

Cumulative variance explained by principal components of gene expression data.

Although we start with 189 dimensions, we can approximate \(Y\) with just 95:

k <- 95 ##out a possible 189
Yhat <- U[,1:k] %*% D[1:k,1:k] %*% t(V[,1:k])
resid <- Y - Yhat
boxplot(resid,ylim=quantile(Y,c(0.01,0.99)),range=0)

Residuals from comparing a reconstructed gene expression table using 95 PCs to the original data with 189 dimensions.

Therefore, by using only half as many dimensions, we retain most of the variability in our data:

var(as.vector(resid))/var(as.vector(Y))

## [1] 0.04076899

We say that we explain 96% of the variability.

Note that we can compute this proportion from \(D\):

1-sum(s$d[1:k]^2)/sum(s$d^2)

## [1] 0.04076899

The entries of \(D\) therefore tell us how much each PC contributes in term of variability explained.

Highly correlated data

To help understand how the SVD works, we construct a dataset with two highly correlated columns.

For example:

m <- 100
n <- 2
x <- rnorm(m)
e <- rnorm(n*m,0,0.01)
Y <- cbind(x,x)+e
cor(Y)

##           x         x
## x 1.0000000 0.9998873
## x 0.9998873 1.0000000

In this case, the second column adds very little “information” since all the entries of Y[,1]-Y[,2] are close to 0. Reporting rowMeans(Y) is even more efficient since Y[,1]-rowMeans(Y) and Y[,2]-rowMeans(Y) are even closer to 0. rowMeans(Y) turns out to be the information represented in the first column on \(U\). The SVD helps us notice that we explain almost all the variability with just this first column:

d <- svd(Y)$d
d[1]^2/sum(d^2)

## [1] 0.9999441

In cases with many correlated columns, we can achieve great dimension reduction:

m <- 100
n <- 25
x <- rnorm(m)
e <- rnorm(n*m,0,0.01)
Y <- replicate(n,x)+e
d <- svd(Y)$d
d[1]^2/sum(d^2)

## [1] 0.9999047

Still confused? Here’s a great geometric explanation

Distance lecture adapted from Love & Irizarry (2015)