Chapter 8 section 1

Oct 14, 2016

Outline

The importance of distances
Euclidian distance
some matrix algebra notation
Distance Exercises: p. 322-323

Distances in high-dimensional data analysis

The importance of distance

High-dimensional data are complex and impossible to visualize in raw form
Thousands of dimensions, we can only visualize 2-3
Distances can simplify thousands of dimensions

The importance of distance (cont'd)

Distances can help organize samples and genomic features

The importance of distance (cont'd)

Any clustering or classification of samples and/or genes involves combining or identifying objects that are close or similar.
Distances or similarities are mathematical representations of what we mean by close or similar.
The choice of distance is important and requires thought.
- choice is subject-matter specific

Source: http://master.bioconductor.org/help/course-materials/2002/Summer02Course/Distance/distance.pdf

Metrics and distances

A metric satisfies the following five properties:

non-negativity \(d(a, b) \ge 0\)
symmetry \(d(a, b) = d(b, a)\)
identification mark \(d(a, a) = 0\)
definiteness \(d(a, b) = 0\) if and only if \(a=b\)
triangle inequality \(d(a, b) + d(b, c) \ge d(a, c)\)
- A distance is only required to satisfy 1-3.
- A similarity function satisfies 1-2, and increases as \(a\) and \(b\) become more similar
- A dissimilarity function satisfies 1-2, and decreases as \(a\) and \(b\) become more similar

Euclidian distance (metric)

Remember grade school: Euclidean d = \(\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}\).
Side note: also referred to as \(L_2\) norm

Euclidian distance in high dimensions

##biocLite("genomicsclass/tissuesGeneExpression")
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data

## [1] 22215   189

table(tissue) ##tissue[i] corresponds to e[,i]

## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Interested in identifying similar samples and similar genes

Euclidian distance in high dimensions

Points are no longer on the Cartesian plane,
instead they are in higher dimensions. For example:
- sample \(i\) is defined by a point in 22,215 dimensional space: \((Y_{1,i},\dots,Y_{22215,i})^\top\).
- feature \(g\) is defined by a point in 189 dimensions \((Y_{g,189},\dots,Y_{g,189})^\top\)

Euclidian distance in high dimensions

Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:

\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]

and the distance between two features \(h\) and \(g\) is:

\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]

Matrix algebra notation

The distance between samples \(i\) and \(j\) can be written as:

\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]

with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).

Matrix algebra notation

t(matrix(1:3, ncol=1))

##      [,1] [,2] [,3]
## [1,]    1    2    3

matrix(1:3, ncol=1)

##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3

t(matrix(1:3, ncol=1)) %*% matrix(1:3, ncol=1)

##      [,1]
## [1,]   14

Note: R is very efficient at matrix algebra

3 sample example

kidney1 <- e[, 1]
kidney2 <- e[, 2]
colon1 <- e[, 87]
sqrt(sum((kidney1 - kidney2)^2))

## [1] 85.8546

sqrt(sum((kidney1 - colon1)^2))

## [1] 122.8919

3 sample example using dist()

dim(e)

## [1] 22215   189

(d <- dist(t(e[, c(1, 2, 87)])))

##                 GSM11805.CEL.gz GSM11814.CEL.gz
## GSM11814.CEL.gz         85.8546                
## GSM92240.CEL.gz        122.8919        115.4773

class(d)

## [1] "dist"

The dist() function

Excerpt from ?dist:

dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)

method: the distance measure to be used.
- This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". Any unambiguous substring can be given.
dist class output from dist() is used for many clustering algorithms and heatmap functions

Caution: dist(e) creates a 22215 x 22215 matrix that will probably crash your R session.

Note on standardization

In practice, features (e.g. genes) are typically "standardized", i.e. converted to z-score:

\[x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}\]

This is done because the differences in overall levels between features are often not due to biological effects but technical ones, e.g.:
- GC bias, PCR amplification efficiency, …
Euclidian distance and \(1-r\) (Pearson cor) are related:
- \(\frac{d_E(x, y)^2}{2m} = 1 - r_{xy}\)

Outline

Distances in high-dimensional data analysis

The importance of distance

The importance of distance (cont'd)

The importance of distance (cont'd)

Metrics and distances

Euclidian distance (metric)

Euclidian distance in high dimensions

Euclidian distance in high dimensions

Euclidian distance in high dimensions

Matrix algebra notation

Matrix algebra notation

3 sample example

3 sample example using dist()

The dist() function

Note on standardization

Links