1 Day 4 outline

Book Chapter 2:

Basic data analysis

Book chapter 7:

Distances in high dimensions
Principal Components Analysis
Multidimensional Scaling
Batch Effects

2 Exploratory data analysis

2.1 Introduction

“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey

Discover biases, systematic errors and unexpected variability in data
Graphical approach to detecting these issues
Represents a first step in data analysis and guides hypothesis testing
Opportunities for discovery in the outliers

2.2 Quantile Quantile Plots

Quantiles divide a distribution into equally sized bins
Division into 100 bins gives percentiles
Quantiles of a theoretical distribution are plotted against an experimental distribution
- alternatively, quantiles of two experimental distributions
Given a perfect fit, \(x=y\)
Useful in determining data distribution (normal, t, etc.)

2.3 Quantile Quantile Plots

suppressPackageStartupMessages(library(UsingR))
suppressPackageStartupMessages(library(rafalib))
# height qq plot
x <- father.son$fheight
ps <- (seq(0, 99) + 0.5) / 100
qs <- quantile(x, ps)
normalqs <- qnorm(ps, mean(x), popsd(x))
par(mfrow = c(1, 3))
plot(normalqs,
     qs,
     xlab = "Normal Percentiles",
     ylab = "Height Percentiles",
     main = "Height Q-Q Plot")
abline(0, 1)
# t-distributed for 12 df
x <- rt(1000, 12)
qqnorm(x,
       xlab = "t quantiles",
       main = "T Quantiles (df=12) Q-Q Plot",
       ylim = c(-6, 6))
qqline(x)
# t-distributed for 3 df
x <- rt(1000, 3)
qqnorm(x,
       xlab = "t quantiles",
       main = "T Quantiles (df=3) Q-Q Plot",
       ylim = c(-6, 6))
qqline(x)

2.4 Boxplots

Provide a graph that is easy to interpret where data may not be normally distributed
Qualitative + quantitative
Particularly informative in relation to outliers and range
Possible to compare multiple distributions side by side

2.5 Boxplots

par(mfrow = c(1, 3))
hist(exec.pay, main = "CEO Compensation")
qqnorm(exec.pay, main = "CEO Compensation")
boxplot(exec.pay,
        ylab = "10,000s of dollars",
        ylim = c(0, 400),
        main = "CEO Compensation")

Three different views of a continuous variable

2.6 Scatterplots And Correlation

For two continuous variables, scatter plot and calculation of correlation is useful
Provides a graphical and numeric estimation of relationships
Quick and easy with plot() and cor()

2.7 Scatterplots And Correlation

par(mfrow = c(1, 3))
plot(
    father.son$fheight,
    father.son$sheight,
    xlab = "Father's height in inches",
    ylab = "Son's height in inches",
    main = paste("correlation =", signif(
        cor(father.son$fheight, father.son$sheight), 2
    ))
)
plot(
    cars$speed,
    cars$dist,
    xlab = "Speed",
    ylab = "Stopping Distance",
    main = paste("correlation =", signif(cor(
        cars$speed, cars$dist
    ), 2))
)
plot(
    faithful$eruptions,
    faithful$waiting,
    xlab = "Eruption Duration",
    ylab = "Waiting Time",
    main = paste("correlation =", signif(
        cor(faithful$eruptions, faithful$waiting), 2
    ))
)

2.8 Exploratory data analysis in high dimensions

#install from Github
BiocManager::install("genomicsclass/GSE5859Subset")  
BiocManager::install("genomicsclass/tissuesGeneExpression")

library(GSE5859Subset)
data(GSE5859Subset) ##this loads three tables
c(class(geneExpression), class(sampleInfo))

## [1] "matrix"     "data.frame"

rbind(dim(geneExpression), dim(sampleInfo))

##      [,1] [,2]
## [1,] 8793   24
## [2,]   24    4

head(sampleInfo)

##     ethnicity       date         filename group
## 107       ASN 2005-06-23 GSM136508.CEL.gz     1
## 122       ASN 2005-06-27 GSM136530.CEL.gz     1
## 113       ASN 2005-06-27 GSM136517.CEL.gz     1
## 163       ASN 2005-10-28 GSM136576.CEL.gz     1
## 153       ASN 2005-10-07 GSM136566.CEL.gz     1
## 161       ASN 2005-10-07 GSM136574.CEL.gz     1

2.9 Volcano plots

T-test for every row (gene) of gene expression matrix:

library(genefilter)
g <- factor(sampleInfo$group)
system.time(results <- rowttests(geneExpression,g))

##    user  system elapsed 
##   0.007   0.001   0.009

pvals <- results$p.value

Note that these 8,793 tests are done in about 0.01s

2.10 Volcano plots

par(mar = c(4, 4, 0, 0))
plot(results$dm,
     -log10(results$p.value),
     xlab = "Effect size (difference in group means)",
     ylab = "- log (base 10) p-values")
abline(h = -log10(0.05 / nrow(geneExpression)), col = "red")
legend("bottomright",
       lty = 1,
       col = "red",
       legend = "Bonferroni = 0.05")

2.11 Volcano plots (summary)

Many small p-values with small effect size indicate low within-group variability
Inspect for asymmetry
Can color points by significance threshold

2.12 P-value histograms

If all null hypotheses are true, expect a flat histogram of p-values:

m <- nrow(geneExpression)
n <- ncol(geneExpression)
set.seed(1)
randomData <- matrix(rnorm(n * m), m, n)
nullpvals <- rowttests(randomData, g)$p.value

par(mfrow = c(1, 2))
hist(pvals, ylim = c(0, 1400))
hist(nullpvals, ylim = c(0, 1400))

2.13 P-value histograms

Note that permuting these data doesn’t produce an ideal null p-value histogram due to batch effects:

permg <- sample(g)
permresults <- rowttests(geneExpression, permg)
hist(permresults$p.value)

2.14 P-value histograms (summary)

Give a quick look at how many significant p-values there may be
When using permuted labels, can expose non-independence among the samples
- can be due to batch effects or family structure
Most common approaches for correcting batch effects are:
- including batch in your model formula
- ComBat: corrects for known batch effects by linear model), and
- sva: creates surrogate variables for unknown batch effects, corrects the structure of permutation p-values
- correction using control (housekeeping) genes

All are available from the sva Bioconductor package

2.15 MA plot

just a scatterplot rotated 45\(^o\)

rafalib::mypar(1, 2)
pseudo <- apply(geneExpression, 1, median)
plot(geneExpression[, 1], pseudo)
plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo))

2.16 MA plot (summary)

useful for quality control of high-dimensional data
plot all data values for a sample against another sample or a median “pseudosample”
affyPLM::MAplots better MA plots
- adds a smoothing line to highlight departures from horizontal line
- plots a “cloud” rather than many data points

2.17 Heatmaps

Detailed representation of high-dimensional dataset

ge = geneExpression[sample(1:nrow(geneExpression), 200),]
pheatmap::pheatmap(
    ge,
    scale = "row",
    show_colnames = FALSE,
    show_rownames = FALSE
)

2.18 Heatmaps

Clustering becomes slow and memory-intensivefor thousands of rows
- probably too detailed for thousands of rows
can show co-expressed genes, groups of samples

2.19 Colors

Types of color pallettes:
- sequential: shows a gradient
- diverging: goes in two directions from a center
- qualitative: for categorical variables
Keep color blindness in mind (10% of all men)

Combination of RColorBrewer package and colorRampPalette() can create anything you want

rafalib::mypar(1, 1)
RColorBrewer::display.brewer.all(n = 7)

2.20 Plots To Avoid

“Pie charts are a very bad way of displaying information.” - R Help

Always avoid pie charts
Avoid doughnut charts too
Avoid pseudo 3D and most Excel defaults
Effective graphs use color judiciously

3 Distances in high-dimensional data analysis

3.1 The importance of distance

High-dimensional data are complex and impossible to visualize in raw form
Thousands of dimensions, we can only visualize 2-3
Distances can simplify thousands of dimensions

Distances can help organize samples and genomic features

library(GSE5859Subset)
data(GSE5859Subset) ##this loads three tables
set.seed(1)
ge = geneExpression[sample(1:nrow(geneExpression), 200),]
pheatmap::pheatmap(
    ge,
    scale = "row",
    show_colnames = FALSE,
    show_rownames = FALSE
)

Any clustering or classification of samples and/or genes involves combining or identifying objects that are close or similar.
Distances or similarities are mathematical representations of what we mean by close or similar.
The choice of distance is important and requires thought.
- choice is subject-matter specific

Source: http://master.bioconductor.org/help/course-materials/2002/Summer02Course/Distance/distance.pdf

3.2 Metrics and distances

A metric satisfies the following five properties:

non-negativity \(d(a, b) \ge 0\)
symmetry \(d(a, b) = d(b, a)\)
identification mark \(d(a, a) = 0\)
definiteness \(d(a, b) = 0\) if and only if \(a=b\)
triangle inequality \(d(a, b) + d(b, c) \ge d(a, c)\)

A distance is only required to satisfy 1-3.
A similarity function satisfies 1-2, and increases as \(a\) and \(b\) become more similar
A dissimilarity function satisfies 1-2, and decreases as \(a\) and \(b\) become more similar

3.3 Euclidian distance (metric)

Remember grade school:

rafalib::mypar()
plot(
    c(0, 1, 1),
    c(0, 0, 1),
    pch = 16,
    cex = 2,
    xaxt = "n",
    yaxt = "n",
    xlab = "",
    ylab = "",
    bty = "n",
    xlim = c(-0.25, 1.25),
    ylim = c(-0.25, 1.25)
)
lines(c(0, 1, 1, 0), c(0, 0, 1, 0))
text(0, .2, expression(paste('(A'[x] * ',A'[y] * ')')), cex = 1.5)
text(1, 1.2, expression(paste('(B'[x] * ',B'[y] * ')')), cex = 1.5)
text(-0.1, 0, "A", cex = 2)
text(1.1, 1, "B", cex = 2)

Euclidean d = \(\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}\).

Side note: also referred to as \(L_2\) norm

3.4 Euclidian distance in high dimensions

##BiocManager::install("genomicsclass/tissuesGeneExpression")
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data

## [1] 22215   189

table(tissue) ##tissue[i] corresponds to e[,i]

## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Interested in identifying similar samples and similar genes

3.4.1 Notes:

Points are no longer on the Cartesian plane,
instead they are in higher dimensions. For example:
- sample \(i\) is defined by a point in 22,215 dimensional space: \((Y_{1,i},\dots,Y_{22215,i})^\top\).
- feature \(g\) is defined by a point in 189 dimensions \((Y_{g,189},\dots,Y_{g,189})^\top\)

Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:

\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]

and the distance between two features \(h\) and \(g\) is:

\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]

3.5 Matrix algebra notation

The distance between samples \(i\) and \(j\) can be written as:

\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]

with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).

t(matrix(1:3, ncol = 1))

##      [,1] [,2] [,3]
## [1,]    1    2    3

matrix(1:3, ncol = 1)

##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3

t(matrix(1:3, ncol = 1)) %*% matrix(1:3, ncol = 1)

##      [,1]
## [1,]   14

Note: R is very efficient at matrix algebra

3.6 3 sample example

kidney1 <- e[, 1]
kidney2 <- e[, 2]
colon1 <- e[, 87]
sqrt(sum((kidney1 - kidney2) ^ 2))

## [1] 85.8546

sqrt(sum((kidney1 - colon1) ^ 2))

## [1] 122.8919

3.7 3 sample example using dist()

dim(e)

## [1] 22215   189

(d <- dist(t(e[, c(1, 2, 87)])))

##                 GSM11805.CEL.gz GSM11814.CEL.gz
## GSM11814.CEL.gz         85.8546                
## GSM92240.CEL.gz        122.8919        115.4773

class(d)

## [1] "dist"

3.8 The dist() function

Excerpt from ?dist:

dist(
    x,
    method = "euclidean",
    diag = FALSE,
    upper = FALSE,
    p = 2
)

method: the distance measure to be used.
- This must be one of “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”. Any unambiguous substring can be given.
dist class output from dist() is used for many clustering algorithms and heatmap functions

Caution: dist(e) creates a 22215 x 22215 matrix that will probably crash your R session.

3.9 Note on standardization

In practice, features (e.g. genes) are typically “standardized”, i.e. converted to z-score:

\[x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}\]

This is done because the differences in overall levels between features are often not due to biological effects but technical ones, e.g.:
- GC bias, PCR amplification efficiency, …
Euclidian distance and \(1-r\) (Pearson cor) are related:
- \(\frac{d_E(x, y)^2}{2m} = 1 - r_{xy}\)

4 Dimension reduction and PCA

4.1 Motivation for dimension reduction

Simulate the heights of twin pairs:

set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0, 0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)

## [1]   2 100

cor(t(y))

##           [,1]      [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000

4.1.1 Visualizing twin pairs data

z1 = (y[1,] + y[2,]) / 2 #the sum
z2 = (y[1,] - y[2,])   #the difference

z = rbind(z1, z2) #matrix now same dimensions as y

thelim <- c(-3, 3)
rafalib::mypar(1, 2)

plot(
    y[1,],
    y[2,],
    xlab = "Twin 1 (standardized height)",
    ylab = "Twin 2 (standardized height)",
    xlim = thelim,
    ylim = thelim,
    main = "Original twin heights"
)
points(y[1, 1:2], y[2, 1:2], col = 2, pch = 16)

plot(
    z[1,],
    z[2,],
    xlim = thelim,
    ylim = thelim,
    xlab = "Average height",
    ylab = "Difference in height",
    main = "Manual PCA-like projection"
)
points(z[1, 1:2] , z[2, 1:2], col = 2, pch = 16)

4.1.2 Not much distance is lost in the second dimension

rafalib::mypar()
d = dist(t(y))
d3 = dist(z[1, ]) * sqrt(2) ##distance computed using just first dimension mypar(1,1)
plot(as.numeric(d),
     as.numeric(d3),
     xlab = "Pairwise distances in 2 dimensions",
     ylab = "Pairwise distances in 1 dimension")
abline(0, 1, col = "red")

Not much loss of height differences when just using average heights of twin pairs.
- because twin heights are highly correlated

4.2 Singular Value Decomposition (SVD)

SVD generalizes the example rotation we looked at:

\[\mathbf{Y} = \mathbf{UDV}^\top\]

note: the above formulation is for \(m > n\)
\(\mathbf{Y}\): the m rows x n cols matrix of measurements
\(\mathbf{U}\): m x n matrix relating original scores to PCA scores (loadings)
\(\mathbf{D}\): n x n diagonal matrix (eigenvalues)
\(\mathbf{V}\): n × n orthogonal matrix (eigenvectors or PCA scores)
- orthogonal = unit length and “perpendicular” in 3-D

4.3 SVD of gene expression dataset

Scaling:

system.time(e.standardize.slow <-
                t(apply(e, 1, function(x)
                    x - mean(x))))

##    user  system elapsed 
##   1.248   0.121   1.441

system.time(e.standardize.fast <- t(scale(t(e), scale = FALSE)))

##    user  system elapsed 
##   0.149   0.001   0.155

all.equal(e.standardize.slow[, 1], e.standardize.fast[, 1])

## [1] TRUE

SVD:

s <- svd(e.standardize.fast)
names(s)

## [1] "d" "u" "v"

4.4 Components of SVD results

dim(s$u)     # loadings

## [1] 22215   189

length(s$d)  # eigenvalues

## [1] 189

dim(s$v)     # d %*% vT = scores

## [1] 189 189

4.5 PCA of gene expression dataset

rafalib::mypar()
p <- prcomp(t(e.standardize.fast))
plot(s$u[, 1] ~ p$rotation[, 1])

Lesson: u and v can each be multiplied by -1 arbitrarily

4.6 PCA interpretation: loadings

\(\mathbf{U}\) (loadings): relate the principal component axes to the original variables
- think of principal component axes as a weighted combination of original axes

4.7 Visualizing PCA loadings

plot(p$rotation[, 1],
     xlab = "Index of genes",
     ylab = "Loadings of PC1",
     main = "Scores of PC1") #or, predict(p)
abline(h = c(-0.03, 0.04), col = "red")

Genes with high PC1 loadings:

e.pc1genes <-
    e.standardize.fast[p$rotation[, 1] < -0.03 |
                           p$rotation[, 1] > 0.03,]
pheatmap::pheatmap(
    e.pc1genes,
    scale = "none",
    show_rownames = TRUE,
    show_colnames = FALSE
)

4.8 PCA interpretation: eigenvalues

\(\mathbf{D}\) (eigenvalues): standard deviation scaling factor that each decomposed variable is multiplied by.

rafalib::mypar()
plot(
    p$sdev ^ 2 / sum(p$sdev ^ 2) * 100,
    xlim = c(0, 150),
    type = "b",
    ylab = "% variance explained",
    main = "Screeplot"
)

Alternatively as cumulative % variance explained (using cumsum() function):

rafalib::mypar()
plot(
    cumsum(p$sdev ^ 2) / sum(p$sdev ^ 2) * 100,
    ylab = "cumulative % variance explained",
    ylim = c(0, 100),
    type = "b",
    main = "Cumulative screeplot"
)

4.9 PCA interpretation: scores

\(\mathbf{V}\) (scores): The “datapoints” in the reduced prinipal component space
In some implementations (like prcomp()), scores \(\mathbf{D V^T}\)

4.10 PCA interpretation: scores

rafalib::mypar()
plot(
    p$x[, 1:2],
    xlab = "PC1",
    ylab = "PC2",
    main = "plot of p$x[, 1:2]",
    col = factor(tissue),
    pch = as.integer(factor(tissue))
)
legend(
    "topright",
    legend = levels(factor(tissue)),
    col = 1:length(unique(tissue)),
    pch = 1:length(unique(tissue)),
    bty = 'n'
)

4.11 Multi-dimensional Scaling (MDS)

also referred to as Principal Coordinates Analysis (PCoA)
a reduced SVD, performed on a distance matrix
identify two (or more) eigenvalues/vectors that preserve distances

d <- as.dist(1 - cor(e.standardize.fast))
mds <- cmdscale(d)

rafalib::mypar()
plot(mds, col = factor(tissue), pch = as.integer(factor(tissue)))
legend(
    "topright",
    legend = levels(factor(tissue)),
    col = 1:length(unique(tissue)),
    bty = 'n',
    pch = 1:length(unique(tissue))
)

4.12 Summary: distances and dimension reduction

Note: signs of eigenvalues (square to get variances) and eigenvectors (loadings) can be arbitrarily flipped
PCA and MDS are useful for dimension reduction when you have correlated variables
Variables are always centered.
Variables are also scaled unless you know they have the same scale in the population
PCA projection can be applied to new datasets if you know the matrix calculations
PCA is subject to over-fitting, screeplot can be tested by cross-validation

5 Batch Effects

5.1 Batch Effects

pervasive in genomics (e.g. Leek et al. )
affect DNA and RNA sequencing, proteomics, imaging, microarray…
have caused high-profile problems and retractions
- you can’t get rid of them
- but you can make sure they are not confounded with your experiment

5.2 Batch Effects - an example

Nat Genet. 2007 Feb;39(2):226-31. Epub 2007 Jan 7.
Title: Common genetic variants account for differences in gene expression among ethnic groups.
- “The results show that specific genetic variation among populations contributes appreciably to differences in gene expression phenotypes.”

library(Biobase)
library(genefilter)
library(GSE5859) ## BiocInstaller::biocLite("genomicsclass/GSE5859")
data(GSE5859)
geneExpression = exprs(e)
sampleInfo = pData(e)

5.3 Date of processing as a proxy for batch

Sample metadata included date of processing:

head(table(sampleInfo$date))

## 
## 2002-10-31 2002-11-15 2002-11-19 2002-11-21 2002-11-22 2002-11-27 
##          2          2          2          5          4          2

year <-  as.integer(format(sampleInfo$date, "%y"))
year <- year - min(year)
month = as.integer(format(sampleInfo$date, "%m")) + 12 * year
table(year, sampleInfo$ethnicity)

##     
## year ASN CEU HAN
##    0   0  32   0
##    1   0  54   0
##    2   0  13   0
##    3  80   3   0
##    4   2   0  24

5.4 Visualizing batch effects by PCA

pc <- prcomp(t(geneExpression), scale. = TRUE)
boxplot(
    pc$x[, 1] ~ month,
    varwidth = TRUE,
    notch = TRUE,
    main = "PC1 scores vs. month",
    xlab = "Month",
    ylab = "PC1"
)

5.5 Visualizing batch effects by MDS

A starting point for a color palette:

RColorBrewer::display.brewer.all(n = 3, colorblindFriendly = TRUE)

Interpolate one color per month on a quantitative palette:

col3 <- c(RColorBrewer::brewer.pal(n = 3, "Greys")[2:3], "black")
MYcols <- colorRampPalette(col3, space = "Lab")(length(unique(month)))

d <- as.dist(1 - cor(geneExpression))
mds <- cmdscale(d)
plot(mds, col = MYcols[as.integer(factor(month))],
     main = "MDS shaded by batch")

5.6 The batch effects impact clustering

hcclass <- cutree(hclust(d), k = 5)
table(hcclass, year)

##        year
## hcclass  0  1  2  3  4
##       1  0  2  8  1  0
##       2 25 43  0  2  0
##       3  6  0  0  0  0
##       4  1  7  0  0  0
##       5  0  2  5 80 26

5.7 Batch Effects - summary

tend to affect many or all measurements by a little bit
during experimental design:
- keep track of anything that might cause a batch effect for post-hoc analysis
- include control samples in each batch
batches can be corrected for if randomized in study design
- if confounded with treatment or outcome of interest, nothing can help you

5.8 Links

A built html version of this lecture is available.
The source R Markdown is also available from Github.

Applied Statistics for High-throughput Biology 2019: Session 4

Levi Waldron