1 Day 4 outline

Book Chapter 2:

Book chapter 7:

2 Exploratory data analysis

2.1 Introduction

“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey

  • Discover biases, systematic errors and unexpected variability in data
  • Graphical approach to detecting these issues
  • Represents a first step in data analysis and guides hypothesis testing
  • Opportunities for discovery in the outliers

2.2 Quantile Quantile Plots

  • Quantiles divide a distribution into equally sized bins
  • Division into 100 bins gives percentiles
  • Quantiles of a theoretical distribution are plotted against an experimental distribution
    • alternatively, quantiles of two experimental distributions
  • Given a perfect fit, \(x=y\)
  • Useful in determining data distribution (normal, t, etc.)

2.3 Quantile Quantile Plots

suppressPackageStartupMessages(library(UsingR))
suppressPackageStartupMessages(library(rafalib))
# height qq plot
x <- father.son$fheight
ps <- (seq(0, 99) + 0.5) / 100
qs <- quantile(x, ps)
normalqs <- qnorm(ps, mean(x), popsd(x))
par(mfrow = c(1, 3))
plot(normalqs,
     qs,
     xlab = "Normal Percentiles",
     ylab = "Height Percentiles",
     main = "Height Q-Q Plot")
abline(0, 1)
# t-distributed for 12 df
x <- rt(1000, 12)
qqnorm(x,
       xlab = "t quantiles",
       main = "T Quantiles (df=12) Q-Q Plot",
       ylim = c(-6, 6))
qqline(x)
# t-distributed for 3 df
x <- rt(1000, 3)
qqnorm(x,
       xlab = "t quantiles",
       main = "T Quantiles (df=3) Q-Q Plot",
       ylim = c(-6, 6))
qqline(x)

2.4 Boxplots

  • Provide a graph that is easy to interpret where data may not be normally distributed
  • Qualitative + quantitative
  • Particularly informative in relation to outliers and range
  • Possible to compare multiple distributions side by side

2.5 Boxplots

par(mfrow = c(1, 3))
hist(exec.pay, main = "CEO Compensation")
qqnorm(exec.pay, main = "CEO Compensation")
boxplot(exec.pay,
        ylab = "10,000s of dollars",
        ylim = c(0, 400),
        main = "CEO Compensation")
Three different views of a continuous variable

2.6 Scatterplots And Correlation

  • For two continuous variables, scatter plot and calculation of correlation is useful
  • Provides a graphical and numeric estimation of relationships
  • Quick and easy with plot() and cor()

2.7 Scatterplots And Correlation

par(mfrow = c(1, 3))
plot(
    father.son$fheight,
    father.son$sheight,
    xlab = "Father's height in inches",
    ylab = "Son's height in inches",
    main = paste("correlation =", signif(
        cor(father.son$fheight, father.son$sheight), 2
    ))
)
plot(
    cars$speed,
    cars$dist,
    xlab = "Speed",
    ylab = "Stopping Distance",
    main = paste("correlation =", signif(cor(
        cars$speed, cars$dist
    ), 2))
)
plot(
    faithful$eruptions,
    faithful$waiting,
    xlab = "Eruption Duration",
    ylab = "Waiting Time",
    main = paste("correlation =", signif(
        cor(faithful$eruptions, faithful$waiting), 2
    ))
)

2.8 Exploratory data analysis in high dimensions

#install from Github
BiocManager::install("genomicsclass/GSE5859Subset")  
BiocManager::install("genomicsclass/tissuesGeneExpression")
library(GSE5859Subset)
data(GSE5859Subset) ##this loads three tables
c(class(geneExpression), class(sampleInfo))
## [1] "matrix"     "data.frame"
rbind(dim(geneExpression), dim(sampleInfo))
##      [,1] [,2]
## [1,] 8793   24
## [2,]   24    4
head(sampleInfo)
##     ethnicity       date         filename group
## 107       ASN 2005-06-23 GSM136508.CEL.gz     1
## 122       ASN 2005-06-27 GSM136530.CEL.gz     1
## 113       ASN 2005-06-27 GSM136517.CEL.gz     1
## 163       ASN 2005-10-28 GSM136576.CEL.gz     1
## 153       ASN 2005-10-07 GSM136566.CEL.gz     1
## 161       ASN 2005-10-07 GSM136574.CEL.gz     1

2.9 Volcano plots

T-test for every row (gene) of gene expression matrix:

library(genefilter)
g <- factor(sampleInfo$group)
system.time(results <- rowttests(geneExpression,g))
##    user  system elapsed 
##   0.007   0.001   0.009
pvals <- results$p.value

Note that these 8,793 tests are done in about 0.01s

2.10 Volcano plots

par(mar = c(4, 4, 0, 0))
plot(results$dm,
     -log10(results$p.value),
     xlab = "Effect size (difference in group means)",
     ylab = "- log (base 10) p-values")
abline(h = -log10(0.05 / nrow(geneExpression)), col = "red")
legend("bottomright",
       lty = 1,
       col = "red",
       legend = "Bonferroni = 0.05")

2.11 Volcano plots (summary)

  • Many small p-values with small effect size indicate low within-group variability
  • Inspect for asymmetry
  • Can color points by significance threshold

2.12 P-value histograms

  • If all null hypotheses are true, expect a flat histogram of p-values:
m <- nrow(geneExpression)
n <- ncol(geneExpression)
set.seed(1)
randomData <- matrix(rnorm(n * m), m, n)
nullpvals <- rowttests(randomData, g)$p.value
par(mfrow = c(1, 2))
hist(pvals, ylim = c(0, 1400))
hist(nullpvals, ylim = c(0, 1400))

2.13 P-value histograms

Note that permuting these data doesn’t produce an ideal null p-value histogram due to batch effects:

permg <- sample(g)
permresults <- rowttests(geneExpression, permg)
hist(permresults$p.value)

2.14 P-value histograms (summary)

  • Give a quick look at how many significant p-values there may be
  • When using permuted labels, can expose non-independence among the samples
    • can be due to batch effects or family structure
  • Most common approaches for correcting batch effects are:
    • including batch in your model formula
    • ComBat: corrects for known batch effects by linear model), and
    • sva: creates surrogate variables for unknown batch effects, corrects the structure of permutation p-values
    • correction using control (housekeeping) genes

All are available from the sva Bioconductor package

2.15 MA plot

  • just a scatterplot rotated 45\(^o\)
rafalib::mypar(1, 2)
pseudo <- apply(geneExpression, 1, median)
plot(geneExpression[, 1], pseudo)
plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo))

2.16 MA plot (summary)

  • useful for quality control of high-dimensional data
  • plot all data values for a sample against another sample or a median “pseudosample”
  • affyPLM::MAplots better MA plots
    • adds a smoothing line to highlight departures from horizontal line
    • plots a “cloud” rather than many data points

2.17 Heatmaps

Detailed representation of high-dimensional dataset

ge = geneExpression[sample(1:nrow(geneExpression), 200),]
pheatmap::pheatmap(
    ge,
    scale = "row",
    show_colnames = FALSE,
    show_rownames = FALSE
)

2.18 Heatmaps

  • Clustering becomes slow and memory-intensivefor thousands of rows
    • probably too detailed for thousands of rows
  • can show co-expressed genes, groups of samples

2.19 Colors

  • Types of color pallettes:
    • sequential: shows a gradient
    • diverging: goes in two directions from a center
    • qualitative: for categorical variables
  • Keep color blindness in mind (10% of all men)

Combination of RColorBrewer package and colorRampPalette() can create anything you want

rafalib::mypar(1, 1)
RColorBrewer::display.brewer.all(n = 7)

2.20 Plots To Avoid

“Pie charts are a very bad way of displaying information.” - R Help

  • Always avoid pie charts
  • Avoid doughnut charts too
  • Avoid pseudo 3D and most Excel defaults
  • Effective graphs use color judiciously

3 Distances in high-dimensional data analysis

3.1 The importance of distance

  • High-dimensional data are complex and impossible to visualize in raw form
  • Thousands of dimensions, we can only visualize 2-3
  • Distances can simplify thousands of dimensions
animals
  • Distances can help organize samples and genomic features
library(GSE5859Subset)
data(GSE5859Subset) ##this loads three tables
set.seed(1)
ge = geneExpression[sample(1:nrow(geneExpression), 200),]
pheatmap::pheatmap(
    ge,
    scale = "row",
    show_colnames = FALSE,
    show_rownames = FALSE
)

  • Any clustering or classification of samples and/or genes involves combining or identifying objects that are close or similar.
  • Distances or similarities are mathematical representations of what we mean by close or similar.
  • The choice of distance is important and requires thought.
    • choice is subject-matter specific

Source: http://master.bioconductor.org/help/course-materials/2002/Summer02Course/Distance/distance.pdf

3.2 Metrics and distances

A metric satisfies the following five properties:

  1. non-negativity \(d(a, b) \ge 0\)
  2. symmetry \(d(a, b) = d(b, a)\)
  3. identification mark \(d(a, a) = 0\)
  4. definiteness \(d(a, b) = 0\) if and only if \(a=b\)
  5. triangle inequality \(d(a, b) + d(b, c) \ge d(a, c)\)
  • A distance is only required to satisfy 1-3.
  • A similarity function satisfies 1-2, and increases as \(a\) and \(b\) become more similar
  • A dissimilarity function satisfies 1-2, and decreases as \(a\) and \(b\) become more similar

3.3 Euclidian distance (metric)

  • Remember grade school:
rafalib::mypar()
plot(
    c(0, 1, 1),
    c(0, 0, 1),
    pch = 16,
    cex = 2,
    xaxt = "n",
    yaxt = "n",
    xlab = "",
    ylab = "",
    bty = "n",
    xlim = c(-0.25, 1.25),
    ylim = c(-0.25, 1.25)
)
lines(c(0, 1, 1, 0), c(0, 0, 1, 0))
text(0, .2, expression(paste('(A'[x] * ',A'[y] * ')')), cex = 1.5)
text(1, 1.2, expression(paste('(B'[x] * ',B'[y] * ')')), cex = 1.5)
text(-0.1, 0, "A", cex = 2)
text(1.1, 1, "B", cex = 2)
Euclidean d = \(\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}\).
  • Side note: also referred to as \(L_2\) norm

3.4 Euclidian distance in high dimensions

##BiocManager::install("genomicsclass/tissuesGeneExpression")
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##gene expression data
## [1] 22215   189
table(tissue) ##tissue[i] corresponds to e[,i]
## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

Interested in identifying similar samples and similar genes

3.4.1 Notes:

  • Points are no longer on the Cartesian plane,
  • instead they are in higher dimensions. For example:
    • sample \(i\) is defined by a point in 22,215 dimensional space: \((Y_{1,i},\dots,Y_{22215,i})^\top\).
    • feature \(g\) is defined by a point in 189 dimensions \((Y_{g,189},\dots,Y_{g,189})^\top\)

Euclidean distance as for two dimensions. E.g., the distance between two samples \(i\) and \(j\) is:

\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]

and the distance between two features \(h\) and \(g\) is:

\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]

3.5 Matrix algebra notation

The distance between samples \(i\) and \(j\) can be written as:

\[ \mbox{dist}(i,j) = \sqrt{ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) }\]

with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\).

t(matrix(1:3, ncol = 1))
##      [,1] [,2] [,3]
## [1,]    1    2    3
matrix(1:3, ncol = 1)
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
t(matrix(1:3, ncol = 1)) %*% matrix(1:3, ncol = 1)
##      [,1]
## [1,]   14

Note: R is very efficient at matrix algebra

3.6 3 sample example

kidney1 <- e[, 1]
kidney2 <- e[, 2]
colon1 <- e[, 87]
sqrt(sum((kidney1 - kidney2) ^ 2))
## [1] 85.8546
sqrt(sum((kidney1 - colon1) ^ 2))
## [1] 122.8919

3.7 3 sample example using dist()

dim(e)
## [1] 22215   189
(d <- dist(t(e[, c(1, 2, 87)])))
##                 GSM11805.CEL.gz GSM11814.CEL.gz
## GSM11814.CEL.gz         85.8546                
## GSM92240.CEL.gz        122.8919        115.4773
class(d)
## [1] "dist"

3.8 The dist() function

Excerpt from ?dist:

dist(
    x,
    method = "euclidean",
    diag = FALSE,
    upper = FALSE,
    p = 2
)
  • method: the distance measure to be used.
    • This must be one of “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”. Any unambiguous substring can be given.
  • dist class output from dist() is used for many clustering algorithms and heatmap functions

Caution: dist(e) creates a 22215 x 22215 matrix that will probably crash your R session.

3.9 Note on standardization

  • In practice, features (e.g. genes) are typically “standardized”, i.e. converted to z-score:

\[x_{gi} \leftarrow \frac{(x_{gi} - \bar{x}_g)}{s_g}\]

  • This is done because the differences in overall levels between features are often not due to biological effects but technical ones, e.g.:
    • GC bias, PCR amplification efficiency, …
  • Euclidian distance and \(1-r\) (Pearson cor) are related:
    • \(\frac{d_E(x, y)^2}{2m} = 1 - r_{xy}\)

4 Dimension reduction and PCA

4.1 Motivation for dimension reduction

Simulate the heights of twin pairs:

set.seed(1)
n <- 100
y <- t(MASS::mvrnorm(n, c(0, 0), matrix(c(1, 0.95, 0.95, 1), 2, 2)))
dim(y)
## [1]   2 100
cor(t(y))
##           [,1]      [,2]
## [1,] 1.0000000 0.9433295
## [2,] 0.9433295 1.0000000

4.1.1 Visualizing twin pairs data

z1 = (y[1,] + y[2,]) / 2 #the sum
z2 = (y[1,] - y[2,])   #the difference

z = rbind(z1, z2) #matrix now same dimensions as y

thelim <- c(-3, 3)
rafalib::mypar(1, 2)

plot(
    y[1,],
    y[2,],
    xlab = "Twin 1 (standardized height)",
    ylab = "Twin 2 (standardized height)",
    xlim = thelim,
    ylim = thelim,
    main = "Original twin heights"
)
points(y[1, 1:2], y[2, 1:2], col = 2, pch = 16)

plot(
    z[1,],
    z[2,],
    xlim = thelim,
    ylim = thelim,
    xlab = "Average height",
    ylab = "Difference in height",
    main = "Manual PCA-like projection"
)
points(z[1, 1:2] , z[2, 1:2], col = 2, pch = 16)

4.1.2 Not much distance is lost in the second dimension

rafalib::mypar()
d = dist(t(y))
d3 = dist(z[1, ]) * sqrt(2) ##distance computed using just first dimension mypar(1,1)
plot(as.numeric(d),
     as.numeric(d3),
     xlab = "Pairwise distances in 2 dimensions",
     ylab = "Pairwise distances in 1 dimension")
abline(0, 1, col = "red")
  • Not much loss of height differences when just using average heights of twin pairs.
    • because twin heights are highly correlated

4.2 Singular Value Decomposition (SVD)

SVD generalizes the example rotation we looked at:

\[\mathbf{Y} = \mathbf{UDV}^\top\]

SVD
  • note: the above formulation is for \(m > n\)

  • \(\mathbf{Y}\): the m rows x n cols matrix of measurements
  • \(\mathbf{U}\): m x n matrix relating original scores to PCA scores (loadings)
  • \(\mathbf{D}\): n x n diagonal matrix (eigenvalues)
  • \(\mathbf{V}\): n × n orthogonal matrix (eigenvectors or PCA scores)
    • orthogonal = unit length and “perpendicular” in 3-D

4.3 SVD of gene expression dataset

Scaling:

system.time(e.standardize.slow <-
                t(apply(e, 1, function(x)
                    x - mean(x))))
##    user  system elapsed 
##   1.248   0.121   1.441
system.time(e.standardize.fast <- t(scale(t(e), scale = FALSE)))
##    user  system elapsed 
##   0.149   0.001   0.155
all.equal(e.standardize.slow[, 1], e.standardize.fast[, 1])
## [1] TRUE

SVD:

s <- svd(e.standardize.fast)
names(s)
## [1] "d" "u" "v"

4.4 Components of SVD results

dim(s$u)     # loadings
## [1] 22215   189
length(s$d)  # eigenvalues
## [1] 189
dim(s$v)     # d %*% vT = scores
## [1] 189 189
SVD

4.5 PCA of gene expression dataset

rafalib::mypar()
p <- prcomp(t(e.standardize.fast))
plot(s$u[, 1] ~ p$rotation[, 1])

Lesson: u and v can each be multiplied by -1 arbitrarily

4.6 PCA interpretation: loadings

SVD
  • \(\mathbf{U}\) (loadings): relate the principal component axes to the original variables
    • think of principal component axes as a weighted combination of original axes

4.7 Visualizing PCA loadings

plot(p$rotation[, 1],
     xlab = "Index of genes",
     ylab = "Loadings of PC1",
     main = "Scores of PC1") #or, predict(p)
abline(h = c(-0.03, 0.04), col = "red")

Genes with high PC1 loadings:

e.pc1genes <-
    e.standardize.fast[p$rotation[, 1] < -0.03 |
                           p$rotation[, 1] > 0.03,]
pheatmap::pheatmap(
    e.pc1genes,
    scale = "none",
    show_rownames = TRUE,
    show_colnames = FALSE
)

4.8 PCA interpretation: eigenvalues

  • \(\mathbf{D}\) (eigenvalues): standard deviation scaling factor that each decomposed variable is multiplied by.
rafalib::mypar()
plot(
    p$sdev ^ 2 / sum(p$sdev ^ 2) * 100,
    xlim = c(0, 150),
    type = "b",
    ylab = "% variance explained",
    main = "Screeplot"
)

Alternatively as cumulative % variance explained (using cumsum() function):

rafalib::mypar()
plot(
    cumsum(p$sdev ^ 2) / sum(p$sdev ^ 2) * 100,
    ylab = "cumulative % variance explained",
    ylim = c(0, 100),
    type = "b",
    main = "Cumulative screeplot"
)

4.9 PCA interpretation: scores

SVD
  • \(\mathbf{V}\) (scores): The “datapoints” in the reduced prinipal component space
  • In some implementations (like prcomp()), scores \(\mathbf{D V^T}\)

4.10 PCA interpretation: scores

rafalib::mypar()
plot(
    p$x[, 1:2],
    xlab = "PC1",
    ylab = "PC2",
    main = "plot of p$x[, 1:2]",
    col = factor(tissue),
    pch = as.integer(factor(tissue))
)
legend(
    "topright",
    legend = levels(factor(tissue)),
    col = 1:length(unique(tissue)),
    pch = 1:length(unique(tissue)),
    bty = 'n'
)

4.11 Multi-dimensional Scaling (MDS)

  • also referred to as Principal Coordinates Analysis (PCoA)
  • a reduced SVD, performed on a distance matrix
  • identify two (or more) eigenvalues/vectors that preserve distances
d <- as.dist(1 - cor(e.standardize.fast))
mds <- cmdscale(d)
rafalib::mypar()
plot(mds, col = factor(tissue), pch = as.integer(factor(tissue)))
legend(
    "topright",
    legend = levels(factor(tissue)),
    col = 1:length(unique(tissue)),
    bty = 'n',
    pch = 1:length(unique(tissue))
)

4.12 Summary: distances and dimension reduction

  • Note: signs of eigenvalues (square to get variances) and eigenvectors (loadings) can be arbitrarily flipped
  • PCA and MDS are useful for dimension reduction when you have correlated variables
  • Variables are always centered.
  • Variables are also scaled unless you know they have the same scale in the population
  • PCA projection can be applied to new datasets if you know the matrix calculations
  • PCA is subject to over-fitting, screeplot can be tested by cross-validation

5 Batch Effects

5.1 Batch Effects

  • pervasive in genomics (e.g. Leek et al. )
  • affect DNA and RNA sequencing, proteomics, imaging, microarray…
  • have caused high-profile problems and retractions
    • you can’t get rid of them
    • but you can make sure they are not confounded with your experiment

5.2 Batch Effects - an example

  • Nat Genet. 2007 Feb;39(2):226-31. Epub 2007 Jan 7.
  • Title: Common genetic variants account for differences in gene expression among ethnic groups.
    • “The results show that specific genetic variation among populations contributes appreciably to differences in gene expression phenotypes.”
library(Biobase)
library(genefilter)
library(GSE5859) ## BiocInstaller::biocLite("genomicsclass/GSE5859")
data(GSE5859)
geneExpression = exprs(e)
sampleInfo = pData(e)

5.3 Date of processing as a proxy for batch

  • Sample metadata included date of processing:
head(table(sampleInfo$date))
## 
## 2002-10-31 2002-11-15 2002-11-19 2002-11-21 2002-11-22 2002-11-27 
##          2          2          2          5          4          2
year <-  as.integer(format(sampleInfo$date, "%y"))
year <- year - min(year)
month = as.integer(format(sampleInfo$date, "%m")) + 12 * year
table(year, sampleInfo$ethnicity)
##     
## year ASN CEU HAN
##    0   0  32   0
##    1   0  54   0
##    2   0  13   0
##    3  80   3   0
##    4   2   0  24

5.4 Visualizing batch effects by PCA

pc <- prcomp(t(geneExpression), scale. = TRUE)
boxplot(
    pc$x[, 1] ~ month,
    varwidth = TRUE,
    notch = TRUE,
    main = "PC1 scores vs. month",
    xlab = "Month",
    ylab = "PC1"
)

5.5 Visualizing batch effects by MDS

A starting point for a color palette:

RColorBrewer::display.brewer.all(n = 3, colorblindFriendly = TRUE)

Interpolate one color per month on a quantitative palette:

col3 <- c(RColorBrewer::brewer.pal(n = 3, "Greys")[2:3], "black")
MYcols <- colorRampPalette(col3, space = "Lab")(length(unique(month)))
d <- as.dist(1 - cor(geneExpression))
mds <- cmdscale(d)
plot(mds, col = MYcols[as.integer(factor(month))],
     main = "MDS shaded by batch")

5.6 The batch effects impact clustering

hcclass <- cutree(hclust(d), k = 5)
table(hcclass, year)
##        year
## hcclass  0  1  2  3  4
##       1  0  2  8  1  0
##       2 25 43  0  2  0
##       3  6  0  0  0  0
##       4  1  7  0  0  0
##       5  0  2  5 80 26

5.7 Batch Effects - summary

  • tend to affect many or all measurements by a little bit
  • during experimental design:
    • keep track of anything that might cause a batch effect for post-hoc analysis
    • include control samples in each batch
  • batches can be corrected for if randomized in study design
    • if confounded with treatment or outcome of interest, nothing can help you