Applied Statistics for High-throughput Biology: Session 4

Levi Waldron

Outline

Exploratory data analysis

Introduction

“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey

Quantile Quantile Plots

Example: Quantile Quantile Plots

Boxplots: About

Boxplots: Example

Three different views of a continuous variable

Scatterplots And Correlation: About

Scatterplots And Correlation: Example

Exploratory data analysis in high dimensions

library(GSE5859Subset)
data(GSE5859Subset) ##this loads three tables
c(class(geneExpression), class(sampleInfo))
#> [1] "matrix"     "array"      "data.frame"
rbind(dim(geneExpression), dim(sampleInfo))
#>      [,1] [,2]
#> [1,] 8793   24
#> [2,]   24    4
head(sampleInfo)
#>     ethnicity       date         filename group
#> 107       ASN 2005-06-23 GSM136508.CEL.gz     1
#> 122       ASN 2005-06-27 GSM136530.CEL.gz     1
#> 113       ASN 2005-06-27 GSM136517.CEL.gz     1
#> 163       ASN 2005-10-28 GSM136576.CEL.gz     1
#> 153       ASN 2005-10-07 GSM136566.CEL.gz     1
#> 161       ASN 2005-10-07 GSM136574.CEL.gz     1

Volcano plots: Setup

T-test for every row (gene) of gene expression matrix:

library(genefilter)
g <- factor(sampleInfo$group)
system.time(results <- rowttests(geneExpression, g))
#>    user  system elapsed 
#>   0.004   0.000   0.004
pvals <- results$p.value

Note that these 8,793 tests are done in about 0.01s

Volcano plots: Example

par(mar = c(4, 4, 0, 0))
plot(results$dm,
     -log10(results$p.value),
     xlab = "Effect size (difference in group means)",
     ylab = "- log (base 10) p-values")
abline(h = -log10(0.05 / nrow(geneExpression)), col = "red")
legend("bottomright",
       lty = 1,
       col = "red",
       legend = "Bonferroni = 0.05")

Volcano plots: Summary

P-value histograms: Setup

m <- nrow(geneExpression)
n <- ncol(geneExpression)
set.seed(1)
randomData <- matrix(rnorm(n * m), m, n)
nullpvals <- rowttests(randomData, g)$p.value

P-value histograms: Example

P-value histograms: Example 2 (permutation)

Note that permuting these data doesn’t produce an ideal null p-value histogram due to batch effects:

permg <- sample(g)
permresults <- rowttests(geneExpression, permg)
hist(permresults$p.value)

P-value histograms: Summary

ComBat and sva are available from the sva Bioconductor package

MA plot

rafalib::mypar(1, 2)
pseudo <- apply(geneExpression, 1, median)
plot(geneExpression[, 1], pseudo)
plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo))

MA plot: Summary

Heatmaps

Heatmaps: Summary

Colors

Colors (cont’d)

Combination of RColorBrewer package and colorRampPalette() can create anything you want

Plots To Avoid

“Pie charts are a very bad way of displaying information.” - R Help

Batch effects

Pervasiveness of batch Effects

Batch Effects - an example

library(Biobase)
library(genefilter)
library(GSE5859) ## BiocInstaller::biocLite("genomicsclass/GSE5859")
data(GSE5859)
geneExpression = exprs(e)
sampleInfo = pData(e)

Date of processing as a proxy for batch

head(table(sampleInfo$date))
#> 
#> 2002-10-31 2002-11-15 2002-11-19 2002-11-21 2002-11-22 2002-11-27 
#>          2          2          2          5          4          2
year <-  as.integer(format(sampleInfo$date, "%y"))
year <- year - min(year)
month = as.integer(format(sampleInfo$date, "%m")) + 12 * year
table(year, sampleInfo$ethnicity)
#>     
#> year ASN CEU HAN
#>    0   0  32   0
#>    1   0  54   0
#>    2   0  13   0
#>    3  80   3   0
#>    4   2   0  24

Visualizing batch effects by PCA

pc <- prcomp(t(geneExpression), scale. = TRUE)
boxplot(
    pc$x[, 1] ~ month,
    varwidth = TRUE,
    notch = TRUE,
    main = "PC1 scores vs. month",
    xlab = "Month",
    ylab = "PC1"
)

Visualizing batch effects by MDS

A starting point for a color palette:

RColorBrewer::display.brewer.all(n = 3, colorblindFriendly = TRUE)

Interpolate one color per month on a quantitative palette:

col3 <- c(RColorBrewer::brewer.pal(n = 3, "Greys")[2:3], "black")
MYcols <- colorRampPalette(col3, space = "Lab")(length(unique(month)))

Visualizing batch effects by MDS

d <- as.dist(1 - cor(geneExpression))
mds <- cmdscale(d)
plot(mds, col = MYcols[as.integer(factor(month))],
     main = "MDS shaded by month")

The batch effects impact clustering

hcclass <- cutree(hclust(d), k = 5)
table(hcclass, year)
#>        year
#> hcclass  0  1  2  3  4
#>       1  0  2  8  1  0
#>       2 25 43  0  2  0
#>       3  6  0  0  0  0
#>       4  1  7  0  0  0
#>       5  0  2  5 80 26

Approaches to correcting for batch effects

Batch Effects - summary

Exercises