Applied Statistics for High-throughput Biology: Day 3

Levi Waldron

June 22, 2017

Day 3 outline

Hypothesis testing for categorical variables

Hypothesis testing for categorical variables

Lady Tasting Tea

Lady tasting tea

Source: https://en.wikipedia.org/wiki/Lady_tasting_tea

Fisher’s Exact Test

p-value is the probability of the observed number of successes, or more, under \(H_0\)

Tea-Tasting Distribution
Success count Permutations of selection Number of permutations
0 oooo 1 × 1 = 1
1 ooox, ooxo, oxoo, xooo 4 × 4 = 16
2 ooxx, oxox, oxxo, xoxo, xxoo, xoox 6 × 6 = 36
3 oxxx, xoxx, xxox, xxxo 4 × 4 = 16
4 xxxx 1 × 1 = 1
Total 70

What do you notice about all these combinations?

Notes on Fisher’s Exact Test

Notes on Fisher’s Exact Test (cont’d)

Chi-squared test

Application to GWAS

disease=factor(c(rep(0,180),rep(1,20),rep(0,40),rep(1,10)),
               labels=c("control","cases"))
genotype=factor(c(rep("AA/Aa",204),rep("aa",46)),
                levels=c("AA/Aa","aa"))
dat <- data.frame(disease, genotype)
dat <- dat[sample(nrow(dat)),] #shuffle them up 
summary(dat)
##     disease     genotype  
##  control:220   AA/Aa:204  
##  cases  : 30   aa   : 46

Application to GWAS (cont’d)

table(disease, genotype)
##          genotype
## disease   AA/Aa  aa
##   control   184  36
##   cases      20  10
chisq.test(disease, genotype)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  disease and genotype
## X-squared = 3.9963, df = 1, p-value = 0.0456
chisq.test(disease, genotype, simulate.p.value = TRUE)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  disease and genotype
## X-squared = 5.0634, df = NA, p-value = 0.03998

Application to GWAS (cont’d)

Note that the result says nothing about how the departure from independence occurs

library(epitools)
epitools::oddsratio(genotype, disease, method="wald")$measure
##          odds ratio with 95% C.I.
## Predictor estimate    lower    upper
##     AA/Aa 1.000000       NA       NA
##     aa    2.555556 1.104441 5.913274

Note on factor reference levels

Use the relevel() function if you want to change the reference level of a factor:

epitools::oddsratio(relevel(genotype, "aa"), disease)$measure
##          odds ratio with 95% C.I.
## Predictor  estimate     lower     upper
##     aa    1.0000000        NA        NA
##     AA/Aa 0.3908154 0.1705085 0.9426884

(the default is whichever level is first alphabetically!)

Summary - two categorical variables

Inference in high dimensions

Inference in high dimensions

Multiple testing

  1. Family-wise error rate (e.g., Bonferroni correction).
    • Controlling FWER at 0.05 ensures that the probably of any type I errors is < 0.05.
  2. False Discovery Rate (e.g., Benjamini-Hochberg correction)
    • Controlling FDR at 0.05 ensures that fraction of type I errors is < 0.05.
    • correction for the smallest p-value is the same as the Bonferroni correction
    • correction for larger p-values becomes less stringent
p.adjust(p, method = p.adjust.methods)

Visualizing False Discovery Rate

Benjamini and Hochberg: one visualization

Exploratory data analysis

Introduction

“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey

Quantile Quantile Plots

Quantile Quantile Plots

Boxplots

Boxplots

Three different views of a continuous variable

Scatterplots And Correlation

Scatterplots And Correlation

Exploratory data analysis in high dimensions

library(BiocInstaller)
biocLite("genomicsclass/GSE5859Subset")  #install from Github
library(GSE5859Subset)
data(GSE5859Subset) ##this loads three tables
c(class(geneExpression), class(sampleInfo))
## [1] "matrix"     "data.frame"
rbind(dim(geneExpression), dim(sampleInfo))
##      [,1] [,2]
## [1,] 8793   24
## [2,]   24    4
head(sampleInfo)
##     ethnicity       date         filename group
## 107       ASN 2005-06-23 GSM136508.CEL.gz     1
## 122       ASN 2005-06-27 GSM136530.CEL.gz     1
## 113       ASN 2005-06-27 GSM136517.CEL.gz     1
## 163       ASN 2005-10-28 GSM136576.CEL.gz     1
## 153       ASN 2005-10-07 GSM136566.CEL.gz     1
## 161       ASN 2005-10-07 GSM136574.CEL.gz     1

Volcano plots

T-test for every row (gene) of gene expression matrix:

library(genefilter)
g <- factor(sampleInfo$group)
system.time(results <- rowttests(geneExpression,g))
##    user  system elapsed 
##   0.007   0.000   0.007
pvals <- results$p.value

Note that these 8,793 tests are done in about 0.01s

Volcano plots

par(mar=c(4, 4, 0, 0))
plot(results$dm, -log10(results$p.value),
     xlab="Effect size (difference in group means)",
     ylab="- log (base 10) p-values")
abline(h=-log10(0.05 / nrow(geneExpression)), col="red")
legend("bottomright", lty=1, col="red", legend="Bonferroni = 0.05")

Volcano plots (summary)

P-value histograms

m <- nrow(geneExpression)
n <- ncol(geneExpression)
set.seed(1); randomData <- matrix(rnorm(n*m),m,n)
nullpvals <- rowttests(randomData,g)$p.value

P-value histograms

Note that permuting these data doesn’t produce an ideal null p-value histogram due to batch effects:

permg <- sample(g)
permresults <- rowttests(geneExpression, permg)
hist(permresults$p.value)

P-value histograms (summary)

All are available from the sva Bioconductor package

MA plot

rafalib::mypar(1,2)
pseudo <- apply(geneExpression, 1, median)
plot(geneExpression[, 1], pseudo)
plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo))

MA plot (summary)

Heatmaps

Detailed representation of high-dimensional dataset

ge = geneExpression[sample(1:nrow(geneExpression), 200), ]
pheatmap::pheatmap(ge, scale="row", show_colnames = F, show_rownames = F)

Heatmaps

Colors

Colors (cont’d)

Combination of RColorBrewer package and colorRampPalette() can create anything you want

Plots To Avoid

“Pie charts are a very bad way of displaying information.” - R Help

Lab

Exploratory Data Analysis: http://genomicsclass.github.io/book/pages/exploratory_data_analysis_exercises.html