Applied Statistics for High-throughput Biology: Day 3

Levi Waldron

June 22, 2017

Day 3 outline

Hypothesis testing for categorical variables
- Fisher’s Exact Test and Chi-square Test
Inference in high dimensions: multiple testing
Exploratory data analysis
Chapter 1 - Inference
Chapter 2 - Exploratory Data Analysis
Chapter 5 - Inference for High Dimensional Data
- Basic exploratory data analysis

Hypothesis testing for categorical variables

Hypothesis testing and confidence intervals for one or two continuous variables:
- Z-test, t-test, correlation
Two binary variables:
- Fisher’s Exact Test
- Pearson’s Chi-square Test

Lady Tasting Tea

The Lady in question claimed to be able to tell whether the tea or the milk was added first to a cup
Fisher proposed to give her eight cups, four of each variety, in random order
- the Lady is fully informed of the experimental method
- \(H_0\): the Lady has no ability to tell which was added first

Source: https://en.wikipedia.org/wiki/Lady_tasting_tea

Fisher’s Exact Test

p-value is the probability of the observed number of successes, or more, under \(H_0\)

Tea-Tasting Distribution
Success count	Permutations of selection	Number of permutations
0	oooo	1 × 1 = 1
1	ooox, ooxo, oxoo, xooo	4 × 4 = 16
2	ooxx, oxox, oxxo, xoxo, xxoo, xoox	6 × 6 = 36
3	oxxx, xoxx, xxox, xxxo	4 × 4 = 16
4	xxxx	1 × 1 = 1
Total		70

What do you notice about all these combinations?

Notes on Fisher’s Exact Test

Can also be applied to rxc tables
Remember that the margins of the table are fixed by design
Also referred to as the Hypergeometric Test
Exact p-values are difficult (and unnecessary) for large samples
- fisher.test(x, y = NULL, etc, simulate.p.value = FALSE)

Notes on Fisher’s Exact Test (cont’d)

Has been applied (with peril!) to gene set analysis, e.g.:
- 10 of my top 100 genes are annotated with the cytokinesis GO term
- 465 of 21,000 human genes are annotated with the cytokinesis GO term
- Are my top 100 genes enriched for cytokinesis process?
Problems with this analysis:
- Main problem: top-n genes tend to be correlated, so their selections are not independent trials
- Secondary: does not match design for \(H_0\)
Alternative: permutation test repeating all steps

Chi-squared test

Test of independence for rxc table (two categorical variables)
Does not assume the margins are fixed by design
- i.e., the number of cups of tea with milk poured first can be random, and the Lady doesn’t know how many
- more common in practice
- classic genomics example is GWAS
\(H_0\): the two variables are independent
\(H_A\): there is an association between the variables

Application to GWAS

Interested in association between disease and some potential causative factor
In a case-control study, the numbers of cases and controls are fixed, but the other variable is not
In a prospective or longitudinal cohort study, neither the number of cases or the other variable are fixed

disease=factor(c(rep(0,180),rep(1,20),rep(0,40),rep(1,10)),
               labels=c("control","cases"))
genotype=factor(c(rep("AA/Aa",204),rep("aa",46)),
                levels=c("AA/Aa","aa"))
dat <- data.frame(disease, genotype)
dat <- dat[sample(nrow(dat)),] #shuffle them up 
summary(dat)

##     disease     genotype  
##  control:220   AA/Aa:204  
##  cases  : 30   aa   : 46

Application to GWAS (cont’d)

table(disease, genotype)

##          genotype
## disease   AA/Aa  aa
##   control   184  36
##   cases      20  10

chisq.test(disease, genotype)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  disease and genotype
## X-squared = 3.9963, df = 1, p-value = 0.0456

chisq.test(disease, genotype, simulate.p.value = TRUE)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  disease and genotype
## X-squared = 5.0634, df = NA, p-value = 0.03998

Application to GWAS (cont’d)

Note that the result says nothing about how the departure from independence occurs

library(epitools)
epitools::oddsratio(genotype, disease, method="wald")$measure

##          odds ratio with 95% C.I.
## Predictor estimate    lower    upper
##     AA/Aa 1.000000       NA       NA
##     aa    2.555556 1.104441 5.913274

Note on factor reference levels

Use the relevel() function if you want to change the reference level of a factor:

epitools::oddsratio(relevel(genotype, "aa"), disease)$measure

##          odds ratio with 95% C.I.
## Predictor  estimate     lower     upper
##     aa    1.0000000        NA        NA
##     AA/Aa 0.3908154 0.1705085 0.9426884

(the default is whichever level is first alphabetically!)

Summary - two categorical variables

Choice between Fisher’s Exact Test and Chi-square test is determined by experimental design
If any counts in the table are less than 5, can use simulate.p.value=TRUE to get a more accurate p-value from the chi-square test
Both assume independent observations (important!!)
In a GWAS, correction for multiple testing is required
Can also use logistic regression for two categorical variables

Inference in high dimensions

Conceptually similar to what we have already done
- \(Y_i\) expression of a gene, etc
Just repeated many times, e.g.:
- is the mean expression of a gene different between two groups? (t-test, linear model)
- is a gene variant associated with disease? (chi-square test, logistic regression)
- do this simple analysis thousands of times
- note: for small sample sizes, some Bayesian improvements can be made (i.e. LIMMA, DESeq2, EdgeR)

Multiple testing

When testing thousands of true null hypotheses with \(\alpha = 0.05\), you expect a 5% type I error rate
What p-values are even smaller than you expect by chance from multiple testing?
Two mainstream approaches for controlling type I error rate:

Family-wise error rate (e.g., Bonferroni correction).
- Controlling FWER at 0.05 ensures that the probably of any type I errors is < 0.05.
False Discovery Rate (e.g., Benjamini-Hochberg correction)
- Controlling FDR at 0.05 ensures that fraction of type I errors is < 0.05.
- correction for the smallest p-value is the same as the Bonferroni correction
- correction for larger p-values becomes less stringent

p.adjust(p, method = p.adjust.methods)

Visualizing False Discovery Rate

Benjamini and Hochberg: one visualization

Introduction

“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey

Discover biases, systematic errors and unexpected variability in data
Graphical approach to detecting these issues
Represents a first step in data analysis and guides hypothesis testing
Opportunities for discovery in the outliers

Quantile Quantile Plots

Quantiles divide a distribution into equally sized bins
Division into 100 bins gives percentiles
Quantiles of a theoretical distribution are plotted against an experimental distribution
- alternatively, quantiles of two experimental distributions
Given a perfect fit, \(x=y\)
Useful in determining data distribution (normal, t, etc.)

Quantile Quantile Plots

Boxplots

Provide a graph that is easy to interpret where data is not normally distributed
Would be an appropriate choice to explore income data, as distribution is highly skewed
Particularly informative in relation to outliers and range
Possible to compare multiple distributions side by side

Boxplots

Three different views of a continuous variable

Scatterplots And Correlation

For two continuous variables, scatter plot and calculation of correlation is useful
Provides a graphical and numeric estimation of relationships
Quick and easy with plot() and cor()

Scatterplots And Correlation

Exploratory data analysis in high dimensions

library(BiocInstaller)
biocLite("genomicsclass/GSE5859Subset")  #install from Github

library(GSE5859Subset)
data(GSE5859Subset) ##this loads three tables
c(class(geneExpression), class(sampleInfo))

## [1] "matrix"     "data.frame"

rbind(dim(geneExpression), dim(sampleInfo))

##      [,1] [,2]
## [1,] 8793   24
## [2,]   24    4

head(sampleInfo)

##     ethnicity       date         filename group
## 107       ASN 2005-06-23 GSM136508.CEL.gz     1
## 122       ASN 2005-06-27 GSM136530.CEL.gz     1
## 113       ASN 2005-06-27 GSM136517.CEL.gz     1
## 163       ASN 2005-10-28 GSM136576.CEL.gz     1
## 153       ASN 2005-10-07 GSM136566.CEL.gz     1
## 161       ASN 2005-10-07 GSM136574.CEL.gz     1

Volcano plots

T-test for every row (gene) of gene expression matrix:

library(genefilter)
g <- factor(sampleInfo$group)
system.time(results <- rowttests(geneExpression,g))

##    user  system elapsed 
##   0.007   0.000   0.007

pvals <- results$p.value

Note that these 8,793 tests are done in about 0.01s

Volcano plots

par(mar=c(4, 4, 0, 0))
plot(results$dm, -log10(results$p.value),
     xlab="Effect size (difference in group means)",
     ylab="- log (base 10) p-values")
abline(h=-log10(0.05 / nrow(geneExpression)), col="red")
legend("bottomright", lty=1, col="red", legend="Bonferroni = 0.05")

Volcano plots (summary)

Many small p-values with small effect size indicate low within-group variability
Inspect for asymmetry
Can color points by significance threshold

P-value histograms

If all null hypotheses are true, expect a flat histogram of p-values:

m <- nrow(geneExpression)
n <- ncol(geneExpression)
set.seed(1); randomData <- matrix(rnorm(n*m),m,n)
nullpvals <- rowttests(randomData,g)$p.value

P-value histograms

Note that permuting these data doesn’t produce an ideal null p-value histogram due to batch effects:

permg <- sample(g)
permresults <- rowttests(geneExpression, permg)
hist(permresults$p.value)

P-value histograms (summary)

Give a quick look at how many significant p-values there may be
When using permuted labels, can expose non-independence among the samples
- can be due to batch effects or family structure
Most common approaches for correcting batch effects are:
- including batch in your model formula
- ComBat: corrects for known batch effects by linear model), and
- sva: creates surrogate variables for unknown batch effects, corrects the structure of permutation p-values
- correction using control (housekeeping) genes

All are available from the sva Bioconductor package

MA plot

just a scatterplot rotated 45\(^o\)

rafalib::mypar(1,2)
pseudo <- apply(geneExpression, 1, median)
plot(geneExpression[, 1], pseudo)
plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo))

MA plot (summary)

useful for quality control of high-dimensional data
plot all data values for a sample against another sample or a median “pseudosample”
affyPLM::MAplots better MA plots
- adds a smoothing line to highlight departures from horizontal line
- plots a “cloud” rather than many data points

Heatmaps

Detailed representation of high-dimensional dataset

ge = geneExpression[sample(1:nrow(geneExpression), 200), ]
pheatmap::pheatmap(ge, scale="row", show_colnames = F, show_rownames = F)

Heatmaps

Clustering becomes slow and memory-intensivefor thousands of rows
- probably too detailed for thousands of rows
can show co-expressed genes, groups of samples

Colors

Types of color pallettes:
- sequential: shows a gradient
- diverging: goes in two directions from a center
- qualitative: for categorical variables
Keep color blindness in mind (10% of all men)

Colors (cont’d)

Combination of RColorBrewer package and colorRampPalette() can create anything you want

Plots To Avoid

“Pie charts are a very bad way of displaying information.” - R Help

Always avoid pie charts
Avoid doughnut charts too
Avoid pseudo 3D and most Excel defaults
Effective graphs use color judiciously

Lab

Exploratory Data Analysis: http://genomicsclass.github.io/book/pages/exploratory_data_analysis_exercises.html

Links

A built html version of this lecture is available.
The source R Markdown is also available from Github.