Applied Statistics for High-throughput Biology: Day 3
Levi Waldron
June 22, 2017
Hypothesis testing for categorical variables
Hypothesis testing for categorical variables
- Hypothesis testing and confidence intervals for one or two continuous variables:
- Z-test, t-test, correlation
- Two binary variables:
- Fisher’s Exact Test
- Pearson’s Chi-square Test
Lady Tasting Tea
- The Lady in question claimed to be able to tell whether the tea or the milk was added first to a cup
- Fisher proposed to give her eight cups, four of each variety, in random order
- the Lady is fully informed of the experimental method
- \(H_0\): the Lady has no ability to tell which was added first
Source: https://en.wikipedia.org/wiki/Lady_tasting_tea
Fisher’s Exact Test
p-value is the probability of the observed number of successes, or more, under \(H_0\)
Tea-Tasting Distribution
Success count
|
Permutations of selection
|
Number of permutations
|
0
|
oooo
|
1 × 1 = 1
|
1
|
ooox, ooxo, oxoo, xooo
|
4 × 4 = 16
|
2
|
ooxx, oxox, oxxo, xoxo, xxoo, xoox
|
6 × 6 = 36
|
3
|
oxxx, xoxx, xxox, xxxo
|
4 × 4 = 16
|
4
|
xxxx
|
1 × 1 = 1
|
Total
|
70
|
What do you notice about all these combinations?
Notes on Fisher’s Exact Test
- Can also be applied to rxc tables
- Remember that the margins of the table are fixed by design
- Also referred to as the Hypergeometric Test
- Exact p-values are difficult (and unnecessary) for large samples
fisher.test(x, y = NULL, etc, simulate.p.value = FALSE)
Notes on Fisher’s Exact Test (cont’d)
- Has been applied (with peril!) to gene set analysis, e.g.:
- 10 of my top 100 genes are annotated with the cytokinesis GO term
- 465 of 21,000 human genes are annotated with the cytokinesis GO term
- Are my top 100 genes enriched for cytokinesis process?
- Problems with this analysis:
- Main problem: top-n genes tend to be correlated, so their selections are not independent trials
- Secondary: does not match design for \(H_0\)
- Alternative: permutation test repeating all steps
Chi-squared test
- Test of independence for rxc table (two categorical variables)
- Does not assume the margins are fixed by design
- i.e., the number of cups of tea with milk poured first can be random, and the Lady doesn’t know how many
- more common in practice
- classic genomics example is GWAS
- \(H_0\): the two variables are independent
- \(H_A\): there is an association between the variables
Application to GWAS
- Interested in association between disease and some potential causative factor
- In a case-control study, the numbers of cases and controls are fixed, but the other variable is not
- In a prospective or longitudinal cohort study, neither the number of cases or the other variable are fixed
disease=factor(c(rep(0,180),rep(1,20),rep(0,40),rep(1,10)),
labels=c("control","cases"))
genotype=factor(c(rep("AA/Aa",204),rep("aa",46)),
levels=c("AA/Aa","aa"))
dat <- data.frame(disease, genotype)
dat <- dat[sample(nrow(dat)),] #shuffle them up
summary(dat)
## disease genotype
## control:220 AA/Aa:204
## cases : 30 aa : 46
Application to GWAS (cont’d)
## genotype
## disease AA/Aa aa
## control 184 36
## cases 20 10
chisq.test(disease, genotype)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: disease and genotype
## X-squared = 3.9963, df = 1, p-value = 0.0456
chisq.test(disease, genotype, simulate.p.value = TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: disease and genotype
## X-squared = 5.0634, df = NA, p-value = 0.03998
Application to GWAS (cont’d)
Note that the result says nothing about how the departure from independence occurs
library(epitools)
epitools::oddsratio(genotype, disease, method="wald")$measure
## odds ratio with 95% C.I.
## Predictor estimate lower upper
## AA/Aa 1.000000 NA NA
## aa 2.555556 1.104441 5.913274
Note on factor reference levels
Use the relevel()
function if you want to change the reference level of a factor:
epitools::oddsratio(relevel(genotype, "aa"), disease)$measure
## odds ratio with 95% C.I.
## Predictor estimate lower upper
## aa 1.0000000 NA NA
## AA/Aa 0.3908154 0.1705085 0.9426884
(the default is whichever level is first alphabetically!)
Summary - two categorical variables
- Choice between Fisher’s Exact Test and Chi-square test is determined by experimental design
- If any counts in the table are less than 5, can use
simulate.p.value=TRUE
to get a more accurate p-value from the chi-square test
- Both assume independent observations (important!!)
- In a GWAS, correction for multiple testing is required
- Can also use logistic regression for two categorical variables
Inference in high dimensions
Inference in high dimensions
- Conceptually similar to what we have already done
- \(Y_i\) expression of a gene, etc
- Just repeated many times, e.g.:
- is the mean expression of a gene different between two groups? (t-test, linear model)
- is a gene variant associated with disease? (chi-square test, logistic regression)
- do this simple analysis thousands of times
- note: for small sample sizes, some Bayesian improvements can be made (i.e. LIMMA, DESeq2, EdgeR)
Multiple testing
- When testing thousands of true null hypotheses with \(\alpha = 0.05\), you expect a 5% type I error rate
- What p-values are even smaller than you expect by chance from multiple testing?
- Two mainstream approaches for controlling type I error rate:
- Family-wise error rate (e.g., Bonferroni correction).
- Controlling FWER at 0.05 ensures that the probably of any type I errors is < 0.05.
- False Discovery Rate (e.g., Benjamini-Hochberg correction)
- Controlling FDR at 0.05 ensures that fraction of type I errors is < 0.05.
- correction for the smallest p-value is the same as the Bonferroni correction
- correction for larger p-values becomes less stringent
p.adjust(p, method = p.adjust.methods)
Visualizing False Discovery Rate
Benjamini and Hochberg: one visualization
Exploratory data analysis
Introduction
“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey
- Discover biases, systematic errors and unexpected variability in data
- Graphical approach to detecting these issues
- Represents a first step in data analysis and guides hypothesis testing
- Opportunities for discovery in the outliers
Quantile Quantile Plots
- Quantiles divide a distribution into equally sized bins
- Division into 100 bins gives percentiles
- Quantiles of a theoretical distribution are plotted against an experimental distribution
- alternatively, quantiles of two experimental distributions
- Given a perfect fit, \(x=y\)
- Useful in determining data distribution (normal, t, etc.)
Quantile Quantile Plots

Boxplots
- Provide a graph that is easy to interpret where data is not normally distributed
- Would be an appropriate choice to explore income data, as distribution is highly skewed
- Particularly informative in relation to outliers and range
- Possible to compare multiple distributions side by side
Boxplots
Three different views of a continuous variable
Scatterplots And Correlation
- For two continuous variables, scatter plot and calculation of correlation is useful
- Provides a graphical and numeric estimation of relationships
- Quick and easy with
plot()
and cor()
Scatterplots And Correlation

Exploratory data analysis in high dimensions
library(BiocInstaller)
biocLite("genomicsclass/GSE5859Subset") #install from Github
library(GSE5859Subset)
data(GSE5859Subset) ##this loads three tables
c(class(geneExpression), class(sampleInfo))
## [1] "matrix" "data.frame"
rbind(dim(geneExpression), dim(sampleInfo))
## [,1] [,2]
## [1,] 8793 24
## [2,] 24 4
## ethnicity date filename group
## 107 ASN 2005-06-23 GSM136508.CEL.gz 1
## 122 ASN 2005-06-27 GSM136530.CEL.gz 1
## 113 ASN 2005-06-27 GSM136517.CEL.gz 1
## 163 ASN 2005-10-28 GSM136576.CEL.gz 1
## 153 ASN 2005-10-07 GSM136566.CEL.gz 1
## 161 ASN 2005-10-07 GSM136574.CEL.gz 1
Volcano plots
T-test for every row (gene) of gene expression matrix:
library(genefilter)
g <- factor(sampleInfo$group)
system.time(results <- rowttests(geneExpression,g))
## user system elapsed
## 0.007 0.000 0.007
Note that these 8,793 tests are done in about 0.01s
Volcano plots
par(mar=c(4, 4, 0, 0))
plot(results$dm, -log10(results$p.value),
xlab="Effect size (difference in group means)",
ylab="- log (base 10) p-values")
abline(h=-log10(0.05 / nrow(geneExpression)), col="red")
legend("bottomright", lty=1, col="red", legend="Bonferroni = 0.05")

Volcano plots (summary)
- Many small p-values with small effect size indicate low within-group variability
- Inspect for asymmetry
- Can color points by significance threshold
P-value histograms
- If all null hypotheses are true, expect a flat histogram of p-values:
m <- nrow(geneExpression)
n <- ncol(geneExpression)
set.seed(1); randomData <- matrix(rnorm(n*m),m,n)
nullpvals <- rowttests(randomData,g)$p.value

P-value histograms
Note that permuting these data doesn’t produce an ideal null p-value histogram due to batch effects:
permg <- sample(g)
permresults <- rowttests(geneExpression, permg)
hist(permresults$p.value)

P-value histograms (summary)
- Give a quick look at how many significant p-values there may be
- When using permuted labels, can expose non-independence among the samples
- can be due to batch effects or family structure
- Most common approaches for correcting batch effects are:
- including batch in your model formula
ComBat
: corrects for known batch effects by linear model), and
sva
: creates surrogate variables for unknown batch effects, corrects the structure of permutation p-values
- correction using control (housekeeping) genes
All are available from the sva Bioconductor package
MA plot
- just a scatterplot rotated 45\(^o\)
rafalib::mypar(1,2)
pseudo <- apply(geneExpression, 1, median)
plot(geneExpression[, 1], pseudo)
plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo))

MA plot (summary)
- useful for quality control of high-dimensional data
- plot all data values for a sample against another sample or a median “pseudosample”
affyPLM::MAplots
better MA plots
- adds a smoothing line to highlight departures from horizontal line
- plots a “cloud” rather than many data points
Heatmaps
Detailed representation of high-dimensional dataset
ge = geneExpression[sample(1:nrow(geneExpression), 200), ]
pheatmap::pheatmap(ge, scale="row", show_colnames = F, show_rownames = F)

Heatmaps
- Clustering becomes slow and memory-intensivefor thousands of rows
- probably too detailed for thousands of rows
- can show co-expressed genes, groups of samples
Colors
- Types of color pallettes:
- sequential: shows a gradient
- diverging: goes in two directions from a center
- qualitative: for categorical variables
- Keep color blindness in mind (10% of all men)
Colors (cont’d)
Combination of RColorBrewer
package and colorRampPalette()
can create anything you want

Plots To Avoid
“Pie charts are a very bad way of displaying information.” - R Help
- Always avoid pie charts
- Avoid doughnut charts too
- Avoid pseudo 3D and most Excel defaults
- Effective graphs use color judiciously
Links
- A built html version of this lecture is available.
- The source R Markdown is also available from Github.