Processing math: 100%
Applied Statistics for High-throughput Biology:
Session 2
Levi Waldron
Hypothesis testing for categorical variables
- Recall hypothesis testing and confidence intervals for one or two
continuous variables:
- Z-test, t-test, correlation
- Two binary variables:
- Fisher’s Exact Test
- Pearson’s Chi-square Test
Lady Tasting Tea
- The Lady in question claimed to be able to tell whether the tea or
the milk was added first to a cup
- Fisher proposed to give her eight cups,
four of each variety, in random order
- the Lady is fully informed of the experimental
method
- H0: the Lady has no ability to
tell which was added first
Source: https://en.wikipedia.org/wiki/Lady_tasting_tea
Fisher’s Exact Test
p-value is the probability of the observed number of successes, or
more, under H0
Tea-Tasting Distribution
Success count
|
Permutations of selection
|
Number of permutations
|
0
|
oooo
|
1 × 1 = 1
|
1
|
ooox, ooxo, oxoo, xooo
|
4 × 4 = 16
|
2
|
ooxx, oxox, oxxo, xoxo, xxoo, xoox
|
6 × 6 = 36
|
3
|
oxxx, xoxx, xxox, xxxo
|
4 × 4 = 16
|
4
|
xxxx
|
1 × 1 = 1
|
Total
|
70
|
- x = correct, o = incorrect
What do you notice about all these combinations?
- Each only involves four guesses
- Without knowing how many cups had milk added first, there would be
many more possible guesses
Notes on Fisher’s Exact Test
- Can also be applied to r×c tables
- Remember that the margins of the table are fixed by
design
- One-tailed version (used in Gene Set Enrichment Analysis) is
referred to as the Hypergeometric Test
- Exact p-values are difficult (and unnecessary) for large samples
fisher.test(x, y = NULL, etc, simulate.p.value = FALSE)
Applications of Fisher’s Exact Test
- Has been applied (with peril) to gene set analysis, e.g.:
- 10 of my top 100 genes are annotated with the cytokinesis GO
term
- 465 of 21,000 human genes are annotated with the cytokinesis GO
term
- Are my top 100 genes enriched for cytokinesis process?
- Problems with this analysis:
- Main problem: top-n genes tend to be correlated, so their selections
are not independent trials
- Secondary: does not match design for H0
- There are many alternative approaches in Gene Set Analysis, but this
approach isn’t as bad as statisticians thought. See:
Geistlinger et al. Toward a gold standard for benchmarking
gene set enrichment analysis. Brief. Bioinform. (2020) doi:10.1093/bib/bbz158
Chi-squared test
- Test of independence for r×c table (two categorical variables)
- Does not assume the margins are fixed by design
- i.e., whether each cup of tea has milk poured first is determined by
coin flip, and the Lady doesn’t know how many
- more common in practice
- classic genomics example is GWAS
- H0: the two variables are
independent
- HA: there is an association
between the variables
Application to GWAS
- Interested in association between disease and some potential
causative factor
- In a case-control study, the numbers of cases and controls are
fixed, but the other variable is not
- In a prospective or longitudinal cohort study, neither the number of
cases or the other variable are fixed
## disease genotype
## control:220 AA/Aa:204
## cases : 30 aa : 46
Apply the chi-squared test
## genotype
## disease AA/Aa aa
## control 184 36
## cases 20 10
chisq.test(disease, genotype)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: disease and genotype
## X-squared = 3.9963, df = 1, p-value = 0.0456
chisq.test(disease, genotype, simulate.p.value = TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: disease and genotype
## X-squared = 5.0634, df = NA, p-value = 0.03998
- Chisquare-test p-values are approximate, and inaccuracy increases
with small cells in r×c
tables (say, less than 5)
- simulate.p.value is a more
valid way to address small counts than using Fisher’s Exact Test, if
H0 corresponds to a chi-square
test
- Fisher’s Exact Test provides an exact p-value for a different
hypothesis
Calculate odds ratio
- Note that the result says nothing about how the departure
from independence occurs
- Odds ratios do
library(epitools)
epitools::oddsratio(genotype, disease, method = "wald")$measure
## odds ratio with 95% C.I.
## Predictor estimate lower upper
## AA/Aa 1.000000 NA NA
## aa 2.555556 1.104441 5.913274
Note on factor reference levels
Use the relevel()
function if you want to change the
reference level of a factor, or dplyr::recode_factor()
:
epitools::oddsratio(relevel(genotype, "aa"), disease)$measure
## odds ratio with 95% C.I.
## Predictor estimate lower upper
## aa 1.0000000 NA NA
## AA/Aa 0.3908154 0.1705085 0.9426884
(the default is whichever level is first alphabetically!)
Summary - two categorical variables
- Choice between Fisher’s Exact Test and Chi-square test is determined
by experimental design
- If any counts in the table are less than 5, can use
simulate.p.value=TRUE
to get a more accurate p-value from
the chi-square test
- Both assume independent observations (important!!)
- In a GWAS, correction for multiple testing is required
Resampling methods
Big classes of resampling methods
- Resampling involves simulating repeated samples from the one
available sample
- Permutation tests: shuffling labels to generate an
empirical null distribution
- Cross-validation: generate training and test sets
without replacement
- Bootstrap: generate samples of size n, with replacement
Permutation test
Response (y) variable is permuted to guarantee true H0:
Genome-wide association studies for common diseases
and complex traits. Nature Reviews Genetics 6, 95-108 (February
2005).
Permutation test
- Calculate the test statistic for each permutation
- P-value is the quantile of the real test statistic in the “empirical
null distribution” of permutation test statistics
- Permutations tests still have assumptions:
- samples are assumed to be independent and “exchangeable”
- hidden structure such as families can cause anti-conservative
p-values
K-fold cross-validation: setting
- Setting: prediction (as opposed to inference)
- Potential uses:
- to tune a model to the data
- to avoid reporting inflated prediction accuracy due to
over-fitting
- to compare the accuracy of competing algorithms in one dataset

Three-fold cross-validation
Definitions: Training, Validation, Test
- The model is initially fit on a training
dataset
- Successively, the fitted model is used to predict the responses for
the observations in a second dataset called the validation
dataset. The validation dataset might be used repeatedly for
selecting from different models.
- Finally, the test dataset is a dataset used to
provide an unbiased evaluation of a final model fit on the
training dataset.
https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets
Tuning: training set and validation set error

ISLR Figure 2.9: Training set vs. Validation set error: the U
- Training set error always decreased with more flexible models
- Validation set error tends to be U-shaped: model “tuning” searches
for this low point
- Cross-validation for tuning does not equal cross-validation for
estimating independent accuracy
James, Witten, Hastie, Tibshirani. An Introduction to
Statistical Learning with Applications in R. Springer, 2014. https://www.statlearning.com/
FIGURE 2.9. Left: Data simulated from f, shown in black. Three
estimates of f are shown: the linear regression line (orange curve), and
two smoothing spline fits (blue and green curves). Right: Training MSE
(grey curve), test MSE (red curve), and minimum possible test MSE over
all methods (dashed line). Squares represent the training and test MSEs
for the three fits shown in the left-hand panel.
K-fold cross-validation: algorithm
- Create K “folds” from the
sample of size n, K≤n
- Randomly sample 1/K
observations (without replacement) as the validation set
- Use remaining samples as the training set
- Fit model on the training set, estimate accuracy on the validation
set
- Repeat K times, not using the
same validation samples
- Average validation accuracy from each of the validation sets
Variability in cross-validation

ISLR Figure 5.2: Variability in 2-fold cross-validation
- Repeating cross-validation many times then averaging the results is
called smoothing.
- Simply choosing a different random training and validation set each
time is called Monte Carlo Cross-validation (MCCV)
James, Witten, Hastie, Tibshirani. An Introduction to
Statistical Learning with Applications in R. Springer, 2014. https://www.statlearning.com/
FIGURE 5.2. The validation set approach was used on the Auto data set
in order to estimate the test error that results from predicting mpg
using polynomial functions of horsepower. Left: Validation error
estimates for a single split into training and validation data sets.
Right: The validation method was repeated ten times, each time using a
different random split of the observations into a training set and a
validation set. This illustrates the variability in the estimated test
MSE that results from this approach.
Cross-validation summary
- In prediction modeling, we think of data as training or
test
- cross-validation is a way to estimate test set error from a training
set
- Training set error always decreases with more complex (flexible)
models
- Test set error as a function of model flexibility tends to be
U-shaped
- the low point of the U represents the most appropriate amount of
model complexity
Summary of The Bootstrap
- The Bootstrap is a very general approach to estimating uncertainty,
e.g. standard errors
- Can be applied to a wide range of models and statistics
- Robust to outliers and violations of model assumptions
- The basic approach:
- Using the available sample (size n), generate a new sample of size n (with replacement)
- Calculate the statistic of interest
- Repeat
- Use repeated experiments to estimate the variability of your
statistic of interest
The Bootstrap (schematic)
Exploratory data analysis
Introduction
“The greatest value of a picture is when it forces us to notice what
we never expected to see.” - John W. Tukey
- Discover biases, systematic errors and unexpected variability in
data
- Graphical approach to detecting these issues
- Represents a first step in data analysis and guides hypothesis
testing
- Opportunities for discovery in the outliers
Quantile Quantile Plots
- Quantiles divide a distribution into equally sized bins
- Division into 100 bins gives percentiles
- Quantiles of a theoretical distribution are plotted against an
experimental distribution
- alternatively, quantiles of two experimental distributions
- Given a perfect fit, x=y
- Useful in determining data distribution (normal, t, etc.)
Example: Quantile Quantile Plots

Boxplots: About
- Provide a graph that is easy to interpret where data is not normally
distributed
- Would be an appropriate choice to explore income data, as
distribution is highly skewed
- Particularly informative in relation to outliers and range
- Possible to compare multiple distributions side by side
Boxplots: Example
Three different views of a continuous variable
Scatterplots And Correlation: About
- For two continuous variables, scatter plot and calculation of
correlation is useful
- Provides a graphical and numeric estimation of relationships
- Quick and easy with
plot()
and cor()
Scatterplots And Correlation: Example

Exploratory data analysis in high dimensions
if(!require(GSE5859Subset)){
BiocManager::install("genomicsclass/GSE5859Subset")
}
## Loading required package: GSE5859Subset
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'GSE5859Subset'
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## '?repositories' for details
##
## replacement repositories:
## CRAN: https://cloud.r-project.org
## Bioconductor version 3.15 (BiocManager 1.30.18), R 4.2.0 (2022-04-22)
## Installing github package(s) 'genomicsclass/GSE5859Subset'
## Downloading GitHub repo genomicsclass/GSE5859Subset@HEAD
## * checking for file ‘/tmp/RtmpDzZT2K/remotes111e4e5ebc9aee/genomicsclass-GSE5859Subset-8ada5f4/DESCRIPTION’ ... OK
## * preparing ‘GSE5859Subset’:
## * checking DESCRIPTION meta-information ... OK
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * looking to see if a ‘data/datalist’ file should be added
## * creating default NAMESPACE file
## * building ‘GSE5859Subset_1.0.tar.gz’
## Old packages: 'terra'
library(GSE5859Subset)
data(GSE5859Subset) ##this loads three tables
c(class(geneExpression), class(sampleInfo))
## [1] "matrix" "array" "data.frame"
rbind(dim(geneExpression), dim(sampleInfo))
## [,1] [,2]
## [1,] 8793 24
## [2,] 24 4
## ethnicity date filename group
## 107 ASN 2005-06-23 GSM136508.CEL.gz 1
## 122 ASN 2005-06-27 GSM136530.CEL.gz 1
## 113 ASN 2005-06-27 GSM136517.CEL.gz 1
## 163 ASN 2005-10-28 GSM136576.CEL.gz 1
## 153 ASN 2005-10-07 GSM136566.CEL.gz 1
## 161 ASN 2005-10-07 GSM136574.CEL.gz 1
Volcano plots: Setup
T-test for every row (gene) of gene expression matrix:
library(genefilter)
g <- factor(sampleInfo$group)
system.time(results <- rowttests(geneExpression, g))
## user system elapsed
## 0.006 0.000 0.005
Note that these 8,793 tests are done in about 0.01s
Volcano plots: Example
par(mar = c(4, 4, 0, 0))
plot(results$dm,
-log10(results$p.value),
xlab = "Effect size (difference in group means)",
ylab = "- log (base 10) p-values")
abline(h = -log10(0.05 / nrow(geneExpression)), col = "red")
legend("bottomright",
lty = 1,
col = "red",
legend = "Bonferroni = 0.05")

Volcano plots: Summary
- Many small p-values with small effect size indicate low within-group
variability
- Inspect for asymmetry
- Can color points by significance threshold
P-value histograms: Setup
- If all null hypotheses are true, expect a flat histogram of
p-values:
m <- nrow(geneExpression)
n <- ncol(geneExpression)
set.seed(1)
randomData <- matrix(rnorm(n * m), m, n)
nullpvals <- rowttests(randomData, g)$p.value
P-value histograms: Example

P-value histograms: Example 2 (permutation)
Note that permuting these data doesn’t produce an ideal null p-value
histogram due to batch effects:
permg <- sample(g)
permresults <- rowttests(geneExpression, permg)
hist(permresults$p.value)

P-value histograms: Summary
- Give a quick look at how many significant p-values there may be
- When using permuted labels, can exposes non-independence among the
samples
- can be due to batch effects or family structure
- Most common approaches for correcting batch effects are:
ComBat
: corrects for known batch effects by linear
model), and
sva
: creates surrogate variables for unknown batch
effects, corrects the structure of permutation p-values
- correction using control (housekeeping) genes
All are available from the sva Bioconductor
package
MA plot
- just a scatterplot rotated 45o
rafalib::mypar(1, 2)
pseudo <- apply(geneExpression, 1, median)
plot(geneExpression[, 1], pseudo)
plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo))

MA plot: Summary
- useful for quality control of high-dimensional data
- plot all data values for a sample against another sample or a median
“pseudosample”
affyPLM::MAplots
better MA plots
- adds a smoothing line to highlight departures from horizontal
line
- plots a “cloud” rather than many data points
Heatmaps
- Detailed representation of high-dimensional dataset.
ComplexHeatmap
package is the best as of 2022: large
datasets, interactive heatmaps, simple defaults but many customizations
possible
suppressPackageStartupMessages(library(ComplexHeatmap))
keep <- rank(apply(geneExpression, 1, var)) <= 100 # 500 most variable genes
ge <- geneExpression[keep, ]
ge <- t(scale(t(ge))) #scale
rownames(ge) <- NULL; colnames(ge) <- NULL
chr <- sub("chr", "", geneAnnotation$CHR)
chr[is.na(chr)] <- "Un"
chr <- factor(chr, levels = c(1:22, "Un", "X", "Y"))
chrcols <- list(chromosome = c(colorRampPalette(c("lightgrey", "black"))(22), "white", "pink", "blue"))
names(chrcols[["chromosome"]]) <- c(1:22, "Un", "X", "Y")
row_ha <- rowAnnotation(chromosome = chr[keep], col = chrcols, na_col = "white")
column_ha <- HeatmapAnnotation(ethnicity = sampleInfo$ethnicity, group = factor(sampleInfo$group),
col = list(ethnicity = c("ASN"="lightgrey", "CEU"="darkgrey"),
group = c("0" = "brown", "1" = "blue")))
Heatmap(ge, use_raster = TRUE, top_annotation = column_ha, right_annotation = row_ha)

Heatmaps: Summary
- Clustering becomes slow and memory-intensivefor thousands of rows
- probably too detailed for thousands of rows
- can show co-expressed genes, groups of samples
Colors
- Types of color pallettes:
- sequential: shows a gradient
- diverging: goes in two directions from a
center
- qualitative: for categorical variables
- Keep color blindness in mind (10% of all men)
Colors (cont’d)
Combination of RColorBrewer
package and
colorRampPalette()
can create anything you want

Plots To Avoid
“Pie charts are a very bad way of displaying information.” - R
Help
- Always avoid pie charts
- Avoid doughnut charts too
- Avoid pseudo 3D and most Excel defaults
- Effective graphs use color judiciously
Links
- A built html version
of this lecture is available.
- The source R
Markdown is also available from Github.
Space, Right Arrow or swipe left to move to next slide, click help below for more details