Applied Statistics for High-throughput Biology: Session 2

Levi Waldron

Session 2 outline

Hypothesis testing for categorical variables
- Fisher’s Exact Test and Chi-square Test (“Association tests” book section)
Resampling methods
- “Permutation Tests” book section
- Cross-validation book section
- bootstrap simulation (see Introduction to Statistical Learning section 5.2: https://www.statlearning.com/)
Exploratory data analysis

Hypothesis testing for categorical variables

Recall hypothesis testing and confidence intervals for one or two continuous variables:
- Z-test, t-test, correlation
Two binary variables:
- Fisher’s Exact Test
- Pearson’s Chi-square Test

Lady Tasting Tea

The Lady in question claimed to be able to tell whether the tea or the milk was added first to a cup
Fisher proposed to give her eight cups, four of each variety, in random order
- the Lady is fully informed of the experimental method
- \(H_0\): the Lady has no ability to tell which was added first

Source: https://en.wikipedia.org/wiki/Lady_tasting_tea

Fisher’s Exact Test

p-value is the probability of the observed number of successes, or more, under \(H_0\)

Tea-Tasting Distribution
Success count	Permutations of selection	Number of permutations
0	oooo	1 × 1 = 1
1	ooox, ooxo, oxoo, xooo	4 × 4 = 16
2	ooxx, oxox, oxxo, xoxo, xxoo, xoox	6 × 6 = 36
3	oxxx, xoxx, xxox, xxxo	4 × 4 = 16
4	xxxx	1 × 1 = 1
Total		70

x = correct, o = incorrect

What do you notice about all these combinations?

Each only involves four guesses
Without knowing how many cups had milk added first, there would be many more possible guesses

Notes on Fisher’s Exact Test

Can also be applied to \(r \times c\) tables
Remember that the margins of the table are fixed by design
One-tailed version (used in Gene Set Enrichment Analysis) is referred to as the Hypergeometric Test
Exact p-values are difficult (and unnecessary) for large samples
- fisher.test(x, y = NULL, etc, simulate.p.value = FALSE)

Applications of Fisher’s Exact Test

Has been applied (with peril) to gene set analysis, e.g.:
- 10 of my top 100 genes are annotated with the cytokinesis GO term
- 465 of 21,000 human genes are annotated with the cytokinesis GO term
- Are my top 100 genes enriched for cytokinesis process?
Problems with this analysis:
- Main problem: top-n genes tend to be correlated, so their selections are not independent trials
- Secondary: does not match design for \(H_0\)
There are many alternative approaches in Gene Set Analysis, but this approach isn’t as bad as statisticians thought. See:

Geistlinger et al. Toward a gold standard for benchmarking gene set enrichment analysis. Brief. Bioinform. (2020) doi:10.1093/bib/bbz158

Chi-squared test

Test of independence for \(r \times c\) table (two categorical variables)
Does not assume the margins are fixed by design
- i.e., whether each cup of tea has milk poured first is determined by coin flip, and the Lady doesn’t know how many
- more common in practice
- classic genomics example is GWAS
\(H_0\): the two variables are independent
\(H_A\): there is an association between the variables

Application to GWAS

Interested in association between disease and some potential causative factor
In a case-control study, the numbers of cases and controls are fixed, but the other variable is not
In a prospective or longitudinal cohort study, neither the number of cases or the other variable are fixed

##     disease     genotype  
##  control:220   AA/Aa:204  
##  cases  : 30   aa   : 46

Apply the chi-squared test

table(disease, genotype)

##          genotype
## disease   AA/Aa  aa
##   control   184  36
##   cases      20  10

chisq.test(disease, genotype)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  disease and genotype
## X-squared = 3.9963, df = 1, p-value = 0.0456

chisq.test(disease, genotype, simulate.p.value = TRUE)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  disease and genotype
## X-squared = 5.0634, df = NA, p-value = 0.03998

Chisquare-test p-values are approximate, and inaccuracy increases with small cells in \(r \times c\) tables (say, less than 5)
\(simulate.p.value\) is a more valid way to address small counts than using Fisher’s Exact Test, if \(H_0\) corresponds to a chi-square test
Fisher’s Exact Test provides an exact p-value for a different hypothesis

Calculate odds ratio

Note that the result says nothing about how the departure from independence occurs
Odds ratios do

library(epitools)
epitools::oddsratio(genotype, disease, method = "wald")$measure

##          odds ratio with 95% C.I.
## Predictor estimate    lower    upper
##     AA/Aa 1.000000       NA       NA
##     aa    2.555556 1.104441 5.913274

Note on factor reference levels

Use the relevel() function if you want to change the reference level of a factor, or dplyr::recode_factor():

epitools::oddsratio(relevel(genotype, "aa"), disease)$measure

##          odds ratio with 95% C.I.
## Predictor  estimate     lower     upper
##     aa    1.0000000        NA        NA
##     AA/Aa 0.3908154 0.1705085 0.9426884

(the default is whichever level is first alphabetically!)

Summary - two categorical variables

Choice between Fisher’s Exact Test and Chi-square test is determined by experimental design
If any counts in the table are less than 5, can use simulate.p.value=TRUE to get a more accurate p-value from the chi-square test
Both assume independent observations (important!!)
In a GWAS, correction for multiple testing is required

Resampling methods

Big classes of resampling methods

Resampling involves simulating repeated samples from the one available sample
- Permutation tests: shuffling labels to generate an empirical null distribution
- Cross-validation: generate training and test sets without replacement
- Bootstrap: generate samples of size \(n\), with replacement

Permutation test

Response (y) variable is permuted to guarantee true \(H_0\):

Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics 6, 95-108 (February 2005).

Permutation test

Calculate the test statistic for each permutation
- 999 is a typical number
P-value is the quantile of the real test statistic in the “empirical null distribution” of permutation test statistics
Permutations tests still have assumptions:
- samples are assumed to be independent and “exchangeable”
- hidden structure such as families can cause anti-conservative p-values

K-fold cross-validation: setting

Setting: prediction (as opposed to inference)
Potential uses:
1. to tune a model to the data
2. to avoid reporting inflated prediction accuracy due to over-fitting
3. to compare the accuracy of competing algorithms in one dataset

Three-fold cross-validation

Definitions: Training, Validation, Test

The model is initially fit on a training dataset
Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset. The validation dataset might be used repeatedly for selecting from different models.
Finally, the test dataset is a dataset used to provide an unbiased evaluation of a final model fit on the training dataset.

https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets

Tuning: training set and validation set error

ISLR Fig 2.9 U curve

ISLR Figure 2.9: Training set vs. Validation set error: the U

Training set error always decreased with more flexible models
Validation set error tends to be U-shaped: model “tuning” searches for this low point
Cross-validation for tuning does not equal cross-validation for estimating independent accuracy

James, Witten, Hastie, Tibshirani. An Introduction to Statistical Learning with Applications in R. Springer, 2014. https://www.statlearning.com/

FIGURE 2.9. Left: Data simulated from f, shown in black. Three estimates of f are shown: the linear regression line (orange curve), and two smoothing spline fits (blue and green curves). Right: Training MSE (grey curve), test MSE (red curve), and minimum possible test MSE over all methods (dashed line). Squares represent the training and test MSEs for the three fits shown in the left-hand panel.

K-fold cross-validation: algorithm

Create \(K\) “folds” from the sample of size \(n\), \(K \le n\)

Randomly sample \(1/K\) observations (without replacement) as the validation set
Use remaining samples as the training set
Fit model on the training set, estimate accuracy on the validation set
Repeat \(K\) times, not using the same validation samples
Average validation accuracy from each of the validation sets

Variability in cross-validation

ISLR Figure 5.2: Variability in 2-fold cross-validation

Repeating cross-validation many times then averaging the results is called smoothing.
Simply choosing a different random training and validation set each time is called Monte Carlo Cross-validation (MCCV)

James, Witten, Hastie, Tibshirani. An Introduction to Statistical Learning with Applications in R. Springer, 2014. https://www.statlearning.com/

FIGURE 5.2. The validation set approach was used on the Auto data set in order to estimate the test error that results from predicting mpg using polynomial functions of horsepower. Left: Validation error estimates for a single split into training and validation data sets. Right: The validation method was repeated ten times, each time using a different random split of the observations into a training set and a validation set. This illustrates the variability in the estimated test MSE that results from this approach.

Cross-validation summary

In prediction modeling, we think of data as training or test
- cross-validation is a way to estimate test set error from a training set
Training set error always decreases with more complex (flexible) models
Test set error as a function of model flexibility tends to be U-shaped
- the low point of the U represents the most appropriate amount of model complexity

Summary of The Bootstrap

The Bootstrap is a very general approach to estimating uncertainty, e.g. standard errors
Can be applied to a wide range of models and statistics
Robust to outliers and violations of model assumptions
The basic approach:
1. Using the available sample (size \(n\)), generate a new sample of size \(n\) (with replacement)
2. Calculate the statistic of interest
3. Repeat
4. Use repeated experiments to estimate the variability of your statistic of interest

The Bootstrap (schematic)

Exploratory data analysis

Introduction

“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey

Discover biases, systematic errors and unexpected variability in data
Graphical approach to detecting these issues
Represents a first step in data analysis and guides hypothesis testing
Opportunities for discovery in the outliers

Quantile Quantile Plots

Quantiles divide a distribution into equally sized bins
Division into 100 bins gives percentiles
Quantiles of a theoretical distribution are plotted against an experimental distribution
- alternatively, quantiles of two experimental distributions
Given a perfect fit, \(x=y\)
Useful in determining data distribution (normal, t, etc.)

Example: Quantile Quantile Plots

Boxplots: About

Provide a graph that is easy to interpret where data is not normally distributed
Would be an appropriate choice to explore income data, as distribution is highly skewed
Particularly informative in relation to outliers and range
Possible to compare multiple distributions side by side

Boxplots: Example

Three different views of a continuous variable

Scatterplots And Correlation: About

For two continuous variables, scatter plot and calculation of correlation is useful
Provides a graphical and numeric estimation of relationships
Quick and easy with plot() and cor()

Scatterplots And Correlation: Example

Exploratory data analysis in high dimensions

if(!require(GSE5859Subset)){
  BiocManager::install("genomicsclass/GSE5859Subset")
}

## Loading required package: GSE5859Subset

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'GSE5859Subset'

## 'getOption("repos")' replaces Bioconductor standard repositories, see
## '?repositories' for details
## 
## replacement repositories:
##     CRAN: https://cloud.r-project.org

## Bioconductor version 3.15 (BiocManager 1.30.18), R 4.2.0 (2022-04-22)

## Installing github package(s) 'genomicsclass/GSE5859Subset'

## Downloading GitHub repo genomicsclass/GSE5859Subset@HEAD

## * checking for file ‘/tmp/RtmpDzZT2K/remotes111e4e5ebc9aee/genomicsclass-GSE5859Subset-8ada5f4/DESCRIPTION’ ... OK
## * preparing ‘GSE5859Subset’:
## * checking DESCRIPTION meta-information ... OK
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * looking to see if a ‘data/datalist’ file should be added
## * creating default NAMESPACE file
## * building ‘GSE5859Subset_1.0.tar.gz’

## Old packages: 'terra'

library(GSE5859Subset)
data(GSE5859Subset) ##this loads three tables
c(class(geneExpression), class(sampleInfo))

## [1] "matrix"     "array"      "data.frame"

rbind(dim(geneExpression), dim(sampleInfo))

##      [,1] [,2]
## [1,] 8793   24
## [2,]   24    4

head(sampleInfo)

##     ethnicity       date         filename group
## 107       ASN 2005-06-23 GSM136508.CEL.gz     1
## 122       ASN 2005-06-27 GSM136530.CEL.gz     1
## 113       ASN 2005-06-27 GSM136517.CEL.gz     1
## 163       ASN 2005-10-28 GSM136576.CEL.gz     1
## 153       ASN 2005-10-07 GSM136566.CEL.gz     1
## 161       ASN 2005-10-07 GSM136574.CEL.gz     1

Volcano plots: Setup

T-test for every row (gene) of gene expression matrix:

library(genefilter)
g <- factor(sampleInfo$group)
system.time(results <- rowttests(geneExpression, g))

##    user  system elapsed 
##   0.006   0.000   0.005

pvals <- results$p.value

Note that these 8,793 tests are done in about 0.01s

Volcano plots: Example

par(mar = c(4, 4, 0, 0))
plot(results$dm,
     -log10(results$p.value),
     xlab = "Effect size (difference in group means)",
     ylab = "- log (base 10) p-values")
abline(h = -log10(0.05 / nrow(geneExpression)), col = "red")
legend("bottomright",
       lty = 1,
       col = "red",
       legend = "Bonferroni = 0.05")

Volcano plots: Summary

Many small p-values with small effect size indicate low within-group variability
Inspect for asymmetry
Can color points by significance threshold

P-value histograms: Setup

If all null hypotheses are true, expect a flat histogram of p-values:

m <- nrow(geneExpression)
n <- ncol(geneExpression)
set.seed(1)
randomData <- matrix(rnorm(n * m), m, n)
nullpvals <- rowttests(randomData, g)$p.value

P-value histograms: Example

P-value histograms: Example 2 (permutation)

Note that permuting these data doesn’t produce an ideal null p-value histogram due to batch effects:

permg <- sample(g)
permresults <- rowttests(geneExpression, permg)
hist(permresults$p.value)

P-value histograms: Summary

Give a quick look at how many significant p-values there may be
When using permuted labels, can exposes non-independence among the samples
- can be due to batch effects or family structure
Most common approaches for correcting batch effects are:
- ComBat: corrects for known batch effects by linear model), and
- sva: creates surrogate variables for unknown batch effects, corrects the structure of permutation p-values
- correction using control (housekeeping) genes

All are available from the sva Bioconductor package

MA plot

just a scatterplot rotated 45\(^o\)

rafalib::mypar(1, 2)
pseudo <- apply(geneExpression, 1, median)
plot(geneExpression[, 1], pseudo)
plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo))

MA plot: Summary

useful for quality control of high-dimensional data
plot all data values for a sample against another sample or a median “pseudosample”
affyPLM::MAplots better MA plots
- adds a smoothing line to highlight departures from horizontal line
- plots a “cloud” rather than many data points

Heatmaps

Detailed representation of high-dimensional dataset.
- ComplexHeatmap package is the best as of 2022: large datasets, interactive heatmaps, simple defaults but many customizations possible

suppressPackageStartupMessages(library(ComplexHeatmap))
keep <- rank(apply(geneExpression, 1, var)) <= 100  # 500 most variable genes
ge <- geneExpression[keep, ]
ge <- t(scale(t(ge))) #scale
rownames(ge) <- NULL; colnames(ge) <- NULL
chr <- sub("chr", "", geneAnnotation$CHR)
chr[is.na(chr)] <- "Un"
chr <- factor(chr, levels = c(1:22, "Un", "X", "Y"))
chrcols <- list(chromosome = c(colorRampPalette(c("lightgrey", "black"))(22), "white", "pink", "blue"))
names(chrcols[["chromosome"]]) <- c(1:22, "Un", "X", "Y")
row_ha <- rowAnnotation(chromosome = chr[keep], col = chrcols, na_col = "white")
column_ha <- HeatmapAnnotation(ethnicity = sampleInfo$ethnicity, group = factor(sampleInfo$group),
                               col = list(ethnicity = c("ASN"="lightgrey", "CEU"="darkgrey"),
                                          group = c("0" = "brown", "1" = "blue")))
Heatmap(ge, use_raster = TRUE, top_annotation = column_ha, right_annotation = row_ha)

Heatmaps: Summary

Clustering becomes slow and memory-intensivefor thousands of rows
- probably too detailed for thousands of rows
can show co-expressed genes, groups of samples

Colors

Types of color pallettes:
- sequential: shows a gradient
- diverging: goes in two directions from a center
- qualitative: for categorical variables
Keep color blindness in mind (10% of all men)

Colors (cont’d)

Combination of RColorBrewer package and colorRampPalette() can create anything you want

Plots To Avoid

“Pie charts are a very bad way of displaying information.” - R Help

Always avoid pie charts
Avoid doughnut charts too
Avoid pseudo 3D and most Excel defaults
Effective graphs use color judiciously

Applied Statistics for High-throughput Biology: Session 2

Session 2 outline

Hypothesis testing for categorical variables

Lady Tasting Tea

Fisher’s Exact Test

Notes on Fisher’s Exact Test

Applications of Fisher’s Exact Test

Chi-squared test

Application to GWAS

Apply the chi-squared test

Calculate odds ratio

Note on factor reference levels

Summary - two categorical variables

Resampling methods

Big classes of resampling methods

Permutation test

Permutation test

K-fold cross-validation: setting

Definitions: Training, Validation, Test

Tuning: training set and validation set error

K-fold cross-validation: algorithm

Variability in cross-validation

Cross-validation summary

Summary of The Bootstrap

The Bootstrap (schematic)

Exploratory data analysis

Introduction

Quantile Quantile Plots

Example: Quantile Quantile Plots

Boxplots: About

Boxplots: Example

Scatterplots And Correlation: About

Scatterplots And Correlation: Example

Exploratory data analysis in high dimensions

Volcano plots: Setup

Volcano plots: Example

Volcano plots: Summary

P-value histograms: Setup

P-value histograms: Example

P-value histograms: Example 2 (permutation)

P-value histograms: Summary

MA plot

MA plot: Summary

Heatmaps

Heatmaps: Summary

Colors

Colors (cont’d)

Plots To Avoid

Lab

Links