Applied Statistics for High-throughput Biology: Session 2

Levi Waldron

Session 2 outline

Hypothesis testing for categorical variables

Lady Tasting Tea

Lady tasting tea

Source: https://en.wikipedia.org/wiki/Lady_tasting_tea

Fisher’s Exact Test

p-value is the probability of the observed number of successes, or more, under \(H_0\)

Tea-Tasting Distribution
Success count Permutations of selection Number of permutations
0 oooo 1 × 1 = 1
1 ooox, ooxo, oxoo, xooo 4 × 4 = 16
2 ooxx, oxox, oxxo, xoxo, xxoo, xoox 6 × 6 = 36
3 oxxx, xoxx, xxox, xxxo 4 × 4 = 16
4 xxxx 1 × 1 = 1
Total 70

What do you notice about all these combinations?

Notes on Fisher’s Exact Test

Applications of Fisher’s Exact Test

Geistlinger et al. Toward a gold standard for benchmarking gene set enrichment analysis. Brief. Bioinform. (2020) doi:10.1093/bib/bbz158

Chi-squared test

Application to GWAS

##     disease     genotype  
##  control:220   AA/Aa:204  
##  cases  : 30   aa   : 46

Apply the chi-squared test

table(disease, genotype)
##          genotype
## disease   AA/Aa  aa
##   control   184  36
##   cases      20  10
chisq.test(disease, genotype)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  disease and genotype
## X-squared = 3.9963, df = 1, p-value = 0.0456
chisq.test(disease, genotype, simulate.p.value = TRUE)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  disease and genotype
## X-squared = 5.0634, df = NA, p-value = 0.03998

Calculate odds ratio

library(epitools)
epitools::oddsratio(genotype, disease, method = "wald")$measure
##          odds ratio with 95% C.I.
## Predictor estimate    lower    upper
##     AA/Aa 1.000000       NA       NA
##     aa    2.555556 1.104441 5.913274

Note on factor reference levels

Use the relevel() function if you want to change the reference level of a factor, or dplyr::recode_factor():

epitools::oddsratio(relevel(genotype, "aa"), disease)$measure
##          odds ratio with 95% C.I.
## Predictor  estimate     lower     upper
##     aa    1.0000000        NA        NA
##     AA/Aa 0.3908154 0.1705085 0.9426884

(the default is whichever level is first alphabetically!)

Summary - two categorical variables

Resampling methods

Big classes of resampling methods

Permutation test

Response (y) variable is permuted to guarantee true \(H_0\):

nrg1521-i1.gif

Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics 6, 95-108 (February 2005).

Permutation test

K-fold cross-validation: setting

three-fold.png

Three-fold cross-validation

Definitions: Training, Validation, Test

https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets

Tuning: training set and validation set error

ISLR Fig 2.9 U curve

ISLR Figure 2.9: Training set vs. Validation set error: the U

James, Witten, Hastie, Tibshirani. An Introduction to Statistical Learning with Applications in R. Springer, 2014. https://www.statlearning.com/

FIGURE 2.9. Left: Data simulated from f, shown in black. Three estimates of f are shown: the linear regression line (orange curve), and two smoothing spline fits (blue and green curves). Right: Training MSE (grey curve), test MSE (red curve), and minimum possible test MSE over all methods (dashed line). Squares represent the training and test MSEs for the three fits shown in the left-hand panel.

K-fold cross-validation: algorithm

  1. Randomly sample \(1/K\) observations (without replacement) as the validation set
  2. Use remaining samples as the training set
  3. Fit model on the training set, estimate accuracy on the validation set
  4. Repeat \(K\) times, not using the same validation samples
  5. Average validation accuracy from each of the validation sets

Variability in cross-validation

ISLR_Fig52.png

ISLR Figure 5.2: Variability in 2-fold cross-validation

James, Witten, Hastie, Tibshirani. An Introduction to Statistical Learning with Applications in R. Springer, 2014. https://www.statlearning.com/

FIGURE 5.2. The validation set approach was used on the Auto data set in order to estimate the test error that results from predicting mpg using polynomial functions of horsepower. Left: Validation error estimates for a single split into training and validation data sets. Right: The validation method was repeated ten times, each time using a different random split of the observations into a training set and a validation set. This illustrates the variability in the estimated test MSE that results from this approach.

Cross-validation summary

Summary of The Bootstrap

The Bootstrap (schematic)

ISLR_Fig511.png

Exploratory data analysis

Introduction

“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey

Quantile Quantile Plots

Example: Quantile Quantile Plots

Boxplots: About

Boxplots: Example

Three different views of a continuous variable

Scatterplots And Correlation: About

Scatterplots And Correlation: Example

Exploratory data analysis in high dimensions

if(!require(GSE5859Subset)){
  BiocManager::install("genomicsclass/GSE5859Subset")
}
## Loading required package: GSE5859Subset
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'GSE5859Subset'
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## '?repositories' for details
## 
## replacement repositories:
##     CRAN: https://cloud.r-project.org
## Bioconductor version 3.15 (BiocManager 1.30.18), R 4.2.0 (2022-04-22)
## Installing github package(s) 'genomicsclass/GSE5859Subset'
## Downloading GitHub repo genomicsclass/GSE5859Subset@HEAD
## * checking for file ‘/tmp/RtmpDzZT2K/remotes111e4e5ebc9aee/genomicsclass-GSE5859Subset-8ada5f4/DESCRIPTION’ ... OK
## * preparing ‘GSE5859Subset’:
## * checking DESCRIPTION meta-information ... OK
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * looking to see if a ‘data/datalist’ file should be added
## * creating default NAMESPACE file
## * building ‘GSE5859Subset_1.0.tar.gz’
## Old packages: 'terra'
library(GSE5859Subset)
data(GSE5859Subset) ##this loads three tables
c(class(geneExpression), class(sampleInfo))
## [1] "matrix"     "array"      "data.frame"
rbind(dim(geneExpression), dim(sampleInfo))
##      [,1] [,2]
## [1,] 8793   24
## [2,]   24    4
head(sampleInfo)
##     ethnicity       date         filename group
## 107       ASN 2005-06-23 GSM136508.CEL.gz     1
## 122       ASN 2005-06-27 GSM136530.CEL.gz     1
## 113       ASN 2005-06-27 GSM136517.CEL.gz     1
## 163       ASN 2005-10-28 GSM136576.CEL.gz     1
## 153       ASN 2005-10-07 GSM136566.CEL.gz     1
## 161       ASN 2005-10-07 GSM136574.CEL.gz     1

Volcano plots: Setup

T-test for every row (gene) of gene expression matrix:

library(genefilter)
g <- factor(sampleInfo$group)
system.time(results <- rowttests(geneExpression, g))
##    user  system elapsed 
##   0.006   0.000   0.005
pvals <- results$p.value

Note that these 8,793 tests are done in about 0.01s

Volcano plots: Example

par(mar = c(4, 4, 0, 0))
plot(results$dm,
     -log10(results$p.value),
     xlab = "Effect size (difference in group means)",
     ylab = "- log (base 10) p-values")
abline(h = -log10(0.05 / nrow(geneExpression)), col = "red")
legend("bottomright",
       lty = 1,
       col = "red",
       legend = "Bonferroni = 0.05")

Volcano plots: Summary

P-value histograms: Setup

m <- nrow(geneExpression)
n <- ncol(geneExpression)
set.seed(1)
randomData <- matrix(rnorm(n * m), m, n)
nullpvals <- rowttests(randomData, g)$p.value

P-value histograms: Example

P-value histograms: Example 2 (permutation)

Note that permuting these data doesn’t produce an ideal null p-value histogram due to batch effects:

permg <- sample(g)
permresults <- rowttests(geneExpression, permg)
hist(permresults$p.value)

P-value histograms: Summary

All are available from the sva Bioconductor package

MA plot

rafalib::mypar(1, 2)
pseudo <- apply(geneExpression, 1, median)
plot(geneExpression[, 1], pseudo)
plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo))

MA plot: Summary

Heatmaps

suppressPackageStartupMessages(library(ComplexHeatmap))
keep <- rank(apply(geneExpression, 1, var)) <= 100  # 500 most variable genes
ge <- geneExpression[keep, ]
ge <- t(scale(t(ge))) #scale
rownames(ge) <- NULL; colnames(ge) <- NULL
chr <- sub("chr", "", geneAnnotation$CHR)
chr[is.na(chr)] <- "Un"
chr <- factor(chr, levels = c(1:22, "Un", "X", "Y"))
chrcols <- list(chromosome = c(colorRampPalette(c("lightgrey", "black"))(22), "white", "pink", "blue"))
names(chrcols[["chromosome"]]) <- c(1:22, "Un", "X", "Y")
row_ha <- rowAnnotation(chromosome = chr[keep], col = chrcols, na_col = "white")
column_ha <- HeatmapAnnotation(ethnicity = sampleInfo$ethnicity, group = factor(sampleInfo$group),
                               col = list(ethnicity = c("ASN"="lightgrey", "CEU"="darkgrey"),
                                          group = c("0" = "brown", "1" = "blue")))
Heatmap(ge, use_raster = TRUE, top_annotation = column_ha, right_annotation = row_ha)

Heatmaps: Summary

Colors

Colors (cont’d)

Combination of RColorBrewer package and colorRampPalette() can create anything you want

Plots To Avoid

“Pie charts are a very bad way of displaying information.” - R Help

Lab