0.1 Session 2 outline

1 Hypothesis testing for categorical variables

1.1 Lady Tasting Tea

  • The Lady in question claimed to be able to tell whether the tea or the milk was added first to a cup
  • Fisher proposed to give her eight cups, four of each variety, in random order
    • the Lady is fully informed of the experimental method
    • \(H_0\): the Lady has no ability to tell which was added first
Lady tasting tea

Source: https://en.wikipedia.org/wiki/Lady_tasting_tea

1.2 Fisher’s Exact Test

p-value is the probability of the observed number of successes, or more, under \(H_0\)

Tea-Tasting Distribution
Success count Permutations of selection Number of permutations
0 oooo 1 × 1 = 1
1 ooox, ooxo, oxoo, xooo 4 × 4 = 16
2 ooxx, oxox, oxxo, xoxo, xxoo, xoox 6 × 6 = 36
3 oxxx, xoxx, xxox, xxxo 4 × 4 = 16
4 xxxx 1 × 1 = 1
Total 70

What do you notice about all these combinations?

1.3 Notes on Fisher’s Exact Test

  • Can also be applied to \(r x c\) tables
  • Remember that the margins of the table are fixed by design
  • Also referred to as the Hypergeometric Test
  • Exact p-values are difficult (and unnecessary) for large samples
    • fisher.test(x, y = NULL, etc, simulate.p.value = FALSE)

1.4 Applications of Fisher’s Exact Test

  • Has been applied (with peril) to gene set analysis, e.g.:
    • 10 of my top 100 genes are annotated with the cytokinesis GO term
    • 465 of 21,000 human genes are annotated with the cytokinesis GO term
    • Are my top 100 genes enriched for cytokinesis process?
  • Problems with this analysis:
    • Main problem: top-n genes tend to be correlated, so their selections are not independent trials
    • Secondary: does not match design for \(H_0\)
  • Alternative: permutation test repeating all steps

1.5 Chi-squared test

  • Test of independence for rxc table (two categorical variables)
  • Does not assume the margins are fixed by design
    • i.e., the number of cups of tea with milk poured first can be random, and the Lady doesn’t know how many
    • more common in practice
    • classic genomics example is GWAS
  • \(H_0\): the two variables are independent
  • \(H_A\): there is an association between the variables

1.6 Application to GWAS

  • Interested in association between disease and some potential causative factor
  • In a case-control study, the numbers of cases and controls are fixed, but the other variable is not
  • In a prospective or longitudinal cohort study, neither the number of cases or the other variable are fixed
##     disease     genotype  
##  control:220   AA/Aa:204  
##  cases  : 30   aa   : 46

1.7 Apply the chi-squared test

table(disease, genotype)
##          genotype
## disease   AA/Aa  aa
##   control   184  36
##   cases      20  10
chisq.test(disease, genotype)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  disease and genotype
## X-squared = 3.9963, df = 1, p-value = 0.0456
chisq.test(disease, genotype, simulate.p.value = TRUE)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  disease and genotype
## X-squared = 5.0634, df = NA, p-value = 0.04248

1.8 Calculate odds ratio

Note that the result says nothing about how the departure from independence occurs

library(epitools)
epitools::oddsratio(genotype, disease, method="wald")$measure
##          odds ratio with 95% C.I.
## Predictor estimate    lower    upper
##     AA/Aa 1.000000       NA       NA
##     aa    2.555556 1.104441 5.913274

1.9 Note on factor reference levels

Use the relevel() function if you want to change the reference level of a factor:

epitools::oddsratio(relevel(genotype, "aa"), disease)$measure
##          odds ratio with 95% C.I.
## Predictor  estimate     lower     upper
##     aa    1.0000000        NA        NA
##     AA/Aa 0.3908154 0.1705085 0.9426884

(the default is whichever level is first alphabetically!)

1.10 Summary - two categorical variables

  • Choice between Fisher’s Exact Test and Chi-square test is determined by experimental design
  • If any counts in the table are less than 5, can use simulate.p.value=TRUE to get a more accurate p-value from the chi-square test
  • Both assume independent observations (important!!)
  • In a GWAS, correction for multiple testing is required

2 Resampling methods

2.1 Big classes of resampling methods

  • Resampling involves simulating repeated samples from the one available sample
    • Permutation tests: shuffling labels to generate an empirical null distribution
    • Cross-validation: generate training and test sets without replacement
    • Bootstrap: generate samples of size \(n\), with replacement

2.2 Permutation test

Response (y) variable is permuted to guarantee true \(H_0\):

nrg1521-i1.gif

Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics 6, 95-108 (February 2005).

2.3 Permutation test

  • Calculate the test statistic for each permutation
    • 999 is a typical number
  • P-value is the quantile of the real test statistic in the “empirical null distribution” of permutation test statistics
  • Permutations tests still have assumptions:
    • samples are assumed to be independent and “exchangeable”
    • hidden structure such as families can cause anti-conservative p-values

2.4 K-fold cross-validation: setting

  • Setting: prediction (as opposed to inference)
  • Purpose: to avoid reporting inflated prediction accuracy due to over-fitting

three-fold.png

Three-fold cross-validation

2.5 K-fold cross-validation: algorithm

  • Create \(K\) “folds” from the sample of size \(n\), \(K \le n\)
  1. Randomly sample \(1/K\) observations (without replacement) as the validation set
  2. Use remaining samples as the training set
  3. Fit model on the training set, estimate accuracy on the validation set
  4. Repeat \(K\) times, not using the same validation samples
  5. Average validation accuracy from each of the validation sets

2.6 Variability in cross-validation

ISLR_Fig52.png

ISLR Figure 5.2: Variability in 2-fold cross-validation

James, Witten, Hastie, Tibshirani. An Introduction to Statistical Learning with Applications in R. Springer, 2014.

http://www-bcf.usc.edu/~gareth/ISL/

2.7 Cross-validation summary

  • In prediction modeling, we think of data as training or test
    • cross-validation is a way to estimate test set error from a training set
  • Training set error always decreases with more complex (flexible) models
  • Test set error as a function of model flexibility tends to be U-shaped
    • the low point of the U represents the most appropriate amount of model complexity

2.8 Summary of The Bootstrap

  • The Bootstrap is a very general approach to estimating uncertainty, e.g. standard errors
  • Can be applied to a wide range of models and statistics
  • Robust to outliers and violations of model assumptions
  • The basic approach:
    1. Using the available sample (size \(n\)), generate a new sample of size \(n\) (with replacement)
    2. Calculate the statistic of interest
    3. Repeat
    4. Use repeated experiments to estimate the variability of your statistic of interest

2.9 The Bootstrap (schematic)

ISLR_Fig511.png

3 Exploratory data analysis

3.1 Introduction

“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey

  • Discover biases, systematic errors and unexpected variability in data
  • Graphical approach to detecting these issues
  • Represents a first step in data analysis and guides hypothesis testing
  • Opportunities for discovery in the outliers

3.2 Quantile Quantile Plots

  • Quantiles divide a distribution into equally sized bins
  • Division into 100 bins gives percentiles
  • Quantiles of a theoretical distribution are plotted against an experimental distribution
    • alternatively, quantiles of two experimental distributions
  • Given a perfect fit, \(x=y\)
  • Useful in determining data distribution (normal, t, etc.)

3.3 Example: Quantile Quantile Plots

3.4 Boxplots: About

  • Provide a graph that is easy to interpret where data is not normally distributed
  • Would be an appropriate choice to explore income data, as distribution is highly skewed
  • Particularly informative in relation to outliers and range
  • Possible to compare multiple distributions side by side

3.5 Boxplots: Example

Three different views of a continuous variable

3.6 Scatterplots And Correlation: About

  • For two continuous variables, scatter plot and calculation of correlation is useful
  • Provides a graphical and numeric estimation of relationships
  • Quick and easy with plot() and cor()

3.7 Scatterplots And Correlation: Example

3.8 Exploratory data analysis in high dimensions

BiocManager("genomicsclass/GSE5859Subset")  #install from Github
library(GSE5859Subset)
data(GSE5859Subset) ##this loads three tables
c(class(geneExpression), class(sampleInfo))
## [1] "matrix"     "data.frame"
rbind(dim(geneExpression), dim(sampleInfo))
##      [,1] [,2]
## [1,] 8793   24
## [2,]   24    4
head(sampleInfo)
##     ethnicity       date         filename group
## 107       ASN 2005-06-23 GSM136508.CEL.gz     1
## 122       ASN 2005-06-27 GSM136530.CEL.gz     1
## 113       ASN 2005-06-27 GSM136517.CEL.gz     1
## 163       ASN 2005-10-28 GSM136576.CEL.gz     1
## 153       ASN 2005-10-07 GSM136566.CEL.gz     1
## 161       ASN 2005-10-07 GSM136574.CEL.gz     1

3.9 Volcano plots: Setup

T-test for every row (gene) of gene expression matrix:

library(genefilter)
g <- factor(sampleInfo$group)
system.time(results <- rowttests(geneExpression,g))
##    user  system elapsed 
##   0.013   0.001   0.020
pvals <- results$p.value

Note that these 8,793 tests are done in about 0.01s

3.10 Volcano plots: Example

par(mar=c(4, 4, 0, 0))
plot(results$dm, -log10(results$p.value),
     xlab="Effect size (difference in group means)",
     ylab="- log (base 10) p-values")
abline(h=-log10(0.05 / nrow(geneExpression)), col="red")
legend("bottomright", lty=1, col="red", legend="Bonferroni = 0.05")

3.11 Volcano plots: Summary

  • Many small p-values with small effect size indicate low within-group variability
  • Inspect for asymmetry
  • Can color points by significance threshold

3.12 P-value histograms: Setup

  • If all null hypotheses are true, expect a flat histogram of p-values:
m <- nrow(geneExpression)
n <- ncol(geneExpression)
set.seed(1); randomData <- matrix(rnorm(n*m),m,n)
nullpvals <- rowttests(randomData,g)$p.value

3.13 P-value histograms: Example

3.14 P-value histograms: Example 2 (permutation)

Note that permuting these data doesn’t produce an ideal null p-value histogram due to batch effects:

permg <- sample(g)
permresults <- rowttests(geneExpression, permg)
hist(permresults$p.value)

3.15 P-value histograms: Summary

  • Give a quick look at how many significant p-values there may be
  • When using permuted labels, can exposes non-independence among the samples
    • can be due to batch effects or family structure
  • Most common approaches for correcting batch effects are:
    • ComBat: corrects for known batch effects by linear model), and
    • sva: creates surrogate variables for unknown batch effects, corrects the structure of permutation p-values
    • correction using control (housekeeping) genes

All are available from the sva Bioconductor package

3.16 MA plot

  • just a scatterplot rotated 45\(^o\)
rafalib::mypar(1,2)
pseudo <- apply(geneExpression, 1, median)
plot(geneExpression[, 1], pseudo)
plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo))

3.17 MA plot: Summary

  • useful for quality control of high-dimensional data
  • plot all data values for a sample against another sample or a median “pseudosample”
  • affyPLM::MAplots better MA plots
    • adds a smoothing line to highlight departures from horizontal line
    • plots a “cloud” rather than many data points

3.18 Heatmaps

Detailed representation of high-dimensional dataset

ge = geneExpression[sample(1:nrow(geneExpression), 200), ]
pheatmap::pheatmap(ge, scale="row", show_colnames = F, show_rownames = F)

3.19 Heatmaps: Summary

  • Clustering becomes slow and memory-intensivefor thousands of rows
    • probably too detailed for thousands of rows
  • can show co-expressed genes, groups of samples

3.20 Colors

  • Types of color pallettes:
    • sequential: shows a gradient
    • diverging: goes in two directions from a center
    • qualitative: for categorical variables
  • Keep color blindness in mind (10% of all men)

3.21 Colors (cont’d)

Combination of RColorBrewer package and colorRampPalette() can create anything you want

3.22 Plots To Avoid

“Pie charts are a very bad way of displaying information.” - R Help

  • Always avoid pie charts
  • Avoid doughnut charts too
  • Avoid pseudo 3D and most Excel defaults
  • Effective graphs use color judiciously