Hypothesis testing for categorical variables
- Hypothesis testing and confidence intervals for one or two continuous variables:
- Z-test, t-test, correlation
- Two binary variables:
- Fisher’s Exact Test
- Pearson’s Chi-square Test
Lady Tasting Tea
- The Lady in question claimed to be able to tell whether the tea or the milk was added first to a cup
- Fisher proposed to give her eight cups, four of each variety, in random order
- the Lady is fully informed of the experimental method
- \(H_0\): the Lady has no ability to tell which was added first
Source: https://en.wikipedia.org/wiki/Lady_tasting_tea
Fisher’s Exact Test
p-value is the probability of the observed number of successes, or more, under \(H_0\)
Tea-Tasting Distribution
Success count
|
Permutations of selection
|
Number of permutations
|
0
|
oooo
|
1 × 1 = 1
|
1
|
ooox, ooxo, oxoo, xooo
|
4 × 4 = 16
|
2
|
ooxx, oxox, oxxo, xoxo, xxoo, xoox
|
6 × 6 = 36
|
3
|
oxxx, xoxx, xxox, xxxo
|
4 × 4 = 16
|
4
|
xxxx
|
1 × 1 = 1
|
Total
|
70
|
What do you notice about all these combinations?
Notes on Fisher’s Exact Test
- Can also be applied to \(r x c\) tables
- Remember that the margins of the table are fixed by design
- Also referred to as the Hypergeometric Test
- Exact p-values are difficult (and unnecessary) for large samples
fisher.test(x, y = NULL, etc, simulate.p.value = FALSE)
Applications of Fisher’s Exact Test
- Has been applied (with peril) to gene set analysis, e.g.:
- 10 of my top 100 genes are annotated with the cytokinesis GO term
- 465 of 21,000 human genes are annotated with the cytokinesis GO term
- Are my top 100 genes enriched for cytokinesis process?
- Problems with this analysis:
- Main problem: top-n genes tend to be correlated, so their selections are not independent trials
- Secondary: does not match design for \(H_0\)
- Alternative: permutation test repeating all steps
Chi-squared test
- Test of independence for rxc table (two categorical variables)
- Does not assume the margins are fixed by design
- i.e., the number of cups of tea with milk poured first can be random, and the Lady doesn’t know how many
- more common in practice
- classic genomics example is GWAS
- \(H_0\): the two variables are independent
- \(H_A\): there is an association between the variables
Application to GWAS
- Interested in association between disease and some potential causative factor
- In a case-control study, the numbers of cases and controls are fixed, but the other variable is not
- In a prospective or longitudinal cohort study, neither the number of cases or the other variable are fixed
## disease genotype
## control:220 AA/Aa:204
## cases : 30 aa : 46
Apply the chi-squared test
table(disease, genotype)
## genotype
## disease AA/Aa aa
## control 184 36
## cases 20 10
chisq.test(disease, genotype)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: disease and genotype
## X-squared = 3.9963, df = 1, p-value = 0.0456
chisq.test(disease, genotype, simulate.p.value = TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: disease and genotype
## X-squared = 5.0634, df = NA, p-value = 0.04248
Calculate odds ratio
Note that the result says nothing about how the departure from independence occurs
library(epitools)
epitools::oddsratio(genotype, disease, method="wald")$measure
## odds ratio with 95% C.I.
## Predictor estimate lower upper
## AA/Aa 1.000000 NA NA
## aa 2.555556 1.104441 5.913274
Note on factor reference levels
Use the relevel()
function if you want to change the reference level of a factor:
epitools::oddsratio(relevel(genotype, "aa"), disease)$measure
## odds ratio with 95% C.I.
## Predictor estimate lower upper
## aa 1.0000000 NA NA
## AA/Aa 0.3908154 0.1705085 0.9426884
(the default is whichever level is first alphabetically!)
Summary - two categorical variables
- Choice between Fisher’s Exact Test and Chi-square test is determined by experimental design
- If any counts in the table are less than 5, can use
simulate.p.value=TRUE
to get a more accurate p-value from the chi-square test
- Both assume independent observations (important!!)
- In a GWAS, correction for multiple testing is required
Resampling methods
Big classes of resampling methods
- Resampling involves simulating repeated samples from the one available sample
- Permutation tests: shuffling labels to generate an empirical null distribution
- Cross-validation: generate training and test sets without replacement
- Bootstrap: generate samples of size \(n\), with replacement
Permutation test
Response (y) variable is permuted to guarantee true \(H_0\):
Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics 6, 95-108 (February 2005).
Permutation test
- Calculate the test statistic for each permutation
- P-value is the quantile of the real test statistic in the “empirical null distribution” of permutation test statistics
- Permutations tests still have assumptions:
- samples are assumed to be independent and “exchangeable”
- hidden structure such as families can cause anti-conservative p-values
K-fold cross-validation: setting
- Setting: prediction (as opposed to inference)
- Purpose: to avoid reporting inflated prediction accuracy due to over-fitting

Three-fold cross-validation
K-fold cross-validation: algorithm
- Create \(K\) “folds” from the sample of size \(n\), \(K \le n\)
- Randomly sample \(1/K\) observations (without replacement) as the validation set
- Use remaining samples as the training set
- Fit model on the training set, estimate accuracy on the validation set
- Repeat \(K\) times, not using the same validation samples
- Average validation accuracy from each of the validation sets
Variability in cross-validation

ISLR Figure 5.2: Variability in 2-fold cross-validation
James, Witten, Hastie, Tibshirani. An Introduction to Statistical Learning with Applications in R. Springer, 2014.
http://www-bcf.usc.edu/~gareth/ISL/
Cross-validation summary
- In prediction modeling, we think of data as training or test
- cross-validation is a way to estimate test set error from a training set
- Training set error always decreases with more complex (flexible) models
- Test set error as a function of model flexibility tends to be U-shaped
- the low point of the U represents the most appropriate amount of model complexity
Summary of The Bootstrap
- The Bootstrap is a very general approach to estimating uncertainty, e.g. standard errors
- Can be applied to a wide range of models and statistics
- Robust to outliers and violations of model assumptions
- The basic approach:
- Using the available sample (size \(n\)), generate a new sample of size \(n\) (with replacement)
- Calculate the statistic of interest
- Repeat
- Use repeated experiments to estimate the variability of your statistic of interest
The Bootstrap (schematic)
Exploratory data analysis
Introduction
“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John W. Tukey
- Discover biases, systematic errors and unexpected variability in data
- Graphical approach to detecting these issues
- Represents a first step in data analysis and guides hypothesis testing
- Opportunities for discovery in the outliers
Quantile Quantile Plots
- Quantiles divide a distribution into equally sized bins
- Division into 100 bins gives percentiles
- Quantiles of a theoretical distribution are plotted against an experimental distribution
- alternatively, quantiles of two experimental distributions
- Given a perfect fit, \(x=y\)
- Useful in determining data distribution (normal, t, etc.)
Example: Quantile Quantile Plots

Boxplots: About
- Provide a graph that is easy to interpret where data is not normally distributed
- Would be an appropriate choice to explore income data, as distribution is highly skewed
- Particularly informative in relation to outliers and range
- Possible to compare multiple distributions side by side
Boxplots: Example
Three different views of a continuous variable
Scatterplots And Correlation: About
- For two continuous variables, scatter plot and calculation of correlation is useful
- Provides a graphical and numeric estimation of relationships
- Quick and easy with
plot()
and cor()
Scatterplots And Correlation: Example

Exploratory data analysis in high dimensions
BiocManager("genomicsclass/GSE5859Subset") #install from Github
library(GSE5859Subset)
data(GSE5859Subset) ##this loads three tables
c(class(geneExpression), class(sampleInfo))
## [1] "matrix" "data.frame"
rbind(dim(geneExpression), dim(sampleInfo))
## [,1] [,2]
## [1,] 8793 24
## [2,] 24 4
head(sampleInfo)
## ethnicity date filename group
## 107 ASN 2005-06-23 GSM136508.CEL.gz 1
## 122 ASN 2005-06-27 GSM136530.CEL.gz 1
## 113 ASN 2005-06-27 GSM136517.CEL.gz 1
## 163 ASN 2005-10-28 GSM136576.CEL.gz 1
## 153 ASN 2005-10-07 GSM136566.CEL.gz 1
## 161 ASN 2005-10-07 GSM136574.CEL.gz 1
Volcano plots: Setup
T-test for every row (gene) of gene expression matrix:
library(genefilter)
g <- factor(sampleInfo$group)
system.time(results <- rowttests(geneExpression,g))
## user system elapsed
## 0.013 0.001 0.020
pvals <- results$p.value
Note that these 8,793 tests are done in about 0.01s
Volcano plots: Example
par(mar=c(4, 4, 0, 0))
plot(results$dm, -log10(results$p.value),
xlab="Effect size (difference in group means)",
ylab="- log (base 10) p-values")
abline(h=-log10(0.05 / nrow(geneExpression)), col="red")
legend("bottomright", lty=1, col="red", legend="Bonferroni = 0.05")

Volcano plots: Summary
- Many small p-values with small effect size indicate low within-group variability
- Inspect for asymmetry
- Can color points by significance threshold
P-value histograms: Setup
- If all null hypotheses are true, expect a flat histogram of p-values:
m <- nrow(geneExpression)
n <- ncol(geneExpression)
set.seed(1); randomData <- matrix(rnorm(n*m),m,n)
nullpvals <- rowttests(randomData,g)$p.value
P-value histograms: Example

P-value histograms: Example 2 (permutation)
Note that permuting these data doesn’t produce an ideal null p-value histogram due to batch effects:
permg <- sample(g)
permresults <- rowttests(geneExpression, permg)
hist(permresults$p.value)

P-value histograms: Summary
- Give a quick look at how many significant p-values there may be
- When using permuted labels, can exposes non-independence among the samples
- can be due to batch effects or family structure
- Most common approaches for correcting batch effects are:
ComBat
: corrects for known batch effects by linear model), and
sva
: creates surrogate variables for unknown batch effects, corrects the structure of permutation p-values
- correction using control (housekeeping) genes
All are available from the sva Bioconductor package
MA plot
- just a scatterplot rotated 45\(^o\)
rafalib::mypar(1,2)
pseudo <- apply(geneExpression, 1, median)
plot(geneExpression[, 1], pseudo)
plot((geneExpression[, 1] + pseudo) / 2, (geneExpression[, 1] - pseudo))

MA plot: Summary
- useful for quality control of high-dimensional data
- plot all data values for a sample against another sample or a median “pseudosample”
affyPLM::MAplots
better MA plots
- adds a smoothing line to highlight departures from horizontal line
- plots a “cloud” rather than many data points
Heatmaps
Detailed representation of high-dimensional dataset
ge = geneExpression[sample(1:nrow(geneExpression), 200), ]
pheatmap::pheatmap(ge, scale="row", show_colnames = F, show_rownames = F)

Heatmaps: Summary
- Clustering becomes slow and memory-intensivefor thousands of rows
- probably too detailed for thousands of rows
- can show co-expressed genes, groups of samples
Colors
- Types of color pallettes:
- sequential: shows a gradient
- diverging: goes in two directions from a center
- qualitative: for categorical variables
- Keep color blindness in mind (10% of all men)
Colors (cont’d)
Combination of RColorBrewer
package and colorRampPalette()
can create anything you want

Plots To Avoid
“Pie charts are a very bad way of displaying information.” - R Help
- Always avoid pie charts
- Avoid doughnut charts too
- Avoid pseudo 3D and most Excel defaults
- Effective graphs use color judiciously
Links
- A built html version of this lecture is available.
- The source R Markdown is also available from Github.