1 Welcome and outline
- 1.1 Learning objectives
- 1.2 A bit about me - research interests
2 R - basic usage
3 Introduction to the R language
4 dplyr
5 Random Variables and Distributions
6 Hypothesis Testing
7 Hypothesis testing for categorical variables
8 Summary - hypothesis testing
9 Lab Exercises

1 Welcome and outline

Some essential R classes and related Bioconductor classes
Introduction to dplyr
Random variables and distributions
Hypothesis testing for one or two samples (t-test, Wilcoxon test, etc)
Hypothesis testing for categorical variables (Fisher’s Test, Chi-square test)
Book chapters 0 and 1

1.1 Learning objectives

Perform basic data manipulation/exploration in R and dplyr
Identify key R and Bioconductor data classes
Define random variables and distinguish them from non-random ones
Distinguish population from sampling distributions
Interpret t-tests and confidence intervals
Identify when to use Fisher’s Exact Test and Chi-square Test

1.2 A bit about me - research interests

High-dimensional statistics (more variables than observations)
Predictive modeling and methodology for validation
Metagenomic profiling of the human microbiome
Multi-omic data analysis for cancer
http://www.waldronlab.io

2 R - basic usage

2.1 Tips for learning R

Pseudo code	Example code
library(packagename)	library(dplyr)
?functionname	?select
?package::functionname	?dplyr::select
? ‘Reserved keyword or symbol’	? ‘%>%’
??searchforpossiblyexistingfunctionandortopic	??simulate
help(package = “loadedpackage”)	help(“dplyr”)
browseVignettes(“packagename”)	browseVignettes(“dplyr”)

Slide credit: Marcel Ramos

2.2 Installing Packages the Bioconductor Way

See the Bioconductor site for more info

Pseudo code:

install.packages("BiocManager") #from CRAN
packages <- c("packagename", "githubuser/repository", "biopackage")
BiocManager::install(packages)
BiocManager::valid()  #check validity of installed packages

Works for CRAN, GitHub, and Bioconductor packages!

2.3 Note about installing `devtools`

Useful for building packages
Download and install from GitHub, directly or via BiocManager::install()
Installation dependent on OS (Rtools for Windows, OSX installation requires XCode and Command Line Tools, GNU/Linux “just works”)

3 Introduction to the R language

3.1 Base R Data Types: atomic vectors

numeric (set seed to sync random number generator):

set.seed(1)
rnorm(5)

## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078

integer:

sample( 1:5 )

## [1] 2 1 3 4 5

logical:

1:3 %in% 3

## [1] FALSE FALSE  TRUE

character:

c("yes", "no")

## [1] "yes" "no"

factor:

factor(c("yes", "no"))

## [1] yes no 
## Levels: no yes

Demo: integer-like properties, relevel()

3.2 Base R Data Types: missingness

Missing Values and others - IMPORTANT

c(NA, NaN, -Inf, Inf)

## [1]   NA  NaN -Inf  Inf

class() to find the class of a variable.

3.3 Base R Data Types: matrix, list, data.frame

matrix:

matrix(1:9, nrow = 3)

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

The list is a non-atomic vector:

measurements <- c( 1.3, 1.6, 3.2, 9.8, 10.2 )
parents <- c( "Parent1.name", "Parent2.name" )
my.list <- list( measurements, parents)
my.list

## [[1]]
## [1]  1.3  1.6  3.2  9.8 10.2
## 
## [[2]]
## [1] "Parent1.name" "Parent2.name"

The data.frame has list-like and matrix-like properties:

x <- 11:16
y <- seq(0,1,.2)
z <- c( "one", "two", "three", "four", "five", "six" )
a <- factor( z )
my.df <- data.frame(x,y,z,a, stringsAsFactors = FALSE)

3.4 Bioconductor S4 vectors: DataFrame

Bioconductor (www.bioconductor.org) defines its own set of vectors using the S4 formal class system DataFrame: like a data.frame but more flexible. columns can be any atomic vector type:
- GenomicRanges objects
- Rle (run-length encoding)

suppressPackageStartupMessages(library(S4Vectors))
df <- DataFrame(var1 = Rle(c("a", "a", "b")),
          var2 = 1:3)
metadata(df) <- list(father="Levi is my father")
df

## DataFrame with 3 rows and 2 columns
##    var1      var2
##   <Rle> <integer>
## 1     a         1
## 2     a         2
## 3     b         3

3.5 Bioconductor S4 vectors: List and derived classes

List(my.list)

## List of length 2

str(List(my.list))

## Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
##   ..@ listData       :List of 2
##   .. ..$ : num [1:5] 1.3 1.6 3.2 9.8 10.2
##   .. ..$ : chr [1:2] "Parent1.name" "Parent2.name"
##   ..@ elementType    : chr "ANY"
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()

suppressPackageStartupMessages(library(IRanges))
IntegerList(var1=1:26, var2=1:100)

## IntegerList of length 2
## [["var1"]] 1 2 3 4 5 6 7 8 9 10 11 12 ... 16 17 18 19 20 21 22 23 24 25 26
## [["var2"]] 1 2 3 4 5 6 7 8 9 10 11 ... 90 91 92 93 94 95 96 97 98 99 100

CharacterList(var1=letters[1:100], var2=LETTERS[1:26])

## CharacterList of length 2
## [["var1"]] a b c d e f g h i ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## [["var2"]] A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

LogicalList(var1=1:100 %in% 5, var2=1:100 %% 2)

## LogicalList of length 2
## [["var1"]] FALSE FALSE FALSE FALSE TRUE ... FALSE FALSE FALSE FALSE FALSE
## [["var2"]] TRUE FALSE TRUE FALSE TRUE ... FALSE TRUE FALSE TRUE FALSE

3.6 Bioconductor S4 vectors: Biostrings

suppressPackageStartupMessages(library(Biostrings))
bstring = BString("I am a BString object")
bstring

##   21-letter "BString" instance
## seq: I am a BString object

dnastring = DNAString("TTGAAA-CTC-N")
dnastring

##   12-letter "DNAString" instance
## seq: TTGAAA-CTC-N

str(dnastring)

## Formal class 'DNAString' [package "Biostrings"] with 5 slots
##   ..@ shared         :Formal class 'SharedRaw' [package "XVector"] with 2 slots
##   .. .. ..@ xp                    :<externalptr> 
##   .. .. ..@ .link_to_cached_object:<environment: 0x7ff37196c090> 
##   ..@ offset         : int 0
##   ..@ length         : int 12
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()

alphabetFrequency(dnastring, baseOnly=TRUE, as.prob=TRUE)

##          A          C          G          T      other 
## 0.25000000 0.16666667 0.08333333 0.25000000 0.25000000

4 dplyr

4.1 Data Manipulation using `dplyr`

dplyr convention aims to ease cognitive burden
Function names are easy to remember:

select (Y)
mutate/transmute (add Ys / new Y)
filter (get Xs based on condition)
slice (get Xs specified)
summarise (reduce to single observation)
arrange (re-order observations)

4.2 `dplyr` example

library(nycflights13)
library(dplyr)
delays <- flights %>% 
  filter(!is.na(dep_delay)) %>%
  group_by(year, month, day, hour) %>%
  summarise(delay = mean(dep_delay), n = n()) %>%
  filter(n > 10)

4.3 `dplyr` example (cont’d)

hist(delays$delay, main="Mean hourly delay", xlab="Delay (hours)")

5 Random Variables and Distributions

5.1 Random Variables

A random variable: any characteristic that can be measured or categorized, and where any particular outcome is determined at least partially by chance.
Examples:
- number of new diabetes cases in population in a given year
- The weight of a randomly selected individual in a population
Types:
- Categorical random variable (e.g. disease / healthy)
- Discrete random variable (e.g. sequence read counts)
- Continuous random variable (e.g. normalized qPCR intensity)

5.2 Probability Distributions

We use probability distributions to describe the probability of all possible realizations of a random variable
In public health we use probability distributions to describe hypotheses or inference about the population
- because we normally cannot observe the entire population
In practice we study a sample selected from the population

5.3 Random Variables - examples

Normally distributed random variable with mean \(\mu = 0\) / standard deviation \(\sigma = 1\), and a sample of \(n=100\)

5.4 Random Variables - examples

Poisson distributed random variable (\(\lambda = 2\)), and a sample of \(n=100\).

5.5 Random Variables - examples

Negative Binomially distributed random variable (\(size=30, \mu=2\)), and a sample of \(n=100\).

5.6 Random Variables - examples

Binomial Distribution random variable (\(size=20, prob=0.25\)), and a sample of \(n=100\).
- use for binary outcomes

6 Hypothesis Testing

6.1 Population vs sampling distributions

Population distributions
- Each realization / point is an individual
Sampling distributions
- Each realization / point is a sample
- distribution depends on sample size
- large sample distributions are given by the Central Limit Theorem
Hypothesis testing is about sampling distributions
- Did my sample likely come from that distribution?

6.2 Logic of hypothesis testing

One Sample - observations come from one of two population distributions:

usual distribution that has been true in the past
a potentially new distribution induced by an intervention or changing condition

Two Sample - two samples are drawn, either:

from a single population
from two different populations

6.3 The t-tests

In a one-sample test, only a single sample is drawn:
- \(H_0: \mu = \mu_0\)
In a two-sample test, two samples are drawn independently:
- \(H_0: \mu_1 = \mu_2\)
A paired test is one sample of paired measurements, e.g.:
- individuals before and after treatment

6.4 When to use t-tests

\(\frac{{}\bar{x} - \mu}{s}\) and \(\frac{\bar{x_1} - \bar{x_2}}{s}\) are t-distributed if:
- the standard deviation \(s\) is estimated from the sample
- the population values are normally distributed
- the population values are slightly skewed but n > 15
- the population values are quite skewed but n > 30
Wilcoxon tests are an alternative for non-normal populations
- e.g. rank data
- data where ranks are more informative than means
- not when many ranks are arbitrary

7 Hypothesis testing for categorical variables

7.1 Lady Tasting Tea

The Lady in question claimed to be able to tell whether the tea or the milk was added first to a cup
Fisher proposed to give her eight cups, four of each variety, in random order
- the Lady is fully informed of the experimental method
- \(H_0\): the Lady has no ability to tell which was added first

Source: https://en.wikipedia.org/wiki/Lady_tasting_tea

7.2 Fisher’s Exact Test

p-value is the probability of the observed number of successes, or more, under \(H_0\)

Tea-Tasting Distribution
Success count	Permutations of selection	Number of permutations
0	oooo	1 × 1 = 1
1	ooox, ooxo, oxoo, xooo	4 × 4 = 16
2	ooxx, oxox, oxxo, xoxo, xxoo, xoox	6 × 6 = 36
3	oxxx, xoxx, xxox, xxxo	4 × 4 = 16
4	xxxx	1 × 1 = 1
Total		70

What do you notice about all these combinations?

7.3 Notes on Fisher’s Exact Test

Can also be applied to rxc tables
Remember that the margins of the table are fixed by design
Also referred to as the Hypergeometric Test
Exact p-values are difficult (and unnecessary) for large samples
- fisher.test(x, y = NULL, etc, simulate.p.value = FALSE)

7.4 Notes on Fisher’s Exact Test (cont’d)

Has been applied (with peril!) to gene set analysis, e.g.:
- 10 of my top 100 genes are annotated with the cytokinesis GO term
- 465 of 21,000 human genes are annotated with the cytokinesis GO term
- Are my top 100 genes enriched for cytokinesis process?
Problems with this analysis:
- Main problem: top-n genes tend to be correlated, so their selections are not independent trials
- Secondary: does not match design for \(H_0\)
Alternative: permutation test repeating all steps

7.5 Chi-squared test

Test of independence for rxc table (two categorical variables)
Does not assume the margins are fixed by design
- i.e., the number of cups of tea with milk poured first can be random, and the Lady doesn’t know how many
- more common in practice
- classic genomics example is GWAS
\(H_0\): the two variables are independent
\(H_A\): there is an association between the variables

7.6 Application to GWAS

Interested in association between disease and some potential causative factor
In a case-control study, the numbers of cases and controls are fixed, but the other variable is not
In a prospective or longitudinal cohort study, neither the number of cases or the other variable are fixed

disease=factor(c(rep(0,180),rep(1,20),rep(0,40),rep(1,10)),
               labels=c("control","cases"))
genotype=factor(c(rep("AA/Aa",204),rep("aa",46)),
                levels=c("AA/Aa","aa"))
set.seed(1)
genotype <- sample(genotype)
dat <- data.frame(disease, genotype)
summary(dat)

##     disease     genotype  
##  control:220   AA/Aa:204  
##  cases  : 30   aa   : 46

7.7 Application to GWAS (cont’d)

table(disease, genotype)

##          genotype
## disease   AA/Aa  aa
##   control   181  39
##   cases      23   7

chisq.test(disease, genotype)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  disease and genotype
## X-squared = 0.24229, df = 1, p-value = 0.6226

chisq.test(disease, genotype, simulate.p.value = TRUE)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  disease and genotype
## X-squared = 0.5526, df = NA, p-value = 0.5777

7.8 Application to GWAS (cont’d)

Note that the result says nothing about how the departure from independence occurs

library(epitools)
epitools::oddsratio(genotype, disease, method="wald")$measure

##          odds ratio with 95% C.I.
## Predictor estimate     lower    upper
##     AA/Aa 1.000000        NA       NA
##     aa    1.412486 0.5662511 3.523379

(the default is whichever level is first alphabetically!)

7.9 Summary - two categorical variables

Choice between Fisher’s Exact Test and Chi-square test is determined by experimental design
If any counts in the table are less than 5, can use simulate.p.value=TRUE to get a more accurate p-value from the chi-square test
Both assume independent observations (important!!)
In a GWAS, correction for multiple testing is required
Can also use logistic regression for two categorical variables

8 Summary - hypothesis testing

8.1 Power and type I and II error

True state of nature	Result of test
	Reject \(H_0\)	Fail to reject \(H_0\)
\(H_0\) TRUE	Type I error, probability = \(\alpha\)	No error, probability = \(1-\alpha\)
\(H_0\) FALSE	No error, probability is called power = \(1-\beta\)	Type II error, probability = \(\beta\) (false negative)

8.2 Use and mis-use of the p-value

The p-value is the probability of observing a sample statistic as or more extreme as you did, assuming that \(H_0\) is true
The p-value is a random variable:
- don’t treat it as precise.
- don’t do silly things like try to interpret differences or ratios between p-values
- don’t lose track of test assumptions such as independence of observations
- do use a moderate p-value cutoff, then use some effect size measure for ranking
Small p-values are particularly:
- variable under repeated sampling, and
- sensitive to test assumptions

8.3 Use and mis-use of the p-value (cont’d)

If we fail to reject \(H_0\), is the probability that \(H_0\) is true equal to (\(1-\alpha\))? (Hint: NO NO NO!)
- Failing to reject \(H_0\) does not mean \(H_0\) is true
- “No evidence of difference \(\neq\) evidence of no difference”
Statistical significance vs. practical significance
- As sample size increases, point estimates such as the mean are more precise
- With large sample size, small differences in means may be statistically significant but not practically significant
Although \(\alpha = 0.05\) is a common cut-off for the p-value, there is no set border between “significant” and “insignificant,” only increasingly strong evidence against \(H_0\) (in favor of \(H_A\)) as the p-value gets smaller.

Applied Statistics for High-throughput Biology: Session 1

Levi Waldron

Jan 10, 2019