1 Welcome and outline

1.1 Learning objectives

  • Perform basic data manipulation/exploration in R and dplyr
  • Identify key R and Bioconductor data classes
  • Define random variables and distinguish them from non-random ones
  • Distinguish population from sampling distributions
  • Interpret t-tests and confidence intervals
  • Identify when to use Fisher’s Exact Test and Chi-square Test

1.2 A bit about me - research interests

  • High-dimensional statistics (more variables than observations)
  • Predictive modeling and methodology for validation
  • Metagenomic profiling of the human microbiome
  • Multi-omic data analysis for cancer
  • http://www.waldronlab.io

2 R - basic usage

2.1 Tips for learning R

Pseudo code Example code
library(packagename) library(dplyr)
?functionname ?select
?package::functionname ?dplyr::select
? ‘Reserved keyword or symbol’ ? ‘%>%’
??searchforpossiblyexistingfunctionandortopic ??simulate
help(package = “loadedpackage”) help(“dplyr”)
browseVignettes(“packagename”) browseVignettes(“dplyr”)

Slide credit: Marcel Ramos

2.2 Installing Packages the Bioconductor Way

Pseudo code:

install.packages("BiocManager") #from CRAN
packages <- c("packagename", "githubuser/repository", "biopackage")
BiocManager::install(packages)
BiocManager::valid()  #check validity of installed packages
  • Works for CRAN, GitHub, and Bioconductor packages!

2.3 Note about installing devtools

  • Useful for building packages
  • Download and install from GitHub, directly or via BiocManager::install()
  • Installation dependent on OS (Rtools for Windows, OSX installation requires XCode and Command Line Tools, GNU/Linux “just works”)

3 Introduction to the R language

3.1 Base R Data Types: atomic vectors

numeric (set seed to sync random number generator):

set.seed(1)
rnorm(5)
## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078

integer:

sample( 1:5 )
## [1] 2 1 3 4 5

logical:

1:3 %in% 3
## [1] FALSE FALSE  TRUE

character:

c("yes", "no")
## [1] "yes" "no"

factor:

factor(c("yes", "no"))
## [1] yes no 
## Levels: no yes

Demo: integer-like properties, relevel()

3.2 Base R Data Types: missingness

  • Missing Values and others - IMPORTANT
c(NA, NaN, -Inf, Inf)
## [1]   NA  NaN -Inf  Inf

class() to find the class of a variable.

3.3 Base R Data Types: matrix, list, data.frame

matrix:

matrix(1:9, nrow = 3)
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

The list is a non-atomic vector:

measurements <- c( 1.3, 1.6, 3.2, 9.8, 10.2 )
parents <- c( "Parent1.name", "Parent2.name" )
my.list <- list( measurements, parents)
my.list
## [[1]]
## [1]  1.3  1.6  3.2  9.8 10.2
## 
## [[2]]
## [1] "Parent1.name" "Parent2.name"

The data.frame has list-like and matrix-like properties:

x <- 11:16
y <- seq(0,1,.2)
z <- c( "one", "two", "three", "four", "five", "six" )
a <- factor( z )
my.df <- data.frame(x,y,z,a, stringsAsFactors = FALSE)

3.4 Bioconductor S4 vectors: DataFrame

  • Bioconductor (www.bioconductor.org) defines its own set of vectors using the S4 formal class system DataFrame: like a data.frame but more flexible. columns can be any atomic vector type:
    • GenomicRanges objects
    • Rle (run-length encoding)
suppressPackageStartupMessages(library(S4Vectors))
df <- DataFrame(var1 = Rle(c("a", "a", "b")),
          var2 = 1:3)
metadata(df) <- list(father="Levi is my father")
df
## DataFrame with 3 rows and 2 columns
##    var1      var2
##   <Rle> <integer>
## 1     a         1
## 2     a         2
## 3     b         3

3.5 Bioconductor S4 vectors: List and derived classes

List(my.list)
## List of length 2
str(List(my.list))
## Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
##   ..@ listData       :List of 2
##   .. ..$ : num [1:5] 1.3 1.6 3.2 9.8 10.2
##   .. ..$ : chr [1:2] "Parent1.name" "Parent2.name"
##   ..@ elementType    : chr "ANY"
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()
suppressPackageStartupMessages(library(IRanges))
IntegerList(var1=1:26, var2=1:100)
## IntegerList of length 2
## [["var1"]] 1 2 3 4 5 6 7 8 9 10 11 12 ... 16 17 18 19 20 21 22 23 24 25 26
## [["var2"]] 1 2 3 4 5 6 7 8 9 10 11 ... 90 91 92 93 94 95 96 97 98 99 100
CharacterList(var1=letters[1:100], var2=LETTERS[1:26])
## CharacterList of length 2
## [["var1"]] a b c d e f g h i ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## [["var2"]] A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
LogicalList(var1=1:100 %in% 5, var2=1:100 %% 2)
## LogicalList of length 2
## [["var1"]] FALSE FALSE FALSE FALSE TRUE ... FALSE FALSE FALSE FALSE FALSE
## [["var2"]] TRUE FALSE TRUE FALSE TRUE ... FALSE TRUE FALSE TRUE FALSE

3.6 Bioconductor S4 vectors: Biostrings

suppressPackageStartupMessages(library(Biostrings))
bstring = BString("I am a BString object")
bstring
##   21-letter "BString" instance
## seq: I am a BString object
dnastring = DNAString("TTGAAA-CTC-N")
dnastring
##   12-letter "DNAString" instance
## seq: TTGAAA-CTC-N
str(dnastring)
## Formal class 'DNAString' [package "Biostrings"] with 5 slots
##   ..@ shared         :Formal class 'SharedRaw' [package "XVector"] with 2 slots
##   .. .. ..@ xp                    :<externalptr> 
##   .. .. ..@ .link_to_cached_object:<environment: 0x7ff37196c090> 
##   ..@ offset         : int 0
##   ..@ length         : int 12
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()
alphabetFrequency(dnastring, baseOnly=TRUE, as.prob=TRUE)
##          A          C          G          T      other 
## 0.25000000 0.16666667 0.08333333 0.25000000 0.25000000

4 dplyr

4.1 Data Manipulation using dplyr

  • dplyr convention aims to ease cognitive burden
  • Function names are easy to remember:
  1. select (Y)
  2. mutate/transmute (add Ys / new Y)
  3. filter (get Xs based on condition)
  4. slice (get Xs specified)
  5. summarise (reduce to single observation)
  6. arrange (re-order observations)

4.2 dplyr example

library(nycflights13)
library(dplyr)
delays <- flights %>% 
  filter(!is.na(dep_delay)) %>%
  group_by(year, month, day, hour) %>%
  summarise(delay = mean(dep_delay), n = n()) %>%
  filter(n > 10)

4.3 dplyr example (cont’d)

hist(delays$delay, main="Mean hourly delay", xlab="Delay (hours)")

5 Random Variables and Distributions

5.1 Random Variables

  • A random variable: any characteristic that can be measured or categorized, and where any particular outcome is determined at least partially by chance.

  • Examples:
    • number of new diabetes cases in population in a given year
    • The weight of a randomly selected individual in a population
  • Types:
    • Categorical random variable (e.g. disease / healthy)
    • Discrete random variable (e.g. sequence read counts)
    • Continuous random variable (e.g. normalized qPCR intensity)

5.2 Probability Distributions

  • We use probability distributions to describe the probability of all possible realizations of a random variable
  • In public health we use probability distributions to describe hypotheses or inference about the population
    • because we normally cannot observe the entire population
  • In practice we study a sample selected from the population
Random Sample

5.3 Random Variables - examples

Normally distributed random variable with mean \(\mu = 0\) / standard deviation \(\sigma = 1\), and a sample of \(n=100\)

5.4 Random Variables - examples

Poisson distributed random variable (\(\lambda = 2\)), and a sample of \(n=100\).

5.5 Random Variables - examples

Negative Binomially distributed random variable (\(size=30, \mu=2\)), and a sample of \(n=100\).

5.6 Random Variables - examples

  • Binomial Distribution random variable (\(size=20, prob=0.25\)), and a sample of \(n=100\).
    • use for binary outcomes

6 Hypothesis Testing

6.1 Population vs sampling distributions

  • Population distributions
    • Each realization / point is an individual
  • Sampling distributions
    • Each realization / point is a sample
    • distribution depends on sample size
    • large sample distributions are given by the Central Limit Theorem
  • Hypothesis testing is about sampling distributions
    • Did my sample likely come from that distribution?

6.2 Logic of hypothesis testing

One Sample - observations come from one of two population distributions:

  1. usual distribution that has been true in the past
  2. a potentially new distribution induced by an intervention or changing condition

Two Sample - two samples are drawn, either:

  1. from a single population
  2. from two different populations

6.3 The t-tests

  • In a one-sample test, only a single sample is drawn:
    • \(H_0: \mu = \mu_0\)
  • In a two-sample test, two samples are drawn independently:
    • \(H_0: \mu_1 = \mu_2\)
  • A paired test is one sample of paired measurements, e.g.:
    • individuals before and after treatment

6.4 When to use t-tests

  • \(\frac{{}\bar{x} - \mu}{s}\) and \(\frac{\bar{x_1} - \bar{x_2}}{s}\) are t-distributed if:
    • the standard deviation \(s\) is estimated from the sample
    • the population values are normally distributed
    • the population values are slightly skewed but n > 15
    • the population values are quite skewed but n > 30
  • Wilcoxon tests are an alternative for non-normal populations
    • e.g. rank data
    • data where ranks are more informative than means
    • not when many ranks are arbitrary

7 Hypothesis testing for categorical variables

7.1 Lady Tasting Tea

  • The Lady in question claimed to be able to tell whether the tea or the milk was added first to a cup
  • Fisher proposed to give her eight cups, four of each variety, in random order
    • the Lady is fully informed of the experimental method
    • \(H_0\): the Lady has no ability to tell which was added first
Lady tasting tea

Source: https://en.wikipedia.org/wiki/Lady_tasting_tea

7.2 Fisher’s Exact Test

p-value is the probability of the observed number of successes, or more, under \(H_0\)

Tea-Tasting Distribution
Success count Permutations of selection Number of permutations
0 oooo 1 × 1 = 1
1 ooox, ooxo, oxoo, xooo 4 × 4 = 16
2 ooxx, oxox, oxxo, xoxo, xxoo, xoox 6 × 6 = 36
3 oxxx, xoxx, xxox, xxxo 4 × 4 = 16
4 xxxx 1 × 1 = 1
Total 70

What do you notice about all these combinations?

7.3 Notes on Fisher’s Exact Test

  • Can also be applied to rxc tables
  • Remember that the margins of the table are fixed by design
  • Also referred to as the Hypergeometric Test
  • Exact p-values are difficult (and unnecessary) for large samples
    • fisher.test(x, y = NULL, etc, simulate.p.value = FALSE)

7.4 Notes on Fisher’s Exact Test (cont’d)

  • Has been applied (with peril!) to gene set analysis, e.g.:
    • 10 of my top 100 genes are annotated with the cytokinesis GO term
    • 465 of 21,000 human genes are annotated with the cytokinesis GO term
    • Are my top 100 genes enriched for cytokinesis process?
  • Problems with this analysis:
    • Main problem: top-n genes tend to be correlated, so their selections are not independent trials
    • Secondary: does not match design for \(H_0\)
  • Alternative: permutation test repeating all steps

7.5 Chi-squared test

  • Test of independence for rxc table (two categorical variables)
  • Does not assume the margins are fixed by design
    • i.e., the number of cups of tea with milk poured first can be random, and the Lady doesn’t know how many
    • more common in practice
    • classic genomics example is GWAS
  • \(H_0\): the two variables are independent
  • \(H_A\): there is an association between the variables

7.6 Application to GWAS

  • Interested in association between disease and some potential causative factor
  • In a case-control study, the numbers of cases and controls are fixed, but the other variable is not
  • In a prospective or longitudinal cohort study, neither the number of cases or the other variable are fixed
disease=factor(c(rep(0,180),rep(1,20),rep(0,40),rep(1,10)),
               labels=c("control","cases"))
genotype=factor(c(rep("AA/Aa",204),rep("aa",46)),
                levels=c("AA/Aa","aa"))
set.seed(1)
genotype <- sample(genotype)
dat <- data.frame(disease, genotype)
summary(dat)
##     disease     genotype  
##  control:220   AA/Aa:204  
##  cases  : 30   aa   : 46

7.7 Application to GWAS (cont’d)

table(disease, genotype)
##          genotype
## disease   AA/Aa  aa
##   control   181  39
##   cases      23   7
chisq.test(disease, genotype)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  disease and genotype
## X-squared = 0.24229, df = 1, p-value = 0.6226
chisq.test(disease, genotype, simulate.p.value = TRUE)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  disease and genotype
## X-squared = 0.5526, df = NA, p-value = 0.5777

7.8 Application to GWAS (cont’d)

Note that the result says nothing about how the departure from independence occurs

library(epitools)
epitools::oddsratio(genotype, disease, method="wald")$measure
##          odds ratio with 95% C.I.
## Predictor estimate     lower    upper
##     AA/Aa 1.000000        NA       NA
##     aa    1.412486 0.5662511 3.523379

(the default is whichever level is first alphabetically!)

7.9 Summary - two categorical variables

  • Choice between Fisher’s Exact Test and Chi-square test is determined by experimental design
  • If any counts in the table are less than 5, can use simulate.p.value=TRUE to get a more accurate p-value from the chi-square test
  • Both assume independent observations (important!!)
  • In a GWAS, correction for multiple testing is required
  • Can also use logistic regression for two categorical variables

8 Summary - hypothesis testing

8.1 Power and type I and II error

True state of nature Result of test
Reject \(H_0\) Fail to reject \(H_0\)
\(H_0\) TRUE Type I error, probability = \(\alpha\) No error, probability = \(1-\alpha\)
\(H_0\) FALSE No error, probability is called power = \(1-\beta\) Type II error, probability = \(\beta\) (false negative)

8.2 Use and mis-use of the p-value

  • The p-value is the probability of observing a sample statistic as or more extreme as you did, assuming that \(H_0\) is true
  • The p-value is a random variable:
    • don’t treat it as precise.
    • don’t do silly things like try to interpret differences or ratios between p-values
    • don’t lose track of test assumptions such as independence of observations
    • do use a moderate p-value cutoff, then use some effect size measure for ranking
  • Small p-values are particularly:
    • variable under repeated sampling, and
    • sensitive to test assumptions

8.3 Use and mis-use of the p-value (cont’d)

  • If we fail to reject \(H_0\), is the probability that \(H_0\) is true equal to (\(1-\alpha\))? (Hint: NO NO NO!)
    • Failing to reject \(H_0\) does not mean \(H_0\) is true
    • “No evidence of difference \(\neq\) evidence of no difference”
  • Statistical significance vs. practical significance
    • As sample size increases, point estimates such as the mean are more precise
    • With large sample size, small differences in means may be statistically significant but not practically significant
  • Although \(\alpha = 0.05\) is a common cut-off for the p-value, there is no set border between “significant” and “insignificant,” only increasingly strong evidence against \(H_0\) (in favor of \(H_A\)) as the p-value gets smaller.

9 Lab Exercises

  1. dplyr exercises
  2. random variables exercises