Applied Statistics for High-throughput Biology: Session 1

Levi Waldron

Front matter

Welcome

My research interests:

The textbook

Biomedical Data Science book cover

Day 1 outline

Learning objectives

Random Variables and Distributions

Random Variables

Random Variables - examples

Normally distributed random variable with mean \(\mu = 0\) / standard deviation \(\sigma = 1\), and a sample of \(n=100\)

Random Variables - examples

Poisson distributed random variable (\(\lambda = 2\)), and a sample of \(n=100\).

Random Variables - examples

Negative Binomially distributed random variable (\(size=30, \mu=2\)), and a sample of \(n=100\).

Random Variables - examples

Hypothesis Testing

Why hypothesis testing?

Logic of hypothesis testing

Sample vs Population

One and two-sample hypothesis tests of mean

One-sample hypothesis test of mean - sample comes from one of two population distributions:

  1. usual distribution that has been true in the past
    • \(H_0: \mu = \mu_0\) (null hypothesis)
  2. a potentially new distribution induced by an intervention or changing condition
    • \(H_A: \mu \neq \mu_0\) (alternative hypothesis)

Two-sample hypothesis test of means - two samples are drawn, either:

  1. from a single population
    • \(H_0: \mu_1 = \mu_2\) (null hypothesis)
  2. from two different populations
    • \(H_A: \mu_1 \neq \mu_2\) (alternative hypothesis)

Population vs sampling distributions

Note about the Central Limit Theorem

When sample size is large, the average \(\bar{Y}\) of a random sample:

  1. Follows a normal distribution
  2. The normal distribution has mean entered at the population average \(\mu_Y\)
  3. with standard deviation equal to the population standard deviation \(\sigma_Y\), divided by the square root of the sample size \(N\). We refer to the standard deviation of the distribution of a random variable as the random variable’s standard error.

Thanks to CLT, linear modeling, t-tests, ANOVA are all guaranteed to work reasonably well for large samples (more than ~30 observations).

Go to online demo: http://onlinestatbook.com/stat_sim/sampling_dist/

Example applications of hypothesis tests for continuous variables

  1. Do study participants on a diet lose weight compared to a control group?
  2. Does a gene knockout result in less viable cell colonies, as measured by the number of living cells in replicate experiments?
  3. Is Prevotella copri more abundant in the guts of vegans than of meat-eaters?
  4. Do infants born in this region have a higher birthweight than the national average?

These are research hypotheses. What are the corresponding null hypotheses?

The z-tests

When to use z-tests

Example of one-sample z-test

library(TeachingDemos)
set.seed(1)
(weights = rnorm(10, mean = 75, sd = 10))
##  [1] 68.73546 76.83643 66.64371 90.95281 78.29508 66.79532 79.87429 82.38325
##  [9] 80.75781 71.94612
TeachingDemos::z.test(
    x = weights,
    mu = 70,  #specified for population under H0 because this is a one-sample test
    stdev = 10, #specified for population because this is a z-test
    alternative = "two.sided"
)
## 
##  One Sample z-test
## 
## data:  weights
## z = 1.9992, n = 10.0000, Std. Dev. = 10.0000, Std. Dev. of the sample
## mean = 3.1623, p-value = 0.04559
## alternative hypothesis: true mean is not equal to 70
## 95 percent confidence interval:
##  70.12408 82.51998
## sample estimates:
## mean of weights 
##        76.32203

The t-tests

When to use t-tests

Example of one-sample t-test

Note that here we do not specify the standard deviation, because it is estimated from the sample.

stats::t.test(
    x = weights,
    mu = 70,  #specified for population under H0 because this is a one-sample test
    alternative = "two.sided"
)
## 
##  One Sample t-test
## 
## data:  weights
## t = 2.5612, df = 9, p-value = 0.03063
## alternative hypothesis: true mean is not equal to 70
## 95 percent confidence interval:
##  70.73805 81.90600
## sample estimates:
## mean of x 
##  76.32203

In this particular example the p-value is smaller than from the z-test, but in general it would be less powerful than a z-test if its assumptions are met.

Confidence intervals

Why confidence intervals?

Confidence intervals provide a probable range for the true population mean.

Intervals for any confidence level

Confidence intervals vs. hypothesis testing

Q & A: confidence intervals

Which of the following are true?

Summary - hypothesis testing

Power and type I and II error

True state of nature Result of test
Reject \(H_0\) Fail to reject \(H_0\)
\(H_0\) TRUE Type I error, probability = \(\alpha\) No error, probability = \(1-\alpha\)
\(H_0\) FALSE No error, probability is called power = \(1-\beta\) Type II error, probability = \(\beta\) (false negative)

Use and mis-use of the p-value

Use and mis-use of the p-value (cont’d)

R - basic usage

Tips for learning R

Pseudo code Example code
library(packagename) library(dplyr)
?functionname ?select
?package::functionname ?dplyr::select
? ‘Reserved keyword or symbol’ ? ‘%>%’
??searchforpossiblyexistingfunctionandortopic ??simulate
help(package = “loadedpackage”) help(“dplyr”)
browseVignettes(“packagename”) browseVignettes(“dplyr”)

Slide credit: Marcel Ramos

Installing Packages the Bioconductor Way

Pseudo code:

install.packages("BiocManager") #from CRAN
packages <- c("packagename", "githubuser/repository", "biopackage")
BiocManager::install(packages)
BiocManager::valid()  #check validity of installed packages

Introduction to the R/Bioconductor data classes

Base R Data Types: atomic vectors

numeric (set seed to sync random number generator):

set.seed(1)
rnorm(5)
## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078

integer:

sample(1L:5L)
## [1] 3 5 1 4 2

logical:

1:3 %in% 3
## [1] FALSE FALSE  TRUE

character:

c("yes", "no")
## [1] "yes" "no"

factor:

factor(c("yes", "no"))
## [1] yes no 
## Levels: no yes

For real-life recoding of factors, try dplyr::recode, dplyr::recode_factor

Base R Data Types: missingness

c(NA, NaN, -Inf, Inf)
## [1]   NA  NaN -Inf  Inf

class() to find the class of a variable.

Base R Data Types: matrix, list, data.frame

matrix:

matrix(1:9, nrow = 3)
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

The list is a non-atomic vector:

measurements <- c(1.3, 1.6, 3.2, 9.8, 10.2)
parents <- c("Parent1.name", "Parent2.name")
my.list <- list(measurements, parents)
my.list
## [[1]]
## [1]  1.3  1.6  3.2  9.8 10.2
## 
## [[2]]
## [1] "Parent1.name" "Parent2.name"

The data.frame has list-like and matrix-like properties:

x <- 11:16
y <- seq(0, 1, .2)
z <- c("one", "two", "three", "four", "five", "six")
a <- factor(z)
my.df <- data.frame(x, y, z, a, stringsAsFactors = FALSE)

Bioconductor S4 vectors: DataFrame

suppressPackageStartupMessages(library(S4Vectors))
df <- DataFrame(var1 = Rle(c("a", "a", "b")),
                var2 = 1:3)
metadata(df) <- list(parent = "DataFrame is my virtual parent")
df
## DataFrame with 3 rows and 2 columns
##    var1      var2
##   <Rle> <integer>
## 1     a         1
## 2     a         2
## 3     b         3

DataFrame is actually a virtual classes

List and derived classes

List(my.list)
## List of length 2
str(List(my.list))
## Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
##   ..@ listData       :List of 2
##   .. ..$ : num [1:5] 1.3 1.6 3.2 9.8 10.2
##   .. ..$ : chr [1:2] "Parent1.name" "Parent2.name"
##   ..@ elementType    : chr "ANY"
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()
suppressPackageStartupMessages(library(IRanges))
IntegerList(var1 = 1:26, var2 = 1:100)
## IntegerList of length 2
## [["var1"]] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## [["var2"]] 1 2 3 4 5 6 7 8 9 10 11 12 ... 89 90 91 92 93 94 95 96 97 98 99 100
CharacterList(var1 = letters[1:100], var2 = LETTERS[1:26])
## CharacterList of length 2
## [["var1"]] a b c d e f g h i j ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## [["var2"]] A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
LogicalList(var1 = 1:100 %in% 5, var2 = 1:100 %% 2)
## LogicalList of length 2
## [["var1"]] FALSE FALSE FALSE FALSE TRUE FALSE ... FALSE FALSE FALSE FALSE FALSE
## [["var2"]] TRUE FALSE TRUE FALSE TRUE FALSE ... FALSE TRUE FALSE TRUE FALSE

Biostrings

suppressPackageStartupMessages(library(Biostrings))
bstring <- BString("I am a BString object")
bstring
## 21-letter BString object
## seq: I am a BString object
dnastring <- DNAString("TTGAAA-CTC-N")
dnastring
## 12-letter DNAString object
## seq: TTGAAA-CTC-N
str(dnastring)
## Formal class 'DNAString' [package "Biostrings"] with 5 slots
##   ..@ shared         :Formal class 'SharedRaw' [package "XVector"] with 2 slots
##   .. .. ..@ xp                    :<externalptr> 
##   .. .. ..@ .link_to_cached_object:<environment: 0x5640ba3f7668> 
##   ..@ offset         : int 0
##   ..@ length         : int 12
##   ..@ elementMetadata: NULL
##   ..@ metadata       : list()
alphabetFrequency(dnastring, baseOnly=TRUE, as.prob=TRUE)
##          A          C          G          T      other 
## 0.25000000 0.16666667 0.08333333 0.25000000 0.25000000

(Ranged)SummarizedExperiment

Summarized Experiment * Source: https://bioconductor.org/packages/SummarizedExperiment/

(Ranged)SummarizedExperiment