Common structure to well-known statistical tests
The linear model is pervasive
Multiple routes to generalisation of the linear model
We could use base R instead of limma / edgeR / DESeq2 / variancePartition (but thankfully don’t)
30th September, 2019
Common structure to well-known statistical tests
The linear model is pervasive
Multiple routes to generalisation of the linear model
We could use base R instead of limma / edgeR / DESeq2 / variancePartition (but thankfully don’t)
Github:
Dataset:
R Packages:
# format: xaringan
pkgs <- c(
# CRAN
"dplyr", "magrittr", "ggplot2",
# Bioconductor
"ComplexHeatmap", "edgeR", "limma", "variancePartition"
)
for (pkg in pkgs) {
suppressPackageStartupMessages(
library(pkg, character.only = TRUE))
}
Sources of variability:
Data:
floored the RSEM values)dge <- readRDS(here("data", "job", "GSE103528.dgelist.rds"))
dge <- dge[dge$genes$gene_biotype == "protein_coding", ]
dim(dge)
## [1] 19932 27
\[\mathbf{y} \sim \mathcal{N} \left(X\boldsymbol{\beta}, \sigma^2I \right)\]
Where,
\(\mathbf{y}\) - observed values
\(\mathtt{lhs} \sim \mathcal{N}(., .)\) - normality assumption (on the residuals; or around the fitted values)
\(X\) - a design matrix (one row per observation)
\(\boldsymbol{\beta}, \sigma\) - parameters to be estimated (vector and scalar, resp)
… plus lots of assumptions
\[\mathbf{y} \sim \mathcal{N} \left(X\boldsymbol{\beta}, \sigma^2I \right)\]
Where,
\(\mathbf{y}\) - eg, log-expression-intensity for a single gene across a set of samples
\(\mathtt{lhs} \sim \mathcal{N}(., .)\) - some statistical distribution
\(X\) - some encoding of experimental design / predictors for the samples
\(\boldsymbol{\beta}\) - how much does each predictor contribute …
\(\sigma\) - and how much noise is there around …
… the fitted values
Encode the experimental / statistical design:
Model: intensity ~ cell + treatment
See also: ExploreModelMatrix
Encode the experimental comparisons
Dependent on Design Matrix
Columns: Comparison of interest
Rows: Model coefficients (~ columns of design matrix)
Treatment Y vs Treatment X:
Be careful with contrasts over ‘interactions’ - genomicsclass.github.io/book/pages/interactions_and_contrasts.html
lm())If the residuals aren’t normal(ish) in
\[\mathbf{y} \sim \mathcal{N} \left(X\boldsymbol{\beta}, \sigma^2I \right)\] \[\downarrow\] \[\mathbf{Y} = f(\mathbf{y})\] \[\mathbf{Y} \sim \mathcal{N} \left(X\boldsymbol{\beta}, \sigma^2I \right)\]
design <- factor_model(~ cell + treatment, data = dge$samples) dge_voom <- voom(dge, design = design)
lm() (cont.)## ## Call: ## lm(formula = intensity ~ cell + treatment, data = df_carm1) ## ## Coefficients: ## (Intercept) cellmv411 cellskno1 treatmentsh55 treatmentsh57 ## 4.3966 0.0602 0.3932 -2.0898 -2.8950
Baselines (for cell: MOLM13; for treatment: shct)
Fitted values:
with(
lm_carm1,
sum(coefficients[c("(Intercept)", "cellskno1", "treatmentsh57")]))
## [1] 1.894793
But
design is a N x K matrix (here, binary)
coefficients is K x 1 vector (entries ~ columns of the design)
What is the matrix * vector product?
( design %*% lm_carm1$coefficients ) %>% tail(n = 6)
## [,1] ## skno1_sh55_a 2.700022 ## skno1_sh55_b 2.700022 ## skno1_sh55_c 2.700022 ## skno1_sh57_a 1.894793 ## skno1_sh57_b 1.894793 ## skno1_sh57_c 1.894793
LHS skno1 != RHS skno1 coef
limma)\[\mathbf{y}_{[g_1]} \sim \mathcal{N} \left(X\boldsymbol{\beta}_{[g_1]}, \sigma^2_{[g_1]}I \right)\] \[\mathbf{y}_{[g_2]} \sim \mathcal{N} \left(X\boldsymbol{\beta}_{[g_2]}, \sigma^2_{[g_2]}I \right)\] \[\downarrow\]
\(\sigma_{g}\) : balance between gene-specific estimate and global estimate}
\[\mathbf{y}_{[g]} \sim \mathcal{N} \left(X\boldsymbol{\beta}_{[g]}, \sigma^2_{[g]}I \right)\]
colnames(design)
## [1] "(Intercept)" "mv411" "skno1" "sh55" "sh57"
contrasts <- limma::makeContrasts( # baseline treatment is shct, so # (fit for sh55) - (fit for shCTRL) = (Intercept + k_sh55) - (Intercept) # (fit for sh57) - (fit for shCTRL) = (Intercept + k_sh57) - (Intercept) sh_55 = "sh55", sh_57 = "sh57", sh_average = "(sh55 + sh57) / 2", levels = design )
# standard limma workflow # - vanilla linear model fit_raw <- limma::lmFit(dge_voom, design) # - estimate experimental contrasts fit_cont <- limma::contrasts.fit(fit_raw, contrasts = contrasts) # - moderated t-statistics (uses distribution of standard # errors for all features to better estimate that for a # given feature) fit_ebayes <- limma::eBayes(fit_cont)
… but voom-contrasts != the lm() estimates
## sh_55 sh_57 sh_average ## after limma::lmFit -2.065455 -2.922189 -2.493822 ## after limma::eBayes -2.065455 -2.922189 -2.493822 ## from `lm()` -2.089766 -2.894996 -2.492381
“voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.” Law et al (Genome Biol. 2014)
with(
dge_voom[row_carm1, ],
lm.wfit(x = design, y = as.vector(E), w = as.vector(weights)) #<<
)$coefficients %*% contrasts
## Contrasts ## sh_55 sh_57 sh_average ## [1,] -2.065455 -2.922189 -2.493822
limma does other stuff …Source: Richie et al. Nuc. Acids Res. (2015)
Why should we assume / allow
the residuals are normal?
only a single outcome vector?
that each hypothesis should be ran completely independently?
equal weighting of all samples?
Source: Ehsan Azhdari [CC BY-SA 3.0] via Wikipedia
edgeR / DESeq2)\[\mathbf{y} \sim \mathcal{N} \left(X\boldsymbol{\beta}, \sigma^2I \right)\]
that is, for the \(i\)th observation:
\[y_i \sim \mathcal{N} \left(\Sigma_j X_{ij}\beta_j, \sigma^2 \right)\]
\[\downarrow\]
\[y_i \sim \mathcal{\psi} \left(\Sigma_j X_{ij}\beta_j,\ other\ params \right)\]
Count distributions (examples)
Negative-Binomial \(\mathcal{NB} \left( r, q \right)\)
used in both edgeR and DESeq2
generalisation of
practically: just more flexible than Poisson
many ways of parameterising:
edgeR modelFor a gene \(g\) and sample \(i\) (from experimental group \(k\))
\[y_i \sim \mathcal{NB}\left( mean = m_i, var = v_i \right)\]
where
\(m_i = M_i p_k\)
\(M_i\) = library size
\(p_k\) = relative abundance of \(g\) in experimental group \(k\)
\(v_i = m_i\left( 1 + m_i \phi \right)\)
\(p_k = \exp\left( \Sigma_j {X_{ij}\beta_j} \right)\)
\(y_i\) is the observed count
edgeR call \(\phi\) the dispersion parameter (note, \(\phi = 0\) is a Poisson model; note \(\phi ^ {1/2} = BCV\))
dge_edger <- edgeR::estimateDisp(dge, design = design, tagwise = TRUE) fit_edger <- edgeR::glmFit(dge_edger, design = design)
# convert coefs to log2 fit_edger$coefficients[row_carm1, ] * log2(exp(1))
## (Intercept) mv411 skno1 sh55 sh57 ## -15.487436975 0.002539948 0.356581304 -2.033617365 -2.871841062
## sh_55 sh_57 sh_average ## `lm()` -2.089766 -2.894996 -2.492381 ## `limma::eBayes` -2.065455 -2.922189 -2.493822 ## `edgeR::glmFit` -2.033617 -2.871841 -2.452729
To get (approx) the same coefficients as edgeR using glm
model_data <- with(
dge_edger,
data.frame(
counts = as.vector(counts[row_carm1, ]),
samples
))
nb <- glm(
# note that the `offset()` introduces a fixed constant (it's included in the
# model with a coefficient of 1)
counts ~ offset(log(lib.size)) + cell + treatment,
data = model_data,
family = MASS::negative.binomial(
# R/MASS uses theta = 1 / phi, where phi is `edgeR`'s dispersion parameter
# note that v = mu + mu^2/theta = mu + phi * mu^2
theta = 1 / dge_edger$tagwise.dispersion[row_carm1]
)
)
nb$coefficients * log2(exp(1))
## (Intercept) cellmv411 cellskno1 treatmentsh55 treatmentsh57 ## -15.487681768 0.002493712 0.356768237 -2.034286473 -2.873166449
(nb$coefficients * log2(exp(1))) %*% contrasts
## Contrasts ## sh_55 sh_57 sh_average ## [1,] -2.034286 -2.873166 -2.453726
lme4, variancePartition)\[\mathbf{y} \sim X\boldsymbol{\beta} + \mathcal{N} \left(\boldsymbol{0}, \sigma^2I \right)\] that is,
\[\mathbf{y} \sim {Fixed\ Effects} + Noise\] \[\downarrow\] \[\mathbf{y} \sim [Fixed Effects] + [Random Effects] + Noise\]
Sources of variability in the current experiment:
Cell type
Batch (not modelled here)
Treatment
Technical issues, noise etc
Fixed effect:
Random effect:
A model parameter that is a random variable
You estimate the properties of its distribution, not its value
voom::duplicateCorrelation# limma can accomodate at most one random effect treatment_design <- model.matrix( ~ -1 + treatment, data = dge_voom$targets ) treatment_contrast <- makeContrasts( sh55 = "treatmentsh55 - treatmentshct", sh57 = "treatmentsh57 - treatmentshct", levels = treatment_design ) # 'random effects' as used in limma blocks <- factor(dge_voom$targets$cell)
duplicateCorrelation (cont.)dupcor <- limma::duplicateCorrelation( dge_voom, treatment_design, block = blocks)
## Warning in glmgam.fit(dx, dy, coef.start = start, tol = tol, maxit = ## maxit, : Too much damping - convergence tolerance not achievable
fit_cor <- lmFit( dge_voom, treatment_design, block = blocks, correlation = dupcor$consensus #<< ) fit_cor$coefficients[row_carm1, ] %*% treatment_contrast
## Contrasts ## sh55 sh57 ## [1,] -2.069139 -2.930064
variancePartitionvp <- variancePartition::dream( dge_voom[row_carm1, ], # slow for all genes; recommend parallel formula = ~ -1 + treatment + (1|cell), # lme4 model syntax #<< data = dge_voom$targets, L = treatment_contrast )
## Projected memory usage: > 30.1 Kb ## ## Finished... ## Total: 0 s
vp$coefficients
## sh55 sh57 ## ENSG00000142453 -2.089766 -2.894996
variancePartition::dreamvp2 <- variancePartition::dream( dge_voom[row_carm1, ], formula = ~ -1 + treatment + (1|cell) + (1|batch), #<< data = dge_voom$targets, L = treatment_contrast )
## Projected memory usage: > 34.4 Kb ## ## Finished... ## Total: 0 s
vp2$coefficients
## sh55 sh57 ## ENSG00000142453 -2.089766 -2.894996
fully Bayesian approaches (STAN etc)
genes aren’t independent of each other …
epigenomic context
coregulatory modules
dynamic modelling of gene expression …
the cell population milkshake …
Gene expression course (Love / Irizarry)
Connections between different models
Bayesian stats (beyond the main part of this talk)
Design matrix explorer:
[Could have included batch-effect and cell/treatment interaction]
To sample 1000 NB(r, p) values
sample 1000 Gamma(shape = r, scale = p/(1-p)) values
then for each Gamma value, g, sample a Poisson(rate = g)
The latter are NB(r, p) // but R uses 1 - p rather than p
edgeR fits larger coefficients for poorly-detected genes than limma## Warning: Removed 6 rows containing missing values (geom_point).
stan to fit a negative binomial model for RNA-SeqsuppressPackageStartupMessages( library(rstan) )
# modified from http://rstudio-pubs-static.s3.amazonaws.com/34099_2e35c3966ef548c2918d5b6c2146bfd1.html
model <- "
// negative binomial parameterized as eta (log(mu)) and dispersion (phi)
// see p286 in stan-reference-2.4.0.pdf
// a basic GLM example
data {
int<lower=1> N; // rows of data
matrix[N, 5] X; // predictor
vector[N] L; // lib sizes
int<lower=0> y[N]; // response
real phi; // edgeR dispersion parameter (glm() and stan use the
// inverse of this to specify mean/variance relationship;
// we call the inverse `theta` for consistency with glm())
// Apologies for the confusion, stan's docs refer to our
// `theta` as `phi`
}
parameters {
vector[5] beta; // coefficients for the linear component
}
transformed parameters {
vector[N] y_hat = L .* exp(X * beta); // linear component
real theta = 1 / phi;
}
model {
// data model:
y ~ neg_binomial_2(y_hat, theta);
}
"
stan_data <- list(
N = ncol(dge_edger),
y = as.vector(dge_edger$counts[row_carm1, ]),
L = as.vector(dge_edger$samples$lib.size),
X = set_colnames(design, c("intercept", colnames(design)[-1])),
phi = dge_edger$tagwise.dispersion[row_carm1]
)
write(model, file = here("models/negbin-glm.stan"))
sm <- stan_model(here("models/negbin-glm.stan"))
# refresh = 0 prevents sampling output from printing
m <- sampling(sm, data = stan_data,
pars = c("beta"),
iter = 2000, chains = 4, verbose = FALSE, refresh = 0)
log2(exp(1)) *
rbind(
edgeR = fit_edger$coefficients[row_carm1, ],
glm = nb$coefficients,
stan = summary(m, "beta")$summary[, "mean"]
) %*% contrasts
## Contrasts ## sh_55 sh_57 sh_average ## edgeR -2.033617 -2.871841 -2.452729 ## glm -2.034286 -2.873166 -2.453726 ## stan -2.039033 -2.873223 -2.456128