A Link To The Past

30th September, 2019

Key Points

Common structure to well-known statistical tests
The linear model is pervasive
Multiple routes to generalisation of the linear model
We could use base R instead of limma / edgeR / DESeq2 / variancePartition (but thankfully don’t)

Preamble

Github:

https://www.github.com/russHyde/bfx_201909

Dataset:

GSE103528

R Packages:

# format: xaringan
pkgs <- c(
  # CRAN
  "dplyr", "magrittr", "ggplot2",
  # Bioconductor
  "ComplexHeatmap", "edgeR", "limma", "variancePartition"
)

for (pkg in pkgs) {
  suppressPackageStartupMessages(
    library(pkg, character.only = TRUE))
}

The Dataset

GSE103528

Sources of variability:
- 3x AML cell lines (Acute myeloid leukaemia)
- 3x treatments (“Scramble”, and two shRNAs vs CARM1 (aka PRMT4))
- Triplicates (not obvious: are replicates batches?)
Data:
- Star / RSEM / GRCh38 / Ensembl-87
- (@russHyde should have aligned the data and used actual counts
  - but life’s too short
  - he just floored the RSEM values)

The Dataset (cont.)

dge <- readRDS(here("data", "job", "GSE103528.dgelist.rds"))
dge <- dge[dge$genes$gene_biotype == "protein_coding", ]
dim(dge)

## [1] 19932    27

A Simple Class of Models (shorthand)

\[\mathbf{y} \sim \mathcal{N} \left(X\boldsymbol{\beta}, \sigma^2I \right)\]

Where,

\(\mathbf{y}\) - observed values

\(\mathtt{lhs} \sim \mathcal{N}(., .)\) - normality assumption (on the residuals; or around the fitted values)

\(X\) - a design matrix (one row per observation)

\(\boldsymbol{\beta}, \sigma\) - parameters to be estimated (vector and scalar, resp)

… plus lots of assumptions

… And In Practice

\[\mathbf{y} \sim \mathcal{N} \left(X\boldsymbol{\beta}, \sigma^2I \right)\]

Where,

\(\mathbf{y}\) - eg, log-expression-intensity for a single gene across a set of samples

\(\mathtt{lhs} \sim \mathcal{N}(., .)\) - some statistical distribution

\(X\) - some encoding of experimental design / predictors for the samples

\(\boldsymbol{\beta}\) - how much does each predictor contribute …

\(\sigma\) - and how much noise is there around …

… the fitted values

Design Matrices

Encode the experimental / statistical design:

rows = samples
columns = model coefficients

Model: intensity ~ cell + treatment

Contrast Matrices

Encode the experimental comparisons

Dependent on Design Matrix

Columns: Comparison of interest

Rows: Model coefficients (~ columns of design matrix)

Treatment Y vs Treatment X:

1. What combination of model coefficients gives the ‘fitted value’ for X
1. … for Y
Subtract A from B

Be careful with contrasts over ‘interactions’ - genomicsclass.github.io/book/pages/interactions_and_contrasts.html

The implicit generalisation: transform the outcome (`lm()`)

If the residuals aren’t normal(ish) in

\[\mathbf{y} \sim \mathcal{N} \left(X\boldsymbol{\beta}, \sigma^2I \right)\] \[\downarrow\] \[\mathbf{Y} = f(\mathbf{y})\] \[\mathbf{Y} \sim \mathcal{N} \left(X\boldsymbol{\beta}, \sigma^2I \right)\]

design <- factor_model(~ cell + treatment, data = dge$samples)
dge_voom <- voom(dge, design = design)

`lm()` (cont.)

## 
## Call:
## lm(formula = intensity ~ cell + treatment, data = df_carm1)
## 
## Coefficients:
##   (Intercept)      cellmv411      cellskno1  treatmentsh55  treatmentsh57  
##        4.3966         0.0602         0.3932        -2.0898        -2.8950

Baselines (for cell: MOLM13; for treatment: shct)

Fitted values:

- (Intercept) + cellskno1 + treatmentssh57

with(
  lm_carm1,
  sum(coefficients[c("(Intercept)", "cellskno1", "treatmentsh57")]))

## [1] 1.894793

But

design is a N x K matrix (here, binary)
coefficients is K x 1 vector (entries ~ columns of the design)

What is the matrix * vector product?

(
  design %*% lm_carm1$coefficients
  ) %>%
  tail(n = 6)

##                  [,1]
## skno1_sh55_a 2.700022
## skno1_sh55_b 2.700022
## skno1_sh55_c 2.700022
## skno1_sh57_a 1.894793
## skno1_sh57_b 1.894793
## skno1_sh57_c 1.894793

Fitted- versus Observed-Values

Equivalent Design Matrices

LHS skno1 != RHS skno1 coef

The ‘borrow information’ generalisation (`limma`)

\[\mathbf{y}_{[g_1]} \sim \mathcal{N} \left(X\boldsymbol{\beta}_{[g_1]}, \sigma^2_{[g_1]}I \right)\] \[\mathbf{y}_{[g_2]} \sim \mathcal{N} \left(X\boldsymbol{\beta}_{[g_2]}, \sigma^2_{[g_2]}I \right)\] \[\downarrow\]

\(\sigma_{g}\) : balance between gene-specific estimate and global estimate}

\[\mathbf{y}_{[g]} \sim \mathcal{N} \left(X\boldsymbol{\beta}_{[g]}, \sigma^2_{[g]}I \right)\]

colnames(design)

## [1] "(Intercept)" "mv411"       "skno1"       "sh55"        "sh57"

contrasts <- limma::makeContrasts(
  # baseline treatment is shct, so
  # (fit for sh55) - (fit for shCTRL) = (Intercept + k_sh55) - (Intercept)
  # (fit for sh57) - (fit for shCTRL) = (Intercept + k_sh57) - (Intercept)
  sh_55 = "sh55",
  sh_57 = "sh57",
  sh_average = "(sh55 + sh57) / 2",
  levels = design
)

# standard limma workflow
# - vanilla linear model
fit_raw <- limma::lmFit(dge_voom, design)
# - estimate experimental contrasts
fit_cont <- limma::contrasts.fit(fit_raw, contrasts = contrasts)
# - moderated t-statistics (uses distribution of standard
# errors for all features to better estimate that for a
# given feature)
fit_ebayes <- limma::eBayes(fit_cont)

… ‘information borrowing’ does not affect ‘fold-change’

… but voom-contrasts != the lm() estimates

##                         sh_55     sh_57 sh_average
## after limma::lmFit  -2.065455 -2.922189  -2.493822
## after limma::eBayes -2.065455 -2.922189  -2.493822
## from `lm()`         -2.089766 -2.894996  -2.492381

“voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.” Law et al (Genome Biol. 2014)

with(
  dge_voom[row_carm1, ],
    lm.wfit(x = design, y = as.vector(E), w = as.vector(weights)) #<<
)$coefficients %*% contrasts

##       Contrasts
##            sh_55     sh_57 sh_average
##   [1,] -2.065455 -2.922189  -2.493822

`limma` does other stuff …

Source: Richie et al. Nuc. Acids Res. (2015)

Generalisations so far

Why should we assume / allow

the residuals are normal?
only a single outcome vector?
that each hypothesis should be ran completely independently?
equal weighting of all samples?

Some Univariate Distibutions

Source: Ehsan Azhdari [CC BY-SA 3.0] via Wikipedia

The ‘why should the residuals be Normal?’ generalisation: (`edgeR` / `DESeq2`)

\[\mathbf{y} \sim \mathcal{N} \left(X\boldsymbol{\beta}, \sigma^2I \right)\]

that is, for the \(i\)th observation:

\[y_i \sim \mathcal{N} \left(\Sigma_j X_{ij}\beta_j, \sigma^2 \right)\]

\[\downarrow\]

\[y_i \sim \mathcal{\psi} \left(\Sigma_j X_{ij}\beta_j,\ other\ params \right)\]

eg, generalised linear models

RNA-Seq counts are counts

Count distributions (examples)

{0, 1} Bernoulli
{0, 1, 2, …, n} Binomial
{0, 1, 2, …, Inf} Poisson, Geometric

Negative-Binomial \(\mathcal{NB} \left( r, q \right)\)

used in both edgeR and DESeq2
generalisation of
- Poisson
- Geometric
practically: just more flexible than Poisson
many ways of parameterising:
- r = size, q = prob of success
- m = mean = \(r(1-q)/q\) , v = variance = \(r(1-q)/q^2\)

The `edgeR` model

For a gene \(g\) and sample \(i\) (from experimental group \(k\))

\[y_i \sim \mathcal{NB}\left( mean = m_i, var = v_i \right)\]

where

\(m_i = M_i p_k\)
- \(M_i\) = library size
- \(p_k\) = relative abundance of \(g\) in experimental group \(k\)
\(v_i = m_i\left( 1 + m_i \phi \right)\)
\(p_k = \exp\left( \Sigma_j {X_{ij}\beta_j} \right)\)
\(y_i\) is the observed count
edgeR call \(\phi\) the dispersion parameter (note, \(\phi = 0\) is a Poisson model; note \(\phi ^ {1/2} = BCV\))

dge_edger <- edgeR::estimateDisp(dge, design = design, tagwise = TRUE)

fit_edger <- edgeR::glmFit(dge_edger, design = design)

# convert coefs to log2
fit_edger$coefficients[row_carm1, ] * log2(exp(1))

##   (Intercept)         mv411         skno1          sh55          sh57 
## -15.487436975   0.002539948   0.356581304  -2.033617365  -2.871841062

##                     sh_55     sh_57 sh_average
## `lm()`          -2.089766 -2.894996  -2.492381
## `limma::eBayes` -2.065455 -2.922189  -2.493822
## `edgeR::glmFit` -2.033617 -2.871841  -2.452729

To get (approx) the same coefficients as edgeR using glm

model_data <- with(
  dge_edger,
  data.frame(
    counts = as.vector(counts[row_carm1, ]),
    samples
))

nb <- glm(
  # note that the `offset()` introduces a fixed constant (it's included in the
  # model with a coefficient of 1)
  counts ~ offset(log(lib.size)) + cell + treatment,
  data = model_data,
  family = MASS::negative.binomial(
    # R/MASS uses theta = 1 / phi, where phi is `edgeR`'s dispersion parameter
    # note that v = mu + mu^2/theta = mu + phi * mu^2
    theta = 1 / dge_edger$tagwise.dispersion[row_carm1]
  )
)

nb$coefficients * log2(exp(1))

##   (Intercept)     cellmv411     cellskno1 treatmentsh55 treatmentsh57 
## -15.487681768   0.002493712   0.356768237  -2.034286473  -2.873166449

(nb$coefficients * log2(exp(1))) %*% contrasts

##       Contrasts
##            sh_55     sh_57 sh_average
##   [1,] -2.034286 -2.873166  -2.453726

The ‘why should all the effects be fixed?’ generalisation: (`lme4`, `variancePartition`)

\[\mathbf{y} \sim X\boldsymbol{\beta} + \mathcal{N} \left(\boldsymbol{0}, \sigma^2I \right)\] that is,

\[\mathbf{y} \sim {Fixed\ Effects} + Noise\] \[\downarrow\] \[\mathbf{y} \sim [Fixed Effects] + [Random Effects] + Noise\]

Sources of variability in the current experiment:

Cell type
Batch (not modelled here)
Treatment
Technical issues, noise etc

Fixed effect:

Unknown constant we are trying to estimate

Why are we estimating the baseline difference between cell types?

Random effect:

A model parameter that is a random variable
You estimate the properties of its distribution, not its value

Mixed Model version

`voom::duplicateCorrelation`

# limma can accomodate at most one random effect
treatment_design <- model.matrix(
  ~ -1 + treatment, data = dge_voom$targets
)
treatment_contrast <- makeContrasts(
  sh55 = "treatmentsh55 - treatmentshct",
  sh57 = "treatmentsh57 - treatmentshct",
  levels = treatment_design
)
# 'random effects' as used in limma
blocks <- factor(dge_voom$targets$cell)

`duplicateCorrelation (cont.)`

dupcor <- limma::duplicateCorrelation(
  dge_voom, treatment_design, block = blocks)

## Warning in glmgam.fit(dx, dy, coef.start = start, tol = tol, maxit =
## maxit, : Too much damping - convergence tolerance not achievable

fit_cor <- lmFit(
  dge_voom, treatment_design,
  block = blocks, correlation = dupcor$consensus #<<
)
fit_cor$coefficients[row_carm1, ] %*% treatment_contrast

##       Contrasts
##             sh55      sh57
##   [1,] -2.069139 -2.930064

`variancePartition`

vp <- variancePartition::dream(
  dge_voom[row_carm1, ], # slow for all genes; recommend parallel
  formula = ~ -1 + treatment + (1|cell), # lme4 model syntax #<<
  data = dge_voom$targets,
  L = treatment_contrast
)

## Projected memory usage: > 30.1 Kb 
## 
## Finished...
## Total: 0 s

vp$coefficients

##                      sh55      sh57
## ENSG00000142453 -2.089766 -2.894996

Multiple random effects in `variancePartition::dream`

vp2 <- variancePartition::dream(
  dge_voom[row_carm1, ],
  formula = ~ -1 + treatment + (1|cell) + (1|batch), #<<
  data = dge_voom$targets,
  L = treatment_contrast
)

## Projected memory usage: > 34.4 Kb 
## 
## Finished...
## Total: 0 s

vp2$coefficients

##                      sh55      sh57
## ENSG00000142453 -2.089766 -2.894996

The ‘beyond the scope of the talk, and of my explanatory powers’ generalisations …

fully Bayesian approaches (STAN etc)
genes aren’t independent of each other …
- epigenomic context
- coregulatory modules
dynamic modelling of gene expression …
the cell population milkshake …

Appendix

Was the model appropriate?

[Could have included batch-effect and cell/treatment interaction]

NB = Gamma-Poisson Mixture

To sample 1000 NB(r, p) values

sample 1000 Gamma(shape = r, scale = p/(1-p)) values
then for each Gamma value, g, sample a Poisson(rate = g)
The latter are NB(r, p) // but R uses 1 - p rather than p

`edgeR` fits larger coefficients for poorly-detected genes than `limma`

## Warning: Removed 6 rows containing missing values (geom_point).

Using `stan` to fit a negative binomial model for RNA-Seq

suppressPackageStartupMessages(
  library(rstan)
)

stan (cont.)

# modified from http://rstudio-pubs-static.s3.amazonaws.com/34099_2e35c3966ef548c2918d5b6c2146bfd1.html

model <- "
// negative binomial parameterized as eta (log(mu)) and dispersion (phi)
// see p286 in stan-reference-2.4.0.pdf
// a basic GLM example
data {
  int<lower=1> N;    // rows of data
  matrix[N, 5] X;    // predictor
  vector[N] L;       // lib sizes
  int<lower=0> y[N]; // response
  real phi;          // edgeR dispersion parameter (glm() and stan use the
                     //   inverse of this to specify mean/variance relationship;
                     //   we call the inverse `theta` for consistency with glm())
                     // Apologies for the confusion, stan's docs refer to our
                     //   `theta` as `phi`
}
parameters {
  vector[5] beta; // coefficients for the linear component
}
transformed parameters {
  vector[N] y_hat = L .* exp(X * beta); // linear component
  real theta = 1 / phi;
}
model {
  // data model:
  y ~ neg_binomial_2(y_hat, theta);
}
"

stan (cont.)

stan_data <- list(
  N = ncol(dge_edger),
  y = as.vector(dge_edger$counts[row_carm1, ]),
  L = as.vector(dge_edger$samples$lib.size),
  X = set_colnames(design, c("intercept", colnames(design)[-1])),
  phi = dge_edger$tagwise.dispersion[row_carm1]
)

write(model, file = here("models/negbin-glm.stan"))

sm <- stan_model(here("models/negbin-glm.stan"))

stan (cont.)

# refresh = 0 prevents sampling output from printing
m <- sampling(sm, data = stan_data,
  pars = c("beta"),
  iter = 2000, chains = 4, verbose = FALSE, refresh = 0)

log2(exp(1)) *
  rbind(
    edgeR = fit_edger$coefficients[row_carm1, ],
    glm = nb$coefficients,
    stan = summary(m, "beta")$summary[, "mean"]
) %*% contrasts

##        Contrasts
##             sh_55     sh_57 sh_average
##   edgeR -2.033617 -2.871841  -2.452729
##   glm   -2.034286 -2.873166  -2.453726
##   stan  -2.039033 -2.873223  -2.456128

Key Points

Preamble

The Dataset

The Dataset (cont.)

A Simple Class of Models (shorthand)

… And In Practice

Design Matrices

Contrast Matrices

The implicit generalisation: transform the outcome (lm())

lm() (cont.)

Fitted- versus Observed-Values

Equivalent Design Matrices

The ‘borrow information’ generalisation (limma)

… ‘information borrowing’ does not affect ‘fold-change’

limma does other stuff …

Generalisations so far

Some Univariate Distibutions

The ‘why should the residuals be Normal?’ generalisation: (edgeR / DESeq2)

RNA-Seq counts are counts

The edgeR model

The ‘why should all the effects be fixed?’ generalisation: (lme4, variancePartition)

Why are we estimating the baseline difference between cell types?

Mixed Model version

voom::duplicateCorrelation

duplicateCorrelation (cont.)

variancePartition

Multiple random effects in variancePartition::dream

The ‘beyond the scope of the talk, and of my explanatory powers’ generalisations …

Further Reading

Appendix

Was the model appropriate?

NB = Gamma-Poisson Mixture

edgeR fits larger coefficients for poorly-detected genes than limma

Using stan to fit a negative binomial model for RNA-Seq

stan (cont.)

stan (cont.)

stan (cont.)

The implicit generalisation: transform the outcome (`lm()`)

`lm()` (cont.)

The ‘borrow information’ generalisation (`limma`)

`limma` does other stuff …

The ‘why should the residuals be Normal?’ generalisation: (`edgeR` / `DESeq2`)

The `edgeR` model

The ‘why should all the effects be fixed?’ generalisation: (`lme4`, `variancePartition`)

`voom::duplicateCorrelation`

`duplicateCorrelation (cont.)`

`variancePartition`

Multiple random effects in `variancePartition::dream`

`edgeR` fits larger coefficients for poorly-detected genes than `limma`

Using `stan` to fit a negative binomial model for RNA-Seq