30th September, 2019

Key Points

  • Common structure to well-known statistical tests

  • The linear model is pervasive

  • Multiple routes to generalisation of the linear model

  • We could use base R instead of limma / edgeR / DESeq2 / variancePartition (but thankfully don’t)

Preamble

Github:

Dataset:

R Packages:

# format: xaringan
pkgs <- c(
  # CRAN
  "dplyr", "magrittr", "ggplot2",
  # Bioconductor
  "ComplexHeatmap", "edgeR", "limma", "variancePartition"
)

for (pkg in pkgs) {
  suppressPackageStartupMessages(
    library(pkg, character.only = TRUE))
}

The Dataset

GSE103528

  • Sources of variability:

    • 3x AML cell lines (Acute myeloid leukaemia)
    • 3x treatments (“Scramble”, and two shRNAs vs CARM1 (aka PRMT4))
    • Triplicates (not obvious: are replicates batches?)
  • Data:

    • Star / RSEM / GRCh38 / Ensembl-87
    • (@russHyde should have aligned the data and used actual counts
      • but life’s too short
      • he just floored the RSEM values)

The Dataset (cont.)

dge <- readRDS(here("data", "job", "GSE103528.dgelist.rds"))
dge <- dge[dge$genes$gene_biotype == "protein_coding", ]
dim(dge)
## [1] 19932    27

A Simple Class of Models (shorthand)

\[\mathbf{y} \sim \mathcal{N} \left(X\boldsymbol{\beta}, \sigma^2I \right)\]

Where,

\(\mathbf{y}\) - observed values

\(\mathtt{lhs} \sim \mathcal{N}(., .)\) - normality assumption (on the residuals; or around the fitted values)

\(X\) - a design matrix (one row per observation)

\(\boldsymbol{\beta}, \sigma\) - parameters to be estimated (vector and scalar, resp)

… plus lots of assumptions

… And In Practice

\[\mathbf{y} \sim \mathcal{N} \left(X\boldsymbol{\beta}, \sigma^2I \right)\]

Where,

\(\mathbf{y}\) - eg, log-expression-intensity for a single gene across a set of samples

\(\mathtt{lhs} \sim \mathcal{N}(., .)\) - some statistical distribution

\(X\) - some encoding of experimental design / predictors for the samples

\(\boldsymbol{\beta}\) - how much does each predictor contribute …

\(\sigma\) - and how much noise is there around …

… the fitted values

Design Matrices

Contrast Matrices

Encode the experimental comparisons

Dependent on Design Matrix

Columns: Comparison of interest

Rows: Model coefficients (~ columns of design matrix)

Treatment Y vs Treatment X:

    1. What combination of model coefficients gives the ‘fitted value’ for X
    1. … for Y
  • Subtract A from B

Be careful with contrasts over ‘interactions’ - genomicsclass.github.io/book/pages/interactions_and_contrasts.html

The implicit generalisation: transform the outcome (lm())

If the residuals aren’t normal(ish) in

\[\mathbf{y} \sim \mathcal{N} \left(X\boldsymbol{\beta}, \sigma^2I \right)\] \[\downarrow\] \[\mathbf{Y} = f(\mathbf{y})\] \[\mathbf{Y} \sim \mathcal{N} \left(X\boldsymbol{\beta}, \sigma^2I \right)\]

design <- factor_model(~ cell + treatment, data = dge$samples)
dge_voom <- voom(dge, design = design)

lm() (cont.)

## 
## Call:
## lm(formula = intensity ~ cell + treatment, data = df_carm1)
## 
## Coefficients:
##   (Intercept)      cellmv411      cellskno1  treatmentsh55  treatmentsh57  
##        4.3966         0.0602         0.3932        -2.0898        -2.8950

Baselines (for cell: MOLM13; for treatment: shct)

Fitted values:

    • (Intercept) + cellskno1 + treatmentssh57
with(
  lm_carm1,
  sum(coefficients[c("(Intercept)", "cellskno1", "treatmentsh57")]))
## [1] 1.894793

But

  • design is a N x K matrix (here, binary)

  • coefficients is K x 1 vector (entries ~ columns of the design)

What is the matrix * vector product?

(
  design %*% lm_carm1$coefficients
  ) %>%
  tail(n = 6)
##                  [,1]
## skno1_sh55_a 2.700022
## skno1_sh55_b 2.700022
## skno1_sh55_c 2.700022
## skno1_sh57_a 1.894793
## skno1_sh57_b 1.894793
## skno1_sh57_c 1.894793

Fitted- versus Observed-Values

Equivalent Design Matrices

LHS skno1 != RHS skno1 coef

The ‘borrow information’ generalisation (limma)

\[\mathbf{y}_{[g_1]} \sim \mathcal{N} \left(X\boldsymbol{\beta}_{[g_1]}, \sigma^2_{[g_1]}I \right)\] \[\mathbf{y}_{[g_2]} \sim \mathcal{N} \left(X\boldsymbol{\beta}_{[g_2]}, \sigma^2_{[g_2]}I \right)\] \[\downarrow\]

\(\sigma_{g}\) : balance between gene-specific estimate and global estimate}

\[\mathbf{y}_{[g]} \sim \mathcal{N} \left(X\boldsymbol{\beta}_{[g]}, \sigma^2_{[g]}I \right)\]

colnames(design)
## [1] "(Intercept)" "mv411"       "skno1"       "sh55"        "sh57"
contrasts <- limma::makeContrasts(
  # baseline treatment is shct, so
  # (fit for sh55) - (fit for shCTRL) = (Intercept + k_sh55) - (Intercept)
  # (fit for sh57) - (fit for shCTRL) = (Intercept + k_sh57) - (Intercept)
  sh_55 = "sh55",
  sh_57 = "sh57",
  sh_average = "(sh55 + sh57) / 2",
  levels = design
)
# standard limma workflow
# - vanilla linear model
fit_raw <- limma::lmFit(dge_voom, design)
# - estimate experimental contrasts
fit_cont <- limma::contrasts.fit(fit_raw, contrasts = contrasts)
# - moderated t-statistics (uses distribution of standard
# errors for all features to better estimate that for a
# given feature)
fit_ebayes <- limma::eBayes(fit_cont)

… ‘information borrowing’ does not affect ‘fold-change’

… but voom-contrasts != the lm() estimates

##                         sh_55     sh_57 sh_average
## after limma::lmFit  -2.065455 -2.922189  -2.493822
## after limma::eBayes -2.065455 -2.922189  -2.493822
## from `lm()`         -2.089766 -2.894996  -2.492381

“voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.” Law et al (Genome Biol. 2014)

with(
  dge_voom[row_carm1, ],
    lm.wfit(x = design, y = as.vector(E), w = as.vector(weights)) #<<
)$coefficients %*% contrasts
##       Contrasts
##            sh_55     sh_57 sh_average
##   [1,] -2.065455 -2.922189  -2.493822

limma does other stuff …

Source: Richie et al. Nuc. Acids Res. (2015)

Source: Richie et al. Nuc. Acids Res. (2015)

Generalisations so far

Why should we assume / allow

  • the residuals are normal?

  • only a single outcome vector?

  • that each hypothesis should be ran completely independently?

  • equal weighting of all samples?

Some Univariate Distibutions

Source: Ehsan Azhdari [CC BY-SA 3.0] via Wikipedia

Source: Ehsan Azhdari [CC BY-SA 3.0] via Wikipedia

The ‘why should the residuals be Normal?’ generalisation: (edgeR / DESeq2)

\[\mathbf{y} \sim \mathcal{N} \left(X\boldsymbol{\beta}, \sigma^2I \right)\]

that is, for the \(i\)th observation:

\[y_i \sim \mathcal{N} \left(\Sigma_j X_{ij}\beta_j, \sigma^2 \right)\]

\[\downarrow\]

\[y_i \sim \mathcal{\psi} \left(\Sigma_j X_{ij}\beta_j,\ other\ params \right)\]

eg, generalised linear models

RNA-Seq counts are counts

Count distributions (examples)

  • {0, 1} Bernoulli
  • {0, 1, 2, …, n} Binomial
  • {0, 1, 2, …, Inf} Poisson, Geometric

Negative-Binomial \(\mathcal{NB} \left( r, q \right)\)

  • used in both edgeR and DESeq2

  • generalisation of

    • Poisson
    • Geometric
  • practically: just more flexible than Poisson

  • many ways of parameterising:

    • r = size, q = prob of success
    • m = mean = \(r(1-q)/q\) , v = variance = \(r(1-q)/q^2\)

The edgeR model

For a gene \(g\) and sample \(i\) (from experimental group \(k\))

\[y_i \sim \mathcal{NB}\left( mean = m_i, var = v_i \right)\]

where

  • \(m_i = M_i p_k\)

    • \(M_i\) = library size

    • \(p_k\) = relative abundance of \(g\) in experimental group \(k\)

  • \(v_i = m_i\left( 1 + m_i \phi \right)\)

  • \(p_k = \exp\left( \Sigma_j {X_{ij}\beta_j} \right)\)

  • \(y_i\) is the observed count

  • edgeR call \(\phi\) the dispersion parameter (note, \(\phi = 0\) is a Poisson model; note \(\phi ^ {1/2} = BCV\))

dge_edger <- edgeR::estimateDisp(dge, design = design, tagwise = TRUE)

fit_edger <- edgeR::glmFit(dge_edger, design = design)
# convert coefs to log2
fit_edger$coefficients[row_carm1, ] * log2(exp(1))
##   (Intercept)         mv411         skno1          sh55          sh57 
## -15.487436975   0.002539948   0.356581304  -2.033617365  -2.871841062
##                     sh_55     sh_57 sh_average
## `lm()`          -2.089766 -2.894996  -2.492381
## `limma::eBayes` -2.065455 -2.922189  -2.493822
## `edgeR::glmFit` -2.033617 -2.871841  -2.452729

To get (approx) the same coefficients as edgeR using glm

model_data <- with(
  dge_edger,
  data.frame(
    counts = as.vector(counts[row_carm1, ]),
    samples
))
nb <- glm(
  # note that the `offset()` introduces a fixed constant (it's included in the
  # model with a coefficient of 1)
  counts ~ offset(log(lib.size)) + cell + treatment,
  data = model_data,
  family = MASS::negative.binomial(
    # R/MASS uses theta = 1 / phi, where phi is `edgeR`'s dispersion parameter
    # note that v = mu + mu^2/theta = mu + phi * mu^2
    theta = 1 / dge_edger$tagwise.dispersion[row_carm1]
  )
)

nb$coefficients * log2(exp(1))
##   (Intercept)     cellmv411     cellskno1 treatmentsh55 treatmentsh57 
## -15.487681768   0.002493712   0.356768237  -2.034286473  -2.873166449
(nb$coefficients * log2(exp(1))) %*% contrasts
##       Contrasts
##            sh_55     sh_57 sh_average
##   [1,] -2.034286 -2.873166  -2.453726

The ‘why should all the effects be fixed?’ generalisation: (lme4, variancePartition)

\[\mathbf{y} \sim X\boldsymbol{\beta} + \mathcal{N} \left(\boldsymbol{0}, \sigma^2I \right)\] that is,

\[\mathbf{y} \sim {Fixed\ Effects} + Noise\] \[\downarrow\] \[\mathbf{y} \sim [Fixed Effects] + [Random Effects] + Noise\]

Sources of variability in the current experiment:

  • Cell type

  • Batch (not modelled here)

  • Treatment

  • Technical issues, noise etc

Fixed effect:

  • Unknown constant we are trying to estimate

Why are we estimating the baseline difference between cell types?

Random effect:

  • A model parameter that is a random variable

  • You estimate the properties of its distribution, not its value

Mixed Model version

voom::duplicateCorrelation

# limma can accomodate at most one random effect
treatment_design <- model.matrix(
  ~ -1 + treatment, data = dge_voom$targets
)
treatment_contrast <- makeContrasts(
  sh55 = "treatmentsh55 - treatmentshct",
  sh57 = "treatmentsh57 - treatmentshct",
  levels = treatment_design
)
# 'random effects' as used in limma
blocks <- factor(dge_voom$targets$cell)

duplicateCorrelation (cont.)

dupcor <- limma::duplicateCorrelation(
  dge_voom, treatment_design, block = blocks)
## Warning in glmgam.fit(dx, dy, coef.start = start, tol = tol, maxit =
## maxit, : Too much damping - convergence tolerance not achievable
fit_cor <- lmFit(
  dge_voom, treatment_design,
  block = blocks, correlation = dupcor$consensus #<<
)
fit_cor$coefficients[row_carm1, ] %*% treatment_contrast
##       Contrasts
##             sh55      sh57
##   [1,] -2.069139 -2.930064

variancePartition

vp <- variancePartition::dream(
  dge_voom[row_carm1, ], # slow for all genes; recommend parallel
  formula = ~ -1 + treatment + (1|cell), # lme4 model syntax #<<
  data = dge_voom$targets,
  L = treatment_contrast
)
## Projected memory usage: > 30.1 Kb 
## 
## Finished...
## Total: 0 s
vp$coefficients
##                      sh55      sh57
## ENSG00000142453 -2.089766 -2.894996

Multiple random effects in variancePartition::dream

vp2 <- variancePartition::dream(
  dge_voom[row_carm1, ],
  formula = ~ -1 + treatment + (1|cell) + (1|batch), #<<
  data = dge_voom$targets,
  L = treatment_contrast
)
## Projected memory usage: > 34.4 Kb 
## 
## Finished...
## Total: 0 s
vp2$coefficients
##                      sh55      sh57
## ENSG00000142453 -2.089766 -2.894996

The ‘beyond the scope of the talk, and of my explanatory powers’ generalisations …

  • fully Bayesian approaches (STAN etc)

  • genes aren’t independent of each other …

    • epigenomic context

    • coregulatory modules

  • dynamic modelling of gene expression …

  • the cell population milkshake …

Further Reading

Appendix

Was the model appropriate?

[Could have included batch-effect and cell/treatment interaction]

NB = Gamma-Poisson Mixture

To sample 1000 NB(r, p) values

  • sample 1000 Gamma(shape = r, scale = p/(1-p)) values

  • then for each Gamma value, g, sample a Poisson(rate = g)

  • The latter are NB(r, p) // but R uses 1 - p rather than p

edgeR fits larger coefficients for poorly-detected genes than limma

## Warning: Removed 6 rows containing missing values (geom_point).

Using stan to fit a negative binomial model for RNA-Seq

suppressPackageStartupMessages(
  library(rstan)
)

stan (cont.)

# modified from http://rstudio-pubs-static.s3.amazonaws.com/34099_2e35c3966ef548c2918d5b6c2146bfd1.html

model <- "
// negative binomial parameterized as eta (log(mu)) and dispersion (phi)
// see p286 in stan-reference-2.4.0.pdf
// a basic GLM example
data {
  int<lower=1> N;    // rows of data
  matrix[N, 5] X;    // predictor
  vector[N] L;       // lib sizes
  int<lower=0> y[N]; // response
  real phi;          // edgeR dispersion parameter (glm() and stan use the
                     //   inverse of this to specify mean/variance relationship;
                     //   we call the inverse `theta` for consistency with glm())
                     // Apologies for the confusion, stan's docs refer to our
                     //   `theta` as `phi`
}
parameters {
  vector[5] beta; // coefficients for the linear component
}
transformed parameters {
  vector[N] y_hat = L .* exp(X * beta); // linear component
  real theta = 1 / phi;
}
model {
  // data model:
  y ~ neg_binomial_2(y_hat, theta);
}
"

stan (cont.)

stan_data <- list(
  N = ncol(dge_edger),
  y = as.vector(dge_edger$counts[row_carm1, ]),
  L = as.vector(dge_edger$samples$lib.size),
  X = set_colnames(design, c("intercept", colnames(design)[-1])),
  phi = dge_edger$tagwise.dispersion[row_carm1]
)
write(model, file = here("models/negbin-glm.stan"))
sm <- stan_model(here("models/negbin-glm.stan"))

stan (cont.)

# refresh = 0 prevents sampling output from printing
m <- sampling(sm, data = stan_data,
  pars = c("beta"),
  iter = 2000, chains = 4, verbose = FALSE, refresh = 0)
log2(exp(1)) *
  rbind(
    edgeR = fit_edger$coefficients[row_carm1, ],
    glm = nb$coefficients,
    stan = summary(m, "beta")$summary[, "mean"]
) %*% contrasts
##        Contrasts
##             sh_55     sh_57 sh_average
##   edgeR -2.033617 -2.871841  -2.452729
##   glm   -2.034286 -2.873166  -2.453726
##   stan  -2.039033 -2.873223  -2.456128