Meta-analysis for genomic data

A Bioc2014 workshop

author: Levi Waldron
date: August 1, 2014


  • Introduction and Motivation
  • Preparation
    • Finding datasets
    • Downloading datasets
    • Curation
    • Preprocessing and gene mapping
    • Duplicate checking
  • Fixed and Random Effects Synthesis
  • Assessing Heterogeneity
  • Leave-one-dataset-in and Leave-one-dataset-out Validation of Prediction Models

Scope: what is meta-analysis?

  • Broad definition: the full scope of among-study analysis
  • Narrow definition: a synthesis of per-study estimates

  • Not: pooling of per-patient data

“We understand meta-analysis as being the use of statistical techniques to combine the results of studies addressing the same question into a summary measure.”

Villar et al. (2001)

Scope: what is meta-analysis?

Baggstrom et al.

Classic meta-analysis: Third-generation agents compared with best supportive care. Differences in 1-year survival proportions (Baggstrom et al. 2007).

  • In genomics we extend to thousands of variables

Systematic meta-analysis

  • Related to systematic review
  • Formalizes the study inclusion and exclusion process
  • PRISMA Guidelines (rarely cited, less relevant for genomics?)

Preparation: finding datasets

  • Systematic literature review
  • Gene Expression Omnibus (GEO)
    • web page (filter by species, disease, sample size)
    • GEOmetadb Bioconductor package (requires SQL knowledge)
  • ArrayExpress
    • web page also includes many GEO datasets
    • Bioconductor package has search features
  • InSilicoDB
    • better curation, lower coverage

Preparation: downloading datasets

  • GEOquery::getGEO() is a workshorse
    • maximum coverage, minimum frills
    • all metadata included, most is irrelevant
    • large studies limited to 256 patients per list element
    • processed data as uploaded by authors -> list of ExpressionSets
    • no probeset to gene mapping

Preparation: downloading datasets (cont'd)

  • A couple helpful functions from LeviRmisc
    • getGEO2(): consolidate and simplify getGEO() output
    • geoPmidLookup(): look up experiment and publication data from GEO and Pubmed, put in dataframe
df <- geoPmidLookup(c("GSE26712", "PMID18593951")) 

Preparation: curation

  • per-sample metadata must be standardized across studies
  • process is error-prone and it is easy to miss mistakes in high-throughput analysis
  • therefore template-based syntax checking highly recommendable, e.g. see my template and checker for curatedOvarianData.

Preparation: preprocessing and gene mapping

  • it is possible and desirable to synthesize across array platforms
  • common preprocessing is desirable but not necessary
    • deal with non-standardized preprocessing through gene scaling, e.g. z-score
  • must map probeset IDs to common gene identifiers:
    • if using a representative probeset for a gene, best to use the same one in each dataset
    • alternatively, simply average redundant probesets

Preparation: duplicate checking

  • duplicate samples bias meta-analysis
  • be very cautious of multiple studies from the same institution - check sample identifiers and expression profiles
  • doppelgangR package for high-throughput duplicate checking

Fixed and Random Effects Synthesis

Fixed-effects model: \[ \hat{\theta_k} = \theta + \sigma_k \epsilon_k, \; \epsilon_k \stackrel{iid}{\sim} N(0, 1) \]

Random-effects model: \[ \hat{\theta_k} = \theta + \mu_k + \sigma_k \epsilon_k, \; \epsilon_k \stackrel{iid}{\sim} N(0, 1); \mu_k \stackrel{iid}{\sim} N(0, \tau^2) \]

\[ \begin{align*} \textrm{where: }\hat{\theta_k} &= \textrm{effect size estimate from study}\, k \\ \theta &= \textrm{synthesized effect size} \\ \sigma_k &= \textrm{standard error of effect size in study}\, k \\ \epsilon_k &= \textrm{within-study error term}\\ \mu_k &= \textrm{between-study (random-effects) error term in RE model}\\ \tau^2 &= \textrm{variance of "true effects" across studies in RE model} \end{align*} \]

Fixed effects model

\[ \hat{\theta_k} = \theta + \sigma_k \epsilon_k, \; \epsilon_k \stackrel{iid}{\sim} N(0, 1) \]

Maximum-likelihood estimate of \( \theta \) under fixed-effects model is:

\[ \hat{\theta}_F = \frac{\sum\limits_{k=1}^{k}w_k \hat{\theta}_k}{\sum\limits_{k=1}^k w_k}\\ \textrm{where: } w_k = 1 / \sigma_k^2 \]

i.e., the synthesized effect is the average of the study-specific effects, weighted by the inverse squared standard error

Fixed effects model

  • Variance of fixed-effects estimate is the inverse mean of the study-specific weights: \[ S.E.(\hat{\theta}_F) = \sqrt{\frac{1}{\sum\limits_{k=1}^K w_k}} \]

\[ \begin{align*} \textrm{where }w_k &= 1 / \sigma_k^2,\\ \sigma_k &= \textrm{study-specific standard errors} \end{align*} \]

  • \( (1-\alpha) \) confidence interval: \[ \hat{\theta}_F \pm z_{(1-\alpha/2)} \times S.E.(\hat{\theta}_F) \]

Random effects model (cont'd)

  • DerSimonian and Laird RE estimate of \( \hat\theta_F \):

\[ \begin{align*} \scriptsize \hat{\theta}_F &= \scriptsize \frac{\sum\limits_{k=1}^{k}w^*_k \hat{\theta}_k}{\sum\limits_{k=1}^k w^*_k}\\ \textrm{where: } w^*_k &= 1 / (\tau^2 + \sigma_k^2)\\ \end{align*} \]

  • with no heterogeneity (\( \tau = 0 \)), this converges to the FE estimate
  • with large heterogeneity (\( \tau \rightarrow \infty \)) it is the simple average of individual study estimates

Random effects model (cont'd)

  • Definitions of Q and S: \[ \begin{align*} \hat{\tau}^2 &= max\left(\frac{Q-(K-1)}{S}, 0\right),\\ Q &= \sum\limits_{k=1}^K w_k (\hat\theta_k - \hat\theta_F)^2,\\ S &= \sum\limits_{k=1}^K w_k - \frac{\sum\limits_{k=1}^K w_k^2}{\sum\limits_{k=1}^K w_k} \end{align*} \]

Assessing Heterogeneity

  • Standard hypothesis test for heterogeneity: under the null hypothesis of no heterogeneity between studies (\( \tau = 0 \)), \[ Q \sim \chi^2_{K-1} \]

  • Standard descriptions of heterogeneity:

    • \( \tau^2 \): estimate of total amount of heterogeneity
    • \( I^2 \): % of total variability due to heterogeneity
    • \( H^2 \): total variability / within-study variance
  • For further info:

    • Viechtbauer W: Conducting meta-analyses in R with the metafor package. J. Stat. Softw. 2010.

Example 1: Is CXCL12 gene a prognostic factor for ovarian cancer?

Load the curatedOvarianData package, look at available datasets:


Load (and check out) rules defined in default configuration file:

download.file("", method="wget", destfile="patientselection.config")
impute.missing <- TRUE
keep.common.only <- TRUE

Example 1 (cont'd)

Create list of ExpressionSets meeting criteria:

download.file("", method="wget", destfile="createEsetList.R")
[1] 10

Example 1 (cont'd)

  • Calculate “effect size” log(HR) and S.E. for one dataset:
runCox <- function(eset, probeset="CXCL12"){
  eset$y <- Surv(eset$days_to_death, eset$vital_status == "deceased")
  if(probeset %in% featureNames(eset)){
    obj <- coxph(eset$y ~ scale(t(exprs(eset[probeset, ]))[, 1]))
    output <- c(obj$coefficients, sqrt(obj$var))
    names(output) <- c("log.HR", "SE")
  }else{output <- NULL}
log.HR     SE 
0.1080 0.1167 

Example 1 (cont'd)

  • Calculate “effect size” (HR) and Standard Error for all datasets:
study.coefs <- t(sapply(esets, runCox)); head(study.coefs)
                         log.HR      SE
E.MTAB.386_eset        0.108038 0.11671
GSE13876_eset         -0.015534 0.12165
GSE17260_eset          0.196605 0.22132
GSE18520_eset          0.004335 0.15786
GSE19829.GPL8300_eset  0.072413 0.19658
GSE26712_eset          0.192926 0.09192

Example 1 (cont'd): forest plot

res.fe <- metafor::rma(yi=study.coefs[, 1], sei=study.coefs[, 2], method="FE")
forest.rma(res.fe, slab=gsub("_eset$","",rownames(study.coefs)), atransf=exp)

plot of chunk unnamed-chunk-8

Example 1 (cont'd): FE vs. RE

( <- metafor::rma(yi=study.coefs[, 1], sei=study.coefs[, 2], method="DL"))

Random-Effects Model (k = 10; tau^2 estimator: DL)

tau^2 (estimated amount of total heterogeneity): 0 (SE = 0.0073)
tau (square root of estimated tau^2 value):      0
I^2 (total heterogeneity / total variability):   0.00%
H^2 (total variability / sampling variability):  1.00

Test for Heterogeneity: 
Q(df = 9) = 5.3487, p-val = 0.8029

Model Results:

estimate       se     zval     pval    ci.ub          
  0.0966   0.0381   2.5334   0.0113   0.0219   0.1714        * 

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Example 1 (cont'd): closing comments

  • Between-study variability is completely consistent with zero heterogeneity between studies
  • Replace simple univariate regression with multivariate regression to correct for known clinical factors (e.g. see Ganzfried et. al. 2013)
  • Replace HR with any coefficient + S.E.
  • Replace single probeset with any fully-specified score or classifier

Example 2: rank all genes by prognostic association

if( !require("survHD") || package.version("survHD") != "0.5.0" ){
download.file("", destfile="metaCMA.R", method="wget")
gene.coefs <- metaCMA.coefs(esets)
FE.res <- metaCMA.opt(esets=esets, coefs=gene.coefs, rma.method="FE", n=200)

Riester et al. JNCI 2014 - see Bitbucket page to reproduce paper and find helpful tools

Example 3: Leave-one-dataset-out validation

  • Validation of prediction models
  • Use 1 study to test, K-1 studies to train
  • Use meta-analysis of K-1 studies to get univariate coefficients e.g. to generate Tukey's “Compound Covariate” model
LODO.res <- metaCMA(esets,coefs=gene.coefs,n=200, rma.method="FE")

Example 3: Leave-one-dataset-out validation (cont'd)



Leave-one-dataset-out validation of a survival signature. (Riester et al. JNCI 2014)

Leave-one-dataset-in validation


Leave-one-dataset-in validation (cont'd)

“Improvement over random signatures (IOR)” score of gene signatures relative to random gene signatures, equalizing the influences of authors’ algorithms for generating risk scores, quality of the original training data, and gene signature size (Waldron et al. JNCI 2014).
source scripts: genMumatrix.R and analyseMumatrix.R


  • many alternatives for meta-analysis of genomics experiments have been proposed, none as flexible or well-understood as traditional approaches
  • metafor R package is highly recommendable and well-documented Viechtbauer 2010
  • simple pooling makes it difficult to assess heterogeneity arising from genomic or clinical data

DIY exercise: suboptimal debulking

  • Change patientselection.config filter to keep genes with debulking variable
  • Identify the most differentially expressed gene with respect to optimal / suboptimal debulking. Manually:
    • Use rowttests() from genefilter package to get per-study fold-change and standard error
    • Convert t-statistic and difference in means (dm) to S.E.: \[ t=\frac{dm}{S.E.} \]
    • Synthesize per-gene using metafor::rma()
  • Automatically: Use metaCMA.coefs() and metaCMA.opt()