Meta-analysis for genomic data - a short tutorial

Stockholm workshop: Data Integration in Biomedical Research

author: Levi Waldron
date: May 3, 2016

Outline

  • Introduction and Motivation
  • Preparation
    • Downloading datasets
    • Curation
    • Preprocessing and gene mapping
    • Duplicate checking
  • Fixed and Random Effects Synthesis
    • Assessing Heterogeneity
  • Leave-one-dataset-in and Leave-one-dataset-out Validation of Prediction Models

Scope: what is meta-analysis?

  • Broad definition: the full scope of among-study analysis
  • Narrow definition: a synthesis of per-study estimates

  • Not: pooling of per-patient data

“We understand meta-analysis as being the use of statistical techniques to combine the results of studies addressing the same question into a summary measure.”

Villar et al. (2001)

Scope: what is meta-analysis?

Baggstrom et al.

Classic meta-analysis: Third-generation agents compared with best supportive care. Differences in 1-year survival proportions (Baggstrom et al. 2007).

  • In genomics we extend to thousands of variables

Preparation: downloading datasets (cont'd)

  • A couple helpful functions from LeviRmisc
    • getGEO2(): consolidate and simplify getGEO() output
    • geoPmidLookup(): look up experiment and publication data from GEO and Pubmed, put in dataframe
library(LeviRmisc)
df <- geoPmidLookup(c("GSE26712", "PMID18593951")) 

Preparation: curation

  • per-sample metadata must be standardized across studies
  • process is error-prone and it is easy to miss mistakes in high-throughput analysis
  • therefore template-based syntax checking highly recommendable, e.g. see my template and checker for curatedOvarianData.

Preparation: preprocessing and gene mapping

  • it is possible and desirable to synthesize across array platforms
  • common preprocessing is desirable but not necessary
    • deal with non-standardized preprocessing through gene scaling, e.g. z-score
  • must map probeset IDs to common gene identifiers
    • microarray: if using a representative probeset for a gene, best to use the same one in each dataset
    • RNA-seq: map to same genomic build across datasets

Preparation: duplicate checking

  • duplicate samples bias meta-analysis
  • be very cautious of multiple studies from the same institution - check sample identifiers and expression profiles
  • Waldron et al. The doppelgänger effect: hidden duplicates in databases of transcriptome profiles. JNCI 2016 (In Press)

Fixed and Random Effects Synthesis

  • Fixed-effects model

    • Effect size is the same in every study, variation in estimates is due to individual variability only
  • Random-effects model

    • Effect size is normally distributed across studies, variation in estimates is due to individual variability and study variability

Assessing Heterogeneity

  • Standard hypothesis test for heterogeneity: under the null hypothesis of no heterogeneity between studies (\( \tau = 0 \)), \[ Q \sim \chi^2_{K-1} \]

  • Standard descriptions of heterogeneity:

    • \( \tau^2 \): estimate of total amount of heterogeneity
    • \( I^2 \): % of total variability due to heterogeneity
    • \( H^2 \): total variability / within-study variance
  • For further info:

    • Viechtbauer W: Conducting meta-analyses in R with the metafor package. J. Stat. Softw. 2010.

Example 1: Is CXCL12 gene a prognostic factor for ovarian cancer?

Load the curatedOvarianData package, look at available datasets:

library(curatedOvarianData)
data(package="curatedOvarianData")

Load (and check out) rules defined in default configuration file:

source("patientselection.config")
impute.missing <- TRUE
keep.common.only <- TRUE

Example 1 (cont'd)

Create list of ExpressionSets meeting criteria:

source("createEsetList.R")
length(esets)
[1] 14

Example 1 (cont'd)

  • Calculate “effect size” log(HR) and S.E. for one dataset:
runCox <- function(eset, probeset="CXCL12"){
  library(survival)
  eset$y <- Surv(eset$days_to_death, eset$vital_status == "deceased")
  if(probeset %in% featureNames(eset)){
    obj <- coxph(eset$y ~ scale(t(exprs(eset[probeset, ]))[, 1]))
    output <- c(obj$coefficients, sqrt(obj$var))
    names(output) <- c("log.HR", "SE")
  }else{output <- NULL}
    output}
runCox(esets[[1]])
   log.HR        SE 
0.1080378 0.1167063 

Example 1 (cont'd)

  • Calculate “effect size” (HR) and Standard Error for all datasets:
study.coefs <- t(sapply(esets, runCox)); head(study.coefs)
                            log.HR        SE
E.MTAB.386_eset        0.108037829 0.1167063
GSE13876_eset         -0.015533625 0.1216511
GSE17260_eset          0.196604844 0.2213214
GSE18520_eset          0.004334577 0.1578573
GSE19829.GPL8300_eset  0.072413433 0.1965850
GSE26193_eset         -0.035518891 0.1688681

Example 1 (cont'd): forest plot

plot of chunk unnamed-chunk-9

Example 1 (cont'd): FE vs. RE

(res.re <- metafor::rma(yi=study.coefs[, 1], sei=study.coefs[, 2], method="DL"))

Random-Effects Model (k = 14; tau^2 estimator: DL)

tau^2 (estimated amount of total heterogeneity): 0 (SE = 0.0062)
tau (square root of estimated tau^2 value):      0
I^2 (total heterogeneity / total variability):   0.00%
H^2 (total variability / sampling variability):  1.00

Test for Heterogeneity: 
Q(df = 13) = 11.2219, p-val = 0.5922

Model Results:

estimate       se     zval     pval    ci.lb    ci.ub          
  0.1108   0.0329   3.3664   0.0008   0.0463   0.1754      *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Example 1 (cont'd): closing comments

  • Between-study variability is completely consistent with zero heterogeneity between studies
  • Replace simple univariate regression with multivariate regression to correct for known clinical factors (e.g. see Ganzfried et. al. 2013)
  • Replace HR with any coefficient + S.E.
  • Replace single probeset with any fully-specified score or classifier

Example 2: Leave-one-dataset-out validation

  • Validation of prediction models
  • Use 1 study to test, K-1 studies to train
  • Use meta-analysis of K-1 studies to get univariate coefficients e.g. to generate Tukey's “Compound Covariate” model
LODO.res <- metaCMA(esets,coefs=gene.coefs,n=200, rma.method="FE") 

Example 2: Leave-one-dataset-out validation (cont'd)

 

 
 
 

Leave-one-dataset-out validation of a survival signature. (Riester et al. JNCI 2014)

Leave-one-dataset-in validation

zmatrix

Leave-one-dataset-in validation (cont'd)

“Improvement over random signatures (IOR)” score of gene signatures relative to random gene signatures, equalizing the influences of authors’ algorithms for generating risk scores, quality of the original training data, and gene signature size (Waldron et al. JNCI 2014).
source scripts: genMumatrix.R and analyseMumatrix.R

Resources

Conclusions

  • many alternatives for meta-analysis of genomics experiments have been proposed, none as flexible or well-understood as traditional approaches
  • simple pooling makes it difficult to assess heterogeneity arising from genomic or clinical data
  • metafor R package is highly recommendable and well-documented Viechtbauer 2010
  • For multi-assay experiments, MultiAssayExperiment in bioc-devel