Meta-analysis for genomic data - a short tutorial

Stockholm workshop: Data Integration in Biomedical Research

author: Levi Waldron
date: May 3, 2016

Outline

Introduction and Motivation
Preparation
- Downloading datasets
- Curation
- Preprocessing and gene mapping
- Duplicate checking
Fixed and Random Effects Synthesis
- Assessing Heterogeneity
Leave-one-dataset-in and Leave-one-dataset-out Validation of Prediction Models

Scope: what is meta-analysis?

Broad definition: the full scope of among-study analysis
Narrow definition: a synthesis of per-study estimates
Not: pooling of per-patient data

“We understand meta-analysis as being the use of statistical techniques to combine the results of studies addressing the same question into a summary measure.”

Villar et al. (2001)

Scope: what is meta-analysis?

Baggstrom et al.

Classic meta-analysis: Third-generation agents compared with best supportive care. Differences in 1-year survival proportions (Baggstrom et al. 2007).

In genomics we extend to thousands of variables

Preparation: downloading datasets (cont'd)

A couple helpful functions from LeviRmisc
- getGEO2(): consolidate and simplify getGEO() output
- geoPmidLookup(): look up experiment and publication data from GEO and Pubmed, put in dataframe

library(LeviRmisc)
df <- geoPmidLookup(c("GSE26712", "PMID18593951"))

Preparation: curation

per-sample metadata must be standardized across studies
process is error-prone and it is easy to miss mistakes in high-throughput analysis
therefore template-based syntax checking highly recommendable, e.g. see my template and checker for curatedOvarianData.

Preparation: preprocessing and gene mapping

it is possible and desirable to synthesize across array platforms
common preprocessing is desirable but not necessary
- deal with non-standardized preprocessing through gene scaling, e.g. z-score
must map probeset IDs to common gene identifiers
- microarray: if using a representative probeset for a gene, best to use the same one in each dataset
- RNA-seq: map to same genomic build across datasets

Preparation: duplicate checking

duplicate samples bias meta-analysis
be very cautious of multiple studies from the same institution - check sample identifiers and expression profiles
Waldron et al. The doppelgänger effect: hidden duplicates in databases of transcriptome profiles. JNCI 2016 (In Press)
- doppelgangR Bioconductor package

Fixed and Random Effects Synthesis

Fixed-effects model
- Effect size is the same in every study, variation in estimates is due to individual variability only
Random-effects model
- Effect size is normally distributed across studies, variation in estimates is due to individual variability and study variability

Assessing Heterogeneity

Standard hypothesis test for heterogeneity: under the null hypothesis of no heterogeneity between studies (\( \tau = 0 \)), \[ Q \sim \chi^2_{K-1} \]
Standard descriptions of heterogeneity:
- \( \tau^2 \): estimate of total amount of heterogeneity
- \( I^2 \): % of total variability due to heterogeneity
- \( H^2 \): total variability / within-study variance
For further info:
- Viechtbauer W: Conducting meta-analyses in R with the metafor package. J. Stat. Softw. 2010.

Example 1: Is CXCL12 gene a prognostic factor for ovarian cancer?

Load the curatedOvarianData package, look at available datasets:

library(curatedOvarianData)
data(package="curatedOvarianData")

Load (and check out) rules defined in default configuration file:

source("patientselection.config")
impute.missing <- TRUE
keep.common.only <- TRUE

Example 1 (cont'd)

Create list of ExpressionSets meeting criteria:

source("createEsetList.R")

length(esets)

[1] 14

Example 1 (cont'd)

Calculate “effect size” log(HR) and S.E. for one dataset:

runCox <- function(eset, probeset="CXCL12"){
  library(survival)
  eset$y <- Surv(eset$days_to_death, eset$vital_status == "deceased")
  if(probeset %in% featureNames(eset)){
    obj <- coxph(eset$y ~ scale(t(exprs(eset[probeset, ]))[, 1]))
    output <- c(obj$coefficients, sqrt(obj$var))
    names(output) <- c("log.HR", "SE")
  }else{output <- NULL}
    output}
runCox(esets[[1]])

   log.HR        SE 
0.1080378 0.1167063

Example 1 (cont'd)

Calculate “effect size” (HR) and Standard Error for all datasets:

study.coefs <- t(sapply(esets, runCox)); head(study.coefs)

                            log.HR        SE
E.MTAB.386_eset        0.108037829 0.1167063
GSE13876_eset         -0.015533625 0.1216511
GSE17260_eset          0.196604844 0.2213214
GSE18520_eset          0.004334577 0.1578573
GSE19829.GPL8300_eset  0.072413433 0.1965850
GSE26193_eset         -0.035518891 0.1688681

Example 1 (cont'd): forest plot

plot of chunk unnamed-chunk-9

Example 1 (cont'd): FE vs. RE

(res.re <- metafor::rma(yi=study.coefs[, 1], sei=study.coefs[, 2], method="DL"))


Random-Effects Model (k = 14; tau^2 estimator: DL)

tau^2 (estimated amount of total heterogeneity): 0 (SE = 0.0062)
tau (square root of estimated tau^2 value):      0
I^2 (total heterogeneity / total variability):   0.00%
H^2 (total variability / sampling variability):  1.00

Test for Heterogeneity: 
Q(df = 13) = 11.2219, p-val = 0.5922

Model Results:

estimate       se     zval     pval    ci.lb    ci.ub          
  0.1108   0.0329   3.3664   0.0008   0.0463   0.1754      *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example 1 (cont'd): closing comments

Between-study variability is completely consistent with zero heterogeneity between studies
Replace simple univariate regression with multivariate regression to correct for known clinical factors (e.g. see Ganzfried et. al. 2013)
Replace HR with any coefficient + S.E.
Replace single probeset with any fully-specified score or classifier
- noting that random gene sets are likely to be prognostic (Waldron et al. JNCI 2014, Venet et al. PLoS Comp Bio 2011)

Example 2: Leave-one-dataset-out validation

Validation of prediction models
Use 1 study to test, K-1 studies to train
Use meta-analysis of K-1 studies to get univariate coefficients e.g. to generate Tukey's “Compound Covariate” model

LODO.res <- metaCMA(esets,coefs=gene.coefs,n=200, rma.method="FE")

Example 2: Leave-one-dataset-out validation (cont'd)

Leave-one-dataset-out validation of a survival signature. (Riester et al. JNCI 2014)

Leave-one-dataset-in validation

Independent datasets for evaluation of prediction models or gene signatures
Train and test using all dataset pairs (Waldron et al. JNCI 2014, Bernau et al. Bioinformatics 2014, Zhao et al. Bioinformatics 2014)

zmatrix

Leave-one-dataset-in validation (cont'd)

“Improvement over random signatures (IOR)” score of gene signatures relative to random gene signatures, equalizing the influences of authors’ algorithms for generating risk scores, quality of the original training data, and gene signature size (Waldron et al. JNCI 2014).
source scripts: genMumatrix.R and analyseMumatrix.R

Resources

Riester et al. JNCI 2014 and associated Bitbucket page
(Waldron et al. JNCI 2014) and associated Bitbucket page
Bioconductor packages: curatedOvarianData, curatedBladderData, curatedCRCData, doppelgangR, simulatorZ
Book chapter: Waldron L, Riester M: Meta-Analysis in Gene Expression Studies. Methods Mol. Biol. 2016, 1418:161–176.
Longer workshop materials from Bioc2014
Code to make this presentation on Github

Conclusions

many alternatives for meta-analysis of genomics experiments have been proposed, none as flexible or well-understood as traditional approaches
simple pooling makes it difficult to assess heterogeneity arising from genomic or clinical data
metafor R package is highly recommendable and well-documented Viechtbauer 2010
For multi-assay experiments, MultiAssayExperiment in bioc-devel