Meta-analysis for genomic data

A Bioc2014 workshop

author: Levi Waldron
date: August 1, 2014

Outline

Introduction and Motivation
Preparation
- Finding datasets
- Downloading datasets
- Curation
- Preprocessing and gene mapping
- Duplicate checking
Fixed and Random Effects Synthesis
Assessing Heterogeneity
Leave-one-dataset-in and Leave-one-dataset-out Validation of Prediction Models

Scope: what is meta-analysis?

Broad definition: the full scope of among-study analysis
Narrow definition: a synthesis of per-study estimates
Not: pooling of per-patient data

“We understand meta-analysis as being the use of statistical techniques to combine the results of studies addressing the same question into a summary measure.”

Villar et al. (2001)

Scope: what is meta-analysis?

Baggstrom et al.

Classic meta-analysis: Third-generation agents compared with best supportive care. Differences in 1-year survival proportions (Baggstrom et al. 2007).

In genomics we extend to thousands of variables

Systematic meta-analysis

Related to systematic review
Formalizes the study inclusion and exclusion process
PRISMA Guidelines (rarely cited, less relevant for genomics?)

Preparation: finding datasets

Systematic literature review
Gene Expression Omnibus (GEO)
- web page (filter by species, disease, sample size)
- GEOmetadb Bioconductor package (requires SQL knowledge)
ArrayExpress
- web page also includes many GEO datasets
- Bioconductor package has search features
InSilicoDB
- better curation, lower coverage

Preparation: downloading datasets

GEOquery::getGEO() is a workshorse
- maximum coverage, minimum frills
- all metadata included, most is irrelevant
- large studies limited to 256 patients per list element
- processed data as uploaded by authors -> list of ExpressionSets
- no probeset to gene mapping

Preparation: downloading datasets (cont'd)

A couple helpful functions from LeviRmisc
- getGEO2(): consolidate and simplify getGEO() output
- geoPmidLookup(): look up experiment and publication data from GEO and Pubmed, put in dataframe

library(LeviRmisc)
df <- geoPmidLookup(c("GSE26712", "PMID18593951"))

Preparation: curation

per-sample metadata must be standardized across studies
process is error-prone and it is easy to miss mistakes in high-throughput analysis
therefore template-based syntax checking highly recommendable, e.g. see my template and checker for curatedOvarianData.

Preparation: preprocessing and gene mapping

it is possible and desirable to synthesize across array platforms
common preprocessing is desirable but not necessary
- deal with non-standardized preprocessing through gene scaling, e.g. z-score
must map probeset IDs to common gene identifiers:
- if using a representative probeset for a gene, best to use the same one in each dataset
- alternatively, simply average redundant probesets

Preparation: duplicate checking

duplicate samples bias meta-analysis
be very cautious of multiple studies from the same institution - check sample identifiers and expression profiles
doppelgangR package for high-throughput duplicate checking

Fixed and Random Effects Synthesis

Fixed-effects model: \[ \hat{\theta_k} = \theta + \sigma_k \epsilon_k, \; \epsilon_k \stackrel{iid}{\sim} N(0, 1) \]

Random-effects model: \[ \hat{\theta_k} = \theta + \mu_k + \sigma_k \epsilon_k, \; \epsilon_k \stackrel{iid}{\sim} N(0, 1); \mu_k \stackrel{iid}{\sim} N(0, \tau^2) \]

\[ \begin{align*} \textrm{where: }\hat{\theta_k} &= \textrm{effect size estimate from study}\, k \\ \theta &= \textrm{synthesized effect size} \\ \sigma_k &= \textrm{standard error of effect size in study}\, k \\ \epsilon_k &= \textrm{within-study error term}\\ \mu_k &= \textrm{between-study (random-effects) error term in RE model}\\ \tau^2 &= \textrm{variance of "true effects" across studies in RE model} \end{align*} \]

Fixed effects model

\[ \hat{\theta_k} = \theta + \sigma_k \epsilon_k, \; \epsilon_k \stackrel{iid}{\sim} N(0, 1) \]

Maximum-likelihood estimate of \( \theta \) under fixed-effects model is:

\[ \hat{\theta}_F = \frac{\sum\limits_{k=1}^{k}w_k \hat{\theta}_k}{\sum\limits_{k=1}^k w_k}\\ \textrm{where: } w_k = 1 / \sigma_k^2 \]

i.e., the synthesized effect is the average of the study-specific effects, weighted by the inverse squared standard error

Fixed effects model

Variance of fixed-effects estimate is the inverse mean of the study-specific weights: \[ S.E.(\hat{\theta}_F) = \sqrt{\frac{1}{\sum\limits_{k=1}^K w_k}} \]

\[ \begin{align*} \textrm{where }w_k &= 1 / \sigma_k^2,\\ \sigma_k &= \textrm{study-specific standard errors} \end{align*} \]

\( (1-\alpha) \) confidence interval: \[ \hat{\theta}_F \pm z_{(1-\alpha/2)} \times S.E.(\hat{\theta}_F) \]

Random effects model (cont'd)

DerSimonian and Laird RE estimate of \( \hat\theta_F \):

\[ \begin{align*} \scriptsize \hat{\theta}_F &= \scriptsize \frac{\sum\limits_{k=1}^{k}w^*_k \hat{\theta}_k}{\sum\limits_{k=1}^k w^*_k}\\ \textrm{where: } w^*_k &= 1 / (\tau^2 + \sigma_k^2)\\ \end{align*} \]

with no heterogeneity (\( \tau = 0 \)), this converges to the FE estimate
with large heterogeneity (\( \tau \rightarrow \infty \)) it is the simple average of individual study estimates

Random effects model (cont'd)

Definitions of Q and S: \[ \begin{align*} \hat{\tau}^2 &= max\left(\frac{Q-(K-1)}{S}, 0\right),\\ Q &= \sum\limits_{k=1}^K w_k (\hat\theta_k - \hat\theta_F)^2,\\ S &= \sum\limits_{k=1}^K w_k - \frac{\sum\limits_{k=1}^K w_k^2}{\sum\limits_{k=1}^K w_k} \end{align*} \]

Assessing Heterogeneity

Standard hypothesis test for heterogeneity: under the null hypothesis of no heterogeneity between studies (\( \tau = 0 \)), \[ Q \sim \chi^2_{K-1} \]
Standard descriptions of heterogeneity:
- \( \tau^2 \): estimate of total amount of heterogeneity
- \( I^2 \): % of total variability due to heterogeneity
- \( H^2 \): total variability / within-study variance
For further info:
- Viechtbauer W: Conducting meta-analyses in R with the metafor package. J. Stat. Softw. 2010.

Example 1: Is CXCL12 gene a prognostic factor for ovarian cancer?

Load the curatedOvarianData package, look at available datasets:

library(curatedOvarianData)
data(package="curatedOvarianData")

Load (and check out) rules defined in default configuration file:

download.file("https://bitbucket.org/lwaldron/ovrc4_sigvalidation/raw/tip/input/patientselection.config", method="wget", destfile="patientselection.config")
source("patientselection.config")
impute.missing <- TRUE
keep.common.only <- TRUE

Example 1 (cont'd)

Create list of ExpressionSets meeting criteria:

download.file("https://bitbucket.org/lwaldron/ovrc4_sigvalidation/raw/tip/src/createEsetList_source.R", method="wget", destfile="createEsetList.R")
source("createEsetList.R")

length(esets)

[1] 10

Example 1 (cont'd)

Calculate “effect size” log(HR) and S.E. for one dataset:

runCox <- function(eset, probeset="CXCL12"){
  library(survival)
  eset$y <- Surv(eset$days_to_death, eset$vital_status == "deceased")
  if(probeset %in% featureNames(eset)){
    obj <- coxph(eset$y ~ scale(t(exprs(eset[probeset, ]))[, 1]))
    output <- c(obj$coefficients, sqrt(obj$var))
    names(output) <- c("log.HR", "SE")
  }else{output <- NULL}
    output}
runCox(esets[[1]])

log.HR     SE 
0.1080 0.1167

Example 1 (cont'd)

Calculate “effect size” (HR) and Standard Error for all datasets:

study.coefs <- t(sapply(esets, runCox)); head(study.coefs)

                         log.HR      SE
E.MTAB.386_eset        0.108038 0.11671
GSE13876_eset         -0.015534 0.12165
GSE17260_eset          0.196605 0.22132
GSE18520_eset          0.004335 0.15786
GSE19829.GPL8300_eset  0.072413 0.19658
GSE26712_eset          0.192926 0.09192

Example 1 (cont'd): forest plot

library(metafor)
res.fe <- metafor::rma(yi=study.coefs[, 1], sei=study.coefs[, 2], method="FE")
forest.rma(res.fe, slab=gsub("_eset$","",rownames(study.coefs)), atransf=exp)

plot of chunk unnamed-chunk-8

Example 1 (cont'd): FE vs. RE

(res.re <- metafor::rma(yi=study.coefs[, 1], sei=study.coefs[, 2], method="DL"))


Random-Effects Model (k = 10; tau^2 estimator: DL)

tau^2 (estimated amount of total heterogeneity): 0 (SE = 0.0073)
tau (square root of estimated tau^2 value):      0
I^2 (total heterogeneity / total variability):   0.00%
H^2 (total variability / sampling variability):  1.00

Test for Heterogeneity: 
Q(df = 9) = 5.3487, p-val = 0.8029

Model Results:

estimate       se     zval     pval    ci.lb    ci.ub          
  0.0966   0.0381   2.5334   0.0113   0.0219   0.1714        * 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Example 1 (cont'd): closing comments

Between-study variability is completely consistent with zero heterogeneity between studies
Replace simple univariate regression with multivariate regression to correct for known clinical factors (e.g. see Ganzfried et. al. 2013)
Replace HR with any coefficient + S.E.
Replace single probeset with any fully-specified score or classifier
- noting that random gene sets are likely to be prognostic (Waldron et al. JNCI 2014, Venet et al. PLoS Comp Bio 2011)

Example 2: rank all genes by prognostic association

if( !require("survHD") || package.version("survHD") != "0.5.0" ){
    library(devtools)
    install_url("https://bitbucket.org/lwaldron/survhd/downloads/survHD_0.5.0.tar.gz")
}
download.file("https://bitbucket.org/lima1/ovrc4_signew/raw/tip/src/metaCMA.R", destfile="metaCMA.R", method="wget")
source("metaCMA.R")
gene.coefs <- metaCMA.coefs(esets)
FE.res <- metaCMA.opt(esets=esets, coefs=gene.coefs, rma.method="FE", n=200)

Riester et al. JNCI 2014 - see Bitbucket page to reproduce paper and find helpful tools

Example 3: Leave-one-dataset-out validation

Validation of prediction models
Use 1 study to test, K-1 studies to train
Use meta-analysis of K-1 studies to get univariate coefficients e.g. to generate Tukey's “Compound Covariate” model

LODO.res <- metaCMA(esets,coefs=gene.coefs,n=200, rma.method="FE")

Example 3: Leave-one-dataset-out validation (cont'd)

Leave-one-dataset-out validation of a survival signature. (Riester et al. JNCI 2014)

Leave-one-dataset-in validation

Independent datasets for evaluation of prediction models or gene signatures
Train and test using all dataset pairs (Waldron et al. JNCI 2014, Bernau et al. Bioinformatics 2014, Zhao et al. Bioinformatics 2014)

zmatrix

Leave-one-dataset-in validation (cont'd)

“Improvement over random signatures (IOR)” score of gene signatures relative to random gene signatures, equalizing the influences of authors’ algorithms for generating risk scores, quality of the original training data, and gene signature size (Waldron et al. JNCI 2014).
source scripts: genMumatrix.R and analyseMumatrix.R

Conclusions

many alternatives for meta-analysis of genomics experiments have been proposed, none as flexible or well-understood as traditional approaches
metafor R package is highly recommendable and well-documented Viechtbauer 2010
simple pooling makes it difficult to assess heterogeneity arising from genomic or clinical data

DIY exercise: suboptimal debulking

Change patientselection.config filter to keep genes with debulking variable
Identify the most differentially expressed gene with respect to optimal / suboptimal debulking. Manually:
- Use rowttests() from genefilter package to get per-study fold-change and standard error
- Convert t-statistic and difference in means (dm) to S.E.: \[ t=\frac{dm}{S.E.} \]
- Synthesize per-gene using metafor::rma()
Automatically: Use metaCMA.coefs() and metaCMA.opt()