Multi-Omics Pathway Testing with the MiniMax Statistic

Gabriel J. Odom

<https://stempel.fiu.edu/faculty/gabriel-j-odom/>

5 November 2020

Outline

Outline

  • Introduction and Motivation
  • Limitations of Current “Multi-Omics” Approaches
  • Solution Overview
  • The MiniMax Statistic
  • Simulation Study Design
  • Simulation Study Results
  • Additional Comments
  • Conclusion and Discussion

Introduction and Motivation

Introduction

  • Cancer is the second leading cause of death in the United States: over half a million Americans die from cancer each year (CDC, 2018)
  • Every year, 442 people per 100,000 are diagnosed with cancer. Every year, 158 people per 100,000 will die (National Cancer Institute, 2017).
  • The risk to develop a cancer is strongly associated with age, and our society is “getting older”; global cancer incidence is expected to more than double by 2050 (Pilleron et al., 2020)
  • In the study of thousands of cancers, analysing genomic and genetic data is critical to create novel therapies.

Background and Motivation

  • Start with an example question:
    • “In breast cancer, is increased activity of ERBB2 signalling related to overall survival time?”
    • “In colon cancer, how do MMR mutations increase microsatellite instability?”
  • Basic components of these questions:
    • Disease: “breast cancer”, “colon cancer”
    • Biological process: “ERBB2 signalling”, “MMR mutations”
    • Clinical response: “overall patient survival”, “high microsatellite instability”
    • (potential) Statistical method: Cox Proportional Hazards, Elastic-Net Logistic Regression

Multiple Files of Data

  • To answer questions of these type, we have a wealth of genomic data from clinical trials and observational studies.
  • Recall the “central dogma” of molecular biology: DNA \(\rightarrow\) RNA \(\rightarrow\) Protein.
  • Data related to these biological processes are available from the level of regions of base pairs of DNA all the way to expression levels of complex phospoproteins.
  • Generally speaking, multiple levels of genomic measurements related to the same disease are known as multi-omic data.

Multi-Omics Categories

  • Multi-Way: data is matched on subjects AND features while being recorded repeatedly over time, space, or tissue. Examples: DNA methylation from blood drawn on the same subjects every morning and evening for two weeks (data can be stored in a regular array).
  • Sample Multi-Table: data is matched on subjects and recorded over different media. Examples: one draw of DNA methylation, gene expression, and protein expression on the same subjects.
  • Feature Multi-Table: data is matched on features and recorded over different cohorts and/or media. Example: phosphoprotein expression recorded for three different cohorts of breast cancer survivors.

Limitations of Current Multi-Omics Approaches

Example Data

Limitations of Current Multi-Omics Approaches

  • These multi-omics techniques all require that we “match” on something: time, subjects, or features.
  • Matching on “rows” yields 5 subjects (15 samples) to analyse (discarding 600+ subjects and 1000+ samples).
  • Matching on “columns” (genes) requires discarding any genes not in the intersection of the three platforms (roughly 35,000 gene measurements per subject) and any within-subject replicates (roughly 400 samples), ignoring the whole point of multiple platforms.

Solution Overview

Solution (p1): Match on Pathways

Solution (p2): Test Pathways

  • Given a set of biological pathways appropriate to the disease / biological question of interest:
    • Supervised: test the statistical significance of the relationship between that pathway and a clinical response (such as survival status), or
    • Unsupervised: measure the disregulation of a pathway without clinical response
  • Certain methods are available to statistically test the activity of a pathway such as GSEA, rnaEditr, pathwayPCA, MOGSA, WebGestalt, and many others.
  • Using a statistically and biologically appropriate method, assign a p-value to each pathway for each data platform.

Solution (p3): Summarise Pathway Significance

  • We essentially have multiple measurements of statistical significance (p-values) for the same biological process, which we need to combine.
  • Some platforms will share subjects with another platform (e.g. a clinical trial where subjects’ gene and protein expression were collected), while some platforms may be essentially independent (e.g. two studies on prostate cancer with different cohort, but one using DNA methylation and another RNA editing levels).
  • Based on sample and biological correlations between pathways, their p-value correlations range from 0 to 1.

The MiniMax Statistic

Defining the MiniMax Statistic

  • Consider \(G\) data platforms used to measure the statistical significance of a single pathway.
  • Under an independence assumption and the null hypothesis, these p-values are uniformally distributed over a unit \(G\)-cube in \([0,1]^G\).
  • We define the MiniMax statistic to be the minimum of all \(G - 1\) pairwise p-value maxima; calculated simply as the second order statistic of the p-values. We interpret this as the “best of the worst-case scenarios”.

Distribution of the MiniMax Statistic

  • Again, assuming independence of the pathway p-values across \(G\) platforms, the distribution of the \(k^{th}\) order statistic of these p-values (\(p_{[k]}\)) under the null hypothesis is Beta (\(\mathbb{B}(\alpha, \beta)\)) with parameters (see Gentle, (2009). p. 63):

\[ \begin{align} \alpha &= k \\ \beta &= G - k + 1. \end{align} \]

  • Thus, the MiniMax statistic is distributed \[ p_{[2]} \sim \mathbb{B}(2, G - 1). \]

Relationship Between MLEs and p-Value Correlations

  • First of all: we know that these p-values are NOT independent.
  • We hypothesized that the correlation between p-values would affect the Beta distribution of the MiniMax statistic (I originally believed this dependence would only affect the \(G\) parameter of this distribution).
  • We used a Gaussian copula to simulate sets of 100,000 correlated p-values across a grid of ~7700 correlation matrices \(\textbf{P} \in \mathbb{R}_{3 \times 3}^>\).
  • Under independence, the MLEs should be \(\{\hat{\alpha} = 2,\ \hat{\beta} = 2\}\).

Correlation to MLEs Relationship Continued

  • We built this grid of points subject to the constraint below to ensure each correlation matrix \(\textbf{P}\) was positive definite (see this derivation): \[ 1 + 2\rho_{12}\rho_{13}\rho_{23} - \rho_{12}^2 - \rho_{13}^2 - \rho_{23}^2 > 0 \]
  • We discovered that the MLEs of the Beta distribution parameters were most often near equal (maximum absolute difference = 0.01524), suggesting that the correlations affect the true distribution equally through the \(k\) and \(G\) parameters.
  • We then use \(\text{det}(\textbf{P})\) to predict the average of \(\hat{\alpha}\) and \(\hat{\beta}\).

MLEs for \(|\alpha - \beta|\)

Visualising this Relationship

Some Mathematical Details

For \(k = 2,\ G = 3\), our current best guess for the relationship between the determinant of the correlation matrix of the p-values and the MLEs for the Beta distribution is:

\[ \hat{\alpha} = \hat{\beta} \approx 2 - \frac{7}{8}\sqrt{1 - |\textbf{P}|}. \]

Because the estimates for \(\alpha\) and \(\beta\) are nearly identical, we conjecture that \(\text{det}(\textbf{P})\) affects \(k\) and \(G\) equally, but this needs further exploration to confirm.

Simulation Study

The Base Layer of Data Across Platforms

  • We have 631 samples of patients with colorectal adenocarcinoma (our Venn diagram).
  • There are seven molecular platforms plus clinical data (\(n = 629\)); our simulation will use centred and scaled data from these 3 platforms:
    • Copy Number Changes (\(n = 616,\ p = 24776\))
    • Normalised RNAseq (\(n = 379,\ p = 19828\))
    • Gene-Level Proteome (\(n = 90,\ p = 5538\))
  • Source: http://www.linkedomics.org/data_download/TCGA-COADREAD/

Pathways and Synthetic Subtypes

  • We want to treat a certain subset of biological pathways, but “real” pathways are often dependent (many genes are important in multiple biological processes, appearing in more than one pathway).
  • Because each data set has a different dimensionality, we cluster the 5252 genes included across each platform.
  • To ensure that treatments applied to pathways do not “bleed over” to other pathways, we partitioned genes into a set of 200 pathways.
  • All samples were from subjects with cancer, so we randomly assigned each subject to a synthetic subtype (“A” or “B”), which we repeated 100 times.

Statistic Properties under the Null Distribution

For each of the 100 response sets:

  • Before treating the pathways, use the pathwayPCA:: R package to fit the logistic regression model of the random subtype against the first principal component of each pathway.
  • Repeat this process for each of the three data platforms (copy number, RNAseq, and protomics).
  • Calculate the MiniMax statistic for each pathway.
  • Use these results under the null hypothesis to
    • Estimate the test size of the MiniMax statistic, and
    • Find the Maximum Likelihood Estimators of the Beta distribution parameters \(\alpha\) and \(\beta\).

Design Points

For each of the 100 response sets:

  • Treat 20 pathways at random for all samples of synthetic type “B”.
  • Treatment proportion: 20%, 40%, 60%, or 80% of genes in “treated” pathways are selected for direct treatment. Based on the clustering procedure (Euclidean distance with agglomeration clustering via a modified Ward’s method), clusters range in size from roughly 5 to 75 genes, with a median of 25. Thus, we vary the proportion of treated genes rather than the count.
  • Treatment strength: +0.1, +0.2, …, +0.5. Recall that each gene has been centred and scaled, so their standard deviations are approximately 1.

Within-pathway Dependent Hypothesis Testing

  • For each pathway and each platform, we use the pathwayPCA:: R package to fit the logistic regression model of the synthetic subtype against the first principal component of each pathway. (Recall that each platform has a different sample size.)
  • For each pathway, calculate the MiniMax statistic from these p-values.
  • For each pathway, find the p-value of the MiniMax statistic using \(\hat{\alpha}_{\text{MLE}}\) and \(\hat{\beta}_{\text{MLE}}\) calculated from the results under the null distribution (which has the same dependence across platforms).

Simulation Study Results

Potential Competing Methods

  • Because of the limitations described above, we haven’t been able to find methods to compare against that have reasonable performance (less than 20% power for scenario with strongest signal).
  • Because only 5 samples are shared across the three platforms, standard multi-omics techniques ignore over 90% of the data.
  • Potential competing method: “stack” the data; this however requires us to discard any multi-platform data for the same subject (we can’t include multiple rows from the same person without violating independence assumptions).

Test Size

  • Because the test size of the MiniMax statistic depends on the parameters of the Beta distribution, we estimate the test size under four assumptions:
    1. Independent hypotheses: \(\{\alpha,\beta\} = \{2,2\}\)
    2. Dependence only affects \(G\): \(\{\alpha,\beta\} = \{2,\hat{\beta}\}\)
    3. Dependence affects both parameters: \(\{\alpha,\beta\} = \{\hat{\alpha}_{\text{MLE}},\hat{\beta}_{\text{MLE}}\}\)
    4. Same as (3), but estimating the parameters using their functional dependence on the correlation matrix of the pathway p-values: \(\{\alpha,\beta\} = \{\hat{\alpha}_f,\hat{\beta}_f\}\)
  • We also estimated the test size of the “stacked” method.

Test Size Results

Method Median Mean Q3
(1) 0.055 0.062 0.080
(2) 0.060 0.058 0.070
(3) 0.050 0.051 0.060
(4) 0.048 0.051 0.065
Stacked 0.043 0.051 0.070


  • Test size for Methods (3) and (4)—the MLE-based approaches—were well controlled, so we show power results for those methods compared to the more rigid “stacked” approach.
  • Power results were nearly identical for Methods (3) and (4).

Power

Comments on Flexibility

  • We have seen that the test sizes for these two techniques are nearly identical, but the “stacked” data method offers superior power to the MiniMax statistic.
  • This is because the MiniMax does not use the full data to estimate pathway significance, only p-values; therefore, it can be used even when original data is unavailable.
  • Moreover, “stacking” the data requires us to discard nearly 20k gene-level copy number measurements, 15k gene expression measurements, and 400 matched-platform samples.
  • “Stacking” prohibits any investigation into molecular systems within the same subject.

Conclusion and Discussion

Review of Points

  • Genetics and genomics data often comes from multiple studies with unmatched or partially matched samples.
  • Despite this sample and feature discordance, we would like to synthesize signals within the same disease using biological pathways.
  • The MiniMax statistic can combine biological process significance across multiple data sets, without requiring access to the original data.
  • This statistic shows well controlled test size and reasonable power across common scenarios with moderate biological signal.

Discussion Questions

  • What are other potential competing methods?
  • How can I derive the functional relationship between \(\text{det}(\textbf{P})\) and \(\{\hat{\alpha},\hat{\beta}\}\) for the Beta distribution?

Thank You