Multi-Omics Pathway Testing with the MiniMax Statistic

Gabriel J. Odom

<https://stempel.fiu.edu/faculty/gabriel-j-odom/>

5 November 2020

Outline

Introduction and Motivation
Limitations of Current “Multi-Omics” Approaches
Solution Overview
The MiniMax Statistic
Simulation Study Design
Simulation Study Results
Additional Comments
Conclusion and Discussion

Introduction and Motivation

Introduction

Cancer is the second leading cause of death in the United States: over half a million Americans die from cancer each year (CDC, 2018)
Every year, 442 people per 100,000 are diagnosed with cancer. Every year, 158 people per 100,000 will die (National Cancer Institute, 2017).
The risk to develop a cancer is strongly associated with age, and our society is “getting older”; global cancer incidence is expected to more than double by 2050 (Pilleron et al., 2020)
In the study of thousands of cancers, analysing genomic and genetic data is critical to create novel therapies.

Background and Motivation

Start with an example question:
- “In breast cancer, is increased activity of ERBB2 signalling related to overall survival time?”
- “In colon cancer, how do MMR mutations increase microsatellite instability?”
Basic components of these questions:
- Disease: “breast cancer”, “colon cancer”
- Biological process: “ERBB2 signalling”, “MMR mutations”
- Clinical response: “overall patient survival”, “high microsatellite instability”
- (potential) Statistical method: Cox Proportional Hazards, Elastic-Net Logistic Regression

Multiple Files of Data

To answer questions of these type, we have a wealth of genomic data from clinical trials and observational studies.
Recall the “central dogma” of molecular biology: DNA \(\rightarrow\) RNA \(\rightarrow\) Protein.
Data related to these biological processes are available from the level of regions of base pairs of DNA all the way to expression levels of complex phospoproteins.
Generally speaking, multiple levels of genomic measurements related to the same disease are known as multi-omic data.

Multi-Omics Categories

Multi-Way: data is matched on subjects AND features while being recorded repeatedly over time, space, or tissue. Examples: DNA methylation from blood drawn on the same subjects every morning and evening for two weeks (data can be stored in a regular array).
Sample Multi-Table: data is matched on subjects and recorded over different media. Examples: one draw of DNA methylation, gene expression, and protein expression on the same subjects.
Feature Multi-Table: data is matched on features and recorded over different cohorts and/or media. Example: phosphoprotein expression recorded for three different cohorts of breast cancer survivors.

Limitations of Current Multi-Omics Approaches

Example Data

Limitations of Current Multi-Omics Approaches

These multi-omics techniques all require that we “match” on something: time, subjects, or features.
Matching on “rows” yields 5 subjects (15 samples) to analyse (discarding 600+ subjects and 1000+ samples).
Matching on “columns” (genes) requires discarding any genes not in the intersection of the three platforms (roughly 35,000 gene measurements per subject) and any within-subject replicates (roughly 400 samples), ignoring the whole point of multiple platforms.

Solution Overview

Solution (p1): Match on Pathways

Can’t match completely on samples (because not all samples have matched measurements across all platforms).
Can’t match on features (because different platforms record different molecular components).
Can match on biological pathways (pre-defined groups of genomic regions, genes, or proteins that are all related to a single biological function).
Biological pathways can be imported from online databases, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG), The Broad Institute’s Molecular Signatures Database (MSigDB), The Gene Ontology (GO) database, WikiPathways, and more.

Solution (p2): Test Pathways

Given a set of biological pathways appropriate to the disease / biological question of interest:
- Supervised: test the statistical significance of the relationship between that pathway and a clinical response (such as survival status), or
- Unsupervised: measure the disregulation of a pathway without clinical response
Certain methods are available to statistically test the activity of a pathway such as GSEA, rnaEditr, pathwayPCA, MOGSA, WebGestalt, and many others.
Using a statistically and biologically appropriate method, assign a p-value to each pathway for each data platform.

Solution (p3): Summarise Pathway Significance

We essentially have multiple measurements of statistical significance (p-values) for the same biological process, which we need to combine.
Some platforms will share subjects with another platform (e.g. a clinical trial where subjects’ gene and protein expression were collected), while some platforms may be essentially independent (e.g. two studies on prostate cancer with different cohort, but one using DNA methylation and another RNA editing levels).
Based on sample and biological correlations between pathways, their p-value correlations range from 0 to 1.

The MiniMax Statistic

Defining the MiniMax Statistic

Consider \(G\) data platforms used to measure the statistical significance of a single pathway.
Under an independence assumption and the null hypothesis, these p-values are uniformally distributed over a unit \(G\)-cube in \([0,1]^G\).
We define the MiniMax statistic to be the minimum of all \(G - 1\) pairwise p-value maxima; calculated simply as the second order statistic of the p-values. We interpret this as the “best of the worst-case scenarios”.

Distribution of the MiniMax Statistic

Again, assuming independence of the pathway p-values across \(G\) platforms, the distribution of the \(k^{th}\) order statistic of these p-values (\(p_{[k]}\)) under the null hypothesis is Beta (\(\mathbb{B}(\alpha, \beta)\)) with parameters (see Gentle, (2009). p. 63):

\[ \begin{align} \alpha &= k \\ \beta &= G - k + 1. \end{align} \]

Thus, the MiniMax statistic is distributed \[ p_{[2]} \sim \mathbb{B}(2, G - 1). \]

Relationship Between MLEs and p-Value Correlations

First of all: we know that these p-values are NOT independent.
We hypothesized that the correlation between p-values would affect the Beta distribution of the MiniMax statistic (I originally believed this dependence would only affect the \(G\) parameter of this distribution).
We used a Gaussian copula to simulate sets of 100,000 correlated p-values across a grid of ~7700 correlation matrices \(\textbf{P} \in \mathbb{R}_{3 \times 3}^>\).
Under independence, the MLEs should be \(\{\hat{\alpha} = 2,\ \hat{\beta} = 2\}\).

Correlation to MLEs Relationship Continued

We built this grid of points subject to the constraint below to ensure each correlation matrix \(\textbf{P}\) was positive definite (see this derivation): \[ 1 + 2\rho_{12}\rho_{13}\rho_{23} - \rho_{12}^2 - \rho_{13}^2 - \rho_{23}^2 > 0 \]
We discovered that the MLEs of the Beta distribution parameters were most often near equal (maximum absolute difference = 0.01524), suggesting that the correlations affect the true distribution equally through the \(k\) and \(G\) parameters.
We then use \(\text{det}(\textbf{P})\) to predict the average of \(\hat{\alpha}\) and \(\hat{\beta}\).

MLEs for \(|\alpha - \beta|\)

Visualising this Relationship

Some Mathematical Details

For \(k = 2,\ G = 3\), our current best guess for the relationship between the determinant of the correlation matrix of the p-values and the MLEs for the Beta distribution is:

\[ \hat{\alpha} = \hat{\beta} \approx 2 - \frac{7}{8}\sqrt{1 - |\textbf{P}|}. \]

Because the estimates for \(\alpha\) and \(\beta\) are nearly identical, we conjecture that \(\text{det}(\textbf{P})\) affects \(k\) and \(G\) equally, but this needs further exploration to confirm.

Simulation Study

The Base Layer of Data Across Platforms

We have 631 samples of patients with colorectal adenocarcinoma (our Venn diagram).
There are seven molecular platforms plus clinical data (\(n = 629\)); our simulation will use centred and scaled data from these 3 platforms:
- Copy Number Changes (\(n = 616,\ p = 24776\))
- Normalised RNAseq (\(n = 379,\ p = 19828\))
- Gene-Level Proteome (\(n = 90,\ p = 5538\))
Source: http://www.linkedomics.org/data_download/TCGA-COADREAD/

Pathways and Synthetic Subtypes

We want to treat a certain subset of biological pathways, but “real” pathways are often dependent (many genes are important in multiple biological processes, appearing in more than one pathway).
Because each data set has a different dimensionality, we cluster the 5252 genes included across each platform.
To ensure that treatments applied to pathways do not “bleed over” to other pathways, we partitioned genes into a set of 200 pathways.
All samples were from subjects with cancer, so we randomly assigned each subject to a synthetic subtype (“A” or “B”), which we repeated 100 times.

Statistic Properties under the Null Distribution

For each of the 100 response sets:

Before treating the pathways, use the pathwayPCA:: R package to fit the logistic regression model of the random subtype against the first principal component of each pathway.
Repeat this process for each of the three data platforms (copy number, RNAseq, and protomics).
Calculate the MiniMax statistic for each pathway.
Use these results under the null hypothesis to
- Estimate the test size of the MiniMax statistic, and
- Find the Maximum Likelihood Estimators of the Beta distribution parameters \(\alpha\) and \(\beta\).

Design Points

For each of the 100 response sets:

Treat 20 pathways at random for all samples of synthetic type “B”.
Treatment proportion: 20%, 40%, 60%, or 80% of genes in “treated” pathways are selected for direct treatment. Based on the clustering procedure (Euclidean distance with agglomeration clustering via a modified Ward’s method), clusters range in size from roughly 5 to 75 genes, with a median of 25. Thus, we vary the proportion of treated genes rather than the count.
Treatment strength: +0.1, +0.2, …, +0.5. Recall that each gene has been centred and scaled, so their standard deviations are approximately 1.

Within-pathway Dependent Hypothesis Testing

For each pathway and each platform, we use the pathwayPCA:: R package to fit the logistic regression model of the synthetic subtype against the first principal component of each pathway. (Recall that each platform has a different sample size.)
For each pathway, calculate the MiniMax statistic from these p-values.
For each pathway, find the p-value of the MiniMax statistic using \(\hat{\alpha}_{\text{MLE}}\) and \(\hat{\beta}_{\text{MLE}}\) calculated from the results under the null distribution (which has the same dependence across platforms).

Simulation Study Results

Potential Competing Methods

Because of the limitations described above, we haven’t been able to find methods to compare against that have reasonable performance (less than 20% power for scenario with strongest signal).
Because only 5 samples are shared across the three platforms, standard multi-omics techniques ignore over 90% of the data.
Potential competing method: “stack” the data; this however requires us to discard any multi-platform data for the same subject (we can’t include multiple rows from the same person without violating independence assumptions).

Test Size

Because the test size of the MiniMax statistic depends on the parameters of the Beta distribution, we estimate the test size under four assumptions:
1. Independent hypotheses: \(\{\alpha,\beta\} = \{2,2\}\)
2. Dependence only affects \(G\): \(\{\alpha,\beta\} = \{2,\hat{\beta}\}\)
3. Dependence affects both parameters: \(\{\alpha,\beta\} = \{\hat{\alpha}_{\text{MLE}},\hat{\beta}_{\text{MLE}}\}\)
4. Same as (3), but estimating the parameters using their functional dependence on the correlation matrix of the pathway p-values: \(\{\alpha,\beta\} = \{\hat{\alpha}_f,\hat{\beta}_f\}\)
We also estimated the test size of the “stacked” method.

Test Size Results

Method	Median	Mean	Q3
(1)	0.055	0.062	0.080
(2)	0.060	0.058	0.070
(3)	0.050	0.051	0.060
(4)	0.048	0.051	0.065
Stacked	0.043	0.051	0.070

Test size for Methods (3) and (4)—the MLE-based approaches—were well controlled, so we show power results for those methods compared to the more rigid “stacked” approach.
Power results were nearly identical for Methods (3) and (4).

Power

Comments on Flexibility

We have seen that the test sizes for these two techniques are nearly identical, but the “stacked” data method offers superior power to the MiniMax statistic.
This is because the MiniMax does not use the full data to estimate pathway significance, only p-values; therefore, it can be used even when original data is unavailable.
Moreover, “stacking” the data requires us to discard nearly 20k gene-level copy number measurements, 15k gene expression measurements, and 400 matched-platform samples.
“Stacking” prohibits any investigation into molecular systems within the same subject.

Conclusion and Discussion

Review of Points

Genetics and genomics data often comes from multiple studies with unmatched or partially matched samples.
Despite this sample and feature discordance, we would like to synthesize signals within the same disease using biological pathways.
The MiniMax statistic can combine biological process significance across multiple data sets, without requiring access to the original data.
This statistic shows well controlled test size and reasonable power across common scenarios with moderate biological signal.

Discussion Questions

What are other potential competing methods?
How can I derive the functional relationship between \(\text{det}(\textbf{P})\) and \(\{\hat{\alpha},\hat{\beta}\}\) for the Beta distribution?

Multi-Omics Pathway Testing with the MiniMax Statistic

Gabriel J. Odom

<https://stempel.fiu.edu/faculty/gabriel-j-odom/>

5 November 2020

Outline

Outline

Introduction and Motivation

Introduction

Background and Motivation

Multiple Files of Data

Multi-Omics Categories

Limitations of Current Multi-Omics Approaches

Example Data

Limitations of Current Multi-Omics Approaches

Solution Overview

Solution (p1): Match on Pathways

Solution (p2): Test Pathways

Solution (p3): Summarise Pathway Significance

The MiniMax Statistic

Defining the MiniMax Statistic

Distribution of the MiniMax Statistic

Relationship Between MLEs and p-Value Correlations

Correlation to MLEs Relationship Continued

MLEs for \(|\alpha - \beta|\)

Visualising this Relationship

Some Mathematical Details

Simulation Study

The Base Layer of Data Across Platforms

Pathways and Synthetic Subtypes

Statistic Properties under the Null Distribution

Design Points

Within-pathway Dependent Hypothesis Testing

Simulation Study Results

Potential Competing Methods

Test Size

Test Size Results

Power

Comments on Flexibility

Conclusion and Discussion

Review of Points

Discussion Questions

Thank You