Multi-Omics Pathway Testing with the MiniMax Statistic

Gabriel J. Odom, PhD, ThD

<https://rpubs.com/gabrielodom/pathway_minimax/>

8 February 2022

Outline

Introduction and Motivation
Example Data Sets
Solution Overview
The MiniMax Statistic
Application in Practice
Conclusion and Discussion

Suppl. 1: Simulation Study Design
Suppl. 2: Simulation Study Results

Introduction and Motivation

Introduction

In the study of neurodegenerative diseases, thousands of cancers, and heart disease, analysing genomic and genetic data is critical to create novel therapies.
Generally speaking, multiple levels of genomic measurements related to the same disease are known as multi-omic data.
Multi-omics data often include multiple related samples in the two or more of the following forms: genome, transcriptome, epigenome, proteome, lipidome, and others (Krassowski et al., 2020).
Multi-omics data shows “the flow of biological information at multiple levels and thus can help in unraveling the mechanisms underlying the biological condition of interest” (Subramanian et al., 2020).

Categories of Multi-Omics Data

Sample Multi-Table: data is matched on subjects and recorded over different media. Examples: one draw of DNA methylation, gene expression, and protein expression on the same subjects (today’s subject).
Feature Multi-Table: data is matched on features and recorded over different cohorts and/or media. Example: phosphoprotein expression recorded for three different cohorts of breast cancer survivors.
Multi-Way: data is matched on subjects AND features while being recorded repeatedly over time, space, or tissue. Examples: DNA methylation from blood drawn on the same subjects every morning and evening for two weeks (data can be stored in a regular array).

Background and Motivation

Start with an example question:
- “In breast cancer, is increased activity of ERBB2 signalling related to overall survival time?”
- “In colon cancer, how do MMR mutations increase microsatellite instability?”
Basic components of these questions:
- Disease: “breast cancer”, “colon cancer”
- Biological process: “ERBB2 signalling”, “MMR mutations”
- Clinical response: “overall patient survival”, “high microsatellite instability”
- (potential) Statistical method: Cox Proportional Hazards, Elastic-Net Logistic Regression

Example Data Sets

Multiple Files of Data

To answer questions of these type, we have a wealth of genomic data from clinical trials and observational studies.
Recall the “central dogma” of molecular biology: DNA \(\rightarrow\) RNA \(\rightarrow\) Protein.
Data related to these biological processes are available from the level of regions of base pairs of DNA all the way to expression levels of complex phospoproteins.
Data repositories include: The Cancer Genome Atlas, Omics Discovery Index, Gene Expression Omnibus (recently added some multi-omics data sets), and others.

Example Data 1: TCGA-COADREAD

We have 631 samples (\(n = 629\) with complete phenotype) of patients with colorectal adenocarcinoma.
There are seven molecular platforms, but we pick 3 platforms for multi-omics analysis:
- Copy Number Changes (\(n = 616,\ p = 24776\))
- Normalised RNAseq (\(n = 379,\ p = 19828\))
- Gene-Level Proteome (\(n = 90,\ p = 5538\))
Matching on “rows” yields 5 subjects (15 samples) to analyse (discarding 600+ subjects and 1000+ samples).
Matching on “columns” (genes) yields 1,710 features to analyse (discarding thousands of proteins and over 20,000 genes)

COADREAD Sample Overlap

Example Data 2: Alzheimer’s Disease Meta-Analysis

SNP-level summary statistics: Kunkle et al. (2019) performed a genome-wide association study (GWAS) to find single nucleotide polymorphisms (SNPs) related to disease status.
DNA Methylation summary statistics: Zhang et al. (2020) performed an epigenome-wide DNA methylation study to identify CpGs (cytosine-p-guanine pairs in the genome) and regions of concurrent differential methlyation.
RNAseq data: The Religious Orders Study and Memory and Aging Project (ROSMAP) data set has 640 samples of postmortem prefrontal cortex samples.

Solution Overview

Solution (p1): Match on Pathways

Can’t match completely on samples (because not all samples have matched measurements across all platforms).
Can’t match on features (because different platforms record different molecular components).
Can match on biological pathways (AKA gene sets: pre-defined groups of genomic regions, genes, or proteins that are all related to a single biological function).
Biological pathways can be imported from online databases, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG), the Broad Institute’s Molecular Signatures Database (MSigDB), the Gene Ontology (GO) database, WikiPathways, and more.

Limitations of Current Multi-Omics Approaches

These multi-omics techniques all require that we “match” on something: time, subjects, or features.
In the toy example (sample multi-table) on the next slide:
- there is only one shared sample (Sample 006) to match on
- there are no shared features to match on directly
- matching RNAseq to proteomics requires us to discard any non-coding RNA
NOTE: if your data are matched, stop paying attention to this talk and go use one of these methods instead.

Solution (p2): Test Pathways

Given a set of biological pathways appropriate to the disease / biological question of interest:
- Supervised: test the statistical significance of the relationship between that pathway and a clinical response (such as survival status), or
- Unsupervised: measure the disregulation of a pathway without clinical response
Certain methods are available to statistically test the activity of a pathway such as GSEA, rnaEditr, MOGSA, pathwayPCA, WebGestalt, and many others.
Using a statistically and biologically appropriate method, assign a p-value to each pathway for each data platform.

Solution (p3): Summarise Pathway Significance

Pathway	RNAseq p-value	DNAm p-value	Proteomics p-value
`cell_process_1`	`2.3e-4`	`1.8e-3`	`2.2e-1`
`cell_process_2`	`1.2e-2`	`5.9e-4`	`9.7e-2`

Our “data” is now this matrix of p-values: one row for each biological process and one column for each single-omic data platform.
We essentially have multiple measurements of statistical significance (p-values) for the same biological process, which we need to combine.
Choose a pathway collection such that the pathways in it are as distinct as possible.
What causes dependence? Sample / feature overlap.

The MiniMax Statistic

Defining the MiniMax Statistic

Consider \(G\) data platforms used to measure the statistical significance of a single pathway.
Under an independence assumption and the null hypothesis, these p-values are uniformally distributed over a unit \(G\)-cube in \([0,1]^G\).
We define the MiniMax statistic to be the minimum of all \(G - 1\) pairwise p-value maxima; calculated simply as the second order statistic of the p-values.
We interpret this as the “best of the worst-case scenarios”.
This is arithmetically equivalent to taking the second smallest p-value.

Distribution of the MiniMax Statistic

Again, assuming independence of the pathway p-values across \(G\) platforms, the distribution of the \(k^{th}\) order statistic of these p-values (\(p_{[k]}\)) under the null hypothesis is Beta (\(\mathbb{B}(\alpha, \beta)\)) with parameters (see Gentle, (2009). p. 63):

\[ \begin{align} \alpha &= k \\ \beta &= G - k + 1. \end{align} \]

Thus, the MiniMax statistic is distributed \[ p_{[2]} \sim \mathbb{B}(2, G - 1). \]

Method Summary

Application in Practice: Alzheimer’s Disease

Step 1A: Multi-Omics Pathway p-Values

Select the MSigDB’s full C2:CP collection (2982 pathways).
In order to calculate pathway-level summaries of these data sets on subjects with late-onset Alzheimer’s disease, we performed
- SNP: Mixed-Model GWAS Pathway Analysis (Wang et al., 2011)
- DNAm: Illumina 450k Gene Set Analysis (missMethyl::, Phipson et al., 2016)
- RNAseq: Fast Gene Set Enrichment Analysis (fgsea::, Korotkevich et al., 2021) on expression levels controlled for age, sex, and Braak stage.

Steps 1B - 3: MiniMax Significance

Step 1B: For pathway \(g\), we calculate the minimum of all pairwise maxima of the p-values, denoted \(P_g\).
Step 2: Because we do not have access to the original data, we do not estimate the parameters of the Beta distribution, but rather accept their parametric default values: \(\alpha = 2\); \(\beta = G - k + 1 = 3 - 2 + 1\).
Step 3: using each \(P_g\) as a critical value, find the MiniMax p-value for each pathway from the \(\mathbb{B}(\alpha = 2, \beta = 2)\) distribution .

Top Results

As discussed in Odom et al. (2021), these pathways all have strong implication for the diagnosis and treatment of neurodegenerative diseases.

Conclusion and Discussion

Review of Points

Genetics and genomics data often comes from multiple studies with unmatched or partially matched samples.
Despite this sample and feature discordance, we would like to synthesize signals within the same disease using biological pathways.
The MiniMax statistic can combine biological process significance across multiple data sets, without requiring access to the original data.
This statistic shows well controlled test size and reasonable power across common scenarios with moderate biological signal (supplemental slides).

Thank You

Simulation Study

Pathways and Synthetic Subtypes

We want to treat a certain subset of biological pathways, but “real” pathways are often dependent (many genes are important in multiple biological processes, appearing in more than one pathway).
We use the 3 centred and scaled TCGA-COADREAD data sets as a “base layer” (all cases).
Because each data set has a different dimensionality, we cluster the 1710 genes included across each platform.
To ensure that treatments applied to pathways do not “bleed over” to other pathways, we partitioned genes into a set of 50 pathways.
We randomly assigned each subject to a synthetic subtype (“A” or “B”), which we repeated 100 times.

MiniMax Properties under \(H_0\)

For each of the 100 response sets:

Before treating the pathways, use the pathwayPCA:: R package to fit the logistic regression model of the random subtype against the first principal component of each pathway.
Repeat this process for each of the three data platforms (copy number, RNAseq, and protomics).
Calculate the MiniMax statistic for each pathway.
Use these results under the null hypothesis to
- Estimate the test size of the MiniMax statistic, and
- Find the Maximum Likelihood Estimators of the Beta distribution parameters \(\alpha\) and \(\beta\).

Design Points

For each of the 100 response sets:

Treat 20 pathways at random for all samples of synthetic type “B”.
Treatment proportion: 20%, 40%, 60%, or 80% of genes in “treated” pathways are selected for direct treatment.
Based on the clustering procedure (Euclidean distance with agglomeration clustering via a modified Ward’s method), clusters range in size from roughly 5 to 75 genes, with a median of 25.
Treatment strength: +0.1, +0.2, …, +0.5. Recall that each gene has been centred and scaled, so their standard deviations are approximately 1.

Within-pathway Dependent Hypothesis Testing

For each pathway and each platform, we use the pathwayPCA:: R package to fit the logistic regression model of the synthetic subtype against the first principal component of each pathway.
For each pathway, calculate the MiniMax statistic from these p-values.
For each pathway, find the p-value of the MiniMax statistic using \(\hat{\alpha}_{\text{MLE}}\) and \(\hat{\beta}_{\text{MLE}}\) calculated from the results under the null distribution (which has the same dependence across platforms).

Simulation Study Results

Competing Methods

Because only 5 samples are shared across the three platforms, standard multi-omics techniques ignore over 90% of the data.
We found the following methods to perform pathway-level multi-omics analysis (in order of performance in silico):
- MiniMax plus pathwayPCA (our previously published single-omics principal component analysis method)
- mitch: multi-contrast pathway enrichment
- iProFun: Integrative analysis of PROteogenomic FUNctional traits
- MFA: multi-factor analysis
- sCCA: sparse multiple canonical correlation analysis

Test Size

Because the test size of the MiniMax statistic depends on the parameters of the Beta distribution, we estimate the test size under two assumptions:
1. Independent hypotheses: \(\{\alpha,\beta\} = \{2,2\}\)
2. Dependence affects both parameters: \(\{\alpha,\beta\} = \{\hat{\alpha}_{\text{MLE}},\hat{\beta}_{\text{MLE}}\}\)
We experimented with having dependence affect only \(\beta\) and leaving \(\alpha = 2\) fixed, but these results were not as good.
We also estimated the test size of the other methods to find the test statistic thresholds which preserved a test size of 5%.

Test Size Results

Method	Parameters	Median	Mean	Q3
(1)	\(\{2, 2\}\)	0.044	0.063	0.080
(2)	\(\{1.85, 1.9\}\)	0.045	0.054	0.060

Power results were nearly identical for both methods. We believe that this is because our pathways were constructed to be as independent as possible.
In practice, great care should be taken when selecting a collection of pathways.

Multi-Omics Pathway Testing with the MiniMax Statistic

Gabriel J. Odom, PhD, ThD

<https://rpubs.com/gabrielodom/pathway_minimax/>

8 February 2022

Outline

Outline

Introduction and Motivation

Introduction

Categories of Multi-Omics Data

Background and Motivation

Example Data Sets

Multiple Files of Data

Example Data 1: TCGA-COADREAD

COADREAD Sample Overlap

Example Data 2: Alzheimer’s Disease Meta-Analysis

Solution Overview

Solution (p1): Match on Pathways

Limitations of Current Multi-Omics Approaches

Solution (p2): Test Pathways

Solution (p3): Summarise Pathway Significance

The MiniMax Statistic

Defining the MiniMax Statistic

Distribution of the MiniMax Statistic

Method Summary

Application in Practice: Alzheimer’s Disease

Step 1A: Multi-Omics Pathway p-Values

Steps 1B - 3: MiniMax Significance

Top Results

Conclusion and Discussion

Review of Points

Related Works

Thank You

Simulation Study

Pathways and Synthetic Subtypes

MiniMax Properties under \(H_0\)

Design Points

Within-pathway Dependent Hypothesis Testing

Simulation Study Results

Competing Methods

Test Size

Test Size Results