Deep Learning for omics data

Author

Yuchen

Published

February 19, 2024

Outline

XAI for omics data Section 1
Explainability methods used for AI methods Section 2
Benchmark dataset Section 3

1 XAI for omics data

Explainable artificial intelligence for omics data: a systematic mapping study (Toussaint et al. 2023)

1.1 AI methods used for omics data

1.1.1 DNA-based omics data

Some approaches use models generally considered transparent:

Linear/logistic regression
Generalized linear models
Gradient boosting
Rule mining

However, the majority of studies focus on non-transparent models: Random forests Support vector machines Neural networks

In fact, DNA sequences analyzed by means of convolution neural networks (CNNs) are the most common combination overall.

Gene expressions are the most used omics data type in our study.

1.1.2 Transcriptomic data (RNA sequence data)

1.1.3 Microbiomic, proteomic and multi-omics data

Seldomly analyzed data type

The most popular approach for this set is using multiple AI methods followed by DNNs and random forests.

1.1.4 Multi-omics data

Papers usually argue that one omics data type is not sufficient to achieve their results and thus apply a second data type for additional input information.

1.2 Explainability methods used for AI methods

1.2.1 Neural networks

DNNs and CNNs are predominant. Across all NNs, feature relevance is applied in 87 studies (47.3%). Many approaches utilize SHAP values to identify the feature relevance of the NNs.

Some apply architecture modification on NNs to achieve interpretability. These architecture modifications change the network layout to represent biological connections.

An emerging variation of NNs are variational autoencoders and transformers, which mostly use feature relevance or architecture modifications.

1.2.2 Tree-based methods

Decision trees, Random forests

1.2.3 Statistical methods

Bayesian networks
Linear or logistic regression
Do not apply additional post-hoc approaches

1.3 Relationships between omics data, ai method and explainability method

CNNs for DNA sequence data
Rule mining and DNNs for gene expression data
Multiple AI methods for multi-omics data

One interesting representation example:

Karim et al. [98] utilize the DeepInsight framework to transform their gene expression data into images, which are then classified with a CNN, allowing for visual post-hoc explanations via SHAP pixel maps.

1.4 Transformer-based + XAI

SetQuence & SetOmic: Deep Set Transformer-based Representations of Cancer Multi-Omics

1.4.1 Methods

They proposed relying on non-fixed sets of mutated genome sequences, which can be used for supervised learning of oncology-relevant tasks by the Transformer-based Deep Neural Network

They extend the model to incorporate these representations as well as multiple sources of omics data in a flexible way with SETOMIC

1.4.2 Explainability through primary attribution methods

Attribution-based methods:

Integrated Gradients
Input X Grad
DeepLIFT
SHAP

1.4.3 Data

TCGA

They used transcriptome expression data, somatic mutation data and their corresponding clinical annotations across 33 tumour types from the TCGA pan-cancer data-set

2 Multimodal deep learning for omics data

Multimodal deep learning approaches for single-cell multi-omics data integration (Athaya et al. 2023)

2.1 MULTI-OMICS DATA MODALITIES

2.1.1 Single-cell genomics data

Single-cell DNA sequencing (scDNA-seq)

2.1.2 Single-cell transcriptomics data

scRNA-seq, also known as single-cell transcriptomics or gene expression data

2.1.3 Single-cell epigenomics data

Epigenomics measures genome-wide epigenomic modifications, such as DNA methylation, histone modifications and chromosome accessibilities

2.1.4 Single-cell proteomics data

Single-cell proteomics investigates individual cells’ protein content, analyzing their roles and interactions

2.2 Multimodal deep learning techiques

Fully connected neural network (FCNN)

Convolutional neural network (CNN)

Recurrent neural network (RNN)

Autoencoder (AE)

Heterogenous model

Zhang et al. (Zhang et al. 2020)developed fusion models based on CNNs and RNNs to learn patient representation by combining sequential clinical notes, static demographic and admission data.

Lin et al. (Lin et al. 2020) utilized three separate encoder networks to learn marginal representations of mRNA, DNA methylation and copy number variation data for breast cancer subtype prediction. These marginal representations were concatenated and fed into a classification subnetwork to learn a joint representation.

sciCAN (Xu, Begoli, and McCord 2022) combined generative adversarial networks (GAN) and encoder models for integrating single-cell multi-omics data.

2.3 Model architecture for data integration

VAE
AE
Encoder
GAN
FCNN
Heterogenous model

2.4 Fusion methods

Early Fusion

Concatenating input features from different modalities to serve as the input of a deep learning model

Intermediate Fusion
Late Fusion

3 Benchmark dataset

3.1 R simulated data

3.2 TCGA

TCGA’s principal aims are to generate, quality control, merge, analyze and interpret molecular profiles at the DNA, RNA, protein and epigenetic levels for hundreds of clinical tumors representing various tumor types and their subtypes.

As of 24 July 2013, TCGA had mapped molecular patterns across 7,992 total cases representing 27 tumor types.

TCGA Pan-Cancer project assembled data from thousands of patients with primary tumors occurring in different sites of the body, covering 12 tumor types

Six types of omics characterization were performed creating a ‘data stack’ (right) in which data elements across the platforms are linked by the fact that the same samples were used for each, thus maximizing the potential of integrative analysis.

3.3 UK Biobank

The UK Biobank resource currently contains the following genomic data:

Genotypes and imputation therefrom for 488,000 participants, with imputed versions provided using TOPMed, Genomics England, and Haplotype Reference Consortium pipelines;

Exome sequences for 470,000 participants, with versions using OQFE, DRAGEN, and gnomAD pipelines;

Whole genome sequences for 200,000 participants.

References

Athaya, Tasbiraha, Rony Chowdhury Ripan, Xiaoman Li, and Haiyan Hu. 2023. “Multimodal Deep Learning Approaches for Single-Cell Multi-Omics Data Integration.” Briefings in Bioinformatics 24 (5). https://doi.org/10.1093/bib/bbad313.

Lin, Yuqi, Wen Zhang, Huanshen Cao, Gaoyang Li, and Wei Du. 2020. “Classifying Breast Cancer Subtypes Using Deep Neural Networks Based on Multi-Omics Data.” Genes 11 (8): 888. https://doi.org/10.3390/genes11080888.

Toussaint, Philipp A, Florian Leiser, Scott Thiebes, Matthias Schlesner, Benedikt Brors, and Ali Sunyaev. 2023. “Explainable Artificial Intelligence for Omics Data: A Systematic Mapping Study.” Briefings in Bioinformatics 25 (1). https://doi.org/10.1093/bib/bbad453.

Xu, Yang, Edmon Begoli, and Rachel Patton McCord. 2022. “sciCAN: Single-Cell Chromatin Accessibility and Gene Expression Data Integration via Cycle-Consistent Adversarial Network.” Npj Systems Biology and Applications 8 (1). https://doi.org/10.1038/s41540-022-00245-6.

Zhang, Dongdong, Changchang Yin, Jucheng Zeng, Xiaohui Yuan, and Ping Zhang. 2020. “Combining Structured and Unstructured Data for Predictive Models: A Deep Learning Approach.” BMC Medical Informatics and Decision Making 20 (1). https://doi.org/10.1186/s12911-020-01297-6.