Deep Learning for omics data
Outline
XAI for omics data Section 1
Explainability methods used for AI methods Section 2
Benchmark dataset Section 3
1 XAI for omics data
Explainable artificial intelligence for omics data: a systematic mapping study (Toussaint et al. 2023)
1.1 AI methods used for omics data
1.1.1 DNA-based omics data
Some approaches use models generally considered transparent:
Linear/logistic regression
Generalized linear models
Gradient boosting
Rule mining
However, the majority of studies focus on non-transparent models: Random forests Support vector machines Neural networks
In fact, DNA sequences analyzed by means of convolution neural networks (CNNs) are the most common combination overall.
Gene expressions are the most used omics data type in our study.
1.1.2 Transcriptomic data (RNA sequence data)
1.1.3 Microbiomic, proteomic and multi-omics data
Seldomly analyzed data type
The most popular approach for this set is using multiple AI methods followed by DNNs and random forests.
1.1.4 Multi-omics data
Papers usually argue that one omics data type is not sufficient to achieve their results and thus apply a second data type for additional input information.
1.2 Explainability methods used for AI methods
1.2.1 Neural networks
DNNs and CNNs are predominant. Across all NNs, feature relevance is applied in 87 studies (47.3%). Many approaches utilize SHAP values to identify the feature relevance of the NNs.
Some apply architecture modification on NNs to achieve interpretability. These architecture modifications change the network layout to represent biological connections.
An emerging variation of NNs are variational autoencoders and transformers, which mostly use feature relevance or architecture modifications.
1.2.2 Tree-based methods
- Decision trees, Random forests
1.2.3 Statistical methods
Bayesian networks
Linear or logistic regression
Do not apply additional post-hoc approaches
1.3 Relationships between omics data, ai method and explainability method
CNNs for DNA sequence data
Rule mining and DNNs for gene expression data
Multiple AI methods for multi-omics data
One interesting representation example:
Karim et al. [98] utilize the DeepInsight framework to transform their gene expression data into images, which are then classified with a CNN, allowing for visual post-hoc explanations via SHAP pixel maps.
1.4 Transformer-based + XAI
SetQuence & SetOmic: Deep Set Transformer-based Representations of Cancer Multi-Omics
1.4.1 Methods
They proposed relying on non-fixed sets of mutated genome sequences, which can be used for supervised learning of oncology-relevant tasks by the Transformer-based Deep Neural Network
They extend the model to incorporate these representations as well as multiple sources of omics data in a flexible way with SETOMIC
1.4.2 Explainability through primary attribution methods
Attribution-based methods:
Integrated Gradients
Input X Grad
DeepLIFT
SHAP
1.4.3 Data
TCGA
They used transcriptome expression data, somatic mutation data and their corresponding clinical annotations across 33 tumour types from the TCGA pan-cancer data-set
2 Multimodal deep learning for omics data
Multimodal deep learning approaches for single-cell multi-omics data integration (Athaya et al. 2023)
2.1 MULTI-OMICS DATA MODALITIES
2.1.1 Single-cell genomics data
Single-cell DNA sequencing (scDNA-seq)
2.1.2 Single-cell transcriptomics data
scRNA-seq, also known as single-cell transcriptomics or gene expression data
2.1.3 Single-cell epigenomics data
Epigenomics measures genome-wide epigenomic modifications, such as DNA methylation, histone modifications and chromosome accessibilities
2.1.4 Single-cell proteomics data
Single-cell proteomics investigates individual cells’ protein content, analyzing their roles and interactions
2.2 Multimodal deep learning techiques
Fully connected neural network (FCNN)
Convolutional neural network (CNN)
Recurrent neural network (RNN)
Autoencoder (AE)
Heterogenous model
Zhang et al. (Zhang et al. 2020)developed fusion models based on CNNs and RNNs to learn patient representation by combining sequential clinical notes, static demographic and admission data.
Lin et al. (Lin et al. 2020) utilized three separate encoder networks to learn marginal representations of mRNA, DNA methylation and copy number variation data for breast cancer subtype prediction. These marginal representations were concatenated and fed into a classification subnetwork to learn a joint representation.
sciCAN (Xu, Begoli, and McCord 2022) combined generative adversarial networks (GAN) and encoder models for integrating single-cell multi-omics data.
2.3 Model architecture for data integration
VAE
AE
Encoder
GAN
FCNN
Heterogenous model
2.4 Fusion methods
- Early Fusion
Concatenating input features from different modalities to serve as the input of a deep learning model
Intermediate Fusion
Late Fusion
3 Benchmark dataset
3.1 R simulated data
3.2 TCGA
TCGA’s principal aims are to generate, quality control, merge, analyze and interpret molecular profiles at the DNA, RNA, protein and epigenetic levels for hundreds of clinical tumors representing various tumor types and their subtypes.
As of 24 July 2013, TCGA had mapped molecular patterns across 7,992 total cases representing 27 tumor types.
TCGA Pan-Cancer project assembled data from thousands of patients with primary tumors occurring in different sites of the body, covering 12 tumor types
Six types of omics characterization were performed creating a ‘data stack’ (right) in which data elements across the platforms are linked by the fact that the same samples were used for each, thus maximizing the potential of integrative analysis.
3.3 UK Biobank
The UK Biobank resource currently contains the following genomic data:
Genotypes and imputation therefrom for 488,000 participants, with imputed versions provided using TOPMed, Genomics England, and Haplotype Reference Consortium pipelines;
Exome sequences for 470,000 participants, with versions using OQFE, DRAGEN, and gnomAD pipelines;
Whole genome sequences for 200,000 participants.