The SUMO R package provides a powerful and flexible simulation framework tailored for the generation of synthetic multi-omics datasets. These datasets are vital for evaluating and benchmarking integrative bioinformatics tools, specifically for factor analysis based methods, however, it can be used for other methods such as clustering, and dimensionality reduction methods. SUMO allows users to simulate multiple omics datasets with predefined signal structures, making it ideal for method development, and reproducible benchmarking.
This vignette accompanies the SUMO R package and serves as both a tutorial and reference for researchers and developers interested in synthetic data for multi-omics.
You can install SUMO directly from CRAN using the following command:
You can get help in SUMO directly by using the following command:
You will receive a quick description of SUMO, key features, main functions, and the responsible contact person who is the author and the maintainer. SUMO main functions include:
simulateMultiOmics(): Simulates multiple (>= two)
high-dimensional multi-omics datasets.simulate_twoOmicsData(): Simulates two high-dimensional
multi-omics datasets.plot_simData(): Visualizes generated data at different
levels.plot_factor(): Displays raw factor scores across
samples for signal inspection.plot_weights(): Visualizes raw feature loadings to
assess signal versus noise.demo_multiomics_analysis(): Full demo function for
applying MOFA on SUMO-generated or real-world chronic lymphocytic
leukemia (CLL) data.compute_means_vars(): Estimate parameters from the real
experimental dataset.SUMO originally introduced the simulate_twoOmicsData()
function to support synthetic data generation for two omics
layers, such as transcriptomics and proteomics. This function
provided control over:
This function creates datasets with user-defined dimensions, signal regions, and latent factor relationships.
However, modern integrative analysis often involves two omics and/or
more datasets. To address this, SUMO introduces an extended and
general-purpose function: simulateMultiOmics().
simulateMultiOmics())The function simulateMultiOmics() generalizes and
replaces the two-layer simulator, allowing users to define two
or more omics layers with distinct or shared signal structures.
It is now the recommended core simulator for all multi-omics
benchmarking and demo applications.
simulate_twoOmicsData()library(SUMO)
sim_object1 <- simulateMultiOmics(
vector_features = c(3000, 2500, 2000), # Features in omic1, omic2, omic3
n_samples = 100, # Shared samples
n_factors = 3, # Number of latent factors
snr = 3, # Signal-to-noise ratio
signal.samples = c(5, 1),
signal.features = list(c(3, 0.3), c(2.5, 0.25), c(2, 0.2)),
factor_structure = "mixed", # Shared + specific factors
num.factor = "multiple",
seed = 123
)The table below explains each parameter used in the
simulateMultiOmics() example:
| Parameter | Meaning |
|---|---|
vector_features = c(3000, 2500, 2000) |
Specifies the number of features in each omics layer. Omic 1 has 3000, Omic 2 has 2500, and Omic 3 has 2000 features. |
n_samples = 100 |
Number of shared samples across all omics. These samples will be simulated identically across the 3 layers. |
n_factors = 3 |
The number of latent factors (biological processes, patterns, or signals) to simulate. |
snr = 3 |
Signal-to-noise ratio. A higher value means simulated signal is more distinguishable from noise. |
signal.samples = c(5, 1) |
Mean and standard deviation of the signal strength across samples. |
signal.features = list(c(3, 0.3), c(2.5, 0.25), c(2, 0.2)) |
Mean and standard deviation of signal across features, defined separately for each omic. |
factor_structure = "mixed" |
Specifies that some latent factors are shared across all omics, while others are omic-specific. |
num.factor = "multiple" |
Indicates that multiple factors are being simulated (not just one global factor). |
seed = 123 |
Ensures reproducibility. The same random seed yields the same output when rerun. |
omic.list: A list of 3 numeric matrices, each
representing a simulated omics dataset with dimensions (features ×
samples). These are the core data matrices used for downstream analysis
or benchmarking.signal_annotation: A structured list containing
information about where signal was injected in both samples and
features. This is essential for validating method accuracy (e.g.,
sensitivity, specificity).list_alphas: A matrix of sample-level latent factor
scores with dimensions (samples × latent factors). Each row corresponds
to a sample and each column to a latent factor.list_betas: A list of feature-level loading matrices
for each omic. Each matrix has dimensions (features × latent factors)
and captures how strongly each feature loads onto each latent factor,
specific to its omics layer.Use either simulate_twoOmicsData() or
simulateMultiOmics() for two-layer simulation. This allows
more fine-tuned control over individual omic signal distributions and
latent factor type (shared, unique, mixed). The example below
demonstrated using the former function to simulate two-layer
dataset.
The simulateMultiOmics() function in SUMO supports three
flexible approaches for setting up simulation parameters, depending on
the user’s goals and level of expertise. This is designed to ensure that
both beginners and advanced users can effectively generate high-quality,
synthetic data.
There are three main ways to configure simulation parameters:
compute_means_vars() function and use them in the
simulation.Below is a detailed description of each parameter accepted by
simulateMultiOmics():
| Parameter | Required | Default | Description |
|---|---|---|---|
vector_features |
✅ | — | Specifies the number of features in each omics layer. E.g.,
c(3000, 2500, 2000) simulates 3 omics with 3000, 2500, and
2000 features. |
n_samples |
✅ | — | Number of shared samples across all omics. These samples will be simulated identically across layers. |
n_factors |
✅ | — | Number of latent factors (biological patterns or signals) to simulate. |
snr |
❌ | 2 |
Signal-to-noise ratio. Higher values simulate clearer signal. E.g.,
snr = 3 makes signal more distinct from background
noise. |
signal.samples |
❌ | c(5, 1) |
Mean and standard deviation of signal intensity across samples for each factor. |
signal.features |
❌ | NULL |
A list like list(c(3, 0.3), c(2.5, 0.25), c(2, 0.2))
defines signal strength (mean, SD) across features per omic. |
factor_structure |
❌ | "mixed" |
Specifies whether factors are shared, unique to each omic, or a mix
of both ("shared", "unique", or
"mixed"). |
num.factor |
❌ | "multiple" |
Indicates how many factors each omic receives: "single"
for one factor per omic, "multiple" for several. |
seed |
❌ | NULL |
Optional seed value for reproducibility. Using the same seed ensures the same output when rerun. |
Here’s an example of using default and user-defined parameters:
# Minimal example using default parameters
simulateMultiOmics(
vector_features = c(3000, 2500, 2000),
n_samples = 100,
n_factors = 3
)
# Fully customized example - user-defined
simulateMultiOmics(
vector_features = c(3000, 2500, 2000),
n_samples = 100,
n_factors = 3,
snr = 3,
signal.samples = c(10, 2),
signal.features = list(c(3, 0.3), c(2.5, 0.25), c(2, 0.2)),
factor_structure = "mixed",
num.factor = "multiple",
seed = 123
)To simulate omics layers with real-world-like distributions, users
can leverage the compute_means_vars() function. This
function calculates the overall, row-wise, and column-wise means and
standard deviations of each dataset. These statistics can then guide
signal generation in simulateMultiOmics().
This approach is especially useful when benchmarking methods using simulated data that closely mimic real omics datasets (e.g., gene expression and methylation data).
Below is an example using two simple simulated matrices to
demonstrate how users can extract empirical means and standard
deviations and plug them into simulateMultiOmics():
# Assuming `real_data1`, `real_data2` are matrices of real omics data. Let's simulate
# two real-world-like omics datasets
set.seed(123)
real_data1 <- matrix(rnorm(100, mean = 5, sd = 2), nrow = 10, ncol = 10) # Omic 1
real_data2 <- matrix(rnorm(100, mean = 10, sd = 3), nrow = 10, ncol = 10) # Omic 2
data_list <- list(real_data1, real_data2)
# Compute summary statistics using SUMO's built-in function
real_stats <- compute_means_vars(data_list)
print(real_stats)
# Use output from `real_stats` to inform custom simulation settings
# (future versions may allow direct plugging into simulateMultiOmics)This outputs values such as:
simulateMultiOmics()You can now plug these values directly into the
signal.features and signal.samples parameters
to simulate data with similar distributions:
# Simulate using real-data-informed parameters
sim_data_real <- simulateMultiOmics(
vector_features = c(10000, 15000),
n_samples = 150,
n_factors = 2,
snr = 0.5,
signal.samples = c(real_stats$mean_smp, real_stats$sd_smp), # Use sample-level mean/sd
signal.features = list(
c(real_stats$overall_mean.one, real_stats$overall_sd.one), # Omic 1
c(real_stats$overall_mean.two, real_stats$overall_sd.two) # Omic 2
),
factor_structure = "mixed",
num.factor = "multiple",
seed = 42
)This real-data-informed simulation mode bridges the gap between fully synthetic and real datasets, offering realism while retaining control over ground-truth signal structures.
You can view the resulting structure with:
SUMO provides a variety of built-in visualization tools to help users explore simulated multi-omics data. These include heatmaps, 3D surface plots, factor score scatterplots, and feature loading visualizations. All functions are compatible with both merged and individual omics layers.
You can visualize either the merged omics matrix or individual omics layers using 2D heatmaps or 3D surface plots. These are useful for inspecting co-expression patterns, block structure, and noise.
The plot_factor() function shows factor scores across
samples. When factor_num = "all" is used, multiple
scatterplots are displayed, one per factor. You can also specify a
specific factor to isolate its effect.
The plot_weights() function visualizes how strongly each
feature loads onto a selected latent factor. This helps identify
signal-driving features or clusters.
These plots help users verify the location and magnitude of signal across features and samples.
demo_multiomics_analysis())The function demo_multiomics_analysis() provides a
comprehensive demo using either SUM0-generated data (option
data_type = "SUMO") or real-world CLL data (option
data_type = "real_world"). Here, we run a complete
MOFA2-based analysis pipeline using these two data types.
This function includes preprocessing, MOFA model training, variance
decomposition visualization, and optional PowerPoint report generation,
when export_pptx = TRUE. Future development, will include
.HTML,.docx, .PDF reports.
This means SUMO can seamlessly be integrated as a plugin to any factor analysis multi-omics analysis pipeline for testing and validate the methods.
When saving the results to .pptx, this will be saved automatically to your working directory local computer.
This includes model fitting, variance decomposition, and PowerPoint export.
The following functions are essential for simulation logic and block generation:
divide_samples(): Segment sample indices across latent
factorsdivide_vector(): Random sampling helperfeature_selection_one() /
feature_selection_two(): Assign signal to omicsSUMO is a foundational tool for synthetic multi-omics simulation. Its rigorously tested structure, customizability, and compatibility with modern pipelines make it a valuable asset to bioinformatics methodologists, educators, and practitioners.
For further information and updates, please refer to the accompanying publication in Bioinformatics Advances and the official GitHub repository.