SymSim: single cell RNA-Seq data simulator

SymSim is an R package made to simulate single cell RNA-seq data. It can be used to generate a single population of cells with similar statistical properties to real data, or to generate multiple discrete or continuous populations of cells, where users can input a tree to represent relationships between multiple populations. SymSim has the following applications:

Each cell has an EVF vector which defines the identify of the cell. Each gene has an gene effect vector of the same length as the EVF vector, and can be thought of as the weights of EVFs. The EVFs allow us to simulate discrete or continuous populations. We use the kinetic model to sample the true transcript count for each gene in each cell, and the parameters of the kinetic model (k_on, k_off, s) are calculated from external variability factors (EVFs) and gene-specific effects (or gene effect). The gene effect vectors allow us to simulate differentially expressed genes or co-expressed gene modules. From true transcript counts to observed counts (which can be read counts or UMI counts) we consider capture efficiency, amplification bias, sequencing depth and batch effects. At different steps, users have control on the amount of extrinsic variation, intrinsic variation and technical variation.

Install SymSim

library("devtools")
devtools::install_github('YosefLab/SymSim')

Main functions

SimulateTrueCounts( ) and True2ObservedCounts( ) are the two major functions to generate datasets. SimulateTrueCounts generates true transcript counts for the given number of genes and cells, where the cells can come from one single population, or multiple discrete populations, or continuous populations. True2ObservedCounts then simulates the library preparation and sequencing procedures, and convert the true transcript counts into observed read counts or UMI counts.

The input parameters of the package allow users to control intrinsic variation, technical variation and biological variation in the data. Intrinsic variation is modeled through the kinetic model; technical variation takes into account dropouts, amplification biases including length bias and GC bias, and batch effects; biological variation is modeled by EVFs (extrinsic variation factors).

SimulateTrueCounts( ) returns a list of four elements:
1 a matrix containing the true transcript counts;
2 gene level meta information;
3 cell level meta information, including a matrix of EVFs and a vector of cell identity (for example, the population the cell belongs to);
4 the parameters kon, koff and s used to simulation the true counts.

True2ObservedCounts( ) returns a list of two elements:
1 a matrix containing the observed read counts or UMI counts;
2 cell level meta information;

DivideBatches( ) takes the output of True2ObservedCounts( ) as input and split the data into multiple batches while adding batch effect for each batch. The batch information of each cell is in the output cell level meta information.

We provide functions for users to retrieve goldstandard information to benchmark computational methods for clustering, differential expression and trajectory inference. The “true” cluster information for each cell can be simply obtained from the output of function SimulateTrueCounts( ). Function getDEgenes( ) can be used to obtain differential expression information for genes, given two populations. Function getTrajectoryGenes( ) returns the simulated trajectory information of cells.

For a single population, BestMatchParams( ) returns parameters to generate datasets with similar statistical properties as a given input real dataset.

Description of input parameters for these functions can be found at the end of this document.

Simulate one population

First, we simulate the case where there is one population.

library("SymSim")

## Loading required package: plyr

## Loading required package: ggplot2

## Loading required package: Rtsne

## Loading required package: grid

## Loading required package: reshape

## 
## Attaching package: 'reshape'

## The following objects are masked from 'package:plyr':
## 
##     rename, round_any

## Loading required package: RColorBrewer

## Loading required package: Biobase

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which, which.max, which.min

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Loading required package: ape

## Loading required package: MASS

## Loading required package: stringi

## Loading required package: roxygen2

## Loading required package: phytools

## Loading required package: maps

## 
## Attaching package: 'maps'

## The following object is masked from 'package:plyr':
## 
##     ozone

## Loading required package: repr

## Loading required package: IRanges

## Loading required package: S4Vectors

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:reshape':
## 
##     expand, rename

## The following object is masked from 'package:plyr':
## 
##     rename

## The following object is masked from 'package:base':
## 
##     expand.grid

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:plyr':
## 
##     desc

## Loading required package: SummarizedExperiment

## Loading required package: GenomicRanges

## Loading required package: GenomeInfoDb

## Loading required package: DelayedArray

## Loading required package: matrixStats

## 
## Attaching package: 'matrixStats'

## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians

## The following object is masked from 'package:plyr':
## 
##     count

## 
## Attaching package: 'DelayedArray'

## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

## The following objects are masked from 'package:base':
## 
##     aperm, apply, rowsum

ngenes <- 500
true_counts_res <- SimulateTrueCounts(ncells_total=300, ngenes=ngenes, evf_type="one.population", Sigma=0.4, randseed=0)
tsne_true_counts <- PlotTsne(meta=true_counts_res[[3]], data=log2(true_counts_res[[1]]+1), evf_type="one.population", n_pc=20, label='pop', saving = F, plotname="one.population")
tsne_true_counts[[2]]

For the true counts computed above, taking them through the library preparation and sequencing procedures will yield read counts (if protocol=“nonUMI”) or UMI counts (if protocol=“UMI”). Each genes needs to be assigned a gene length. We sample lengths from human transcript lengths.

data(gene_len_pool)
gene_len <- sample(gene_len_pool, ngenes, replace = FALSE)
observed_counts <- True2ObservedCounts(true_counts=true_counts_res[[1]], meta_cell=true_counts_res[[3]], protocol="nonUMI", alpha_mean=0.1, alpha_sd=0.05, gene_len=gene_len, depth_mean=1e5, depth_sd=3e3)

One can plot the mean-variance relationship in the observed read counts:

plot(log2(rowMeans(observed_counts[[1]])+1), log2(apply(observed_counts[[1]],1,cv)), col=adjustcolor("blue", alpha.f = 0.5), pch=19, xlab="log2(mean+1)", ylab="log2(CV)")

Simulate multiple discrete populations

When there are multiple populations, users need to provide a tree. A tree with five leaves (five populations) can be generated as follows:

phyla1 <- Phyla5()

or read from a file with the tree in Newick format:

phyla2 <- read.tree(system.file("extdata", "Newick_ABCDE.txt", package = "SymSim"))
par(mfrow=c(1,2))
plot(phyla1)
plot(phyla2)

The true counts of the five populations can be simulated:

true_counts_res <- SimulateTrueCounts(ncells_total=300, min_popsize=50, i_minpop=2, ngenes=ngenes, nevf=10, evf_type="discrete", n_de_evf=9, vary="s", Sigma=0.4, phyla=Phyla5(), randseed=0)
true_counts_res_dis <- true_counts_res
tsne_true_counts <- PlotTsne(meta=true_counts_res[[3]], data=log2(true_counts_res[[1]]+1), evf_type="discrete", n_pc=20, label='pop', saving = F, plotname="discrete populations (true counts)")
tsne_true_counts[[2]]

observed_counts <- True2ObservedCounts(true_counts=true_counts_res[[1]], meta_cell=true_counts_res[[3]], protocol="nonUMI", alpha_mean=0.1, alpha_sd=0.05, gene_len=gene_len, depth_mean=1e5, depth_sd=3e3)
tsne_nonUMI_counts <- PlotTsne(meta=observed_counts[[2]], data=log2(observed_counts[[1]]+1), evf_type="discrete", n_pc=20, label='pop', saving = F, plotname="observed counts nonUMI")
tsne_nonUMI_counts[[2]]

observed_counts <- True2ObservedCounts(true_counts=true_counts_res[[1]], meta_cell=true_counts_res[[3]], protocol="UMI", alpha_mean=0.05, alpha_sd=0.02, gene_len=gene_len, depth_mean=5e4, depth_sd=3e3)
tsne_UMI_counts <- PlotTsne(meta=observed_counts[[2]], data=log2(observed_counts[[1]]+1), evf_type="discrete", n_pc=20, label='pop', saving = F, plotname="observed counts UMI")
tsne_UMI_counts[[2]]

Add batch effects

We can divide the data we simulated using the previous steps into multiple batches and add batch effects to each batch.

observed_counts_2batches <- DivideBatches(observed_counts_res = observed_counts, nbatch = 2, batch_effect_size = 1)
tsne_batches <- PlotTsne(meta=observed_counts_2batches[[2]], data=log2(observed_counts_2batches[[1]]+1), evf_type="discrete", n_pc=20, label='batch', saving = F, plotname="observed counts in batches")
tsne_batches[[2]]

Simulate continuous populations

We use the same tree as used for the simulation of discrete populations above. To visualize the continuous populations, we color the cells by the edges they belong to on the tree. We then label both the internal and tip nodes of the tree.

plot(phyla1, show.tip.label=F)
nodelabels()
tiplabels()

true_counts_res <- SimulateTrueCounts(ncells_total=500, ngenes=ngenes, nevf=20, evf_type="continuous", n_de_evf=12, vary="s", Sigma=0.4, phyla=Phyla5(), randseed=1)
tsne_true_counts <- PlotTsne(meta=true_counts_res[[3]], data=log2(true_counts_res[[1]]+1), evf_type="continuous", n_pc=20, label='pop', saving = F, plotname="continuous populations (true counts)")
tsne_true_counts[[2]]

Retrieve goldstandard information to benchmark computational methods

getDEgenes( ) allows users to obtain DE information for genes. It returns the following information for each gene:
nDiffEVF: the number of DiffEVFs used for each gene
logFC_theoretical: log2 fold change based on kinetic parameters
wil.p_true_counts: adjusted wilcoxon p-value based on true counts

DEinfo <- getDEgenes(true_counts_res = true_counts_res_dis, popA = 1, popB = 3)
summary(DEinfo)

##                   Length Class  Mode   
## nDiffEVF          500    -none- numeric
## logFC_theoretical 500    -none- numeric
## wil.p_true_counts 500    -none- numeric

getTrajectoryGenes( ) retrieve thes information of cells on continuous trajectories. It outputs a data frame, where each row corresponds to a cell. Each cell has information “pseudotime” (distance from root) and “branch” (on which branch is the cell).

TrajInfo <- getTrajectoryGenes(true_counts_res$cell_meta)
head(TrajInfo)

##        branch pseudotime
## cell_1    6_7 0.00000000
## cell_2    6_7 0.01694915
## cell_3    6_7 0.03389831
## cell_4    6_7 0.05084746
## cell_5    6_7 0.06779661
## cell_6    6_7 0.08474576

Find simulations which match a given experimental dataset from our database

Given an experimental dataset, users can find the simulation which match the experimental dataset in terms of statistics including the mean expression, the fano factor and the percentage of expressing cells of genes. The function BestMatchParams() returns the parameter configurations which generate the top matching simulations. In the following example, we obtain the top 5 best matching parameter configurations for respectively the cortex and the Th17 datasets. The function also generates Q-Q plots of mean, percent-non-zero, standard deviation (sd) of genes between the given experimental dataset and simulated dataset with top ranking. Given the cortex dataset:

cortex_counts <- read.table(system.file("extdata", "cortex_counts.txt", package = "SymSim"), header = F, stringsAsFactors = F)
best_matches_UMI <- BestMatchParams('UMI',cortex_counts,'best_params.umi.qqplot.pdf', depth_range = c(200e3,Inf), n_optimal=5)

best_matches_UMI

##       gene_effects_sd Sigma scale_s alpha_mean alpha_sd depth_mean depth_sd
## 11355               2   0.2     0.4       0.05    0.025      3e+05   150000
## 11356               2   0.2     0.4       0.05    0.025      3e+05   150000
## 7204                2   0.2     0.5       0.03    0.009      3e+05    90000
## 11344               2   0.2     0.4       0.05    0.015      3e+05    90000
## 7225                2   0.2     0.5       0.03    0.009      5e+05   150000
##       nPCR1 prop_hge mean_hge protocol gene_effect_prob      dist
## 11355    14     0.01        3      UMI              0.3 0.2846441
## 11356    14     0.02        3      UMI              0.3 0.3194173
## 7204     10     0.02        4      UMI              0.3 0.3478997
## 11344    14     0.02        3      UMI              0.3 0.3489926
## 7225     10     0.03        4      UMI              0.3 0.3538994

Given the Th17 dataset:

load(system.file("extdata", "Th17_data.RData", package = "SymSim"))
counts <- ss2_130cells_counts[which(rowSums(ss2_130cells_counts > 0) > 10),]
best_matches_nonUMI <- BestMatchParams('nonUMI',counts,'best_params.nonUMI.qqplot.pdf', depth_range = c(1e6, 2e6), n_optimal=5)

best_matches_nonUMI

##      gene_effects_sd Sigma scale_s alpha_mean alpha_sd depth_mean depth_sd
## 1651               1   0.2     0.5       0.15    0.045      2e+06    6e+05
## 1649               1   0.2     0.5       0.15    0.045      2e+06    6e+05
## 5110               2   0.2     0.5       0.15    0.045      2e+06    6e+05
## 7480               2   0.1     0.3       0.20    0.060      2e+06    6e+05
## 1654               1   0.2     0.5       0.15    0.045      2e+06    6e+05
##      nPCR1 prop_hge mean_hge protocol gene_effect_prob      dist
## 1651    18    0.020        4   nonUMI              0.3 0.8500673
## 1649    18    0.015        5   nonUMI              0.3 0.8535507
## 5110    18    0.025        4   nonUMI              0.3 0.8593795
## 7480    18    0.020        4   nonUMI              0.3 0.8649737
## 1654    18    0.025        4   nonUMI              0.3 0.8699528

Description of input parameters for major functions

Parameters for function SimulateTrueCounts( ) are:
ncells_total total number of cells from all populations;
min_popsize number of cells in the rarest population;
i_minpop specifies which population has the smallest size;
ngenes number of genes;
evf_center the value which evf mean is generated from (default=1);
nevf number of EVFs for each kinetic parameter (default=30);
evf_type indicates the population structure of the cells, can be “one.population”, “discrete” or “continuous”;
n_de_evf number of differential evfs between populations for one kinetic parameter (default=18 when vary=‘s’);
impulse when generating continuous populations, use the impulse model or not. Default is FALSE;
vary which kinetic parameters have differential evfs. Can be “all”, “kon”, “koff”, “s”, “except_kon”, “except_koff”, “except_s”;
Sigma controls heterogeneity each population;
phyla a tree which defines relationship between populations;
geffect_mean the mean of gene effect size;
gene_effects_sd controls differences between genes;
gene_effect_prob probability of non-zero values in the gene effect vectors;
bimod adjusts the bimodality of gene expression, thus controls intrinsic variation;
param_realdata the experimental dataset used to estimate kon, koff and s parameters;
scale_s the cell size parameter in (0,1). Use smaller value for cell types known to be small (like naive cells);
prop_hge proportion of high expression outlier genes (default=0.015);
mean_hge the inflation parameter to increase s for the high expression outlier genes;
randseed random seed to reproduce the results;

Parameters for function True2ObservedCounts( ) are:
true_counts true transcript counts from function SimulateTrueCounts;
meta_cell cell identity from function SimulateTrueCounts;
nbatch number of batches the cells are sequenced on;
protocol protocol for library preparation, can be “nonUMI” (without UMIs) or “10x” (with UMIs);
alpha_mean mean capture effeciency of all cells;
alpha_sd standard deviation of capture efficiency of all cells;
lenslope controls the amount of gene length bias;
nbins number of bins to simulate gene length bias;
gene_len gene lengths;
amp_bias_limit amount of amplification bias;
rate_2PCR PCR efficiency during amplification;
nPCR1 number of PCR cycles in the pre-amplification step;
nPCR2 number of PCR cycles in the second amplification step for fragments;
LinearAmp if linear amplification should be used instead of PCR amplification for the pre-amplification step. Default is FALSE;
LinearAmp_coef the number by which the number of transcript is multiplied if linear amplification is used;
depth_mean mean sequencing depth of all cells;
depth_sd standard deviation of sequencing depth of all cells;