SymSim is an R package made to simulate single cell RNA-seq data. It can be used to generate a single population of cells with similar statistical properties to real data, or to generate multiple discrete or continuous populations of cells, where users can input a tree to represent relationships between multiple populations. SymSim has the following applications:
Each cell has an EVF vector which defines the identify of the cell. Each gene has an gene effect vector of the same length as the EVF vector, and can be thought of as the weights of EVFs. The EVFs allow us to simulate discrete or continuous populations. We use the kinetic model to sample the true transcript count for each gene in each cell, and the parameters of the kinetic model (k_on, k_off, s) are calculated from external variability factors (EVFs) and gene-specific effects (or gene effect). The gene effect vectors allow us to simulate differentially expressed genes or co-expressed gene modules. From true transcript counts to observed counts (which can be read counts or UMI counts) we consider capture efficiency, amplification bias, sequencing depth and batch effects. At different steps, users have control on the amount of extrinsic variation, intrinsic variation and technical variation.
Main functions
SimulateTrueCounts( ) and True2ObservedCounts( ) are the two major functions to generate datasets. SimulateTrueCounts generates true transcript counts for the given number of genes and cells, where the cells can come from one single population, or multiple discrete populations, or continuous populations. True2ObservedCounts then simulates the library preparation and sequencing procedures, and convert the true transcript counts into observed read counts or UMI counts.
The input parameters of the package allow users to control intrinsic variation, technical variation and biological variation in the data. Intrinsic variation is modeled through the kinetic model; technical variation takes into account dropouts, amplification biases including length bias and GC bias, and batch effects; biological variation is modeled by EVFs (extrinsic variation factors).
SimulateTrueCounts( ) returns a list of four elements:
1 a matrix containing the true transcript counts;
2 gene level meta information;
3 cell level meta information, including a matrix of EVFs and a vector of cell identity (for example, the population the cell belongs to);
4 the parameters kon, koff and s used to simulation the true counts.
True2ObservedCounts( ) returns a list of two elements:
1 a matrix containing the observed read counts or UMI counts;
2 cell level meta information;
DivideBatches( ) takes the output of True2ObservedCounts( ) as input and split the data into multiple batches while adding batch effect for each batch. The batch information of each cell is in the output cell level meta information.
We provide functions for users to retrieve goldstandard information to benchmark computational methods for clustering, differential expression and trajectory inference. The “true” cluster information for each cell can be simply obtained from the output of function SimulateTrueCounts( ). Function getDEgenes( ) can be used to obtain differential expression information for genes, given two populations. Function getTrajectoryGenes( ) returns the simulated trajectory information of cells.
For a single population, BestMatchParams( ) returns parameters to generate datasets with similar statistical properties as a given input real dataset.
Description of input parameters for these functions can be found at the end of this document.
Simulate one population
First, we simulate the case where there is one population.
## Loading required package: plyr
## Loading required package: ggplot2
## Loading required package: Rtsne
## Loading required package: grid
## Loading required package: reshape
##
## Attaching package: 'reshape'
## The following objects are masked from 'package:plyr':
##
## rename, round_any
## Loading required package: RColorBrewer
## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which, which.max, which.min
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: ape
## Loading required package: MASS
## Loading required package: stringi
## Loading required package: roxygen2
## Loading required package: phytools
## Loading required package: maps
##
## Attaching package: 'maps'
## The following object is masked from 'package:plyr':
##
## ozone
## Loading required package: repr
## Loading required package: IRanges
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:reshape':
##
## expand, rename
## The following object is masked from 'package:plyr':
##
## rename
## The following object is masked from 'package:base':
##
## expand.grid
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:plyr':
##
## desc
## Loading required package: SummarizedExperiment
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: DelayedArray
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
## The following object is masked from 'package:plyr':
##
## count
##
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
##
## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
##
## aperm, apply, rowsum
ngenes <- 500
true_counts_res <- SimulateTrueCounts(ncells_total=300, ngenes=ngenes, evf_type="one.population", Sigma=0.4, randseed=0)
tsne_true_counts <- PlotTsne(meta=true_counts_res[[3]], data=log2(true_counts_res[[1]]+1), evf_type="one.population", n_pc=20, label='pop', saving = F, plotname="one.population")
tsne_true_counts[[2]]

For the true counts computed above, taking them through the library preparation and sequencing procedures will yield read counts (if protocol=“nonUMI”) or UMI counts (if protocol=“UMI”). Each genes needs to be assigned a gene length. We sample lengths from human transcript lengths.
data(gene_len_pool)
gene_len <- sample(gene_len_pool, ngenes, replace = FALSE)
observed_counts <- True2ObservedCounts(true_counts=true_counts_res[[1]], meta_cell=true_counts_res[[3]], protocol="nonUMI", alpha_mean=0.1, alpha_sd=0.05, gene_len=gene_len, depth_mean=1e5, depth_sd=3e3)
One can plot the mean-variance relationship in the observed read counts:
plot(log2(rowMeans(observed_counts[[1]])+1), log2(apply(observed_counts[[1]],1,cv)), col=adjustcolor("blue", alpha.f = 0.5), pch=19, xlab="log2(mean+1)", ylab="log2(CV)")

Simulate multiple discrete populations
When there are multiple populations, users need to provide a tree. A tree with five leaves (five populations) can be generated as follows:
or read from a file with the tree in Newick format:
phyla2 <- read.tree(system.file("extdata", "Newick_ABCDE.txt", package = "SymSim"))
par(mfrow=c(1,2))
plot(phyla1)
plot(phyla2)

The true counts of the five populations can be simulated:
true_counts_res <- SimulateTrueCounts(ncells_total=300, min_popsize=50, i_minpop=2, ngenes=ngenes, nevf=10, evf_type="discrete", n_de_evf=9, vary="s", Sigma=0.4, phyla=Phyla5(), randseed=0)
true_counts_res_dis <- true_counts_res
tsne_true_counts <- PlotTsne(meta=true_counts_res[[3]], data=log2(true_counts_res[[1]]+1), evf_type="discrete", n_pc=20, label='pop', saving = F, plotname="discrete populations (true counts)")
tsne_true_counts[[2]]

observed_counts <- True2ObservedCounts(true_counts=true_counts_res[[1]], meta_cell=true_counts_res[[3]], protocol="nonUMI", alpha_mean=0.1, alpha_sd=0.05, gene_len=gene_len, depth_mean=1e5, depth_sd=3e3)
tsne_nonUMI_counts <- PlotTsne(meta=observed_counts[[2]], data=log2(observed_counts[[1]]+1), evf_type="discrete", n_pc=20, label='pop', saving = F, plotname="observed counts nonUMI")
tsne_nonUMI_counts[[2]]

observed_counts <- True2ObservedCounts(true_counts=true_counts_res[[1]], meta_cell=true_counts_res[[3]], protocol="UMI", alpha_mean=0.05, alpha_sd=0.02, gene_len=gene_len, depth_mean=5e4, depth_sd=3e3)
tsne_UMI_counts <- PlotTsne(meta=observed_counts[[2]], data=log2(observed_counts[[1]]+1), evf_type="discrete", n_pc=20, label='pop', saving = F, plotname="observed counts UMI")
tsne_UMI_counts[[2]]
