This package allows for the differential expression analysis of scRNA-seq and other NGS count data. It employs a novel family of discrete distributions for describing the difference of two NB distributions (named DOTNB). DEGage takes the raw counts of scRNA-seq as inputs, and thus avoids introducing artificially bias in normalization steps in current methods. A workflow is shown as follows.
To install DEGage on your local machine, ensure devtools is installed and run the following commands
library(devtools)
install_github("chenyongrowan/DEGage")
DEGage require the following packages to be installed and up to date. The installation will be performed automatically if these packages are not already installed.
MASS >= 7.3-59
pscl >= 1.5.5
hypergeo >= 1.2-13
stats >= 4.2.2
doParallel >= 1.0.17
parallel >= 4.3.0
foreach >=1.5.2
A recent MASS update made it incompatible with R versions below 4.4. To install a legacy version of MASS, go to https://cran.r-project.org/src/contrib/Archive/MASS/ and download an archived version that is compatible with your current R installation. Then, run the following in an R terminal:
install.pacakges("path/to/MASS_SOURCE.tar.gz", repos=NULL, type="source")
Although DEGage does not require counts to be normalized, it’s recommended to filter out low quality cells before use. To accomplish this, we have provided an optional function that filters out low quality cells and poorly expressed genes before hand. Note that the default parameters should be adjusted depending on the sample sizes and sequencing depths of the dataset being used.
DEGage_preprocess <- function(counts,
min.transcripts.per.gene = 200,
max.transcripts.per.gene = 10000,
min.transcripts.per.cell = 500,
max.transcripts.per.cell = 20000,
min.cells.per.gene = 3,
max.dropout.prop = 0.8,
max.mt.percent = 0.2,
mt.prefix = 'MT-',
max.rRna.percent = 0.2,
rRna.prefix = 'rr')
counts: A dataframe containing countsmin.transcripts.per.gene: The minimum number of
transcripts a gene can havemax.transcripts.per.gene: The maximum number of
transcripts a gene can havemin.transcripts.per.cell: The the minimum number of
transcripts required for a cell to be retainedmax.transcripts.per.cell: The maximum number of
transcripts required for a cell to be retainedmin.cells.per.gene: The number of cells a gene must be
present in to be retainedmax.dropout.prop: Removes genes with zero counts
present in a number of cells greater than the specified proportionmax.mt.percent: The maximum proportion of transcripts
from mitochondrial genes that can be present in a cellmt.prefix: The prefix for mitochondrial genesmax.rRna.precent: The maximum proportion of transcripts
from ribosomal RNAs that can be present in a cellrRna.prefix: The prefix for ribosomal genesA filtered dataframe of counts
DEGage performs pairwise differential expression analysis on scRNA-seq data. The input is typically a dataframe where columns contain samples and rows contain genes. Counts do not need to be normalized prior to use with DEGage, however covariants and batch effects should be accounted for prior to its use.
DEGage(counts,
group,
perm.preprocess = TRUE,
gene.filter.threshold = 0.9,
nperms = 2000,
nsubsample = NA,
perm.pval = 0.1,
ncores = 4,
maxiter = 1000,
mean.ratio = 4,
subsampled.k = T)
counts: An object or path to directory containing read
counts. The following inputs are accepted:
group: A factor of a numeric vector that indicates
which samples belong to which pairwise condition. An example is shown
below under the Example Usage sectionperm.preprocess: A boolean indicating whether a
permutation test is used to pre-filter genesgene.filter.threshold: A value between 0-1 indicating
the maximum proportion of zero counts a gene can have before being
filtered out. A value of 1 means only genes with all zero counts are
filtered out, where as a value of 0.5 would filter out genes where half
the counts are zerosnsubsample: The number of cells to sub-sample for each
group. If left as NA, the default is 125+0.05n, where n is the number of
samples in the data set.nperms: The number of permutations performed during the
permutation testperm.pval: A p value used by the permutation test to
pre-filter genesncores: The number of cores to use for parallel
computingmaxiter: The maxiumum number of iterations to perform
while calculating the cdfmean.ratio: In cases of hypergeomtric non-convergence,
considers the minimum ratio between the count means for each group for a
gene with an incalculable p-value to be considered differentially
expressedsubsampled.k: If true the subsampling procedure is used
to estimate k. If false, the random assignment procedure is used.A dataframe containing the following information about each gene is
output:
- r1: The r parameter for the NB distribution fit to the
first condition of counts
- p1: The p parameter for the NB distribution fit to the
first condition of counts
- mu1: The mean of the counts for the first condition
- r2: The r parameter for the NB distribution fit to the
second condition of counts
- p2: The p parameter for the NB distribution fit to the
second condition of counts
- mu2: The mean of the counts for the second
condition
- basemean : The mean of all counts pooled across two
conditions
- z.r1: The r parameter for the NB distribution fit to the
first condition of counts with zero inflated NB regression
- z.p1: The p parameter for the NB distribution fit to the
first condition of counts with zero inflated NB regression
- z.r2: The r parameter for the NB distribution fit to the
second condition of counts with zero inflated NB regression with zero
inflated NB regression
- z.p2: The p parameter for the NB distribution fit to the
second condition of counts with zero inflated NB regression
- k: Observed difference between the two NB
distributions
- permPvals: p-value generated from a pre-filtering
permutation test. It is recommended to filter genes according to a
permutation p-value of 0.1 to minimize false positive rates.
- pval: p-values generated according to DOTNB
- FDR: FDR adjusted p-values
DEGage_multitest is similar to DEGage, except it can perform an indefinite number of pairwise comparisons.
DEGage_multitest(counts,
group,
perm.preprocess = TRUE,
gene.filter.threshold = 0.9,
nperms = 2000,
nsubsample = NA,
perm.pval = 0.1,
ncores = 4,
maxiter = 100,
mean.ratio = 4,
subsampled.k = T,
writing.dir = NULL)
counts: An object or path to directory containing read
counts. The following inputs are accepted:
group: A factor of a numeric vector that indicates
which samples belong to which pairwise condition. An example is shown
below under the Example Usage sectionperm.preprocess: A boolean indicating whether a
permutation test is used to pre-filter genesgene.filter.threshold: A value between 0-1 indicating
the maximum proportion of zero counts a gene can have before being
filtered out. A value of 1 means only genes with all zero counts are
filtered out, where as a value of 0.5 would filter out genes where half
the counts are zerosnsubsample: The number of cells to sub-sample for each
group. If left as NA, the default is 125+0.05n, where n is the number of
samples in the data set.nperms: The number of permutations performed during the
permutation testperm.pval: A p value used by the permutation test to
pre-filter genesncores: The number of cores to use for parallel
computingmaxiter: The maxiumum number of iterations to perform
while calculating the cdfmean.ratio: In cases of hypergeomtric non-convergence,
considers the minimum ratio between the count means for each group for a
gene with an incalculable p-value to be considered differentially
expressedsubsampled.k: If true the subsampling procedure is used
to estimate k. If false, the random assignment procedure is used.writing.dir: A directory where the results of each
comparison are to be written to. It is highly recommended to provide a
directory for large numbers of comparisons.If writing.dir is NULL, a list of Dataframes structured in the same way as the output of DEGage() that each contain a single comparison are returned. If writing.dir is not null and is a valid directory, nothing is output, and the results can be retrieved with list.files(“directory”)
DEGage_Simulation generates very simplified simulated scRNA-Seq counts following an NB distribution with pre-defined proportions of dropouts.
DEGage_Simulation(ngenes,
ndegs,
ngroup1,
ngroup2,
lfc = 1,
min.prop.zeros = 0.1,
max.prop.zeros = 0.5,
seed = NULL,
ncores = 4,
dispersions = NULL,
means = NULL)
ngenes: The number of genes to be simulated totalndegs: The number of genes to be differentially
expressedngroup1: The number of cells in the first groupngroup2: The number of cells in the second grouplfc: Either an integer or vector with a length of ndegs
which dictates log fold change value to adjust means for differentially
expressed genesmin.prop.zeros: The minimum proportion of dropouts that
are to be introduced for a genemax.prop.zeros: The maximum proportion of dropouts that
are to be introduced for a geneseed: Sets a seed for random generation. If user does
not input one, a random seed is generatedncores: The number of cores to allocated for parallel
computingdispersions: An optional integer or vector the length
of ngenes+ndegs that sets genewise dispersion valuesmeans: An optional integer or vector the length of
ngenes+ndegs that sets the means for genewise distributionsA dataframe where cells are represented by columns and genes are represented by rows. Differentially expressed genes have the prefix “DE”, while equivalently expressed genes have the prefix “EE”. The group each cell belongs to is noted within the column names.
In this section, we will detail how to use DEGage functions
First, we will simulate a small data frame to pass through DEGage() using DEGage simulate:
library(DEGage)
df <- DEGage_Simulation(ngenes = 1000, ndegs = 100, ngroup1 = 50, ngroup2 = 50)
Next, we will pass this through DEGage:
cellgroups <- factor(c(rep(1,50), rep(2,50)))
results <- DEGage(counts = df, group = cellgroups)
To test DEGage_multitest, we will simulate a second dataframe of counts, merge them together, and pass them through DEGage_multitest():
df2 <- DEGage_Simulation(ngenes = 1000, ndegs = 100, ngroup1 = 50, ngroup2 = 50)
df <- cbind(df, df2)
cellgroups2 <- factor(c(rep(3,50), rep(4,50)))
cellgroups <- factor(c(cellgroups, cellgroups2))
multitest.results <- DEGage_multitest(df, cellgroups)