This package allows for the differential expression analysis of scRNA-seq and other NGS count data. It employs a novel family of discrete distributions for describing the difference of two NB distributions (named DOTNB). DEGage takes the raw counts of scRNA-seq as inputs, and thus avoids introducing artificially bias in normalization steps in current methods. A workflow is shown as follows.

1 Setup Instructions

1.1 Installation

To install DEGage on your local machine, ensure devtools is installed and run the following commands

library(devtools)
install_github("chenyongrowan/DEGage")

1.2 Dependencies

DEGage require the following packages to be installed and up to date. The installation will be performed automatically if these packages are not already installed.

MASS >= 7.3-59
pscl >= 1.5.5
hypergeo >= 1.2-13
stats >= 4.2.2
doParallel >= 1.0.17
parallel >= 4.3.0
foreach >=1.5.2

A recent MASS update made it incompatible with R versions below 4.4. To install a legacy version of MASS, go to https://cran.r-project.org/src/contrib/Archive/MASS/ and download an archived version that is compatible with your current R installation. Then, run the following in an R terminal:

install.pacakges("path/to/MASS_SOURCE.tar.gz", repos=NULL, type="source")

2 Preprocessing

Although DEGage does not require counts to be normalized, it’s recommended to filter out low quality cells before use. To accomplish this, we have provided an optional function that filters out low quality cells and poorly expressed genes before hand. Note that the default parameters should be adjusted depending on the sample sizes and sequencing depths of the dataset being used.

DEGage_preprocess <- function(counts,
                              min.transcripts.per.gene = 200,
                              max.transcripts.per.gene = 10000,
                              min.transcripts.per.cell = 500,
                              max.transcripts.per.cell = 20000,
                              min.cells.per.gene = 3,
                              max.dropout.prop = 0.8,
                              max.mt.percent = 0.2,
                              mt.prefix = 'MT-',
                              max.rRna.percent = 0.2,
                              rRna.prefix = 'rr')

2.0.1 Parameters

  • counts: A dataframe containing counts
  • min.transcripts.per.gene: The minimum number of transcripts a gene can have
  • max.transcripts.per.gene: The maximum number of transcripts a gene can have
  • min.transcripts.per.cell: The the minimum number of transcripts required for a cell to be retained
  • max.transcripts.per.cell: The maximum number of transcripts required for a cell to be retained
  • min.cells.per.gene: The number of cells a gene must be present in to be retained
  • max.dropout.prop: Removes genes with zero counts present in a number of cells greater than the specified proportion
  • max.mt.percent: The maximum proportion of transcripts from mitochondrial genes that can be present in a cell
  • mt.prefix: The prefix for mitochondrial genes
  • max.rRna.precent: The maximum proportion of transcripts from ribosomal RNAs that can be present in a cell
  • rRna.prefix: The prefix for ribosomal genes

2.0.2 Output

A filtered dataframe of counts

3 Differential Expression Functions

3.1 DEGage

DEGage performs pairwise differential expression analysis on scRNA-seq data. The input is typically a dataframe where columns contain samples and rows contain genes. Counts do not need to be normalized prior to use with DEGage, however covariants and batch effects should be accounted for prior to its use.

DEGage(counts,
         group,
         perm.preprocess = TRUE,
         gene.filter.threshold = 0.9,
         nperms = 2000,
         nsubsample = NA,
         perm.pval = 0.1,
         ncores = 4,
         maxiter = 1000,
         mean.ratio = 4,
         subsampled.k = T)

3.1.1 Parameters

  • counts: An object or path to directory containing read counts. The following inputs are accepted:
    • A dataframe where columns are samples and rows are genes
    • A SeuratObject with an assay titled “RNA” containing counts
    • A SingleCellExperiment containing an assay titled “counts”
  • group: A factor of a numeric vector that indicates which samples belong to which pairwise condition. An example is shown below under the Example Usage section
  • perm.preprocess: A boolean indicating whether a permutation test is used to pre-filter genes
  • gene.filter.threshold: A value between 0-1 indicating the maximum proportion of zero counts a gene can have before being filtered out. A value of 1 means only genes with all zero counts are filtered out, where as a value of 0.5 would filter out genes where half the counts are zeros
  • nsubsample: The number of cells to sub-sample for each group. If left as NA, the default is 125+0.05n, where n is the number of samples in the data set.
  • nperms: The number of permutations performed during the permutation test
  • perm.pval: A p value used by the permutation test to pre-filter genes
  • ncores: The number of cores to use for parallel computing
  • maxiter: The maxiumum number of iterations to perform while calculating the cdf
  • mean.ratio: In cases of hypergeomtric non-convergence, considers the minimum ratio between the count means for each group for a gene with an incalculable p-value to be considered differentially expressed
  • subsampled.k: If true the subsampling procedure is used to estimate k. If false, the random assignment procedure is used.

3.1.2 Output

A dataframe containing the following information about each gene is output:
- r1: The r parameter for the NB distribution fit to the first condition of counts
- p1: The p parameter for the NB distribution fit to the first condition of counts
- mu1: The mean of the counts for the first condition
- r2: The r parameter for the NB distribution fit to the second condition of counts
- p2: The p parameter for the NB distribution fit to the second condition of counts
- mu2: The mean of the counts for the second condition
- basemean : The mean of all counts pooled across two conditions
- z.r1: The r parameter for the NB distribution fit to the first condition of counts with zero inflated NB regression
- z.p1: The p parameter for the NB distribution fit to the first condition of counts with zero inflated NB regression
- z.r2: The r parameter for the NB distribution fit to the second condition of counts with zero inflated NB regression with zero inflated NB regression
- z.p2: The p parameter for the NB distribution fit to the second condition of counts with zero inflated NB regression
- k: Observed difference between the two NB distributions
- permPvals: p-value generated from a pre-filtering permutation test. It is recommended to filter genes according to a permutation p-value of 0.1 to minimize false positive rates.
- pval: p-values generated according to DOTNB
- FDR: FDR adjusted p-values

3.2 DEGage_multitest

DEGage_multitest is similar to DEGage, except it can perform an indefinite number of pairwise comparisons.

DEGage_multitest(counts,
                   group,
                   perm.preprocess = TRUE,
                   gene.filter.threshold = 0.9,
                   nperms = 2000,
                   nsubsample = NA,
                   perm.pval = 0.1,
                   ncores = 4,
                   maxiter = 100,
                   mean.ratio = 4,
                   subsampled.k = T,
                   writing.dir = NULL)

3.2.1 Parameters

  • counts: An object or path to directory containing read counts. The following inputs are accepted:
    • A dataframe where columns are samples and rows are genes
    • A SeuratObject with an assay titled “RNA” containing counts
    • A SingleCellExperiment containing an assay titled “counts”
    • A path to an mtx directory
  • group: A factor of a numeric vector that indicates which samples belong to which pairwise condition. An example is shown below under the Example Usage section
  • perm.preprocess: A boolean indicating whether a permutation test is used to pre-filter genes
  • gene.filter.threshold: A value between 0-1 indicating the maximum proportion of zero counts a gene can have before being filtered out. A value of 1 means only genes with all zero counts are filtered out, where as a value of 0.5 would filter out genes where half the counts are zeros
  • nsubsample: The number of cells to sub-sample for each group. If left as NA, the default is 125+0.05n, where n is the number of samples in the data set.
  • nperms: The number of permutations performed during the permutation test
  • perm.pval: A p value used by the permutation test to pre-filter genes
  • ncores: The number of cores to use for parallel computing
  • maxiter: The maxiumum number of iterations to perform while calculating the cdf
  • mean.ratio: In cases of hypergeomtric non-convergence, considers the minimum ratio between the count means for each group for a gene with an incalculable p-value to be considered differentially expressed
  • subsampled.k: If true the subsampling procedure is used to estimate k. If false, the random assignment procedure is used.
  • writing.dir: A directory where the results of each comparison are to be written to. It is highly recommended to provide a directory for large numbers of comparisons.

3.2.2 Output

If writing.dir is NULL, a list of Dataframes structured in the same way as the output of DEGage() that each contain a single comparison are returned. If writing.dir is not null and is a valid directory, nothing is output, and the results can be retrieved with list.files(“directory”)

4 Count Simulation

DEGage_Simulation generates very simplified simulated scRNA-Seq counts following an NB distribution with pre-defined proportions of dropouts.

DEGage_Simulation(ngenes, 
                  ndegs, 
                  ngroup1, 
                  ngroup2, 
                  lfc = 1, 
                  min.prop.zeros = 0.1, 
                  max.prop.zeros = 0.5, 
                  seed = NULL, 
                  ncores = 4, 
                  dispersions = NULL, 
                  means = NULL)

4.0.1 Input

  • ngenes: The number of genes to be simulated total
  • ndegs: The number of genes to be differentially expressed
  • ngroup1: The number of cells in the first group
  • ngroup2: The number of cells in the second group
  • lfc: Either an integer or vector with a length of ndegs which dictates log fold change value to adjust means for differentially expressed genes
  • min.prop.zeros: The minimum proportion of dropouts that are to be introduced for a gene
  • max.prop.zeros: The maximum proportion of dropouts that are to be introduced for a gene
  • seed: Sets a seed for random generation. If user does not input one, a random seed is generated
  • ncores: The number of cores to allocated for parallel computing
  • dispersions: An optional integer or vector the length of ngenes+ndegs that sets genewise dispersion values
  • means: An optional integer or vector the length of ngenes+ndegs that sets the means for genewise distributions

4.0.2 Output

A dataframe where cells are represented by columns and genes are represented by rows. Differentially expressed genes have the prefix “DE”, while equivalently expressed genes have the prefix “EE”. The group each cell belongs to is noted within the column names.

5 Example Usage

In this section, we will detail how to use DEGage functions

First, we will simulate a small data frame to pass through DEGage() using DEGage simulate:

library(DEGage)
df <- DEGage_Simulation(ngenes = 1000, ndegs = 100, ngroup1 = 50, ngroup2 = 50)

Next, we will pass this through DEGage:

cellgroups <- factor(c(rep(1,50), rep(2,50)))
results <- DEGage(counts = df, group = cellgroups)

To test DEGage_multitest, we will simulate a second dataframe of counts, merge them together, and pass them through DEGage_multitest():

df2 <- DEGage_Simulation(ngenes = 1000, ndegs = 100, ngroup1 = 50, ngroup2 = 50)

df <- cbind(df, df2)
cellgroups2 <- factor(c(rep(3,50), rep(4,50)))
cellgroups <- factor(c(cellgroups, cellgroups2))

multitest.results <- DEGage_multitest(df, cellgroups)