scRICA: an R package for multiple-sample single-cell RNA-Seq Integrative Comparative Analysis

Yan Li

2023-04-21

1. What is scRICA

It is a systematic workflow R package for integrative and comparative scRNA-seq analysis. Various parameter options are inputted from a metadata table and then inherited through the entire analysis workflow. scRICA can significantly improves efficiency of programming for comparative analyses with multiple sample attributes. It is distributed as an R package and also offers a command line execution option.

scRICA categorizes the workflow of integrative and comparative analysis using multiple-sample scRNA-seq data into four steps as shown below picture:

2. pacakge installation

2.1 Prerequisite packages installation

If below prerequisite packages have not been installed, you can install them via below commands:

BiocManager::install("MAST")
BiocManager::install("scater")
BiocManager::install("destiny")
install.packages("mclust")
install.packages("easylabel")
install.packages("corrplot")
devtools::install_github('EDePasquale/DoubletDecon')
devtools::install_github("ChenMengjie/lightHippo")
BiocManager::install("slingshot")
BiocManager::install("DESeq2")
BiocManager::install("SingleCellExperiment")
BiocManager::install("tradeSeq")

2.2 scRICA installation

3. Demonstration data

3.1 Demonstration data for multi-sample pre-processing and integration

As shown in below, a total of 6 samples single cell gene expression count matrix results from cell ranger analysis can be downloaded from https://drive.google.com/drive/folders/1wwxmi4cvASRDFtiFKs6MbflFDOOxw3AY?usp=sharing using as the demonstration data for this package’s step 1 and step 2 analysis implementations.

-- 3041A
   |__barcodes.tsv
   |__genes.tsv
   |__matrix.mtx
-- 3041F
   |__barcodes.tsv
   |__genes.tsv
   |__matrix.mtx
-- 3041I
   |__barcodes.tsv
   |__genes.tsv
   |__matrix.mtx
-- 3396A
   |__barcodes.tsv.gz
   |__features.tsv.gz
   |__matrix.mtx.gz
-- 3396F
   |__barcodes.tsv.gz
   |__features.tsv.gz
   |__matrix.mtx.gz
-- 3396I
   |__barcodes.tsv.gz
   |__features.tsv.gz
   |__matrix.mtx.gz

3.2 Demonstration data example of an integrated RDS

This RDS file can be downloaded from https://drive.google.com/drive/folders/1wwxmi4cvASRDFtiFKs6MbflFDOOxw3AY?usp=sharing, which can be used as demonstration data for scRICA step 3 visualization and step 4 downstream analysis implementations.

This RDS includes an integration study from our human cell atlas study of Fallopian Tubes, where a total of 18 scRNA-seq sequencing samples from 8 donors fallopian anatomical sites, including isthmus (I), ampule (A) and fambiriae (F). As shown in the table head, these experimental attributes, including donor and anatomical sites, were defined in the original metadata table column ’’ and ’’ which has been inherited into the integrated RDS object column ’’ and ’’ respectively.

4. Input metadata table

4.1 metadata table specification

4.2 metadata table examples

sample path expCond1 expCond2 doubletsRmMethod
3041A /FullPath/to/CountMatrix/ A patient1 OL
3041F /FullPath/to/CountMatrix/ F patient1 OL
3041I /FullPath/to/CountMatrix/ I patient1 OL
3396A /FullPath/to/CountMatrix/ A patient2 OL
3396F /FullPath/to/CountMatrix/ F patient2 OL
3396I /FullPath/to/CountMatrix/ I patient2 OL

Where you can see that ‘OL’ doublets removal method is specified in column doubletsRmMethod for all samples, you can also specify different methods in that column for each sample, because doublet removal is conducted on samples separately.

sample path
3041A /FullPath/to/CountMatrix/
3041F /FullPath/to/CountMatrix/
3041I /FullPath/to/CountMatrix/
3396A /FullPath/to/CountMatrix/
3396F /FullPath/to/CountMatrix/
3396I /FullPath/to/CountMatrix/
sample path expCond1 expCond2 doubletsRmMethod doubletsResDir
3041A /FullPath/to/CountMatrix/ A patient1 OL /FullPath/to/DoubletDeconResults/
3041F /FullPath/to/CountMatrix/ F patient1 OL /FullPath/to/DoubletDeconResults/
3041I /FullPath/to/CountMatrix/ I patient1 OL /FullPath/to/DoubletDeconResults/
3396A /FullPath/to/CountMatrix/ A patient2 OL /FullPath/to/DoubletDeconResults/
3396F /FullPath/to/CountMatrix/ F patient2 OL /FullPath/to/DoubletDeconResults/
3396I /FullPath/to/CountMatrix/ I patient2 OL /FullPath/to/DoubletDeconResults/

5. Analysis Workflow implementation

5.1: ‘Step 1’ data pre-processing and quality control(QC)

5.2: ‘Step 2’ Data integration

5.3: Commannd line execution option for Step 1 and Step 2 analysis

Rscript run_scRICA_qc_integration.R -m PATH/TO/METADATA_TABLE -o scRICA_results -g human --minCells 3 --minFeatures 100 --mtFiltering True --mtPerCutoff 20 --multiomics False --extraFilter False --integration.method CCA --integration.nfeatures 500 > $PWD/run_scRICA.log 2>&1

5.4: ‘Step 3’ integrated results visualization

5.4.1 Cellular compositions summary and visualization

  1. These can be done via getClusterSummaryReplot(). In our demonstrated data named as ‘scRICA_demo.rds’, where ‘expCond1’ specify the patient donor IDs, and ‘expCond2’ specify fallopian tube anatomical sites (I/A/F). If you would like to update the cells annotation, just run on newAnnotation=T, and provide the corresponding annotation file with the option newAnnotationRscriptName=/full/path/to/file, an example of new annotation file named as anno_example.R can be found from ‘misc’ folder, where all cell types need to be updated can be included.

    An example code to summarize the cellular composition based on samples attribute specified in ‘expCond1’ with respect to different donors is shown below:

rds1 <- readRDS('~/Desktop/757_scRNA-seq/RDS/scRNAseq_FT_finalized_wAnno.rds')
library(scRICA)
## seperation on donor: expCond1
getClusterSummaryReplot(rds = rds1, resDir = 'scRICA_results', newAnnotation = F, expCondCheck = 'expCond1', expCondCheckFname = 'donors_summary')
## seperation on fallopian tissue types (I/A/F): expCond2 togher with new annotation
getClusterSummaryReplot(rds = rds1, resDir = 'scRICA_results', newAnnotation = T, newAnnotationRscriptName = 'anno_example.R', expCondCheck = 'expCond1', expCondCheckFname = 'donors_summary')
  1. Depending on whether newAnnotation is True or False, all related results will be saved into scRICA_results/results_wOrgClusterAnnotation/expCond_donors_summary or scRICA_results/results_wOrgClusterAnnotation/expCond_donors_summary respectively. An example of data structure of result with newAnnotation=F can be seen as:

    -- scRICA_results
      |__results_wOrgClusterAnnotation
    |expCond_donors_summary
       |__cellNo_summary_orgClusterAnnotation_expCond_donors_summary.pdf
       |__cellNo_summary_orgClusterAnnotation_expCond_donors_summary.txt
       |__new_tSNE_plot_expCond_donors_summary
          |__tsne_plot_noLabel_integrate_orgAnnotation.pdf
          |__tsne_plot_wLabel_integrate_orgAnnotation.pdf
          |__tsne_plot_wLabel_orgAnnotation_expCond_donors_summary.pdf
       |__new_UMAP_plot_expCond_donors_summary
          |__UMAP_plot_noLabel_integrate_orgAnnotation.pdf
          |__UMAP_plot_wLabel_integrate_orgAnnotation.pdf
          |__UMAP_plot_wLabel_orgAnnotation_expCond_donors_summary.pdf
  • where ‘cellNo_summary_orgClusterAnnotation_expCond_donors_summary.pdf’: is a cellular composition plot separated by the specified sample’s attribute by the option expCondCheck = 'expCond1'.
  • ‘cellNo_summary_orgClusterAnnotation_expCond_donors_summary.txt’: provides a summary of corresponding cell number and percentage of cells normalized by the total number of cells from each attribute.
  • cellular composition plot separated by patient donors (expCondCheck = 'expCond1') is:
  • cellular composition plot separated by fallopian tube anatomical sites (expCondCheck = 'expCond2') with new cell annotation specified in newAnnotation = T, newAnnotationRscriptName = 'anno_example.R' is:
  • UMAP plot separated into 3 different anatomical sites (I/A/F) together with new cell annotations is:

  1. There are several options which can be used to alter the cellular composition percentage bar-plot, such as

    • stack =T: make stacked or un-stacked cellular composition bar-plot by turning on/off this option

    • perPlotVertical = F, either in the vertical or horizontal position for this cellular composition bar-plot

    • fpRemake = F, perRemake = T: only make cellular composition bar-plot by turn on perRemake without making corresponding UMAP/tSNE plots (perPlotVertical = F)

    • perPlotVertical = F, perPlotHeight = 3, perPlotWidth = 7, fpRemake = F, stack =T,

    • expCondNameOrder = c('I', 'A', 'F'): specify the order of attributes specified from expCondCheck = 'expCond2', where ‘expCond2’ has 3 levels, which can be re-ordered as ‘I->A->F’.

    • An example of un-stacked re-ordered cellular composition plot shown below can be made by:

    getClusterSummaryReplot(rds = rds1, resDir = 'misc/visualization_results', 
                       perRemake = T, fpRemake = F, stack =F,  
                       perPlotVertical = T, perPlotHeight = 3, perPlotWidth = 7, 
                       expCondNameOrder = c('I', 'A', 'F'),
                       newAnnotation = F, 
                       expCondCheck = 'expCond2', expCondCheckFname = 'tissue_sep1')
  1. Other options can be found from the help document via ?getClusterSummaryReplot
5.4.2 feature genes dot-plot visualization
  1. This plot can be made via function getGoiDotplot(). A code example to make such plot on a list of cell cycle genes list (seen in misc folder named as ‘marker_genes_CC.xlsx’), which all cell cycle genes are classified into 2 groups: ‘S phase’ and ‘G2/M phase’.

    getGoiDotplot(resDir = 'misc/visualization_results', rds = rds1,
            newAnnotation = F,
            goiFname = 'misc/marker_genes_CC.xlsx',
            expCondCheck = 'expCond2', expCondCheckFname = 'dotplot_tissue',
            dotPlotFnamePrefix = 'cc')
  2. scRICA provides options for users to select certain cell types cellcluster and/or experimental conditions based on the levels of chosen attribute in expCondCheck = 'expCond2' for this feature dot-plot visualization. For example, we can visualize only 3 T/NK cells (T/NK1, T/NK2, and T/NK2) from 2 fallopian tube tissues I and A via below command.

    getGoiDotplot(resDir = 'misc/visualization_results', rds = rds1,
            newAnnotation = F,
            goiFname = 'misc/marker_genes_CC.xlsx',
            expCondCheck = 'expCond2', expCondCheckFname = 'dotplot_tissue',
            geneTypeOrder = c('S phase', 'G2/M phase'),
            cellcluster = c('T/NK1', 'T/NK2', 'T/NK3'),
            expCond = c('I', 'A'),
            expCondReorderLevels = paste(rep(rev(c('T/NK1', 'T/NK2', 'T/NK3')), each=2), rep(rev(c('I', 'A')), time = 3), sep = '_'),
            dotPlotFnamePrefix = 'cc_TNKonly_IAonly',
            dotPlotMinExpCutoff = 0, dotPlotMaxExpCutoff = 4,
            dotPlotWidth = 26, dotPlotHeight = 6,
            legendPer = 0.08, genetypebarPer = 0.03,
            fontsize.x = 20, fontsize.y = 16,
            fontangle.x = 90, fontangle.y = 0,
            fontsize.legend1 = 10,
            geneTypeLegendOn = T, fontsize.legend2 = 20)
  3. As shown in above code, there are several options which can be used to adjust this dot-plot display, including x-/y-axis font size and angles (fontsize.x, fontangle.x etc.), whether to include gene type legend color bar (geneTypeLegendOn), the size of the dot plot (dotPlotWidth and dotPlotHeight), top legend bar height (genetypebarPer) etc., for detailed information please refer the help manual ?getGoiDotplot. And the finalized plot can be seen as this:

5.4.3 heatmap visualization
  1. This plot can be made via function getGoiHeatmap(). The heatmap can be visualized in 3 ways via option heatmap.view, including visualize on 1). the identified cell types by heatmap.view = 'idents'; 2) samples attributes levels by heatmap.view = 'expCond'; 3) across samples attributes levels with respect to different cell types by heatmap.view = 'expCond.idents'. Additionally, we also can make heatmap based on 1) cell levels with display = 'cell'; 2) aggregated expressions across various levels of samples attributes with display = 'expCond.merge'; 3) aggregated expressions for each sample defined in the metadata table with display = 'sample.merge'. For example, we can use below command to generate a heatmap of selected genes expression aggregated at the sample level across all cell types.

    getGoiHeatmap(heatmap.view = 'idents', resDir = 'misc/visualization_results', rds = rds1,
            newAnnotation = F, geneNames = c('CD44', 'ITGA6', 'EPCAM', 'KRT5', 'KRT7', 'TUBB4A', 'TUBB4B', 'PAX8'),
            expCondCheck = 'expCond2',
            expCondCheckFname = 'heatmap_tissue',
            plotFnamePrefix = 'geneset1_sampleMerge',
            display = 'sample.merge')

2. scRICA includes various options to change the plot size, font size etc. to make the figure with high resolution quality for publication, they can be found from the help manual via ?getGoiHeatmap. Additionally scRICA also allows users to select certain cell types and samples attributes for heatmap visualization, here is an example of the heatmap from fallopian tube I and A for the selected genes on each sample level.

getGoiHeatmap(heatmap.view = 'expCond.idents', resDir = 'misc/visualization_results', rds = rds1,
            newAnnotation = F, geneNames = c('CD44', 'ITGA6', 'EPCAM', 'KRT5', 'KRT7', 'TUBB4A', 'TUBB4B', 'PAX8'),
            expCondCheck = 'expCond2',
            expCondCheckFname = 'heatmap_tissue',
            plotFnamePrefix = 'geneset1_expCond_idents_merge_SECEonly_IAonly',
            expCondReorderLevels = c('I CE', 'A CE', 'I SE', 'A SE'),
            display = 'sample.merge', draw.lines = F,
            cellcluster = c('CE', 'SE'),
            expCond = c('I', 'A'),
            fontsize.top = 2, fontsize.y = 12, barHeight.top = 0.05,
            plotWidth = 5, plotHeight = 3, fontangle.top = 90, debug = T)

5.4.4 scatter plot visualization

5.5: ‘Step 4’ downatream analysis

5.5.1 Differential expression analysis between different samples attributes for cell types
  • DE analysis on cells

    scRICA allow users to conduct DE analysis on any 2 group of cells with different sample’s attributes defined in any specific metadata table column (expCond*). For example, the demonstration RDs data has 2 columns, expCond1 and expCond2, defining samples attributes of donors and fallopian tissue types respectively. If we would like to check DEGs between tissue types I and F for a specific cell type with option cellcluster or all cell types without specifying cellcluster. Once you are running on all cell clusters, all cell types DE results are saved into the excel file spreadsheet, which can cause memory deficiency for excel file storage. commands to run these 2 analysis can be seen as below:

    de.res1 <- getClusterExpCondDe(resDir = 'misc/', rds = rds1,
                               newAnnotation = F, 
                               cellcluster = 'MP', deMethod = 'wilcox',
                               expCondCheck = 'expCond2', expCondCheckFname = 'IFcomp_wilcox', 
                               compGroup = 'I/F')
    de.res2 <- getClusterExpCondDe(resDir = 'misc/', rds = rds1,
                               newAnnotation = F, deMethod = 'MAST',
                               outputExcel = as.logical(F),
                               expCondCheck = 'expCond2', expCondCheckFname = 'IFcomp_allCells_MAST',
                               compGroup = 'I/F')

    All analysis results will be saved into the folder specified by resDir= 'misc' with newAnnotation = F into misc/results_wOrgClusterAnnotation_DEGs or misc/results_wNewAnnotation_DEGs when newAnnotation = T. For each different analysis, results will be saved into specific folder named via option expCondCheckFname, as shown below, above 2 analysis are saved into folder misc/results_wNewAnnotation_DEGs/IFcomp_wilcox and misc/results_wNewAnnotation_DEGs/IFcomp_allCells_MAST respectively, including 1) full list of DEGs (differentially expressed genes) results; 2) significantly DEGs at the specified pAdjValCutoff level, by default it is 0.05; 3) significantly up- and down-regulated DEGs at the specified pAdjValCutoff level; 4) summary of number of DEGs; 5) top number of DEGs list. When outputExcel = F, full list of DE results and significantly expressed DEGs are saved into the spreadsheet of the same excel file, otherwise, they are saved into the sperate txt file with respect to each cell type.

    -- results_wOrgClusterAnnotation_DEGs
       |__IFcomp_allCells_MAST
      |__expCondCompDeMarkers_I-F_adjSig_allClusters_cluster*.txt
      |__expCondCompDeMarkers_I-F_adjSig_down_allClusters_cluster*.txt
      |__expCondCompDeMarkers_I-F_adjSig_up_allClusters_cluster*.txt
      |__expCondCompDeMarkers_I-F_full_allClusters_cluster*.txt
      |__expCondCompDeMarkers_I-F_NoDEmarkers_summary.xlsx
      |__expCondCompDeMarkers_I-F_top10_downDe.xlsx
      |__expCondCompDeMarkers_I-F_top10_upDe.xlsx
       |__IFcomp_wilcox
      |__expCondCompDeMarkers_I-F_adjSig_1SelClusters.xlsx
      |__expCondCompDeMarkers_I-F_adjSig_down_1SelClusters.xlsx
      |__expCondCompDeMarkers_I-F_adjSig_up_1SelClusters.xlsx
      |__expCondCompDeMarkers_I-F_full_1SelClusters.xlsx
      |__expCondCompDeMarkers_I-F_NoDEmarkers_summary.xlsx
      |__expCondCompDeMarkers_I-F_top10_downDe.xlsx
      |__expCondCompDeMarkers_I-F_top10_upDe.xlsx

    Above returned DE analysis results are listed item, where you can access all DE analysis results with below command with de.res1 as an example:

    head(de.res1$clusterDeMarkers$MP)
    de.res1$clusterDeResSummar
    head(de.res1$clusterTopDeMarkers$MP)
  • DE analysis on sample’s pseudo-bulk counts

    scRICA also provides DE analysis based on pseudo-bulk counts aggregated on samples defined in the initial metadata table, this information is inherited into the integrated RDS metadata table column expCond (rds1$expCond). For example, in our demonstration RDS file, there are totally 18 samples from different donor fallopian tube tissues, then we can use option deMethod = 'DESeq2.bulk' to conduct pseudo-bulk DE analysis, where counts from each sample are aggregrated and normalized based on the total number of cells. Additionally, users can also use an addtional metadata table to specify certain confounding covariates for model to correct those potential confounding effects for DE analysis with DESeq2. In this confounding factor defining metadata table, 1) column bulksamp should include all samples listed in rds1$expCond or initial metadata table for integration analysis; 2) other confounding factors to be corrected should be included in the columns of this table, such as we make an example confounding factor metadata table for our RDS demonstration data as below, here we would like to address potential confounding factors effects from gender and bmi.

    meta1 <- data.frame('bulksamp' = names(table(rds1$expCond)),
                    'gender' = c('F', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'F' ),
                    'bmi' = c(23.5, 23.5, 23.5, 18, 18, 28.4, 28.4, 28.4, 32, 32, 20, 20, 20, 30, 30, 30, 21, 21))

    With this confounding factor metadata table, we can use option deseq2bulk.metaCovariateInput to remove all potential confounding factor effects from pseudo-bulk DE analysis performed via DESeq2. Users also can use option norm.method to adjust normalization methods implemented across samples, including TMM, upper-quantile (UQ), and default DESeq2 normalization methods. An example of running command can be seen as below:

    de.res3 <- getClusterExpCondDe(resDir = 'misc/', rds = rds1,
                               newAnnotation = F, cellcluster = 'MP',
                               deMethod = 'DESeq2.bulk',
                               deseq2bulk.metaCovariateInput = meta1,
                               expCondCheck = 'expCond2', expCondCheckFname = 'IFcomp_DEseq2bulk',
                               compGroup = 'I/F')

    In addition to above analysis results, scRICA also return original and cell number normalized count table, which users can use below command to access them:

    head(de.res3$normCounts$MP)
    head(de.res3$orgCounts$MP)
5.5.2 Over expressed marker genes across comparison cell types with respect to different sample’s attributes levels.

scRICA allow users to check over expressed marker genes across defined cell types via function getExpCondClusterMarkers(). For example, users can check over-expressed genes among 5 selected immune cells, including T/NK1, T/NK2, T/NK3, MP, and B/P with respect to samples tissue types defined in expCond2 with below command:

oe.res1 <- getExpCondClusterMarkers(resDir = 'misc/', rds = rds1,
                                    deMethod = 'MAST',
                                    cellcluster = c('T/NK1', 'T/NK2', 'T/NK3', 'MP', 'B/P'),
                                    newAnnotation = F,
                                    expCondCheck = 'expCond2',
                                    expCondCheckFname = 'tissue_IR_overExpressedGenes')

Where, over-expressed marker genes are identified from each cell type in comparison with all other cell types with respect to each fallopian tube tissue type (I/A/F). Similarly, as other DE results, all analysis results are also saved into the folder based on defined options resDir, newAnnotation, and expCondCheckFname. In our example, results are saved into the foler named as misc/clusterMarkerGenes_results_wOrgClusterAnnotation with below data structure, where 1) full list of marker genes analysis results with respect to each fallopian tube tissue type are saved into txt file named as *_resFull.txt; 2) significant over-expressed genes list are also saved into different txt file named as *_UpSig.txt; 3) the number of over-expressed marker genes are also summarized into *_UpSig_NoSummary.txt; 4) top expressed marker genes heatmap is also made with respect to different levels of samples attributes as *_UpSig_heatmap.pdf.

-- clusterMarkerGenes_results_wOrgClusterAnnotation
   |__tissue_IR_overExpressedGenes
      |__expCond_tissue_IR_overExpressedGenes_A_allCluster_pos_markers_resFull.txt
      |__expCond_tissue_IR_overExpressedGenes_A_allCluster_pos_markers_UpSig_heatmap.pdf
      |__expCond_tissue_IR_overExpressedGenes_A_allCluster_pos_markers_UpSig_NoSummary.txt
      |__expCond_tissue_IR_overExpressedGenes_A_allCluster_pos_markers_UpSig.txt
      |__expCond_tissue_IR_overExpressedGenes_F_allCluster_pos_markers_resFull.txt
      |__expCond_tissue_IR_overExpressedGenes_F_allCluster_pos_markers_UpSig_heatmap.pdf
      |__expCond_tissue_IR_overExpressedGenes_F_allCluster_pos_markers_UpSig_NoSummary.txt
      |__expCond_tissue_IR_overExpressedGenes_F_allCluster_pos_markers_UpSig.txt
      |__expCond_tissue_IR_overExpressedGenes_I_allCluster_pos_markers_resFull.txt
      |__expCond_tissue_IR_overExpressedGenes_I_allCluster_pos_markers_UpSig_heatmap.pdf
      |__expCond_tissue_IR_overExpressedGenes_I_allCluster_pos_markers_UpSig_NoSummary.txt
      |__expCond_tissue_IR_overExpressedGenes_I_allCluster_pos_markers_UpSig.txt

Similarly, some analysis results can also be accessed from returned analysis object, including the full list of over-expressed genes and significantly over-express gene list, such as oe.res1 from our above analysis example.

head(oe.res1$fullRes$A)
head(oe.res1$sigUp$A)
table(oe.res1$sigUp$A$cluster)
5.5.3 HIPPO cells sub-clustering analysis

scRICA allow users to conduct sub-clustering HIPPO analysis for any specific cell types (cellcluster) with respect to cell from all samples or any specific sample’s attribute defined via expCondCheck and expCond together. This analyis can be done inside R with R function getHippoRes() or by command line exections Rscript run_scRICA_hippo.R [Options]....

  • R function exectution: ?getHippoRes()

    For example, you can sub-cluster 3 types of T/NK cells (T/NK1, T/NK2 and T/NK3) into 5 sub-clusters via noClusters=5 based on zero-inflation clustering method for all cells regardless of theirs samples attributes, the command is as below:

    getHippoRes(resDir = 'misc/', rds = rds1, newAnnotation = F,
            hippoResNamePrefix = 'Allcells_TNK',
            cellcluster = c('T/NK1', 'T/NK2', 'T/NK3'), noClusters = 4)

    Users also can conduct HIPPO sub-clustering analysis on cells defined from a specific sample’s attribute level, such as you can conduct HIPPO sub-clustering analysis on 2 groups of T/NK cells from fallopian tissue F only via below command:

    getHippoRes(resDir = 'misc/', rds = rds1, newAnnotation = F,
            expCondCheck = 'expCond2', expCond = 'F',
            hippoResNamePrefix = 'Fonly_TNK23',
            cellcluster = c('T/NK2', 'T/NK3'), noClusters = 5)

    If too many cells need to be performed, you also can turn on sparseMatrix = T to avoid memory deficiency. The HIPPO analysis results are saved into the defined folder named via option hippoResNamePrefix inside resDir direcotry sub-folder hippo_results, an example of above analysis results data structure can be seen as below:

    -- hippo_results
       |__Allcells_TNK
      |__Allcells_TNK_k4_clustersTree.pdf
      |__diagnosticPlot_cellCluster_Allcells_TNK_k4.pdf
      |__lightHIPPOres_Allcells_TNK_k4.Rdata
      |__Per_cellCluster_Allcells_TNK_k4.xlsx
      |__topFeatures_cellCluster_Allcells_TNK_k4_dotplot.pdf
       |__Fonly_TNK23
      |__...

    where includes 1) hierarchy clustering tree plot; 2) diagnostic clustering plot; 3) all analysis results saved in Rdata; 4) number of cells in each HIPPO identified sub-clusters in excel file; 5) top expressed genes with respect to each HIPPO identified sub-clusters.

  • Command line execution: Rscript run_scRICA_hippo.R --help

    Users can also conduct above analysis with command line executions as below:

    Rscript run_scRICA_hippo.R --outputDir misc/ --hippoResNamePrefix Allcells_TNK_commandLine --expCondCheck 'all' --rdsFname ~/Desktop/scRICA_demo.rds --cellclusters 'T/NK1, T/NK2, T/NK3' --noCluster 4 > $PWD/run_scRICA_hippo_tnk_allcells_k4.log 2>&1
    Rscript run_scRICA_hippo.R --outputDir misc/ --hippoResNamePrefix Fonly_TNK23_commandLine --expCondCheck 'expCond2' --expCond 'F' --rdsFname ~/Desktop/scRICA_demo.rds --cellclusters 'T/NK2, T/NK3' --noCluster 5 > $PWD/run_scRICA_hippo_tnk23_Fonly_k5.log 2>&1
  • Update RDS file with HIPPO results

    As shown in above examples, 3 types of T/NK cells were conducted with sub-clustering analysis with HIPPO for a k=4 sub-clusters identification. Below command can be used to update the cells clustering information saved in Idents() based on HIPPO results saved in Rdata format. If list names are provided in the option hippoResList, updated cells idents names are based on them. Otherwise, users also can use option hippoResNames for updated HIIPO sub-clusters names. More detailed information regarding to these options, please refer ?updateHippoIdents.

    rds.hippo1 <- updateHippoIdents(resDir = 'misc/', rds = rds1, resSaveFname = 'hippo_tnk.Rds',
                             hippoResList = list('TNK_hippo' = 'misc/hippo_results/Allcells_TNK_k4/lightHIPPOres_Allcells_TNK_k4.Rdata'),
                             hippoResK = c(4),
                             newAnnotation = F )
    table(Idents(rds.hippo1))
5.5.4 pseudo-time trajectory analysis with slingshot

Similarly, scRICA also allow users to conduct pseudo-time trajectory analysis with slingshot for any specific cell types (cellcluster) with respect to cells from all samples or any specific sample’s attribute defined via expCondCheck and expCond together. This analyis can be done inside R with R function getExpCondClusterPseudotime() or by command line exections Rscript run_scRICA_pseudoTime.R [Options]....

  • R function exectution: ?getExpCondClusterPseudotime(). For example, you can run slingshot pseudo-time trajectory analysis on SE and CE cell types with respect to 3 different fallopian tube anatomical sites seperations with below command:

    psdt.res1 <- getExpCondClusterPseudotime(resDir = 'misc/visualization_results', rds = rds1,
                                       expCondCheck = 'expCond2', 
                                       expCondCheckFname = 'tissueSep_SECE',
                                       cellcluster = c('SE1', 'CE'),
                                       newAnnotation = F)
  • The above analysis results are saved inside folder named via options resDir and expCondCheckFname. For example, the above analysis results are saved into the folder named as ./misc/visualization_results/results_wOrgClusterAnnotation_pseudoTime/tissueSep_SECE with below data structure:

    -- results_wOrgClusterAnnotation_pseudoTime
     |__tissueSep_SECE
      |__expCondLevel_A_cluster_SE_fnPseudoTimeRes.Rdata
      |__expCondLevel_A_cluster_SE_ptGMM_GAMheatmap.pdf
      |__expCondLevel_A_cluster_SE_ptGMM_GAMheatmapGenenames.txt
      |__expCondLevel_A_cluster_SE_ptGMM_lineage.pdf
      |__expCondLevel_A_cluster_SE_ptGMM_plots.pdf
      |__...

    where includes results with respect to each samples attributes condition and cell types, such as for fallpian site ampulla, SE cell, 1) slingshot linage plots (expCondLevel_A_cluster_SE_ptGMM_lineage.pdf); 2) PCA and diffusion map seperation plots (expCondLevel_A_cluster_SE_ptGMM_plots.pdf); 3) heatmap of identified genes seperating lingeages (expCondLevel_A_cluster_SE_ptGMM_GAMheatmap.pdf); 4) corresponding gene list presented in heatmap (expCondLevel_A_cluster_SE_ptGMM_GAMheatmapGenenames.txt).

  • Command line execution: Rscript run_scRICA_pseudoTime.R --help

    For above analysis, it can also be done with command line option shown as below:

    Rscript run_scRICA_pseudoTime.R --outputDir misc/ --hippoResNamePrefix tissueSep_SECE --expCondCheck 'expCond2' --rdsFname ~/Desktop/scRICA_demo.rds --cellclusters 'SE, CE' > $PWD/run_scRICA_pseudoTime_CESEonly.log 2>&1

Feedback

If you have further questions or suggestions regarding this package, please contact Yan Li at from the bioinformatics core at the Center for Research Bioinformatics (CRI), biological science division (BSD), University of Chicago.