It is a systematic workflow R package for integrative and comparative scRNA-seq analysis. Various parameter options are inputted from a metadata table and then inherited through the entire analysis workflow. scRICA can significantly improves efficiency of programming for comparative analyses with multiple sample attributes. It is distributed as an R package and also offers a command line execution option.
scRICA categorizes the workflow of integrative and comparative analysis using multiple-sample scRNA-seq data into four steps as shown below picture:
If below prerequisite packages have not been installed, you can install them via below commands:
BiocManager::install("MAST")
BiocManager::install("scater")
BiocManager::install("destiny")
install.packages("mclust")
install.packages("easylabel")
install.packages("corrplot")
devtools::install_github('EDePasquale/DoubletDecon')
devtools::install_github("ChenMengjie/lightHippo")
BiocManager::install("slingshot")
BiocManager::install("DESeq2")
BiocManager::install("SingleCellExperiment")
BiocManager::install("tradeSeq")
github installation
devtools::install_github(repo = 'yan-cri/scRICA', build_vignettes = T, force = T)
library(scRICA)Installation by downloading scRICA into your local computer via
git clone https://github.com/yan-cri/scRICA.git, then
install this package by:
devtools::install('scRICA', build_vignettes = T)
browseVignettes(package = 'scRICA')As shown in below, a total of 6 samples single cell gene expression count matrix results from cell ranger analysis can be downloaded from https://drive.google.com/drive/folders/1wwxmi4cvASRDFtiFKs6MbflFDOOxw3AY?usp=sharing using as the demonstration data for this package’s step 1 and step 2 analysis implementations.
-- 3041A
|__barcodes.tsv
|__genes.tsv
|__matrix.mtx
-- 3041F
|__barcodes.tsv
|__genes.tsv
|__matrix.mtx
-- 3041I
|__barcodes.tsv
|__genes.tsv
|__matrix.mtx
-- 3396A
|__barcodes.tsv.gz
|__features.tsv.gz
|__matrix.mtx.gz
-- 3396F
|__barcodes.tsv.gz
|__features.tsv.gz
|__matrix.mtx.gz
-- 3396I
|__barcodes.tsv.gz
|__features.tsv.gz
|__matrix.mtx.gz
This RDS file can be downloaded from https://drive.google.com/drive/folders/1wwxmi4cvASRDFtiFKs6MbflFDOOxw3AY?usp=sharing, which can be used as demonstration data for scRICA step 3 visualization and step 4 downstream analysis implementations.
This RDS includes an integration study from our human cell atlas study of Fallopian Tubes, where a total of 18 scRNA-seq sequencing samples from 8 donors fallopian anatomical sites, including isthmus (I), ampule (A) and fambiriae (F). As shown in the table head, these experimental attributes, including donor and anatomical sites, were defined in the original metadata table column ’’ and ’’ which has been inherited into the integrated RDS object column ’’ and ’’ respectively.
In general, the input metadata table includes four mandatory
columns: sample, path, expCond1,
doubletsRmMethod.
sample: specifies all sample names which will be
used for integration;path: specifies the full path of count matrix
for each corresponding sample in a row, where the processed count matrix
can be in different formats, including 1). a processed count matrix
saved in a folder as above downloaded data, which is directly generated
from the 10X genomic Cell Ranger analysis tool; 2). a count matrix
stored in text or hd5 format.doubletsRmMethod: specify the doublet removal
with 4 options, 1) None: no doublet removal; 2) centroids: one algorithm
used for doublets removal from DoubletDecon; 3) medoids: another
algorithm used in DobuletDecon for doublets identification; 4) OL, cells
detected as doublets by both algorithm (centroids & medoids) from
DoubletDecon.findDoublets() or command line
option Rscript run_scRICA_findDoublets.R --options, which
only sample and path are required as mandatory
columnsdoubletsResDir to specify the full path of DoubletDecon
results directory.expCond1: specify sample’s attributes under an
experimental condition, such as our demonstration data, it can be the
fallopian tube anatomical side (I/A/F)Extra columns can be added to provide more options, they are:
expCond*': specify more sample's attributes, they will be inherited into the integrated Seurat object asexpCond2,expCond3,expCond4`
…filterFname: the full path of a file, which includes
all cells in the rows specified with column barcode and
extra column `filter’ with True or False to specify whether the cell in
a row need to be removed.| sample | path | expCond1 | expCond2 | doubletsRmMethod |
|---|---|---|---|---|
| 3041A | /FullPath/to/CountMatrix/ | A | patient1 | OL |
| 3041F | /FullPath/to/CountMatrix/ | F | patient1 | OL |
| 3041I | /FullPath/to/CountMatrix/ | I | patient1 | OL |
| 3396A | /FullPath/to/CountMatrix/ | A | patient2 | OL |
| 3396F | /FullPath/to/CountMatrix/ | F | patient2 | OL |
| 3396I | /FullPath/to/CountMatrix/ | I | patient2 | OL |
Where you can see that ‘OL’ doublets removal method is specified in
column doubletsRmMethod for all samples, you can also
specify different methods in that column for each sample, because
doublet removal is conducted on samples separately.
| sample | path |
|---|---|
| 3041A | /FullPath/to/CountMatrix/ |
| 3041F | /FullPath/to/CountMatrix/ |
| 3041I | /FullPath/to/CountMatrix/ |
| 3396A | /FullPath/to/CountMatrix/ |
| 3396F | /FullPath/to/CountMatrix/ |
| 3396I | /FullPath/to/CountMatrix/ |
| sample | path | expCond1 | expCond2 | doubletsRmMethod | doubletsResDir |
|---|---|---|---|---|---|
| 3041A | /FullPath/to/CountMatrix/ | A | patient1 | OL | /FullPath/to/DoubletDeconResults/ |
| 3041F | /FullPath/to/CountMatrix/ | F | patient1 | OL | /FullPath/to/DoubletDeconResults/ |
| 3041I | /FullPath/to/CountMatrix/ | I | patient1 | OL | /FullPath/to/DoubletDeconResults/ |
| 3396A | /FullPath/to/CountMatrix/ | A | patient2 | OL | /FullPath/to/DoubletDeconResults/ |
| 3396F | /FullPath/to/CountMatrix/ | F | patient2 | OL | /FullPath/to/DoubletDeconResults/ |
| 3396I | /FullPath/to/CountMatrix/ | I | patient2 | OL | /FullPath/to/DoubletDeconResults/ |
QC assessment without mitochondrial content filtering: turn on
mtFiltering option and setup the mitochondrial content
percentage cut-off values with option mtPerCutoff
library(scRICA)
qcResult <- processQC(metadata = metadata, resDirName = 'scRICA_test_result', genomeSpecies = 'human', mtFiltering = T, mtPerCutoff = 20)Options for this function includes:
genomeSpecies: specify sample’s genome species, by
default ‘human’, currently supporting human, mouse, and rate.extraFilter: True or False option, if ‘TRUE’, column
‘filterFname’ needs to be included in the provided metadata table.minCells: specify the minimum detected number of cells
to be included in the analysis for each sample used by Seurat, by
default = 3.minFeatures: specify the minimum number of expressed
gene features for cells included in the analysis used by Seurat, by
default = 200.?processQCResults will be saved into the folder specified in the option
resDirName under the current working directory shown as
below data structure, where includes 2 directories and 3 text files.
## -- scRICA_test_result
## |__allCluster_pos_markers_no.txt
## |__allCluster_pos_markers_top10.txt
## |__allCluster_pos_markers.txt
## |__cluster_heatmap_top10PosMarkers.pdf
## |__pca_plot.pdf
## |__RDS_Dir
## |__scRICA_test_result.rds
## |__tsne_plot_samplSep.pdf
## |__tsne_plot.pdf
## |__umap_plot_samplSep.pdf
## |__umap_plot.pdf
mtFiltering = T.This step integrates all samples listed in the input metadata table using two popular integration methods, the CCA and RPCA algorithms in Seurat, the input file for this function is the pre-processed QC results from above ‘step 1’ for example:
results <- getClusterMarkers(qcProcessedResults = qcResult)Options for this function includes:
integrationMethod: specify integration method applied
for anchor identification for integration, 2 options, ‘CCA’ and ‘RPCA’,
by default = ‘CCA’.nfeatures: specify the number of identified over
expressed genes from each identified cell clusters, by default =
10.topN: specify the folder/directory name where
integration analysis results will be saved, by default: the save
directory where QC results are saved.ribo: True or False logical option, if True, rRNA genes
are removed from the variable features for integration analysis.int.k.weight: specify the number of neighbors to
consider when weighting anchors in the step of integration, by default =
100, if too few number cells used in the study, this number can be
reduced.The integration results will be saved in the same directory
defined in the processQC() execution, here we can see:
‘umap_plot.pdf’: multi-sample integrated UMAP plot;
‘tsne_plot.pdf’: multi-sample integrated tSNE plot:
Corresponding UMAP and tSNE plots separated on samples from the input metadata table: ‘umap_plot_samplSep.pdf’ and ‘tsne_plot_samplSep.pdf’
‘allCluster_pos_markers.txt’: identified over-expressed genes with respect to each identified cell cluster;
‘allCluster_pos_markers_no.txt’: summarize the number of over-expressed genes from each identified cell cluster;
‘allCluster_pos_markers_top10.txt’: presents topN
identified over-expressed genes from each identified cell
cluster;
‘cluster_heatmap_top10PosMarkers.pdf’: a heatmap of
topN identified over-expressed genes from each identified
cell cluster.
‘seuratReadin_qcProcessed.Rdata’: intermediate Rdata file, where
the pre-processed QC results from step 1 is saved as an object named as
seuratProcssedObjList1, which can be loaded and used for
integration analysis separately, such as
load(seuratReadin_qcProcessed.Rdata)
results2 <- getClusterMarkers(qcProcessedResults = seuratProcssedObjList1, integrationMethod = 'RPCA')Rscript run_scRICA_qc_integration.R -m PATH/TO/METADATA_TABLE -o scRICA_results -g human --minCells 3 --minFeatures 100 --mtFiltering True --mtPerCutoff 20 --multiomics False --extraFilter False --integration.method CCA --integration.nfeatures 500 > $PWD/run_scRICA.log 2>&1These can be done via getClusterSummaryReplot(). In
our demonstrated data named as ‘scRICA_demo.rds’, where ‘expCond1’
specify the patient donor IDs, and ‘expCond2’ specify fallopian tube
anatomical sites (I/A/F). If you would like to update the cells
annotation, just run on newAnnotation=T, and provide the
corresponding annotation file with the option
newAnnotationRscriptName=/full/path/to/file, an example of
new annotation file named as anno_example.R can be found
from ‘misc’ folder, where all cell types need to be updated can be
included.
An example code to summarize the cellular composition based on samples attribute specified in ‘expCond1’ with respect to different donors is shown below:
rds1 <- readRDS('~/Desktop/757_scRNA-seq/RDS/scRNAseq_FT_finalized_wAnno.rds')
library(scRICA)
## seperation on donor: expCond1
getClusterSummaryReplot(rds = rds1, resDir = 'scRICA_results', newAnnotation = F, expCondCheck = 'expCond1', expCondCheckFname = 'donors_summary')
## seperation on fallopian tissue types (I/A/F): expCond2 togher with new annotation
getClusterSummaryReplot(rds = rds1, resDir = 'scRICA_results', newAnnotation = T, newAnnotationRscriptName = 'anno_example.R', expCondCheck = 'expCond1', expCondCheckFname = 'donors_summary')
Depending on whether newAnnotation is True or False,
all related results will be saved into
scRICA_results/results_wOrgClusterAnnotation/expCond_donors_summary
or
scRICA_results/results_wOrgClusterAnnotation/expCond_donors_summary
respectively. An example of data structure of result with
newAnnotation=F can be seen as:
-- scRICA_results
|__results_wOrgClusterAnnotation
|expCond_donors_summary
|__cellNo_summary_orgClusterAnnotation_expCond_donors_summary.pdf
|__cellNo_summary_orgClusterAnnotation_expCond_donors_summary.txt
|__new_tSNE_plot_expCond_donors_summary
|__tsne_plot_noLabel_integrate_orgAnnotation.pdf
|__tsne_plot_wLabel_integrate_orgAnnotation.pdf
|__tsne_plot_wLabel_orgAnnotation_expCond_donors_summary.pdf
|__new_UMAP_plot_expCond_donors_summary
|__UMAP_plot_noLabel_integrate_orgAnnotation.pdf
|__UMAP_plot_wLabel_integrate_orgAnnotation.pdf
|__UMAP_plot_wLabel_orgAnnotation_expCond_donors_summary.pdfexpCondCheck = 'expCond1'.expCondCheck = 'expCond1') is:expCondCheck = 'expCond2') with new cell annotation
specified in
newAnnotation = T, newAnnotationRscriptName = 'anno_example.R'
is:UMAP plot separated into 3 different anatomical sites (I/A/F) together with new cell annotations is:
There are several options which can be used to alter the cellular composition percentage bar-plot, such as
stack =T: make stacked or un-stacked cellular
composition bar-plot by turning on/off this option
perPlotVertical = F, either in the vertical or
horizontal position for this cellular composition bar-plot
fpRemake = F, perRemake = T: only make cellular
composition bar-plot by turn on perRemake without making
corresponding UMAP/tSNE plots
(perPlotVertical = F)
perPlotVertical = F, perPlotHeight = 3, perPlotWidth = 7, fpRemake = F, stack =T,
expCondNameOrder = c('I', 'A', 'F'): specify the
order of attributes specified from
expCondCheck = 'expCond2', where ‘expCond2’ has 3 levels,
which can be re-ordered as ‘I->A->F’.
An example of un-stacked re-ordered cellular composition plot shown below can be made by:
getClusterSummaryReplot(rds = rds1, resDir = 'misc/visualization_results',
perRemake = T, fpRemake = F, stack =F,
perPlotVertical = T, perPlotHeight = 3, perPlotWidth = 7,
expCondNameOrder = c('I', 'A', 'F'),
newAnnotation = F,
expCondCheck = 'expCond2', expCondCheckFname = 'tissue_sep1')?getClusterSummaryReplotThis plot can be made via function getGoiDotplot().
A code example to make such plot on a list of cell cycle genes list
(seen in misc folder named as ‘marker_genes_CC.xlsx’),
which all cell cycle genes are classified into 2 groups: ‘S phase’ and
‘G2/M phase’.
getGoiDotplot(resDir = 'misc/visualization_results', rds = rds1,
newAnnotation = F,
goiFname = 'misc/marker_genes_CC.xlsx',
expCondCheck = 'expCond2', expCondCheckFname = 'dotplot_tissue',
dotPlotFnamePrefix = 'cc')scRICA provides options for users to select certain cell types
cellcluster and/or experimental conditions based on the
levels of chosen attribute in expCondCheck = 'expCond2' for
this feature dot-plot visualization. For example, we can visualize only
3 T/NK cells (T/NK1, T/NK2, and T/NK2) from 2 fallopian tube tissues I
and A via below command.
getGoiDotplot(resDir = 'misc/visualization_results', rds = rds1,
newAnnotation = F,
goiFname = 'misc/marker_genes_CC.xlsx',
expCondCheck = 'expCond2', expCondCheckFname = 'dotplot_tissue',
geneTypeOrder = c('S phase', 'G2/M phase'),
cellcluster = c('T/NK1', 'T/NK2', 'T/NK3'),
expCond = c('I', 'A'),
expCondReorderLevels = paste(rep(rev(c('T/NK1', 'T/NK2', 'T/NK3')), each=2), rep(rev(c('I', 'A')), time = 3), sep = '_'),
dotPlotFnamePrefix = 'cc_TNKonly_IAonly',
dotPlotMinExpCutoff = 0, dotPlotMaxExpCutoff = 4,
dotPlotWidth = 26, dotPlotHeight = 6,
legendPer = 0.08, genetypebarPer = 0.03,
fontsize.x = 20, fontsize.y = 16,
fontangle.x = 90, fontangle.y = 0,
fontsize.legend1 = 10,
geneTypeLegendOn = T, fontsize.legend2 = 20)As shown in above code, there are several options which can be
used to adjust this dot-plot display, including x-/y-axis font size and
angles (fontsize.x, fontangle.x etc.), whether
to include gene type legend color bar (geneTypeLegendOn),
the size of the dot plot (dotPlotWidth and
dotPlotHeight), top legend bar height
(genetypebarPer) etc., for detailed information please
refer the help manual ?getGoiDotplot. And the finalized
plot can be seen as this:
This plot can be made via function getGoiHeatmap().
The heatmap can be visualized in 3 ways via option
heatmap.view, including visualize on 1). the identified
cell types by heatmap.view = 'idents'; 2) samples
attributes levels by heatmap.view = 'expCond'; 3) across
samples attributes levels with respect to different cell types by
heatmap.view = 'expCond.idents'. Additionally, we also can
make heatmap based on 1) cell levels with display = 'cell';
2) aggregated expressions across various levels of samples attributes
with display = 'expCond.merge'; 3) aggregated expressions
for each sample defined in the metadata table with
display = 'sample.merge'. For example, we can use below
command to generate a heatmap of selected genes expression aggregated at
the sample level across all cell types.
getGoiHeatmap(heatmap.view = 'idents', resDir = 'misc/visualization_results', rds = rds1,
newAnnotation = F, geneNames = c('CD44', 'ITGA6', 'EPCAM', 'KRT5', 'KRT7', 'TUBB4A', 'TUBB4B', 'PAX8'),
expCondCheck = 'expCond2',
expCondCheckFname = 'heatmap_tissue',
plotFnamePrefix = 'geneset1_sampleMerge',
display = 'sample.merge') 2. scRICA includes
various options to change the plot size, font size etc. to make the
figure with high resolution quality for publication, they can be found
from the help manual via
?getGoiHeatmap. Additionally
scRICA also allows users to select certain cell types and samples
attributes for heatmap visualization, here is an example of the heatmap
from fallopian tube I and A for the selected genes on each sample
level.
getGoiHeatmap(heatmap.view = 'expCond.idents', resDir = 'misc/visualization_results', rds = rds1,
newAnnotation = F, geneNames = c('CD44', 'ITGA6', 'EPCAM', 'KRT5', 'KRT7', 'TUBB4A', 'TUBB4B', 'PAX8'),
expCondCheck = 'expCond2',
expCondCheckFname = 'heatmap_tissue',
plotFnamePrefix = 'geneset1_expCond_idents_merge_SECEonly_IAonly',
expCondReorderLevels = c('I CE', 'A CE', 'I SE', 'A SE'),
display = 'sample.merge', draw.lines = F,
cellcluster = c('CE', 'SE'),
expCond = c('I', 'A'),
fontsize.top = 2, fontsize.y = 12, barHeight.top = 0.05,
plotWidth = 5, plotHeight = 3, fontangle.top = 90, debug = T)
DE analysis on cells
scRICA allow users to conduct DE analysis on any 2 group of cells
with different sample’s attributes defined in any specific metadata
table column (expCond*). For example, the demonstration RDs data has 2
columns, expCond1 and expCond2, defining samples attributes of donors
and fallopian tissue types respectively. If we would like to check DEGs
between tissue types I and F for a specific cell type with option
cellcluster or all cell types without specifying
cellcluster. Once you are running on all cell clusters, all
cell types DE results are saved into the excel file spreadsheet, which
can cause memory deficiency for excel file storage. commands to run
these 2 analysis can be seen as below:
de.res1 <- getClusterExpCondDe(resDir = 'misc/', rds = rds1,
newAnnotation = F,
cellcluster = 'MP', deMethod = 'wilcox',
expCondCheck = 'expCond2', expCondCheckFname = 'IFcomp_wilcox',
compGroup = 'I/F')
de.res2 <- getClusterExpCondDe(resDir = 'misc/', rds = rds1,
newAnnotation = F, deMethod = 'MAST',
outputExcel = as.logical(F),
expCondCheck = 'expCond2', expCondCheckFname = 'IFcomp_allCells_MAST',
compGroup = 'I/F')
All analysis results will be saved into the folder specified by
resDir= 'misc' with newAnnotation = F into
misc/results_wOrgClusterAnnotation_DEGs or
misc/results_wNewAnnotation_DEGs when
newAnnotation = T. For each different analysis, results
will be saved into specific folder named via option
expCondCheckFname, as shown below, above 2 analysis are
saved into folder
misc/results_wNewAnnotation_DEGs/IFcomp_wilcox and
misc/results_wNewAnnotation_DEGs/IFcomp_allCells_MAST
respectively, including 1) full list of DEGs (differentially expressed
genes) results; 2) significantly DEGs at the specified
pAdjValCutoff level, by default it is 0.05; 3)
significantly up- and down-regulated DEGs at the specified
pAdjValCutoff level; 4) summary of number of DEGs; 5) top
number of DEGs list. When outputExcel = F, full list of DE
results and significantly expressed DEGs are saved into the spreadsheet
of the same excel file, otherwise, they are saved into the sperate txt
file with respect to each cell type.
-- results_wOrgClusterAnnotation_DEGs
|__IFcomp_allCells_MAST
|__expCondCompDeMarkers_I-F_adjSig_allClusters_cluster*.txt
|__expCondCompDeMarkers_I-F_adjSig_down_allClusters_cluster*.txt
|__expCondCompDeMarkers_I-F_adjSig_up_allClusters_cluster*.txt
|__expCondCompDeMarkers_I-F_full_allClusters_cluster*.txt
|__expCondCompDeMarkers_I-F_NoDEmarkers_summary.xlsx
|__expCondCompDeMarkers_I-F_top10_downDe.xlsx
|__expCondCompDeMarkers_I-F_top10_upDe.xlsx
|__IFcomp_wilcox
|__expCondCompDeMarkers_I-F_adjSig_1SelClusters.xlsx
|__expCondCompDeMarkers_I-F_adjSig_down_1SelClusters.xlsx
|__expCondCompDeMarkers_I-F_adjSig_up_1SelClusters.xlsx
|__expCondCompDeMarkers_I-F_full_1SelClusters.xlsx
|__expCondCompDeMarkers_I-F_NoDEmarkers_summary.xlsx
|__expCondCompDeMarkers_I-F_top10_downDe.xlsx
|__expCondCompDeMarkers_I-F_top10_upDe.xlsx
Above returned DE analysis results are listed item, where you can
access all DE analysis results with below command with
de.res1 as an example:
head(de.res1$clusterDeMarkers$MP)
de.res1$clusterDeResSummar
head(de.res1$clusterTopDeMarkers$MP)DE analysis on sample’s pseudo-bulk counts
scRICA also provides DE analysis based on pseudo-bulk counts
aggregated on samples defined in the initial metadata table, this
information is inherited into the integrated RDS metadata table column
expCond (rds1$expCond). For example, in our demonstration
RDS file, there are totally 18 samples from different donor fallopian
tube tissues, then we can use option
deMethod = 'DESeq2.bulk' to conduct pseudo-bulk DE
analysis, where counts from each sample are aggregrated and normalized
based on the total number of cells. Additionally, users can also use an
addtional metadata table to specify certain confounding covariates for
model to correct those potential confounding effects for DE analysis
with DESeq2. In this confounding factor defining metadata table, 1)
column bulksamp should include all samples listed in
rds1$expCond or initial metadata table for integration
analysis; 2) other confounding factors to be corrected should be
included in the columns of this table, such as we make an example
confounding factor metadata table for our RDS demonstration data as
below, here we would like to address potential confounding factors
effects from gender and bmi.
meta1 <- data.frame('bulksamp' = names(table(rds1$expCond)),
'gender' = c('F', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'F' ),
'bmi' = c(23.5, 23.5, 23.5, 18, 18, 28.4, 28.4, 28.4, 32, 32, 20, 20, 20, 30, 30, 30, 21, 21))
With this confounding factor metadata table, we can use option
deseq2bulk.metaCovariateInput to remove all potential
confounding factor effects from pseudo-bulk DE analysis performed via
DESeq2. Users also can use option norm.method to adjust
normalization methods implemented across samples, including TMM,
upper-quantile (UQ), and default DESeq2 normalization methods. An
example of running command can be seen as below:
de.res3 <- getClusterExpCondDe(resDir = 'misc/', rds = rds1,
newAnnotation = F, cellcluster = 'MP',
deMethod = 'DESeq2.bulk',
deseq2bulk.metaCovariateInput = meta1,
expCondCheck = 'expCond2', expCondCheckFname = 'IFcomp_DEseq2bulk',
compGroup = 'I/F')
In addition to above analysis results, scRICA also return original and cell number normalized count table, which users can use below command to access them:
head(de.res3$normCounts$MP)
head(de.res3$orgCounts$MP)scRICA allow users to check over expressed marker genes across
defined cell types via function getExpCondClusterMarkers().
For example, users can check over-expressed genes among 5 selected
immune cells, including T/NK1, T/NK2, T/NK3, MP, and B/P with respect to
samples tissue types defined in expCond2 with below
command:
oe.res1 <- getExpCondClusterMarkers(resDir = 'misc/', rds = rds1,
deMethod = 'MAST',
cellcluster = c('T/NK1', 'T/NK2', 'T/NK3', 'MP', 'B/P'),
newAnnotation = F,
expCondCheck = 'expCond2',
expCondCheckFname = 'tissue_IR_overExpressedGenes')
Where, over-expressed marker genes are identified from each cell type
in comparison with all other cell types with respect to each fallopian
tube tissue type (I/A/F). Similarly, as other DE results, all analysis
results are also saved into the folder based on defined options
resDir, newAnnotation, and
expCondCheckFname. In our example, results are saved into
the foler named as
misc/clusterMarkerGenes_results_wOrgClusterAnnotation with
below data structure, where 1) full list of marker genes analysis
results with respect to each fallopian tube tissue type are saved into
txt file named as *_resFull.txt; 2) significant
over-expressed genes list are also saved into different txt file named
as *_UpSig.txt; 3) the number of over-expressed marker
genes are also summarized into *_UpSig_NoSummary.txt; 4)
top expressed marker genes heatmap is also made with respect to
different levels of samples attributes as
*_UpSig_heatmap.pdf.
-- clusterMarkerGenes_results_wOrgClusterAnnotation
|__tissue_IR_overExpressedGenes
|__expCond_tissue_IR_overExpressedGenes_A_allCluster_pos_markers_resFull.txt
|__expCond_tissue_IR_overExpressedGenes_A_allCluster_pos_markers_UpSig_heatmap.pdf
|__expCond_tissue_IR_overExpressedGenes_A_allCluster_pos_markers_UpSig_NoSummary.txt
|__expCond_tissue_IR_overExpressedGenes_A_allCluster_pos_markers_UpSig.txt
|__expCond_tissue_IR_overExpressedGenes_F_allCluster_pos_markers_resFull.txt
|__expCond_tissue_IR_overExpressedGenes_F_allCluster_pos_markers_UpSig_heatmap.pdf
|__expCond_tissue_IR_overExpressedGenes_F_allCluster_pos_markers_UpSig_NoSummary.txt
|__expCond_tissue_IR_overExpressedGenes_F_allCluster_pos_markers_UpSig.txt
|__expCond_tissue_IR_overExpressedGenes_I_allCluster_pos_markers_resFull.txt
|__expCond_tissue_IR_overExpressedGenes_I_allCluster_pos_markers_UpSig_heatmap.pdf
|__expCond_tissue_IR_overExpressedGenes_I_allCluster_pos_markers_UpSig_NoSummary.txt
|__expCond_tissue_IR_overExpressedGenes_I_allCluster_pos_markers_UpSig.txt
Similarly, some analysis results can also be accessed from returned
analysis object, including the full list of over-expressed genes and
significantly over-express gene list, such as oe.res1 from
our above analysis example.
head(oe.res1$fullRes$A)
head(oe.res1$sigUp$A)
table(oe.res1$sigUp$A$cluster)
scRICA allow users to conduct sub-clustering HIPPO analysis for any
specific cell types (cellcluster) with respect to cell from
all samples or any specific sample’s attribute defined via
expCondCheck and expCond together. This
analyis can be done inside R with R function getHippoRes()
or by command line exections
Rscript run_scRICA_hippo.R [Options]....
R function exectution: ?getHippoRes()
For example, you can sub-cluster 3 types of T/NK cells (T/NK1, T/NK2
and T/NK3) into 5 sub-clusters via noClusters=5 based on
zero-inflation clustering method for all cells regardless of theirs
samples attributes, the command is as below:
getHippoRes(resDir = 'misc/', rds = rds1, newAnnotation = F,
hippoResNamePrefix = 'Allcells_TNK',
cellcluster = c('T/NK1', 'T/NK2', 'T/NK3'), noClusters = 4)
Users also can conduct HIPPO sub-clustering analysis on cells defined
from a specific sample’s attribute level, such as you can conduct HIPPO
sub-clustering analysis on 2 groups of T/NK cells from fallopian tissue
F only via below command:
getHippoRes(resDir = 'misc/', rds = rds1, newAnnotation = F,
expCondCheck = 'expCond2', expCond = 'F',
hippoResNamePrefix = 'Fonly_TNK23',
cellcluster = c('T/NK2', 'T/NK3'), noClusters = 5)
If too many cells need to be performed, you also can turn on
sparseMatrix = T to avoid memory deficiency. The HIPPO
analysis results are saved into the defined folder named via option
hippoResNamePrefix inside resDir direcotry
sub-folder hippo_results, an example of above analysis
results data structure can be seen as below:
-- hippo_results
|__Allcells_TNK
|__Allcells_TNK_k4_clustersTree.pdf
|__diagnosticPlot_cellCluster_Allcells_TNK_k4.pdf
|__lightHIPPOres_Allcells_TNK_k4.Rdata
|__Per_cellCluster_Allcells_TNK_k4.xlsx
|__topFeatures_cellCluster_Allcells_TNK_k4_dotplot.pdf
|__Fonly_TNK23
|__...
where includes 1) hierarchy clustering tree plot; 2) diagnostic clustering plot; 3) all analysis results saved in Rdata; 4) number of cells in each HIPPO identified sub-clusters in excel file; 5) top expressed genes with respect to each HIPPO identified sub-clusters.
Command line execution:
Rscript run_scRICA_hippo.R --help
Users can also conduct above analysis with command line executions as below:
Rscript run_scRICA_hippo.R --outputDir misc/ --hippoResNamePrefix Allcells_TNK_commandLine --expCondCheck 'all' --rdsFname ~/Desktop/scRICA_demo.rds --cellclusters 'T/NK1, T/NK2, T/NK3' --noCluster 4 > $PWD/run_scRICA_hippo_tnk_allcells_k4.log 2>&1Rscript run_scRICA_hippo.R --outputDir misc/ --hippoResNamePrefix Fonly_TNK23_commandLine --expCondCheck 'expCond2' --expCond 'F' --rdsFname ~/Desktop/scRICA_demo.rds --cellclusters 'T/NK2, T/NK3' --noCluster 5 > $PWD/run_scRICA_hippo_tnk23_Fonly_k5.log 2>&1Update RDS file with HIPPO results
As shown in above examples, 3 types of T/NK cells were conducted with
sub-clustering analysis with HIPPO for a k=4 sub-clusters
identification. Below command can be used to update the cells clustering
information saved in Idents() based on HIPPO results saved
in Rdata format. If list names are provided in the option
hippoResList, updated cells idents names are based on them.
Otherwise, users also can use option hippoResNames for
updated HIIPO sub-clusters names. More detailed information regarding to
these options, please refer ?updateHippoIdents.
rds.hippo1 <- updateHippoIdents(resDir = 'misc/', rds = rds1, resSaveFname = 'hippo_tnk.Rds',
hippoResList = list('TNK_hippo' = 'misc/hippo_results/Allcells_TNK_k4/lightHIPPOres_Allcells_TNK_k4.Rdata'),
hippoResK = c(4),
newAnnotation = F )
table(Idents(rds.hippo1))Similarly, scRICA also allow users to conduct pseudo-time trajectory
analysis with slingshot for any specific cell types (cellcluster) with
respect to cells from all samples or any specific sample’s attribute
defined via expCondCheck and expCond together. This analyis can be done
inside R with R function getExpCondClusterPseudotime() or
by command line exections
Rscript run_scRICA_pseudoTime.R [Options]....
R function exectution:
?getExpCondClusterPseudotime(). For example, you can run
slingshot pseudo-time trajectory analysis on SE and CE cell types with
respect to 3 different fallopian tube anatomical sites seperations with
below command:
psdt.res1 <- getExpCondClusterPseudotime(resDir = 'misc/visualization_results', rds = rds1,
expCondCheck = 'expCond2',
expCondCheckFname = 'tissueSep_SECE',
cellcluster = c('SE1', 'CE'),
newAnnotation = F)The above analysis results are saved inside folder named via
options resDir and expCondCheckFname. For
example, the above analysis results are saved into the folder named as
./misc/visualization_results/results_wOrgClusterAnnotation_pseudoTime/tissueSep_SECE
with below data structure:
-- results_wOrgClusterAnnotation_pseudoTime
|__tissueSep_SECE
|__expCondLevel_A_cluster_SE_fnPseudoTimeRes.Rdata
|__expCondLevel_A_cluster_SE_ptGMM_GAMheatmap.pdf
|__expCondLevel_A_cluster_SE_ptGMM_GAMheatmapGenenames.txt
|__expCondLevel_A_cluster_SE_ptGMM_lineage.pdf
|__expCondLevel_A_cluster_SE_ptGMM_plots.pdf
|__...
where includes results with respect to each samples attributes
condition and cell types, such as for fallpian site ampulla, SE cell, 1)
slingshot linage plots
(expCondLevel_A_cluster_SE_ptGMM_lineage.pdf); 2) PCA and
diffusion map seperation plots
(expCondLevel_A_cluster_SE_ptGMM_plots.pdf); 3) heatmap of
identified genes seperating lingeages
(expCondLevel_A_cluster_SE_ptGMM_GAMheatmap.pdf); 4)
corresponding gene list presented in heatmap
(expCondLevel_A_cluster_SE_ptGMM_GAMheatmapGenenames.txt).
Command line execution:
Rscript run_scRICA_pseudoTime.R --help
For above analysis, it can also be done with command line option shown as below:
Rscript run_scRICA_pseudoTime.R --outputDir misc/ --hippoResNamePrefix tissueSep_SECE --expCondCheck 'expCond2' --rdsFname ~/Desktop/scRICA_demo.rds --cellclusters 'SE, CE' > $PWD/run_scRICA_pseudoTime_CESEonly.log 2>&1If you have further questions or suggestions regarding this package, please contact Yan Li at liyan@uchicagomedicine.org from the bioinformatics core at the Center for Research Bioinformatics (CRI), biological science division (BSD), University of Chicago.