A Journey through Single-Cell Transcriptomics: Guiding Principles for Annotating Cellular Realms (using ScType)

Single-cell transcriptomics is a powerful technique that allows us to study thousands of cells at once and discover new types of cells, understand their different states, and observe how they change over time. Scientists have developed methods and steps to create detailed maps of these individual cells in tissues. This tutorial is designed to help us understand and make sense of the data from single-cell transcriptomics experiments. We will learn how to identify different types of cells, recognize their states, and uncover important patterns that are relevant to biology. Ultimately, our goal is to create a map that annotates and describes the characteristics of these cells, providing valuable insights into their functions and behaviors.

(Write this under references:) We recommend the following reference article for an overview of guidelines for annotating single-cell transcriptomic maps: Clarke, Zoe A., et al. “Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods.” Nature protocols 16.6 (2021): 2749-2764.

They recommend a 3 step workflow: Automatic Cell Annotation Step 1 of the annotation process in single-cell transcriptomics involves automatic cell annotation. This method efficiently labels cells or clusters using computer algorithms and existing biological knowledge. There are two main approaches: marker-based automatic annotation and reference-based automatic annotation. Marker-based annotation uses known marker genes specific to each cell type, while reference-based annotation transfers labels from a reference dataset to unlabeled cells with similar gene expression profiles. Automatic annotation can be applied to individual cells or clusters, with cluster annotation being faster but potentially less accurate. Challenges include incomplete labeling for cell types with poorly characterized signatures. Marker-based annotation tools include SCINA and AUCell, while GSVA is used for cluster annotation. Reference-based annotation relies on high-quality reference data, and the performance of automatic annotation tools can vary. Conflicting annotations can be resolved through various approaches, including confidence scores and majority rule. Manual annotation may be necessary for unresolved cases or novel cell identities.

Expert Manual Cell Annotation Step 2 of the annotation process involves expert manual cell annotation. Manual annotation becomes necessary when automated methods result in lower confidence, conflicting labels, or absent cell annotations. In this step, cells are examined manually to determine their function using various resources and marker-based annotation principles. Visualization tools like gene expression overlays and heat maps assist in identifying marker genes for known cell types. Additional markers can be found through literature research and mining of existing single-cell transcriptomic data. Manual annotation operates at the cluster level, but rare cells can be individually examined. Pathway enrichment analysis and differential expression analysis aid in determining cluster-specific pathways and identifying novel cell types. Manual annotation is labor-intensive and subjective but is considered the gold standard method. Cell states and gradients are also addressed in this step, requiring careful examination of stable cell types versus cell states and the annotation of intermediate stages within gradients. Overall, standard nomenclature and integration with Cell Ontology are recommended for consistent annotation across studies.

Annotation Verification Step 3 of the annotation process involves the verification of cell annotations using independent methods. While automated tools and manual annotation provide confident labels, it is crucial to confirm annotations through statistical methods, expert consultation, and experimental validation. Independent methods such as T-cell receptor (TCR) and B-cell receptor (BCR) clonotyping can refine annotations for tissue-resident immune cells by examining their transcriptional signatures. Functional assays, imaging experiments, and single-cell qPCR can also increase annotation confidence. Complementary single-cell genomic methods, including cellular indexing, single-cell ATAC-seq, and spatial transcriptomics, provide additional insights into immunophenotyping, chromatin state, and spatial transcript patterns. In the context of tumor biology, genetic alterations like single-nucleotide variants and copy-number variants (CNVs) can be detected in single-cell data using specialized tools. CNV inference methods analyze gene expression values across genomic regions to identify amplifications or deletions. Verification experiments and genetic analysis are essential to validate novel cell types and distinguish cancer cells from normal cells.

#BiocManager::install("GEOquery")

We are going to work with single-cell RNA sequencing data from Intrahepatic cholangiocarcinoma cell samples. You can find more details here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi

library(GEOquery)

## Loading required package: Biobase

## Loading required package: BiocGenerics

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## Setting options('download.file.method.GEOquery'='auto')

## Setting options('GEOquery.inmemory.gpl'=FALSE)

gse_id <- "GSE138709" #This line fetches the GEO dataset specified by the gse_id using the getGEO() function.
geo_object <- getGEO(gse_id)

## Found 2 file(s)

## GSE138709-GPL20795_series_matrix.txt.gz

## GSE138709-GPL24676_series_matrix.txt.gz

geo_object

## $`GSE138709-GPL20795_series_matrix.txt.gz`
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 0 features, 7 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM4116579 GSM4116580 ... GSM4116585 (7 total)
##   varLabels: title geo_accession ... tissue:ch1 (41 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: GPL20795 
## 
## $`GSE138709-GPL24676_series_matrix.txt.gz`
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 0 features, 1 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: 1
##   varLabels: title geo_accession ... tissue:ch1 (40 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: GPL24676

geo_object$`GSE138709-GPL20795_series_matrix.txt.gz`@experimentData

## Experiment data
##   Experimenter name: Min,,Zhang 
##   Laboratory:  
##   Contact information: dermingammc@sina.com 
##   Title: Dissecting the transcriptomic landscape of the human intrahepatic cholangiocarcinoma by single cell RNA-sequencing 
##   URL: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE138709 
##   PMIDs:  
## 
##   Abstract: A 45 word abstract is available. Use 'abstract' method.
##   notes:
##    citation:     
##       Zhang M, Yang H, Wan L, Wang Z, Wang H, Ge C, Liu Y, Hao Y, Zhang D, Shi G
## , Gong Y, Ni Y, Wang C, Zhang Y, Xi J, Wang S, Shi L, Zhang L, Yue W, Pei 
## X, Liu B, and Yan X. Single cell transcriptomic architecture and intercell
## ular crosstalk of human intrahepatic cholangiocarcinoma. Journal of Hepato
## logy (2020). doi:10.1016/j.jhep.2020.05.039
##    contact_address:     
##       No. 8 East Road, Fengtai District
##    contact_city:     
##       Beijing
##    contact_country:     
##       China
##    contact_department:     
##       The Fifth Medical Center of Chinese PLA General Hospital
##    contact_email:     
##       dermingammc@sina.com
##    contact_institute:     
##       Academy of Military Medical Sciences
##    contact_name:     
##       Min,,Zhang
##    contact_phone:     
##       13241302826
##    contact_state:     
##       Beijing
##    contact_zip/postal_code:     
##       100071
##    contributor:     
##       Min,,Zhang
## Hui,,Yang
## Bing,,Liu
## Xinlong,,Yan
##    geo_accession:     
##       GSE138709
##    last_update_date:     
##       Aug 02 2022
##    overall_design:     
##       To comprehensively elucidate the ICC heterogeneity, we employed single-cel
## l RNA sequencing (scRNA-seq) to profile 31,302 cells of five ICC samples, 
## as well as three paired adjacent non-tumor tissues. Our results identify 3
## 1 cell subtypes and the heterogeneous malignant subtype was highly patient
## -specific, which has differential proliferative, migratory and inflammatio
## n phenotype.
##    platform_id:     
##       GPL20795
## GPL24676
##    platform_taxid:     
##       9606
##    relation:     
##       BioProject: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA576876
##    sample_id:     
##       GSM4116579 GSM4116580 GSM4116581 GSM4116582 GSM4116583 GSM4116584 GSM41165
## 85 GSM4116586 
##    sample_taxid:     
##       9606
##    status:     
##       Public on Jun 06 2020
##    submission_date:     
##       Oct 10 2019
##    summary:     
##       Intrahepatic cholangiocarcinoma (ICC) is the second most common malignancy
##  arising in the liver, featured with high tumor heterogeneity and dismal p
## rognosis. The cellular disversity and molecular basis of tumor cells inter
## acted with surrounding niche cells remains poorly understood, limiting the
##  development of targeted anti-tumor therapies.
##    supplementary_file:     
##       ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE138nnn/GSE138709/suppl/GSE138709_
## RAW.tar
##    title:     
##       Dissecting the transcriptomic landscape of the human intrahepatic cholangi
## ocarcinoma by single cell RNA-sequencing
##    type:     
##       Expression profiling by high throughput sequencing

metadata <- pData(geo_object[[1]]) #This line extracts the sample metadata from the geo_object using the pData() function.
metadata

# Download the supplementary files for this GSE ID
getGEOSuppFiles(gse_id, makeDirectory = TRUE)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:Biobase':
## 
##     combine

## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(Seurat)

## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
##      (status 2 uses the sf package in place of rgdal)

## Attaching SeuratObject

library(patchwork)

data <- read.csv("GSE138709/GSM4116579_ICC_18_Adjacent_UMI.csv.gz")

row.names(data) <- data[,1]
head(data)

dim(data)

## [1] 14647 10318

data <- data[,2:10318]
head(data)

seurat_obj <- CreateSeuratObject(counts = data, project = "ScType_workflow", min.cells = 3, min.features = 200)

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

rm(data)

str(seurat_obj)

## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
##   ..@ assays      :List of 1
##   .. ..$ RNA:Formal class 'Assay' [package "SeuratObject"] with 8 slots
##   .. .. .. ..@ counts       :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. ..@ i       : int [1:13574652] 12 42 58 62 79 89 158 172 200 241 ...
##   .. .. .. .. .. ..@ p       : int [1:10318] 0 858 1758 2658 3731 5070 6195 7272 8033 10065 ...
##   .. .. .. .. .. ..@ Dim     : int [1:2] 14647 10317
##   .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. ..$ : chr [1:14647] "AL627309.1" "AP006222.2" "RP11-206L10.3" "RP11-206L10.2" ...
##   .. .. .. .. .. .. ..$ : chr [1:10317] "X18P_AAACCTGAGATCCCAT" "X18P_AAACCTGCAATAGCAA" "X18P_AAACCTGCAATGAAAC" "X18P_AAACCTGCAGCCAATT" ...
##   .. .. .. .. .. ..@ x       : num [1:13574652] 1 2 1 1 1 1 1 1 2 7 ...
##   .. .. .. .. .. ..@ factors : list()
##   .. .. .. ..@ data         :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. ..@ i       : int [1:13574652] 12 42 58 62 79 89 158 172 200 241 ...
##   .. .. .. .. .. ..@ p       : int [1:10318] 0 858 1758 2658 3731 5070 6195 7272 8033 10065 ...
##   .. .. .. .. .. ..@ Dim     : int [1:2] 14647 10317
##   .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. ..$ : chr [1:14647] "AL627309.1" "AP006222.2" "RP11-206L10.3" "RP11-206L10.2" ...
##   .. .. .. .. .. .. ..$ : chr [1:10317] "X18P_AAACCTGAGATCCCAT" "X18P_AAACCTGCAATAGCAA" "X18P_AAACCTGCAATGAAAC" "X18P_AAACCTGCAGCCAATT" ...
##   .. .. .. .. .. ..@ x       : num [1:13574652] 1 2 1 1 1 1 1 1 2 7 ...
##   .. .. .. .. .. ..@ factors : list()
##   .. .. .. ..@ scale.data   : num[0 , 0 ] 
##   .. .. .. ..@ key          : chr "rna_"
##   .. .. .. ..@ assay.orig   : NULL
##   .. .. .. ..@ var.features : logi(0) 
##   .. .. .. ..@ meta.features:'data.frame':   14647 obs. of  0 variables
##   .. .. .. ..@ misc         : list()
##   ..@ meta.data   :'data.frame': 10317 obs. of  3 variables:
##   .. ..$ orig.ident  : Factor w/ 1 level "X18P": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ nCount_RNA  : num [1:10317] 1882 2068 2296 2580 3807 ...
##   .. ..$ nFeature_RNA: int [1:10317] 858 900 900 1073 1339 1125 1077 761 2032 1941 ...
##   ..@ active.assay: chr "RNA"
##   ..@ active.ident: Factor w/ 1 level "X18P": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:10317] "X18P_AAACCTGAGATCCCAT" "X18P_AAACCTGCAATAGCAA" "X18P_AAACCTGCAATGAAAC" "X18P_AAACCTGCAGCCAATT" ...
##   ..@ graphs      : list()
##   ..@ neighbors   : list()
##   ..@ reductions  : list()
##   ..@ images      : list()
##   ..@ project.name: chr "ScType_workflow"
##   ..@ misc        : list()
##   ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
##   .. ..$ : int [1:3] 4 1 3
##   ..@ commands    : list()
##   ..@ tools       : list()

#QC
seurat_obj[["percent.mt"]] <- PercentageFeatureSet(seurat_obj, pattern = "^MT-")
# Normalize data
seurat_obj <- NormalizeData(seurat_obj, normalization.method = "LogNormalize", scale.factor = 10000)
#Feature selection
seurat_obj <- FindVariableFeatures(seurat_obj, selection.method = "vst", nfeatures = 2000)
# Scaling
seurat_obj <- ScaleData(seurat_obj, features = rownames(seurat_obj))

## Centering and scaling data matrix

#Dimensionality reduction by PCA
seurat_obj <- RunPCA(seurat_obj, features = VariableFeatures(object = seurat_obj))

## PC_ 1 
## Positive:  IL32, GZMA, CCL5, CD69, NKG7, CORO1A, GZMH, ITM2A, TUBA4A, CD8A 
##     LDHB, PRF1, PFN1, KLRD1, HMGB2, CD8B, HIST1H4C, CTSW, CCL4, C12orf75 
##     KIAA0101, APOBEC3G, HMGB1, KLRB1, MKI67, PTTG1, BIRC5, UBE2C, NUSAP1, ISG20 
## Negative:  CST3, LYZ, FGL2, FCGRT, IFI30, LST1, GRN, CD68, MS4A6A, AIF1 
##     FTL, TYROBP, NPC2, FCER1G, C1orf162, MS4A7, CPVL, MNDA, IGSF6, GPX1 
##     HLA-DQA1, CD14, HLA-DRA, TIMP1, PSAP, FTH1, CTSS, PLAUR, C1QA, LGALS2 
## PC_ 2 
## Positive:  KLRB1, LTB, CD69, IL7R, CXCR6, KLF6, CCR6, CD40LG, NFKBIA, JUN 
##     IL32, CCL20, ICOS, BTG2, CD7, PHLDA1, IFNG, ARID5B, TMIGD2, CD160 
##     FOS, SELM, CCL5, TNFRSF4, RP11-138A9.2, IL23A, AMICA1, FKBP11, BCL2A1, SOCS3 
## Negative:  TUBA1B, STMN1, KIAA0101, UBE2C, BIRC5, MKI67, RRM2, TUBB, CKS1B, TOP2A 
##     CDK1, NUSAP1, ZWINT, CENPF, CCNB2, TK1, AURKB, CDKN3, CCNA2, SPC25 
##     CENPW, CDC20, PLK1, PTTG1, H2AFZ, GTSE1, SMC2, TPX2, CCNB1, NUF2 
## PC_ 3 
## Positive:  TMSB4X, RPS2, ACTB, CORO1A, S100A4, HLA-DPB1, AIF1, LST1, TMSB10, HLA-DPA1 
##     LYZ, IFI30, HLA-DQB1, C1orf162, PFN1, HLA-DRA, MNDA, HLA-DRB1, FCER1G, TYROBP 
##     CD74, HLA-DQA1, FGL2, ARPC2, HLA-DMB, IGSF6, CPVL, MS4A7, FCN1, CTSS 
## Negative:  TM4SF1, CALD1, NUPR1, S100A16, SEPP1, ARHGAP29, GNG11, TFPI, NNMT, CNN3 
##     RAMP2, IFI27, RNASE1, VTN, C8orf4, EGFL7, AMBP, SLC9A3R2, TINAGL1, APOH 
##     TTR, S100A13, CRIP2, AQP1, WBP5, ASS1, NGFRAP1, RBP4, FGG, ALB 
## PC_ 4 
## Positive:  APOH, APOC3, TTR, FABP1, HP, RBP4, APOA2, ORM1, FGG, ALDOB 
##     FGB, ADH1B, AZGP1, GATM, ALB, MT1G, GSTA1, AHSG, APOC1, HPD 
##     ADH4, APOA1, AGXT, AMBP, APOC2, PCK1, ANGPTL3, VTN, CFHR2, SAA4 
## Negative:  RNASE1, GNG11, RAMP2, EGFL7, TM4SF1, RAMP3, HYAL2, AC011526.1, CLDN5, ELTD1 
##     HSPG2, ESAM, IFI27, IGFBP7, SPARC, ARHGAP29, CLEC14A, CRIP2, A2M, S100A16 
##     FAM167B, CAV1, TINAGL1, WBP5, IL33, PTRF, LDB2, INSR, FCN3, MGP 
## PC_ 5 
## Positive:  TMSB4X, ACTB, IL32, PFN1, TM4SF1, RNASE1, GNG11, EGFL7, RAMP2, MT2A 
##     CCL5, IFI27, RAMP3, S100A16, AC011526.1, ARHGAP29, GZMA, ELTD1, IGFBP7, CFL1 
##     HSPG2, CLDN5, TMSB10, ESAM, S100A4, HYAL2, CORO1A, A2M, TINAGL1, PTMA 
## Negative:  MZB1, DERL3, TNFRSF17, IGJ, CD79A, POU2AF1, EAF2, PRDX4, IGLL5, TNFRSF13B 
##     ITM2C, SEC11C, GNG7, RP11-16E12.2, UBE2J1, LMAN1, XBP1, HSP90B1, JSRP1, C19orf10 
##     SPATS2, SSR4, PKHD1L1, PDIA4, RP5-887A10.1, RAB30, FKBP11, FCRL5, BLNK, CHPF

# Check number of PC components
ElbowPlot(seurat_obj)

#We selected 14 PCs for downstream analysis, based on Elbow plot

# Cluster and visualize
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:14)

## Computing nearest neighbor graph

## Computing SNN

seurat_obj <- FindClusters(seurat_obj, resolution = 0.7)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 10317
## Number of edges: 362859
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8831
## Number of communities: 18
## Elapsed time: 1 seconds

seurat_obj <- RunUMAP(seurat_obj, dims = 1:14)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 15:55:29 UMAP embedding parameters a = 0.9922 b = 1.112

## 15:55:29 Read 10317 rows and found 14 numeric columns

## 15:55:29 Using Annoy for neighbor search, n_neighbors = 30

## 15:55:29 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 15:55:30 Writing NN index file to temp file /tmp/RtmppH73t8/file102d27f52b267
## 15:55:30 Searching Annoy index using 1 thread, search_k = 3000
## 15:55:33 Annoy recall = 100%
## 15:55:33 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 15:55:34 Initializing from normalized Laplacian + noise (using irlba)
## 15:55:34 Commencing optimization for 200 epochs, with 437226 positive edges
## 15:55:38 Optimization finished

DimPlot(seurat_obj, reduction = "umap")

Now that we have grouped similar cells into clusters, let’s move onto our mission of uncovering the secrets of cell types hidden within the dataset using ScType, a powerful tool that will aid us on our quest.

ScType is a specialized tool that automatically and quickly identifies cell types based on specific combinations of genes from the single-cell gene expression data.

They provide cell-type-specific gene sets from their in-built database (DB).

lapply(c("dplyr","Seurat","HGNChelper","openxlsx"), library, character.only = T)

## [[1]]
##  [1] "patchwork"    "SeuratObject" "Seurat"       "dplyr"        "GEOquery"    
##  [6] "Biobase"      "BiocGenerics" "stats"        "graphics"     "grDevices"   
## [11] "utils"        "datasets"     "methods"      "base"        
## 
## [[2]]
##  [1] "patchwork"    "SeuratObject" "Seurat"       "dplyr"        "GEOquery"    
##  [6] "Biobase"      "BiocGenerics" "stats"        "graphics"     "grDevices"   
## [11] "utils"        "datasets"     "methods"      "base"        
## 
## [[3]]
##  [1] "HGNChelper"   "patchwork"    "SeuratObject" "Seurat"       "dplyr"       
##  [6] "GEOquery"     "Biobase"      "BiocGenerics" "stats"        "graphics"    
## [11] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[4]]
##  [1] "openxlsx"     "HGNChelper"   "patchwork"    "SeuratObject" "Seurat"      
##  [6] "dplyr"        "GEOquery"     "Biobase"      "BiocGenerics" "stats"       
## [11] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [16] "base"

#For that, we first load 2 additional ScType functions: 
# load gene set preparation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
# load cell type annotation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")

ScType comes with its own database of cell markers, which are genes known to be specific to particular cell types. However, it’s worth noting that we can also use our own data by preparing an input file in a specific format similar to their database.

db_ = "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx";
#db_ = "ScTypeDB_short.xlsx"
tissue = "Immune system"
# prepare gene sets
gs_list = gene_sets_prepare(db_, tissue)

## Warning in checkGeneSymbols(markers_all): Human gene symbols should be all
## upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

## Warning in checkGeneSymbols(markers_all): x contains non-approved gene symbols

Once we have the input data ready, we can use the ScType function called “sctype_score” to assign cell types to each cluster. This function takes both positive and negative markers (genes that provide evidence against a cell being of a specific cell type) as input. If there are no negative markers, we can simply set the negative markers argument to NULL.

# get cell-type by cell matrix
es.max = sctype_score(scRNAseqData =  seurat_obj[["RNA"]]@scale.data, scaled = TRUE, 
                      gs = gs_list$gs_positive, gs2 = gs_list$gs_negative) 

# NOTE: scRNAseqData parameter should correspond to your input scRNA-seq matrix. 
# In case Seurat is used, it is either pbmc[["RNA"]]@scale.data (default), pbmc[["SCT"]]@scale.data, in case sctransform is used for normalization,
# or pbmc[["integrated"]]@scale.data, in case a joint analysis of multiple single-cell datasets is performed.

# merge by cluster
cL_results = do.call("rbind", lapply(unique(seurat_obj@meta.data$seurat_clusters), function(cl){
    es.max.cl = sort(rowSums(es.max[ ,rownames(seurat_obj@meta.data[seurat_obj@meta.data$seurat_clusters==cl, ])]), decreasing = !0)
    head(data.frame(cluster = cl, type = names(es.max.cl), scores = es.max.cl, ncells = sum(seurat_obj@meta.data$seurat_clusters==cl)), 10)
}))
sctype_scores = cL_results %>% group_by(cluster) %>% top_n(n = 1, wt = scores)

# set low-confident (low ScType score) clusters to "unknown"
sctype_scores$type[as.numeric(as.character(sctype_scores$scores)) < sctype_scores$ncells/4] = "Unknown"
print(sctype_scores[,1:3])

## # A tibble: 18 × 3
## # Groups:   cluster [18]
##    cluster type                         scores
##    <fct>   <chr>                         <dbl>
##  1 1       CD8+ NKT-like cells           1045.
##  2 3       CD8+ NKT-like cells            943.
##  3 0       Naive CD4+ T cells            1457.
##  4 2       γδ-T cells                     570.
##  5 5       CD8+ NKT-like cells           1029.
##  6 15      Endothelial                    660.
##  7 11      Naive CD8+ T cells             162.
##  8 6       Myeloid Dendritic cells       2766.
##  9 4       CD8+ NKT-like cells            687.
## 10 10      Macrophages                   1803.
## 11 7       Non-classical monocytes       2308.
## 12 8       Natural killer  cells         1196.
## 13 13      Naive B cells                  590.
## 14 9       Memory B cells                2319.
## 15 12      Myeloid Dendritic cells        674.
## 16 16      Platelets                      100.
## 17 14      Plasmacytoid Dendritic cells   727.
## 18 17      Cancer cells                   335.

To visualize the results, we can overlay the identified cell types on a UMAP plot, which is a way to visualize high-dimensional data in two dimensions. The UMAP plot will show each cluster with different colors representing the assigned cell types.

seurat_obj@meta.data$customclassif = ""
for(j in unique(sctype_scores$cluster)){
  cl_type = sctype_scores[sctype_scores$cluster==j,]; 
  seurat_obj@meta.data$customclassif[seurat_obj@meta.data$seurat_clusters == j] = as.character(cl_type$type[1])
}

DimPlot(seurat_obj, reduction = "umap", label = TRUE, repel = TRUE, group.by = 'customclassif')

Additionally, ScType provides a way to visualize a bubble plot that shows all the cell types considered for cluster annotation. The outer (grey) bubbles correspond to each cluster, with bigger bubbles representing clusters with more cells. The inner bubbles correspond to the different cell types considered for each cluster, and the biggest bubble inside each cluster corresponds to the final assigned cell type.

# load libraries
lapply(c("ggraph","igraph","tidyverse", "data.tree"), library, character.only = T)

## Loading required package: ggplot2

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:BiocGenerics':
## 
##     normalize, path, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ✔ readr     2.1.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::%--%()       masks igraph::%--%()
## ✖ tibble::as_data_frame() masks igraph::as_data_frame(), dplyr::as_data_frame()
## ✖ dplyr::combine()        masks Biobase::combine(), BiocGenerics::combine()
## ✖ purrr::compose()        masks igraph::compose()
## ✖ tidyr::crossing()       masks igraph::crossing()
## ✖ dplyr::filter()         masks stats::filter()
## ✖ dplyr::lag()            masks stats::lag()
## ✖ ggplot2::Position()     masks BiocGenerics::Position(), base::Position()
## ✖ purrr::simplify()       masks igraph::simplify()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'data.tree'
## 
## 
## The following object is masked from 'package:Biobase':
## 
##     Aggregate

## [[1]]
##  [1] "ggraph"       "ggplot2"      "openxlsx"     "HGNChelper"   "patchwork"   
##  [6] "SeuratObject" "Seurat"       "dplyr"        "GEOquery"     "Biobase"     
## [11] "BiocGenerics" "stats"        "graphics"     "grDevices"    "utils"       
## [16] "datasets"     "methods"      "base"        
## 
## [[2]]
##  [1] "igraph"       "ggraph"       "ggplot2"      "openxlsx"     "HGNChelper"  
##  [6] "patchwork"    "SeuratObject" "Seurat"       "dplyr"        "GEOquery"    
## [11] "Biobase"      "BiocGenerics" "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"        
## 
## [[3]]
##  [1] "lubridate"    "forcats"      "stringr"      "purrr"        "readr"       
##  [6] "tidyr"        "tibble"       "tidyverse"    "igraph"       "ggraph"      
## [11] "ggplot2"      "openxlsx"     "HGNChelper"   "patchwork"    "SeuratObject"
## [16] "Seurat"       "dplyr"        "GEOquery"     "Biobase"      "BiocGenerics"
## [21] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [26] "methods"      "base"        
## 
## [[4]]
##  [1] "data.tree"    "lubridate"    "forcats"      "stringr"      "purrr"       
##  [6] "readr"        "tidyr"        "tibble"       "tidyverse"    "igraph"      
## [11] "ggraph"       "ggplot2"      "openxlsx"     "HGNChelper"   "patchwork"   
## [16] "SeuratObject" "Seurat"       "dplyr"        "GEOquery"     "Biobase"     
## [21] "BiocGenerics" "stats"        "graphics"     "grDevices"    "utils"       
## [26] "datasets"     "methods"      "base"

# prepare edges
cL_results=cL_results[order(cL_results$cluster),]; edges = cL_results; edges$type = paste0(edges$type,"_",edges$cluster); edges$cluster = paste0("cluster ", edges$cluster); edges = edges[,c("cluster", "type")]; colnames(edges) = c("from", "to"); rownames(edges) <- NULL

# prepare nodes
nodes_lvl1 = sctype_scores[,c("cluster", "ncells")]; nodes_lvl1$cluster = paste0("cluster ", nodes_lvl1$cluster); nodes_lvl1$Colour = "#f1f1ef"; nodes_lvl1$ord = 1; nodes_lvl1$realname = nodes_lvl1$cluster; nodes_lvl1 = as.data.frame(nodes_lvl1); nodes_lvl2 = c(); 
ccolss= c("#5f75ae","#92bbb8","#64a841","#e5486e","#de8e06","#eccf5a","#b5aa0f","#e4b680","#7ba39d","#b15928","#ffff99", "#6a3d9a","#cab2d6","#ff7f00","#fdbf6f","#e31a1c","#fb9a99","#33a02c","#b2df8a","#1f78b4","#a6cee3")

for (i in 1:length(unique(cL_results$cluster))){
  dt_tmp = cL_results[cL_results$cluster == unique(cL_results$cluster)[i], ]; nodes_lvl2 = rbind(nodes_lvl2, data.frame(cluster = paste0(dt_tmp$type,"_",dt_tmp$cluster), ncells = dt_tmp$scores, Colour = ccolss[i], ord = 2, realname = dt_tmp$type))
}
nodes = rbind(nodes_lvl1, nodes_lvl2); nodes$ncells[nodes$ncells<1] = 1;
files_db = openxlsx::read.xlsx(db_)[,c("cellName","shortName")]; files_db = unique(files_db); nodes = merge(nodes, files_db, all.x = T, all.y = F, by.x = "realname", by.y = "cellName", sort = F)
nodes$shortName[is.na(nodes$shortName)] = nodes$realname[is.na(nodes$shortName)]; nodes = nodes[,c("cluster", "ncells", "Colour", "ord", "shortName", "realname")]

#create Graph
mygraph <- graph_from_data_frame(edges, vertices=nodes)

# Set Plot Dimensions
plot_width <- 8  # Adjust the width as desired
plot_height <- 6  # Adjust the height as desired

# Make the graph
gggr<- ggraph(mygraph, layout = 'circlepack', weight=I(ncells)) + 
  geom_node_circle(aes(filter=ord==1,fill=I("#F5F5F5"), colour=I("#D3D3D3")), alpha=0.9) + geom_node_circle(aes(filter=ord==2,fill=I(Colour), colour=I("#D3D3D3")), alpha=0.9) +
  theme_void() + geom_node_text(aes(filter=ord==2, label=shortName, colour=I("#ffffff"), fill="white", repel = !1, parse = T, size = I(log(ncells,25)*1.5)))+ geom_node_label(aes(filter=ord==1,  label=shortName, colour=I("#000000"), size = I(3), fill="white", parse = T), repel = !0, segment.linetype="dotted")

## Non-leaf weights ignored

## Warning in geom_node_text(aes(filter = ord == 2, label = shortName, colour =
## I("#ffffff"), : Ignoring unknown aesthetics: fill, repel, and parse

## Warning in geom_node_label(aes(filter = ord == 1, label = shortName, colour =
## I("#000000"), : Ignoring unknown aesthetics: parse

scater::multiplot(DimPlot(seurat_obj, reduction = "umap", label = TRUE, repel = TRUE, cols = ccolss), gggr, cols = 2)

## Warning: 'scater::multiplot' is deprecated.
## Use 'gridExtra::grid.arrange' instead.
## See help("Deprecated")

# load libraries
lapply(c("ggraph","igraph","tidyverse", "data.tree"), library, character.only = T)

## [[1]]
##  [1] "data.tree"    "lubridate"    "forcats"      "stringr"      "purrr"       
##  [6] "readr"        "tidyr"        "tibble"       "tidyverse"    "igraph"      
## [11] "ggraph"       "ggplot2"      "openxlsx"     "HGNChelper"   "patchwork"   
## [16] "SeuratObject" "Seurat"       "dplyr"        "GEOquery"     "Biobase"     
## [21] "BiocGenerics" "stats"        "graphics"     "grDevices"    "utils"       
## [26] "datasets"     "methods"      "base"        
## 
## [[2]]
##  [1] "data.tree"    "lubridate"    "forcats"      "stringr"      "purrr"       
##  [6] "readr"        "tidyr"        "tibble"       "tidyverse"    "igraph"      
## [11] "ggraph"       "ggplot2"      "openxlsx"     "HGNChelper"   "patchwork"   
## [16] "SeuratObject" "Seurat"       "dplyr"        "GEOquery"     "Biobase"     
## [21] "BiocGenerics" "stats"        "graphics"     "grDevices"    "utils"       
## [26] "datasets"     "methods"      "base"        
## 
## [[3]]
##  [1] "data.tree"    "lubridate"    "forcats"      "stringr"      "purrr"       
##  [6] "readr"        "tidyr"        "tibble"       "tidyverse"    "igraph"      
## [11] "ggraph"       "ggplot2"      "openxlsx"     "HGNChelper"   "patchwork"   
## [16] "SeuratObject" "Seurat"       "dplyr"        "GEOquery"     "Biobase"     
## [21] "BiocGenerics" "stats"        "graphics"     "grDevices"    "utils"       
## [26] "datasets"     "methods"      "base"        
## 
## [[4]]
##  [1] "data.tree"    "lubridate"    "forcats"      "stringr"      "purrr"       
##  [6] "readr"        "tidyr"        "tibble"       "tidyverse"    "igraph"      
## [11] "ggraph"       "ggplot2"      "openxlsx"     "HGNChelper"   "patchwork"   
## [16] "SeuratObject" "Seurat"       "dplyr"        "GEOquery"     "Biobase"     
## [21] "BiocGenerics" "stats"        "graphics"     "grDevices"    "utils"       
## [26] "datasets"     "methods"      "base"

# prepare edges
cL_results=cL_results[order(cL_results$cluster),]; edges = cL_results; edges$type = paste0(edges$type,"_",edges$cluster); edges$cluster = paste0("cluster ", edges$cluster); edges = edges[,c("cluster", "type")]; colnames(edges) = c("from", "to"); rownames(edges) <- NULL

# prepare nodes
nodes_lvl1 = sctype_scores[,c("cluster", "ncells")]; nodes_lvl1$cluster = paste0("cluster ", nodes_lvl1$cluster); nodes_lvl1$Colour = "#f1f1ef"; nodes_lvl1$ord = 1; nodes_lvl1$realname = nodes_lvl1$cluster; nodes_lvl1 = as.data.frame(nodes_lvl1); nodes_lvl2 = c(); 
ccolss= c("#5f75ae","#92bbb8","#64a841","#e5486e","#de8e06","#eccf5a","#b5aa0f","#e4b680","#7ba39d","#b15928","#ffff99", "#6a3d9a","#cab2d6","#ff7f00","#fdbf6f","#e31a1c","#fb9a99","#33a02c","#b2df8a","#1f78b4","#a6cee3")
for (i in 1:length(unique(cL_results$cluster))){
  dt_tmp = cL_results[cL_results$cluster == unique(cL_results$cluster)[i], ]; nodes_lvl2 = rbind(nodes_lvl2, data.frame(cluster = paste0(dt_tmp$type,"_",dt_tmp$cluster), ncells = dt_tmp$scores, Colour = ccolss[i], ord = 2, realname = dt_tmp$type))
}
nodes = rbind(nodes_lvl1, nodes_lvl2); nodes$ncells[nodes$ncells<1] = 1;
files_db = openxlsx::read.xlsx(db_)[,c("cellName","shortName")]; files_db = unique(files_db); nodes = merge(nodes, files_db, all.x = T, all.y = F, by.x = "realname", by.y = "cellName", sort = F)
nodes$shortName[is.na(nodes$shortName)] = nodes$realname[is.na(nodes$shortName)]; nodes = nodes[,c("cluster", "ncells", "Colour", "ord", "shortName", "realname")]

mygraph <- graph_from_data_frame(edges, vertices=nodes)

# Set Plot Dimensions
plot_width <- 20  # Adjust the width as desired
plot_height <- 20  # Adjust the height as desired


# Make the graph
gggr<- ggraph(mygraph, layout = 'circlepack', weight=I(ncells)) + 
  geom_node_circle(aes(filter=ord==1,fill=I("#F5F5F5"), colour=I("#D3D3D3")), alpha=0.9) + geom_node_circle(aes(filter=ord==2,fill=I(Colour), colour=I("#D3D3D3")), alpha=0.9) +
  theme_void() + geom_node_text(aes(filter=ord==2, label=shortName, colour=I("#000000"), fill="white", repel = !1, parse = T, size = I(log(ncells,25)*1.5)))+ geom_node_label(aes(filter=ord==1,  label=shortName, colour=I("#000000"), size = I(3), fill="white", parse = T), repel = !0, segment.linetype="dotted")

## Non-leaf weights ignored

## Warning in geom_node_text(aes(filter = ord == 2, label = shortName, colour =
## I("#000000"), : Ignoring unknown aesthetics: fill, repel, and parse

## Warning in geom_node_label(aes(filter = ord == 1, label = shortName, colour =
## I("#000000"), : Ignoring unknown aesthetics: parse

#scater::multiplot(DimPlot(seurat_obj, reduction = "umap", label = TRUE, repel = TRUE, cols = ccolss), gggr, cols = 2)

print(gggr)

scater::multiplot(DimPlot(seurat_obj, reduction = "umap", label = TRUE, repel = TRUE, cols = ccolss), gggr, cols = 2)

## Warning: 'scater::multiplot' is deprecated.
## Use 'gridExtra::grid.arrange' instead.
## See help("Deprecated")

Reference for ScType: https://github.com/IanevskiAleksandr/sc-type