Introduction

Multiple toolkits and analytic frameworks have been developed to facilitate scRNA-seq data analysis. These options include but are not limit to Seurat, developed by Rahul Satija’s Lab in R, and scanpy, developed by Fabian Theis’s Lab in Python. Both toolkits provide functions and rich parameter sets that serve most of the routine analysis that one usually does on scRNA-seq data. However, one should be aware that these analytic frameworks do not cover all the interesting analyses that one can do when analyzing data. It is also important to get to know other tools for scRNA-seq data analysis.

Preparation

This Peer teaching session assumes that the sequencing data preprocessing steps, including base calling, mapping and read counting, have been done. 10x Genomics has its own analysis pipelineCellRanger for data generated with the 10x Genomics Chromium Single Cell Gene Expression Solution. At the end of the Cell Ranger pipeline, a count matrix is generated.

As part of this tutorial, we are providing two data sets (DS1 and DS3), both generated using 10x Genomics and preprocessed using Cell Ranger. They are both public scRNA-seq data of human cerebral organoids and are part of the data presented in this paper.The first part of this tutorial, which includes most of the general analysis pipeline, is based on DS1, while the second part, which focuses on data integration and batch effect correction, is based on both data sets.

Now let’s start Part 1

### Step 0. Import Seurat package
#First of all, please make sure that Seurat is installed in your R.
#data location /Users/usri/Desktop/scRNAseq_analysis_april5.2025
#Step 1. Create a Seurat object
#What the ```Read10X``` function does is to read in the matrix and rename its row names and col names by gene symbols and cell barcodes, respectively. Alternatively, one can do this manually, which is probably what one would do when the data is not generated using 10x.

#Seurat implements a new data type which is named 'Seurat'. It allows Seurat to store all the steps and results along the whole analysis. Therefore, the first step is to read in the data and create a Seurat object. Seurat has an easy solution for data generated using the 10x Genomics platform.

#Method-1 diectly call the 10x files with Seurat function 'Read10x'.

library(Seurat)
## Warning: package 'Seurat' was built under R version 4.3.3
## Loading required package: SeuratObject
## Warning: package 'SeuratObject' was built under R version 4.3.2
## Loading required package: sp
## Warning: package 'sp' was built under R version 4.3.3
## 
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
## 
##     intersect, t
# Initialize the Seurat object with the raw (non-normalized data).
counts <- Read10X(data.dir = "/Users/usri/Desktop/scRNAseq_analysis_april5.2025/data/DS1/")

seurat <- CreateSeuratObject(counts = counts, project = "DS1") 
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
seurat
## An object of class Seurat 
## 33694 features across 4672 samples within 1 assay 
## Active assay: RNA (33694 features, 0 variable features)
##  1 layer present: counts
saveRDS(seurat, file = 'seurat.rds')

#Method-2

library(Matrix)
# Initialize the Seurat object with the raw (non-normalized data).
counts <- readMM("/Users/usri/Desktop/scRNAseq_analysis_april5.2025/data/DS1/matrix.mtx.gz")
barcodes <- read.table("/Users/usri/Desktop/scRNAseq_analysis_april5.2025/data/DS1/barcodes.tsv.gz", stringsAsFactors=F)[,1]
features <- read.csv("/Users/usri/Desktop/scRNAseq_analysis_april5.2025/data/DS1/features.tsv.gz", stringsAsFactors=F, sep="\t", header=F)
rownames(counts) <- make.unique(features[,2])
colnames(counts) <- barcodes
seurat2 <- CreateSeuratObject(counts, project="DS1")
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
## Warning: Data is of class dgTMatrix. Coercing to dgCMatrix.
seurat2
## An object of class Seurat 
## 33694 features across 4672 samples within 1 assay 
## Active assay: RNA (33694 features, 0 variable features)
##  1 layer present: counts

#Step 2. Quality control ### Step 2. Quality control After creating the Seurat object, the next step is to do quality control on the data. The most common quality control is to filter out 1. Cells with too few genes detected. They usually represent cells which are not sequenced deep enough for reliable characterization. 2. Cells with too many genes detected. They may represent doublets or multiplets (i.e. two or more cells in the same droplet, therefore sharing the same cell barcode). 3. Cells with high mitochondrial transcript percentage. As most of the scRNA-seq experiments use oligo-T to capture mRNAs, mitochondrial transcripts should be relatively under-representative due to their lack of poly-A tails, but it is unavoidable that some mitochondrial transcripts are captured. Meanwhile, there is also some evidence that stable poly-A tails exist in some mitochondrial transcripts but serve as a marker for degradation (e.g. this paper). Together, cells with high mitochondrial transcript percentage likely represent cells under stress (e.g. hypoxia) which produce a lot of mitochondria, or which produce an abnormally high amount of truncated mitochondrial transcripts.

While numbers of detected genes are summarized by Seurat automatically when creating the Seurat object (with nFeature_RNA being the number of detected genes/features; nCount_RNA being the number of detected transcripts), one needs to calculate mitochondial transcript percentages manually. Still, Seurat provides an easy solution

#calculates the percentage of mitochondrial gene expression per cell and stores it in the metadata of the Seurat object.Adds a new column to the object’s meta.data slot called "percent.mt". PercentageFeatureSet() calculates what fraction of total RNA counts comes from genes matching the pattern.pattern = "^MT-" matches human mitochondrial genes, which typically start with "MT-" (like MT-CO1, MT-ND1, etc.).for mouse data need to use mt-.
seurat[["percent.mt"]] <- PercentageFeatureSet(seurat, pattern = "^MT[-\\.]")
##check in metadata
head(seurat@meta.data, 5)
##                    orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACCTGAGAATGTTG-1        DS1      16752         3287   1.098376
## AAACCTGAGAGCCTAG-1        DS1      13533         3399   1.950787
## AAACCTGAGTAATCCC-1        DS1       3098         1558   2.453196
## AAACCTGCACACTGCG-1        DS1       5158         2015   3.761148
## AAACCTGCATCGGAAG-1        DS1       6966         2322   2.182027
# visualize the mitochondrial percentages with violin plot
VlnPlot(seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), pt.size = 1)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.

#how you can interpret this violin plot
#nFeature_RNA: Gene count distribution ( Cells with very low gene counts may be empty droplets, while very high ones may be doublets (two cells captured together).
#nCount_RNA: Total RNA count distribution (Shows total RNA content per cell — extremely high or low counts can also suggest technical artifacts).
#percent.mt: Mitochondrial % distribution (High mitochondrial gene percentage (>10%) often indicates stressed or dying cells, which should be filtered out).

#if you don't like the dots (individual cells), run this.
VlnPlot(seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size=0, cols = 'darkgreen')
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.

#number of detected genes and number of detected transcripts are well correlated across cells while mitochondrial transcript percentage is not.
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.3.3
plot1 <- FeatureScatter(seurat, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(seurat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

#Interpretaion of the scater plot
#Left Plot: percent.mt vs nCount_RNA: There's no strong correlation between total RNA and mitochondrial percentage (correlation = 0), which is expected.Cells with high percent.mt and low nCount_RNA (bottom-left) are likely low-quality or dying and These cells should be filtered out using a threshold like percent.mt < 10.
#Right Plot: nFeature_RNA vs nCount_RNA
#Strong positive correlation (0.93) indicates that more RNA counts usually mean more genes detected.This is a good sign — it shows consistent detection across cells.However, a few cells with very high counts and features may be doublets and should be checked.
#Due to the correlation of gene number and transcript number, we only need to set a cutoff to either one of these metrics, combined with an upper threshold of mitochondrial transcript percentage, for the QC. For instance, for this data set, a detected gene number between 500 and 5000, and a mitochondrial transcript percentage lower than 5% would be quite reasonable, but it is fine to use different thresholds.
#quality control filtering — to remove low-quality cells from your Seurat object.
seurat <- subset(seurat, subset = nFeature_RNA > 500 & nFeature_RNA < 5000 & percent.mt < 5)
seurat
## An object of class Seurat 
## 33694 features across 4317 samples within 1 assay 
## Active assay: RNA (33694 features, 0 variable features)
##  1 layer present: counts
#Similar to bulk RNA-seq, the amount of captured RNA is different from cell to cell, and one should therefore not directly compare the number of captured transcripts for each gene between cells. A normalization step, aiming to make gene expression levels between different cells comparable, is therefore necessary. The most commonly used normalization in scRNA-seq data analysis is very similar to the concept of TPM (Transcripts Per Million reads) - one normalizes the feature expression measurements for each cell to the total expression, and then multiplies this by a scale factor (10000 by default). At the end, the resulting expression levels are log-transformed so that the expression values better fit a normal distribution. It is worth to mention that before doing the log-transformation, one pseudocount is added to every value so that genes with zero transcripts detected in a cell still present values of zero after log-transform. In principle there are several parameters one can set in the ```NormalizeData``` function, but most of the time the default settings are good.
seurat <- NormalizeData(seurat)
## Normalizing layer: counts
#saveRDS(seurat, file = 'Normalized.seurat.rds)
#The biggest advantage of single-cell as compared to bulk RNA-seq is the potential to look into cellular heterogeneity of samples, by looking for cell groups with distinct molecular signatures. However, not every gene has the same level of information and the same contribution when trying to identify different cell groups. For instance, genes with low expression levels, and those with similar expression levels across all cells, are not very informative and may dilute differences between distinct cell groups. Therefore, it is necessary to perform a proper feature selection before further exploring the scRNA-seq data.

#In Seurat, or more general in scRNA-seq data analysis, this step usually refers to the identification of highly variable features/genes, which are genes with the most varied expression levels across cells.
seurat <- FindVariableFeatures(seurat, nfeatures = 3000) #By default, Seurat calculates the standardized variance of each gene across cells, and picks the top 2000 ones as the highly variable features. 
## Finding variable features for layer counts
seurat
## An object of class Seurat 
## 33694 features across 4317 samples within 1 assay 
## Active assay: RNA (33694 features, 3000 variable features)
##  2 layers present: counts, data
#visualize the result in a variable feature plot.
top_features <- head(VariableFeatures(seurat), 10)
plot1 <- VariableFeaturePlot(seurat)
plot2 <- LabelPoints(plot = plot1, points = top_features, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot1 + plot2
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## log-10 transformation introduced infinite values.
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

#Since different genes have different base expression levels and distributions, the contribution of each gene to the analysis is different if no data transformation is performed. This is something we do not want as we don't want our analysis to only depend on genes that are highly expressed. Therefore a scaling is applied to the data using the selected features.by setting the parameter `var.to.regress` function.
#What Does “Regress Out” Mean in Seurat? 
#In Seurat, regressing out variables like gene count, RNA count, or mitochondrial percentage helps reduce their effect on analysis by adjusting gene expression through a linear model. It can improve data quality but also slows processing and doesn’t always help, so it's best to use only if those factors clearly affect your results.

seurat <- ScaleData(seurat, vars.to.regress = c("nFeature_RNA", "percent.mt"))
## Regressing out nFeature_RNA, percent.mt
## Centering and scaling data matrix
seurat
## An object of class Seurat 
## 33694 features across 4317 samples within 1 assay 
## Active assay: RNA (33694 features, 3000 variable features)
##  3 layers present: counts, data, scale.data
#saveRDS(seurat, file = 'seurat.scaled.rds')
#A common issue with standard log-normalization in scRNA-seq is that it can create artificial zero inflation in the data. To address this, Hafemeister and Satija developed the sctransform R package, which applies a regularized negative binomial regression model for improved normalization. Seurat includes a built-in wrapper function called SCTransform to easily apply this method.
seurat <- SCTransform(seurat,
                      vars.to.regress = c("nFeature_RNA", "percent.mt"),
                      variable.features.n = 3000)
## Running SCTransform on assay: RNA
## Running SCTransform on layer: counts
## vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.
## Variance stabilizing transformation of count matrix of size 17751 by 4317
## Model formula is y ~ log_umi
## Get Negative Binomial regression parameters per gene
## Using 2000 genes, 4317 cells
## Found 92 outliers - those will be ignored in fitting/regularization step
## Second step: Get residuals using fitted parameters for 17751 genes
## Computing corrected count matrix for 17751 genes
## Calculating gene attributes
## Wall clock passed: Time difference of 26.48952 secs
## Determine variable features
## Regressing out nFeature_RNA, percent.mt
## Centering data matrix
## Set default assay to SCT
seurat
## An object of class Seurat 
## 51445 features across 4317 samples within 2 assays 
## Active assay: SCT (17751 features, 3000 variable features)
##  3 layers present: counts, data, scale.data
##  1 other assay present: RNA
saveRDS(seurat, file = 'SCT.seurat.scaled.rds')

#Step 6. Linear dimensionality reduction using principal component analysis (PCA) #In principle one can start to look at cell heterogeneity after identifying highly variable genes and scaling the data.However, applying a linear dimension reduction before doing any further analysis is strongly recommended andsometimes even seen as compulsory. The benefit of doing such a dimension reduction includes but is not limitedto: #1.The data becomes much more compact so that computation becomes much faster. #2.As scRNA-seq data is intrinsically sparse, summarizing measurements of related features greatly enhancesthe signal robustness. #Any Drawback? Basically none. Well, one needs some extra lines in the script and needs to decide on the numberof reduced dimensions to use in the following analysis, but that’s it.For scRNA-seq data, the linear dimension reduction mostly refers to principal component analysis, short PCA.

#The number of principal components (PCs) that one can calculate for a data set is equal to the number of highlyvariable genes or the number of cells, whichever value is smaller. However, most of these PCs are not informativeand only represent random noise. Only the top PCs are informative and represent differences among cellpopulations. Therefore, instead of calculating all possible PCs, Seurat uses truncated PCA to only calculate thefirst PCs, by default the top 50 PCs. One can change that by setting the npcs parameter.
seurat <- RunPCA(seurat, npcs = 50)
## PC_ 1 
## Positive:  VIM, HMGB2, PTTG1, CENPF, TOP2A, UBE2C, NUSAP1, FABP7, CKS2, MKI67 
##     SMC4, CDC20, CCNB2, CDK1, CCNB1, BIRC5, HMGN2, CCNA2, CKS1B, TUBA1B 
##     TPX2, KPNA2, ASPM, PLK1, PRC1, NUF2, GTSE1, TUBB4B, PBK, HES1 
## Negative:  STMN2, RTN1, TUBA1A, GAP43, DCX, HMP19, TUBB2A, NNAT, NEUROD6, MLLT11 
##     SOX4, TMSB10, NEUROD2, GPM6A, MAPT, STMN4, PCSK1N, UCHL1, VAMP2, CELF4 
##     CD24, CRMP1, RAB3A, SCG5, TUBB2B, RBFOX2, SYT1, NSG1, BEX1, CAMK2N1 
## PC_ 2 
## Positive:  CENPF, PTTG1, STMN2, UBE2C, HMGB2, CDC20, TOP2A, NUSAP1, CCNB1, MKI67 
##     CCNB2, PLK1, ARL6IP1, KPNA2, UBE2S, TUBB4B, CCNA2, ASPM, CKS2, TPX2 
##     BIRC5, GTSE1, NUF2, SGOL2, CDK1, PRC1, AURKA, CENPE, TUBA1A, CDKN3 
## Negative:  VIM, PTN, FGFBP3, C1orf61, TTYH1, FABP7, PLP1, EEF1A1, RPL13A, RPS2 
##     RPL13, CXCR4, RPS19, RPS6, RPS3A, RPS18, RPS24, CLU, RPL41, RPS15A 
##     HES1, NTRK2, RPLP1, SPARC, HES4, RPL7, RPS14, RPS17, RPS15, ZFP36L1 
## PC_ 3 
## Positive:  NTRK2, NNAT, MAB21L1, CXCR4, PTN, LHX9, TTYH1, LHX1, NEFL, ZIC1 
##     NOVA1, CRNDE, PLP1, MEST, CLU, BTG1, EDNRB, PBX3, ZFHX3, SPARC 
##     GPM6B, ID3, TFAP2B, MAB21L2, NES, VIM, MYL6, CXXC4, LHX5-AS1, IRX3 
## Negative:  NEUROD6, NEUROD2, NFIB, NFIA, CNTNAP2, TMSB4X, SSTR2, BCL11A, LINC01551, TBR1 
##     FEZF2, SOX5, SOX11, CSRP2, FOXG1, IGFBP5, BCL11B, ABRACL, ZBTB18, EMX1 
##     TCF4, SFRP1, RASL11B, NEUROG2, C1orf61, SLA, HEY1, TSPO, SLC17A7, NHLH1 
## PC_ 4 
## Positive:  FABP7, VIM, PTN, SFRP1, C1orf61, TTYH1, RTN1, SOX2, SST, HES1 
##     VAMP2, IFI44L, LY6H, IGFBP5, CNTNAP2, MAPT, TUBB2A, SNCA, SIX3, CALY 
##     LINC01551, TXNIP, ZFP36L1, GRIA2, SCG5, RBFOX1, PRKCB, CLU, PLP1, CREB5 
## Negative:  NHLH1, HES6, CRABP1, SOX11, TUBA1A, TAGLN3, GADD45G, KLHL35, RGS16, GADD45A 
##     TMSB4X, TAC3, TFAP2B, CD24, CRNDE, BTG1, PTMA, LHX1, INSM1, ACTB 
##     SOX4, RGMB, NEUROD1, MLLT11, NHLH2, NPTX2, DCC, CITED2, EBF2, DLL1 
## PC_ 5 
## Positive:  KIAA0101, HIST1H4C, SFRP1, FABP7, TMSB4X, GINS2, PCNA, TYMS, LHX2, SYNE2 
##     MCM7, NNAT, MEIS2, RRM2, EMX2, STXBP6, DHFR, MYBL2, HELLS, DEK 
##     MCM3, CLSPN, PHLDA1, ATAD2, ORC6, AMBN, CENPU, CDT1, CENPK, CREB5 
## Negative:  NTRK2, NEUROD2, NEUROD6, CXCR4, VIM, PTN, ID2, EDNRB, PLP1, TTYH1 
##     IGFBP5, GPM6B, ID3, SPARC, CTGF, NES, CNTNAP2, FGFBP3, ZIC1, NFIB 
##     CSRP2, CRNDE, CLU, ID1, MGST1, RFX4, PDE1A, LHFP, LMO3, ARL6IP1
# Embeddings is the function in Seurat to obtain the dimension reduction result given the name of the dimension reduction of interest.By default, the RunPCA function stores the PCA result in the embedding called 'pca', with each column being one PC. So here it tells Seuratto construct the elbow plot to show the standarized variation of all the PCs that are calculated.

ElbowPlot(seurat, ndims = ncol(Embeddings(seurat, "pca")))

#As it is defined, higher-ranked PCs explain more variation in the data (have higher standard deviations) than lower-ranked PCs. However, the decrease of standard deviation is not linear. The curve of the elbow plot dropsdramatically for the first few PCs, and then slows down and becomes pretty flat. One would assume that the firstphase of the curve represents the 'real' signal related to biological differences between cell populations, while thesecond phase mostly represents technical variation or the stochastic nature of individual cells. To that perspective,choosing the top-15 PCs is probably good.
#
DimPlot(seurat, reduction = "pca", group.by = "orig.ident") #This shows the cells projected onto the first two principal components (PC_1 vs. PC_2), usually colored by identity class (Idents(seurat)).

#Why We Use a Heatmap in scRNA-seq Analysis? #In single-cell RNA-seq, a heatmap is used to visually explore gene expression patterns across cells. Specifically:

#Identify Key Genes per Principal Component (PC):After PCA, we use a heatmap (like PCHeatmap) to see which genes contribute most to each principal component. This helps us understand what biological signals (e.g., cell type, cell cycle, stress) are driving variation in the data.

#Spot Patterns Across Cells:Heatmaps let us visually compare expression of top genes across many cells, revealing clusters, gradients, or outliers.

#Guide Downstream Steps:By seeing which PCs are meaningful (and not just noise), we can decide how many PCs to keep for dimensionality reduction (e.g., UMAP) and clustering.

#Explore which genes influence each PCs.
PCHeatmap(seurat, dims = 1:4, cells = 500, balanced = TRUE, ncol = 2)

### Step 7. Non-linear dimension reduction for visualization #A linear dimension reduction has both pros and cons. The good side is that every PC is a linear combination of gene expression so interpretation of PCs are straightforward. Also the data is compressed but not disorted, therefore information in the data is largely remained. The bad side, on the other hand, is that one usually needs more than 10 PCs to cover most of the information. This is fine for most of the analysis, but not for visualization where it is impossible to go over three dimensions for ordinary persons.

#To overcome this issue, non-linear dimension reductions is introduced. The most commonly used non-linear dimension reduction methods in scRNA-seq data analysis are t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP). Both methods try to place every sample in a low-dimensional space (2D/3D), so that distances or neighborhood relationships between different samples (here cells) in the original space are largely retained in the low-dimensional space.Now let’s focus on tSNE and UMAP which Seurat has included. The top PCs in the PCA analysis are used as the input to create a tSNE and UMAP embedding of the data.

#TSNE provides great visualization when cells form distinct cell groups, while UMAP perserves trajectory-like structure better when data contains ‘continuum’, e.g. the continuous cell state change during development and differentiation. It is therefore good to try both, and choose the one works better for your data.

#Once the tSNE or UMAP embedding is created, one can start to check whether certain cell types or cell states exist in the data, by doing feature plots of some known canonical markers of the cell types of interest.

seurat <- RunTSNE(seurat, dims = 1:30)
seurat <- RunUMAP(seurat, dims = 1:30)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 12:26:03 UMAP embedding parameters a = 0.9922 b = 1.112
## 12:26:03 Read 4317 rows and found 30 numeric columns
## 12:26:03 Using Annoy for neighbor search, n_neighbors = 30
## 12:26:03 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 12:26:03 Writing NN index file to temp file /var/folders/fj/btmxy73n74zgn38kbjwfv7zh0000gn/T//Rtmpbl01xP/file148875ffba880
## 12:26:03 Searching Annoy index using 1 thread, search_k = 3000
## 12:26:04 Annoy recall = 100%
## 12:26:05 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 12:26:05 Initializing from normalized Laplacian + noise (using RSpectra)
## 12:26:06 Commencing optimization for 500 epochs, with 178630 positive edges
## 12:26:06 Using rng type: pcg
## 12:26:11 Optimization finished
seurat 
## An object of class Seurat 
## 51445 features across 4317 samples within 2 assays 
## Active assay: SCT (17751 features, 3000 variable features)
##  3 layers present: counts, data, scale.data
##  1 other assay present: RNA
##  3 dimensional reductions calculated: pca, tsne, umap
# Visualize both tSNE and UMAP
plot1 <- TSNEPlot(seurat)
plot2 <- UMAPPlot(seurat)
plot1 + plot2

plot1 <- FeaturePlot(seurat, c("MKI67","NES","DCX"),
                     ncol=3, reduction = "tsne")
plot2 <- FeaturePlot(seurat, c("MKI67","NES","DCX"),
                     ncol=3, reduction = "umap")
plot1 / plot2

### Step 8. Cluster the cells #Doing feature plot of markers is usually a good way to start with when exploring scRNA-seq data. However, to more comprehensively understand the underlying heterogeneity in the data, it is necessary to identify cell groups with an unbiased manner. This is what clustering does. In principle, one can apply any clustering methods, including those widely used in bulk RNA-seq data analysis such as hierarchical clustering and k-means, to the scRNA-seq data. However, in practice, this is very difficult, as the sample size in scRNA-seq data is too much larger (one 10x experiment usually gives several thousands of cells). It would be extremely slow to use these methods. In addition, due to the intrinsic sparseness of scRNA-seq data, even if data is denoised by dimension reduction like PCA, differences between different cells are not as well quantitative as those of bulk RNA-seq data. Therefore, the more commonly used clustering methods in scRNA-seq data analysis is graph-based community identification algorithm. Here, graph is the mathematical concept, where there is a set of objects, and some pairs of these objects are related with each other; or in a simplified way, a network of something, and here, a network of cells.

#First of all, a k-nearest neighbor network of cells is generated. Every cells is firstly connected to cells with the shortest distances, based on their corresponding PC values. Only cell pairs which are neighbors of each other are considered as connected. Proportion of shared neighbors between every cell pairs is then calculated and used to describe the strength of the connection between two cells. Weak connections are trimmed. This gives the resulted Shared Nearest Neighbor (SNN) network. In practice, this is very simple in Seurat.

#FindNeighbors() function in  Seurat, tell us about which cells are close to each other, based on their gene expression patterns in PCs 1–30.
#builds a nearest neighbor graph based on the first 30 principal components (PCs)

seurat <- FindNeighbors(seurat, dims = 1:30)
## Computing nearest neighbor graph
## Computing SNN
#Clustering the cells:resolution``` parameter is used to control whether the major and coarsed cell groups (e.g. major cell types), or the smaller but finer cell groups are returned (e.g. cell subtypes). The commonly used resolution ranges between 0.1 and 1, and which is the best option largely depends on the aim of the analysis. Here, a high resolution parameter is used to get a finer clustering. One can run multiple times of the ```FindClusters``` function with different resolutions. The newest clustering result can be obtained by ```Idents(seurat)``` or ```seurat@active.ident```. Other clustering results are also stored, as different columns in the meta.data slot 'seurat@meta.data.

#FindClusters() groups your cells into similar expression-based clusters.The resolution controls cluster granularity.

seurat <- FindClusters(seurat, resolution = 1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 4317
## Number of edges: 148203
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8532
## Number of communities: 15
## Elapsed time: 0 seconds
seurat
## An object of class Seurat 
## 51445 features across 4317 samples within 2 assays 
## Active assay: SCT (17751 features, 3000 variable features)
##  3 layers present: counts, data, scale.data
##  1 other assay present: RNA
##  3 dimensional reductions calculated: pca, tsne, umap
saveRDS(seurat, file = 'Clustered.seurat.rds')
head(seurat@meta.data)
##                    orig.ident nCount_RNA nFeature_RNA percent.mt nCount_SCT
## AAACCTGAGAATGTTG-1        DS1      16752         3287   1.098376       6978
## AAACCTGAGAGCCTAG-1        DS1      13533         3399   1.950787       7361
## AAACCTGAGTAATCCC-1        DS1       3098         1558   2.453196       6011
## AAACCTGCACACTGCG-1        DS1       5158         2015   3.761148       5970
## AAACCTGCATCGGAAG-1        DS1       6966         2322   2.182027       6708
## AAACCTGGTGTAACGG-1        DS1       4108         1542   2.507303       5889
##                    nFeature_SCT SCT_snn_res.1 seurat_clusters
## AAACCTGAGAATGTTG-1         2086             6               6
## AAACCTGAGAGCCTAG-1         2862             6               6
## AAACCTGAGTAATCCC-1         1589             2               2
## AAACCTGCACACTGCG-1         2000             2               2
## AAACCTGCATCGGAAG-1         2308             8               8
## AAACCTGGTGTAACGG-1         1538             3               3
#visualize the clustering result using the tSNE and UMAP embeddings.
plot1 <- DimPlot(seurat, reduction = "tsne", label = TRUE)
plot2 <- DimPlot(seurat, reduction = "umap", label = TRUE)
plot1 + plot2

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Find marker genes for each cluster
cl_markers <- FindAllMarkers(
  seurat,
  only.pos = TRUE,                  # Only return genes that are upregulated
  min.pct = 0.25,                   # Keep genes expressed in at least 25% of cells in a cluster
  logfc.threshold = log(1.2)        # Keep genes with log fold change > log(1.2), (log(1.2) ≈ 0.18 )
)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
## Calculating cluster 10
## Calculating cluster 11
## Calculating cluster 12
## Calculating cluster 13
## Calculating cluster 14
head(cl_markers, 10)
##                   p_val avg_log2FC pct.1 pct.2     p_val_adj cluster      gene
## SFRP1     8.851487e-299  2.4226325 0.963 0.324 1.571228e-294       0     SFRP1
## IFI44L    1.552503e-266  2.7810384 0.758 0.151 2.755849e-262       0    IFI44L
## C1orf61   5.584454e-229  1.7354731 1.000 0.751 9.912965e-225       0   C1orf61
## FABP7     6.637338e-216  2.0732509 0.976 0.552 1.178194e-211       0     FABP7
## ZFP36L1   2.325831e-202  1.7960452 0.916 0.311 4.128583e-198       0   ZFP36L1
## LINC01551 1.402614e-198  1.8982363 0.904 0.336 2.489780e-194       0 LINC01551
## HES1      2.821944e-194  1.6702571 0.900 0.285 5.009233e-190       0      HES1
## CREB5     9.020389e-189  2.2638340 0.700 0.176 1.601209e-184       0     CREB5
## SOX2      6.048017e-163  1.3859093 0.981 0.497 1.073583e-158       0      SOX2
## RPS6      3.806056e-159  0.9522958 0.999 0.998 6.756131e-155       0      RPS6
#Because of the nature of large sample size in scRNA-seq data (one cell is one sample), it is strongly recommended to not only look at p-values, but also detection rate of the gene in the cluster (```pct```) and fold change (```logfc```) between cells in and outside the cluster. That's why there are options ```min.pct``` and ```logfc.threshold``` in the function to require threshold on the effect size.

#Extract top 2 markers per cluster
cl_markers.top2 <- cl_markers %>%
  group_by(cluster) %>%
  top_n(n = 2, wt = avg_log2FC)
head(cl_markers.top2, 10)
## # A tibble: 10 × 7
## # Groups:   cluster [5]
##        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene   
##        <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>  
##  1 1.55e-266       2.78 0.758 0.151 2.76e-262 0       IFI44L 
##  2 2.57e-107       3.00 0.296 0.043 4.55e-103 0       C8orf4 
##  3 0               4.77 0.996 0.124 0         1       NEUROD2
##  4 0               6.03 0.579 0.014 0         1       SLA    
##  5 1.41e-189       3.48 0.561 0.084 2.50e-185 2       LHX1   
##  6 5.57e- 95       4.10 0.268 0.033 9.89e- 91 2       TFAP2A 
##  7 1.60e-231       4.60 0.424 0.022 2.84e-227 3       ENPP2  
##  8 7.94e-168       4.48 0.386 0.032 1.41e-163 3       FRZB   
##  9 0               5.64 0.911 0.052 0         4       LHX9   
## 10 1.09e-213       6.02 0.328 0.009 1.93e-209 4       TFAP2D
# Load libraries
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
library(RColorBrewer)
library(scales)

# Prepare cluster data
cluster_counts <- table(seurat$seurat_clusters)
cluster_df <- as.data.frame(cluster_counts)
colnames(cluster_df) <- c("cluster", "cell_count")

# Calculate fractions and percentages
cluster_df$fraction <- cluster_df$cell_count / sum(cluster_df$cell_count)
cluster_df$percent <- round(cluster_df$fraction * 100, 2)

# Set number of clusters
n_clusters <- length(unique(cluster_df$cluster))

# Use a dark color palette (e.g., Dark2 from RColorBrewer)
custom_colors <- colorRampPalette(brewer.pal(8, "Dark2"))(n_clusters)
names(custom_colors) <- levels(factor(cluster_df$cluster))

#check table 
head(cluster_df, 10)
##    cluster cell_count   fraction percent
## 1        0        669 0.15496873   15.50
## 2        1        475 0.11003011   11.00
## 3        2        462 0.10701876   10.70
## 4        3        389 0.09010887    9.01
## 5        4        348 0.08061154    8.06
## 6        5        346 0.08014825    8.01
## 7        6        323 0.07482048    7.48
## 8        7        261 0.06045865    6.05
## 9        8        251 0.05814223    5.81
## 10       9        226 0.05235117    5.24
#save table 
write.csv(cluster_df[, c("cluster", "cell_count", "fraction", "percent")],
          file = "cluster_summary.csv", row.names = FALSE)

# Bar plot of cluster fractions with percent labels
p1 <- ggplot(cluster_df, aes(x = factor(cluster), y = fraction, fill = factor(cluster))) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = scales::percent(fraction, accuracy = 0.1)), 
            vjust = -0.5, size = 4, color = "black") +  # add percent labels
  scale_fill_manual(values = custom_colors, name = "Cluster") +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Fraction of Cells per Cluster",
    x = "Cluster",
    y = "Fraction"
  ) +
  theme_classic(base_size = 14) +
  theme(legend.position = "none")


# Bar plot of absolute cell counts with count labels
p2 <- ggplot(cluster_df, aes(x = factor(cluster), y = cell_count, fill = factor(cluster))) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = cell_count), vjust = -0.5, size = 4, color = "black") +  # add count labels
  scale_fill_manual(values = custom_colors, name = "Cluster") +
  labs(
    title = "Number of Cells per Cluster",
    x = "Cluster",
    y = "Cell Count"
  ) +
  theme_classic(base_size = 14) +
  theme(legend.position = "none")
# Print plots
p1

p2

# Define custom colors for clusters
custom_colors <- c(
  "0"  = "#1B9E77",
  "1"  = "#7A7E3C",
  "2"  = "#D95F02",
  "3"  = "#A7675A",
  "4"  = "#7570B3",
  "5"  = "#AD4C9E",
  "6"  = "#E7298A",
  "7"  = "#A66753",
  "8"  = "#66A61E",
  "9"  = "#A6A810",
  "10" = "#E6AB02",
  "11" = "#C5900F",
  "12" = "#A6761D",
  "13" = "#866E41",
  "14" = "#666666"
)

# Get top 2 markers per cluster
top2_cl_markers <- cl_markers %>%
  group_by(cluster) %>%
  top_n(n = 2, wt = avg_log2FC)

# Ensure Idents match cluster identities
Idents(seurat) <- "seurat_clusters"

# Draw heatmap with custom cluster colors
DoHeatmap(
  seurat,
  features = top2_cl_markers$gene,
  group.colors = custom_colors
) + NoLegend()
## Warning in DoHeatmap(seurat, features = top2_cl_markers$gene, group.colors =
## custom_colors): The following features were omitted as they were not found in
## the scale.data slot for the SCT assay: KDELR3, E2F8

#One can also check those markers of different clusters in more details, by doing feature plot or violin plot for them. For instance, we can use NEUROD2 and NEUROD6 as very strong markers of cluster 2, so let's take a closer look
plot1 <- FeaturePlot(seurat, c("NEUROD2","NEUROD6"), ncol = 2)
plot2 <- VlnPlot(seurat, features = c("NEUROD2","NEUROD6"), pt.size = 0.5, ncol = 1)
plot1 

plot2

# Define known marker genes for various brain regions and cell types
ct_markers <- c(
  "MKI67", "NES", "DCX", "FOXG1",             # Cell cycle, neural progenitors, neurons, telencephalon
  "DLX2", "DLX5", "ISL1", "SIX3",  # Ventral telencephalon markers
   "GLI3", "EOMES", "NEUROD6", # Dorsal telencephalon markers
  "RSPO3", "OTX2", "LHX9", "TFAP2A", "RELN"# Non-telencephalon related markers
)

# Create a DotPlot showing expression of these genes across clusters
DotPlot(object = seurat, features = ct_markers, group.by = 'seurat_clusters') +
  # Rotate x-axis labels for readability
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  # Customize color gradient (expression level)
  scale_color_gradient(low = "lightgrey", high = "darkblue") +
   # Optional: better layout
  labs(title = "Marker Gene Expression Across Clusters",
       x = "Genes",
       y = "Clusters") +
  
  theme_minimal()
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

saveRDS(seurat, file = "seurat.DS1.rds")