First, we load our pre-processed Seurat Object (SO) from the local directory. This object contains our raw single-cell RNA-seq data.
Code
# Define the file pathdata_path <-"/Users/yoshimurasouhei/Downloads/010_school/4年生/bioinfomaticsリサーチクラークシップ/PD_2026/data/raw/SO_PD_Sohei.rds"# Import the Seurat objectSO <-readRDS(data_path)
Step 2: Quality Control (QC)
Data quality control is a critical step in single-cell RNA-seq analysis. We need to filter out low-quality cells, empty droplets, dying cells, and doublets (two cells captured in one droplet) to prevent them from biasing our downstream clustering and UMAP projections.
2.1 Identify Doublets with scDblFinder
We use the scDblFinder package to identify potential doublets. Because this dataset contains multiple samples (e.g., different patients or conditions), it is crucial to use the samples argument. This ensures doublets are calculated within each specific sample, avoiding false positives.
Code
# 1. Convert Seurat objectsce <-as.SingleCellExperiment(SO, assay ="RNA")# 2. Run scDblFinder (verbose = FALSE to hide the progress bar)sce <-scDblFinder(sce, samples ="Sample", verbose =FALSE)# 3. Add the results back to metadataSO$scDblFinder.score <- sce$scDblFinder.scoreSO$scDblFinder.class <- sce$scDblFinder.class
2.2 General QC and Final Filtering
Next, we calculate the percentage of reads mapping to the mitochondrial genome. A high mitochondrial rate typically indicates a stressed or dying cell where the cytoplasmic RNA has leaked out.
We then apply our strict biological filtering criteria:
Gene count (nFeature_RNA): Keep cells with between 200 and 7,000 detected genes.
Mitochondrial rate (percent.mt): Keep cells with 10% or less mitochondrial reads.
Doublets: Retain ONLY the cells classified as “singlet”.
After executing the filters, we check the dimensions to see how many high-quality cells remain. Finally, we visualize the distributions of our key metrics using Violin plots to confirm our dataset is clean and ready for dimensionality reduction.
Code
# 1. Check final dimensions (Rows = Genes, Columns = Cells)print("Final SO dimensions after all QC:")
[1] "Final SO dimensions after all QC:"
Code
dim(SO)
[1] 47773 43328
Code
# 2. Final visualizationSeurat::VlnPlot(SO, features =c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol =3, pt.size =0.1)
Violin plots showing the distribution of genes, RNA counts, and mitochondrial percentages after QC filtering.
Step 3: Data Normalization
After removing low-quality cells and doublets, the next step is to normalize the raw gene expression counts.
In single-cell RNA sequencing, different cells often have different sequencing depths (some cells just happen to get sequenced more deeply than others due to technical capture variations). To make the gene expression levels truly comparable across all cells, we apply a global-scaling normalization method called LogNormalize.
This method normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 is the standard), and log-transforms the result.
Code
# Normalize the raw count dataSO <-NormalizeData(SO, normalization.method ="LogNormalize", scale.factor =10000)
Step 4: Feature Selection (Highly Variable Genes)
To reduce the computational burden and focus our analysis on biological signal rather than technical noise, we identify a subset of features (genes) that exhibit high cell-to-cell variation in the dataset.
These are genes that are highly expressed in some cells, and lowly expressed in others. We use the vst (variance-stabilizing transformation) method to select the top 2,000 highly variable genes. This method helps control for the relationship between a gene’s mean expression and its variance.
Code
# Identify the top 2000 most highly variable genesSO <-FindVariableFeatures(SO, selection.method ="vst", nfeatures =2000)
4.1 Visualizing Variable Features
To verify our feature selection, we extract the top 10 most variable genes and plot the variance of all genes against their mean expression. The highly variable genes are highlighted, with the top 10 explicitly labeled.
Code
# Extract the top 10 highly variable genestop10 <-head(VariableFeatures(SO), 10)print("Top 10 Variable Genes:")
# Create the variance plotp1 <-VariableFeaturePlot(SO)# Add labels to the top 10 genesp2 <-LabelPoints(plot = p1, points = top10, repel =TRUE, xnudge =0, ynudge =0)# Display the plotp2
Scatter plot of gene variance vs. mean expression. The top 2000 highly variable genes are highlighted in red, with the top 10 labeled.
Step 5: Scaling Data and Regressing Nuisance Variables
Before performing Principal Component Analysis (PCA), we must scale and center the data. This ensures that highly expressed genes do not dominate the analysis and that each gene carries equal weight.
Furthermore, in neurological research, we want to cluster cells based on their true biological identity (e.g., neurons vs. glia), not by their current cell-cycle phase or cellular stress levels. Therefore, we score the cells for their cell-cycle phase and regress out these “nuisance” variables alongside mitochondrial percentages and total read counts.
5.1 Initial Scaling and Cell Cycle Scoring
To prevent memory limit crashes, we first scale only the genes necessary for cell-cycle scoring and our previously identified highly variable genes.
Now that we have the cell cycle scores (S.Score and G2M.Score), we perform the final scaling on our variable features while explicitly regressing out the technical and cell-cycle noise. This isolates the pure biological signal needed for our downstream PD research.
Code
# 1. Define variables to regress outvars_to_regress <-c("nCount_RNA", "S.Score", "G2M.Score", "percent.mt")# 2. Perform final scaling with regressionSO <-ScaleData(SO, features = var_genes, vars.to.regress = vars_to_regress)
Step 6: Dimensionality Reduction (PCA)
With our data scaled and technical noise regressed out, we now apply Principal Component Analysis (PCA).
Single-cell data is incredibly high-dimensional (we are looking at ~2,000 highly variable genes at once). PCA condenses this information into a smaller number of “Principal Components” (PCs). Each PC represents a major axis of variance in the dataset.
Code
# Run PCA on the highly variable featuresSO <-RunPCA(SO, npcs =20, features =VariableFeatures(object = SO))
6.1 PCA Visualization
First, we plot PC1 against PC2 to ensure our cells are separating based on major biological variances.
Next, we generate an Elbow Plot. This is a critical diagnostic visualization. It ranks the principal components based on the percentage of variance they explain. We look for the “elbow” (the point where the graph sharply levels off) to determine how many PCs are actually meaningful. We will use this number for our final clustering and UMAP.
Code
# Plot PC1 vs. PC2DimPlot(object = SO, reduction ="pca") +NoLegend()# Visualize the Elbow Plot to determine dimensionalityElbowPlot(object = SO, ndims =20, reduction ="pca")
Cells plotted along the first two principal components.
Elbow Plot determining the optimal number of PCs.
Step 7: Clustering Cells
After dimensionality reduction, we mathematically group the cells based on their similar gene expression profiles. This is a two-step process in Seurat: first, we construct a K-nearest neighbor (KNN) graph, and second, we apply a modularity optimization algorithm to identify the actual clusters.
7.1 Memory Allocation
Single-cell clustering requires significant computational memory. To prevent the R session from crashing due to memory limits during parallel processing, we increase the maximum allowable global object size to 4 GB.
Code
# Increase future.globals.maxSize to 4 GBoptions(future.globals.maxSize =4*1024^3)
7.2 Finding Neighbors
We use the FindNeighbors function to construct the KNN graph based on the euclidean distance in PCA space. We use the first 20 Principal Components, as determined by the Elbow Plot in the previous step.
Code
# Find neighbors using the first 20 PCsSO <-FindNeighbors(SO, dims =1:20)
7.3 Finding Clusters
Next, we identify the actual clusters using the Louvain algorithm. We set the resolution parameter to 0.5. A higher resolution will yield more, smaller clusters, while a lower resolution will yield fewer, larger clusters.
Note: We set random.seed = 123 to ensure this exact clustering result is mathematically reproducible.
Now that our cells are mathematically clustered in high-dimensional space, we need a way to visualize them.
For this, we use UMAP (Uniform Manifold Approximation and Projection). Unlike PCA, which is a linear dimensionality reduction method, UMAP is a non-linear technique. It excels at taking the complex, high-dimensional neighborhood graph we built in the previous step and flattening it into 2D space while preserving both the local clustering structure and the broader global relationships between different cell types.
Cells that have highly similar gene expression profiles will cluster closely together, forming distinct visual “islands.”
Code
# 1. Calculate UMAP using the same 20 Principal ComponentsSO <-RunUMAP(object = SO, dims =1:20)# 2. Generate the UMAP plot with cluster labelsumap_clusters <-DimPlot(SO, reduction ="umap", label =TRUE) +NoLegend()# 3. Display the plotumap_clusters
UMAP projection of single cells. Each dot represents a single cell, colored by its mathematically assigned cluster.