PD_SO

Author

Sohei Yoshimura

Step 0: Prepare R environment

Analysis in bioinformatics requires specific tools. In this step, we ensure all necessary libraries are installed and loaded.

Code

# 1. Install BiocManager
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

# 2. Install scDblFinder
if (!require("scDblFinder", quietly = TRUE))
  BiocManager::install("scDblFinder")

# 3. Load libraries
library(Seurat)
library(tidyverse)
library(ggpubr)
library(patchwork)
library(scDblFinder)
library(SingleCellExperiment)

print("Environment is ready!")

[1] "Environment is ready!"

Step 1. Importing the Data

First, we load our pre-processed Seurat Object (SO) from the local directory. This object contains our raw single-cell RNA-seq data.

Code

# Define the file path
data_path <- "/Users/yoshimurasouhei/Downloads/010_school/4年生/bioinfomaticsリサーチクラークシップ/PD_2026/data/raw/SO_PD_Sohei.rds"

# Import the Seurat object
SO <- readRDS(data_path)

Step 2: Quality Control (QC)

Data quality control is a critical step in single-cell RNA-seq analysis. We need to filter out low-quality cells, empty droplets, dying cells, and doublets (two cells captured in one droplet) to prevent them from biasing our downstream clustering and UMAP projections.

2.1 Identify Doublets with scDblFinder

We use the scDblFinder package to identify potential doublets. Because this dataset contains multiple samples (e.g., different patients or conditions), it is crucial to use the samples argument. This ensures doublets are calculated within each specific sample, avoiding false positives.

Code

# 1. Convert Seurat object
sce <- as.SingleCellExperiment(SO, assay = "RNA")

# 2. Run scDblFinder (verbose = FALSE to hide the progress bar)
sce <- scDblFinder(sce, samples = "Sample", verbose = FALSE)

# 3. Add the results back to metadata
SO$scDblFinder.score <- sce$scDblFinder.score
SO$scDblFinder.class <- sce$scDblFinder.class

2.2 General QC and Final Filtering

Next, we calculate the percentage of reads mapping to the mitochondrial genome. A high mitochondrial rate typically indicates a stressed or dying cell where the cytoplasmic RNA has leaked out.

We then apply our strict biological filtering criteria:

Gene count (nFeature_RNA): Keep cells with between 200 and 7,000 detected genes.

Mitochondrial rate (percent.mt): Keep cells with 10% or less mitochondrial reads.

Doublets: Retain ONLY the cells classified as “singlet”.

Code

# 1. Calculate mitochondrial rates
SO[["percent.mt"]] <- Seurat::PercentageFeatureSet(SO, pattern = "^MT-")

# 2. Apply filtering logic
SO <- subset(SO, 
             subset = nFeature_RNA > 200 & 
               nFeature_RNA <= 7000 & 
               percent.mt <= 10 & 
               scDblFinder.class == "singlet")

2.3 Final QC Inspection

After executing the filters, we check the dimensions to see how many high-quality cells remain. Finally, we visualize the distributions of our key metrics using Violin plots to confirm our dataset is clean and ready for dimensionality reduction.

Code

# 1. Check final dimensions (Rows = Genes, Columns = Cells)
print("Final SO dimensions after all QC:")

[1] "Final SO dimensions after all QC:"

Code

dim(SO)

[1] 47773 43328

Code

# 2. Final visualization
Seurat::VlnPlot(SO, 
                features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), 
                ncol = 3, 
                pt.size = 0.1)

Violin plots showing the distribution of genes, RNA counts, and mitochondrial percentages after QC filtering.

Step 3: Data Normalization

After removing low-quality cells and doublets, the next step is to normalize the raw gene expression counts.

In single-cell RNA sequencing, different cells often have different sequencing depths (some cells just happen to get sequenced more deeply than others due to technical capture variations). To make the gene expression levels truly comparable across all cells, we apply a global-scaling normalization method called LogNormalize.

This method normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 is the standard), and log-transforms the result.

Code

# Normalize the raw count data
SO <- NormalizeData(SO, 
                    normalization.method = "LogNormalize", 
                    scale.factor = 10000)

Step 4: Feature Selection (Highly Variable Genes)

To reduce the computational burden and focus our analysis on biological signal rather than technical noise, we identify a subset of features (genes) that exhibit high cell-to-cell variation in the dataset.

These are genes that are highly expressed in some cells, and lowly expressed in others. We use the vst (variance-stabilizing transformation) method to select the top 2,000 highly variable genes. This method helps control for the relationship between a gene’s mean expression and its variance.

Code

# Identify the top 2000 most highly variable genes
SO <- FindVariableFeatures(SO, 
                           selection.method = "vst", 
                           nfeatures = 2000)

4.1 Visualizing Variable Features

To verify our feature selection, we extract the top 10 most variable genes and plot the variance of all genes against their mean expression. The highly variable genes are highlighted, with the top 10 explicitly labeled.

Code

# Extract the top 10 highly variable genes
top10 <- head(VariableFeatures(SO), 10)
print("Top 10 Variable Genes:")

[1] "Top 10 Variable Genes:"

Code

print(top10)

 [1] "NRG1"    "KCNQ5"   "ROBO2"   "SST"     "NEFL"    "GALNTL6" "NEFM"   
 [8] "CNTN5"   "HBB"     "SYT1"

Code

# Create the variance plot
p1 <- VariableFeaturePlot(SO)

# Add labels to the top 10 genes
p2 <- LabelPoints(plot = p1, 
                  points = top10, 
                  repel = TRUE, 
                  xnudge = 0, 
                  ynudge = 0)

# Display the plot
p2

Scatter plot of gene variance vs. mean expression. The top 2000 highly variable genes are highlighted in red, with the top 10 labeled.

Step 5: Scaling Data and Regressing Nuisance Variables

Before performing Principal Component Analysis (PCA), we must scale and center the data. This ensures that highly expressed genes do not dominate the analysis and that each gene carries equal weight.

Furthermore, in neurological research, we want to cluster cells based on their true biological identity (e.g., neurons vs. glia), not by their current cell-cycle phase or cellular stress levels. Therefore, we score the cells for their cell-cycle phase and regress out these “nuisance” variables alongside mitochondrial percentages and total read counts.

5.1 Initial Scaling and Cell Cycle Scoring

To prevent memory limit crashes, we first scale only the genes necessary for cell-cycle scoring and our previously identified highly variable genes.

Code

# 1. Define genes needed (Top 2000 variable genes + Cell cycle markers)
var_genes <- VariableFeatures(SO)
cc_genes <- c(cc.genes$s.genes, cc.genes$g2m.genes)
genes_to_scale <- unique(c(var_genes, cc_genes))

# 2. Initial scaling of selected features
SO <- ScaleData(SO, features = genes_to_scale)

# 3. Cell Cycle Scoring (Classifying cells into G1, S, or G2M)
SO <- CellCycleScoring(SO, 
                       s.features = cc.genes$s.genes, 
                       g2m.features = cc.genes$g2m.genes, 
                       set.ident = TRUE)

5.2 Final Scaling and Regression

Now that we have the cell cycle scores (S.Score and G2M.Score), we perform the final scaling on our variable features while explicitly regressing out the technical and cell-cycle noise. This isolates the pure biological signal needed for our downstream PD research.

Code

# 1. Define variables to regress out
vars_to_regress <- c("nCount_RNA", "S.Score", "G2M.Score", "percent.mt")

# 2. Perform final scaling with regression
SO <- ScaleData(SO, 
                features = var_genes, 
                vars.to.regress = vars_to_regress)

Step 6: Dimensionality Reduction (PCA)

With our data scaled and technical noise regressed out, we now apply Principal Component Analysis (PCA).

Single-cell data is incredibly high-dimensional (we are looking at ~2,000 highly variable genes at once). PCA condenses this information into a smaller number of “Principal Components” (PCs). Each PC represents a major axis of variance in the dataset.

Code

# Run PCA on the highly variable features
SO <- RunPCA(SO, 
             npcs = 20, 
             features = VariableFeatures(object = SO))

6.1 PCA Visualization

First, we plot PC1 against PC2 to ensure our cells are separating based on major biological variances.

Next, we generate an Elbow Plot. This is a critical diagnostic visualization. It ranks the principal components based on the percentage of variance they explain. We look for the “elbow” (the point where the graph sharply levels off) to determine how many PCs are actually meaningful. We will use this number for our final clustering and UMAP.

Code

# Plot PC1 vs. PC2
DimPlot(object = SO, reduction = "pca") + NoLegend()
# Visualize the Elbow Plot to determine dimensionality
ElbowPlot(object = SO, ndims = 20, reduction = "pca")

Cells plotted along the first two principal components.

Elbow Plot determining the optimal number of PCs.

Step 7: Clustering Cells

After dimensionality reduction, we mathematically group the cells based on their similar gene expression profiles. This is a two-step process in Seurat: first, we construct a K-nearest neighbor (KNN) graph, and second, we apply a modularity optimization algorithm to identify the actual clusters.

7.1 Memory Allocation

Single-cell clustering requires significant computational memory. To prevent the R session from crashing due to memory limits during parallel processing, we increase the maximum allowable global object size to 4 GB.

Code

# Increase future.globals.maxSize to 4 GB
options(future.globals.maxSize = 4 * 1024^3)

7.2 Finding Neighbors

We use the FindNeighbors function to construct the KNN graph based on the euclidean distance in PCA space. We use the first 20 Principal Components, as determined by the Elbow Plot in the previous step.

Code

# Find neighbors using the first 20 PCs
SO <- FindNeighbors(SO, dims = 1:20)

7.3 Finding Clusters

Next, we identify the actual clusters using the Louvain algorithm. We set the resolution parameter to 0.5. A higher resolution will yield more, smaller clusters, while a lower resolution will yield fewer, larger clusters.

Note: We set random.seed = 123 to ensure this exact clustering result is mathematically reproducible.

Code

# Find clusters at resolution 0.5
SO <- FindClusters(SO, 
                   resolution = 0.5, 
                   random.seed = 123)

Step 8: Visualizing Cell Clusters (UMAP)

Now that our cells are mathematically clustered in high-dimensional space, we need a way to visualize them.

For this, we use UMAP (Uniform Manifold Approximation and Projection). Unlike PCA, which is a linear dimensionality reduction method, UMAP is a non-linear technique. It excels at taking the complex, high-dimensional neighborhood graph we built in the previous step and flattening it into 2D space while preserving both the local clustering structure and the broader global relationships between different cell types.

Cells that have highly similar gene expression profiles will cluster closely together, forming distinct visual “islands.”

Code

# 1. Calculate UMAP using the same 20 Principal Components
SO <- RunUMAP(object = SO, dims = 1:20)

# 2. Generate the UMAP plot with cluster labels
umap_clusters <- DimPlot(SO, 
                         reduction = "umap", 
                         label = TRUE) + NoLegend()

# 3. Display the plot
umap_clusters

UMAP projection of single cells. Each dot represents a single cell, colored by its mathematically assigned cluster.