PD_SO

Author

Sohei Yoshimura

Step 0: Prepare R environment

Analysis in bioinformatics requires specific tools. In this step, we ensure all necessary libraries are installed and loaded.

Code

# 1. Install BiocManager
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

# 2. Install scDblFinder
if (!require("scDblFinder", quietly = TRUE))
  BiocManager::install("scDblFinder")

# 3. Load libraries
library(Seurat)
library(tidyverse)
library(ggpubr)
library(patchwork)
library(scDblFinder)
library(SingleCellExperiment)
library(DT)
print("Environment is ready!")

[1] "Environment is ready!"

Step 1. Importing the Data

First, we load our pre-processed Seurat Object (SO) from the local directory. This object contains our raw single-cell RNA-seq data.

Code

# Define the file path
data_path <- "/Users/yoshimurasouhei/Downloads/010_school/4年生/bioinfomaticsリサーチクラークシップ/PD_2026/data/raw/SO_PD_Sohei.rds"

# Import the Seurat object
SO <- readRDS(data_path)

1.1 Data Summary

1. How many genes and cells total in the data? To understand the scope of our dataset, we need to determine the total number of genes and cells, as well as the distribution of cells across different libraries (patients).

Code

# Print high-level overview
SO

An object of class Seurat 
47773 features across 46066 samples within 1 assay 
Active assay: RNA (47773 features, 0 variable features)
 1 layer present: counts

Code

# Check exact dimensions (Rows = Genes, Columns = Cells)
dim(SO)

[1] 47773 46066

To ensure our data is balanced and to see the contribution of each patient, we tabulate the cells based on their sample origin in the metadata.

Code

# First, let's verify our metadata columns
head(colnames(SO@meta.data), 10)

[1] "orig.ident"   "nCount_RNA"   "nFeature_RNA" "Sample"       "condition"   
[6] "Sex"

Code

# Count the number of cells per library (patient)
# Sorted in descending order and formatted with commas for readability
table(SO$Sample) |> 
  sort(decreasing = TRUE) |> 
  format(big.mark = ",") |> 
  print(quote = FALSE)


NOSN03_F PDSN02_F NOSN04_M PDSN09_M PDSN04_F PDSN01_F NOSN05_M NOSN01_F 
   9,557    6,604    4,977    4,343    3,868    3,592    3,460    3,422 
PDSN06_M NOSN06_M PDSN07_M NOSN02_F 
   3,130    1,157    1,002      954

Step 2: Quality Control (QC)

Data quality control is a critical step in single-cell RNA-seq analysis. We need to filter out low-quality cells, empty droplets, dying cells, and doublets (two cells captured in one droplet) to prevent them from biasing our downstream clustering and UMAP projections.

2.1 Identify Doublets with scDblFinder

We use the scDblFinder package to identify potential doublets. Because this dataset contains multiple samples (e.g., different patients or conditions), it is crucial to use the samples argument. This ensures doublets are calculated within each specific sample, avoiding false positives.

Code

# 1. Convert Seurat object
sce <- as.SingleCellExperiment(SO, assay = "RNA")

# 2. Run scDblFinder (verbose = FALSE to hide the progress bar)
sce <- scDblFinder(sce, samples = "Sample", verbose = FALSE)

# 3. Add the results back to metadata
SO$scDblFinder.score <- sce$scDblFinder.score
SO$scDblFinder.class <- sce$scDblFinder.class

2.2 Checking Doublet Results

After running scDblFinder, we can review the exact number of singlets (high-quality single cells) versus doublets.

Code

# Look for the number of 'doublet' vs 'singlet'
print("Doublet identification results:")

[1] "Doublet identification results:"

Code

table(SO$scDblFinder.class)


singlet doublet 
  43596    2470

How many per library?

Code

library(dplyr)
library(tidyr)
library(DT) # インタラクティブな表のために追加

# 1. 集計とデータ成形（ここはそのまま活用）
dbl_df <- as.data.frame.matrix(table(SO$Sample, SO$scDblFinder.class))

dbl_summary <- dbl_df |>
  tibble::rownames_to_column("Sample") |>
  mutate(
    Total = singlet + doublet,
    Doublet_Percent = round((doublet / Total) * 100, 1)
  ) |>
  arrange(desc(Total))

# 2. DT パッケージを使用してインタラクティブな表を出力
datatable(
  dbl_summary,
  colnames = c("Sample", "Singlet", "Doublet", "Total", "Doublet (%)"),
  caption = "Table: Doublet detection summary across libraries (Interactive)",
  options = list(
    pageLength = 12,       # 全サンプル（12個）を1ページに表示
    autoWidth = TRUE,
    columnDefs = list(list(className = 'dt-center', targets = "_all")), # 文字を中央揃え
    order = list(list(3, 'desc')) # デフォルトでTotal（4列目）が多い順にソート
  ),
  rownames = FALSE,
  filter = 'top'           # 検索バーを上部に追加
) |>
  # 数値のカンマ区切り（interval=3, mark=","）
  formatCurrency(c('singlet', 'doublet', 'Total'), currency = "", interval = 3, mark = ",") |>
  # パーセンテージに単位を付与
  formatString('Doublet_Percent', suffix = ' %')

2.3 Justifying QC Thresholds (Pre-filtering Visualization)

Next, we calculate the percentage of reads mapping to the mitochondrial genome. A high mitochondrial rate typically indicates a stressed or dying cell.

Before applying strict filters, we must inspect the natural distribution of our dataset to justify our thresholds. We visualize the distributions and overlay red dashed lines to indicate our chosen cutoffs.

Code

# 1. Calculate mitochondrial rates FIRST so we can plot it
SO[["percent.mt"]] <- Seurat::PercentageFeatureSet(SO, pattern = "^MT-")

# 2. Plot nFeature_RNA distribution BEFORE filtering
# Adding red dashed lines to show our 200 and 7000 cutoffs
p1 <- Seurat::VlnPlot(SO, features = "nFeature_RNA", group.by = "Sample", pt.size = 0.1) +
  ggplot2::geom_hline(yintercept = c(200, 7000), color = "red", linetype = "dashed") +
  Seurat::NoLegend()

# 3. Plot percent.mt distribution BEFORE filtering
# Adding a red dashed line at 10%
p2 <- Seurat::VlnPlot(SO, features = "percent.mt", group.by = "Sample", pt.size = 0.1) +
  ggplot2::geom_hline(yintercept = 10, color = "red", linetype = "dashed") +
  Seurat::NoLegend()

# 4. Display the plots side-by-side
p1 | p2

Pre-filtering distributions of detected genes and mitochondrial rates. Red dashed lines indicate the chosen biological cutoffs.

Code

# 1. フィルター前の総細胞数（シングレットのみを分母とする）を取得
total_singlets <- sum(SO$scDblFinder.class == "singlet")

# 2. nFeature_RNAのカットオフ（200超〜7000以下）を満たすシングレットの割合を計算
kept_nFeature <- sum(SO$nFeature_RNA > 200 & SO$nFeature_RNA <= 7000 & SO$scDblFinder.class == "singlet")
pct_nFeature <- round((kept_nFeature / total_singlets) * 100, 1)

# 3. percent.mtのカットオフ（10以下）を満たすシングレットの割合を計算
kept_mt <- sum(SO$percent.mt <= 10 & SO$scDblFinder.class == "singlet")
pct_mt <- round((kept_mt / total_singlets) * 100, 1)

# --- 変更点: 全細胞を1つのバイオリンにまとめるためのダミー列を作成 ---
SO$merge_group <- "All Cells"

# 4. 左側のグラフ作成 (nFeature_RNA)
# group.by引数でさっき作ったダミー列を指定する
p3_nFeature <- Seurat::VlnPlot(SO, features = "nFeature_RNA", group.by = "merge_group", pt.size = 0.1) +
  ggplot2::geom_hline(yintercept = c(200, 7000), color = "red", linetype = "dashed") +
  Seurat::NoLegend() +
  ggplot2::theme(axis.title.x = ggplot2::element_blank(), # x軸の「merge_group」というタイトルだけ消す
                 plot.subtitle = ggplot2::element_text(size = 10, face = "bold")) +
  ggplot2::labs(title = "nFeature_RNA", subtitle = paste0("Kept: ", pct_nFeature, "% / Dropped: ", round(100 - pct_nFeature, 1), "%"))

# 5. 右側のグラフ作成 (percent.mt)
p3_mt <- Seurat::VlnPlot(SO, features = "percent.mt", group.by = "merge_group", pt.size = 0.1) +
  ggplot2::geom_hline(yintercept = 10, color = "red", linetype = "dashed") +
  Seurat::NoLegend() +
  ggplot2::theme(axis.title.x = ggplot2::element_blank(),
                 plot.subtitle = ggplot2::element_text(size = 10, face = "bold")) +
  ggplot2::labs(title = "percent.mt", subtitle = paste0("Kept: ", pct_mt, "% / Dropped: ", round(100 - pct_mt, 1), "%"))

# 6. p3として結合して表示
p3 <- p3_nFeature | p3_mt
p3

As visualized above: nFeature_RNA (7,000 cutoff): The vast majority of viable cells fall below 7,000 detected genes. Cells above this red dashed line are extreme outliers, likely representing unstably captured droplets or undiscovered doublets.

percent.mt (10% cutoff): The distribution shows that this specific pre-processed dataset is already bounded below a 10% mitochondrial rate. We retain the <= 10% logic in our pipeline to explicitly state our upper tolerance for cellular stress.

2.4 Applying Final QC Filters

Now that our thresholds are justified, we apply our strict biological filtering criteria:

Gene count (nFeature_RNA): Keep cells with between 200 and 7,000 detected genes.

Mitochondrial rate (percent.mt): Keep cells with 10% or less mitochondrial reads.

Doublets: Retain ONLY the cells classified as “singlet”.

Code

# Apply filtering logic
SO <- subset(SO, 
             subset = nFeature_RNA > 200 & 
               nFeature_RNA <= 7000 & 
               percent.mt <= 10 & 
               scDblFinder.class == "singlet")

2.5 Final QC Inspection

After executing the filters, we check the dimensions to see how many high-quality cells remain. Finally, we visualize the post-filtering distributions to confirm our dataset is clean and ready for dimensionality reduction.

Code

# 1. Check final dimensions (Rows = Genes, Columns = Cells)
print("Final SO dimensions after all QC:")

[1] "Final SO dimensions after all QC:"

Code

dim(SO)

[1] 47773 43448

Code

# 2. Final visualization
Seurat::VlnPlot(SO, 
                features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
                group.by = "Sample",
                ncol = 3, 
                pt.size = 0.1)+
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1)) # 文字被り防止

Violin plots showing the distribution of genes, RNA counts, and mitochondrial percentages after QC filtering.

Step 3: Data Normalization

After removing low-quality cells and doublets, the next step is to normalize the raw gene expression counts.

In single-cell RNA sequencing, different cells often have different sequencing depths (some cells just happen to get sequenced more deeply than others due to technical capture variations). To make the gene expression levels truly comparable across all cells, we apply a global-scaling normalization method called LogNormalize.

This method normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 is the standard), and log-transforms the result.

Code

# Normalize the raw count data
SO <- NormalizeData(SO, 
                    normalization.method = "LogNormalize", 
                    scale.factor = 10000)

Step 4: Feature Selection (Highly Variable Genes)

To reduce the computational burden and focus our analysis on biological signal rather than technical noise, we identify a subset of features (genes) that exhibit high cell-to-cell variation in the dataset.

These are genes that are highly expressed in some cells, and lowly expressed in others. We use the vst (variance-stabilizing transformation) method to select the top 2,000 highly variable genes. This method helps control for the relationship between a gene’s mean expression and its variance.

Code

# Identify the top 2000 most highly variable genes
SO <- FindVariableFeatures(SO, 
                           selection.method = "vst", 
                           nfeatures = 2000)

4.1 Visualizing Variable Features

To verify our feature selection, we extract the top 10 most variable genes and plot the variance of all genes against their mean expression. The highly variable genes are highlighted, with the top 10 explicitly labeled.

Code

# Extract the top 10 highly variable genes
top10 <- head(VariableFeatures(SO), 10)
print("Top 10 Variable Genes:")

[1] "Top 10 Variable Genes:"

Code

print(top10)

 [1] "NRG1"    "KCNQ5"   "ROBO2"   "SST"     "NEFL"    "GALNTL6" "NEFM"   
 [8] "CNTN5"   "HBB"     "SYT1"

Code

# Create the variance plot
p1 <- VariableFeaturePlot(SO)

# Add labels to the top 10 genes
p2 <- LabelPoints(plot = p1, 
                  points = top10, 
                  repel = TRUE, 
                  xnudge = 0, 
                  ynudge = 0)

# Display the plot
p2

Scatter plot of gene variance vs. mean expression. The top 2000 highly variable genes are highlighted in red, with the top 10 labeled.

Step 5: Scaling Data and Regressing Nuisance Variables

Before performing Principal Component Analysis (PCA), we must scale and center the data. This ensures that highly expressed genes do not dominate the analysis and that each gene carries equal weight.

Furthermore, in neurological research, we want to cluster cells based on their true biological identity (e.g., neurons vs. glia), not by their current cell-cycle phase or cellular stress levels. Therefore, we score the cells for their cell-cycle phase and regress out these “nuisance” variables alongside mitochondrial percentages and total read counts.

5.1 Initial Scaling and Cell Cycle Scoring

To prevent memory limit crashes, we first scale only the genes necessary for cell-cycle scoring and our previously identified highly variable genes.

Code

# 1. Define genes needed (Top 2000 variable genes + Cell cycle markers)
var_genes <- VariableFeatures(SO)
cc_genes <- c(cc.genes$s.genes, cc.genes$g2m.genes)
genes_to_scale <- unique(c(var_genes, cc_genes))

# 2. Initial scaling of selected features
SO <- ScaleData(SO, features = genes_to_scale)

# 3. Cell Cycle Scoring (Classifying cells into G1, S, or G2M)
SO <- CellCycleScoring(SO, 
                       s.features = cc.genes$s.genes, 
                       g2m.features = cc.genes$g2m.genes, 
                       set.ident = TRUE)

5.2 Final Scaling and Regression

Now that we have the cell cycle scores (S.Score and G2M.Score), we perform the final scaling on our variable features while explicitly regressing out the technical and cell-cycle noise. This isolates the pure biological signal needed for our downstream PD research.

Code

# 1. Define variables to regress out
vars_to_regress <- c("nCount_RNA", "S.Score", "G2M.Score", "percent.mt")

# 2. Perform final scaling with regression
SO <- ScaleData(SO, 
                features = var_genes, 
                vars.to.regress = vars_to_regress)

Step 6: Dimensionality Reduction (PCA)

With our data scaled and technical noise regressed out, we now apply Principal Component Analysis (PCA).

Single-cell data is incredibly high-dimensional. PCA condenses this information into a smaller number of “Principal Components” (PCs). Each PC represents a major axis of variance in the dataset.

Code

# Run PCA on the highly variable features
SO <- RunPCA(SO, 
             npcs = 50, 
             features = VariableFeatures(object = SO))

6.1 PCA Visualization

First, we plot PC1 against PC2 to ensure our cells are separating based on major biological variances.

Next, we generate an Elbow Plot. This is a critical diagnostic visualization. It ranks the principal components based on the percentage of variance they explain. We look for the “elbow” (the point where the graph sharply levels off) to determine how many PCs are actually meaningful. We will use this number for our final clustering and UMAP.

Code

# Plot PC1 vs. PC2
DimPlot(object = SO, reduction = "pca") 
# Visualize the Elbow Plot to determine dimensionality
ElbowPlot(object = SO, ndims = 50, reduction = "pca")

Cells plotted along the first two principal components.

Elbow Plot determining the optimal number of PCs.

Step 7: Clustering Cells

After dimensionality reduction, we mathematically group the cells based on their similar gene expression profiles. This is a two-step process in Seurat: first, we construct a K-nearest neighbor (KNN) graph, and second, we apply a modularity optimization algorithm to identify the actual clusters.

7.1 Memory Allocation

Single-cell clustering requires significant computational memory. To prevent the R session from crashing due to memory limits during parallel processing, we increase the maximum allowable global object size to 4 GB.

Code

# Increase future.globals.maxSize to 4 GB
options(future.globals.maxSize = 4 * 1024^3)

7.2 Finding Neighbors

We use the FindNeighbors function to construct the KNN graph based on the euclidean distance in PCA space. We use the first 35 Principal Components, as determined by the Elbow Plot in the previous step.

Code

# Find neighbors using the first 20 PCs
SO <- FindNeighbors(SO, dims = 1:35)

7.3 Finding Clusters

Next, we identify the actual clusters using the Louvain algorithm. We set the resolution parameter to 0.5. A higher resolution will yield more, smaller clusters, while a lower resolution will yield fewer, larger clusters.

Note: We set random.seed = 123 to ensure this exact clustering result is mathematically reproducible.

Code

# Find clusters at resolution 0.5
SO <- FindClusters(SO, 
                   resolution = 0.5, 
                   random.seed = 123)

Step 8: Visualizing Cell Clusters (UMAP)

Now that our cells are mathematically clustered in high-dimensional space, we need a way to visualize them.

For this, we use UMAP. Unlike PCA, which is a linear dimensionality reduction method, UMAP is a non-linear technique. It excels at taking the complex, high-dimensional neighborhood graph we built in the previous step and flattening it into 2D space while preserving both the local clustering structure and the broader global relationships between different cell types.

Cells that have highly similar gene expression profiles will cluster closely together, forming distinct visual “islands.”

Code

# 1. Calculate UMAP using the same 35 Principal Components
SO <- RunUMAP(object = SO, dims = 1:35)

# 2. Generate the UMAP plot with cluster labels
umap_clusters <- DimPlot(SO, 
                         reduction = "umap", 
                         label = TRUE) + NoLegend()

# 3. Display the plot
umap_clusters

UMAP projection of single cells. Each dot represents a single cell, colored by its mathematically assigned cluster.

Step 8.5: Saving the Processed Data

After successfully performing quality control, normalization, scaling, dimensionality reduction, and clustering, it is crucial to save the processed Seurat object.

This creates a checkpoint. In future analytical sessions (such as identifying cluster markers or performing differential expression analysis between Normal and PD), we can load this object directly in seconds, bypassing all the computationally heavy steps above.

Code

# Define the file path for the processed data
# (必要に応じてパスを /Users/... に変更してください)
save_path <- "/Users/yoshimurasouhei/Downloads/010_school/4年生/bioinfomaticsリサーチクラークシップ/PD_2026/scripts/SO_processed_UMAP.rds"

# Save the object
saveRDS(SO, file = save_path)

print("Processed Seurat Object saved successfully!")

[1] "Processed Seurat Object saved successfully!"