First, we load our pre-processed Seurat Object (SO) from the local directory. This object contains our raw single-cell RNA-seq data.
Code
# Define the file pathdata_path <-"/Users/yoshimurasouhei/Downloads/010_school/4年生/bioinfomaticsリサーチクラークシップ/PD_2026/data/raw/SO_PD_Sohei.rds"# Import the Seurat objectSO <-readRDS(data_path)
1.1 Data Summary
1. How many genes and cells total in the data? To understand the scope of our dataset, we need to determine the total number of genes and cells, as well as the distribution of cells across different libraries (patients).
Code
# Print high-level overviewSO
An object of class Seurat
47773 features across 46066 samples within 1 assay
Active assay: RNA (47773 features, 0 variable features)
1 layer present: counts
# Count the number of cells per library (patient)# Sorted in descending order and formatted with commas for readabilitytable(SO$Sample) |>sort(decreasing =TRUE) |>format(big.mark =",") |>print(quote =FALSE)
Data quality control is a critical step in single-cell RNA-seq analysis. We need to filter out low-quality cells, empty droplets, dying cells, and doublets (two cells captured in one droplet) to prevent them from biasing our downstream clustering and UMAP projections.
2.1 Identify Doublets with scDblFinder
We use the scDblFinder package to identify potential doublets. Because this dataset contains multiple samples (e.g., different patients or conditions), it is crucial to use the samples argument. This ensures doublets are calculated within each specific sample, avoiding false positives.
Code
# 1. Convert Seurat objectsce <-as.SingleCellExperiment(SO, assay ="RNA")# 2. Run scDblFinder (verbose = FALSE to hide the progress bar)sce <-scDblFinder(sce, samples ="Sample", verbose =FALSE)# 3. Add the results back to metadataSO$scDblFinder.score <- sce$scDblFinder.scoreSO$scDblFinder.class <- sce$scDblFinder.class
2.2 Checking Doublet Results
After running scDblFinder, we can review the exact number of singlets (high-quality single cells) versus doublets.
Code
# Look for the number of 'doublet' vs 'singlet'print("Doublet identification results:")
Next, we calculate the percentage of reads mapping to the mitochondrial genome. A high mitochondrial rate typically indicates a stressed or dying cell.
Before applying strict filters, we must inspect the natural distribution of our dataset to justify our thresholds. We visualize the distributions and overlay red dashed lines to indicate our chosen cutoffs.
Code
# 1. Calculate mitochondrial rates FIRST so we can plot itSO[["percent.mt"]] <- Seurat::PercentageFeatureSet(SO, pattern ="^MT-")# 2. Plot nFeature_RNA distribution BEFORE filtering# Adding red dashed lines to show our 200 and 7000 cutoffsp1 <- Seurat::VlnPlot(SO, features ="nFeature_RNA", group.by ="Sample", pt.size =0.1) + ggplot2::geom_hline(yintercept =c(200, 7000), color ="red", linetype ="dashed") + Seurat::NoLegend()# 3. Plot percent.mt distribution BEFORE filtering# Adding a red dashed line at 10%p2 <- Seurat::VlnPlot(SO, features ="percent.mt", group.by ="Sample", pt.size =0.1) + ggplot2::geom_hline(yintercept =10, color ="red", linetype ="dashed") + Seurat::NoLegend()# 4. Display the plots side-by-sidep1 | p2
Pre-filtering distributions of detected genes and mitochondrial rates. Red dashed lines indicate the chosen biological cutoffs.
Pre-filtering distributions of detected genes and mitochondrial rates. Red dashed lines indicate the chosen biological cutoffs.
As visualized above: nFeature_RNA (7,000 cutoff): The vast majority of viable cells fall below 7,000 detected genes. Cells above this red dashed line are extreme outliers, likely representing unstably captured droplets or undiscovered doublets.
percent.mt (10% cutoff): The distribution shows that this specific pre-processed dataset is already bounded below a 10% mitochondrial rate. We retain the <= 10% logic in our pipeline to explicitly state our upper tolerance for cellular stress.
2.4 Applying Final QC Filters
Now that our thresholds are justified, we apply our strict biological filtering criteria:
Gene count (nFeature_RNA): Keep cells with between 200 and 7,000 detected genes.
Mitochondrial rate (percent.mt): Keep cells with 10% or less mitochondrial reads.
Doublets: Retain ONLY the cells classified as “singlet”.
After executing the filters, we check the dimensions to see how many high-quality cells remain. Finally, we visualize the post-filtering distributions to confirm our dataset is clean and ready for dimensionality reduction.
Code
# 1. Check final dimensions (Rows = Genes, Columns = Cells)print("Final SO dimensions after all QC:")
[1] "Final SO dimensions after all QC:"
Code
dim(SO)
[1] 47773 43448
Code
# 2. Final visualizationSeurat::VlnPlot(SO, features =c("nFeature_RNA", "nCount_RNA", "percent.mt"),group.by ="Sample",ncol =3, pt.size =0.1)+ ggplot2::theme(axis.text.x = ggplot2::element_text(angle =45, hjust =1)) # 文字被り防止
Violin plots showing the distribution of genes, RNA counts, and mitochondrial percentages after QC filtering.
Step 3: Data Normalization
After removing low-quality cells and doublets, the next step is to normalize the raw gene expression counts.
In single-cell RNA sequencing, different cells often have different sequencing depths (some cells just happen to get sequenced more deeply than others due to technical capture variations). To make the gene expression levels truly comparable across all cells, we apply a global-scaling normalization method called LogNormalize.
This method normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 is the standard), and log-transforms the result.
Code
# Normalize the raw count dataSO <-NormalizeData(SO, normalization.method ="LogNormalize", scale.factor =10000)
Step 4: Feature Selection (Highly Variable Genes)
To reduce the computational burden and focus our analysis on biological signal rather than technical noise, we identify a subset of features (genes) that exhibit high cell-to-cell variation in the dataset.
These are genes that are highly expressed in some cells, and lowly expressed in others. We use the vst (variance-stabilizing transformation) method to select the top 2,000 highly variable genes. This method helps control for the relationship between a gene’s mean expression and its variance.
Code
# Identify the top 2000 most highly variable genesSO <-FindVariableFeatures(SO, selection.method ="vst", nfeatures =2000)
4.1 Visualizing Variable Features
To verify our feature selection, we extract the top 10 most variable genes and plot the variance of all genes against their mean expression. The highly variable genes are highlighted, with the top 10 explicitly labeled.
Code
# Extract the top 10 highly variable genestop10 <-head(VariableFeatures(SO), 10)print("Top 10 Variable Genes:")
# Create the variance plotp1 <-VariableFeaturePlot(SO)# Add labels to the top 10 genesp2 <-LabelPoints(plot = p1, points = top10, repel =TRUE, xnudge =0, ynudge =0)# Display the plotp2
Scatter plot of gene variance vs. mean expression. The top 2000 highly variable genes are highlighted in red, with the top 10 labeled.
Step 5: Scaling Data and Regressing Nuisance Variables
Before performing Principal Component Analysis (PCA), we must scale and center the data. This ensures that highly expressed genes do not dominate the analysis and that each gene carries equal weight.
Furthermore, in neurological research, we want to cluster cells based on their true biological identity (e.g., neurons vs. glia), not by their current cell-cycle phase or cellular stress levels. Therefore, we score the cells for their cell-cycle phase and regress out these “nuisance” variables alongside mitochondrial percentages and total read counts.
5.1 Initial Scaling and Cell Cycle Scoring
To prevent memory limit crashes, we first scale only the genes necessary for cell-cycle scoring and our previously identified highly variable genes.
Now that we have the cell cycle scores (S.Score and G2M.Score), we perform the final scaling on our variable features while explicitly regressing out the technical and cell-cycle noise. This isolates the pure biological signal needed for our downstream PD research.
Code
# 1. Define variables to regress outvars_to_regress <-c("nCount_RNA", "S.Score", "G2M.Score", "percent.mt")# 2. Perform final scaling with regressionSO <-ScaleData(SO, features = var_genes, vars.to.regress = vars_to_regress)
Step 6: Dimensionality Reduction (PCA)
With our data scaled and technical noise regressed out, we now apply Principal Component Analysis (PCA).
Single-cell data is incredibly high-dimensional. PCA condenses this information into a smaller number of “Principal Components” (PCs). Each PC represents a major axis of variance in the dataset.
Code
# Run PCA on the highly variable featuresSO <-RunPCA(SO, npcs =50, features =VariableFeatures(object = SO))
6.1 PCA Visualization
First, we plot PC1 against PC2 to ensure our cells are separating based on major biological variances.
Next, we generate an Elbow Plot. This is a critical diagnostic visualization. It ranks the principal components based on the percentage of variance they explain. We look for the “elbow” (the point where the graph sharply levels off) to determine how many PCs are actually meaningful. We will use this number for our final clustering and UMAP.
Code
# Plot PC1 vs. PC2DimPlot(object = SO, reduction ="pca") # Visualize the Elbow Plot to determine dimensionalityElbowPlot(object = SO, ndims =50, reduction ="pca")
Cells plotted along the first two principal components.
Elbow Plot determining the optimal number of PCs.
Step 7: Clustering Cells
After dimensionality reduction, we mathematically group the cells based on their similar gene expression profiles. This is a two-step process in Seurat: first, we construct a K-nearest neighbor (KNN) graph, and second, we apply a modularity optimization algorithm to identify the actual clusters.
7.1 Memory Allocation
Single-cell clustering requires significant computational memory. To prevent the R session from crashing due to memory limits during parallel processing, we increase the maximum allowable global object size to 4 GB.
Code
# Increase future.globals.maxSize to 4 GBoptions(future.globals.maxSize =4*1024^3)
7.2 Finding Neighbors
We use the FindNeighbors function to construct the KNN graph based on the euclidean distance in PCA space. We use the first 35 Principal Components, as determined by the Elbow Plot in the previous step.
Code
# Find neighbors using the first 20 PCsSO <-FindNeighbors(SO, dims =1:35)
7.3 Finding Clusters
Next, we identify the actual clusters using the Louvain algorithm. We set the resolution parameter to 0.5. A higher resolution will yield more, smaller clusters, while a lower resolution will yield fewer, larger clusters.
Note: We set random.seed = 123 to ensure this exact clustering result is mathematically reproducible.
Now that our cells are mathematically clustered in high-dimensional space, we need a way to visualize them.
For this, we use UMAP. Unlike PCA, which is a linear dimensionality reduction method, UMAP is a non-linear technique. It excels at taking the complex, high-dimensional neighborhood graph we built in the previous step and flattening it into 2D space while preserving both the local clustering structure and the broader global relationships between different cell types.
Cells that have highly similar gene expression profiles will cluster closely together, forming distinct visual “islands.”
Code
# 1. Calculate UMAP using the same 35 Principal ComponentsSO <-RunUMAP(object = SO, dims =1:35)# 2. Generate the UMAP plot with cluster labelsumap_clusters <-DimPlot(SO, reduction ="umap", label =TRUE) +NoLegend()# 3. Display the plotumap_clusters
UMAP projection of single cells. Each dot represents a single cell, colored by its mathematically assigned cluster.
Step 8.5: Saving the Processed Data
After successfully performing quality control, normalization, scaling, dimensionality reduction, and clustering, it is crucial to save the processed Seurat object.
This creates a checkpoint. In future analytical sessions (such as identifying cluster markers or performing differential expression analysis between Normal and PD), we can load this object directly in seconds, bypassing all the computationally heavy steps above.
Code
# Define the file path for the processed data# (必要に応じてパスを /Users/... に変更してください)save_path <-"/Users/yoshimurasouhei/Downloads/010_school/4年生/bioinfomaticsリサーチクラークシップ/PD_2026/scripts/SO_processed_UMAP.rds"# Save the objectsaveRDS(SO, file = save_path)print("Processed Seurat Object saved successfully!")