library(Seurat)
library(tidyverse)
library(patchwork)
library(viridis) # ヒートマップの色合いを綺麗にするため
# Step 07で保存したアノテーション済みデータをロード
import_path <- "/Users/yoshimurasouhei/Downloads/010_school/4年生/bioinfomaticsリサーチクラークシップ/PD_2026/scripts/SO_07_Annotated.RDS"
SO <- readRDS(import_path)Step 08: Doublet Validation & Quality Audit
Setup: Environment and Data
In Step 02, we heavily filtered the dataset to retain only “singlets” based on scDblFinder. In this module, we validate if this filtering was truly effective and ensure no specific annotated cell types are contaminated by residual high-score doublets.
Step 8.1: Audit Cell Counts
First, let’s mathematically confirm that the “doublet” class was completely removed during the Step 02 QC.
print("Current cell classification counts:")[1] "Current cell classification counts:"
table(SO$scDblFinder.class)
singlet doublet
43448 0
Step 8.2: Visualizing Doublet Scores on UMAP
Even though we only kept “singlets,” each cell still retains its scDblFinder.score (a continuous probability). A high score means the cell was “borderline.” We must visualize this score to ensure these borderline cells aren’t forming a “fake” biological cluster.
FeaturePlot(SO, features = "scDblFinder.score", pt.size = 0.5) +
scale_color_viridis(option = "magma", direction = -1) +
labs(title = "Residual Doublet Scores on UMAP")Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Step 8.3: Comparison Across Annotated Cell Types
Which cell types harbor the cells with the highest residual doublet scores? We use a Violin Plot to audit the “cleanliness” of each biological identity we assigned in Step 07.
VlnPlot(SO, features = "scDblFinder.score", pt.size = 0) +
geom_hline(yintercept = 0.5, linetype = "dashed", color = "red") +
labs(title = "Doublet Score Distribution by Cell Type",
subtitle = "Red line represents a standard high-confidence warning threshold") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Step 8.4: Identifying “Suspicious” Clusters
If a cluster has a median or maximum doublet score significantly higher than others, we should review its biological validity.
# 各クラスタごとの平均スコアと最大スコアを計算
doublet_audit <- SO@meta.data %>%
group_by(cell_type = Idents(SO)) %>%
summarize(mean_score = mean(scDblFinder.score),
max_score = max(scDblFinder.score),
cell_count = n()) %>%
arrange(desc(mean_score))
print(doublet_audit)# A tibble: 15 × 4
cell_type mean_score max_score cell_count
<fct> <dbl> <dbl> <int>
1 Neuron 0.114 0.892 992
2 Inhibitory Neuron 0.0865 0.857 1057
3 Stressed Cell 0.0704 0.755 3352
4 Unassigned 0.0563 0.782 14949
5 Astrocyte 0.0530 0.868 7927
6 Oligodendrocyte 0.0443 0.531 4209
7 Fibroblast 0.0355 0.436 88
8 Excitatory Neuron 0.0296 0.874 356
9 Reactive Astrocyte 0.0274 0.594 1025
10 Fibroblast / Pericyte 0.0261 0.628 1045
11 Endothelial Cell 0.0171 0.732 1777
12 OPCs 0.0164 0.782 2703
13 Microglia / Macrophage 0.0159 0.881 3248
14 Unassigned Glial 0.0141 0.751 596
15 T Cell / NK Cell 0.0133 0.414 124
Step 8.5: Final Check & Save
Assuming our annotated clusters look biologically clean (no cluster is overwhelmingly composed of high-score residuals), we save this validated object to be our pristine dataset for the upcoming PD vs Normal differential expression analysis.
# もしスコアが異常に高いクラスタ(例えば "Unassigned")があり、ここで削りたい場合は
# SO <- subset(SO, idents = "Unassigned", invert = TRUE)
# のように処理します。問題なければそのまま保存します。
save_path <- "/Users/yoshimurasouhei/Downloads/010_school/4年生/bioinfomaticsリサーチクラークシップ/PD_2026/scripts/SO_08_Validated.RDS"
saveRDS(SO, file = save_path)
print("Step 08 Complete! Dataset is validated as clean.")[1] "Step 08 Complete! Dataset is validated as clean."