Step 08: Doublet Validation & Quality Audit

Setup: Environment and Data

In Step 02, we heavily filtered the dataset to retain only “singlets” based on scDblFinder. In this module, we validate if this filtering was truly effective and ensure no specific annotated cell types are contaminated by residual high-score doublets.

library(Seurat)
library(tidyverse)
library(patchwork)
library(viridis) # ヒートマップの色合いを綺麗にするため

# Step 07で保存したアノテーション済みデータをロード
import_path <- "/Users/yoshimurasouhei/Downloads/010_school/4年生/bioinfomaticsリサーチクラークシップ/PD_2026/scripts/SO_07_Annotated.RDS"
SO <- readRDS(import_path)

Step 8.1: Audit Cell Counts

First, let’s mathematically confirm that the “doublet” class was completely removed during the Step 02 QC.

print("Current cell classification counts:")

[1] "Current cell classification counts:"

table(SO$scDblFinder.class)


singlet doublet 
  43448       0

Step 8.2: Visualizing Doublet Scores on UMAP

Even though we only kept “singlets,” each cell still retains its scDblFinder.score (a continuous probability). A high score means the cell was “borderline.” We must visualize this score to ensure these borderline cells aren’t forming a “fake” biological cluster.

FeaturePlot(SO, features = "scDblFinder.score", pt.size = 0.5) +
  scale_color_viridis(option = "magma", direction = -1) +
  labs(title = "Residual Doublet Scores on UMAP")

Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.

Doublet scores projected onto UMAP. Darker colors indicate a higher probability of being a residual doublet.

Step 8.3: Comparison Across Annotated Cell Types

Which cell types harbor the cells with the highest residual doublet scores? We use a Violin Plot to audit the “cleanliness” of each biological identity we assigned in Step 07.

VlnPlot(SO, features = "scDblFinder.score", pt.size = 0) + 
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "red") +
  labs(title = "Doublet Score Distribution by Cell Type",
       subtitle = "Red line represents a standard high-confidence warning threshold") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Step 8.4: Identifying “Suspicious” Clusters

If a cluster has a median or maximum doublet score significantly higher than others, we should review its biological validity.

# 各クラスタごとの平均スコアと最大スコアを計算
doublet_audit <- SO@meta.data %>%
  group_by(cell_type = Idents(SO)) %>%
  summarize(mean_score = mean(scDblFinder.score),
            max_score = max(scDblFinder.score),
            cell_count = n()) %>%
  arrange(desc(mean_score))

print(doublet_audit)

# A tibble: 15 × 4
   cell_type              mean_score max_score cell_count
   <fct>                       <dbl>     <dbl>      <int>
 1 Neuron                     0.114      0.892        992
 2 Inhibitory Neuron          0.0865     0.857       1057
 3 Stressed Cell              0.0704     0.755       3352
 4 Unassigned                 0.0563     0.782      14949
 5 Astrocyte                  0.0530     0.868       7927
 6 Oligodendrocyte            0.0443     0.531       4209
 7 Fibroblast                 0.0355     0.436         88
 8 Excitatory Neuron          0.0296     0.874        356
 9 Reactive Astrocyte         0.0274     0.594       1025
10 Fibroblast / Pericyte      0.0261     0.628       1045
11 Endothelial Cell           0.0171     0.732       1777
12 OPCs                       0.0164     0.782       2703
13 Microglia / Macrophage     0.0159     0.881       3248
14 Unassigned Glial           0.0141     0.751        596
15 T Cell / NK Cell           0.0133     0.414        124

Step 8.5: Final Check & Save

Assuming our annotated clusters look biologically clean (no cluster is overwhelmingly composed of high-score residuals), we save this validated object to be our pristine dataset for the upcoming PD vs Normal differential expression analysis.

# もしスコアが異常に高いクラスタ（例えば "Unassigned"）があり、ここで削りたい場合は
# SO <- subset(SO, idents = "Unassigned", invert = TRUE) 
# のように処理します。問題なければそのまま保存します。

save_path <- "/Users/yoshimurasouhei/Downloads/010_school/4年生/bioinfomaticsリサーチクラークシップ/PD_2026/scripts/SO_08_Validated.RDS"
saveRDS(SO, file = save_path)

print("Step 08 Complete! Dataset is validated as clean.")

[1] "Step 08 Complete! Dataset is validated as clean."