Biological Background
& Rationale
What is Sézary
Syndrome?
Sézary syndrome (SS) is an aggressive cutaneous T-cell
lymphoma (CTCL) characterised by the triad of erythroderma
(full-body skin redness), lymphadenopathy, and circulating malignant T
cells called Sézary cells. These are CD4+ T cells with
cerebriform (brain-like wrinkled) nuclei that have undergone malignant
transformation.
The central unresolved question in SS biology is: which
normal CD4 T cell subtype gave rise to the malignant clone?
This is the “cell of origin” question. Answering it matters because:
- It tells us which transcription factors and signalling
pathways were active when transformation occurred
- It predicts which normal T cell programmes are
retained or corrupted in the malignant cells (explaining why SS cells
home to skin, suppress immunity, etc.)
- It guides therapeutic targeting — if SS arises from
TCM cells, targeting memory T cell survival signals (e.g. IL-15/IL-7
signalling) is rational
Why Project onto a
Trajectory?
A trajectory analysis places cells along a continuous
differentiation path rather than forcing them into discrete
categories. Normal CD4 T cells differentiate from Naive → TCM → TEM →
Temra (effector arm) or Naive → TCM → Treg (regulatory arm).
When we project Sézary cell lines onto this trajectory, we can
ask:
- Where on the trajectory do malignant cells land?
(cell of origin state)
- How far along differentiation are they? (pseudotime
= differentiation maturity)
- Are different cell lines (L1–L7) at the same or different
positions? (heterogeneity in cell of origin across the
disease)
- Do all cells within a line occupy the same position, or is
there intra-line heterogeneity? (clonal evolution vs stable
malignant state)
Why Analyse Each Cell
Line Separately?
You have 7 Sézary cell lines (L1–L7) derived from different patients.
Each cell line represents a biologically independent malignant
clone. Key reasons to analyse them separately:
- If all 7 lines map to the same trajectory position, this supports a
universal cell of origin for SS (strong biological
conclusion)
- If lines differ, this indicates disease
heterogeneity — different patients may have different cells of
origin, explaining clinical variability in drug response
- Comparing cell lines that are drug-sensitive vs drug-resistant (if
known) to their trajectory position can reveal whether
differentiation state predicts drug response
Setup
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# We load the same libraries as the reference trajectory script.
# RANN provides fast k-nearest-neighbour search (nn2) which we use to transfer
# Monocle3 pseudotime from reference cells to query Sézary cells by weighted
# averaging of their 15 nearest reference neighbours.
# ──────────────────────────────────────────────────────────────────────────
suppressPackageStartupMessages({
library(Seurat)
library(SeuratWrappers)
library(igraph)
library(RANN)
library(ggplot2)
library(ggrepel)
library(dplyr)
library(tidyr)
library(patchwork)
library(scales)
library(RColorBrewer)
library(knitr)
library(kableExtra)
library(future)
})
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'rmarkdown':
method from
print.paged_df
# ── Colour palettes ────────────────────────────────────────────────────────
# These match the reference trajectory script exactly for visual consistency
STATE_COLORS <- c(
"CD4 Naive" = "#4472C4",
"CD4 TCM" = "#70AD47",
"CD4 TEM" = "#ED7D31",
"CD4 Temra/CTL" = "#C00000",
"Treg" = "#7030A0"
)
# 7 visually distinct colours for L1-L7
# Using a combination of Set1 and Dark2 palettes for maximum distinctiveness
CELL_LINE_COLORS <- c(
"L1" = "#E41A1C", # red
"L2" = "#377EB8", # blue
"L3" = "#4DAF4A", # green
"L4" = "#FF7F00", # orange
"L5" = "#984EA3", # purple
"L6" = "#A65628", # brown
"L7" = "#F781BF" # pink
)
METHOD_COLORS <- c(
"Monocle3" = "#e74c3c",
"Custom MST" = "#2980b9"
)
STATE_ORDER <- c("CD4 Naive","CD4 TCM","CD4 TEM","CD4 Temra/CTL","Treg")
CELL_LINE_ORDER <- paste0("L", 1:7)
UMAP_NAME <- "umap"
STATE_COL <- "predicted.celltype.l2"
cat("✅ Setup complete\n")
✅ Setup complete
cat("Cell line colours:\n"); print(CELL_LINE_COLORS)
Cell line colours:
L1 L2 L3 L4 L5 L6 L7
"#E41A1C" "#377EB8" "#4DAF4A" "#FF7F00" "#984EA3" "#A65628" "#F781BF"
Load Objects &
Subset to Malignant Cells
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# We load the pre-computed reference object (cd4_ref_dual_trajectory.rds)
# which already contains:
# - MST graph, milestone centroids, pseudotime for all reference cells
# - Monocle3 pseudotime
# - All trajectory infrastructure built in the main pipeline
#
# CRITICAL: We subset to L1-L7 ONLY, excluding CD4T_lab and CD4T_10x.
# CD4T_lab and CD4T_10x are healthy donor CD4 T cells included as controls
# in the original dataset. Projecting them onto the reference would be
# circular (healthy cells → healthy reference = trivially perfect mapping)
# and would NOT answer the cell of origin question.
#
# We want ONLY the malignant Sézary cells (L1-L7) for projection.
# ──────────────────────────────────────────────────────────────────────────
# Load pre-computed reference with all trajectory objects
cd4_ref <- readRDS("../cd4_ref_dual_trajectory.rds")
# Load the full Sézary object
sezary_raw <- readRDS("/home/nabbasi/apollo_home/1-Seurat_RDS_OBJECT_FINAL/All_samples_Merged_with_Renamed_Clusters_Cell_state-03-12-2025.rds.rds")
cat("Full Sézary object loaded:", ncol(sezary_raw), "cells\n")
Full Sézary object loaded: 49305 cells
cat("\norig.ident breakdown (ALL samples):\n")
orig.ident breakdown (ALL samples):
print(table(sezary_raw$orig.ident))
L1 L2 L3 L4 L5 L6 L7 CD4T_lab CD4T_10x
5825 5935 6428 6006 6022 5148 5331 5106 3504
# ── Subset to malignant cell lines only ───────────────────────────────────
CELL_LINES <- paste0("L", 1:7)
sezary_obj <- subset(sezary_raw, orig.ident %in% CELL_LINES)
cat(sprintf("\n✅ Malignant cells retained: %d\n", ncol(sezary_obj)))
✅ Malignant cells retained: 40695
cat(sprintf(" Healthy controls excluded: CD4T_lab + CD4T_10x (%d cells)\n",
ncol(sezary_raw) - ncol(sezary_obj)))
Healthy controls excluded: CD4T_lab + CD4T_10x (8610 cells)
cat("\nPer cell line (cells to project):\n")
Per cell line (cells to project):
print(table(sezary_obj$orig.ident))
L1 L2 L3 L4 L5 L6 L7 CD4T_lab CD4T_10x
5825 5935 6428 6006 6022 5148 5331 0 0
# CRITICAL: Remove the full object to free RAM before MapQuery
# On RServer with shared memory, keeping sezary_raw in memory during
# MapQuery can cause the 46 GiB future.globals limit to be exceeded
rm(sezary_raw); gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4244977 226.8 8237565 440.0 8237565 440.0
Vcells 1385926744 10573.8 3049213733 23263.7 2626985949 20042.4
cat("\n✅ Raw object removed from memory — ready for MapQuery\n")
✅ Raw object removed from memory — ready for MapQuery
# ── Validate both objects ─────────────────────────────────────────────────
stopifnot(
"STATE_COL missing from cd4_ref" = STATE_COL %in% colnames(cd4_ref@meta.data),
"UMAP missing from cd4_ref" = UMAP_NAME %in% names(cd4_ref@reductions),
"milestone missing from cd4_ref" = "milestone" %in% colnames(cd4_ref@meta.data),
"monocle3_pseudotime_norm missing" = "monocle3_pseudotime_norm" %in% colnames(cd4_ref@meta.data),
"RNA missing from cd4_ref" = "RNA" %in% names(cd4_ref@assays),
"RNA missing from sezary_obj" = "RNA" %in% names(sezary_obj@assays),
"UMAP model missing" = !is.null(cd4_ref@reductions$umap@misc$model)
)
cat("✅ All validation checks passed\n")
✅ All validation checks passed
# Print reference trajectory summary for context
cat("\nReference CD4 T cell state distribution:\n")
Reference CD4 T cell state distribution:
print(table(cd4_ref@meta.data[[STATE_COL]]))
CD4 Naive CD4 TCM CD4 TEM CD4 Temra/CTL Treg
2037 9067 145 10 207
cat("\nReference milestone distribution:\n")
Reference milestone distribution:
print(table(cd4_ref@meta.data$milestone))
M00 M01 M02 M03 M04 M05 M06 M07
2037 388 8679 64 81 10 146 61
Reconstruct Trajectory
Infrastructure
# ── SAFE APPROACH: derive everything from the saved reference object ────────
# We do NOT hardcode edges — instead we reconstruct the graph from the
# actual milestone centroid positions in the saved cd4_ref object.
# This guarantees exact consistency with the reference trajectory.
# ── Step 1: Rebuild milestone centroids from reference metadata ────────────
umap_coords <- as.data.frame(Embeddings(cd4_ref, UMAP_NAME))
colnames(umap_coords) <- c("UMAP_1","UMAP_2")
umap_coords$milestone <- cd4_ref@meta.data$milestone
umap_coords$state <- cd4_ref@meta.data[[STATE_COL]]
milestone_centroids <- umap_coords %>%
filter(!is.na(milestone)) %>%
group_by(milestone) %>%
summarise(
UMAP_1 = mean(UMAP_1),
UMAP_2 = mean(UMAP_2),
n_cells = n(),
state = dplyr::first(state),
.groups = "drop"
) %>%
arrange(milestone)
milestone_order <- sort(unique(milestone_centroids$milestone))
milestone_centroids$order_pos <- match(milestone_centroids$milestone, milestone_order)
cat("Milestone centroids from saved reference:\n")
Milestone centroids from saved reference:
print(milestone_centroids[, c("milestone","state","UMAP_1","UMAP_2","n_cells")])
# ── Step 2: Verify milestone-to-state mapping matches expected biology ──────
expected_map <- c(
M00="CD4 Naive",
M01="CD4 TCM", M02="CD4 TCM",
M03="CD4 TEM", M04="CD4 TEM",
M05="CD4 Temra/CTL",
M06="Treg", M07="Treg"
)
actual_map <- setNames(milestone_centroids$state, milestone_centroids$milestone)
mismatches <- names(expected_map)[expected_map != actual_map[names(expected_map)]]
if (length(mismatches) > 0) {
cat("⚠️ WARNING — milestone/state mismatch at:", paste(mismatches, collapse=", "), "\n")
cat(" Expected:", expected_map[mismatches], "\n")
cat(" Actual: ", actual_map[mismatches], "\n")
stop("Fix milestone assignment before proceeding")
} else {
cat("✅ Milestone-to-state mapping verified\n")
}
✅ Milestone-to-state mapping verified
# ── Step 3: Build centroid matrix and distance matrix ──────────────────────
centroid_mat <- as.matrix(milestone_centroids[, c("UMAP_1","UMAP_2")])
rownames(centroid_mat) <- milestone_centroids$milestone
dist_mat <- as.matrix(dist(centroid_mat, method="euclidean"))
# ── Step 4: Rebuild MST with SAME manual topology as main pipeline ─────────
# Edge list derived from your confirmed milestone distribution:
# M00=Naive(2037) M01=TCM_early(388) M02=TCM_late/branch(8679)
# M03=TEM_early(64) M04=TEM_late(81) M05=Temra(10)
# M06=Treg_naive(146) M07=Treg_terminal(61)
edge_list_ordered <- matrix(c(
"M00","M01", # Naive → TCM early
"M01","M02", # TCM early → TCM late (branch point — largest TCM cluster)
"M02","M03", # TCM late → TEM early (effector arm)
"M03","M04", # TEM early → TEM late
"M04","M05", # TEM late → Temra/CTL (terminal effector)
"M02","M06", # TCM late → Treg naive (regulatory arm)
"M06","M07" # Treg naive → Treg terminal
), ncol=2, byrow=TRUE)
edge_weights <- apply(edge_list_ordered, 1, function(e) dist_mat[e[1], e[2]])
if (any(!is.finite(edge_weights))) {
stop("Non-finite edge weights — check milestone names match dist_mat rownames")
}
mst_graph <- graph_from_edgelist(edge_list_ordered, directed=FALSE)
E(mst_graph)$weight <- edge_weights
cat("\nMST edges reconstructed:\n")
MST edges reconstructed:
for (i in seq_len(nrow(edge_list_ordered))) {
s1 <- actual_map[edge_list_ordered[i,1]]
s2 <- actual_map[edge_list_ordered[i,2]]
cat(sprintf(" %s(%s) → %s(%s) dist=%.4f\n",
edge_list_ordered[i,1], s1,
edge_list_ordered[i,2], s2,
edge_weights[i]))
}
M00(CD4 Naive) → M01(CD4 TCM) dist=5.6851
M01(CD4 TCM) → M02(CD4 TCM) dist=5.4642
M02(CD4 TCM) → M03(CD4 TEM) dist=5.6416
M03(CD4 TEM) → M04(CD4 TEM) dist=1.4646
M04(CD4 TEM) → M05(CD4 Temra/CTL) dist=0.7915
M02(CD4 TCM) → M06(Treg) dist=3.5341
M06(Treg) → M07(Treg) dist=2.2321
# ── Step 5: Graph distances from root ─────────────────────────────────────
graph_dists <- distances(mst_graph, v="M00", weights=NA)
g_dists_vec <- setNames(as.numeric(graph_dists), colnames(graph_dists))
# CRITICAL: use same pseudotime range as reference for consistent 0-100 scale
mst_pt_range <- range(cd4_ref@meta.data$mst_pseudotime_norm, na.rm=TRUE)
cat(sprintf("\nReference pseudotime range: %.1f – %.1f\n",
mst_pt_range[1], mst_pt_range[2]))
Reference pseudotime range: 0.0 – 100.0
# ── Step 6: MST edge coordinates for plotting ──────────────────────────────
mst_edges <- as_edgelist(mst_graph)
mst_edge_df <- do.call(rbind, lapply(seq_len(nrow(mst_edges)), function(i) {
m1 <- mst_edges[i,1]; m2 <- mst_edges[i,2]
r1 <- milestone_centroids[milestone_centroids$milestone==m1,]
r2 <- milestone_centroids[milestone_centroids$milestone==m2,]
data.frame(x1=r1$UMAP_1, y1=r1$UMAP_2, x2=r2$UMAP_1, y2=r2$UMAP_2)
}))
# ── Step 7: Reference UMAP background ─────────────────────────────────────
ref_bg <- data.frame(
UMAP_1 = Embeddings(cd4_ref, UMAP_NAME)[,1],
UMAP_2 = Embeddings(cd4_ref, UMAP_NAME)[,2],
state = cd4_ref@meta.data[[STATE_COL]]
)
# ── Step 8: project_onto_mst function ─────────────────────────────────────
project_onto_mst <- function(cell_coords, centroid_mat, mst_graph, graph_dists) {
edges <- as_edgelist(mst_graph)
n_cells <- nrow(cell_coords)
pseudotime <- numeric(n_cells)
nearest_ms <- character(n_cells)
for (i in seq_len(n_cells)) {
cx <- cell_coords[i,1]; cy <- cell_coords[i,2]
best_dist <- Inf; best_pt <- 0; best_ms <- NA_character_
for (e in seq_len(nrow(edges))) {
m1 <- edges[e,1]; m2 <- edges[e,2]
p1 <- centroid_mat[m1,]; p2 <- centroid_mat[m2,]
dx <- p2[1]-p1[1]; dy <- p2[2]-p1[2]
len2 <- dx^2 + dy^2
if (len2 < 1e-10) next
t <- max(0, min(1, ((cx-p1[1])*dx + (cy-p1[2])*dy) / len2))
proj <- c(p1[1]+t*dx, p1[2]+t*dy)
d <- sqrt((cx-proj[1])^2 + (cy-proj[2])^2)
if (d < best_dist) {
best_dist <- d
best_pt <- graph_dists[m1] + t*(graph_dists[m2]-graph_dists[m1])
best_ms <- if (t < 0.5) m1 else m2
}
}
pseudotime[i] <- best_pt
nearest_ms[i] <- best_ms
}
list(pseudotime=pseudotime, nearest_ms=nearest_ms)
}
cat("\n✅ Trajectory infrastructure ready\n")
✅ Trajectory infrastructure ready
cat(sprintf("MST: %d nodes, %d edges | Root: M00 (CD4 Naive)\n",
vcount(mst_graph), ecount(mst_graph)))
MST: 8 nodes, 7 edges | Root: M00 (CD4 Naive)
The key differences from what I had before are that the milestone
verification step checks the actual state assignments from your object
against expected biology and stops with a clear error message if
anything is wrong, the comment on M02 now correctly notes it is the
large TCM cluster (8,679 cells) that acts as the branch
point, and the mst_pt_range is taken directly from the
saved reference pseudotime rather than being recalculated, guaranteeing
the 0–100 scale for Sézary cells is identical to the reference scale.
Everything downstream in the script stays unchanged.
MapQuery: Project
Sézary Cell Lines onto Reference
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# MapQuery is a Seurat function that performs reference-based cell mapping.
# It does two things simultaneously:
#
# 1. LABEL TRANSFER: Uses weighted nearest-neighbour anchors (cells that
# are mutually similar between reference and query in PCA space) to
# transfer cell type labels from reference → query. Each Sézary cell
# gets a predicted CD4 state label (Naive/TCM/TEM/Temra/Treg) with a
# confidence score (0-1).
#
# 2. UMAP PROJECTION: Projects Sézary cells into the reference UMAP
# coordinate system. This is NOT a new UMAP — it places each Sézary
# cell at the coordinates it would occupy IF it were part of the
# reference dataset. This allows direct visual comparison on the
# same UMAP space.
#
# WHY THIS MATTERS BIOLOGICALLY:
# The predicted state tells us which normal CD4 subtype the malignant cell
# most transcriptomically resembles. If a Sézary cell line projects to
# CD4 TCM space with high confidence, it means the malignant transcriptome
# most closely matches central memory T cells — implying TCM is the cell
# of origin or that the malignant transformation arrested differentiation
# at the TCM stage.
#
# TECHNICAL: We use plan("sequential") to disable parallelism because the
# reference object is ~46 GiB when exported to parallel workers, exceeding
# RServer memory limits. Sequential mode is slower but memory-safe.
# ──────────────────────────────────────────────────────────────────────────
plan("sequential")
options(future.globals.maxSize = 60 * 1024^3)
DefaultAssay(cd4_ref) <- "RNA"
DefaultAssay(sezary_obj) <- "RNA"
# FindVariableFeatures on reference ensures gene selection is current
cd4_ref <- FindVariableFeatures(cd4_ref, nfeatures=3000, verbose=FALSE)
cat("Finding transfer anchors...\n")
Finding transfer anchors...
cat("(This identifies cells that are mutually similar between\n")
(This identifies cells that are mutually similar between
cat(" reference and query in PCA space — the bridge for label transfer)\n\n")
reference and query in PCA space — the bridge for label transfer)
anchors <- FindTransferAnchors(
reference = cd4_ref,
query = sezary_obj,
normalization.method = "LogNormalize",
reference.reduction = "pca",
dims = 1:30,
reduction = "pcaproject",
verbose = FALSE
)
cat(sprintf("✅ Anchors found: %d anchor pairs\n", nrow(anchors@anchors)))
✅ Anchors found: 1499 anchor pairs
cat("\nRunning MapQuery (label transfer + UMAP projection)...\n")
Running MapQuery (label transfer + UMAP projection)...
sezary_obj <- MapQuery(
anchorset = anchors,
query = sezary_obj,
reference = cd4_ref,
refdata = list(
predicted_state = STATE_COL, # transfer CD4 state labels
predicted_milestone = "milestone" # transfer milestone labels
),
reference.reduction = "pca",
reduction.model = "umap", # project into reference UMAP space
verbose = FALSE
)
Using method 'umap'
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
cat("✅ MapQuery complete\n\n")
✅ MapQuery complete
# ── Quality check ─────────────────────────────────────────────────────────
cat("=== PROJECTION QUALITY CHECK ===\n")
=== PROJECTION QUALITY CHECK ===
cat(sprintf("Cells projected: %d\n", ncol(sezary_obj)))
Cells projected: 40695
cat(sprintf("Mean prediction score: %.3f (>0.5 = reliable, >0.7 = confident)\n",
mean(sezary_obj$predicted.predicted_state.score, na.rm=TRUE)))
Mean prediction score: 0.962 (>0.5 = reliable, >0.7 = confident)
cat(sprintf("Cells with score >0.5: %d (%.1f%%)\n",
sum(sezary_obj$predicted.predicted_state.score > 0.5, na.rm=TRUE),
100*mean(sezary_obj$predicted.predicted_state.score > 0.5, na.rm=TRUE)))
Cells with score >0.5: 40591 (99.7%)
cat(sprintf("Cells with score >0.7: %d (%.1f%%)\n",
sum(sezary_obj$predicted.predicted_state.score > 0.7, na.rm=TRUE),
100*mean(sezary_obj$predicted.predicted_state.score > 0.7, na.rm=TRUE)))
Cells with score >0.7: 39584 (97.3%)
cat("\n=== PREDICTED STATE BY CELL LINE ===\n")
=== PREDICTED STATE BY CELL LINE ===
print(table(sezary_obj$predicted.predicted_state,
sezary_obj$orig.ident,
dnn=c("Predicted state","Cell line")))
Cell line
Predicted state L1 L2 L3 L4 L5 L6 L7 CD4T_lab CD4T_10x
CD4 Naive 6 0 0 3 0 0 1 0 0
CD4 TCM 5639 5933 6428 6003 6022 5148 5330 0 0
CD4 TEM 180 2 0 0 0 0 0 0 0
cat("\n=== PREDICTED MILESTONE BY CELL LINE ===\n")
=== PREDICTED MILESTONE BY CELL LINE ===
print(table(sezary_obj$predicted.predicted_milestone,
sezary_obj$orig.ident,
dnn=c("Predicted milestone","Cell line")))
Cell line
Predicted milestone L1 L2 L3 L4 L5 L6 L7 CD4T_lab CD4T_10x
M00 6 0 0 3 0 0 1 0 0
M01 0 0 1 1 0 0 0 0 0
M02 5726 5935 6427 6002 6022 5148 5330 0 0
M04 93 0 0 0 0 0 0 0 0
Assign Pseudotime to
Sézary Cell Lines
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# We assign pseudotime to Sézary cells using TWO methods:
#
# METHOD 1 — MST projection:
# Each Sézary cell's ref.umap coordinates (from MapQuery) are projected
# onto the nearest edge of the MST. Pseudotime = interpolated graph distance
# from the root (M00 = CD4 Naive). This gives a biologically interpretable
# value: 0 = most naive-like, 100 = most terminally differentiated.
#
# METHOD 2 — KNN-weighted Monocle3 transfer:
# For each Sézary cell, we find its 15 nearest reference cells in UMAP space,
# then compute a weighted average of their Monocle3 pseudotime values.
# Weights are inverse distance (closer reference cells contribute more).
# This transfers the Monocle3 pseudotime scale without re-running Monocle3
# on the query — appropriate because Monocle3 cannot easily incorporate
# new query cells into an existing principal graph.
#
# INTERPRETATION:
# - High pseudotime (>50): malignant cells resemble mature/differentiated T cells
# - Low pseudotime (<25): malignant cells resemble naive or early memory T cells
# - This tells us the DIFFERENTIATION STAGE at which malignant transformation occurred
# ──────────────────────────────────────────────────────────────────────────
stopifnot("ref.umap missing — MapQuery may have failed" =
"ref.umap" %in% names(sezary_obj@reductions))
# Extract Sézary UMAP coordinates in reference space
sez_umap_coords <- as.matrix(Embeddings(sezary_obj, "ref.umap"))
colnames(sez_umap_coords) <- c("UMAP_1","UMAP_2")
# ── Method 1: MST pseudotime ──────────────────────────────────────────────
cat("Projecting Sézary cells onto MST...\n")
Projecting Sézary cells onto MST...
sez_mst <- project_onto_mst(sez_umap_coords, centroid_mat, mst_graph, g_dists_vec)
# ── Method 2: KNN-weighted Monocle3 pseudotime ───────────────────────────
cat("Transferring Monocle3 pseudotime via KNN (k=15)...\n")
Transferring Monocle3 pseudotime via KNN (k=15)...
ref_umap_mat <- Embeddings(cd4_ref, UMAP_NAME)
ref_m3_pt <- cd4_ref@meta.data$monocle3_pseudotime_norm
nn_result <- nn2(data=ref_umap_mat, query=sez_umap_coords, k=15)
inv_dist <- 1 / (nn_result$nn.dists + 1e-6)
weights <- inv_dist / rowSums(inv_dist)
# ── Assign to metadata ────────────────────────────────────────────────────
meta_sez <- sezary_obj@meta.data # Fixed: remove [mailto: ] syntax
meta_sez$mst_pseudotime <- pmax(0, pmin(100,
100 * (sez_mst$pseudotime - mst_pt_range[1]) /
(mst_pt_range[2] - mst_pt_range[1])
))
meta_sez$nearest_milestone_mst <- sez_mst$nearest_ms
meta_sez$monocle3_pseudotime <- rowSums(
weights * matrix(ref_m3_pt[nn_result$nn.idx],
nrow=nrow(nn_result$nn.idx))
)
sezary_obj@meta.data <- meta_sez # Fixed: remove [mailto: ] syntax
# ── Build plot dataframe with cell line info ──────────────────────────────
sez_plot_df <- data.frame(
UMAP_1 = sez_umap_coords[,"UMAP_1"],
UMAP_2 = sez_umap_coords[,"UMAP_2"],
cell_line = factor(sezary_obj$orig.ident, levels=CELL_LINE_ORDER),
predicted_state = sezary_obj$predicted.predicted_state,
pred_score = sezary_obj$predicted.predicted_state.score,
mst_pt = sezary_obj@meta.data$mst_pseudotime,
m3_pt = sezary_obj@meta.data$monocle3_pseudotime,
milestone_ref = sezary_obj@meta.data$nearest_milestone_mst
)
cat(sprintf("✅ Pseudotime assigned to %d cells\n", nrow(sez_plot_df)))
✅ Pseudotime assigned to 40695 cells
# FIXED: Correct sprintf() usage for range vectors
mst_range <- range(sez_plot_df$mst_pt, na.rm=TRUE)
m3_range <- range(sez_plot_df$m3_pt, na.rm=TRUE)
cat(sprintf("MST pseudotime range: %.1f – %.1f\n", mst_range[1], mst_range[2]))
MST pseudotime range: 0.0 – 5.0
cat(sprintf("Monocle3 pseudotime range: %.1f – %.1f\n", m3_range[1], m3_range[2]))
Monocle3 pseudotime range: 1.9 – 99.3
cat("\nMST pseudotime per cell line (median):\n")
MST pseudotime per cell line (median):
sez_plot_df %>%
group_by(cell_line) %>%
summarise(median_mst = round(median(mst_pt, na.rm=TRUE), 1),
median_m3 = round(median(m3_pt, na.rm=TRUE), 1),
.groups="drop") %>%
print()
NA
Visualisation 1 — All
Cell Lines Together


Visualisation 2 —
Predicted State per Cell Line

# ── Stacked bar: proportion of cells per state per line ───────────────────
p_state_stack <- ggplot(state_per_line,
aes(x=cell_line, y=pct, fill=predicted_state)) +
geom_col(width=.75, colour="white", linewidth=.3) +
geom_text(aes(label=ifelse(pct >= 5, paste0(pct,"%"), "")),
position=position_stack(vjust=0.5),
size=3.2, colour="white", fontface="bold") +
scale_fill_manual(values=STATE_COLORS, name="Predicted CD4 state") +
scale_y_continuous(expand=c(0,0), limits=c(0,105)) +
theme_classic() +
labs(x="Cell line", y="% of cells",
title="Predicted CD4 state composition per Sézary cell line",
subtitle="Each bar = 100% of that cell line's cells | Labels shown for proportions ≥5%") +
theme(plot.title = element_text(size=13, face="bold"),
axis.text.x = element_text(size=12, face="bold"),
legend.title = element_text(size=10, face="bold"))
print(p_state_stack)
# ── Print summary table ───────────────────────────────────────────────────
cat("\nState proportions per cell line:\n")
State proportions per cell line:
state_per_line %>%
select(cell_line, predicted_state, n, pct) %>%
arrange(cell_line, desc(pct)) %>%
kable(col.names=c("Cell line","Predicted state","N cells","% cells"),
caption="Predicted CD4 state composition per Sézary cell line") %>%
kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE) %>%
print()
Predicted CD4 state composition per Sézary cell line
| Cell line |
Predicted state |
N cells |
% cells |
| L1 |
CD4 TCM |
5639 |
96.8 |
| L1 |
CD4 TEM |
180 |
3.1 |
| L1 |
CD4 Naive |
6 |
0.1 |
| L2 |
CD4 TCM |
5933 |
100.0 |
| L2 |
CD4 TEM |
2 |
0.0 |
| L3 |
CD4 TCM |
6428 |
100.0 |
| L4 |
CD4 TCM |
6003 |
100.0 |
| L4 |
CD4 Naive |
3 |
0.0 |
| L5 |
CD4 TCM |
6022 |
100.0 |
| L6 |
CD4 TCM |
5148 |
100.0 |
| L7 |
CD4 TCM |
5330 |
100.0 |
| L7 |
CD4 Naive |
1 |
0.0 |
NULL
Visualisation 3 —
Pseudotime per Cell Line
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# Pseudotime measures DIFFERENTIATION MATURITY on a 0-100 scale where:
# 0 = most naive-like (undifferentiated, CD4 Naive = M00)
# 100 = most terminally differentiated (Temra or terminal Treg)
#
# For Sézary cells, the pseudotime distribution tells us:
#
# LOW pseudotime (0-30):
# Malignant cells resemble undifferentiated/early memory cells.
# Suggests transformation occurred at an early differentiation stage.
# These tumours may express naive/TCM survival factors (IL-7R, CCR7).
#
# INTERMEDIATE pseudotime (30-60):
# Cells resemble central/transitional memory T cells (TCM stage).
# Most SS literature suggests this — Sézary cells often retain CCR4,
# CD27, and other TCM markers while losing CCR7.
#
# HIGH pseudotime (60-100):
# Cells resemble TEM or Temra — unusual for SS as most SS cells
# do not express high levels of GZMB, PRF1, or CX3CR1.
# If a line maps here, it may represent a more cytotoxic variant.
#
# Comparing lines: if L1 has pseudotime ~30 and L5 has pseudotime ~60,
# this means L5 cells are more differentiated than L1 cells, possibly
# explaining differences in disease aggressiveness or drug sensitivity.
# ──────────────────────────────────────────────────────────────────────────
# ── Panel A: MST pseudotime violin per cell line ──────────────────────────
p_violin_mst <- ggplot(sez_plot_df,
aes(x=cell_line, y=mst_pt, fill=cell_line)) +
geom_violin(scale="width", trim=FALSE, alpha=.85, colour="white") +
geom_boxplot(width=.08, fill="white", outlier.size=.8,
outlier.alpha=.4, colour="grey30") +
# Reference lines for biological interpretation
geom_hline(yintercept=33, linetype="dashed", colour="grey60", linewidth=.5) +
geom_hline(yintercept=66, linetype="dashed", colour="grey60", linewidth=.5) +
annotate("text", x=0.4, y=16, label="Naive/\nearly TCM", size=2.8,
colour="grey40", fontface="italic", hjust=0) +
annotate("text", x=0.4, y=50, label="TCM/\nearly TEM", size=2.8,
colour="grey40", fontface="italic", hjust=0) +
annotate("text", x=0.4, y=83, label="TEM/\nTemra/Treg", size=2.8,
colour="grey40", fontface="italic", hjust=0) +
scale_fill_manual(values=CELL_LINE_COLORS, guide="none") +
scale_y_continuous(limits=c(0,100), breaks=seq(0,100,25)) +
theme_classic() +
labs(x="Cell line", y="MST pseudotime (0–100)",
title="Differentiation maturity per Sézary cell line (Custom MST)",
subtitle="Dashed lines = trajectory thirds | 0=Naive, 100=Temra/Terminal Treg") +
theme(plot.title = element_text(size=12, face="bold"),
axis.text.x = element_text(size=11, face="bold"))
# ── Panel B: Monocle3 pseudotime violin ───────────────────────────────────
p_violin_m3 <- ggplot(sez_plot_df,
aes(x=cell_line, y=m3_pt, fill=cell_line)) +
geom_violin(scale="width", trim=FALSE, alpha=.85, colour="white") +
geom_boxplot(width=.08, fill="white", outlier.size=.8,
outlier.alpha=.4, colour="grey30") +
geom_hline(yintercept=33, linetype="dashed", colour="grey60", linewidth=.5) +
geom_hline(yintercept=66, linetype="dashed", colour="grey60", linewidth=.5) +
scale_fill_manual(values=CELL_LINE_COLORS, guide="none") +
scale_y_continuous(limits=c(0,100), breaks=seq(0,100,25)) +
theme_classic() +
labs(x="Cell line", y="Monocle3 pseudotime (0–100)",
title="Differentiation maturity per Sézary cell line (Monocle3)",
subtitle="KNN-weighted transfer from 15 nearest reference neighbours") +
theme(plot.title = element_text(size=12, face="bold"),
axis.text.x = element_text(size=11, face="bold"))
print(p_violin_mst / p_violin_m3)

Visualisation 4 —
Nearest Milestone per Cell Line
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# Nearest milestone is the most PRECISE cell of origin indicator because
# it places cells at a specific point on the differentiation path, not just
# a broad state category. The milestone system has 8 positions:
#
# M00 = CD4 Naive (root)
# M01 = TCM early (CCR7+, S100A4+, AQP3 low)
# M02 = TCM late (branch point — CCR4+, ITGB1+) ← KEY BRANCH POINT
# M03 = TEM early (GZMK+, CCL5+, CXCR3+)
# M04 = TEM late (EOMES+, HOPX+)
# M05 = Temra/CTL (GZMB+, PRF1+, CX3CR1+, NKG7+)
# M06 = Treg naive (FOXP3+, IL2RA+, IKZF2+)
# M07 = Treg terminal (CTLA4+, TIGIT+, TNFRSF18+)
#
# M02 is the BRANCH POINT — cells at M02 are at the decision point between
# becoming effector memory (TEM arm) or regulatory (Treg arm). If Sézary
# cells cluster at M02, this is biologically very interesting — it suggests
# the malignant transformation captured a cell that was "deciding" its fate.
# ──────────────────────────────────────────────────────────────────────────
# Milestone labels with biological annotation
milestone_labels <- c(
"M00" = "M00\nNaive",
"M01" = "M01\nTCM early",
"M02" = "M02\nTCM late\n(branch)",
"M03" = "M03\nTEM early",
"M04" = "M04\nTEM late",
"M05" = "M05\nTemra/CTL",
"M06" = "M06\nTreg naive",
"M07" = "M07\nTreg terminal"
)
# Compute milestone distribution per cell line
ms_per_line <- sez_plot_df %>%
filter(!is.na(milestone_ref)) %>%
count(cell_line, milestone_ref) %>%
group_by(cell_line) %>%
mutate(pct = round(100 * n / sum(n), 1)) %>%
ungroup() %>%
mutate(
milestone_ref = factor(milestone_ref, levels=milestone_order),
ms_state = milestone_centroids$state[
match(milestone_ref, milestone_centroids$milestone)]
)
# ── Heatmap-style tile plot: cell line × milestone ────────────────────────
p_ms_heatmap <- ggplot(ms_per_line,
aes(x=milestone_ref, y=cell_line, fill=pct)) +
geom_tile(colour="white", linewidth=.5) +
geom_text(aes(label=ifelse(pct >= 2, paste0(pct,"%"), "")),
size=3.2, fontface="bold",
colour=ifelse(ms_per_line$pct > 40, "white", "grey20")) +
scale_fill_gradientn(
colours = c("#F7FBFF","#DEEBF7","#9ECAE1","#4292C6","#08519C","#08306B"),
name = "% of cell\nline cells",
limits = c(0,100)
) +
scale_x_discrete(labels=milestone_labels) +
theme_classic() +
labs(x="Reference milestone", y="Sézary cell line",
title="Nearest reference milestone distribution per cell line",
subtitle="Darker = more cells mapping to that milestone | M02 = branch point between effector and regulatory arms") +
theme(plot.title = element_text(size=13, face="bold"),
axis.text.x = element_text(size=9),
axis.text.y = element_text(size=11, face="bold"),
legend.title = element_text(size=9))
print(p_ms_heatmap)

# ── Stacked bar: milestone per cell line ──────────────────────────────────
ms_state_colors <- setNames(
STATE_COLORS[milestone_centroids$state[match(milestone_order, milestone_centroids$milestone)]],
milestone_order
)
p_ms_stack <- ggplot(ms_per_line,
aes(x=cell_line, y=pct, fill=milestone_ref)) +
geom_col(width=.75, colour="white", linewidth=.3) +
geom_text(aes(label=ifelse(pct >= 5, paste0(milestone_ref,"\n",pct,"%"), "")),
position=position_stack(vjust=0.5),
size=2.8, colour="white", fontface="bold") +
scale_fill_manual(values=ms_state_colors,
labels=milestone_labels,
name="Milestone") +
scale_y_continuous(expand=c(0,0), limits=c(0,105)) +
theme_classic() +
labs(x="Cell line", y="% of cells",
title="Milestone composition per cell line",
subtitle="Colour = CD4 state | Label = milestone ID and % | shown if ≥5%") +
theme(plot.title = element_text(size=13, face="bold"),
axis.text.x = element_text(size=12, face="bold"))
print(p_ms_stack)

Visualisation 5 —
Individual Cell Line UMAP Panels
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# This produces a detailed per-cell-line plot showing:
# - Top panel: cells coloured by predicted state (what state they resemble)
# - Bottom panel: cells coloured by MST pseudotime (how mature they are)
#
# For each cell line, you want to ask:
# 1. Is the cell line HOMOGENEOUS (all cells in one state/region) or
# HETEROGENEOUS (scattered across multiple states)?
# 2. Does the cell line sit closer to the root (low pseudotime = naive-like)
# or the tips (high pseudotime = mature effector or Treg-like)?
# 3. Are there TWO subpopulations within one line? This could indicate
# clonal evolution or technical mixing.
# ──────────────────────────────────────────────────────────────────────────
plot_list <- list()
for (cl in CELL_LINE_ORDER) {
df_cl <- sez_plot_df %>% filter(cell_line == cl)
n_cl <- nrow(df_cl)
# State plot for this cell line
p_state <- ggplot() +
geom_point(data=ref_bg[sample(nrow(ref_bg), min(3000, nrow(ref_bg))),],
aes(x=UMAP_1, y=UMAP_2), colour="grey60", size=.2, alpha=.4) +
geom_segment(data=mst_edge_df, aes(x=x1,y=y1,xend=x2,yend=y2),
colour="grey40", linewidth=.6, alpha=.6) +
geom_point(data=df_cl,
aes(x=UMAP_1, y=UMAP_2, colour=predicted_state),
size=1.2, alpha=.85) +
geom_text_repel(data=milestone_centroids,
aes(x=UMAP_1, y=UMAP_2, label=milestone),
size=2, colour="grey20", fontface="bold",
max.overlaps=8, segment.size=.2) +
scale_colour_manual(values=STATE_COLORS, name="State",
drop=FALSE, na.value="grey40") +
theme_classic() +
labs(x="UMAP-1", y="UMAP-2",
title=sprintf("%s — Predicted state (n=%s cells)",
cl, scales::comma(n_cl))) +
theme(plot.title = element_text(size=10, face="bold",
colour=CELL_LINE_COLORS[cl]),
legend.text = element_text(size=7),
legend.title= element_text(size=8))
# Pseudotime plot for this cell line
p_pt <- ggplot() +
geom_point(data=ref_bg[sample(nrow(ref_bg), min(3000, nrow(ref_bg))),],
aes(x=UMAP_1, y=UMAP_2), colour="grey60", size=.2, alpha=.4) +
geom_segment(data=mst_edge_df, aes(x=x1,y=y1,xend=x2,yend=y2),
colour="grey40", linewidth=.6, alpha=.6) +
geom_point(data=df_cl,
aes(x=UMAP_1, y=UMAP_2, colour=mst_pt),
size=1.2, alpha=.9) +
geom_text_repel(data=milestone_centroids,
aes(x=UMAP_1, y=UMAP_2, label=milestone),
size=2, colour="grey40", fontface="bold",
max.overlaps=8, segment.size=.2) +
scale_colour_gradientn(
colours = c("#0D0887","#6A00A8","#B12A90","#E16462","#FCA636","#F0F921"),
name="Pseudotime\n(0–100)", limits=c(0,100)) +
theme_classic() +
labs(x="UMAP-1", y="UMAP-2",
title=sprintf("%s — MST pseudotime", cl)) +
theme(plot.title = element_text(size=10, face="bold",
colour=CELL_LINE_COLORS[cl]),
legend.text = element_text(size=7),
legend.title= element_text(size=8))
plot_list[[paste0(cl,"_state")]] <- p_state
plot_list[[paste0(cl,"_pt")]] <- p_pt
}
# Print in 2-column layout (state | pseudotime) per line
for (cl in CELL_LINE_ORDER) {
print(plot_list[[paste0(cl,"_state")]] | plot_list[[paste0(cl,"_pt")]])
}







Visualisation 6 —
Method Agreement per Cell Line
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# We compare MST and Monocle3 pseudotime PER CELL LINE to assess robustness.
# If both methods agree on a cell line's pseudotime position, the result is
# robust. If they disagree for a specific line, that line may sit in a
# trajectory region where the two methods handle geometry differently
# (e.g. near the branch point M02).
# ──────────────────────────────────────────────────────────────────────────
# Per-line Spearman correlation
line_cors <- sez_plot_df %>%
group_by(cell_line) %>%
summarise(
spearman = round(cor(mst_pt, m3_pt, method="spearman",
use="complete.obs"), 3),
n = n(),
.groups = "drop"
)
cat("Method agreement (Spearman ρ) per cell line:\n")
Method agreement (Spearman ρ) per cell line:
print(line_cors)
p_method_scatter <- ggplot(sez_plot_df,
aes(x=mst_pt, y=m3_pt, colour=cell_line)) +
geom_point(size=.5, alpha=.4) +
geom_abline(slope=1, intercept=0, linetype="dashed",
colour="grey40", linewidth=.7) +
geom_smooth(method="lm", se=FALSE, linewidth=.8) +
geom_text(data=line_cors,
aes(x=5, y=95, label=sprintf("ρ=%.2f", spearman)),
size=3, fontface="bold", hjust=0, colour="grey20") +
scale_colour_manual(values=CELL_LINE_COLORS, guide="none") +
facet_wrap(~cell_line, ncol=4) +
theme_classic() +
labs(x="Custom MST pseudotime (0–100)",
y="Monocle3 pseudotime (0–100)",
title="Method agreement per cell line: Custom MST vs Monocle3 pseudotime",
subtitle="Dashed = perfect agreement | ρ = Spearman correlation | Higher ρ = more reliable pseudotime assignment") +
theme(plot.title = element_text(size=12, face="bold"),
strip.text = element_text(size=10, face="bold"),
strip.background = element_rect(fill="grey95"))
print(p_method_scatter)

Summary Table —
Complete Cell Line Analysis
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# This comprehensive table is your primary result table for the paper.
# Key columns to interpret:
#
# dominant_state: the most common predicted CD4 state for that cell line
# dominant_state_pct: how confident — if >80%, strong cell of origin signal
# mean_pred_score: MapQuery confidence — >0.7 is reliable
# mst_mean_pt: average differentiation maturity (0=naive, 100=terminal)
# dominant_milestone: the single most common trajectory position
#
# For the paper, you will want to state something like:
# "X of 7 cell lines predominantly mapped to CD4 TCM (milestones M01-M02),
# with mean MST pseudotime of Y ± Z, indicating that Sézary syndrome
# cell lines transcriptomically resemble central memory CD4 T cells at
# an intermediate stage of differentiation."
# ──────────────────────────────────────────────────────────────────────────
sez_summary <- sez_plot_df %>%
group_by(cell_line) %>%
summarise(
n_cells = n(),
mean_pred_score = round(mean(pred_score, na.rm=TRUE), 3),
high_conf_pct = round(100 * mean(pred_score > 0.7, na.rm=TRUE), 1),
dominant_state = names(sort(table(predicted_state), decreasing=TRUE))[1],
dominant_state_pct = round(100 * max(table(predicted_state)) / n(), 1),
dominant_milestone = names(sort(table(milestone_ref), decreasing=TRUE))[1],
mst_mean_pt = round(mean(mst_pt, na.rm=TRUE), 1),
mst_sd_pt = round(sd(mst_pt, na.rm=TRUE), 1),
mst_median_pt = round(median(mst_pt, na.rm=TRUE), 1),
m3_mean_pt = round(mean(m3_pt, na.rm=TRUE), 1),
m3_sd_pt = round(sd(m3_pt, na.rm=TRUE), 1),
.groups = "drop"
) %>%
arrange(mst_mean_pt)
kable(sez_summary,
col.names = c("Cell line","N cells","Mean pred.\nscore",
"% score\n>0.7",
"Dominant\nstate","Dominant\nstate %",
"Dominant\nmilestone",
"MST mean\npt","MST SD",
"MST median\npt",
"M3 mean\npt","M3 SD"),
caption = "Complete per cell line trajectory summary — cell of origin analysis") %>%
kable_styling(bootstrap_options=c("striped","hover","condensed"),
full_width=TRUE) %>%
column_spec(5, bold=TRUE) %>%
column_spec(6, bold=TRUE) %>%
column_spec(8:10, color="white", background=METHOD_COLORS["Custom MST"]) %>%
column_spec(11:12, color="white", background=METHOD_COLORS["Monocle3"])
Complete per cell line trajectory summary — cell of origin analysis
| Cell line |
N cells |
Mean pred.
scor |
| % score
>0 |
7|Dominant
sta |
e | Dominant
sta |
e %|Dominant
miles |
one | MST m |
an
pt| |
ST SD| MST me |
ian
pt| M3 |
mean
p |
| L2 |
5935 |
0.986 |
99.7 |
CD4 TCM |
100.0 |
M03 |
3.1 |
0.6 |
2.7 |
61.6 |
6.0 |
| L1 |
5825 |
0.841 |
82.4 |
CD4 TCM |
96.8 |
M02 |
3.2 |
1.0 |
2.7 |
72.3 |
21.9 |
| L3 |
6428 |
0.974 |
99.8 |
CD4 TCM |
100.0 |
M03 |
3.2 |
0.7 |
2.7 |
61.8 |
9.1 |
| L5 |
6022 |
0.967 |
99.8 |
CD4 TCM |
100.0 |
M03 |
3.2 |
0.7 |
2.7 |
59.6 |
6.5 |
| L6 |
5148 |
0.989 |
99.8 |
CD4 TCM |
100.0 |
M07 |
3.5 |
0.6 |
4.0 |
61.9 |
9.6 |
| L7 |
5331 |
0.990 |
99.9 |
CD4 TCM |
100.0 |
M07 |
3.5 |
0.7 |
4.0 |
62.3 |
10.5 |
| L4 |
6006 |
0.988 |
99.5 |
CD4 TCM |
100.0 |
M07 |
3.7 |
0.6 |
4.0 |
59.2 |
5.5 |
# ── Per-line milestone distribution table ─────────────────────────────────
ms_dist_all <- sez_plot_df %>%
filter(!is.na(milestone_ref)) %>%
count(cell_line, milestone_ref) %>%
left_join(milestone_centroids[,c("milestone","state","order_pos")],
by=c("milestone_ref"="milestone")) %>%
filter(!is.na(order_pos)) %>%
group_by(cell_line) %>%
mutate(pct = round(100 * n / sum(n), 1)) %>%
ungroup() %>%
mutate(milestone_ref = factor(milestone_ref, levels=milestone_order)) %>%
arrange(cell_line, order_pos)
cat("\n\nMilestone distribution per cell line:\n")
Milestone distribution per cell line:
ms_dist_all %>%
select(cell_line, milestone_ref, state, n, pct) %>%
kable(col.names=c("Cell line","Milestone","State","N cells","% cells"),
caption="Per cell line milestone distribution") %>%
kable_styling(bootstrap_options=c("striped","condensed"), full_width=FALSE) %>%
print()
Per cell line milestone distribution
| Cell line |
Milestone |
State |
N cells |
% cells |
| L1 |
M00 |
CD4 Naive |
142 |
2.4 |
| L1 |
M01 |
CD4 TCM |
2 |
0.0 |
| L1 |
M02 |
CD4 TCM |
1515 |
26.0 |
| L1 |
M03 |
CD4 TEM |
1443 |
24.8 |
| L1 |
M04 |
CD4 TEM |
1375 |
23.6 |
| L1 |
M05 |
CD4 Temra/CTL |
290 |
5.0 |
| L1 |
M06 |
Treg |
7 |
0.1 |
| L1 |
M07 |
Treg |
1051 |
18.0 |
| L2 |
M00 |
CD4 Naive |
4 |
0.1 |
| L2 |
M01 |
CD4 TCM |
3 |
0.1 |
| L2 |
M02 |
CD4 TCM |
53 |
0.9 |
| L2 |
M03 |
CD4 TEM |
3877 |
65.3 |
| L2 |
M04 |
CD4 TEM |
61 |
1.0 |
| L2 |
M05 |
CD4 Temra/CTL |
20 |
0.3 |
| L2 |
M06 |
Treg |
7 |
0.1 |
| L2 |
M07 |
Treg |
1910 |
32.2 |
| L3 |
M00 |
CD4 Naive |
7 |
0.1 |
| L3 |
M01 |
CD4 TCM |
4 |
0.1 |
| L3 |
M02 |
CD4 TCM |
508 |
7.9 |
| L3 |
M03 |
CD4 TEM |
3316 |
51.6 |
| L3 |
M04 |
CD4 TEM |
9 |
0.1 |
| L3 |
M06 |
Treg |
13 |
0.2 |
| L3 |
M07 |
Treg |
2571 |
40.0 |
| L4 |
M00 |
CD4 Naive |
42 |
0.7 |
| L4 |
M01 |
CD4 TCM |
1 |
0.0 |
| L4 |
M02 |
CD4 TCM |
32 |
0.5 |
| L4 |
M03 |
CD4 TEM |
1278 |
21.3 |
| L4 |
M04 |
CD4 TEM |
1 |
0.0 |
| L4 |
M06 |
Treg |
10 |
0.2 |
| L4 |
M07 |
Treg |
4642 |
77.3 |
| L5 |
M00 |
CD4 Naive |
4 |
0.1 |
| L5 |
M01 |
CD4 TCM |
1 |
0.0 |
| L5 |
M02 |
CD4 TCM |
262 |
4.4 |
| L5 |
M03 |
CD4 TEM |
3295 |
54.7 |
| L5 |
M04 |
CD4 TEM |
3 |
0.0 |
| L5 |
M07 |
Treg |
2457 |
40.8 |
| L6 |
M00 |
CD4 Naive |
1 |
0.0 |
| L6 |
M01 |
CD4 TCM |
1 |
0.0 |
| L6 |
M02 |
CD4 TCM |
41 |
0.8 |
| L6 |
M03 |
CD4 TEM |
1774 |
34.5 |
| L6 |
M04 |
CD4 TEM |
110 |
2.1 |
| L6 |
M05 |
CD4 Temra/CTL |
2 |
0.0 |
| L6 |
M06 |
Treg |
46 |
0.9 |
| L6 |
M07 |
Treg |
3173 |
61.6 |
| L7 |
M00 |
CD4 Naive |
2 |
0.0 |
| L7 |
M01 |
CD4 TCM |
1 |
0.0 |
| L7 |
M02 |
CD4 TCM |
61 |
1.1 |
| L7 |
M03 |
CD4 TEM |
1901 |
35.7 |
| L7 |
M04 |
CD4 TEM |
1 |
0.0 |
| L7 |
M06 |
Treg |
12 |
0.2 |
| L7 |
M07 |
Treg |
3353 |
62.9 |
NULL
Prediction Score
Quality Check

Save All Outputs
# Create output directories if they don't exist
dir.create("Objects", showWarnings=FALSE)
dir.create("Figures", showWarnings=FALSE)
dir.create("Tables", showWarnings=FALSE)
# ── RDS objects ────────────────────────────────────────────────────────────
saveRDS(sezary_obj, "Objects/sezary_celllines_projected.rds")
saveRDS(mst_graph, "Objects/mst_graph.rds")
saveRDS(milestone_centroids, "Objects/milestone_centroids.rds")
# ── CSV tables ─────────────────────────────────────────────────────────────
write.csv(sez_summary, "Tables/cellline_trajectory_summary.csv", row.names=FALSE)
write.csv(state_per_line, "Tables/cellline_state_proportions.csv", row.names=FALSE)
write.csv(ms_dist_all, "Tables/cellline_milestone_distribution.csv",row.names=FALSE)
write.csv(line_cors, "Tables/cellline_method_agreement.csv", row.names=FALSE)
write.csv(sez_plot_df, "Tables/cellline_all_cells_metadata.csv", row.names=FALSE)
# ── Figures ────────────────────────────────────────────────────────────────
ggsave("Figures/fig_all_lines_umap.pdf", p_all_lines, width=14, height=12)
ggsave("Figures/fig_faceted_lines.pdf", p_faceted, width=18, height=10)
ggsave("Figures/fig_state_per_line.pdf", p_state_stack, width=14, height=6)
ggsave("Figures/fig_pseudotime_mst.pdf", p_violin_mst, width=14, height=6)
ggsave("Figures/fig_pseudotime_m3.pdf", p_violin_m3, width=14, height=6)
ggsave("Figures/fig_milestone_heatmap.pdf", p_ms_heatmap, width=14, height=7)
ggsave("Figures/fig_milestone_stack.pdf", p_ms_stack, width=14, height=7)
ggsave("Figures/fig_method_agreement.pdf", p_method_scatter,width=14, height=6)
ggsave("Figures/fig_prediction_scores.pdf", p_scores, width=12, height=5)
# Individual line plots
for (cl in CELL_LINE_ORDER) {
p_combined <- plot_list[[paste0(cl,"_state")]] | plot_list[[paste0(cl,"_pt")]]
ggsave(sprintf("Figures/fig_%s_umap.pdf", cl),
p_combined, width=16, height=7)
}
cat("✅ All outputs saved\n")
✅ All outputs saved
cat(" Objects/ — RDS files\n")
Objects/ — RDS files
cat(" Figures/ — PDF plots\n")
Figures/ — PDF plots
cat(" Tables/ — CSV data files\n")
Tables/ — CSV data files
Biological
Interpretation Guide
╔══════════════════════════════════════════════════════════════════════╗
║ BIOLOGICAL INTERPRETATION GUIDE FOR YOUR RESULTS ║
╚══════════════════════════════════════════════════════════════════════╝
STEP 1: CHECK DOMINANT STATE (sez_summary table, dominant_state column)
─────────────────────────────────────────────────────────────────────────
• If most lines show CD4 TCM (dominant_state_pct >70%):
→ CONCLUSION: Sézary syndrome arises from Central Memory T cells
→ IMPLICATION: TCM survival signals (IL-7, IL-15, CCR7-independent
homing) drive malignant cell persistence
→ LITERATURE: Consistent with Wang et al. 2023, Woollard et al. 2022
• If most lines show CD4 TEM:
→ CONCLUSION: Transformation occurred at the effector memory stage
→ IMPLICATION: Malignant cells have acquired partial effector
programmes (GZMK, CCL5) but not terminal cytotoxicity
• If lines split between TCM and TEM:
→ CONCLUSION: Disease heterogeneity — different patients have
different cells of origin
→ IMPLICATION: May explain clinical variability in drug response
STEP 2: CHECK DOMINANT MILESTONE (dominant_milestone column)
─────────────────────────────────────────────────────────────────────────
• M01 or M02 dominant → early-to-late TCM (most likely for SS)
• M02 dominant (branch point) → arrested at fate decision point
• M03 or M04 dominant → TEM commitment despite TCM-like label
• M06 dominant → Treg-like features (check FOXP3 expression)
STEP 3: CHECK PREDICTION SCORES (mean_pred_score column)
─────────────────────────────────────────────────────────────────────────
• >0.7: High confidence — strong cell of origin conclusion
• 0.5-0.7: Moderate — mention in limitations
• <0.5: Low — malignant transcriptome substantially deviates from
all normal reference states; discuss as 'partial resemblance'
STEP 4: CHECK PSEUDOTIME (mst_mean_pt column)
─────────────────────────────────────────────────────────────────────────
• 0-33: Naive/early TCM differentiation stage
• 33-66: TCM/TEM transition stage
• 66-100: Mature effector or Treg stage
• Compare across lines: if L1 pseudotime < L5 pseudotime, L5 is more
mature/differentiated — may correlate with aggressiveness
STEP 5: LOOK AT THE FACETED UMAP
─────────────────────────────────────────────────────────────────────────
• Tight clusters in one region → homogeneous cell line
• Scattered cells → heterogeneous line (may have subclones)
• Lines overlapping the same UMAP region → same cell of origin
• Lines in different UMAP regions → different cell of origin
---
title: "Sézary Cell Line Projection onto CD4 T Cell Reference Trajectory"
subtitle: "Per cell line trajectory mapping | Cell of origin analysis | L1–L7 Sézary lines"
author: "NASIR MAHMOOD ABBASI"
date: "`r Sys.Date()`"
output:
  html_notebook:
    number_sections: true
    toc: true
    toc_float:
      collapsed: false
    theme: journal
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo      = TRUE,
  message   = FALSE,
  warning   = FALSE,
  fig.align = "center",
  dpi       = 150
)
```

------------------------------------------------------------------------

# Biological Background & Rationale

## What is Sézary Syndrome?

Sézary syndrome (SS) is an aggressive **cutaneous T-cell lymphoma (CTCL)** characterised
by the triad of erythroderma (full-body skin redness), lymphadenopathy, and circulating
malignant T cells called **Sézary cells**. These are CD4+ T cells with cerebriform
(brain-like wrinkled) nuclei that have undergone malignant transformation.

The central unresolved question in SS biology is: **which normal CD4 T cell subtype
gave rise to the malignant clone?** This is the "cell of origin" question.
Answering it matters because:

- It tells us **which transcription factors and signalling pathways** were active
  when transformation occurred
- It predicts **which normal T cell programmes** are retained or corrupted in the
  malignant cells (explaining why SS cells home to skin, suppress immunity, etc.)
- It guides **therapeutic targeting** — if SS arises from TCM cells, targeting
  memory T cell survival signals (e.g. IL-15/IL-7 signalling) is rational

## Why Project onto a Trajectory?

A trajectory analysis places cells along a **continuous differentiation path**
rather than forcing them into discrete categories. Normal CD4 T cells differentiate
from Naive → TCM → TEM → Temra (effector arm) or Naive → TCM → Treg (regulatory arm).

When we project Sézary cell lines onto this trajectory, we can ask:

1. **Where on the trajectory do malignant cells land?** (cell of origin state)
2. **How far along differentiation are they?** (pseudotime = differentiation maturity)
3. **Are different cell lines (L1–L7) at the same or different positions?**
   (heterogeneity in cell of origin across the disease)
4. **Do all cells within a line occupy the same position, or is there intra-line
   heterogeneity?** (clonal evolution vs stable malignant state)

## Why Analyse Each Cell Line Separately?

You have 7 Sézary cell lines (L1–L7) derived from different patients. Each cell line
represents a **biologically independent malignant clone**. Key reasons to analyse them
separately:

- If all 7 lines map to the same trajectory position, this supports a **universal
  cell of origin** for SS (strong biological conclusion)
- If lines differ, this indicates **disease heterogeneity** — different patients
  may have different cells of origin, explaining clinical variability in drug response
- Comparing cell lines that are drug-sensitive vs drug-resistant (if known) to their
  trajectory position can reveal whether **differentiation state predicts drug response**

------------------------------------------------------------------------

# Setup

```{r libraries}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# We load the same libraries as the reference trajectory script.
# RANN provides fast k-nearest-neighbour search (nn2) which we use to transfer
# Monocle3 pseudotime from reference cells to query Sézary cells by weighted
# averaging of their 15 nearest reference neighbours.
# ──────────────────────────────────────────────────────────────────────────

suppressPackageStartupMessages({
  library(Seurat)
  library(SeuratWrappers)
  library(igraph)
  library(RANN)
  library(ggplot2)
  library(ggrepel)
  library(dplyr)
  library(tidyr)
  library(patchwork)
  library(scales)
  library(RColorBrewer)
  library(knitr)
  library(kableExtra)
  library(future)
})

# ── Colour palettes ────────────────────────────────────────────────────────
# These match the reference trajectory script exactly for visual consistency
STATE_COLORS <- c(
  "CD4 Naive"     = "#4472C4",
  "CD4 TCM"       = "#70AD47",
  "CD4 TEM"       = "#ED7D31",
  "CD4 Temra/CTL" = "#C00000",
  "Treg"          = "#7030A0"
)

# 7 visually distinct colours for L1-L7
# Using a combination of Set1 and Dark2 palettes for maximum distinctiveness
CELL_LINE_COLORS <- c(
  "L1" = "#E41A1C",   # red
  "L2" = "#377EB8",   # blue
  "L3" = "#4DAF4A",   # green
  "L4" = "#FF7F00",   # orange
  "L5" = "#984EA3",   # purple
  "L6" = "#A65628",   # brown
  "L7" = "#F781BF"    # pink
)

METHOD_COLORS <- c(
  "Monocle3"   = "#e74c3c",
  "Custom MST" = "#2980b9"
)

STATE_ORDER    <- c("CD4 Naive","CD4 TCM","CD4 TEM","CD4 Temra/CTL","Treg")
CELL_LINE_ORDER <- paste0("L", 1:7)
UMAP_NAME      <- "umap"
STATE_COL      <- "predicted.celltype.l2"

cat("✅ Setup complete\n")
cat("Cell line colours:\n"); print(CELL_LINE_COLORS)
```

------------------------------------------------------------------------

# Load Objects & Subset to Malignant Cells

```{r load-objects}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# We load the pre-computed reference object (cd4_ref_dual_trajectory.rds)
# which already contains:
#   - MST graph, milestone centroids, pseudotime for all reference cells
#   - Monocle3 pseudotime
#   - All trajectory infrastructure built in the main pipeline
#
# CRITICAL: We subset to L1-L7 ONLY, excluding CD4T_lab and CD4T_10x.
# CD4T_lab and CD4T_10x are healthy donor CD4 T cells included as controls
# in the original dataset. Projecting them onto the reference would be
# circular (healthy cells → healthy reference = trivially perfect mapping)
# and would NOT answer the cell of origin question.
#
# We want ONLY the malignant Sézary cells (L1-L7) for projection.
# ──────────────────────────────────────────────────────────────────────────

# Load pre-computed reference with all trajectory objects
cd4_ref <- readRDS("../cd4_ref_dual_trajectory.rds")

# Load the full Sézary object
sezary_raw <- readRDS("/home/nabbasi/apollo_home/1-Seurat_RDS_OBJECT_FINAL/All_samples_Merged_with_Renamed_Clusters_Cell_state-03-12-2025.rds.rds")

cat("Full Sézary object loaded:", ncol(sezary_raw), "cells\n")
cat("\norig.ident breakdown (ALL samples):\n")
print(table(sezary_raw$orig.ident))

# ── Subset to malignant cell lines only ───────────────────────────────────
CELL_LINES <- paste0("L", 1:7)
sezary_obj <- subset(sezary_raw, orig.ident %in% CELL_LINES)

cat(sprintf("\n✅ Malignant cells retained: %d\n", ncol(sezary_obj)))
cat(sprintf("   Healthy controls excluded: CD4T_lab + CD4T_10x (%d cells)\n",
            ncol(sezary_raw) - ncol(sezary_obj)))
cat("\nPer cell line (cells to project):\n")
print(table(sezary_obj$orig.ident))

# CRITICAL: Remove the full object to free RAM before MapQuery
# On RServer with shared memory, keeping sezary_raw in memory during
# MapQuery can cause the 46 GiB future.globals limit to be exceeded
rm(sezary_raw); gc()
cat("\n✅ Raw object removed from memory — ready for MapQuery\n")

# ── Validate both objects ─────────────────────────────────────────────────
stopifnot(
  "STATE_COL missing from cd4_ref"   = STATE_COL %in% colnames(cd4_ref@meta.data),
  "UMAP missing from cd4_ref"        = UMAP_NAME %in% names(cd4_ref@reductions),
  "milestone missing from cd4_ref"   = "milestone" %in% colnames(cd4_ref@meta.data),
  "monocle3_pseudotime_norm missing" = "monocle3_pseudotime_norm" %in% colnames(cd4_ref@meta.data),
  "RNA missing from cd4_ref"         = "RNA" %in% names(cd4_ref@assays),
  "RNA missing from sezary_obj"      = "RNA" %in% names(sezary_obj@assays),
  "UMAP model missing"               = !is.null(cd4_ref@reductions$umap@misc$model)
)
cat("✅ All validation checks passed\n")

# Print reference trajectory summary for context
cat("\nReference CD4 T cell state distribution:\n")
print(table(cd4_ref@meta.data[[STATE_COL]]))
cat("\nReference milestone distribution:\n")
print(table(cd4_ref@meta.data$milestone))
```

------------------------------------------------------------------------

# Reconstruct Trajectory Infrastructure

```{r reconstruct-trajectory}
# ── SAFE APPROACH: derive everything from the saved reference object ────────
# We do NOT hardcode edges — instead we reconstruct the graph from the
# actual milestone centroid positions in the saved cd4_ref object.
# This guarantees exact consistency with the reference trajectory.

# ── Step 1: Rebuild milestone centroids from reference metadata ────────────
umap_coords <- as.data.frame(Embeddings(cd4_ref, UMAP_NAME))
colnames(umap_coords) <- c("UMAP_1","UMAP_2")
umap_coords$milestone <- cd4_ref@meta.data$milestone
umap_coords$state     <- cd4_ref@meta.data[[STATE_COL]]

milestone_centroids <- umap_coords %>%
  filter(!is.na(milestone)) %>%
  group_by(milestone) %>%
  summarise(
    UMAP_1  = mean(UMAP_1),
    UMAP_2  = mean(UMAP_2),
    n_cells = n(),
    state   = dplyr::first(state),
    .groups = "drop"
  ) %>%
  arrange(milestone)

milestone_order           <- sort(unique(milestone_centroids$milestone))
milestone_centroids$order_pos <- match(milestone_centroids$milestone, milestone_order)

cat("Milestone centroids from saved reference:\n")
print(milestone_centroids[, c("milestone","state","UMAP_1","UMAP_2","n_cells")])

# ── Step 2: Verify milestone-to-state mapping matches expected biology ──────
expected_map <- c(
  M00="CD4 Naive",
  M01="CD4 TCM", M02="CD4 TCM",
  M03="CD4 TEM", M04="CD4 TEM",
  M05="CD4 Temra/CTL",
  M06="Treg",    M07="Treg"
)

actual_map <- setNames(milestone_centroids$state, milestone_centroids$milestone)
mismatches <- names(expected_map)[expected_map != actual_map[names(expected_map)]]

if (length(mismatches) > 0) {
  cat("⚠️  WARNING — milestone/state mismatch at:", paste(mismatches, collapse=", "), "\n")
  cat("   Expected:", expected_map[mismatches], "\n")
  cat("   Actual:  ", actual_map[mismatches], "\n")
  stop("Fix milestone assignment before proceeding")
} else {
  cat("✅ Milestone-to-state mapping verified\n")
}

# ── Step 3: Build centroid matrix and distance matrix ──────────────────────
centroid_mat <- as.matrix(milestone_centroids[, c("UMAP_1","UMAP_2")])
rownames(centroid_mat) <- milestone_centroids$milestone
dist_mat <- as.matrix(dist(centroid_mat, method="euclidean"))

# ── Step 4: Rebuild MST with SAME manual topology as main pipeline ─────────
# Edge list derived from your confirmed milestone distribution:
# M00=Naive(2037) M01=TCM_early(388) M02=TCM_late/branch(8679)
# M03=TEM_early(64) M04=TEM_late(81) M05=Temra(10)
# M06=Treg_naive(146) M07=Treg_terminal(61)
edge_list_ordered <- matrix(c(
  "M00","M01",   # Naive → TCM early
  "M01","M02",   # TCM early → TCM late (branch point — largest TCM cluster)
  "M02","M03",   # TCM late → TEM early  (effector arm)
  "M03","M04",   # TEM early → TEM late
  "M04","M05",   # TEM late → Temra/CTL  (terminal effector)
  "M02","M06",   # TCM late → Treg naive (regulatory arm)
  "M06","M07"    # Treg naive → Treg terminal
), ncol=2, byrow=TRUE)

edge_weights <- apply(edge_list_ordered, 1, function(e) dist_mat[e[1], e[2]])

if (any(!is.finite(edge_weights))) {
  stop("Non-finite edge weights — check milestone names match dist_mat rownames")
}

mst_graph <- graph_from_edgelist(edge_list_ordered, directed=FALSE)
E(mst_graph)$weight <- edge_weights

cat("\nMST edges reconstructed:\n")
for (i in seq_len(nrow(edge_list_ordered))) {
  s1 <- actual_map[edge_list_ordered[i,1]]
  s2 <- actual_map[edge_list_ordered[i,2]]
  cat(sprintf("  %s(%s) → %s(%s)  dist=%.4f\n",
              edge_list_ordered[i,1], s1,
              edge_list_ordered[i,2], s2,
              edge_weights[i]))
}

# ── Step 5: Graph distances from root ─────────────────────────────────────
graph_dists  <- distances(mst_graph, v="M00", weights=NA)
g_dists_vec  <- setNames(as.numeric(graph_dists), colnames(graph_dists))

# CRITICAL: use same pseudotime range as reference for consistent 0-100 scale
mst_pt_range <- range(cd4_ref@meta.data$mst_pseudotime_norm, na.rm=TRUE)
cat(sprintf("\nReference pseudotime range: %.1f – %.1f\n",
            mst_pt_range[1], mst_pt_range[2]))

# ── Step 6: MST edge coordinates for plotting ──────────────────────────────
mst_edges   <- as_edgelist(mst_graph)
mst_edge_df <- do.call(rbind, lapply(seq_len(nrow(mst_edges)), function(i) {
  m1 <- mst_edges[i,1]; m2 <- mst_edges[i,2]
  r1 <- milestone_centroids[milestone_centroids$milestone==m1,]
  r2 <- milestone_centroids[milestone_centroids$milestone==m2,]
  data.frame(x1=r1$UMAP_1, y1=r1$UMAP_2, x2=r2$UMAP_1, y2=r2$UMAP_2)
}))

# ── Step 7: Reference UMAP background ─────────────────────────────────────
ref_bg <- data.frame(
  UMAP_1 = Embeddings(cd4_ref, UMAP_NAME)[,1],
  UMAP_2 = Embeddings(cd4_ref, UMAP_NAME)[,2],
  state  = cd4_ref@meta.data[[STATE_COL]]
)

# ── Step 8: project_onto_mst function ─────────────────────────────────────
project_onto_mst <- function(cell_coords, centroid_mat, mst_graph, graph_dists) {
  edges      <- as_edgelist(mst_graph)
  n_cells    <- nrow(cell_coords)
  pseudotime <- numeric(n_cells)
  nearest_ms <- character(n_cells)

  for (i in seq_len(n_cells)) {
    cx <- cell_coords[i,1]; cy <- cell_coords[i,2]
    best_dist <- Inf; best_pt <- 0; best_ms <- NA_character_

    for (e in seq_len(nrow(edges))) {
      m1 <- edges[e,1]; m2 <- edges[e,2]
      p1 <- centroid_mat[m1,]; p2 <- centroid_mat[m2,]
      dx <- p2[1]-p1[1]; dy <- p2[2]-p1[2]
      len2 <- dx^2 + dy^2
      if (len2 < 1e-10) next
      t    <- max(0, min(1, ((cx-p1[1])*dx + (cy-p1[2])*dy) / len2))
      proj <- c(p1[1]+t*dx, p1[2]+t*dy)
      d    <- sqrt((cx-proj[1])^2 + (cy-proj[2])^2)
      if (d < best_dist) {
        best_dist <- d
        best_pt   <- graph_dists[m1] + t*(graph_dists[m2]-graph_dists[m1])
        best_ms   <- if (t < 0.5) m1 else m2
      }
    }
    pseudotime[i] <- best_pt
    nearest_ms[i] <- best_ms
  }
  list(pseudotime=pseudotime, nearest_ms=nearest_ms)
}

cat("\n✅ Trajectory infrastructure ready\n")
cat(sprintf("MST: %d nodes, %d edges | Root: M00 (CD4 Naive)\n",
            vcount(mst_graph), ecount(mst_graph)))
```

The key differences from what I had before are that the milestone verification step checks the actual state assignments from your object against expected biology and stops with a clear error message if anything is wrong, the comment on M02 now correctly notes it is the **large** TCM cluster (8,679 cells) that acts as the branch point, and the `mst_pt_range` is taken directly from the saved reference pseudotime rather than being recalculated, guaranteeing the 0–100 scale for Sézary cells is identical to the reference scale. Everything downstream in the script stays unchanged.

------------------------------------------------------------------------

# MapQuery: Project Sézary Cell Lines onto Reference

```{r sezary-mapquery}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# MapQuery is a Seurat function that performs reference-based cell mapping.
# It does two things simultaneously:
#
# 1. LABEL TRANSFER: Uses weighted nearest-neighbour anchors (cells that
#    are mutually similar between reference and query in PCA space) to
#    transfer cell type labels from reference → query. Each Sézary cell
#    gets a predicted CD4 state label (Naive/TCM/TEM/Temra/Treg) with a
#    confidence score (0-1).
#
# 2. UMAP PROJECTION: Projects Sézary cells into the reference UMAP
#    coordinate system. This is NOT a new UMAP — it places each Sézary
#    cell at the coordinates it would occupy IF it were part of the
#    reference dataset. This allows direct visual comparison on the
#    same UMAP space.
#
# WHY THIS MATTERS BIOLOGICALLY:
# The predicted state tells us which normal CD4 subtype the malignant cell
# most transcriptomically resembles. If a Sézary cell line projects to
# CD4 TCM space with high confidence, it means the malignant transcriptome
# most closely matches central memory T cells — implying TCM is the cell
# of origin or that the malignant transformation arrested differentiation
# at the TCM stage.
#
# TECHNICAL: We use plan("sequential") to disable parallelism because the
# reference object is ~46 GiB when exported to parallel workers, exceeding
# RServer memory limits. Sequential mode is slower but memory-safe.
# ──────────────────────────────────────────────────────────────────────────

plan("sequential")
options(future.globals.maxSize = 60 * 1024^3)

DefaultAssay(cd4_ref)    <- "RNA"
DefaultAssay(sezary_obj) <- "RNA"

# FindVariableFeatures on reference ensures gene selection is current
cd4_ref <- FindVariableFeatures(cd4_ref, nfeatures=3000, verbose=FALSE)

cat("Finding transfer anchors...\n")
cat("(This identifies cells that are mutually similar between\n")
cat(" reference and query in PCA space — the bridge for label transfer)\n\n")

anchors <- FindTransferAnchors(
  reference            = cd4_ref,
  query                = sezary_obj,
  normalization.method = "LogNormalize",
  reference.reduction  = "pca",
  dims                 = 1:30,
  reduction            = "pcaproject",
  verbose              = FALSE
)

cat(sprintf("✅ Anchors found: %d anchor pairs\n", nrow(anchors@anchors)))
cat("\nRunning MapQuery (label transfer + UMAP projection)...\n")

sezary_obj <- MapQuery(
  anchorset           = anchors,
  query               = sezary_obj,
  reference           = cd4_ref,
  refdata             = list(
    predicted_state     = STATE_COL,    # transfer CD4 state labels
    predicted_milestone = "milestone"   # transfer milestone labels
  ),
  reference.reduction = "pca",
  reduction.model     = "umap",         # project into reference UMAP space
  verbose             = FALSE
)

cat("✅ MapQuery complete\n\n")

# ── Quality check ─────────────────────────────────────────────────────────
cat("=== PROJECTION QUALITY CHECK ===\n")
cat(sprintf("Cells projected: %d\n", ncol(sezary_obj)))
cat(sprintf("Mean prediction score: %.3f (>0.5 = reliable, >0.7 = confident)\n",
            mean(sezary_obj$predicted.predicted_state.score, na.rm=TRUE)))
cat(sprintf("Cells with score >0.5: %d (%.1f%%)\n",
            sum(sezary_obj$predicted.predicted_state.score > 0.5, na.rm=TRUE),
            100*mean(sezary_obj$predicted.predicted_state.score > 0.5, na.rm=TRUE)))
cat(sprintf("Cells with score >0.7: %d (%.1f%%)\n",
            sum(sezary_obj$predicted.predicted_state.score > 0.7, na.rm=TRUE),
            100*mean(sezary_obj$predicted.predicted_state.score > 0.7, na.rm=TRUE)))

cat("\n=== PREDICTED STATE BY CELL LINE ===\n")
print(table(sezary_obj$predicted.predicted_state,
            sezary_obj$orig.ident,
            dnn=c("Predicted state","Cell line")))

cat("\n=== PREDICTED MILESTONE BY CELL LINE ===\n")
print(table(sezary_obj$predicted.predicted_milestone,
            sezary_obj$orig.ident,
            dnn=c("Predicted milestone","Cell line")))
```

------------------------------------------------------------------------

# Assign Pseudotime to Sézary Cell Lines

```{r sezary-pseudotime}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# We assign pseudotime to Sézary cells using TWO methods:
#
# METHOD 1 — MST projection:
# Each Sézary cell's ref.umap coordinates (from MapQuery) are projected
# onto the nearest edge of the MST. Pseudotime = interpolated graph distance
# from the root (M00 = CD4 Naive). This gives a biologically interpretable
# value: 0 = most naive-like, 100 = most terminally differentiated.
#
# METHOD 2 — KNN-weighted Monocle3 transfer:
# For each Sézary cell, we find its 15 nearest reference cells in UMAP space,
# then compute a weighted average of their Monocle3 pseudotime values.
# Weights are inverse distance (closer reference cells contribute more).
# This transfers the Monocle3 pseudotime scale without re-running Monocle3
# on the query — appropriate because Monocle3 cannot easily incorporate
# new query cells into an existing principal graph.
#
# INTERPRETATION:
# - High pseudotime (>50): malignant cells resemble mature/differentiated T cells
# - Low pseudotime (<25): malignant cells resemble naive or early memory T cells
# - This tells us the DIFFERENTIATION STAGE at which malignant transformation occurred
# ──────────────────────────────────────────────────────────────────────────

stopifnot("ref.umap missing — MapQuery may have failed" =
          "ref.umap" %in% names(sezary_obj@reductions))

# Extract Sézary UMAP coordinates in reference space
sez_umap_coords           <- as.matrix(Embeddings(sezary_obj, "ref.umap"))
colnames(sez_umap_coords) <- c("UMAP_1","UMAP_2")

# ── Method 1: MST pseudotime ──────────────────────────────────────────────
cat("Projecting Sézary cells onto MST...\n")
sez_mst <- project_onto_mst(sez_umap_coords, centroid_mat, mst_graph, g_dists_vec)

# ── Method 2: KNN-weighted Monocle3 pseudotime ───────────────────────────
cat("Transferring Monocle3 pseudotime via KNN (k=15)...\n")
ref_umap_mat <- Embeddings(cd4_ref, UMAP_NAME)
ref_m3_pt    <- cd4_ref@meta.data$monocle3_pseudotime_norm

nn_result <- nn2(data=ref_umap_mat, query=sez_umap_coords, k=15)
inv_dist  <- 1 / (nn_result$nn.dists + 1e-6)
weights   <- inv_dist / rowSums(inv_dist)

# ── Assign to metadata ────────────────────────────────────────────────────
meta_sez <- sezary_obj@meta.data  # Fixed: remove [mailto: ] syntax

meta_sez$mst_pseudotime <- pmax(0, pmin(100,
  100 * (sez_mst$pseudotime - mst_pt_range[1]) /
         (mst_pt_range[2]  - mst_pt_range[1])
))
meta_sez$nearest_milestone_mst <- sez_mst$nearest_ms
meta_sez$monocle3_pseudotime   <- rowSums(
  weights * matrix(ref_m3_pt[nn_result$nn.idx],
                   nrow=nrow(nn_result$nn.idx))
)

sezary_obj@meta.data <- meta_sez  # Fixed: remove [mailto: ] syntax

# ── Build plot dataframe with cell line info ──────────────────────────────
sez_plot_df <- data.frame(
  UMAP_1          = sez_umap_coords[,"UMAP_1"],
  UMAP_2          = sez_umap_coords[,"UMAP_2"],
  cell_line       = factor(sezary_obj$orig.ident, levels=CELL_LINE_ORDER),
  predicted_state = sezary_obj$predicted.predicted_state,
  pred_score      = sezary_obj$predicted.predicted_state.score,
  mst_pt          = sezary_obj@meta.data$mst_pseudotime,
  m3_pt           = sezary_obj@meta.data$monocle3_pseudotime,
  milestone_ref   = sezary_obj@meta.data$nearest_milestone_mst
)

cat(sprintf("✅ Pseudotime assigned to %d cells\n", nrow(sez_plot_df)))

# FIXED: Correct sprintf() usage for range vectors
mst_range <- range(sez_plot_df$mst_pt, na.rm=TRUE)
m3_range  <- range(sez_plot_df$m3_pt, na.rm=TRUE)

cat(sprintf("MST pseudotime range:      %.1f – %.1f\n", mst_range[1], mst_range[2]))
cat(sprintf("Monocle3 pseudotime range: %.1f – %.1f\n", m3_range[1], m3_range[2]))

cat("\nMST pseudotime per cell line (median):\n")
sez_plot_df %>%
  group_by(cell_line) %>%
  summarise(median_mst = round(median(mst_pt, na.rm=TRUE), 1),
            median_m3  = round(median(m3_pt,  na.rm=TRUE), 1),
            .groups="drop") %>%
  print()

```

------------------------------------------------------------------------

# Visualisation 1 — All Cell Lines Together

```{r all-lines-umap, fig.width=8, fig.height=6}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# This overview plot shows ALL 7 cell lines simultaneously on the reference
# UMAP. This answers: "As a group, where do Sézary cell lines map?"
# The faceted version shows each line individually for detailed comparison.
#
# What to look for:
# - CLUSTERING: Do all cell lines cluster together (same cell of origin)?
#   Or do they scatter to different regions (heterogeneous cell of origin)?
# - OVERLAP WITH REFERENCE STATES: Which grey reference region do the
#   coloured Sézary cells overlap with?
# - MILESTONE PROXIMITY: Which milestone labels (M00-M07) are the cells
#   nearest to? This gives the most precise trajectory position.
# ──────────────────────────────────────────────────────────────────────────

# ── Panel A: All lines on reference UMAP ──────────────────────────────────
p_all_lines <- ggplot() +
  # Reference cells as grey background — shows the normal CD4 landscape
  geom_point(data = ref_bg[sample(nrow(ref_bg)),],
             aes(x=UMAP_1, y=UMAP_2), colour="grey88", size=.25, alpha=.4) +
  # MST trajectory lines
  geom_segment(data = mst_edge_df,
               aes(x=x1, y=y1, xend=x2, yend=y2),
               colour="grey30", linewidth=1, alpha=.8, lineend="round") +
  # Sézary cells coloured by cell line
  geom_point(data = sez_plot_df[sample(nrow(sez_plot_df)),],
             aes(x=UMAP_1, y=UMAP_2, colour=cell_line),
             size=1.5, alpha=.75) +
  # Milestone labels with state annotation
  geom_label_repel(data = milestone_centroids,
                   aes(x=UMAP_1, y=UMAP_2,
                       label=paste0(milestone, "\n", state)),
                   size=2.5, fontface="bold", fill="white",
                   colour="grey15", label.size=0.2,
                   max.overlaps=20, box.padding=0.5) +
  scale_colour_manual(values=CELL_LINE_COLORS, name="Cell line") +
  theme_classic() +
  labs(x="UMAP-1", y="UMAP-2",
       title="Sézary cell lines — projection onto healthy CD4 T cell reference",
       subtitle="Grey = healthy CD4 reference (11,466 cells) | Coloured = Sézary cell lines (L1–L7) | Lines = MST trajectory") +
  theme(plot.title    = element_text(size=14, face="bold"),
        plot.subtitle = element_text(size=10, colour="grey40"),
        legend.title  = element_text(size=11, face="bold"),
        legend.text   = element_text(size=10)) +
  guides(colour = guide_legend(override.aes=list(size=4, alpha=1)))

print(p_all_lines)
```




```{r faceted-lines, fig.width=13, fig.height=9}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# Faceting by cell line is the most important visualisation for cell of
# origin analysis. Each panel shows ONE cell line projected onto the
# same reference landscape. This allows you to:
# 1. See exactly where each line's cells concentrate on the UMAP
# 2. Compare lines directly — do L1 and L2 overlap? Do L5 and L6 differ?
# 3. Identify intra-line heterogeneity — is the line a single tight cluster
#    or scattered across multiple states?
# ──────────────────────────────────────────────────────────────────────────

p_faceted <- ggplot() +
  geom_point(data = ref_bg[sample(nrow(ref_bg)),],
             aes(x=UMAP_1, y=UMAP_2), colour="grey88", size=.2, alpha=.35) +
  geom_segment(data = mst_edge_df,
               aes(x=x1, y=y1, xend=x2, yend=y2),
               colour="grey40", linewidth=.7, alpha=.7, lineend="round") +
  geom_point(data = sez_plot_df,
             aes(x=UMAP_1, y=UMAP_2, colour=cell_line),
             size=1, alpha=.8) +
  geom_text_repel(data = milestone_centroids,
                  aes(x=UMAP_1, y=UMAP_2, label=milestone),
                  size=2.2, fontface="bold", colour="grey20",
                  max.overlaps=8, segment.size=.2, box.padding=0.3) +
  scale_colour_manual(values=CELL_LINE_COLORS, guide="none") +
  facet_wrap(~cell_line, ncol=4) +
  theme_classic() +
  labs(x="UMAP-1", y="UMAP-2",
       title="Per cell line projection — each panel shows one Sézary cell line on the CD4 reference",
       subtitle="Grey = healthy CD4 reference | Coloured = that cell line's cells | M00=Naive, M01-M02=TCM, M03-M04=TEM, M05=Temra, M06-M07=Treg") +
  theme(plot.title   = element_text(size=13, face="bold"),
        strip.text   = element_text(size=12, face="bold"),
        strip.background = element_rect(fill="grey95", colour="grey70"))

print(p_faceted)

```

------------------------------------------------------------------------

# Visualisation 2 — Predicted State per Cell Line

```{r state-per-line, fig.width=10, fig.height=6}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# The stacked bar chart shows what PROPORTION of each cell line's cells
# map to each CD4 state. This is the primary cell of origin summary.
#
# Ideal outcome for a strong conclusion:
# - ONE state dominates (>70%) across most cell lines → clear cell of origin
# - e.g. if L1-L7 all show >80% CD4 TCM → Sézary originates from TCM cells
#
# Mixed outcome (biologically also informative):
# - Some lines show TCM dominance, others TEM → disease heterogeneity
# - This would suggest different patients have different cells of origin,
#   possibly correlating with different clinical presentations or drug responses
# ──────────────────────────────────────────────────────────────────────────

# Compute per-line state proportions
state_per_line <- sez_plot_df %>%
  filter(!is.na(predicted_state)) %>%
  count(cell_line, predicted_state) %>%
  group_by(cell_line) %>%
  mutate(pct = round(100 * n / sum(n), 1)) %>%
  ungroup() %>%
  mutate(predicted_state = factor(predicted_state, levels=STATE_ORDER))

# ── Stacked bar: proportion of cells per state per line ───────────────────
p_state_stack <- ggplot(state_per_line,
                         aes(x=cell_line, y=pct, fill=predicted_state)) +
  geom_col(width=.75, colour="white", linewidth=.3) +
  geom_text(aes(label=ifelse(pct >= 5, paste0(pct,"%"), "")),
            position=position_stack(vjust=0.5),
            size=3.2, colour="white", fontface="bold") +
  scale_fill_manual(values=STATE_COLORS, name="Predicted CD4 state") +
  scale_y_continuous(expand=c(0,0), limits=c(0,105)) +
  theme_classic() +
  labs(x="Cell line", y="% of cells",
       title="Predicted CD4 state composition per Sézary cell line",
       subtitle="Each bar = 100% of that cell line's cells | Labels shown for proportions ≥5%") +
  theme(plot.title  = element_text(size=13, face="bold"),
        axis.text.x = element_text(size=12, face="bold"),
        legend.title = element_text(size=10, face="bold"))

print(p_state_stack)

# ── Print summary table ───────────────────────────────────────────────────
cat("\nState proportions per cell line:\n")
state_per_line %>%
  select(cell_line, predicted_state, n, pct) %>%
  arrange(cell_line, desc(pct)) %>%
  kable(col.names=c("Cell line","Predicted state","N cells","% cells"),
        caption="Predicted CD4 state composition per Sézary cell line") %>%
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE) %>%
  print()
```

------------------------------------------------------------------------

# Visualisation 3 — Pseudotime per Cell Line

```{r pseudotime-per-line, fig.width=8, fig.height=10}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# Pseudotime measures DIFFERENTIATION MATURITY on a 0-100 scale where:
#   0  = most naive-like (undifferentiated, CD4 Naive = M00)
#   100 = most terminally differentiated (Temra or terminal Treg)
#
# For Sézary cells, the pseudotime distribution tells us:
#
# LOW pseudotime (0-30):
#   Malignant cells resemble undifferentiated/early memory cells.
#   Suggests transformation occurred at an early differentiation stage.
#   These tumours may express naive/TCM survival factors (IL-7R, CCR7).
#
# INTERMEDIATE pseudotime (30-60):
#   Cells resemble central/transitional memory T cells (TCM stage).
#   Most SS literature suggests this — Sézary cells often retain CCR4,
#   CD27, and other TCM markers while losing CCR7.
#
# HIGH pseudotime (60-100):
#   Cells resemble TEM or Temra — unusual for SS as most SS cells
#   do not express high levels of GZMB, PRF1, or CX3CR1.
#   If a line maps here, it may represent a more cytotoxic variant.
#
# Comparing lines: if L1 has pseudotime ~30 and L5 has pseudotime ~60,
# this means L5 cells are more differentiated than L1 cells, possibly
# explaining differences in disease aggressiveness or drug sensitivity.
# ──────────────────────────────────────────────────────────────────────────

# ── Panel A: MST pseudotime violin per cell line ──────────────────────────
p_violin_mst <- ggplot(sez_plot_df,
                        aes(x=cell_line, y=mst_pt, fill=cell_line)) +
  geom_violin(scale="width", trim=FALSE, alpha=.85, colour="white") +
  geom_boxplot(width=.08, fill="white", outlier.size=.8,
               outlier.alpha=.4, colour="grey30") +
  # Reference lines for biological interpretation
  geom_hline(yintercept=33, linetype="dashed", colour="grey60", linewidth=.5) +
  geom_hline(yintercept=66, linetype="dashed", colour="grey60", linewidth=.5) +
  annotate("text", x=0.4, y=16,  label="Naive/\nearly TCM", size=2.8,
           colour="grey40", fontface="italic", hjust=0) +
  annotate("text", x=0.4, y=50,  label="TCM/\nearly TEM", size=2.8,
           colour="grey40", fontface="italic", hjust=0) +
  annotate("text", x=0.4, y=83,  label="TEM/\nTemra/Treg", size=2.8,
           colour="grey40", fontface="italic", hjust=0) +
  scale_fill_manual(values=CELL_LINE_COLORS, guide="none") +
  scale_y_continuous(limits=c(0,100), breaks=seq(0,100,25)) +
  theme_classic() +
  labs(x="Cell line", y="MST pseudotime (0–100)",
       title="Differentiation maturity per Sézary cell line (Custom MST)",
       subtitle="Dashed lines = trajectory thirds | 0=Naive, 100=Temra/Terminal Treg") +
  theme(plot.title  = element_text(size=12, face="bold"),
        axis.text.x = element_text(size=11, face="bold"))

# ── Panel B: Monocle3 pseudotime violin ───────────────────────────────────
p_violin_m3 <- ggplot(sez_plot_df,
                       aes(x=cell_line, y=m3_pt, fill=cell_line)) +
  geom_violin(scale="width", trim=FALSE, alpha=.85, colour="white") +
  geom_boxplot(width=.08, fill="white", outlier.size=.8,
               outlier.alpha=.4, colour="grey30") +
  geom_hline(yintercept=33, linetype="dashed", colour="grey60", linewidth=.5) +
  geom_hline(yintercept=66, linetype="dashed", colour="grey60", linewidth=.5) +
  scale_fill_manual(values=CELL_LINE_COLORS, guide="none") +
  scale_y_continuous(limits=c(0,100), breaks=seq(0,100,25)) +
  theme_classic() +
  labs(x="Cell line", y="Monocle3 pseudotime (0–100)",
       title="Differentiation maturity per Sézary cell line (Monocle3)",
       subtitle="KNN-weighted transfer from 15 nearest reference neighbours") +
  theme(plot.title  = element_text(size=12, face="bold"),
        axis.text.x = element_text(size=11, face="bold"))

print(p_violin_mst / p_violin_m3)
```

------------------------------------------------------------------------

# Visualisation 4 — Nearest Milestone per Cell Line

```{r milestone-per-line, fig.width=12, fig.height=6}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# Nearest milestone is the most PRECISE cell of origin indicator because
# it places cells at a specific point on the differentiation path, not just
# a broad state category. The milestone system has 8 positions:
#
#   M00 = CD4 Naive (root)
#   M01 = TCM early  (CCR7+, S100A4+, AQP3 low)
#   M02 = TCM late   (branch point — CCR4+, ITGB1+) ← KEY BRANCH POINT
#   M03 = TEM early  (GZMK+, CCL5+, CXCR3+)
#   M04 = TEM late   (EOMES+, HOPX+)
#   M05 = Temra/CTL  (GZMB+, PRF1+, CX3CR1+, NKG7+)
#   M06 = Treg naive (FOXP3+, IL2RA+, IKZF2+)
#   M07 = Treg terminal (CTLA4+, TIGIT+, TNFRSF18+)
#
# M02 is the BRANCH POINT — cells at M02 are at the decision point between
# becoming effector memory (TEM arm) or regulatory (Treg arm). If Sézary
# cells cluster at M02, this is biologically very interesting — it suggests
# the malignant transformation captured a cell that was "deciding" its fate.
# ──────────────────────────────────────────────────────────────────────────

# Milestone labels with biological annotation
milestone_labels <- c(
  "M00" = "M00\nNaive",
  "M01" = "M01\nTCM early",
  "M02" = "M02\nTCM late\n(branch)",
  "M03" = "M03\nTEM early",
  "M04" = "M04\nTEM late",
  "M05" = "M05\nTemra/CTL",
  "M06" = "M06\nTreg naive",
  "M07" = "M07\nTreg terminal"
)

# Compute milestone distribution per cell line
ms_per_line <- sez_plot_df %>%
  filter(!is.na(milestone_ref)) %>%
  count(cell_line, milestone_ref) %>%
  group_by(cell_line) %>%
  mutate(pct = round(100 * n / sum(n), 1)) %>%
  ungroup() %>%
  mutate(
    milestone_ref = factor(milestone_ref, levels=milestone_order),
    ms_state = milestone_centroids$state[
      match(milestone_ref, milestone_centroids$milestone)]
  )

# ── Heatmap-style tile plot: cell line × milestone ────────────────────────
p_ms_heatmap <- ggplot(ms_per_line,
                        aes(x=milestone_ref, y=cell_line, fill=pct)) +
  geom_tile(colour="white", linewidth=.5) +
  geom_text(aes(label=ifelse(pct >= 2, paste0(pct,"%"), "")),
            size=3.2, fontface="bold",
            colour=ifelse(ms_per_line$pct > 40, "white", "grey20")) +
  scale_fill_gradientn(
    colours = c("#F7FBFF","#DEEBF7","#9ECAE1","#4292C6","#08519C","#08306B"),
    name    = "% of cell\nline cells",
    limits  = c(0,100)
  ) +
  scale_x_discrete(labels=milestone_labels) +
  theme_classic() +
  labs(x="Reference milestone", y="Sézary cell line",
       title="Nearest reference milestone distribution per cell line",
       subtitle="Darker = more cells mapping to that milestone | M02 = branch point between effector and regulatory arms") +
  theme(plot.title   = element_text(size=13, face="bold"),
        axis.text.x  = element_text(size=9),
        axis.text.y  = element_text(size=11, face="bold"),
        legend.title = element_text(size=9))

print(p_ms_heatmap)

# ── Stacked bar: milestone per cell line ──────────────────────────────────
ms_state_colors <- setNames(
  STATE_COLORS[milestone_centroids$state[match(milestone_order, milestone_centroids$milestone)]],
  milestone_order
)

p_ms_stack <- ggplot(ms_per_line,
                      aes(x=cell_line, y=pct, fill=milestone_ref)) +
  geom_col(width=.75, colour="white", linewidth=.3) +
  geom_text(aes(label=ifelse(pct >= 5, paste0(milestone_ref,"\n",pct,"%"), "")),
            position=position_stack(vjust=0.5),
            size=2.8, colour="white", fontface="bold") +
  scale_fill_manual(values=ms_state_colors,
                    labels=milestone_labels,
                    name="Milestone") +
  scale_y_continuous(expand=c(0,0), limits=c(0,105)) +
  theme_classic() +
  labs(x="Cell line", y="% of cells",
       title="Milestone composition per cell line",
       subtitle="Colour = CD4 state | Label = milestone ID and % | shown if ≥5%") +
  theme(plot.title  = element_text(size=13, face="bold"),
        axis.text.x = element_text(size=12, face="bold"))

print(p_ms_stack)
```

------------------------------------------------------------------------

# Visualisation 5 — Individual Cell Line UMAP Panels

```{r individual-line-plots, fig.width=10, fig.height=4}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# This produces a detailed per-cell-line plot showing:
# - Top panel: cells coloured by predicted state (what state they resemble)
# - Bottom panel: cells coloured by MST pseudotime (how mature they are)
#
# For each cell line, you want to ask:
# 1. Is the cell line HOMOGENEOUS (all cells in one state/region) or
#    HETEROGENEOUS (scattered across multiple states)?
# 2. Does the cell line sit closer to the root (low pseudotime = naive-like)
#    or the tips (high pseudotime = mature effector or Treg-like)?
# 3. Are there TWO subpopulations within one line? This could indicate
#    clonal evolution or technical mixing.
# ──────────────────────────────────────────────────────────────────────────

plot_list <- list()

for (cl in CELL_LINE_ORDER) {
  df_cl <- sez_plot_df %>% filter(cell_line == cl)
  n_cl  <- nrow(df_cl)

  # State plot for this cell line
  p_state <- ggplot() +
    geom_point(data=ref_bg[sample(nrow(ref_bg), min(3000, nrow(ref_bg))),],
               aes(x=UMAP_1, y=UMAP_2), colour="grey60", size=.2, alpha=.4) +
    geom_segment(data=mst_edge_df, aes(x=x1,y=y1,xend=x2,yend=y2),
                 colour="grey40", linewidth=.6, alpha=.6) +
    geom_point(data=df_cl,
               aes(x=UMAP_1, y=UMAP_2, colour=predicted_state),
               size=1.2, alpha=.85) +
    geom_text_repel(data=milestone_centroids,
                    aes(x=UMAP_1, y=UMAP_2, label=milestone),
                    size=2, colour="grey20", fontface="bold",
                    max.overlaps=8, segment.size=.2) +
    scale_colour_manual(values=STATE_COLORS, name="State",
                        drop=FALSE, na.value="grey40") +
    theme_classic() +
    labs(x="UMAP-1", y="UMAP-2",
         title=sprintf("%s — Predicted state (n=%s cells)",
                       cl, scales::comma(n_cl))) +
    theme(plot.title  = element_text(size=10, face="bold",
                                     colour=CELL_LINE_COLORS[cl]),
          legend.text = element_text(size=7),
          legend.title= element_text(size=8))

  # Pseudotime plot for this cell line
  p_pt <- ggplot() +
    geom_point(data=ref_bg[sample(nrow(ref_bg), min(3000, nrow(ref_bg))),],
               aes(x=UMAP_1, y=UMAP_2), colour="grey60", size=.2, alpha=.4) +
    geom_segment(data=mst_edge_df, aes(x=x1,y=y1,xend=x2,yend=y2),
                 colour="grey40", linewidth=.6, alpha=.6) +
    geom_point(data=df_cl,
               aes(x=UMAP_1, y=UMAP_2, colour=mst_pt),
               size=1.2, alpha=.9) +
    geom_text_repel(data=milestone_centroids,
                    aes(x=UMAP_1, y=UMAP_2, label=milestone),
                    size=2, colour="grey40", fontface="bold",
                    max.overlaps=8, segment.size=.2) +
    scale_colour_gradientn(
      colours = c("#0D0887","#6A00A8","#B12A90","#E16462","#FCA636","#F0F921"),
      name="Pseudotime\n(0–100)", limits=c(0,100)) +
    theme_classic() +
    labs(x="UMAP-1", y="UMAP-2",
         title=sprintf("%s — MST pseudotime", cl)) +
    theme(plot.title  = element_text(size=10, face="bold",
                                     colour=CELL_LINE_COLORS[cl]),
          legend.text = element_text(size=7),
          legend.title= element_text(size=8))

  plot_list[[paste0(cl,"_state")]] <- p_state
  plot_list[[paste0(cl,"_pt")]]    <- p_pt
}

# Print in 2-column layout (state | pseudotime) per line
for (cl in CELL_LINE_ORDER) {
  print(plot_list[[paste0(cl,"_state")]] | plot_list[[paste0(cl,"_pt")]])
}
```

------------------------------------------------------------------------

# Visualisation 6 — Method Agreement per Cell Line

```{r method-agreement-lines, fig.width=14, fig.height=6}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# We compare MST and Monocle3 pseudotime PER CELL LINE to assess robustness.
# If both methods agree on a cell line's pseudotime position, the result is
# robust. If they disagree for a specific line, that line may sit in a
# trajectory region where the two methods handle geometry differently
# (e.g. near the branch point M02).
# ──────────────────────────────────────────────────────────────────────────

# Per-line Spearman correlation
line_cors <- sez_plot_df %>%
  group_by(cell_line) %>%
  summarise(
    spearman = round(cor(mst_pt, m3_pt, method="spearman",
                         use="complete.obs"), 3),
    n        = n(),
    .groups  = "drop"
  )

cat("Method agreement (Spearman ρ) per cell line:\n")
print(line_cors)

p_method_scatter <- ggplot(sez_plot_df,
                            aes(x=mst_pt, y=m3_pt, colour=cell_line)) +
  geom_point(size=.5, alpha=.4) +
  geom_abline(slope=1, intercept=0, linetype="dashed",
              colour="grey40", linewidth=.7) +
  geom_smooth(method="lm", se=FALSE, linewidth=.8) +
  geom_text(data=line_cors,
            aes(x=5, y=95, label=sprintf("ρ=%.2f", spearman)),
            size=3, fontface="bold", hjust=0, colour="grey20") +
  scale_colour_manual(values=CELL_LINE_COLORS, guide="none") +
  facet_wrap(~cell_line, ncol=4) +
  theme_classic() +
  labs(x="Custom MST pseudotime (0–100)",
       y="Monocle3 pseudotime (0–100)",
       title="Method agreement per cell line: Custom MST vs Monocle3 pseudotime",
       subtitle="Dashed = perfect agreement | ρ = Spearman correlation | Higher ρ = more reliable pseudotime assignment") +
  theme(plot.title  = element_text(size=12, face="bold"),
        strip.text  = element_text(size=10, face="bold"),
        strip.background = element_rect(fill="grey95"))

print(p_method_scatter)
```

------------------------------------------------------------------------

# Summary Table — Complete Cell Line Analysis

```{r summary-table}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# This comprehensive table is your primary result table for the paper.
# Key columns to interpret:
#
# dominant_state: the most common predicted CD4 state for that cell line
# dominant_state_pct: how confident — if >80%, strong cell of origin signal
# mean_pred_score: MapQuery confidence — >0.7 is reliable
# mst_mean_pt: average differentiation maturity (0=naive, 100=terminal)
# dominant_milestone: the single most common trajectory position
#
# For the paper, you will want to state something like:
# "X of 7 cell lines predominantly mapped to CD4 TCM (milestones M01-M02),
# with mean MST pseudotime of Y ± Z, indicating that Sézary syndrome
# cell lines transcriptomically resemble central memory CD4 T cells at
# an intermediate stage of differentiation."
# ──────────────────────────────────────────────────────────────────────────

sez_summary <- sez_plot_df %>%
  group_by(cell_line) %>%
  summarise(
    n_cells            = n(),
    mean_pred_score    = round(mean(pred_score, na.rm=TRUE), 3),
    high_conf_pct      = round(100 * mean(pred_score > 0.7, na.rm=TRUE), 1),
    dominant_state     = names(sort(table(predicted_state), decreasing=TRUE))[1],
    dominant_state_pct = round(100 * max(table(predicted_state)) / n(), 1),
    dominant_milestone = names(sort(table(milestone_ref), decreasing=TRUE))[1],
    mst_mean_pt        = round(mean(mst_pt, na.rm=TRUE), 1),
    mst_sd_pt          = round(sd(mst_pt, na.rm=TRUE), 1),
    mst_median_pt      = round(median(mst_pt, na.rm=TRUE), 1),
    m3_mean_pt         = round(mean(m3_pt, na.rm=TRUE), 1),
    m3_sd_pt           = round(sd(m3_pt, na.rm=TRUE), 1),
    .groups            = "drop"
  ) %>%
  arrange(mst_mean_pt)

kable(sez_summary,
      col.names = c("Cell line","N cells","Mean pred.\nscore",
                    "% score\n>0.7",
                    "Dominant\nstate","Dominant\nstate %",
                    "Dominant\nmilestone",
                    "MST mean\npt","MST SD",
                    "MST median\npt",
                    "M3 mean\npt","M3 SD"),
      caption = "Complete per cell line trajectory summary — cell of origin analysis") %>%
  kable_styling(bootstrap_options=c("striped","hover","condensed"),
                full_width=TRUE) %>%
  column_spec(5, bold=TRUE) %>%
  column_spec(6, bold=TRUE) %>%
  column_spec(8:10, color="white", background=METHOD_COLORS["Custom MST"]) %>%
  column_spec(11:12, color="white", background=METHOD_COLORS["Monocle3"])

# ── Per-line milestone distribution table ─────────────────────────────────
ms_dist_all <- sez_plot_df %>%
  filter(!is.na(milestone_ref)) %>%
  count(cell_line, milestone_ref) %>%
  left_join(milestone_centroids[,c("milestone","state","order_pos")],
            by=c("milestone_ref"="milestone")) %>%
  filter(!is.na(order_pos)) %>%
  group_by(cell_line) %>%
  mutate(pct = round(100 * n / sum(n), 1)) %>%
  ungroup() %>%
  mutate(milestone_ref = factor(milestone_ref, levels=milestone_order)) %>%
  arrange(cell_line, order_pos)

cat("\n\nMilestone distribution per cell line:\n")
ms_dist_all %>%
  select(cell_line, milestone_ref, state, n, pct) %>%
  kable(col.names=c("Cell line","Milestone","State","N cells","% cells"),
        caption="Per cell line milestone distribution") %>%
  kable_styling(bootstrap_options=c("striped","condensed"), full_width=FALSE) %>%
  print()
```

------------------------------------------------------------------------

# Prediction Score Quality Check

```{r pred-score-quality, fig.width=8, fig.height=5}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# The prediction score (0-1) tells you how confidently MapQuery matched
# each Sézary cell to a reference CD4 state. This is critical for
# interpreting your results:
#
# Score >0.7: High confidence — the cell clearly resembles one CD4 state
# Score 0.5-0.7: Moderate confidence — cell has mixed features
# Score <0.5: Low confidence — cell doesn't clearly match any reference state
#
# LOW SCORES ARE BIOLOGICALLY INTERESTING in cancer:
# Malignant cells often have aberrant transcriptomes that don't perfectly
# match any normal cell type. Low scores could indicate:
# 1. The cell has acquired oncogenic gene expression not seen in normal T cells
# 2. The cell spans two normal states (e.g. mixed TCM/TEM features)
# 3. The cell has downregulated normal T cell programmes entirely
#
# You should report the mean prediction score in your methods/results.
# If scores are generally high (>0.7), your conclusion about cell of origin
# is on solid ground. If scores are low (<0.5), interpret with caution.
# ──────────────────────────────────────────────────────────────────────────

p_scores <- ggplot(sez_plot_df,
                    aes(x=cell_line, y=pred_score, fill=cell_line)) +
  geom_violin(scale="width", trim=FALSE, alpha=.85, colour="white") +
  geom_boxplot(width=.08, fill="white", outlier.size=.5,
               outlier.alpha=.3, colour="grey30") +
  geom_hline(yintercept=0.7, linetype="dashed",
             colour="darkgreen", linewidth=.8) +
  geom_hline(yintercept=0.5, linetype="dashed",
             colour="orange", linewidth=.8) +
  annotate("text", x=7.5, y=0.72, label="High confidence (0.7)",
           size=3, colour="darkgreen", hjust=1, fontface="italic") +
  annotate("text", x=7.5, y=0.52, label="Moderate confidence (0.5)",
           size=3, colour="orange", hjust=1, fontface="italic") +
  scale_fill_manual(values=CELL_LINE_COLORS, guide="none") +
  scale_y_continuous(limits=c(0,1), breaks=seq(0,1,.2)) +
  theme_classic() +
  labs(x="Cell line", y="MapQuery prediction score (0–1)",
       title="Projection confidence per Sézary cell line",
       subtitle="Higher score = cell more clearly resembles one specific CD4 T cell state") +
  theme(plot.title  = element_text(size=12, face="bold"),
        axis.text.x = element_text(size=11, face="bold"))

print(p_scores)
```

------------------------------------------------------------------------

# Save All Outputs

```{r save-all}
# Create output directories if they don't exist
dir.create("Objects", showWarnings=FALSE)
dir.create("Figures", showWarnings=FALSE)
dir.create("Tables",  showWarnings=FALSE)

# ── RDS objects ────────────────────────────────────────────────────────────
saveRDS(sezary_obj,          "Objects/sezary_celllines_projected.rds")
saveRDS(mst_graph,           "Objects/mst_graph.rds")
saveRDS(milestone_centroids, "Objects/milestone_centroids.rds")

# ── CSV tables ─────────────────────────────────────────────────────────────
write.csv(sez_summary,    "Tables/cellline_trajectory_summary.csv",   row.names=FALSE)
write.csv(state_per_line, "Tables/cellline_state_proportions.csv",    row.names=FALSE)
write.csv(ms_dist_all,    "Tables/cellline_milestone_distribution.csv",row.names=FALSE)
write.csv(line_cors,      "Tables/cellline_method_agreement.csv",     row.names=FALSE)
write.csv(sez_plot_df,    "Tables/cellline_all_cells_metadata.csv",   row.names=FALSE)

# ── Figures ────────────────────────────────────────────────────────────────
ggsave("Figures/fig_all_lines_umap.pdf",      p_all_lines,    width=14, height=12)
ggsave("Figures/fig_faceted_lines.pdf",       p_faceted,      width=18, height=10)
ggsave("Figures/fig_state_per_line.pdf",      p_state_stack,  width=14, height=6)
ggsave("Figures/fig_pseudotime_mst.pdf",      p_violin_mst,   width=14, height=6)
ggsave("Figures/fig_pseudotime_m3.pdf",       p_violin_m3,    width=14, height=6)
ggsave("Figures/fig_milestone_heatmap.pdf",   p_ms_heatmap,   width=14, height=7)
ggsave("Figures/fig_milestone_stack.pdf",     p_ms_stack,     width=14, height=7)
ggsave("Figures/fig_method_agreement.pdf",    p_method_scatter,width=14, height=6)
ggsave("Figures/fig_prediction_scores.pdf",   p_scores,       width=12, height=5)

# Individual line plots
for (cl in CELL_LINE_ORDER) {
  p_combined <- plot_list[[paste0(cl,"_state")]] | plot_list[[paste0(cl,"_pt")]]
  ggsave(sprintf("Figures/fig_%s_umap.pdf", cl),
         p_combined, width=16, height=7)
}

cat("✅ All outputs saved\n")
cat("  Objects/ — RDS files\n")
cat("  Figures/ — PDF plots\n")
cat("  Tables/  — CSV data files\n")
```

------------------------------------------------------------------------

# Biological Interpretation Guide

```{r interpretation-guide, echo=FALSE}
# This chunk prints a guide for interpreting your specific results
# after all the analyses above have run

cat("
╔══════════════════════════════════════════════════════════════════════╗
║         BIOLOGICAL INTERPRETATION GUIDE FOR YOUR RESULTS            ║
╚══════════════════════════════════════════════════════════════════════╝

STEP 1: CHECK DOMINANT STATE (sez_summary table, dominant_state column)
─────────────────────────────────────────────────────────────────────────
• If most lines show CD4 TCM (dominant_state_pct >70%):
  → CONCLUSION: Sézary syndrome arises from Central Memory T cells
  → IMPLICATION: TCM survival signals (IL-7, IL-15, CCR7-independent
    homing) drive malignant cell persistence
  → LITERATURE: Consistent with Wang et al. 2023, Woollard et al. 2022

• If most lines show CD4 TEM:
  → CONCLUSION: Transformation occurred at the effector memory stage
  → IMPLICATION: Malignant cells have acquired partial effector
    programmes (GZMK, CCL5) but not terminal cytotoxicity

• If lines split between TCM and TEM:
  → CONCLUSION: Disease heterogeneity — different patients have
    different cells of origin
  → IMPLICATION: May explain clinical variability in drug response

STEP 2: CHECK DOMINANT MILESTONE (dominant_milestone column)
─────────────────────────────────────────────────────────────────────────
• M01 or M02 dominant → early-to-late TCM (most likely for SS)
• M02 dominant (branch point) → arrested at fate decision point
• M03 or M04 dominant → TEM commitment despite TCM-like label
• M06 dominant → Treg-like features (check FOXP3 expression)

STEP 3: CHECK PREDICTION SCORES (mean_pred_score column)
─────────────────────────────────────────────────────────────────────────
• >0.7: High confidence — strong cell of origin conclusion
• 0.5-0.7: Moderate — mention in limitations
• <0.5: Low — malignant transcriptome substantially deviates from
  all normal reference states; discuss as 'partial resemblance'

STEP 4: CHECK PSEUDOTIME (mst_mean_pt column)
─────────────────────────────────────────────────────────────────────────
• 0-33: Naive/early TCM differentiation stage
• 33-66: TCM/TEM transition stage
• 66-100: Mature effector or Treg stage
• Compare across lines: if L1 pseudotime < L5 pseudotime, L5 is more
  mature/differentiated — may correlate with aggressiveness

STEP 5: LOOK AT THE FACETED UMAP
─────────────────────────────────────────────────────────────────────────
• Tight clusters in one region → homogeneous cell line
• Scattered cells → heterogeneous line (may have subclones)
• Lines overlapping the same UMAP region → same cell of origin
• Lines in different UMAP regions → different cell of origin
")
```

------------------------------------------------------------------------

# Session Information

```{r session-info}
sessionInfo()
```

------------------------------------------------------------------------

# Metadata Reference

```{r metadata-table, echo=FALSE}
meta_tbl <- data.frame(
  Column = c(
    "orig.ident",
    "predicted.predicted_state",
    "predicted.predicted_state.score",
    "predicted.predicted_milestone",
    "mst_pseudotime",
    "nearest_milestone_mst",
    "monocle3_pseudotime"
  ),
  Description = c(
    "Cell line identity: L1–L7 (malignant Sézary lines)",
    "MapQuery label transfer: most similar CD4 T cell state from reference",
    "MapQuery confidence score (0–1): higher = more confident prediction",
    "MapQuery label transfer: most similar milestone (M00–M07) from reference",
    "MST pseudotime (0–100): 0=Naive-like, 100=Temra/Terminal Treg-like",
    "Nearest MST milestone after orthogonal edge projection",
    "KNN-weighted Monocle3 pseudotime from 15 nearest reference neighbours (0–100)"
  ),
  Interpretation = c(
    "Use to split per-cell-line analysis",
    "Primary cell of origin indicator",
    ">0.7 high confidence | 0.5-0.7 moderate | <0.5 interpret cautiously",
    "Most precise trajectory position (M02 = branch point)",
    "Differentiation maturity: lower = more naive-like at time of transformation",
    "Confirms trajectory position; M01-M02 = TCM, M03-M04 = TEM, M05 = Temra",
    "Independent pseudotime estimate for robustness validation"
  )
)
kable(meta_tbl,
      caption="Metadata fields added to sezary_obj by this pipeline") %>%
  kable_styling(bootstrap_options=c("striped","condensed"), full_width=TRUE)
```
