1 Biological Background & Rationale

1.1 What is Sézary Syndrome?

Sézary syndrome (SS) is an aggressive cutaneous T-cell lymphoma (CTCL) characterised by the triad of erythroderma (full-body skin redness), lymphadenopathy, and circulating malignant T cells called Sézary cells. These are CD4+ T cells with cerebriform (brain-like wrinkled) nuclei that have undergone malignant transformation.

The central unresolved question in SS biology is: which normal CD4 T cell subtype gave rise to the malignant clone? This is the “cell of origin” question. Answering it matters because:

  • It tells us which transcription factors and signalling pathways were active when transformation occurred
  • It predicts which normal T cell programmes are retained or corrupted in the malignant cells (explaining why SS cells home to skin, suppress immunity, etc.)
  • It guides therapeutic targeting — if SS arises from TCM cells, targeting memory T cell survival signals (e.g. IL-15/IL-7 signalling) is rational

1.2 Why Project onto a Trajectory?

A trajectory analysis places cells along a continuous differentiation path rather than forcing them into discrete categories. Normal CD4 T cells differentiate from Naive → TCM → TEM → Temra (effector arm) or Naive → TCM → Treg (regulatory arm).

When we project Sézary cell lines onto this trajectory, we can ask:

  1. Where on the trajectory do malignant cells land? (cell of origin state)
  2. How far along differentiation are they? (pseudotime = differentiation maturity)
  3. Are different cell lines (L1–L7) at the same or different positions? (heterogeneity in cell of origin across the disease)
  4. Do all cells within a line occupy the same position, or is there intra-line heterogeneity? (clonal evolution vs stable malignant state)

1.3 Why Analyse Each Cell Line Separately?

You have 7 Sézary cell lines (L1–L7) derived from different patients. Each cell line represents a biologically independent malignant clone. Key reasons to analyse them separately:

  • If all 7 lines map to the same trajectory position, this supports a universal cell of origin for SS (strong biological conclusion)
  • If lines differ, this indicates disease heterogeneity — different patients may have different cells of origin, explaining clinical variability in drug response
  • Comparing cell lines that are drug-sensitive vs drug-resistant (if known) to their trajectory position can reveal whether differentiation state predicts drug response

2 Setup

# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# We load the same libraries as the reference trajectory script.
# RANN provides fast k-nearest-neighbour search (nn2) which we use to transfer
# Monocle3 pseudotime from reference cells to query Sézary cells by weighted
# averaging of their 15 nearest reference neighbours.
# ──────────────────────────────────────────────────────────────────────────

suppressPackageStartupMessages({
  library(Seurat)
  library(SeuratWrappers)
  library(igraph)
  library(RANN)
  library(ggplot2)
  library(ggrepel)
  library(dplyr)
  library(tidyr)
  library(patchwork)
  library(scales)
  library(RColorBrewer)
  library(knitr)
  library(kableExtra)
  library(future)
})
Registered S3 methods overwritten by 'htmltools':
  method               from         
  print.html           tools:rstudio
  print.shiny.tag      tools:rstudio
  print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'rmarkdown':
  method         from
  print.paged_df     
# ── Colour palettes ────────────────────────────────────────────────────────
# These match the reference trajectory script exactly for visual consistency
STATE_COLORS <- c(
  "CD4 Naive"     = "#4472C4",
  "CD4 TCM"       = "#70AD47",
  "CD4 TEM"       = "#ED7D31",
  "CD4 Temra/CTL" = "#C00000",
  "Treg"          = "#7030A0"
)

# 7 visually distinct colours for L1-L7
# Using a combination of Set1 and Dark2 palettes for maximum distinctiveness
CELL_LINE_COLORS <- c(
  "L1" = "#E41A1C",   # red
  "L2" = "#377EB8",   # blue
  "L3" = "#4DAF4A",   # green
  "L4" = "#FF7F00",   # orange
  "L5" = "#984EA3",   # purple
  "L6" = "#A65628",   # brown
  "L7" = "#F781BF"    # pink
)

METHOD_COLORS <- c(
  "Monocle3"   = "#e74c3c",
  "Custom MST" = "#2980b9"
)

STATE_ORDER    <- c("CD4 Naive","CD4 TCM","CD4 TEM","CD4 Temra/CTL","Treg")
CELL_LINE_ORDER <- paste0("L", 1:7)
UMAP_NAME      <- "umap"
STATE_COL      <- "predicted.celltype.l2"

cat("✅ Setup complete\n")
✅ Setup complete
cat("Cell line colours:\n"); print(CELL_LINE_COLORS)
Cell line colours:
       L1        L2        L3        L4        L5        L6        L7 
"#E41A1C" "#377EB8" "#4DAF4A" "#FF7F00" "#984EA3" "#A65628" "#F781BF" 

3 Load Objects & Subset to Malignant Cells

# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# We load the pre-computed reference object (cd4_ref_dual_trajectory.rds)
# which already contains:
#   - MST graph, milestone centroids, pseudotime for all reference cells
#   - Monocle3 pseudotime
#   - All trajectory infrastructure built in the main pipeline
#
# CRITICAL: We subset to L1-L7 ONLY, excluding CD4T_lab and CD4T_10x.
# CD4T_lab and CD4T_10x are healthy donor CD4 T cells included as controls
# in the original dataset. Projecting them onto the reference would be
# circular (healthy cells → healthy reference = trivially perfect mapping)
# and would NOT answer the cell of origin question.
#
# We want ONLY the malignant Sézary cells (L1-L7) for projection.
# ──────────────────────────────────────────────────────────────────────────

# Load pre-computed reference with all trajectory objects
cd4_ref <- readRDS("../cd4_ref_dual_trajectory.rds")

# Load the full Sézary object
sezary_raw <- readRDS("/home/nabbasi/apollo_home/1-Seurat_RDS_OBJECT_FINAL/All_samples_Merged_with_Renamed_Clusters_Cell_state-03-12-2025.rds.rds")

cat("Full Sézary object loaded:", ncol(sezary_raw), "cells\n")
Full Sézary object loaded: 49305 cells
cat("\norig.ident breakdown (ALL samples):\n")

orig.ident breakdown (ALL samples):
print(table(sezary_raw$orig.ident))

      L1       L2       L3       L4       L5       L6       L7 CD4T_lab CD4T_10x 
    5825     5935     6428     6006     6022     5148     5331     5106     3504 
# ── Subset to malignant cell lines only ───────────────────────────────────
CELL_LINES <- paste0("L", 1:7)
sezary_obj <- subset(sezary_raw, orig.ident %in% CELL_LINES)

cat(sprintf("\n✅ Malignant cells retained: %d\n", ncol(sezary_obj)))

✅ Malignant cells retained: 40695
cat(sprintf("   Healthy controls excluded: CD4T_lab + CD4T_10x (%d cells)\n",
            ncol(sezary_raw) - ncol(sezary_obj)))
   Healthy controls excluded: CD4T_lab + CD4T_10x (8610 cells)
cat("\nPer cell line (cells to project):\n")

Per cell line (cells to project):
print(table(sezary_obj$orig.ident))

      L1       L2       L3       L4       L5       L6       L7 CD4T_lab CD4T_10x 
    5825     5935     6428     6006     6022     5148     5331        0        0 
# CRITICAL: Remove the full object to free RAM before MapQuery
# On RServer with shared memory, keeping sezary_raw in memory during
# MapQuery can cause the 46 GiB future.globals limit to be exceeded
rm(sezary_raw); gc()
             used    (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells    4244977   226.8    8237565   440.0    8237565   440.0
Vcells 1385926744 10573.8 3049213733 23263.7 2626985949 20042.4
cat("\n✅ Raw object removed from memory — ready for MapQuery\n")

✅ Raw object removed from memory — ready for MapQuery
# ── Validate both objects ─────────────────────────────────────────────────
stopifnot(
  "STATE_COL missing from cd4_ref"   = STATE_COL %in% colnames(cd4_ref@meta.data),
  "UMAP missing from cd4_ref"        = UMAP_NAME %in% names(cd4_ref@reductions),
  "milestone missing from cd4_ref"   = "milestone" %in% colnames(cd4_ref@meta.data),
  "monocle3_pseudotime_norm missing" = "monocle3_pseudotime_norm" %in% colnames(cd4_ref@meta.data),
  "RNA missing from cd4_ref"         = "RNA" %in% names(cd4_ref@assays),
  "RNA missing from sezary_obj"      = "RNA" %in% names(sezary_obj@assays),
  "UMAP model missing"               = !is.null(cd4_ref@reductions$umap@misc$model)
)
cat("✅ All validation checks passed\n")
✅ All validation checks passed
# Print reference trajectory summary for context
cat("\nReference CD4 T cell state distribution:\n")

Reference CD4 T cell state distribution:
print(table(cd4_ref@meta.data[[STATE_COL]]))

    CD4 Naive       CD4 TCM       CD4 TEM CD4 Temra/CTL          Treg 
         2037          9067           145            10           207 
cat("\nReference milestone distribution:\n")

Reference milestone distribution:
print(table(cd4_ref@meta.data$milestone))

 M00  M01  M02  M03  M04  M05  M06  M07 
2037  388 8679   64   81   10  146   61 

4 Reconstruct Trajectory Infrastructure

# ── SAFE APPROACH: derive everything from the saved reference object ────────
# We do NOT hardcode edges — instead we reconstruct the graph from the
# actual milestone centroid positions in the saved cd4_ref object.
# This guarantees exact consistency with the reference trajectory.

# ── Step 1: Rebuild milestone centroids from reference metadata ────────────
umap_coords <- as.data.frame(Embeddings(cd4_ref, UMAP_NAME))
colnames(umap_coords) <- c("UMAP_1","UMAP_2")
umap_coords$milestone <- cd4_ref@meta.data$milestone
umap_coords$state     <- cd4_ref@meta.data[[STATE_COL]]

milestone_centroids <- umap_coords %>%
  filter(!is.na(milestone)) %>%
  group_by(milestone) %>%
  summarise(
    UMAP_1  = mean(UMAP_1),
    UMAP_2  = mean(UMAP_2),
    n_cells = n(),
    state   = dplyr::first(state),
    .groups = "drop"
  ) %>%
  arrange(milestone)

milestone_order           <- sort(unique(milestone_centroids$milestone))
milestone_centroids$order_pos <- match(milestone_centroids$milestone, milestone_order)

cat("Milestone centroids from saved reference:\n")
Milestone centroids from saved reference:
print(milestone_centroids[, c("milestone","state","UMAP_1","UMAP_2","n_cells")])

# ── Step 2: Verify milestone-to-state mapping matches expected biology ──────
expected_map <- c(
  M00="CD4 Naive",
  M01="CD4 TCM", M02="CD4 TCM",
  M03="CD4 TEM", M04="CD4 TEM",
  M05="CD4 Temra/CTL",
  M06="Treg",    M07="Treg"
)

actual_map <- setNames(milestone_centroids$state, milestone_centroids$milestone)
mismatches <- names(expected_map)[expected_map != actual_map[names(expected_map)]]

if (length(mismatches) > 0) {
  cat("⚠️  WARNING — milestone/state mismatch at:", paste(mismatches, collapse=", "), "\n")
  cat("   Expected:", expected_map[mismatches], "\n")
  cat("   Actual:  ", actual_map[mismatches], "\n")
  stop("Fix milestone assignment before proceeding")
} else {
  cat("✅ Milestone-to-state mapping verified\n")
}
✅ Milestone-to-state mapping verified
# ── Step 3: Build centroid matrix and distance matrix ──────────────────────
centroid_mat <- as.matrix(milestone_centroids[, c("UMAP_1","UMAP_2")])
rownames(centroid_mat) <- milestone_centroids$milestone
dist_mat <- as.matrix(dist(centroid_mat, method="euclidean"))

# ── Step 4: Rebuild MST with SAME manual topology as main pipeline ─────────
# Edge list derived from your confirmed milestone distribution:
# M00=Naive(2037) M01=TCM_early(388) M02=TCM_late/branch(8679)
# M03=TEM_early(64) M04=TEM_late(81) M05=Temra(10)
# M06=Treg_naive(146) M07=Treg_terminal(61)
edge_list_ordered <- matrix(c(
  "M00","M01",   # Naive → TCM early
  "M01","M02",   # TCM early → TCM late (branch point — largest TCM cluster)
  "M02","M03",   # TCM late → TEM early  (effector arm)
  "M03","M04",   # TEM early → TEM late
  "M04","M05",   # TEM late → Temra/CTL  (terminal effector)
  "M02","M06",   # TCM late → Treg naive (regulatory arm)
  "M06","M07"    # Treg naive → Treg terminal
), ncol=2, byrow=TRUE)

edge_weights <- apply(edge_list_ordered, 1, function(e) dist_mat[e[1], e[2]])

if (any(!is.finite(edge_weights))) {
  stop("Non-finite edge weights — check milestone names match dist_mat rownames")
}

mst_graph <- graph_from_edgelist(edge_list_ordered, directed=FALSE)
E(mst_graph)$weight <- edge_weights

cat("\nMST edges reconstructed:\n")

MST edges reconstructed:
for (i in seq_len(nrow(edge_list_ordered))) {
  s1 <- actual_map[edge_list_ordered[i,1]]
  s2 <- actual_map[edge_list_ordered[i,2]]
  cat(sprintf("  %s(%s) → %s(%s)  dist=%.4f\n",
              edge_list_ordered[i,1], s1,
              edge_list_ordered[i,2], s2,
              edge_weights[i]))
}
  M00(CD4 Naive) → M01(CD4 TCM)  dist=5.6851
  M01(CD4 TCM) → M02(CD4 TCM)  dist=5.4642
  M02(CD4 TCM) → M03(CD4 TEM)  dist=5.6416
  M03(CD4 TEM) → M04(CD4 TEM)  dist=1.4646
  M04(CD4 TEM) → M05(CD4 Temra/CTL)  dist=0.7915
  M02(CD4 TCM) → M06(Treg)  dist=3.5341
  M06(Treg) → M07(Treg)  dist=2.2321
# ── Step 5: Graph distances from root ─────────────────────────────────────
graph_dists  <- distances(mst_graph, v="M00", weights=NA)
g_dists_vec  <- setNames(as.numeric(graph_dists), colnames(graph_dists))

# CRITICAL: use same pseudotime range as reference for consistent 0-100 scale
mst_pt_range <- range(cd4_ref@meta.data$mst_pseudotime_norm, na.rm=TRUE)
cat(sprintf("\nReference pseudotime range: %.1f – %.1f\n",
            mst_pt_range[1], mst_pt_range[2]))

Reference pseudotime range: 0.0 – 100.0
# ── Step 6: MST edge coordinates for plotting ──────────────────────────────
mst_edges   <- as_edgelist(mst_graph)
mst_edge_df <- do.call(rbind, lapply(seq_len(nrow(mst_edges)), function(i) {
  m1 <- mst_edges[i,1]; m2 <- mst_edges[i,2]
  r1 <- milestone_centroids[milestone_centroids$milestone==m1,]
  r2 <- milestone_centroids[milestone_centroids$milestone==m2,]
  data.frame(x1=r1$UMAP_1, y1=r1$UMAP_2, x2=r2$UMAP_1, y2=r2$UMAP_2)
}))

# ── Step 7: Reference UMAP background ─────────────────────────────────────
ref_bg <- data.frame(
  UMAP_1 = Embeddings(cd4_ref, UMAP_NAME)[,1],
  UMAP_2 = Embeddings(cd4_ref, UMAP_NAME)[,2],
  state  = cd4_ref@meta.data[[STATE_COL]]
)

# ── Step 8: project_onto_mst function ─────────────────────────────────────
project_onto_mst <- function(cell_coords, centroid_mat, mst_graph, graph_dists) {
  edges      <- as_edgelist(mst_graph)
  n_cells    <- nrow(cell_coords)
  pseudotime <- numeric(n_cells)
  nearest_ms <- character(n_cells)

  for (i in seq_len(n_cells)) {
    cx <- cell_coords[i,1]; cy <- cell_coords[i,2]
    best_dist <- Inf; best_pt <- 0; best_ms <- NA_character_

    for (e in seq_len(nrow(edges))) {
      m1 <- edges[e,1]; m2 <- edges[e,2]
      p1 <- centroid_mat[m1,]; p2 <- centroid_mat[m2,]
      dx <- p2[1]-p1[1]; dy <- p2[2]-p1[2]
      len2 <- dx^2 + dy^2
      if (len2 < 1e-10) next
      t    <- max(0, min(1, ((cx-p1[1])*dx + (cy-p1[2])*dy) / len2))
      proj <- c(p1[1]+t*dx, p1[2]+t*dy)
      d    <- sqrt((cx-proj[1])^2 + (cy-proj[2])^2)
      if (d < best_dist) {
        best_dist <- d
        best_pt   <- graph_dists[m1] + t*(graph_dists[m2]-graph_dists[m1])
        best_ms   <- if (t < 0.5) m1 else m2
      }
    }
    pseudotime[i] <- best_pt
    nearest_ms[i] <- best_ms
  }
  list(pseudotime=pseudotime, nearest_ms=nearest_ms)
}

cat("\n✅ Trajectory infrastructure ready\n")

✅ Trajectory infrastructure ready
cat(sprintf("MST: %d nodes, %d edges | Root: M00 (CD4 Naive)\n",
            vcount(mst_graph), ecount(mst_graph)))
MST: 8 nodes, 7 edges | Root: M00 (CD4 Naive)

The key differences from what I had before are that the milestone verification step checks the actual state assignments from your object against expected biology and stops with a clear error message if anything is wrong, the comment on M02 now correctly notes it is the large TCM cluster (8,679 cells) that acts as the branch point, and the mst_pt_range is taken directly from the saved reference pseudotime rather than being recalculated, guaranteeing the 0–100 scale for Sézary cells is identical to the reference scale. Everything downstream in the script stays unchanged.


5 MapQuery: Project Sézary Cell Lines onto Reference

# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# MapQuery is a Seurat function that performs reference-based cell mapping.
# It does two things simultaneously:
#
# 1. LABEL TRANSFER: Uses weighted nearest-neighbour anchors (cells that
#    are mutually similar between reference and query in PCA space) to
#    transfer cell type labels from reference → query. Each Sézary cell
#    gets a predicted CD4 state label (Naive/TCM/TEM/Temra/Treg) with a
#    confidence score (0-1).
#
# 2. UMAP PROJECTION: Projects Sézary cells into the reference UMAP
#    coordinate system. This is NOT a new UMAP — it places each Sézary
#    cell at the coordinates it would occupy IF it were part of the
#    reference dataset. This allows direct visual comparison on the
#    same UMAP space.
#
# WHY THIS MATTERS BIOLOGICALLY:
# The predicted state tells us which normal CD4 subtype the malignant cell
# most transcriptomically resembles. If a Sézary cell line projects to
# CD4 TCM space with high confidence, it means the malignant transcriptome
# most closely matches central memory T cells — implying TCM is the cell
# of origin or that the malignant transformation arrested differentiation
# at the TCM stage.
#
# TECHNICAL: We use plan("sequential") to disable parallelism because the
# reference object is ~46 GiB when exported to parallel workers, exceeding
# RServer memory limits. Sequential mode is slower but memory-safe.
# ──────────────────────────────────────────────────────────────────────────

plan("sequential")
options(future.globals.maxSize = 60 * 1024^3)

DefaultAssay(cd4_ref)    <- "RNA"
DefaultAssay(sezary_obj) <- "RNA"

# FindVariableFeatures on reference ensures gene selection is current
cd4_ref <- FindVariableFeatures(cd4_ref, nfeatures=3000, verbose=FALSE)

cat("Finding transfer anchors...\n")
Finding transfer anchors...
cat("(This identifies cells that are mutually similar between\n")
(This identifies cells that are mutually similar between
cat(" reference and query in PCA space — the bridge for label transfer)\n\n")
 reference and query in PCA space — the bridge for label transfer)
anchors <- FindTransferAnchors(
  reference            = cd4_ref,
  query                = sezary_obj,
  normalization.method = "LogNormalize",
  reference.reduction  = "pca",
  dims                 = 1:30,
  reduction            = "pcaproject",
  verbose              = FALSE
)

cat(sprintf("✅ Anchors found: %d anchor pairs\n", nrow(anchors@anchors)))
✅ Anchors found: 1499 anchor pairs
cat("\nRunning MapQuery (label transfer + UMAP projection)...\n")

Running MapQuery (label transfer + UMAP projection)...
sezary_obj <- MapQuery(
  anchorset           = anchors,
  query               = sezary_obj,
  reference           = cd4_ref,
  refdata             = list(
    predicted_state     = STATE_COL,    # transfer CD4 state labels
    predicted_milestone = "milestone"   # transfer milestone labels
  ),
  reference.reduction = "pca",
  reduction.model     = "umap",         # project into reference UMAP space
  verbose             = FALSE
)
Using method 'umap'
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
cat("✅ MapQuery complete\n\n")
✅ MapQuery complete
# ── Quality check ─────────────────────────────────────────────────────────
cat("=== PROJECTION QUALITY CHECK ===\n")
=== PROJECTION QUALITY CHECK ===
cat(sprintf("Cells projected: %d\n", ncol(sezary_obj)))
Cells projected: 40695
cat(sprintf("Mean prediction score: %.3f (>0.5 = reliable, >0.7 = confident)\n",
            mean(sezary_obj$predicted.predicted_state.score, na.rm=TRUE)))
Mean prediction score: 0.962 (>0.5 = reliable, >0.7 = confident)
cat(sprintf("Cells with score >0.5: %d (%.1f%%)\n",
            sum(sezary_obj$predicted.predicted_state.score > 0.5, na.rm=TRUE),
            100*mean(sezary_obj$predicted.predicted_state.score > 0.5, na.rm=TRUE)))
Cells with score >0.5: 40591 (99.7%)
cat(sprintf("Cells with score >0.7: %d (%.1f%%)\n",
            sum(sezary_obj$predicted.predicted_state.score > 0.7, na.rm=TRUE),
            100*mean(sezary_obj$predicted.predicted_state.score > 0.7, na.rm=TRUE)))
Cells with score >0.7: 39584 (97.3%)
cat("\n=== PREDICTED STATE BY CELL LINE ===\n")

=== PREDICTED STATE BY CELL LINE ===
print(table(sezary_obj$predicted.predicted_state,
            sezary_obj$orig.ident,
            dnn=c("Predicted state","Cell line")))
               Cell line
Predicted state   L1   L2   L3   L4   L5   L6   L7 CD4T_lab CD4T_10x
      CD4 Naive    6    0    0    3    0    0    1        0        0
      CD4 TCM   5639 5933 6428 6003 6022 5148 5330        0        0
      CD4 TEM    180    2    0    0    0    0    0        0        0
cat("\n=== PREDICTED MILESTONE BY CELL LINE ===\n")

=== PREDICTED MILESTONE BY CELL LINE ===
print(table(sezary_obj$predicted.predicted_milestone,
            sezary_obj$orig.ident,
            dnn=c("Predicted milestone","Cell line")))
                   Cell line
Predicted milestone   L1   L2   L3   L4   L5   L6   L7 CD4T_lab CD4T_10x
                M00    6    0    0    3    0    0    1        0        0
                M01    0    0    1    1    0    0    0        0        0
                M02 5726 5935 6427 6002 6022 5148 5330        0        0
                M04   93    0    0    0    0    0    0        0        0

6 Assign Pseudotime to Sézary Cell Lines

# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# We assign pseudotime to Sézary cells using TWO methods:
#
# METHOD 1 — MST projection:
# Each Sézary cell's ref.umap coordinates (from MapQuery) are projected
# onto the nearest edge of the MST. Pseudotime = interpolated graph distance
# from the root (M00 = CD4 Naive). This gives a biologically interpretable
# value: 0 = most naive-like, 100 = most terminally differentiated.
#
# METHOD 2 — KNN-weighted Monocle3 transfer:
# For each Sézary cell, we find its 15 nearest reference cells in UMAP space,
# then compute a weighted average of their Monocle3 pseudotime values.
# Weights are inverse distance (closer reference cells contribute more).
# This transfers the Monocle3 pseudotime scale without re-running Monocle3
# on the query — appropriate because Monocle3 cannot easily incorporate
# new query cells into an existing principal graph.
#
# INTERPRETATION:
# - High pseudotime (>50): malignant cells resemble mature/differentiated T cells
# - Low pseudotime (<25): malignant cells resemble naive or early memory T cells
# - This tells us the DIFFERENTIATION STAGE at which malignant transformation occurred
# ──────────────────────────────────────────────────────────────────────────

stopifnot("ref.umap missing — MapQuery may have failed" =
          "ref.umap" %in% names(sezary_obj@reductions))

# Extract Sézary UMAP coordinates in reference space
sez_umap_coords           <- as.matrix(Embeddings(sezary_obj, "ref.umap"))
colnames(sez_umap_coords) <- c("UMAP_1","UMAP_2")

# ── Method 1: MST pseudotime ──────────────────────────────────────────────
cat("Projecting Sézary cells onto MST...\n")
Projecting Sézary cells onto MST...
sez_mst <- project_onto_mst(sez_umap_coords, centroid_mat, mst_graph, g_dists_vec)

# ── Method 2: KNN-weighted Monocle3 pseudotime ───────────────────────────
cat("Transferring Monocle3 pseudotime via KNN (k=15)...\n")
Transferring Monocle3 pseudotime via KNN (k=15)...
ref_umap_mat <- Embeddings(cd4_ref, UMAP_NAME)
ref_m3_pt    <- cd4_ref@meta.data$monocle3_pseudotime_norm

nn_result <- nn2(data=ref_umap_mat, query=sez_umap_coords, k=15)
inv_dist  <- 1 / (nn_result$nn.dists + 1e-6)
weights   <- inv_dist / rowSums(inv_dist)

# ── Assign to metadata ────────────────────────────────────────────────────
meta_sez <- sezary_obj@meta.data  # Fixed: remove [mailto: ] syntax

meta_sez$mst_pseudotime <- pmax(0, pmin(100,
  100 * (sez_mst$pseudotime - mst_pt_range[1]) /
         (mst_pt_range[2]  - mst_pt_range[1])
))
meta_sez$nearest_milestone_mst <- sez_mst$nearest_ms
meta_sez$monocle3_pseudotime   <- rowSums(
  weights * matrix(ref_m3_pt[nn_result$nn.idx],
                   nrow=nrow(nn_result$nn.idx))
)

sezary_obj@meta.data <- meta_sez  # Fixed: remove [mailto: ] syntax

# ── Build plot dataframe with cell line info ──────────────────────────────
sez_plot_df <- data.frame(
  UMAP_1          = sez_umap_coords[,"UMAP_1"],
  UMAP_2          = sez_umap_coords[,"UMAP_2"],
  cell_line       = factor(sezary_obj$orig.ident, levels=CELL_LINE_ORDER),
  predicted_state = sezary_obj$predicted.predicted_state,
  pred_score      = sezary_obj$predicted.predicted_state.score,
  mst_pt          = sezary_obj@meta.data$mst_pseudotime,
  m3_pt           = sezary_obj@meta.data$monocle3_pseudotime,
  milestone_ref   = sezary_obj@meta.data$nearest_milestone_mst
)

cat(sprintf("✅ Pseudotime assigned to %d cells\n", nrow(sez_plot_df)))
✅ Pseudotime assigned to 40695 cells
# FIXED: Correct sprintf() usage for range vectors
mst_range <- range(sez_plot_df$mst_pt, na.rm=TRUE)
m3_range  <- range(sez_plot_df$m3_pt, na.rm=TRUE)

cat(sprintf("MST pseudotime range:      %.1f – %.1f\n", mst_range[1], mst_range[2]))
MST pseudotime range:      0.0 – 5.0
cat(sprintf("Monocle3 pseudotime range: %.1f – %.1f\n", m3_range[1], m3_range[2]))
Monocle3 pseudotime range: 1.9 – 99.3
cat("\nMST pseudotime per cell line (median):\n")

MST pseudotime per cell line (median):
sez_plot_df %>%
  group_by(cell_line) %>%
  summarise(median_mst = round(median(mst_pt, na.rm=TRUE), 1),
            median_m3  = round(median(m3_pt,  na.rm=TRUE), 1),
            .groups="drop") %>%
  print()
NA

7 Visualisation 1 — All Cell Lines Together


8 Visualisation 2 — Predicted State per Cell Line

# ── Stacked bar: proportion of cells per state per line ───────────────────
p_state_stack <- ggplot(state_per_line,
                         aes(x=cell_line, y=pct, fill=predicted_state)) +
  geom_col(width=.75, colour="white", linewidth=.3) +
  geom_text(aes(label=ifelse(pct >= 5, paste0(pct,"%"), "")),
            position=position_stack(vjust=0.5),
            size=3.2, colour="white", fontface="bold") +
  scale_fill_manual(values=STATE_COLORS, name="Predicted CD4 state") +
  scale_y_continuous(expand=c(0,0), limits=c(0,105)) +
  theme_classic() +
  labs(x="Cell line", y="% of cells",
       title="Predicted CD4 state composition per Sézary cell line",
       subtitle="Each bar = 100% of that cell line's cells | Labels shown for proportions ≥5%") +
  theme(plot.title  = element_text(size=13, face="bold"),
        axis.text.x = element_text(size=12, face="bold"),
        legend.title = element_text(size=10, face="bold"))

print(p_state_stack)

# ── Print summary table ───────────────────────────────────────────────────
cat("\nState proportions per cell line:\n")

State proportions per cell line:
state_per_line %>%
  select(cell_line, predicted_state, n, pct) %>%
  arrange(cell_line, desc(pct)) %>%
  kable(col.names=c("Cell line","Predicted state","N cells","% cells"),
        caption="Predicted CD4 state composition per Sézary cell line") %>%
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE) %>%
  print()
Predicted CD4 state composition per Sézary cell line
Cell line Predicted state N cells % cells
L1 CD4 TCM 5639 96.8
L1 CD4 TEM 180 3.1
L1 CD4 Naive 6 0.1
L2 CD4 TCM 5933 100.0
L2 CD4 TEM 2 0.0
L3 CD4 TCM 6428 100.0
L4 CD4 TCM 6003 100.0
L4 CD4 Naive 3 0.0
L5 CD4 TCM 6022 100.0
L6 CD4 TCM 5148 100.0
L7 CD4 TCM 5330 100.0
L7 CD4 Naive 1 0.0
NULL

9 Visualisation 3 — Pseudotime per Cell Line

# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# Pseudotime measures DIFFERENTIATION MATURITY on a 0-100 scale where:
#   0  = most naive-like (undifferentiated, CD4 Naive = M00)
#   100 = most terminally differentiated (Temra or terminal Treg)
#
# For Sézary cells, the pseudotime distribution tells us:
#
# LOW pseudotime (0-30):
#   Malignant cells resemble undifferentiated/early memory cells.
#   Suggests transformation occurred at an early differentiation stage.
#   These tumours may express naive/TCM survival factors (IL-7R, CCR7).
#
# INTERMEDIATE pseudotime (30-60):
#   Cells resemble central/transitional memory T cells (TCM stage).
#   Most SS literature suggests this — Sézary cells often retain CCR4,
#   CD27, and other TCM markers while losing CCR7.
#
# HIGH pseudotime (60-100):
#   Cells resemble TEM or Temra — unusual for SS as most SS cells
#   do not express high levels of GZMB, PRF1, or CX3CR1.
#   If a line maps here, it may represent a more cytotoxic variant.
#
# Comparing lines: if L1 has pseudotime ~30 and L5 has pseudotime ~60,
# this means L5 cells are more differentiated than L1 cells, possibly
# explaining differences in disease aggressiveness or drug sensitivity.
# ──────────────────────────────────────────────────────────────────────────

# ── Panel A: MST pseudotime violin per cell line ──────────────────────────
p_violin_mst <- ggplot(sez_plot_df,
                        aes(x=cell_line, y=mst_pt, fill=cell_line)) +
  geom_violin(scale="width", trim=FALSE, alpha=.85, colour="white") +
  geom_boxplot(width=.08, fill="white", outlier.size=.8,
               outlier.alpha=.4, colour="grey30") +
  # Reference lines for biological interpretation
  geom_hline(yintercept=33, linetype="dashed", colour="grey60", linewidth=.5) +
  geom_hline(yintercept=66, linetype="dashed", colour="grey60", linewidth=.5) +
  annotate("text", x=0.4, y=16,  label="Naive/\nearly TCM", size=2.8,
           colour="grey40", fontface="italic", hjust=0) +
  annotate("text", x=0.4, y=50,  label="TCM/\nearly TEM", size=2.8,
           colour="grey40", fontface="italic", hjust=0) +
  annotate("text", x=0.4, y=83,  label="TEM/\nTemra/Treg", size=2.8,
           colour="grey40", fontface="italic", hjust=0) +
  scale_fill_manual(values=CELL_LINE_COLORS, guide="none") +
  scale_y_continuous(limits=c(0,100), breaks=seq(0,100,25)) +
  theme_classic() +
  labs(x="Cell line", y="MST pseudotime (0–100)",
       title="Differentiation maturity per Sézary cell line (Custom MST)",
       subtitle="Dashed lines = trajectory thirds | 0=Naive, 100=Temra/Terminal Treg") +
  theme(plot.title  = element_text(size=12, face="bold"),
        axis.text.x = element_text(size=11, face="bold"))

# ── Panel B: Monocle3 pseudotime violin ───────────────────────────────────
p_violin_m3 <- ggplot(sez_plot_df,
                       aes(x=cell_line, y=m3_pt, fill=cell_line)) +
  geom_violin(scale="width", trim=FALSE, alpha=.85, colour="white") +
  geom_boxplot(width=.08, fill="white", outlier.size=.8,
               outlier.alpha=.4, colour="grey30") +
  geom_hline(yintercept=33, linetype="dashed", colour="grey60", linewidth=.5) +
  geom_hline(yintercept=66, linetype="dashed", colour="grey60", linewidth=.5) +
  scale_fill_manual(values=CELL_LINE_COLORS, guide="none") +
  scale_y_continuous(limits=c(0,100), breaks=seq(0,100,25)) +
  theme_classic() +
  labs(x="Cell line", y="Monocle3 pseudotime (0–100)",
       title="Differentiation maturity per Sézary cell line (Monocle3)",
       subtitle="KNN-weighted transfer from 15 nearest reference neighbours") +
  theme(plot.title  = element_text(size=12, face="bold"),
        axis.text.x = element_text(size=11, face="bold"))

print(p_violin_mst / p_violin_m3)


10 Visualisation 4 — Nearest Milestone per Cell Line

# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# Nearest milestone is the most PRECISE cell of origin indicator because
# it places cells at a specific point on the differentiation path, not just
# a broad state category. The milestone system has 8 positions:
#
#   M00 = CD4 Naive (root)
#   M01 = TCM early  (CCR7+, S100A4+, AQP3 low)
#   M02 = TCM late   (branch point — CCR4+, ITGB1+) ← KEY BRANCH POINT
#   M03 = TEM early  (GZMK+, CCL5+, CXCR3+)
#   M04 = TEM late   (EOMES+, HOPX+)
#   M05 = Temra/CTL  (GZMB+, PRF1+, CX3CR1+, NKG7+)
#   M06 = Treg naive (FOXP3+, IL2RA+, IKZF2+)
#   M07 = Treg terminal (CTLA4+, TIGIT+, TNFRSF18+)
#
# M02 is the BRANCH POINT — cells at M02 are at the decision point between
# becoming effector memory (TEM arm) or regulatory (Treg arm). If Sézary
# cells cluster at M02, this is biologically very interesting — it suggests
# the malignant transformation captured a cell that was "deciding" its fate.
# ──────────────────────────────────────────────────────────────────────────

# Milestone labels with biological annotation
milestone_labels <- c(
  "M00" = "M00\nNaive",
  "M01" = "M01\nTCM early",
  "M02" = "M02\nTCM late\n(branch)",
  "M03" = "M03\nTEM early",
  "M04" = "M04\nTEM late",
  "M05" = "M05\nTemra/CTL",
  "M06" = "M06\nTreg naive",
  "M07" = "M07\nTreg terminal"
)

# Compute milestone distribution per cell line
ms_per_line <- sez_plot_df %>%
  filter(!is.na(milestone_ref)) %>%
  count(cell_line, milestone_ref) %>%
  group_by(cell_line) %>%
  mutate(pct = round(100 * n / sum(n), 1)) %>%
  ungroup() %>%
  mutate(
    milestone_ref = factor(milestone_ref, levels=milestone_order),
    ms_state = milestone_centroids$state[
      match(milestone_ref, milestone_centroids$milestone)]
  )

# ── Heatmap-style tile plot: cell line × milestone ────────────────────────
p_ms_heatmap <- ggplot(ms_per_line,
                        aes(x=milestone_ref, y=cell_line, fill=pct)) +
  geom_tile(colour="white", linewidth=.5) +
  geom_text(aes(label=ifelse(pct >= 2, paste0(pct,"%"), "")),
            size=3.2, fontface="bold",
            colour=ifelse(ms_per_line$pct > 40, "white", "grey20")) +
  scale_fill_gradientn(
    colours = c("#F7FBFF","#DEEBF7","#9ECAE1","#4292C6","#08519C","#08306B"),
    name    = "% of cell\nline cells",
    limits  = c(0,100)
  ) +
  scale_x_discrete(labels=milestone_labels) +
  theme_classic() +
  labs(x="Reference milestone", y="Sézary cell line",
       title="Nearest reference milestone distribution per cell line",
       subtitle="Darker = more cells mapping to that milestone | M02 = branch point between effector and regulatory arms") +
  theme(plot.title   = element_text(size=13, face="bold"),
        axis.text.x  = element_text(size=9),
        axis.text.y  = element_text(size=11, face="bold"),
        legend.title = element_text(size=9))

print(p_ms_heatmap)


# ── Stacked bar: milestone per cell line ──────────────────────────────────
ms_state_colors <- setNames(
  STATE_COLORS[milestone_centroids$state[match(milestone_order, milestone_centroids$milestone)]],
  milestone_order
)

p_ms_stack <- ggplot(ms_per_line,
                      aes(x=cell_line, y=pct, fill=milestone_ref)) +
  geom_col(width=.75, colour="white", linewidth=.3) +
  geom_text(aes(label=ifelse(pct >= 5, paste0(milestone_ref,"\n",pct,"%"), "")),
            position=position_stack(vjust=0.5),
            size=2.8, colour="white", fontface="bold") +
  scale_fill_manual(values=ms_state_colors,
                    labels=milestone_labels,
                    name="Milestone") +
  scale_y_continuous(expand=c(0,0), limits=c(0,105)) +
  theme_classic() +
  labs(x="Cell line", y="% of cells",
       title="Milestone composition per cell line",
       subtitle="Colour = CD4 state | Label = milestone ID and % | shown if ≥5%") +
  theme(plot.title  = element_text(size=13, face="bold"),
        axis.text.x = element_text(size=12, face="bold"))

print(p_ms_stack)


11 Visualisation 5 — Individual Cell Line UMAP Panels

# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# This produces a detailed per-cell-line plot showing:
# - Top panel: cells coloured by predicted state (what state they resemble)
# - Bottom panel: cells coloured by MST pseudotime (how mature they are)
#
# For each cell line, you want to ask:
# 1. Is the cell line HOMOGENEOUS (all cells in one state/region) or
#    HETEROGENEOUS (scattered across multiple states)?
# 2. Does the cell line sit closer to the root (low pseudotime = naive-like)
#    or the tips (high pseudotime = mature effector or Treg-like)?
# 3. Are there TWO subpopulations within one line? This could indicate
#    clonal evolution or technical mixing.
# ──────────────────────────────────────────────────────────────────────────

plot_list <- list()

for (cl in CELL_LINE_ORDER) {
  df_cl <- sez_plot_df %>% filter(cell_line == cl)
  n_cl  <- nrow(df_cl)

  # State plot for this cell line
  p_state <- ggplot() +
    geom_point(data=ref_bg[sample(nrow(ref_bg), min(3000, nrow(ref_bg))),],
               aes(x=UMAP_1, y=UMAP_2), colour="grey60", size=.2, alpha=.4) +
    geom_segment(data=mst_edge_df, aes(x=x1,y=y1,xend=x2,yend=y2),
                 colour="grey40", linewidth=.6, alpha=.6) +
    geom_point(data=df_cl,
               aes(x=UMAP_1, y=UMAP_2, colour=predicted_state),
               size=1.2, alpha=.85) +
    geom_text_repel(data=milestone_centroids,
                    aes(x=UMAP_1, y=UMAP_2, label=milestone),
                    size=2, colour="grey20", fontface="bold",
                    max.overlaps=8, segment.size=.2) +
    scale_colour_manual(values=STATE_COLORS, name="State",
                        drop=FALSE, na.value="grey40") +
    theme_classic() +
    labs(x="UMAP-1", y="UMAP-2",
         title=sprintf("%s — Predicted state (n=%s cells)",
                       cl, scales::comma(n_cl))) +
    theme(plot.title  = element_text(size=10, face="bold",
                                     colour=CELL_LINE_COLORS[cl]),
          legend.text = element_text(size=7),
          legend.title= element_text(size=8))

  # Pseudotime plot for this cell line
  p_pt <- ggplot() +
    geom_point(data=ref_bg[sample(nrow(ref_bg), min(3000, nrow(ref_bg))),],
               aes(x=UMAP_1, y=UMAP_2), colour="grey60", size=.2, alpha=.4) +
    geom_segment(data=mst_edge_df, aes(x=x1,y=y1,xend=x2,yend=y2),
                 colour="grey40", linewidth=.6, alpha=.6) +
    geom_point(data=df_cl,
               aes(x=UMAP_1, y=UMAP_2, colour=mst_pt),
               size=1.2, alpha=.9) +
    geom_text_repel(data=milestone_centroids,
                    aes(x=UMAP_1, y=UMAP_2, label=milestone),
                    size=2, colour="grey40", fontface="bold",
                    max.overlaps=8, segment.size=.2) +
    scale_colour_gradientn(
      colours = c("#0D0887","#6A00A8","#B12A90","#E16462","#FCA636","#F0F921"),
      name="Pseudotime\n(0–100)", limits=c(0,100)) +
    theme_classic() +
    labs(x="UMAP-1", y="UMAP-2",
         title=sprintf("%s — MST pseudotime", cl)) +
    theme(plot.title  = element_text(size=10, face="bold",
                                     colour=CELL_LINE_COLORS[cl]),
          legend.text = element_text(size=7),
          legend.title= element_text(size=8))

  plot_list[[paste0(cl,"_state")]] <- p_state
  plot_list[[paste0(cl,"_pt")]]    <- p_pt
}

# Print in 2-column layout (state | pseudotime) per line
for (cl in CELL_LINE_ORDER) {
  print(plot_list[[paste0(cl,"_state")]] | plot_list[[paste0(cl,"_pt")]])
}


12 Visualisation 6 — Method Agreement per Cell Line

# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# We compare MST and Monocle3 pseudotime PER CELL LINE to assess robustness.
# If both methods agree on a cell line's pseudotime position, the result is
# robust. If they disagree for a specific line, that line may sit in a
# trajectory region where the two methods handle geometry differently
# (e.g. near the branch point M02).
# ──────────────────────────────────────────────────────────────────────────

# Per-line Spearman correlation
line_cors <- sez_plot_df %>%
  group_by(cell_line) %>%
  summarise(
    spearman = round(cor(mst_pt, m3_pt, method="spearman",
                         use="complete.obs"), 3),
    n        = n(),
    .groups  = "drop"
  )

cat("Method agreement (Spearman ρ) per cell line:\n")
Method agreement (Spearman ρ) per cell line:
print(line_cors)

p_method_scatter <- ggplot(sez_plot_df,
                            aes(x=mst_pt, y=m3_pt, colour=cell_line)) +
  geom_point(size=.5, alpha=.4) +
  geom_abline(slope=1, intercept=0, linetype="dashed",
              colour="grey40", linewidth=.7) +
  geom_smooth(method="lm", se=FALSE, linewidth=.8) +
  geom_text(data=line_cors,
            aes(x=5, y=95, label=sprintf("ρ=%.2f", spearman)),
            size=3, fontface="bold", hjust=0, colour="grey20") +
  scale_colour_manual(values=CELL_LINE_COLORS, guide="none") +
  facet_wrap(~cell_line, ncol=4) +
  theme_classic() +
  labs(x="Custom MST pseudotime (0–100)",
       y="Monocle3 pseudotime (0–100)",
       title="Method agreement per cell line: Custom MST vs Monocle3 pseudotime",
       subtitle="Dashed = perfect agreement | ρ = Spearman correlation | Higher ρ = more reliable pseudotime assignment") +
  theme(plot.title  = element_text(size=12, face="bold"),
        strip.text  = element_text(size=10, face="bold"),
        strip.background = element_rect(fill="grey95"))

print(p_method_scatter)


13 Summary Table — Complete Cell Line Analysis

# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# This comprehensive table is your primary result table for the paper.
# Key columns to interpret:
#
# dominant_state: the most common predicted CD4 state for that cell line
# dominant_state_pct: how confident — if >80%, strong cell of origin signal
# mean_pred_score: MapQuery confidence — >0.7 is reliable
# mst_mean_pt: average differentiation maturity (0=naive, 100=terminal)
# dominant_milestone: the single most common trajectory position
#
# For the paper, you will want to state something like:
# "X of 7 cell lines predominantly mapped to CD4 TCM (milestones M01-M02),
# with mean MST pseudotime of Y ± Z, indicating that Sézary syndrome
# cell lines transcriptomically resemble central memory CD4 T cells at
# an intermediate stage of differentiation."
# ──────────────────────────────────────────────────────────────────────────

sez_summary <- sez_plot_df %>%
  group_by(cell_line) %>%
  summarise(
    n_cells            = n(),
    mean_pred_score    = round(mean(pred_score, na.rm=TRUE), 3),
    high_conf_pct      = round(100 * mean(pred_score > 0.7, na.rm=TRUE), 1),
    dominant_state     = names(sort(table(predicted_state), decreasing=TRUE))[1],
    dominant_state_pct = round(100 * max(table(predicted_state)) / n(), 1),
    dominant_milestone = names(sort(table(milestone_ref), decreasing=TRUE))[1],
    mst_mean_pt        = round(mean(mst_pt, na.rm=TRUE), 1),
    mst_sd_pt          = round(sd(mst_pt, na.rm=TRUE), 1),
    mst_median_pt      = round(median(mst_pt, na.rm=TRUE), 1),
    m3_mean_pt         = round(mean(m3_pt, na.rm=TRUE), 1),
    m3_sd_pt           = round(sd(m3_pt, na.rm=TRUE), 1),
    .groups            = "drop"
  ) %>%
  arrange(mst_mean_pt)

kable(sez_summary,
      col.names = c("Cell line","N cells","Mean pred.\nscore",
                    "% score\n>0.7",
                    "Dominant\nstate","Dominant\nstate %",
                    "Dominant\nmilestone",
                    "MST mean\npt","MST SD",
                    "MST median\npt",
                    "M3 mean\npt","M3 SD"),
      caption = "Complete per cell line trajectory summary — cell of origin analysis") %>%
  kable_styling(bootstrap_options=c("striped","hover","condensed"),
                full_width=TRUE) %>%
  column_spec(5, bold=TRUE) %>%
  column_spec(6, bold=TRUE) %>%
  column_spec(8:10, color="white", background=METHOD_COLORS["Custom MST"]) %>%
  column_spec(11:12, color="white", background=METHOD_COLORS["Monocle3"])
Complete per cell line trajectory summary — cell of origin analysis
Cell line N cells Mean pred. scor | % score >0 7|Dominant sta e | Dominant sta e %|Dominant miles one | MST m an pt| ST SD| MST me ian pt| M3 mean p
L2 5935 0.986 99.7 CD4 TCM 100.0 M03 3.1 0.6 2.7 61.6 6.0
L1 5825 0.841 82.4 CD4 TCM 96.8 M02 3.2 1.0 2.7 72.3 21.9
L3 6428 0.974 99.8 CD4 TCM 100.0 M03 3.2 0.7 2.7 61.8 9.1
L5 6022 0.967 99.8 CD4 TCM 100.0 M03 3.2 0.7 2.7 59.6 6.5
L6 5148 0.989 99.8 CD4 TCM 100.0 M07 3.5 0.6 4.0 61.9 9.6
L7 5331 0.990 99.9 CD4 TCM 100.0 M07 3.5 0.7 4.0 62.3 10.5
L4 6006 0.988 99.5 CD4 TCM 100.0 M07 3.7 0.6 4.0 59.2 5.5

# ── Per-line milestone distribution table ─────────────────────────────────
ms_dist_all <- sez_plot_df %>%
  filter(!is.na(milestone_ref)) %>%
  count(cell_line, milestone_ref) %>%
  left_join(milestone_centroids[,c("milestone","state","order_pos")],
            by=c("milestone_ref"="milestone")) %>%
  filter(!is.na(order_pos)) %>%
  group_by(cell_line) %>%
  mutate(pct = round(100 * n / sum(n), 1)) %>%
  ungroup() %>%
  mutate(milestone_ref = factor(milestone_ref, levels=milestone_order)) %>%
  arrange(cell_line, order_pos)

cat("\n\nMilestone distribution per cell line:\n")


Milestone distribution per cell line:
ms_dist_all %>%
  select(cell_line, milestone_ref, state, n, pct) %>%
  kable(col.names=c("Cell line","Milestone","State","N cells","% cells"),
        caption="Per cell line milestone distribution") %>%
  kable_styling(bootstrap_options=c("striped","condensed"), full_width=FALSE) %>%
  print()
Per cell line milestone distribution
Cell line Milestone State N cells % cells
L1 M00 CD4 Naive 142 2.4
L1 M01 CD4 TCM 2 0.0
L1 M02 CD4 TCM 1515 26.0
L1 M03 CD4 TEM 1443 24.8
L1 M04 CD4 TEM 1375 23.6
L1 M05 CD4 Temra/CTL 290 5.0
L1 M06 Treg 7 0.1
L1 M07 Treg 1051 18.0
L2 M00 CD4 Naive 4 0.1
L2 M01 CD4 TCM 3 0.1
L2 M02 CD4 TCM 53 0.9
L2 M03 CD4 TEM 3877 65.3
L2 M04 CD4 TEM 61 1.0
L2 M05 CD4 Temra/CTL 20 0.3
L2 M06 Treg 7 0.1
L2 M07 Treg 1910 32.2
L3 M00 CD4 Naive 7 0.1
L3 M01 CD4 TCM 4 0.1
L3 M02 CD4 TCM 508 7.9
L3 M03 CD4 TEM 3316 51.6
L3 M04 CD4 TEM 9 0.1
L3 M06 Treg 13 0.2
L3 M07 Treg 2571 40.0
L4 M00 CD4 Naive 42 0.7
L4 M01 CD4 TCM 1 0.0
L4 M02 CD4 TCM 32 0.5
L4 M03 CD4 TEM 1278 21.3
L4 M04 CD4 TEM 1 0.0
L4 M06 Treg 10 0.2
L4 M07 Treg 4642 77.3
L5 M00 CD4 Naive 4 0.1
L5 M01 CD4 TCM 1 0.0
L5 M02 CD4 TCM 262 4.4
L5 M03 CD4 TEM 3295 54.7
L5 M04 CD4 TEM 3 0.0
L5 M07 Treg 2457 40.8
L6 M00 CD4 Naive 1 0.0
L6 M01 CD4 TCM 1 0.0
L6 M02 CD4 TCM 41 0.8
L6 M03 CD4 TEM 1774 34.5
L6 M04 CD4 TEM 110 2.1
L6 M05 CD4 Temra/CTL 2 0.0
L6 M06 Treg 46 0.9
L6 M07 Treg 3173 61.6
L7 M00 CD4 Naive 2 0.0
L7 M01 CD4 TCM 1 0.0
L7 M02 CD4 TCM 61 1.1
L7 M03 CD4 TEM 1901 35.7
L7 M04 CD4 TEM 1 0.0
L7 M06 Treg 12 0.2
L7 M07 Treg 3353 62.9
NULL

14 Prediction Score Quality Check


15 Save All Outputs

# Create output directories if they don't exist
dir.create("Objects", showWarnings=FALSE)
dir.create("Figures", showWarnings=FALSE)
dir.create("Tables",  showWarnings=FALSE)

# ── RDS objects ────────────────────────────────────────────────────────────
saveRDS(sezary_obj,          "Objects/sezary_celllines_projected.rds")
saveRDS(mst_graph,           "Objects/mst_graph.rds")
saveRDS(milestone_centroids, "Objects/milestone_centroids.rds")

# ── CSV tables ─────────────────────────────────────────────────────────────
write.csv(sez_summary,    "Tables/cellline_trajectory_summary.csv",   row.names=FALSE)
write.csv(state_per_line, "Tables/cellline_state_proportions.csv",    row.names=FALSE)
write.csv(ms_dist_all,    "Tables/cellline_milestone_distribution.csv",row.names=FALSE)
write.csv(line_cors,      "Tables/cellline_method_agreement.csv",     row.names=FALSE)
write.csv(sez_plot_df,    "Tables/cellline_all_cells_metadata.csv",   row.names=FALSE)

# ── Figures ────────────────────────────────────────────────────────────────
ggsave("Figures/fig_all_lines_umap.pdf",      p_all_lines,    width=14, height=12)
ggsave("Figures/fig_faceted_lines.pdf",       p_faceted,      width=18, height=10)
ggsave("Figures/fig_state_per_line.pdf",      p_state_stack,  width=14, height=6)
ggsave("Figures/fig_pseudotime_mst.pdf",      p_violin_mst,   width=14, height=6)
ggsave("Figures/fig_pseudotime_m3.pdf",       p_violin_m3,    width=14, height=6)
ggsave("Figures/fig_milestone_heatmap.pdf",   p_ms_heatmap,   width=14, height=7)
ggsave("Figures/fig_milestone_stack.pdf",     p_ms_stack,     width=14, height=7)
ggsave("Figures/fig_method_agreement.pdf",    p_method_scatter,width=14, height=6)
ggsave("Figures/fig_prediction_scores.pdf",   p_scores,       width=12, height=5)

# Individual line plots
for (cl in CELL_LINE_ORDER) {
  p_combined <- plot_list[[paste0(cl,"_state")]] | plot_list[[paste0(cl,"_pt")]]
  ggsave(sprintf("Figures/fig_%s_umap.pdf", cl),
         p_combined, width=16, height=7)
}

cat("✅ All outputs saved\n")
✅ All outputs saved
cat("  Objects/ — RDS files\n")
  Objects/ — RDS files
cat("  Figures/ — PDF plots\n")
  Figures/ — PDF plots
cat("  Tables/  — CSV data files\n")
  Tables/  — CSV data files

16 Biological Interpretation Guide


╔══════════════════════════════════════════════════════════════════════╗
║         BIOLOGICAL INTERPRETATION GUIDE FOR YOUR RESULTS            ║
╚══════════════════════════════════════════════════════════════════════╝

STEP 1: CHECK DOMINANT STATE (sez_summary table, dominant_state column)
─────────────────────────────────────────────────────────────────────────
• If most lines show CD4 TCM (dominant_state_pct >70%):
  → CONCLUSION: Sézary syndrome arises from Central Memory T cells
  → IMPLICATION: TCM survival signals (IL-7, IL-15, CCR7-independent
    homing) drive malignant cell persistence
  → LITERATURE: Consistent with Wang et al. 2023, Woollard et al. 2022

• If most lines show CD4 TEM:
  → CONCLUSION: Transformation occurred at the effector memory stage
  → IMPLICATION: Malignant cells have acquired partial effector
    programmes (GZMK, CCL5) but not terminal cytotoxicity

• If lines split between TCM and TEM:
  → CONCLUSION: Disease heterogeneity — different patients have
    different cells of origin
  → IMPLICATION: May explain clinical variability in drug response

STEP 2: CHECK DOMINANT MILESTONE (dominant_milestone column)
─────────────────────────────────────────────────────────────────────────
• M01 or M02 dominant → early-to-late TCM (most likely for SS)
• M02 dominant (branch point) → arrested at fate decision point
• M03 or M04 dominant → TEM commitment despite TCM-like label
• M06 dominant → Treg-like features (check FOXP3 expression)

STEP 3: CHECK PREDICTION SCORES (mean_pred_score column)
─────────────────────────────────────────────────────────────────────────
• >0.7: High confidence — strong cell of origin conclusion
• 0.5-0.7: Moderate — mention in limitations
• <0.5: Low — malignant transcriptome substantially deviates from
  all normal reference states; discuss as 'partial resemblance'

STEP 4: CHECK PSEUDOTIME (mst_mean_pt column)
─────────────────────────────────────────────────────────────────────────
• 0-33: Naive/early TCM differentiation stage
• 33-66: TCM/TEM transition stage
• 66-100: Mature effector or Treg stage
• Compare across lines: if L1 pseudotime < L5 pseudotime, L5 is more
  mature/differentiated — may correlate with aggressiveness

STEP 5: LOOK AT THE FACETED UMAP
─────────────────────────────────────────────────────────────────────────
• Tight clusters in one region → homogeneous cell line
• Scattered cells → heterogeneous line (may have subclones)
• Lines overlapping the same UMAP region → same cell of origin
• Lines in different UMAP regions → different cell of origin

17 Session Information

sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: x86_64-redhat-linux-gnu
Running under: Rocky Linux 9.7 (Blue Onyx)

Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8   
 [6] LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C        
[11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] future_1.69.0        kableExtra_1.4.0     knitr_1.51           RColorBrewer_1.1-3   scales_1.4.0         patchwork_1.3.2     
 [7] tidyr_1.3.2          dplyr_1.2.0          ggrepel_0.9.6        ggplot2_4.0.2        RANN_2.6.2           igraph_2.2.2        
[13] SeuratWrappers_0.4.0 Seurat_5.4.0         SeuratObject_5.3.0   sp_2.2-1            

loaded via a namespace (and not attached):
  [1] rstudioapi_0.18.0      jsonlite_2.0.0         magrittr_2.0.4         spatstat.utils_3.2-1   farver_2.1.2          
  [6] rmarkdown_2.30         ragg_1.5.0             vctrs_0.7.1            ROCR_1.0-12            spatstat.explore_3.7-0
 [11] htmltools_0.5.9        sass_0.4.10            sctransform_0.4.3      parallelly_1.46.1      KernSmooth_2.23-26    
 [16] bslib_0.10.0           htmlwidgets_1.6.4      ica_1.0-3              plyr_1.8.9             plotly_4.12.0         
 [21] zoo_1.8-15             cachem_1.1.0           mime_0.13              lifecycle_1.0.5        pkgconfig_2.0.3       
 [26] rsvd_1.0.5             Matrix_1.7-4           R6_2.6.1               fastmap_1.2.0          fitdistrplus_1.2-6    
 [31] shiny_1.12.1           digest_0.6.39          tensor_1.5.1           RSpectra_0.16-2        irlba_2.3.7           
 [36] textshaping_1.0.4      labeling_0.4.3         progressr_0.18.0       spatstat.sparse_3.1-0  mgcv_1.9-4            
 [41] httr_1.4.8             polyclip_1.10-7        abind_1.4-8            compiler_4.5.2         remotes_2.5.0         
 [46] withr_3.0.2            S7_0.2.1               fastDummies_1.7.5      R.utils_2.13.0         MASS_7.3-65           
 [51] tools_4.5.2            lmtest_0.9-40          otel_0.2.0             httpuv_1.6.16          future.apply_1.20.1   
 [56] goftest_1.2-3          R.oo_1.27.1            glue_1.8.0             nlme_3.1-168           promises_1.5.0        
 [61] grid_4.5.2             Rtsne_0.17             cluster_2.1.8.2        reshape2_1.4.5         generics_0.1.4        
 [66] gtable_0.3.6           spatstat.data_3.1-9    R.methodsS3_1.8.2      data.table_1.18.2.1    xml2_1.5.2            
 [71] spatstat.geom_3.7-0    RcppAnnoy_0.0.23       pillar_1.11.1          stringr_1.6.0          spam_2.11-3           
 [76] RcppHNSW_0.6.0         later_1.4.6            splines_4.5.2          lattice_0.22-9         survival_3.8-6        
 [81] deldir_2.0-4           tidyselect_1.2.1       miniUI_0.1.2           pbapply_1.7-4          gridExtra_2.3         
 [86] svglite_2.2.2          scattermore_1.2        xfun_0.56              matrixStats_1.5.0      stringi_1.8.7         
 [91] lazyeval_0.2.2         yaml_2.3.12            evaluate_1.0.5         codetools_0.2-20       tibble_3.3.1          
 [96] BiocManager_1.30.27    cli_3.6.5              uwot_0.2.4             xtable_1.8-4           reticulate_1.45.0     
[101] systemfonts_1.3.1      jquerylib_0.1.4        dichromat_2.0-0.1      Rcpp_1.1.1             globals_0.19.0        
[106] spatstat.random_3.4-4  png_0.1-8              spatstat.univar_3.1-6  parallel_4.5.2         dotCall64_1.2         
[111] listenv_0.10.0         viridisLite_0.4.3      ggridges_0.5.7         purrr_1.2.1            rlang_1.1.7           
[116] cowplot_1.2.0         

18 Metadata Reference

Metadata fields added to sezary_obj by this pipeline
Column Description Interpretation
orig.ident Cell line identity: L1–L7 (malignant Sézary lines) Use to split per-cell-line analysis
predicted.predicted_state MapQuery label transfer: most similar CD4 T cell state from reference Primary cell of origin indicator
predicted.predicted_state.score MapQuery confidence score (0–1): higher = more confident prediction >0.7 high confidence &#124; 0.5-0.7 moderate &#124; <0.5 interpret cautiously
predicted.predicted_milestone MapQuery label transfer: most similar milestone (M00–M07) from reference Most precise trajectory position (M02 = branch point)
mst_pseudotime MST pseudotime (0–100): 0=Naive-like, 100=Temra/Terminal Treg-like Differentiation maturity: lower = more naive-like at time of transformation
nearest_milestone_mst Nearest MST milestone after orthogonal edge projection Confirms trajectory position; M01-M02 = TCM, M03-M04 = TEM, M05 = Temra
monocle3_pseudotime KNN-weighted Monocle3 pseudotime from 15 nearest reference neighbours (0–100) Independent pseudotime estimate for robustness validation
---
title: "Sézary Cell Line Projection onto CD4 T Cell Reference Trajectory"
subtitle: "Per cell line trajectory mapping | Cell of origin analysis | L1–L7 Sézary lines"
author: "NASIR MAHMOOD ABBASI"
date: "`r Sys.Date()`"
output:
  html_notebook:
    number_sections: true
    toc: true
    toc_float:
      collapsed: false
    theme: journal
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo      = TRUE,
  message   = FALSE,
  warning   = FALSE,
  fig.align = "center",
  dpi       = 150
)
```

------------------------------------------------------------------------

# Biological Background & Rationale

## What is Sézary Syndrome?

Sézary syndrome (SS) is an aggressive **cutaneous T-cell lymphoma (CTCL)** characterised
by the triad of erythroderma (full-body skin redness), lymphadenopathy, and circulating
malignant T cells called **Sézary cells**. These are CD4+ T cells with cerebriform
(brain-like wrinkled) nuclei that have undergone malignant transformation.

The central unresolved question in SS biology is: **which normal CD4 T cell subtype
gave rise to the malignant clone?** This is the "cell of origin" question.
Answering it matters because:

- It tells us **which transcription factors and signalling pathways** were active
  when transformation occurred
- It predicts **which normal T cell programmes** are retained or corrupted in the
  malignant cells (explaining why SS cells home to skin, suppress immunity, etc.)
- It guides **therapeutic targeting** — if SS arises from TCM cells, targeting
  memory T cell survival signals (e.g. IL-15/IL-7 signalling) is rational

## Why Project onto a Trajectory?

A trajectory analysis places cells along a **continuous differentiation path**
rather than forcing them into discrete categories. Normal CD4 T cells differentiate
from Naive → TCM → TEM → Temra (effector arm) or Naive → TCM → Treg (regulatory arm).

When we project Sézary cell lines onto this trajectory, we can ask:

1. **Where on the trajectory do malignant cells land?** (cell of origin state)
2. **How far along differentiation are they?** (pseudotime = differentiation maturity)
3. **Are different cell lines (L1–L7) at the same or different positions?**
   (heterogeneity in cell of origin across the disease)
4. **Do all cells within a line occupy the same position, or is there intra-line
   heterogeneity?** (clonal evolution vs stable malignant state)

## Why Analyse Each Cell Line Separately?

You have 7 Sézary cell lines (L1–L7) derived from different patients. Each cell line
represents a **biologically independent malignant clone**. Key reasons to analyse them
separately:

- If all 7 lines map to the same trajectory position, this supports a **universal
  cell of origin** for SS (strong biological conclusion)
- If lines differ, this indicates **disease heterogeneity** — different patients
  may have different cells of origin, explaining clinical variability in drug response
- Comparing cell lines that are drug-sensitive vs drug-resistant (if known) to their
  trajectory position can reveal whether **differentiation state predicts drug response**

------------------------------------------------------------------------

# Setup

```{r libraries}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# We load the same libraries as the reference trajectory script.
# RANN provides fast k-nearest-neighbour search (nn2) which we use to transfer
# Monocle3 pseudotime from reference cells to query Sézary cells by weighted
# averaging of their 15 nearest reference neighbours.
# ──────────────────────────────────────────────────────────────────────────

suppressPackageStartupMessages({
  library(Seurat)
  library(SeuratWrappers)
  library(igraph)
  library(RANN)
  library(ggplot2)
  library(ggrepel)
  library(dplyr)
  library(tidyr)
  library(patchwork)
  library(scales)
  library(RColorBrewer)
  library(knitr)
  library(kableExtra)
  library(future)
})

# ── Colour palettes ────────────────────────────────────────────────────────
# These match the reference trajectory script exactly for visual consistency
STATE_COLORS <- c(
  "CD4 Naive"     = "#4472C4",
  "CD4 TCM"       = "#70AD47",
  "CD4 TEM"       = "#ED7D31",
  "CD4 Temra/CTL" = "#C00000",
  "Treg"          = "#7030A0"
)

# 7 visually distinct colours for L1-L7
# Using a combination of Set1 and Dark2 palettes for maximum distinctiveness
CELL_LINE_COLORS <- c(
  "L1" = "#E41A1C",   # red
  "L2" = "#377EB8",   # blue
  "L3" = "#4DAF4A",   # green
  "L4" = "#FF7F00",   # orange
  "L5" = "#984EA3",   # purple
  "L6" = "#A65628",   # brown
  "L7" = "#F781BF"    # pink
)

METHOD_COLORS <- c(
  "Monocle3"   = "#e74c3c",
  "Custom MST" = "#2980b9"
)

STATE_ORDER    <- c("CD4 Naive","CD4 TCM","CD4 TEM","CD4 Temra/CTL","Treg")
CELL_LINE_ORDER <- paste0("L", 1:7)
UMAP_NAME      <- "umap"
STATE_COL      <- "predicted.celltype.l2"

cat("✅ Setup complete\n")
cat("Cell line colours:\n"); print(CELL_LINE_COLORS)
```

------------------------------------------------------------------------

# Load Objects & Subset to Malignant Cells

```{r load-objects}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# We load the pre-computed reference object (cd4_ref_dual_trajectory.rds)
# which already contains:
#   - MST graph, milestone centroids, pseudotime for all reference cells
#   - Monocle3 pseudotime
#   - All trajectory infrastructure built in the main pipeline
#
# CRITICAL: We subset to L1-L7 ONLY, excluding CD4T_lab and CD4T_10x.
# CD4T_lab and CD4T_10x are healthy donor CD4 T cells included as controls
# in the original dataset. Projecting them onto the reference would be
# circular (healthy cells → healthy reference = trivially perfect mapping)
# and would NOT answer the cell of origin question.
#
# We want ONLY the malignant Sézary cells (L1-L7) for projection.
# ──────────────────────────────────────────────────────────────────────────

# Load pre-computed reference with all trajectory objects
cd4_ref <- readRDS("../cd4_ref_dual_trajectory.rds")

# Load the full Sézary object
sezary_raw <- readRDS("/home/nabbasi/apollo_home/1-Seurat_RDS_OBJECT_FINAL/All_samples_Merged_with_Renamed_Clusters_Cell_state-03-12-2025.rds.rds")

cat("Full Sézary object loaded:", ncol(sezary_raw), "cells\n")
cat("\norig.ident breakdown (ALL samples):\n")
print(table(sezary_raw$orig.ident))

# ── Subset to malignant cell lines only ───────────────────────────────────
CELL_LINES <- paste0("L", 1:7)
sezary_obj <- subset(sezary_raw, orig.ident %in% CELL_LINES)

cat(sprintf("\n✅ Malignant cells retained: %d\n", ncol(sezary_obj)))
cat(sprintf("   Healthy controls excluded: CD4T_lab + CD4T_10x (%d cells)\n",
            ncol(sezary_raw) - ncol(sezary_obj)))
cat("\nPer cell line (cells to project):\n")
print(table(sezary_obj$orig.ident))

# CRITICAL: Remove the full object to free RAM before MapQuery
# On RServer with shared memory, keeping sezary_raw in memory during
# MapQuery can cause the 46 GiB future.globals limit to be exceeded
rm(sezary_raw); gc()
cat("\n✅ Raw object removed from memory — ready for MapQuery\n")

# ── Validate both objects ─────────────────────────────────────────────────
stopifnot(
  "STATE_COL missing from cd4_ref"   = STATE_COL %in% colnames(cd4_ref@meta.data),
  "UMAP missing from cd4_ref"        = UMAP_NAME %in% names(cd4_ref@reductions),
  "milestone missing from cd4_ref"   = "milestone" %in% colnames(cd4_ref@meta.data),
  "monocle3_pseudotime_norm missing" = "monocle3_pseudotime_norm" %in% colnames(cd4_ref@meta.data),
  "RNA missing from cd4_ref"         = "RNA" %in% names(cd4_ref@assays),
  "RNA missing from sezary_obj"      = "RNA" %in% names(sezary_obj@assays),
  "UMAP model missing"               = !is.null(cd4_ref@reductions$umap@misc$model)
)
cat("✅ All validation checks passed\n")

# Print reference trajectory summary for context
cat("\nReference CD4 T cell state distribution:\n")
print(table(cd4_ref@meta.data[[STATE_COL]]))
cat("\nReference milestone distribution:\n")
print(table(cd4_ref@meta.data$milestone))
```

------------------------------------------------------------------------

# Reconstruct Trajectory Infrastructure

```{r reconstruct-trajectory}
# ── SAFE APPROACH: derive everything from the saved reference object ────────
# We do NOT hardcode edges — instead we reconstruct the graph from the
# actual milestone centroid positions in the saved cd4_ref object.
# This guarantees exact consistency with the reference trajectory.

# ── Step 1: Rebuild milestone centroids from reference metadata ────────────
umap_coords <- as.data.frame(Embeddings(cd4_ref, UMAP_NAME))
colnames(umap_coords) <- c("UMAP_1","UMAP_2")
umap_coords$milestone <- cd4_ref@meta.data$milestone
umap_coords$state     <- cd4_ref@meta.data[[STATE_COL]]

milestone_centroids <- umap_coords %>%
  filter(!is.na(milestone)) %>%
  group_by(milestone) %>%
  summarise(
    UMAP_1  = mean(UMAP_1),
    UMAP_2  = mean(UMAP_2),
    n_cells = n(),
    state   = dplyr::first(state),
    .groups = "drop"
  ) %>%
  arrange(milestone)

milestone_order           <- sort(unique(milestone_centroids$milestone))
milestone_centroids$order_pos <- match(milestone_centroids$milestone, milestone_order)

cat("Milestone centroids from saved reference:\n")
print(milestone_centroids[, c("milestone","state","UMAP_1","UMAP_2","n_cells")])

# ── Step 2: Verify milestone-to-state mapping matches expected biology ──────
expected_map <- c(
  M00="CD4 Naive",
  M01="CD4 TCM", M02="CD4 TCM",
  M03="CD4 TEM", M04="CD4 TEM",
  M05="CD4 Temra/CTL",
  M06="Treg",    M07="Treg"
)

actual_map <- setNames(milestone_centroids$state, milestone_centroids$milestone)
mismatches <- names(expected_map)[expected_map != actual_map[names(expected_map)]]

if (length(mismatches) > 0) {
  cat("⚠️  WARNING — milestone/state mismatch at:", paste(mismatches, collapse=", "), "\n")
  cat("   Expected:", expected_map[mismatches], "\n")
  cat("   Actual:  ", actual_map[mismatches], "\n")
  stop("Fix milestone assignment before proceeding")
} else {
  cat("✅ Milestone-to-state mapping verified\n")
}

# ── Step 3: Build centroid matrix and distance matrix ──────────────────────
centroid_mat <- as.matrix(milestone_centroids[, c("UMAP_1","UMAP_2")])
rownames(centroid_mat) <- milestone_centroids$milestone
dist_mat <- as.matrix(dist(centroid_mat, method="euclidean"))

# ── Step 4: Rebuild MST with SAME manual topology as main pipeline ─────────
# Edge list derived from your confirmed milestone distribution:
# M00=Naive(2037) M01=TCM_early(388) M02=TCM_late/branch(8679)
# M03=TEM_early(64) M04=TEM_late(81) M05=Temra(10)
# M06=Treg_naive(146) M07=Treg_terminal(61)
edge_list_ordered <- matrix(c(
  "M00","M01",   # Naive → TCM early
  "M01","M02",   # TCM early → TCM late (branch point — largest TCM cluster)
  "M02","M03",   # TCM late → TEM early  (effector arm)
  "M03","M04",   # TEM early → TEM late
  "M04","M05",   # TEM late → Temra/CTL  (terminal effector)
  "M02","M06",   # TCM late → Treg naive (regulatory arm)
  "M06","M07"    # Treg naive → Treg terminal
), ncol=2, byrow=TRUE)

edge_weights <- apply(edge_list_ordered, 1, function(e) dist_mat[e[1], e[2]])

if (any(!is.finite(edge_weights))) {
  stop("Non-finite edge weights — check milestone names match dist_mat rownames")
}

mst_graph <- graph_from_edgelist(edge_list_ordered, directed=FALSE)
E(mst_graph)$weight <- edge_weights

cat("\nMST edges reconstructed:\n")
for (i in seq_len(nrow(edge_list_ordered))) {
  s1 <- actual_map[edge_list_ordered[i,1]]
  s2 <- actual_map[edge_list_ordered[i,2]]
  cat(sprintf("  %s(%s) → %s(%s)  dist=%.4f\n",
              edge_list_ordered[i,1], s1,
              edge_list_ordered[i,2], s2,
              edge_weights[i]))
}

# ── Step 5: Graph distances from root ─────────────────────────────────────
graph_dists  <- distances(mst_graph, v="M00", weights=NA)
g_dists_vec  <- setNames(as.numeric(graph_dists), colnames(graph_dists))

# CRITICAL: use same pseudotime range as reference for consistent 0-100 scale
mst_pt_range <- range(cd4_ref@meta.data$mst_pseudotime_norm, na.rm=TRUE)
cat(sprintf("\nReference pseudotime range: %.1f – %.1f\n",
            mst_pt_range[1], mst_pt_range[2]))

# ── Step 6: MST edge coordinates for plotting ──────────────────────────────
mst_edges   <- as_edgelist(mst_graph)
mst_edge_df <- do.call(rbind, lapply(seq_len(nrow(mst_edges)), function(i) {
  m1 <- mst_edges[i,1]; m2 <- mst_edges[i,2]
  r1 <- milestone_centroids[milestone_centroids$milestone==m1,]
  r2 <- milestone_centroids[milestone_centroids$milestone==m2,]
  data.frame(x1=r1$UMAP_1, y1=r1$UMAP_2, x2=r2$UMAP_1, y2=r2$UMAP_2)
}))

# ── Step 7: Reference UMAP background ─────────────────────────────────────
ref_bg <- data.frame(
  UMAP_1 = Embeddings(cd4_ref, UMAP_NAME)[,1],
  UMAP_2 = Embeddings(cd4_ref, UMAP_NAME)[,2],
  state  = cd4_ref@meta.data[[STATE_COL]]
)

# ── Step 8: project_onto_mst function ─────────────────────────────────────
project_onto_mst <- function(cell_coords, centroid_mat, mst_graph, graph_dists) {
  edges      <- as_edgelist(mst_graph)
  n_cells    <- nrow(cell_coords)
  pseudotime <- numeric(n_cells)
  nearest_ms <- character(n_cells)

  for (i in seq_len(n_cells)) {
    cx <- cell_coords[i,1]; cy <- cell_coords[i,2]
    best_dist <- Inf; best_pt <- 0; best_ms <- NA_character_

    for (e in seq_len(nrow(edges))) {
      m1 <- edges[e,1]; m2 <- edges[e,2]
      p1 <- centroid_mat[m1,]; p2 <- centroid_mat[m2,]
      dx <- p2[1]-p1[1]; dy <- p2[2]-p1[2]
      len2 <- dx^2 + dy^2
      if (len2 < 1e-10) next
      t    <- max(0, min(1, ((cx-p1[1])*dx + (cy-p1[2])*dy) / len2))
      proj <- c(p1[1]+t*dx, p1[2]+t*dy)
      d    <- sqrt((cx-proj[1])^2 + (cy-proj[2])^2)
      if (d < best_dist) {
        best_dist <- d
        best_pt   <- graph_dists[m1] + t*(graph_dists[m2]-graph_dists[m1])
        best_ms   <- if (t < 0.5) m1 else m2
      }
    }
    pseudotime[i] <- best_pt
    nearest_ms[i] <- best_ms
  }
  list(pseudotime=pseudotime, nearest_ms=nearest_ms)
}

cat("\n✅ Trajectory infrastructure ready\n")
cat(sprintf("MST: %d nodes, %d edges | Root: M00 (CD4 Naive)\n",
            vcount(mst_graph), ecount(mst_graph)))
```

The key differences from what I had before are that the milestone verification step checks the actual state assignments from your object against expected biology and stops with a clear error message if anything is wrong, the comment on M02 now correctly notes it is the **large** TCM cluster (8,679 cells) that acts as the branch point, and the `mst_pt_range` is taken directly from the saved reference pseudotime rather than being recalculated, guaranteeing the 0–100 scale for Sézary cells is identical to the reference scale. Everything downstream in the script stays unchanged.

------------------------------------------------------------------------

# MapQuery: Project Sézary Cell Lines onto Reference

```{r sezary-mapquery}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# MapQuery is a Seurat function that performs reference-based cell mapping.
# It does two things simultaneously:
#
# 1. LABEL TRANSFER: Uses weighted nearest-neighbour anchors (cells that
#    are mutually similar between reference and query in PCA space) to
#    transfer cell type labels from reference → query. Each Sézary cell
#    gets a predicted CD4 state label (Naive/TCM/TEM/Temra/Treg) with a
#    confidence score (0-1).
#
# 2. UMAP PROJECTION: Projects Sézary cells into the reference UMAP
#    coordinate system. This is NOT a new UMAP — it places each Sézary
#    cell at the coordinates it would occupy IF it were part of the
#    reference dataset. This allows direct visual comparison on the
#    same UMAP space.
#
# WHY THIS MATTERS BIOLOGICALLY:
# The predicted state tells us which normal CD4 subtype the malignant cell
# most transcriptomically resembles. If a Sézary cell line projects to
# CD4 TCM space with high confidence, it means the malignant transcriptome
# most closely matches central memory T cells — implying TCM is the cell
# of origin or that the malignant transformation arrested differentiation
# at the TCM stage.
#
# TECHNICAL: We use plan("sequential") to disable parallelism because the
# reference object is ~46 GiB when exported to parallel workers, exceeding
# RServer memory limits. Sequential mode is slower but memory-safe.
# ──────────────────────────────────────────────────────────────────────────

plan("sequential")
options(future.globals.maxSize = 60 * 1024^3)

DefaultAssay(cd4_ref)    <- "RNA"
DefaultAssay(sezary_obj) <- "RNA"

# FindVariableFeatures on reference ensures gene selection is current
cd4_ref <- FindVariableFeatures(cd4_ref, nfeatures=3000, verbose=FALSE)

cat("Finding transfer anchors...\n")
cat("(This identifies cells that are mutually similar between\n")
cat(" reference and query in PCA space — the bridge for label transfer)\n\n")

anchors <- FindTransferAnchors(
  reference            = cd4_ref,
  query                = sezary_obj,
  normalization.method = "LogNormalize",
  reference.reduction  = "pca",
  dims                 = 1:30,
  reduction            = "pcaproject",
  verbose              = FALSE
)

cat(sprintf("✅ Anchors found: %d anchor pairs\n", nrow(anchors@anchors)))
cat("\nRunning MapQuery (label transfer + UMAP projection)...\n")

sezary_obj <- MapQuery(
  anchorset           = anchors,
  query               = sezary_obj,
  reference           = cd4_ref,
  refdata             = list(
    predicted_state     = STATE_COL,    # transfer CD4 state labels
    predicted_milestone = "milestone"   # transfer milestone labels
  ),
  reference.reduction = "pca",
  reduction.model     = "umap",         # project into reference UMAP space
  verbose             = FALSE
)

cat("✅ MapQuery complete\n\n")

# ── Quality check ─────────────────────────────────────────────────────────
cat("=== PROJECTION QUALITY CHECK ===\n")
cat(sprintf("Cells projected: %d\n", ncol(sezary_obj)))
cat(sprintf("Mean prediction score: %.3f (>0.5 = reliable, >0.7 = confident)\n",
            mean(sezary_obj$predicted.predicted_state.score, na.rm=TRUE)))
cat(sprintf("Cells with score >0.5: %d (%.1f%%)\n",
            sum(sezary_obj$predicted.predicted_state.score > 0.5, na.rm=TRUE),
            100*mean(sezary_obj$predicted.predicted_state.score > 0.5, na.rm=TRUE)))
cat(sprintf("Cells with score >0.7: %d (%.1f%%)\n",
            sum(sezary_obj$predicted.predicted_state.score > 0.7, na.rm=TRUE),
            100*mean(sezary_obj$predicted.predicted_state.score > 0.7, na.rm=TRUE)))

cat("\n=== PREDICTED STATE BY CELL LINE ===\n")
print(table(sezary_obj$predicted.predicted_state,
            sezary_obj$orig.ident,
            dnn=c("Predicted state","Cell line")))

cat("\n=== PREDICTED MILESTONE BY CELL LINE ===\n")
print(table(sezary_obj$predicted.predicted_milestone,
            sezary_obj$orig.ident,
            dnn=c("Predicted milestone","Cell line")))
```

------------------------------------------------------------------------

# Assign Pseudotime to Sézary Cell Lines

```{r sezary-pseudotime}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# We assign pseudotime to Sézary cells using TWO methods:
#
# METHOD 1 — MST projection:
# Each Sézary cell's ref.umap coordinates (from MapQuery) are projected
# onto the nearest edge of the MST. Pseudotime = interpolated graph distance
# from the root (M00 = CD4 Naive). This gives a biologically interpretable
# value: 0 = most naive-like, 100 = most terminally differentiated.
#
# METHOD 2 — KNN-weighted Monocle3 transfer:
# For each Sézary cell, we find its 15 nearest reference cells in UMAP space,
# then compute a weighted average of their Monocle3 pseudotime values.
# Weights are inverse distance (closer reference cells contribute more).
# This transfers the Monocle3 pseudotime scale without re-running Monocle3
# on the query — appropriate because Monocle3 cannot easily incorporate
# new query cells into an existing principal graph.
#
# INTERPRETATION:
# - High pseudotime (>50): malignant cells resemble mature/differentiated T cells
# - Low pseudotime (<25): malignant cells resemble naive or early memory T cells
# - This tells us the DIFFERENTIATION STAGE at which malignant transformation occurred
# ──────────────────────────────────────────────────────────────────────────

stopifnot("ref.umap missing — MapQuery may have failed" =
          "ref.umap" %in% names(sezary_obj@reductions))

# Extract Sézary UMAP coordinates in reference space
sez_umap_coords           <- as.matrix(Embeddings(sezary_obj, "ref.umap"))
colnames(sez_umap_coords) <- c("UMAP_1","UMAP_2")

# ── Method 1: MST pseudotime ──────────────────────────────────────────────
cat("Projecting Sézary cells onto MST...\n")
sez_mst <- project_onto_mst(sez_umap_coords, centroid_mat, mst_graph, g_dists_vec)

# ── Method 2: KNN-weighted Monocle3 pseudotime ───────────────────────────
cat("Transferring Monocle3 pseudotime via KNN (k=15)...\n")
ref_umap_mat <- Embeddings(cd4_ref, UMAP_NAME)
ref_m3_pt    <- cd4_ref@meta.data$monocle3_pseudotime_norm

nn_result <- nn2(data=ref_umap_mat, query=sez_umap_coords, k=15)
inv_dist  <- 1 / (nn_result$nn.dists + 1e-6)
weights   <- inv_dist / rowSums(inv_dist)

# ── Assign to metadata ────────────────────────────────────────────────────
meta_sez <- sezary_obj@meta.data  # Fixed: remove [mailto: ] syntax

meta_sez$mst_pseudotime <- pmax(0, pmin(100,
  100 * (sez_mst$pseudotime - mst_pt_range[1]) /
         (mst_pt_range[2]  - mst_pt_range[1])
))
meta_sez$nearest_milestone_mst <- sez_mst$nearest_ms
meta_sez$monocle3_pseudotime   <- rowSums(
  weights * matrix(ref_m3_pt[nn_result$nn.idx],
                   nrow=nrow(nn_result$nn.idx))
)

sezary_obj@meta.data <- meta_sez  # Fixed: remove [mailto: ] syntax

# ── Build plot dataframe with cell line info ──────────────────────────────
sez_plot_df <- data.frame(
  UMAP_1          = sez_umap_coords[,"UMAP_1"],
  UMAP_2          = sez_umap_coords[,"UMAP_2"],
  cell_line       = factor(sezary_obj$orig.ident, levels=CELL_LINE_ORDER),
  predicted_state = sezary_obj$predicted.predicted_state,
  pred_score      = sezary_obj$predicted.predicted_state.score,
  mst_pt          = sezary_obj@meta.data$mst_pseudotime,
  m3_pt           = sezary_obj@meta.data$monocle3_pseudotime,
  milestone_ref   = sezary_obj@meta.data$nearest_milestone_mst
)

cat(sprintf("✅ Pseudotime assigned to %d cells\n", nrow(sez_plot_df)))

# FIXED: Correct sprintf() usage for range vectors
mst_range <- range(sez_plot_df$mst_pt, na.rm=TRUE)
m3_range  <- range(sez_plot_df$m3_pt, na.rm=TRUE)

cat(sprintf("MST pseudotime range:      %.1f – %.1f\n", mst_range[1], mst_range[2]))
cat(sprintf("Monocle3 pseudotime range: %.1f – %.1f\n", m3_range[1], m3_range[2]))

cat("\nMST pseudotime per cell line (median):\n")
sez_plot_df %>%
  group_by(cell_line) %>%
  summarise(median_mst = round(median(mst_pt, na.rm=TRUE), 1),
            median_m3  = round(median(m3_pt,  na.rm=TRUE), 1),
            .groups="drop") %>%
  print()

```

------------------------------------------------------------------------

# Visualisation 1 — All Cell Lines Together

```{r all-lines-umap, fig.width=8, fig.height=6}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# This overview plot shows ALL 7 cell lines simultaneously on the reference
# UMAP. This answers: "As a group, where do Sézary cell lines map?"
# The faceted version shows each line individually for detailed comparison.
#
# What to look for:
# - CLUSTERING: Do all cell lines cluster together (same cell of origin)?
#   Or do they scatter to different regions (heterogeneous cell of origin)?
# - OVERLAP WITH REFERENCE STATES: Which grey reference region do the
#   coloured Sézary cells overlap with?
# - MILESTONE PROXIMITY: Which milestone labels (M00-M07) are the cells
#   nearest to? This gives the most precise trajectory position.
# ──────────────────────────────────────────────────────────────────────────

# ── Panel A: All lines on reference UMAP ──────────────────────────────────
p_all_lines <- ggplot() +
  # Reference cells as grey background — shows the normal CD4 landscape
  geom_point(data = ref_bg[sample(nrow(ref_bg)),],
             aes(x=UMAP_1, y=UMAP_2), colour="grey88", size=.25, alpha=.4) +
  # MST trajectory lines
  geom_segment(data = mst_edge_df,
               aes(x=x1, y=y1, xend=x2, yend=y2),
               colour="grey30", linewidth=1, alpha=.8, lineend="round") +
  # Sézary cells coloured by cell line
  geom_point(data = sez_plot_df[sample(nrow(sez_plot_df)),],
             aes(x=UMAP_1, y=UMAP_2, colour=cell_line),
             size=1.5, alpha=.75) +
  # Milestone labels with state annotation
  geom_label_repel(data = milestone_centroids,
                   aes(x=UMAP_1, y=UMAP_2,
                       label=paste0(milestone, "\n", state)),
                   size=2.5, fontface="bold", fill="white",
                   colour="grey15", label.size=0.2,
                   max.overlaps=20, box.padding=0.5) +
  scale_colour_manual(values=CELL_LINE_COLORS, name="Cell line") +
  theme_classic() +
  labs(x="UMAP-1", y="UMAP-2",
       title="Sézary cell lines — projection onto healthy CD4 T cell reference",
       subtitle="Grey = healthy CD4 reference (11,466 cells) | Coloured = Sézary cell lines (L1–L7) | Lines = MST trajectory") +
  theme(plot.title    = element_text(size=14, face="bold"),
        plot.subtitle = element_text(size=10, colour="grey40"),
        legend.title  = element_text(size=11, face="bold"),
        legend.text   = element_text(size=10)) +
  guides(colour = guide_legend(override.aes=list(size=4, alpha=1)))

print(p_all_lines)
```




```{r faceted-lines, fig.width=13, fig.height=9}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# Faceting by cell line is the most important visualisation for cell of
# origin analysis. Each panel shows ONE cell line projected onto the
# same reference landscape. This allows you to:
# 1. See exactly where each line's cells concentrate on the UMAP
# 2. Compare lines directly — do L1 and L2 overlap? Do L5 and L6 differ?
# 3. Identify intra-line heterogeneity — is the line a single tight cluster
#    or scattered across multiple states?
# ──────────────────────────────────────────────────────────────────────────

p_faceted <- ggplot() +
  geom_point(data = ref_bg[sample(nrow(ref_bg)),],
             aes(x=UMAP_1, y=UMAP_2), colour="grey88", size=.2, alpha=.35) +
  geom_segment(data = mst_edge_df,
               aes(x=x1, y=y1, xend=x2, yend=y2),
               colour="grey40", linewidth=.7, alpha=.7, lineend="round") +
  geom_point(data = sez_plot_df,
             aes(x=UMAP_1, y=UMAP_2, colour=cell_line),
             size=1, alpha=.8) +
  geom_text_repel(data = milestone_centroids,
                  aes(x=UMAP_1, y=UMAP_2, label=milestone),
                  size=2.2, fontface="bold", colour="grey20",
                  max.overlaps=8, segment.size=.2, box.padding=0.3) +
  scale_colour_manual(values=CELL_LINE_COLORS, guide="none") +
  facet_wrap(~cell_line, ncol=4) +
  theme_classic() +
  labs(x="UMAP-1", y="UMAP-2",
       title="Per cell line projection — each panel shows one Sézary cell line on the CD4 reference",
       subtitle="Grey = healthy CD4 reference | Coloured = that cell line's cells | M00=Naive, M01-M02=TCM, M03-M04=TEM, M05=Temra, M06-M07=Treg") +
  theme(plot.title   = element_text(size=13, face="bold"),
        strip.text   = element_text(size=12, face="bold"),
        strip.background = element_rect(fill="grey95", colour="grey70"))

print(p_faceted)

```

------------------------------------------------------------------------

# Visualisation 2 — Predicted State per Cell Line

```{r state-per-line, fig.width=10, fig.height=6}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# The stacked bar chart shows what PROPORTION of each cell line's cells
# map to each CD4 state. This is the primary cell of origin summary.
#
# Ideal outcome for a strong conclusion:
# - ONE state dominates (>70%) across most cell lines → clear cell of origin
# - e.g. if L1-L7 all show >80% CD4 TCM → Sézary originates from TCM cells
#
# Mixed outcome (biologically also informative):
# - Some lines show TCM dominance, others TEM → disease heterogeneity
# - This would suggest different patients have different cells of origin,
#   possibly correlating with different clinical presentations or drug responses
# ──────────────────────────────────────────────────────────────────────────

# Compute per-line state proportions
state_per_line <- sez_plot_df %>%
  filter(!is.na(predicted_state)) %>%
  count(cell_line, predicted_state) %>%
  group_by(cell_line) %>%
  mutate(pct = round(100 * n / sum(n), 1)) %>%
  ungroup() %>%
  mutate(predicted_state = factor(predicted_state, levels=STATE_ORDER))

# ── Stacked bar: proportion of cells per state per line ───────────────────
p_state_stack <- ggplot(state_per_line,
                         aes(x=cell_line, y=pct, fill=predicted_state)) +
  geom_col(width=.75, colour="white", linewidth=.3) +
  geom_text(aes(label=ifelse(pct >= 5, paste0(pct,"%"), "")),
            position=position_stack(vjust=0.5),
            size=3.2, colour="white", fontface="bold") +
  scale_fill_manual(values=STATE_COLORS, name="Predicted CD4 state") +
  scale_y_continuous(expand=c(0,0), limits=c(0,105)) +
  theme_classic() +
  labs(x="Cell line", y="% of cells",
       title="Predicted CD4 state composition per Sézary cell line",
       subtitle="Each bar = 100% of that cell line's cells | Labels shown for proportions ≥5%") +
  theme(plot.title  = element_text(size=13, face="bold"),
        axis.text.x = element_text(size=12, face="bold"),
        legend.title = element_text(size=10, face="bold"))

print(p_state_stack)

# ── Print summary table ───────────────────────────────────────────────────
cat("\nState proportions per cell line:\n")
state_per_line %>%
  select(cell_line, predicted_state, n, pct) %>%
  arrange(cell_line, desc(pct)) %>%
  kable(col.names=c("Cell line","Predicted state","N cells","% cells"),
        caption="Predicted CD4 state composition per Sézary cell line") %>%
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE) %>%
  print()
```

------------------------------------------------------------------------

# Visualisation 3 — Pseudotime per Cell Line

```{r pseudotime-per-line, fig.width=8, fig.height=10}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# Pseudotime measures DIFFERENTIATION MATURITY on a 0-100 scale where:
#   0  = most naive-like (undifferentiated, CD4 Naive = M00)
#   100 = most terminally differentiated (Temra or terminal Treg)
#
# For Sézary cells, the pseudotime distribution tells us:
#
# LOW pseudotime (0-30):
#   Malignant cells resemble undifferentiated/early memory cells.
#   Suggests transformation occurred at an early differentiation stage.
#   These tumours may express naive/TCM survival factors (IL-7R, CCR7).
#
# INTERMEDIATE pseudotime (30-60):
#   Cells resemble central/transitional memory T cells (TCM stage).
#   Most SS literature suggests this — Sézary cells often retain CCR4,
#   CD27, and other TCM markers while losing CCR7.
#
# HIGH pseudotime (60-100):
#   Cells resemble TEM or Temra — unusual for SS as most SS cells
#   do not express high levels of GZMB, PRF1, or CX3CR1.
#   If a line maps here, it may represent a more cytotoxic variant.
#
# Comparing lines: if L1 has pseudotime ~30 and L5 has pseudotime ~60,
# this means L5 cells are more differentiated than L1 cells, possibly
# explaining differences in disease aggressiveness or drug sensitivity.
# ──────────────────────────────────────────────────────────────────────────

# ── Panel A: MST pseudotime violin per cell line ──────────────────────────
p_violin_mst <- ggplot(sez_plot_df,
                        aes(x=cell_line, y=mst_pt, fill=cell_line)) +
  geom_violin(scale="width", trim=FALSE, alpha=.85, colour="white") +
  geom_boxplot(width=.08, fill="white", outlier.size=.8,
               outlier.alpha=.4, colour="grey30") +
  # Reference lines for biological interpretation
  geom_hline(yintercept=33, linetype="dashed", colour="grey60", linewidth=.5) +
  geom_hline(yintercept=66, linetype="dashed", colour="grey60", linewidth=.5) +
  annotate("text", x=0.4, y=16,  label="Naive/\nearly TCM", size=2.8,
           colour="grey40", fontface="italic", hjust=0) +
  annotate("text", x=0.4, y=50,  label="TCM/\nearly TEM", size=2.8,
           colour="grey40", fontface="italic", hjust=0) +
  annotate("text", x=0.4, y=83,  label="TEM/\nTemra/Treg", size=2.8,
           colour="grey40", fontface="italic", hjust=0) +
  scale_fill_manual(values=CELL_LINE_COLORS, guide="none") +
  scale_y_continuous(limits=c(0,100), breaks=seq(0,100,25)) +
  theme_classic() +
  labs(x="Cell line", y="MST pseudotime (0–100)",
       title="Differentiation maturity per Sézary cell line (Custom MST)",
       subtitle="Dashed lines = trajectory thirds | 0=Naive, 100=Temra/Terminal Treg") +
  theme(plot.title  = element_text(size=12, face="bold"),
        axis.text.x = element_text(size=11, face="bold"))

# ── Panel B: Monocle3 pseudotime violin ───────────────────────────────────
p_violin_m3 <- ggplot(sez_plot_df,
                       aes(x=cell_line, y=m3_pt, fill=cell_line)) +
  geom_violin(scale="width", trim=FALSE, alpha=.85, colour="white") +
  geom_boxplot(width=.08, fill="white", outlier.size=.8,
               outlier.alpha=.4, colour="grey30") +
  geom_hline(yintercept=33, linetype="dashed", colour="grey60", linewidth=.5) +
  geom_hline(yintercept=66, linetype="dashed", colour="grey60", linewidth=.5) +
  scale_fill_manual(values=CELL_LINE_COLORS, guide="none") +
  scale_y_continuous(limits=c(0,100), breaks=seq(0,100,25)) +
  theme_classic() +
  labs(x="Cell line", y="Monocle3 pseudotime (0–100)",
       title="Differentiation maturity per Sézary cell line (Monocle3)",
       subtitle="KNN-weighted transfer from 15 nearest reference neighbours") +
  theme(plot.title  = element_text(size=12, face="bold"),
        axis.text.x = element_text(size=11, face="bold"))

print(p_violin_mst / p_violin_m3)
```

------------------------------------------------------------------------

# Visualisation 4 — Nearest Milestone per Cell Line

```{r milestone-per-line, fig.width=12, fig.height=6}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# Nearest milestone is the most PRECISE cell of origin indicator because
# it places cells at a specific point on the differentiation path, not just
# a broad state category. The milestone system has 8 positions:
#
#   M00 = CD4 Naive (root)
#   M01 = TCM early  (CCR7+, S100A4+, AQP3 low)
#   M02 = TCM late   (branch point — CCR4+, ITGB1+) ← KEY BRANCH POINT
#   M03 = TEM early  (GZMK+, CCL5+, CXCR3+)
#   M04 = TEM late   (EOMES+, HOPX+)
#   M05 = Temra/CTL  (GZMB+, PRF1+, CX3CR1+, NKG7+)
#   M06 = Treg naive (FOXP3+, IL2RA+, IKZF2+)
#   M07 = Treg terminal (CTLA4+, TIGIT+, TNFRSF18+)
#
# M02 is the BRANCH POINT — cells at M02 are at the decision point between
# becoming effector memory (TEM arm) or regulatory (Treg arm). If Sézary
# cells cluster at M02, this is biologically very interesting — it suggests
# the malignant transformation captured a cell that was "deciding" its fate.
# ──────────────────────────────────────────────────────────────────────────

# Milestone labels with biological annotation
milestone_labels <- c(
  "M00" = "M00\nNaive",
  "M01" = "M01\nTCM early",
  "M02" = "M02\nTCM late\n(branch)",
  "M03" = "M03\nTEM early",
  "M04" = "M04\nTEM late",
  "M05" = "M05\nTemra/CTL",
  "M06" = "M06\nTreg naive",
  "M07" = "M07\nTreg terminal"
)

# Compute milestone distribution per cell line
ms_per_line <- sez_plot_df %>%
  filter(!is.na(milestone_ref)) %>%
  count(cell_line, milestone_ref) %>%
  group_by(cell_line) %>%
  mutate(pct = round(100 * n / sum(n), 1)) %>%
  ungroup() %>%
  mutate(
    milestone_ref = factor(milestone_ref, levels=milestone_order),
    ms_state = milestone_centroids$state[
      match(milestone_ref, milestone_centroids$milestone)]
  )

# ── Heatmap-style tile plot: cell line × milestone ────────────────────────
p_ms_heatmap <- ggplot(ms_per_line,
                        aes(x=milestone_ref, y=cell_line, fill=pct)) +
  geom_tile(colour="white", linewidth=.5) +
  geom_text(aes(label=ifelse(pct >= 2, paste0(pct,"%"), "")),
            size=3.2, fontface="bold",
            colour=ifelse(ms_per_line$pct > 40, "white", "grey20")) +
  scale_fill_gradientn(
    colours = c("#F7FBFF","#DEEBF7","#9ECAE1","#4292C6","#08519C","#08306B"),
    name    = "% of cell\nline cells",
    limits  = c(0,100)
  ) +
  scale_x_discrete(labels=milestone_labels) +
  theme_classic() +
  labs(x="Reference milestone", y="Sézary cell line",
       title="Nearest reference milestone distribution per cell line",
       subtitle="Darker = more cells mapping to that milestone | M02 = branch point between effector and regulatory arms") +
  theme(plot.title   = element_text(size=13, face="bold"),
        axis.text.x  = element_text(size=9),
        axis.text.y  = element_text(size=11, face="bold"),
        legend.title = element_text(size=9))

print(p_ms_heatmap)

# ── Stacked bar: milestone per cell line ──────────────────────────────────
ms_state_colors <- setNames(
  STATE_COLORS[milestone_centroids$state[match(milestone_order, milestone_centroids$milestone)]],
  milestone_order
)

p_ms_stack <- ggplot(ms_per_line,
                      aes(x=cell_line, y=pct, fill=milestone_ref)) +
  geom_col(width=.75, colour="white", linewidth=.3) +
  geom_text(aes(label=ifelse(pct >= 5, paste0(milestone_ref,"\n",pct,"%"), "")),
            position=position_stack(vjust=0.5),
            size=2.8, colour="white", fontface="bold") +
  scale_fill_manual(values=ms_state_colors,
                    labels=milestone_labels,
                    name="Milestone") +
  scale_y_continuous(expand=c(0,0), limits=c(0,105)) +
  theme_classic() +
  labs(x="Cell line", y="% of cells",
       title="Milestone composition per cell line",
       subtitle="Colour = CD4 state | Label = milestone ID and % | shown if ≥5%") +
  theme(plot.title  = element_text(size=13, face="bold"),
        axis.text.x = element_text(size=12, face="bold"))

print(p_ms_stack)
```

------------------------------------------------------------------------

# Visualisation 5 — Individual Cell Line UMAP Panels

```{r individual-line-plots, fig.width=10, fig.height=4}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# This produces a detailed per-cell-line plot showing:
# - Top panel: cells coloured by predicted state (what state they resemble)
# - Bottom panel: cells coloured by MST pseudotime (how mature they are)
#
# For each cell line, you want to ask:
# 1. Is the cell line HOMOGENEOUS (all cells in one state/region) or
#    HETEROGENEOUS (scattered across multiple states)?
# 2. Does the cell line sit closer to the root (low pseudotime = naive-like)
#    or the tips (high pseudotime = mature effector or Treg-like)?
# 3. Are there TWO subpopulations within one line? This could indicate
#    clonal evolution or technical mixing.
# ──────────────────────────────────────────────────────────────────────────

plot_list <- list()

for (cl in CELL_LINE_ORDER) {
  df_cl <- sez_plot_df %>% filter(cell_line == cl)
  n_cl  <- nrow(df_cl)

  # State plot for this cell line
  p_state <- ggplot() +
    geom_point(data=ref_bg[sample(nrow(ref_bg), min(3000, nrow(ref_bg))),],
               aes(x=UMAP_1, y=UMAP_2), colour="grey60", size=.2, alpha=.4) +
    geom_segment(data=mst_edge_df, aes(x=x1,y=y1,xend=x2,yend=y2),
                 colour="grey40", linewidth=.6, alpha=.6) +
    geom_point(data=df_cl,
               aes(x=UMAP_1, y=UMAP_2, colour=predicted_state),
               size=1.2, alpha=.85) +
    geom_text_repel(data=milestone_centroids,
                    aes(x=UMAP_1, y=UMAP_2, label=milestone),
                    size=2, colour="grey20", fontface="bold",
                    max.overlaps=8, segment.size=.2) +
    scale_colour_manual(values=STATE_COLORS, name="State",
                        drop=FALSE, na.value="grey40") +
    theme_classic() +
    labs(x="UMAP-1", y="UMAP-2",
         title=sprintf("%s — Predicted state (n=%s cells)",
                       cl, scales::comma(n_cl))) +
    theme(plot.title  = element_text(size=10, face="bold",
                                     colour=CELL_LINE_COLORS[cl]),
          legend.text = element_text(size=7),
          legend.title= element_text(size=8))

  # Pseudotime plot for this cell line
  p_pt <- ggplot() +
    geom_point(data=ref_bg[sample(nrow(ref_bg), min(3000, nrow(ref_bg))),],
               aes(x=UMAP_1, y=UMAP_2), colour="grey60", size=.2, alpha=.4) +
    geom_segment(data=mst_edge_df, aes(x=x1,y=y1,xend=x2,yend=y2),
                 colour="grey40", linewidth=.6, alpha=.6) +
    geom_point(data=df_cl,
               aes(x=UMAP_1, y=UMAP_2, colour=mst_pt),
               size=1.2, alpha=.9) +
    geom_text_repel(data=milestone_centroids,
                    aes(x=UMAP_1, y=UMAP_2, label=milestone),
                    size=2, colour="grey40", fontface="bold",
                    max.overlaps=8, segment.size=.2) +
    scale_colour_gradientn(
      colours = c("#0D0887","#6A00A8","#B12A90","#E16462","#FCA636","#F0F921"),
      name="Pseudotime\n(0–100)", limits=c(0,100)) +
    theme_classic() +
    labs(x="UMAP-1", y="UMAP-2",
         title=sprintf("%s — MST pseudotime", cl)) +
    theme(plot.title  = element_text(size=10, face="bold",
                                     colour=CELL_LINE_COLORS[cl]),
          legend.text = element_text(size=7),
          legend.title= element_text(size=8))

  plot_list[[paste0(cl,"_state")]] <- p_state
  plot_list[[paste0(cl,"_pt")]]    <- p_pt
}

# Print in 2-column layout (state | pseudotime) per line
for (cl in CELL_LINE_ORDER) {
  print(plot_list[[paste0(cl,"_state")]] | plot_list[[paste0(cl,"_pt")]])
}
```

------------------------------------------------------------------------

# Visualisation 6 — Method Agreement per Cell Line

```{r method-agreement-lines, fig.width=14, fig.height=6}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# We compare MST and Monocle3 pseudotime PER CELL LINE to assess robustness.
# If both methods agree on a cell line's pseudotime position, the result is
# robust. If they disagree for a specific line, that line may sit in a
# trajectory region where the two methods handle geometry differently
# (e.g. near the branch point M02).
# ──────────────────────────────────────────────────────────────────────────

# Per-line Spearman correlation
line_cors <- sez_plot_df %>%
  group_by(cell_line) %>%
  summarise(
    spearman = round(cor(mst_pt, m3_pt, method="spearman",
                         use="complete.obs"), 3),
    n        = n(),
    .groups  = "drop"
  )

cat("Method agreement (Spearman ρ) per cell line:\n")
print(line_cors)

p_method_scatter <- ggplot(sez_plot_df,
                            aes(x=mst_pt, y=m3_pt, colour=cell_line)) +
  geom_point(size=.5, alpha=.4) +
  geom_abline(slope=1, intercept=0, linetype="dashed",
              colour="grey40", linewidth=.7) +
  geom_smooth(method="lm", se=FALSE, linewidth=.8) +
  geom_text(data=line_cors,
            aes(x=5, y=95, label=sprintf("ρ=%.2f", spearman)),
            size=3, fontface="bold", hjust=0, colour="grey20") +
  scale_colour_manual(values=CELL_LINE_COLORS, guide="none") +
  facet_wrap(~cell_line, ncol=4) +
  theme_classic() +
  labs(x="Custom MST pseudotime (0–100)",
       y="Monocle3 pseudotime (0–100)",
       title="Method agreement per cell line: Custom MST vs Monocle3 pseudotime",
       subtitle="Dashed = perfect agreement | ρ = Spearman correlation | Higher ρ = more reliable pseudotime assignment") +
  theme(plot.title  = element_text(size=12, face="bold"),
        strip.text  = element_text(size=10, face="bold"),
        strip.background = element_rect(fill="grey95"))

print(p_method_scatter)
```

------------------------------------------------------------------------

# Summary Table — Complete Cell Line Analysis

```{r summary-table}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# This comprehensive table is your primary result table for the paper.
# Key columns to interpret:
#
# dominant_state: the most common predicted CD4 state for that cell line
# dominant_state_pct: how confident — if >80%, strong cell of origin signal
# mean_pred_score: MapQuery confidence — >0.7 is reliable
# mst_mean_pt: average differentiation maturity (0=naive, 100=terminal)
# dominant_milestone: the single most common trajectory position
#
# For the paper, you will want to state something like:
# "X of 7 cell lines predominantly mapped to CD4 TCM (milestones M01-M02),
# with mean MST pseudotime of Y ± Z, indicating that Sézary syndrome
# cell lines transcriptomically resemble central memory CD4 T cells at
# an intermediate stage of differentiation."
# ──────────────────────────────────────────────────────────────────────────

sez_summary <- sez_plot_df %>%
  group_by(cell_line) %>%
  summarise(
    n_cells            = n(),
    mean_pred_score    = round(mean(pred_score, na.rm=TRUE), 3),
    high_conf_pct      = round(100 * mean(pred_score > 0.7, na.rm=TRUE), 1),
    dominant_state     = names(sort(table(predicted_state), decreasing=TRUE))[1],
    dominant_state_pct = round(100 * max(table(predicted_state)) / n(), 1),
    dominant_milestone = names(sort(table(milestone_ref), decreasing=TRUE))[1],
    mst_mean_pt        = round(mean(mst_pt, na.rm=TRUE), 1),
    mst_sd_pt          = round(sd(mst_pt, na.rm=TRUE), 1),
    mst_median_pt      = round(median(mst_pt, na.rm=TRUE), 1),
    m3_mean_pt         = round(mean(m3_pt, na.rm=TRUE), 1),
    m3_sd_pt           = round(sd(m3_pt, na.rm=TRUE), 1),
    .groups            = "drop"
  ) %>%
  arrange(mst_mean_pt)

kable(sez_summary,
      col.names = c("Cell line","N cells","Mean pred.\nscore",
                    "% score\n>0.7",
                    "Dominant\nstate","Dominant\nstate %",
                    "Dominant\nmilestone",
                    "MST mean\npt","MST SD",
                    "MST median\npt",
                    "M3 mean\npt","M3 SD"),
      caption = "Complete per cell line trajectory summary — cell of origin analysis") %>%
  kable_styling(bootstrap_options=c("striped","hover","condensed"),
                full_width=TRUE) %>%
  column_spec(5, bold=TRUE) %>%
  column_spec(6, bold=TRUE) %>%
  column_spec(8:10, color="white", background=METHOD_COLORS["Custom MST"]) %>%
  column_spec(11:12, color="white", background=METHOD_COLORS["Monocle3"])

# ── Per-line milestone distribution table ─────────────────────────────────
ms_dist_all <- sez_plot_df %>%
  filter(!is.na(milestone_ref)) %>%
  count(cell_line, milestone_ref) %>%
  left_join(milestone_centroids[,c("milestone","state","order_pos")],
            by=c("milestone_ref"="milestone")) %>%
  filter(!is.na(order_pos)) %>%
  group_by(cell_line) %>%
  mutate(pct = round(100 * n / sum(n), 1)) %>%
  ungroup() %>%
  mutate(milestone_ref = factor(milestone_ref, levels=milestone_order)) %>%
  arrange(cell_line, order_pos)

cat("\n\nMilestone distribution per cell line:\n")
ms_dist_all %>%
  select(cell_line, milestone_ref, state, n, pct) %>%
  kable(col.names=c("Cell line","Milestone","State","N cells","% cells"),
        caption="Per cell line milestone distribution") %>%
  kable_styling(bootstrap_options=c("striped","condensed"), full_width=FALSE) %>%
  print()
```

------------------------------------------------------------------------

# Prediction Score Quality Check

```{r pred-score-quality, fig.width=8, fig.height=5}
# ── BIOLOGICAL NOTE ────────────────────────────────────────────────────────
# The prediction score (0-1) tells you how confidently MapQuery matched
# each Sézary cell to a reference CD4 state. This is critical for
# interpreting your results:
#
# Score >0.7: High confidence — the cell clearly resembles one CD4 state
# Score 0.5-0.7: Moderate confidence — cell has mixed features
# Score <0.5: Low confidence — cell doesn't clearly match any reference state
#
# LOW SCORES ARE BIOLOGICALLY INTERESTING in cancer:
# Malignant cells often have aberrant transcriptomes that don't perfectly
# match any normal cell type. Low scores could indicate:
# 1. The cell has acquired oncogenic gene expression not seen in normal T cells
# 2. The cell spans two normal states (e.g. mixed TCM/TEM features)
# 3. The cell has downregulated normal T cell programmes entirely
#
# You should report the mean prediction score in your methods/results.
# If scores are generally high (>0.7), your conclusion about cell of origin
# is on solid ground. If scores are low (<0.5), interpret with caution.
# ──────────────────────────────────────────────────────────────────────────

p_scores <- ggplot(sez_plot_df,
                    aes(x=cell_line, y=pred_score, fill=cell_line)) +
  geom_violin(scale="width", trim=FALSE, alpha=.85, colour="white") +
  geom_boxplot(width=.08, fill="white", outlier.size=.5,
               outlier.alpha=.3, colour="grey30") +
  geom_hline(yintercept=0.7, linetype="dashed",
             colour="darkgreen", linewidth=.8) +
  geom_hline(yintercept=0.5, linetype="dashed",
             colour="orange", linewidth=.8) +
  annotate("text", x=7.5, y=0.72, label="High confidence (0.7)",
           size=3, colour="darkgreen", hjust=1, fontface="italic") +
  annotate("text", x=7.5, y=0.52, label="Moderate confidence (0.5)",
           size=3, colour="orange", hjust=1, fontface="italic") +
  scale_fill_manual(values=CELL_LINE_COLORS, guide="none") +
  scale_y_continuous(limits=c(0,1), breaks=seq(0,1,.2)) +
  theme_classic() +
  labs(x="Cell line", y="MapQuery prediction score (0–1)",
       title="Projection confidence per Sézary cell line",
       subtitle="Higher score = cell more clearly resembles one specific CD4 T cell state") +
  theme(plot.title  = element_text(size=12, face="bold"),
        axis.text.x = element_text(size=11, face="bold"))

print(p_scores)
```

------------------------------------------------------------------------

# Save All Outputs

```{r save-all}
# Create output directories if they don't exist
dir.create("Objects", showWarnings=FALSE)
dir.create("Figures", showWarnings=FALSE)
dir.create("Tables",  showWarnings=FALSE)

# ── RDS objects ────────────────────────────────────────────────────────────
saveRDS(sezary_obj,          "Objects/sezary_celllines_projected.rds")
saveRDS(mst_graph,           "Objects/mst_graph.rds")
saveRDS(milestone_centroids, "Objects/milestone_centroids.rds")

# ── CSV tables ─────────────────────────────────────────────────────────────
write.csv(sez_summary,    "Tables/cellline_trajectory_summary.csv",   row.names=FALSE)
write.csv(state_per_line, "Tables/cellline_state_proportions.csv",    row.names=FALSE)
write.csv(ms_dist_all,    "Tables/cellline_milestone_distribution.csv",row.names=FALSE)
write.csv(line_cors,      "Tables/cellline_method_agreement.csv",     row.names=FALSE)
write.csv(sez_plot_df,    "Tables/cellline_all_cells_metadata.csv",   row.names=FALSE)

# ── Figures ────────────────────────────────────────────────────────────────
ggsave("Figures/fig_all_lines_umap.pdf",      p_all_lines,    width=14, height=12)
ggsave("Figures/fig_faceted_lines.pdf",       p_faceted,      width=18, height=10)
ggsave("Figures/fig_state_per_line.pdf",      p_state_stack,  width=14, height=6)
ggsave("Figures/fig_pseudotime_mst.pdf",      p_violin_mst,   width=14, height=6)
ggsave("Figures/fig_pseudotime_m3.pdf",       p_violin_m3,    width=14, height=6)
ggsave("Figures/fig_milestone_heatmap.pdf",   p_ms_heatmap,   width=14, height=7)
ggsave("Figures/fig_milestone_stack.pdf",     p_ms_stack,     width=14, height=7)
ggsave("Figures/fig_method_agreement.pdf",    p_method_scatter,width=14, height=6)
ggsave("Figures/fig_prediction_scores.pdf",   p_scores,       width=12, height=5)

# Individual line plots
for (cl in CELL_LINE_ORDER) {
  p_combined <- plot_list[[paste0(cl,"_state")]] | plot_list[[paste0(cl,"_pt")]]
  ggsave(sprintf("Figures/fig_%s_umap.pdf", cl),
         p_combined, width=16, height=7)
}

cat("✅ All outputs saved\n")
cat("  Objects/ — RDS files\n")
cat("  Figures/ — PDF plots\n")
cat("  Tables/  — CSV data files\n")
```

------------------------------------------------------------------------

# Biological Interpretation Guide

```{r interpretation-guide, echo=FALSE}
# This chunk prints a guide for interpreting your specific results
# after all the analyses above have run

cat("
╔══════════════════════════════════════════════════════════════════════╗
║         BIOLOGICAL INTERPRETATION GUIDE FOR YOUR RESULTS            ║
╚══════════════════════════════════════════════════════════════════════╝

STEP 1: CHECK DOMINANT STATE (sez_summary table, dominant_state column)
─────────────────────────────────────────────────────────────────────────
• If most lines show CD4 TCM (dominant_state_pct >70%):
  → CONCLUSION: Sézary syndrome arises from Central Memory T cells
  → IMPLICATION: TCM survival signals (IL-7, IL-15, CCR7-independent
    homing) drive malignant cell persistence
  → LITERATURE: Consistent with Wang et al. 2023, Woollard et al. 2022

• If most lines show CD4 TEM:
  → CONCLUSION: Transformation occurred at the effector memory stage
  → IMPLICATION: Malignant cells have acquired partial effector
    programmes (GZMK, CCL5) but not terminal cytotoxicity

• If lines split between TCM and TEM:
  → CONCLUSION: Disease heterogeneity — different patients have
    different cells of origin
  → IMPLICATION: May explain clinical variability in drug response

STEP 2: CHECK DOMINANT MILESTONE (dominant_milestone column)
─────────────────────────────────────────────────────────────────────────
• M01 or M02 dominant → early-to-late TCM (most likely for SS)
• M02 dominant (branch point) → arrested at fate decision point
• M03 or M04 dominant → TEM commitment despite TCM-like label
• M06 dominant → Treg-like features (check FOXP3 expression)

STEP 3: CHECK PREDICTION SCORES (mean_pred_score column)
─────────────────────────────────────────────────────────────────────────
• >0.7: High confidence — strong cell of origin conclusion
• 0.5-0.7: Moderate — mention in limitations
• <0.5: Low — malignant transcriptome substantially deviates from
  all normal reference states; discuss as 'partial resemblance'

STEP 4: CHECK PSEUDOTIME (mst_mean_pt column)
─────────────────────────────────────────────────────────────────────────
• 0-33: Naive/early TCM differentiation stage
• 33-66: TCM/TEM transition stage
• 66-100: Mature effector or Treg stage
• Compare across lines: if L1 pseudotime < L5 pseudotime, L5 is more
  mature/differentiated — may correlate with aggressiveness

STEP 5: LOOK AT THE FACETED UMAP
─────────────────────────────────────────────────────────────────────────
• Tight clusters in one region → homogeneous cell line
• Scattered cells → heterogeneous line (may have subclones)
• Lines overlapping the same UMAP region → same cell of origin
• Lines in different UMAP regions → different cell of origin
")
```

------------------------------------------------------------------------

# Session Information

```{r session-info}
sessionInfo()
```

------------------------------------------------------------------------

# Metadata Reference

```{r metadata-table, echo=FALSE}
meta_tbl <- data.frame(
  Column = c(
    "orig.ident",
    "predicted.predicted_state",
    "predicted.predicted_state.score",
    "predicted.predicted_milestone",
    "mst_pseudotime",
    "nearest_milestone_mst",
    "monocle3_pseudotime"
  ),
  Description = c(
    "Cell line identity: L1–L7 (malignant Sézary lines)",
    "MapQuery label transfer: most similar CD4 T cell state from reference",
    "MapQuery confidence score (0–1): higher = more confident prediction",
    "MapQuery label transfer: most similar milestone (M00–M07) from reference",
    "MST pseudotime (0–100): 0=Naive-like, 100=Temra/Terminal Treg-like",
    "Nearest MST milestone after orthogonal edge projection",
    "KNN-weighted Monocle3 pseudotime from 15 nearest reference neighbours (0–100)"
  ),
  Interpretation = c(
    "Use to split per-cell-line analysis",
    "Primary cell of origin indicator",
    ">0.7 high confidence | 0.5-0.7 moderate | <0.5 interpret cautiously",
    "Most precise trajectory position (M02 = branch point)",
    "Differentiation maturity: lower = more naive-like at time of transformation",
    "Confirms trajectory position; M01-M02 = TCM, M03-M04 = TEM, M05 = Temra",
    "Independent pseudotime estimate for robustness validation"
  )
)
kable(meta_tbl,
      caption="Metadata fields added to sezary_obj by this pipeline") %>%
  kable_styling(bootstrap_options=c("striped","condensed"), full_width=TRUE)
```
