This RMarkdown script provides a comprehensive workflow for establishing a normal CD4+ T-cell differentiation trajectory and subsequently projecting Sézary syndrome cell lines onto this reference. The goal is to identify the potential cell of origin for malignant T cells in Sézary syndrome by understanding their position within the normal T-cell developmental landscape. This approach addresses the critical need for a biological reference to interpret the differentiation state of malignant cells.
CD4T_lab
(your lab-processed data) and
CD4T_10x
(publicly available 10x Genomics PBMC-derived CD4+
T cells).Before running this script, ensure you have the following data prepared:
CD4T_lab
Seurat Object: A Seurat
object containing your lab-processed normal CD4+ T-cell data. This
should have undergone initial QC and basic processing (e.g.,
normalization, scaling, variable feature identification).CD4T_10x
Seurat Object: A Seurat
object containing CD4+ T cells extracted from a publicly available 10x
Genomics PBMC dataset. This should also be pre-processed similarly to
your CD4T_lab
data.sezary_cell_lines
Seurat Object: Your
integrated Seurat object containing the Sézary syndrome cell lines (RNA,
CITE-seq, TCR data), which you have already processed and analyzed. This
object should have a UMAP reduction and cell type annotations.Note on Data Preparation: It is crucial that both normal CD4+ T-cell datasets and your Sézary cell line dataset share common gene sets and ideally similar processing steps up to the point of integration to ensure compatibility.
This phase focuses on loading your main Seurat object
(All_sample_Merged
) which has already been processed with
SCTransform
and 3000 features. We will then subset the
normal CD4+ T-cell controls (CD4T_lab
and
CD4T_10x
) based on their orig.ident
values.
This ensures that all subsequent analyses on these normal controls
leverage the consistent SCT
assay and feature set.
library(Seurat)
library(tidyverse)
library(patchwork)
# Set a consistent ggplot2 theme
theme_set(theme_bw() + theme(text = element_text(size = 12)))
All_sample_Merged
) and Inspect
Sample CompositionReplace path/to/your/All_sample_Merged.rds
with the
actual path to your main Seurat object. This object should already
contain the SCT
assay as its default, with 3000 variable
features.
# Ensure the SCT assay is the default
DefaultAssay(main_seurat) <- "SCT"
# Inspect the orig.ident to see all sample identifiers
print("Sample composition in orig.ident:")
[1] "Sample composition in orig.ident:"
print(table(main_seurat$orig.ident))
L1 L2 L3 L4 L5 L6 L7 CD4T_lab CD4T_10x
5825 5935 6428 6006 6022 5148 5331 5106 3504
# Visualize sample distribution (using the UMAP from the main object)
sample_plot <- DimPlot(main_seurat, reduction = "umap", group.by = "orig.ident", label = TRUE) +
ggtitle("All Samples in Main Seurat Object (SCTransformed)")
print(sample_plot)
print(main_seurat)
An object of class Seurat
62900 features across 49305 samples within 6 assays
Active assay: SCT (26176 features, 3000 variable features)
3 layers present: counts, data, scale.data
5 other assays present: RNA, ADT, prediction.score.celltype.l1, prediction.score.celltype.l2, prediction.score.celltype.l3
5 dimensional reductions calculated: integrated_dr, ref.umap, pca, umap, harmony
Now we will extract only the normal control samples
(CD4T_lab
and CD4T_10x
) from the main object
to create our reference trajectory. These subsetted objects will
automatically inherit the SCT
assay.
# Subset normal CD4+ T-cell controls based on orig.ident
# Adjust the identifiers if they are named differently in your data
normal_controls <- subset(main_seurat, subset = orig.ident %in% c("CD4T_lab", "CD4T_10x"))
# Verify the subsetting worked correctly
print("Normal controls after subsetting:")
[1] "Normal controls after subsetting:"
print(table(normal_controls$orig.ident))
CD4T_lab CD4T_10x
5106 3504
# Add a \'dataset\' metadata column for clarity
normal_controls$dataset <- normal_controls$orig.ident
# Ensure SCT assay is default for the subsetted object
DefaultAssay(normal_controls) <- "SCT"
# Basic QC visualization for the normal controls
# Note: If percent.mt, nCount_RNA, nFeature_RNA were regressed out during SCTransform,
# these plots might show less variation. However, they are still useful for checking.
if(!"percent.mt" %in% colnames(normal_controls@meta.data)) {
normal_controls[["percent.mt"]] <- PercentageFeatureSet(normal_controls, pattern = "^MT-")
}
vln_plot_normal <- VlnPlot(normal_controls, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
group.by = "dataset", ncol = 3, pt.size = 0.1)
print(vln_plot_normal)
# Optional: Apply additional QC filtering if needed (though already done on main object)
normal_controls <- subset(normal_controls, subset = nFeature_RNA > 200 & nFeature_RNA < 4000 & percent.mt < 6)
print(paste("Total normal control cells:", ncol(normal_controls)))
[1] "Total normal control cells: 8457"
print(paste("CD4T_lab cells:", sum(normal_controls$orig.ident == "CD4T_lab")))
[1] "CD4T_lab cells: 5053"
print(paste("CD4T_10x cells:", sum(normal_controls$orig.ident == "CD4T_10x")))
[1] "CD4T_10x cells: 3404"
print(normal_controls)
An object of class Seurat
62900 features across 8457 samples within 6 assays
Active assay: SCT (26176 features, 3000 variable features)
3 layers present: counts, data, scale.data
5 other assays present: RNA, ADT, prediction.score.celltype.l1, prediction.score.celltype.l2, prediction.score.celltype.l3
5 dimensional reductions calculated: integrated_dr, ref.umap, pca, umap, harmony
vln_plot_normal <- VlnPlot(normal_controls, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
group.by = "dataset", ncol = 3, pt.size = 0.1)
print(vln_plot_normal)
Since these samples were processed together in your main object, they
should already share the same gene set and basic preprocessing. We will
now split them and prepare for RPCA integration to remove any remaining
batch effects between the two normal control datasets in the
SCT
assay.
# Split the normal controls object by dataset for integration
normal_list <- SplitObject(normal_controls, split.by = "orig.ident")
# Verify the split
print("Split normal controls:")
[1] "Split normal controls:"
for(i in names(normal_list)) {
print(paste(i, ":", ncol(normal_list[[i]]), "cells"))
}
[1] "CD4T_lab : 5053 cells"
[1] "CD4T_10x : 3404 cells"
# SCTransform has already been applied to the main object, so we don't need to run it again here.
# However, for RPCA, we need to ensure the `SCT` assay is the default for each object in the list.
for (i in 1:length(normal_list)) {
DefaultAssay(normal_list[[i]]) <- "SCT"
}
# Merge objects
temp_merged <- merge(normal_list[["CD4T_lab"]], y = normal_list[["CD4T_10x"]],
add.cell.ids = c("Lab", "10x"), project = "NormalCD4T_PreIntegration")
DefaultAssay(temp_merged) <- "SCT"
# Use SCTransform variable features from each object and combine them
# Seurat recommends taking the top variable features from each object (union or intersection)
var_features_lab <- VariableFeatures(normal_list[["CD4T_lab"]])
var_features_10x <- VariableFeatures(normal_list[["CD4T_10x"]])
# Combine and remove unwanted genes
combined_var_features <- setdiff(union(var_features_lab, var_features_10x),
grep("^HLA-|^TR[ABGD]|^IG[HKL]|^XIST$", rownames(temp_merged), value = TRUE))
# Assign to merged object
VariableFeatures(temp_merged) <- combined_var_features
# Now PCA will work
temp_merged <- RunPCA(temp_merged, verbose = FALSE)
# Visualize the elbow plot
ElbowPlot(temp_merged, ndims = 50) + ggtitle("PCA Elbow Plot - Normal Controls")
temp_merged <- RunUMAP(temp_merged, dims = 1:15, verbose = FALSE)
# Plot UMAP colored by dataset to show potential batch effect
pre_integration_plot <- DimPlot(temp_merged, reduction = "umap", group.by = "dataset") +
ggtitle("Normal Controls before RPCA Integration (SCT Assay)")
print(pre_integration_plot)
DimPlot(temp_merged, reduction = "umap", group.by = "predicted.celltype.l2", label = TRUE, label.box = T, repel = T)
DimPlot(temp_merged, reduction = "umap", group.by = "Prediction", label = TRUE, label.box = T, repel = T)
# Clean up temporary object
rm(temp_merged)
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4190807 223.9 7567777 404.2 7567777 404.2
Vcells 1608502678 12272.0 2608179701 19898.9 2233257975 17038.5
print("Normal controls prepared for RPCA integration using SCT assay.")
[1] "Normal controls prepared for RPCA integration using SCT assay."
This phase details the integration of the CD4T_lab
and
CD4T_10x
datasets using Seurat’s Reciprocal PCA (RPCA)
integration method. Since your data has been processed with
SCTransform
, we will ensure that the integration steps are
compatible with the SCT
assay. RPCA is particularly
effective for integrating datasets with significant biological
differences or large batch effects, as it identifies shared biological
signals while minimizing technical variation. This is crucial for
creating a unified reference map for normal CD4+ T-cell
differentiation.
When working with SCTransform
data, it’s important to
use integration methods that are designed to work with the corrected
counts or normalized values produced by SCTransform
. RPCA,
when combined with PrepSCTIntegration
and specifying
normalization.method = "SCT"
, ensures that the integration
leverages the benefits of SCTransform
’s improved
normalization and variance stabilization. This leads to more accurate
anchor finding and better batch effect correction.
# Increase future global size to 5 GB (adjust as needed)
options(future.globals.maxSize = 5 * 1024^3) # 5 GB
# List of Seurat objects to integrate (normal_list created in Phase 1)
# For SCTransform data, we need to run `PrepSCTIntegration` before finding anchors.
# This function ensures that the SCT assay is properly prepared for integration.
normal_list <- PrepSCTIntegration(object.list = normal_list, anchor.features = 3000, verbose = FALSE)
# Select integration features
# When using PrepSCTIntegration, features are already selected, but we can re-confirm.
features <- SelectIntegrationFeatures(object.list = normal_list, nfeatures = 3000)
# Remove unwanted genes from integration features
unwanted_genes <- grep("^HLA-|^TR[ABGD]|^IG[HKL]|^XIST$", features, value = TRUE)
features <- setdiff(features, unwanted_genes)
# Find integration anchors using RPCA
# We specify `normalization.method = "SCT"` to ensure anchors are found on SCT-normalized data.
cd4t_anchors <- FindIntegrationAnchors(object.list = normal_list, anchor.features = features,
normalization.method = "SCT", reduction = "rpca", verbose = FALSE)
| | 0 % ~calculating
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=05s
# Integrate the datasets
# This step uses the identified anchors to correct for batch effects and create a combined expression matrix
cd4t_integrated <- IntegrateData(anchorset = cd4t_anchors, normalization.method = "SCT", verbose = FALSE)
[1] 1
[1] 2
print(cd4t_integrated)
An object of class Seurat
65786 features across 8457 samples within 7 assays
Active assay: integrated (2886 features, 2886 variable features)
2 layers present: data, scale.data
6 other assays present: RNA, ADT, prediction.score.celltype.l1, prediction.score.celltype.l2, prediction.score.celltype.l3, SCT
After integration, we need to scale the integrated data, perform PCA,
and then generate a UMAP embedding to visualize the integrated cell
populations. The integrated
assay now contains the
batch-corrected expression values, derived from the SCT-normalized
data.
# Set default assay to 'integrated' for downstream analysis
DefaultAssay(cd4t_integrated) <- "integrated"
# Remove unwanted genes from variable features
unwanted_genes <- grep("^HLA-|^TR[ABGD]|^IG[HKL]|^XIST$|^MALAT1$",
rownames(cd4t_integrated), value = TRUE)
VariableFeatures(cd4t_integrated) <- setdiff(VariableFeatures(cd4t_integrated), unwanted_genes)
# Confirm removal
VariableFeatures(cd4t_integrated)[1:10] # just to inspect the first few
[1] "CCL17" "CCL1" "CCL4" "PPBP" "CCL3" "OASL" "IFIT2" "GZMB" "XCL1" "CSF2"
# Run PCA on the integrated data
# When using SCTransform, scaling is typically handled by the SCTransform function itself.
# The `integrated` assay is already scaled for PCA by the integration process.
cd4t_integrated <- RunPCA(cd4t_integrated, npcs = 50, verbose = FALSE)
# Visualize elbow plot
ElbowPlot(cd4t_integrated, ndims = 50) + ggtitle("Elbow Plot - Integrated Normal CD4T")
# Run UMAP on the integrated data
cd4t_integrated <- RunUMAP(cd4t_integrated, reduction = "pca", dims = 1:20, verbose = FALSE)
# Find clusters on the integrated data
cd4t_integrated <- FindNeighbors(cd4t_integrated, reduction = "pca", dims = 1:20, verbose = FALSE)
cd4t_integrated <- FindClusters(cd4t_integrated, resolution = c(0.2, 0.3,0.4, 0.5), verbose = FALSE)
print(cd4t_integrated)
An object of class Seurat
65786 features across 8457 samples within 7 assays
Active assay: integrated (2886 features, 2885 variable features)
2 layers present: data, scale.data
6 other assays present: RNA, ADT, prediction.score.celltype.l1, prediction.score.celltype.l2, prediction.score.celltype.l3, SCT
2 dimensional reductions calculated: pca, umap
Now, we can visualize the integrated data to confirm that batch effects have been successfully removed and that cells from different datasets are well-mixed within their respective biological populations.
# Plot UMAP colored by dataset to show successful integration
integrated_plot_dataset <- DimPlot(cd4t_integrated, reduction = "umap", group.by = "dataset") +
ggtitle("UMAP after RPCA Integration (Batch Effect Removed)")
print(integrated_plot_dataset)
# Plot UMAP colored by cluster to see the clustering results
integrated_plot_clusters <- DimPlot(cd4t_integrated, reduction = "umap", group.by = "integrated_snn_res.0.2", label = TRUE) +
ggtitle("UMAP of Integrated Normal CD4+ T Cells (Clusters)")
print(integrated_plot_clusters)
# Combine plots for comparison
combined_integration_plots <- integrated_plot_dataset + integrated_plot_clusters
print(combined_integration_plots)
DimPlot(cd4t_integrated, reduction = "umap", group.by = "integrated_snn_res.0.3", label = TRUE, label.box = T, repel = T)
DimPlot(cd4t_integrated, reduction = "umap", group.by = "integrated_snn_res.0.4", label = TRUE, label.box = T, repel = T)
DimPlot(cd4t_integrated, reduction = "umap", group.by = "integrated_snn_res.0.5", label = TRUE, label.box = T, repel = T)
DimPlot(cd4t_integrated, reduction = "umap", group.by = "predicted.celltype.l2", label = TRUE, label.box = T, repel = T)
DimPlot(cd4t_integrated, reduction = "umap", group.by = "Prediction", label = TRUE, label.box = T, repel = T)
# Save the integrated object for subsequent steps
saveRDS(cd4t_integrated, "integrated_normal_cd4t_seurat_object.rds")
print("RPCA integration complete. Integrated Seurat object saved.")
[1] "RPCA integration complete. Integrated Seurat object saved."
Since your integrated object already contains cell type annotations
from Azimuth (Prediction
and
predicted.celltype.l2
columns), we will utilize these
existing annotations to define the CD4+ T-cell differentiation path.
This phase focuses on organizing these annotations into a biologically
meaningful order for trajectory inference, following the approach used
by Cerapio et al. for CD8+ T cells, but adapted for CD4+ T-cell
differentiation.
If you are running this script in sections, ensure you load the
cd4t_integrated
object saved from the previous phase.
# Uncomment and run if starting from this phase
# library(Seurat)
# cd4t_integrated <- readRDS("path/to/save/integrated_normal_cd4t_seurat_object.rds")
# DefaultAssay(cd4t_integrated) <- "integrated"
# print(cd4t_integrated)
First, let’s examine the existing cell type annotations in your integrated object to understand what CD4+ T-cell subsets are present.
# Check available metadata columns
print("Available metadata columns:")
[1] "Available metadata columns:"
print(colnames(cd4t_integrated@meta.data))
[1] "orig.ident" "nCount_RNA"
[3] "nFeature_RNA" "nCount_ADT"
[5] "nFeature_ADT" "nUMI"
[7] "ngene" "cell_line"
[9] "Patient_origin" "Patient_Immunophenotype"
[11] "condition_of_amplification_in_vivo" "culture_medium"
[13] "Stromal_cells" "Cell_line_Immunophenotype"
[15] "TP53_mutation" "age_at_diagnosis"
[17] "stage_diagnosis" "stage_analysis"
[19] "Treatments_analysis" "TCRVB2"
[21] "CD3_M" "CD30_M"
[23] "CCR4_M" "CD162_BL"
[25] "CD26_BL" "CD7_M"
[27] "CLA_BD" "CD4_BD"
[29] "CCR7_M" "CD45RO_BD"
[31] "CD45RA_BD" "CellName"
[33] "percent.mt" "CD26_BD"
[35] "CD45RA_M" "predicted.celltype.l1.score"
[37] "predicted.celltype.l1" "predicted.celltype.l2.score"
[39] "predicted.celltype.l2" "predicted.celltype.l3.score"
[41] "predicted.celltype.l3" "mapping.score"
[43] "percent.rb" "nCount_SCT"
[45] "nFeature_SCT" "S.Score"
[47] "G2M.Score" "Phase"
[49] "old.ident" "CC.Difference"
[51] "SCT_snn_res.0.4" "SCT_snn_res.0.5"
[53] "SCT_snn_res.0.6" "SCT_snn_res.0.7"
[55] "SCT_snn_res.0.8" "seurat_clusters"
[57] "harmony_res_0.1" "harmony_res_0.2"
[59] "harmony_res_0.3" "harmony_res_0.4"
[61] "harmony_res_0.5" "harmony_res_0.6"
[63] "harmony_res_0.7" "harmony_res_0.8"
[65] "harmony_res_0.9" "harmony_res_1"
[67] "harmony_res_1.2" "predicted.celltype.l1_backup"
[69] "predicted.celltype.l2_backup" "predicted.celltype.l3_backup"
[71] "Prediction" "Cluster"
[73] "Uncertainty_score" "Condition"
[75] "dataset" "integrated_snn_res.0.5"
[77] "integrated_snn_res.0.2" "integrated_snn_res.0.3"
[79] "integrated_snn_res.0.4"
# Inspect the existing annotations
if("Prediction" %in% colnames(cd4t_integrated@meta.data)) {
print("Prediction column cell types:")
print(table(cd4t_integrated$Prediction))
}
[1] "Prediction column cell types:"
CD4 proliferation CD4 Tcm CD4 Tisg CD4 Tisg cell-death CD4 Tn
13 2429 42 190 5689
CD4 Tstr None T
5 89
if("predicted.celltype.l2" %in% colnames(cd4t_integrated@meta.data)) {
print("predicted.celltype.l2 column cell types:")
print(table(cd4t_integrated$predicted.celltype.l2))
}
[1] "predicted.celltype.l2 column cell types:"
CD4 CTL CD4 Naive CD4 TCM CD4 TEM
12 1834 6530 81
# Visualize existing annotations
if("Prediction" %in% colnames(cd4t_integrated@meta.data)) {
p_prediction <- DimPlot(cd4t_integrated, reduction = "umap", group.by = "Prediction", label = TRUE, repel = TRUE) +
ggtitle("Existing Cell Type Annotations (Prediction)")
print(p_prediction)
}
if("predicted.celltype.l2" %in% colnames(cd4t_integrated@meta.data)) {
p_predicted_l2 <- DimPlot(cd4t_integrated, reduction = "umap", group.by = "predicted.celltype.l2", label = TRUE, repel = TRUE) +
ggtitle("Existing Cell Type Annotations (predicted.celltype.l2)")
print(p_predicted_l2)
}
# Check cluster composition
table(cd4t_integrated$integrated_snn_res.0.5, cd4t_integrated$predicted.celltype.l2)
CD4 CTL CD4 Naive CD4 TCM CD4 TEM
0 0 383 1786 3
1 3 3 1749 21
2 8 0 1234 53
3 0 853 429 1
4 0 563 547 0
5 0 17 676 3
6 0 15 93 0
7 1 0 16 0
# Check cluster composition
table(cd4t_integrated$integrated_snn_res.0.5, cd4t_integrated$Prediction)
CD4 proliferation CD4 Tcm CD4 Tisg CD4 Tisg cell-death CD4 Tn CD4 Tstr None T
0 0 31 21 0 2109 4 7
1 0 1226 8 0 531 0 11
2 0 988 11 1 293 0 2
3 0 86 0 0 1196 0 1
4 0 78 0 0 1031 1 0
5 0 20 2 189 465 0 20
6 0 0 0 0 60 0 48
7 13 0 0 0 4 0 0
markers <- c("CCR7","SELL","IL7R","TCF7","GZMK","GZMB","IFNG","PRF1","NKG7","MKI67","TOP2A","FOXP3","IL2RA")
avg_exp <- AverageExpression(cd4t_integrated, assays = "SCT", features = markers)
avg_exp$SCT
13 x 8 sparse Matrix of class "dgCMatrix"
g0 g1 g2 g3 g4 g5 g6 g7
CCR7 4.5543278085 3.094594595 1.610038610 3.2447388932 2.8513513514 3.318965517 1.81481481 1.4705882
SELL 3.1652854512 3.228040541 2.573745174 5.0413094310 4.0729729730 1.251436782 1.88888889 2.9411765
IL7R 1.5828729282 6.864864865 10.414671815 6.7100545596 4.5720720721 3.580459770 4.67592593 4.7647059
TCF7 3.7269797422 2.426238739 3.005405405 5.1371784879 4.2198198198 2.172413793 3.46296296 2.6470588
GZMK 0.0161141805 0.074324324 0.288030888 0.0187061574 0.0108108108 0.037356322 . 0.4117647
GZMB 0.0004604052 0.004504505 0.005405405 0.0007794232 0.0018018018 . . .
IFNG 0.0004604052 0.001126126 0.009266409 0.0031176929 0.0027027027 0.004310345 . .
PRF1 0.0156537753 0.036599099 0.135907336 0.1013250195 0.0954954955 0.025862069 0.01851852 0.2941176
NKG7 0.0041436464 0.015765766 0.103474903 0.0077942323 0.0063063063 0.002873563 . 0.6470588
MKI67 0.0027624309 0.002815315 0.009266409 0.0023382697 0.0027027027 . 0.01851852 .
TOP2A 0.0023020258 0.008445946 0.009266409 0.0007794232 0.0009009009 . . .
FOXP3 0.0188766114 0.032094595 0.107335907 0.0077942323 0.0045045045 0.011494253 . .
IL2RA 0.0686003683 0.069819820 0.121235521 0.0163678878 0.0081081081 0.081896552 0.20370370 .
library(pheatmap)
pheatmap(avg_exp$SCT, cluster_rows = FALSE, cluster_cols = FALSE,
main = "CD4 T-cell marker expression per cluster")
Based on the existing annotations, we will create a ordered differentiation path similar to Cerapio et al.’s approach. The typical CD4+ T-cell differentiation progression follows: Naive CD4+ T cells (CD4Tn) → Central Memory (CD4Tcm) → Effector Memory (CD4Tem) → potentially other specialized subsets.
# Ensure cluster column exists
table(cd4t_integrated$integrated_snn_res.0.5)
0 1 2 3 4 5 6 7
2172 1776 1295 1283 1110 696 108 17
# Map clusters to differentiation stages based on marker expression
cluster_to_stage <- c(
"0" = "CD4Tn", # g0: CCR7, SELL, IL7R high → Naive
"1" = "CD4Tcm", # g1: CCR7, SELL, IL7R high → Central Memory
"2" = "CD4Tcm_CTL", # g2: IL7R very high, moderate cytotoxic markers → Hybrid central memory / early CTL
"3" = "CD4Tn", # g3: CCR7, TCF7, SELL high → Naive
"4" = "CD4Tcm", # g4: CCR7, TCF7 → Central Memory
"5" = "CD4Tcm", # g5: MKI67/TOP2A low → treat as central memory (not proliferating)
"6" = "CD4Tcm", # g6: MKI67/TOP2A low → treat as central memory
"7" = "CD4Tctl" # g7: GZMK, PRF1, NKG7 high → Cytotoxic
)
# Assign differentiation stage to each cell
diff_stage_vec <- sapply(cd4t_integrated$integrated_snn_res.0.5,
function(x) cluster_to_stage[as.character(x)])
names(diff_stage_vec) <- colnames(cd4t_integrated)
cd4t_integrated <- AddMetaData(cd4t_integrated, metadata = diff_stage_vec, col.name = "differentiation_stage")
# Set ordered factor for trajectory or plotting
differentiation_order <- c("CD4Tn", "CD4Tcm", "CD4Tcm_CTL", "CD4Tctl")
cd4t_integrated$differentiation_stage <- factor(cd4t_integrated$differentiation_stage,
levels = differentiation_order)
# Check cluster counts
table(cd4t_integrated$differentiation_stage)
CD4Tn CD4Tcm CD4Tcm_CTL CD4Tctl
3455 3690 1295 17
# # Use the more detailed annotation column (predicted.celltype.l2 if available, otherwise Prediction)
# if("predicted.celltype.l2" %in% colnames(cd4t_integrated@meta.data)) {
# cd4t_integrated$cell_type_detailed <- cd4t_integrated$predicted.celltype.l2
# } else if("Prediction" %in% colnames(cd4t_integrated@meta.data)) {
# cd4t_integrated$cell_type_detailed <- cd4t_integrated$Prediction
# } else {
# stop("Neither 'Prediction' nor 'predicted.celltype.l2' columns found in metadata")
# }
#
# # Map detailed cell types to differentiation stages
# # You may need to adjust these mappings based on your actual annotation results
# cd4t_integrated$differentiation_stage <- cd4t_integrated$cell_type_detailed
#
# # Create a mapping from detailed cell types to ordered differentiation stages
# # This is a template - adjust based on your actual cell type names
# differentiation_mapping <- c(
# # Naive CD4+ T cells (earliest stage)
# "CD4 Naive" = "CD4Tn",
# "CD4 Tn" = "CD4Tn",
#
#
# # Central Memory CD4+ T cells
# "CD4 TCM" = "CD4Tcm",
# "CD4 Tcm" = "CD4Tcm",
#
#
# # Effector Memory CD4+ T cells
# "CD4 TEM" = "CD4Tem",
#
# # Cytotoxic CD4+ T cells
# "CD4 CTL" = "CD4ctl",
#
#
# # Other CD4+ subsets (adjust as needed)
# "CD4 proliferation" = "CD4p",
# )
#
# # Apply the mapping
# for(original_name in names(differentiation_mapping)) {
# cd4t_integrated$differentiation_stage[cd4t_integrated$cell_type_detailed == original_name] <- differentiation_mapping[original_name]
# }
#
# # Check the mapping results
# print("Differentiation stage mapping results:")
# print(table(cd4t_integrated$differentiation_stage))
#
# # Define the biological order for trajectory inference (from naive to most differentiated)
# differentiation_order <- c("CD4Tn", "CD4Tcm", "CD4Tem", "CD4Tctl", "CD4p")
#
# # Filter to only include stages present in the data
# present_stages <- intersect(differentiation_order, unique(cd4t_integrated$differentiation_stage))
# cd4t_integrated$differentiation_stage <- factor(cd4t_integrated$differentiation_stage, levels = present_stages)
#
# print("Final differentiation stages (in order):")
# print(levels(cd4t_integrated$differentiation_stage))
# Plot the ordered differentiation stages
p_diff_stages <- DimPlot(cd4t_integrated, reduction = "umap", group.by = "differentiation_stage", label = TRUE, repel = TRUE) +
ggtitle("CD4+ T-cell Differentiation Stages (Ordered)") +
theme(legend.position = "bottom")
print(p_diff_stages)
# Create a summary table of cell counts per stage
stage_summary <- table(cd4t_integrated$differentiation_stage, cd4t_integrated$dataset)
print("Cell counts per differentiation stage by dataset:")
[1] "Cell counts per differentiation stage by dataset:"
print(stage_summary)
CD4T_10x CD4T_lab
CD4Tn 1278 2177
CD4Tcm 1169 2521
CD4Tcm_CTL 944 351
CD4Tctl 13 4
# Visualize marker genes for validation
# These are canonical CD4+ T-cell differentiation markers
marker_genes <- c("CCR7", "SELL", "IL7R", "CD44", "GZMB", "PRF1", "FOXP3", "CXCR5", "BCL6")
available_markers <- intersect(marker_genes, rownames(cd4t_integrated))
if(length(available_markers) > 0) {
# Switch to SCT assay for marker visualization
DefaultAssay(cd4t_integrated) <- "SCT"
p_markers <- FeaturePlot(cd4t_integrated, features = available_markers, ncol = 3, reduction = "umap")
print(p_markers)
# Switch back to integrated assay
DefaultAssay(cd4t_integrated) <- "integrated"
}
# Save the annotated object
saveRDS(cd4t_integrated, "annotated_normal_cd4t_seurat_object.rds")
print("Cell type annotation and differentiation path ordering complete.")
[1] "Cell type annotation and differentiation path ordering complete."
With the integrated and accurately annotated normal CD4+ T-cell dataset, we can now proceed with constructing the differentiation trajectory using Monocle3. This will establish the reference map onto which the Sézary cells will later be projected. Monocle3 is well-suited for this task due to its ability to model complex, branching trajectories and assign pseudotime values, representing the progression of cells through a biological process.
When working with SCTransform-processed data, it’s important to
understand how Monocle3 handles the different assays. While we used the
integrated
assay for dimensionality reduction and
clustering, Monocle3 typically works best with raw counts or
log-normalized data for its core trajectory inference algorithms. The
SCT
assay contains corrected counts that can be used, but
we’ll primarily use the RNA
assay’s counts for Monocle3
while leveraging the UMAP embedding from our integrated analysis.
If you are running this script in sections, ensure you load the
cd4t_integrated
object saved from the previous phase.
# Uncomment and run if starting from this phase
# library(Seurat)
# cd4t_integrated <- readRDS("path/to/save/annotated_normal_cd4t_seurat_object.rds")
# DefaultAssay(cd4t_integrated) <- "integrated"
# print(cd4t_integrated)
# Install Monocle3 if not already installed
# BiocManager::install(c("monocle3"))
library(monocle3)
library(Seurat)
library(tidyverse)
library(patchwork)
Monocle3 requires a cell_data_set
(CDS) object. We will
convert our integrated Seurat object into a CDS object, ensuring that
the UMAP embedding and the differentiation_stage
annotations are correctly transferred. It’s important to use the
RNA
assay’s raw counts for Monocle3’s core functions, even
though the integrated
assay was used for dimensionality
reduction.
# Ensure the RNA assay is available and contains counts
# Monocle3 typically works best with raw counts for accurate dispersion estimation
DefaultAssay(cd4t_integrated) <- "RNA"
# Join the RNA layers
cd4t_integrated <- JoinLayers(cd4t_integrated, new.assay = "RNA")
# Extract expression matrix (raw counts)
# Using counts is generally recommended for Monocle3
expression_matrix <- GetAssayData(cd4t_integrated, assay = "RNA", slot = "counts")
# Extract cell metadata
cell_metadata <- cd4t_integrated@meta.data
# Create gene metadata (simple dataframe with gene names)
gene_metadata <- data.frame(gene_short_name = rownames(expression_matrix),
row.names = rownames(expression_matrix))
# Create the CDS object
cds_normal <- new_cell_data_set(expression_matrix,
cell_metadata = cell_metadata,
gene_metadata = gene_metadata)
# Add the UMAP embedding from Seurat to the CDS object
# Monocle3 can use the existing UMAP embedding from our integrated analysis
if("umap" %in% names(cd4t_integrated@reductions)) {
cds_normal@int_colData@listData$reducedDims$UMAP <- cd4t_integrated@reductions$umap@cell.embeddings
} else {
stop("UMAP reduction not found in Seurat object. Please ensure it's computed.")
}
# Add partitions (clusters) from Seurat to CDS
# This is important for Monocle3's graph learning
partitions <- as.factor(cd4t_integrated$seurat_clusters)
names(partitions) <- colnames(cd4t_integrated)
cds_normal@clusters$UMAP$partitions <- partitions
print(cds_normal)
class: cell_data_set
dim: 36601 8457
metadata(1): cds_version
assays(1): counts
rownames(36601): MIR1302-2HG FAM138A ... AC007325.4 AC007325.2
rowData names(1): gene_short_name
colnames(8457): PBMC_AAACCTGAGAAACCAT-1 PBMC_AAACCTGAGCGTAATA-1 ... PBMC_10x_TTTGTCATCTCTGAGA-1
PBMC_10x_TTTGTCATCTTCGAGA-1
colData names(81): orig.ident nCount_RNA ... differentiation_stage Size_Factor
reducedDimNames(1): UMAP
mainExpName: NULL
altExpNames(0):
Monocle3 learns a principal graph that represents the trajectory structure. This graph connects cells based on their proximity in the reduced-dimensional space (UMAP in our case).
# The main issue was that you were assigning factors with biological labels as partitions, but Monocle3 requires character vectors, and if use_partition=TRUE, partitions must be simple IDs (e.g., "1"). Once you fixed partitions and ensured UMAP rownames matched CDS colnames, learn_graph() worked.
# Assign clusters
cds_normal@clusters$UMAP$clusters <- as.character(cds_normal$cd4t_stage)
names(cds_normal@clusters$UMAP$clusters) <- colnames(cds_normal)
# Assign partitions (can just set all to "1" if using use_partition = FALSE)
cds_normal@clusters$UMAP$partitions <- rep("1", ncol(cds_normal))
names(cds_normal@clusters$UMAP$partitions) <- colnames(cds_normal)
# Ensure UMAP embedding rownames match colnames of CDS
umap_mat <- reducedDims(cds_normal)$UMAP
umap_mat <- umap_mat[colnames(cds_normal), , drop = FALSE]
reducedDims(cds_normal)$UMAP <- as.matrix(umap_mat)
stopifnot(all(names(cds_normal@clusters$UMAP$partitions) == colnames(cds_normal)))
stopifnot(all(rownames(reducedDims(cds_normal)$UMAP) == colnames(cds_normal)))
# ------------------------------
# Learn the principal graph
# ------------------------------
cds_normal <- learn_graph(cds_normal, use_partition = TRUE)
|
| | 0%
|
|========================================================================================================| 100%
# Plot the learned graph colored by differentiation stage
p_graph_stages <- plot_cells(cds_normal, color_cells_by = "differentiation_stage",
label_groups_by_cluster = FALSE,
label_leaves = FALSE,
label_branch_points = FALSE,
graph_label_size = 4) +
ggtitle("Monocle3 Graph on Normal CD4+ T Cells (by Differentiation Stage)")
print(p_graph_stages)
# Plot the learned graph colored by dataset to ensure no batch effects
p_graph_dataset <- plot_cells(cds_normal, color_cells_by = "dataset",
label_groups_by_cluster = FALSE,
label_leaves = FALSE,
label_branch_points = FALSE,
graph_label_size = 4) +
ggtitle("Monocle3 Graph on Normal CD4+ T Cells (by Dataset)")
print(p_graph_dataset)
NA
NA
This is the core step where cells are ordered along the learned trajectory. You need to specify a “root” node (start point) for the trajectory. For normal CD4+ T-cell differentiation, the root should typically be the Naive CD4+ T-cell population (CD4Tn).
# 1️⃣ Get naive CD4Tn cells
naive_cells <- colnames(cds_normal)[cds_normal$differentiation_stage == "CD4Tn"]
if(length(naive_cells) > 0) {
# 2️⃣ Map cells to their closest principal graph nodes
closest_vertex <- cds_normal@principal_graph_aux$UMAP$pr_graph_cell_proj_closest_vertex
closest_vertex <- as.matrix(closest_vertex[colnames(cds_normal), , drop = FALSE])
# 3️⃣ Get the most frequent node among naive cells (preserve original values)
root_pr_nodes <- names(which.max(table(closest_vertex[naive_cells, 1])))
# 4️⃣ Convert numeric IDs to the proper "Y_" format if needed
if(!grepl("^Y_", root_pr_nodes)) {
root_pr_nodes <- paste0("Y_", root_pr_nodes)
}
# 5️⃣ Check validity
valid_nodes <- igraph::V(principal_graph(cds_normal)$UMAP)$name
if(!(root_pr_nodes %in% valid_nodes)) {
stop(paste("Selected root node", root_pr_nodes, "not found in principal graph."))
}
# 6️⃣ Order cells
cds_normal <- order_cells(cds_normal, root_pr_nodes = root_pr_nodes)
} else {
cds_normal <- order_cells(cds_normal)
warning("No CD4Tn cells found. Using automatic root selection.")
}
# Plot cells colored by pseudotime
p_pseudotime <- plot_cells(cds_normal, color_cells_by = "pseudotime",
label_groups_by_cluster = FALSE,
label_leaves = FALSE,
label_branch_points = FALSE,
graph_label_size = 4) +
ggtitle("Normal CD4+ T-cell Trajectory (Pseudotime)")
print(p_pseudotime)
# Plot cells colored by differentiation stage with pseudotime trajectory
p_stages_trajectory <- plot_cells(cds_normal, color_cells_by = "differentiation_stage",
label_groups_by_cluster = TRUE,
label_leaves = TRUE,
label_branch_points = TRUE,
graph_label_size = 4) +
ggtitle("Normal CD4+ T-cell Trajectory (Differentiation Stages)")
print(p_stages_trajectory)
# Create a combined plot
combined_trajectory_plots <- p_pseudotime + p_stages_trajectory
print(combined_trajectory_plots)
# Summary statistics
print("Pseudotime summary:")
[1] "Pseudotime summary:"
print(summary(cds_normal@principal_graph_aux@listData$UMAP$pseudotime))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 3.809 11.181 12.903 21.074 29.368
print("Pseudotime by differentiation stage:")
[1] "Pseudotime by differentiation stage:"
pseudotime_by_stage <- aggregate(cds_normal@principal_graph_aux@listData$UMAP$pseudotime,
by = list(cds_normal@colData$differentiation_stage),
FUN = function(x) c(mean = mean(x, na.rm = TRUE),
median = median(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE)))
print(pseudotime_by_stage)
Find genes that change expression along the pseudotime trajectory. These genes will be important for understanding the biological processes underlying CD4+ T-cell differentiation and for projecting Sézary cells.
print(paste("Number of trajectory-associated genes:", nrow(significant_trajectory_genes)))
[1] "Number of trajectory-associated genes: 11234"
print("Top 10 trajectory-associated genes:")
[1] "Top 10 trajectory-associated genes:"
print(head(significant_trajectory_genes, 10))
# Plot expression of top trajectory genes
if(nrow(significant_trajectory_genes) > 0) {
top_genes <- rownames(significant_trajectory_genes)[1:min(6, nrow(significant_trajectory_genes))]
p_trajectory_genes <- plot_cells(cds_normal, genes = top_genes,
show_trajectory_graph = TRUE,
label_cell_groups = FALSE,
label_leaves = FALSE)
print(p_trajectory_genes)
}
# ------------------------------
# Filter for significant genes (FDR < 0.05)
# ------------------------------
significant_trajectory_genes <- trajectory_genes %>%
dplyr::filter(q_value < 0.05)
cat("Number of trajectory-associated genes:", nrow(significant_trajectory_genes), "\n")
Number of trajectory-associated genes: 8725
if(nrow(significant_trajectory_genes) > 0) {
# ------------------------------
# Select top genes by Moran's I (strongest trajectory association)
# ------------------------------
top_genes <- significant_trajectory_genes %>%
dplyr::arrange(desc(morans_I)) %>%
dplyr::slice_head(n = 6) %>%
dplyr::pull(gene_short_name)
cat("Top trajectory-associated genes:\n")
print(top_genes)
# ------------------------------
# Plot expression of top trajectory genes along trajectory
# ------------------------------
p_trajectory_genes <- plot_cells(
cds_normal,
genes = top_genes,
show_trajectory_graph = TRUE,
label_cell_groups = FALSE,
label_leaves = FALSE
)
print(p_trajectory_genes)
} else {
warning("No significant trajectory-associated genes found (q_value < 0.05).")
}
Top trajectory-associated genes:
[1] "RPL30" "RPS15A" "RPL41" "RPS12" "RPL32" "RPS14"
# -----------------------------
# Filter for significant genes (q < 0.05)
# -----------------------------
significant_trajectory_genes <- trajectory_genes[trajectory_genes$q_value < 0.05, ]
# -----------------------------
# Remove ribosomal and mitochondrial genes
# Ribosomal: start with "RPL" or "RPS"
# Mitochondrial: start with "MT-"
# -----------------------------
significant_trajectory_genes <- significant_trajectory_genes[!grepl("^RPL|^RPS|^MT-", significant_trajectory_genes$gene_short_name), ]
# -----------------------------
# Sort by Moran's I (effect size)
# -----------------------------
significant_trajectory_genes <- significant_trajectory_genes[order(-significant_trajectory_genes$morans_I), ]
# -----------------------------
# Print summary and top genes
# -----------------------------
cat("Number of trajectory-associated genes after filtering:", nrow(significant_trajectory_genes), "\n")
Number of trajectory-associated genes after filtering: 11127
cat("Top 10 trajectory-associated genes:\n")
Top 10 trajectory-associated genes:
print(head(significant_trajectory_genes$gene_short_name, 10))
[1] "EEF1A1" "B2M" "HLA-B" "TPT1" "FTH1" "FAU" "TXNIP" "NACA" "PTMA" "TMSB4X"
# -----------------------------
# Plot top genes along trajectory
# -----------------------------
top_genes <- head(significant_trajectory_genes$gene_short_name, 6)
if(length(top_genes) > 0) {
p_trajectory_genes <- plot_cells(cds_normal, genes = top_genes,
show_trajectory_graph = TRUE,
label_cell_groups = FALSE,
label_leaves = FALSE)
print(p_trajectory_genes)
}
# Save the CDS object with trajectory information
saveRDS(cds_normal, "normal_cd4t_monocle3_cds.rds")
# Also save key trajectory information back to the Seurat object
cd4t_integrated$monocle3_pseudotime <- cds_normal@principal_graph_aux@listData$UMAP$pseudotime
cd4t_integrated$monocle3_partition <- cds_normal@clusters$UMAP$partitions
# Save updated Seurat object
saveRDS(cd4t_integrated, "normal_cd4t_with_trajectory.rds")
print("Monocle3 trajectory inference complete. Objects saved.")
[1] "Monocle3 trajectory inference complete. Objects saved."
This is a critical phase where we leverage the established normal CD4+ T-cell trajectory to understand the differentiation state of your Sézary syndrome cell lines. By projecting the malignant cells onto this normal reference, we can identify where they map within the healthy developmental landscape, providing insights into their potential cell of origin and any differentiation arrest or deviation.
Load your pre-processed and integrated Sézary cell line Seurat object. This object should contain RNA expression data, UMAP embeddings, and any relevant metadata (e.g., patient ID, cell line ID, TCR clonotypes).
# Load your integrated Sézary cell line Seurat object
sezary_obj <- readRDS("integrated_sezary_seurat_object.rds")
# Ensure the RNA assay is set as default for projection
DefaultAssay(sezary_obj) <- "RNA"
print(sezary_obj)
To project the Sézary cells onto the normal trajectory, we need to
ensure their data is compatible with the Monocle3 CDS object of normal
cells. This involves ensuring common genes and using the
SCT
assay data, as your Sézary object was also processed
with SCTransform
.
# Ensure the SCT assay is the default for the Sézary object
DefaultAssay(sezary_obj) <- "SCT"
# Extract expression matrix (SCT normalized data)
# We use the 'data' slot of the SCT assay, which contains corrected and normalized values
sezary_expression_matrix <- GetAssayData(sezary_obj, assay = "SCT", slot = "data")
# Ensure common genes between normal CDS and Sézary expression matrix
# We should use the features that were used for SCTransform (3000 features)
# For projection, it's crucial to use the same feature set as the reference trajectory.
common_features <- intersect(rownames(sezary_expression_matrix), fData(cds_normal)$gene_short_name)
sezary_expression_matrix_filtered <- sezary_expression_matrix[common_features, ]
# Create a temporary CDS object for Sézary cells (optional, but good for consistency)
sezary_cell_metadata <- sezary_obj@meta.data
sezary_gene_metadata <- data.frame(gene_short_name = common_features, row.names = common_features)
cds_sezary_temp <- new_cell_data_set(sezary_expression_matrix_filtered,
cell_metadata = sezary_cell_metadata,
gene_metadata = sezary_gene_metadata)
# Add UMAP embedding from Sézary object to temporary CDS
# Ensure the UMAP reduction name matches what's in your Seurat object
if("umap" %in% names(sezary_obj@reductions)) {
cds_sezary_temp@int_colData@listData$reducedDims$UMAP <- sezary_obj@reductions$umap@cell.embeddings
} else {
stop("UMAP reduction not found in Sézary Seurat object. Please ensure it's computed.")
}
print(cds_sezary_temp)
Monocle3 provides functions to project new cells onto an existing principal graph. This allows us to see where the Sézary cells fall within the normal CD4+ T-cell differentiation landscape.
# Combine the normal and Sézary CDS objects for projection
# This step essentially adds the Sézary cells to the existing graph of normal cells
# and finds their closest projection onto the learned trajectory.
# First, ensure that the column names of the expression matrices are unique
colnames(sezary_expression_matrix_filtered) <- paste0("sezary_", colnames(sezary_expression_matrix_filtered))
# Combine expression matrices
combined_expression_matrix <- cbind(exprs(cds_normal), sezary_expression_matrix_filtered)
# Combine cell metadata
# Ensure metadata columns are consistent or handled appropriately
# For simplicity, we'll create a new combined metadata, preserving key columns
combined_cell_metadata <- rbind(
colData(cds_normal) %>% as.data.frame() %>% select(any_of(c("dataset", "orig.ident", "differentiation_stage_ordered", "azimuth_cell_type", "seurat_clusters"))),
sezary_cell_metadata %>% as.data.frame() %>% mutate(dataset = "Sézary",
differentiation_stage_ordered = "Sézary",
azimuth_cell_type = "Sézary") %>%
select(any_of(c("dataset", "orig.ident", "differentiation_stage_ordered", "azimuth_cell_type", "seurat_clusters")))
)
# Ensure row names match column names of expression matrix
rownames(combined_cell_metadata) <- colnames(combined_expression_matrix)
# Combine gene metadata (should be the same)
combined_gene_metadata <- fData(cds_normal)
# Create a new CDS object with combined data
cds_combined <- new_cell_data_set(combined_expression_matrix,
cell_metadata = combined_cell_metadata,
gene_metadata = combined_gene_metadata)
# Transfer UMAP embedding from original CDS objects
# This is a crucial step to ensure cells are placed correctly in the UMAP space
combined_umap <- rbind(reducedDims(cds_normal)$UMAP, reducedDims(cds_sezary_temp)$UMAP)
rownames(combined_umap) <- colnames(combined_expression_matrix)
cds_combined@int_colData@listData$reducedDims$UMAP <- combined_umap
# Transfer graph from normal CDS to combined CDS
cds_combined@principal_graph <- cds_normal@principal_graph
cds_combined@principal_graph_aux <- cds_normal@principal_graph_aux
# Project cells onto the existing graph
# This will calculate pseudotime for the Sézary cells based on the normal trajectory
cds_combined <- order_cells(cds_combined, root_pr_nodes = root_node_id) # Use the same root node as for normal cells
print(cds_combined)
Now we can visualize the combined dataset, with Sézary cells projected onto the normal CD4+ T-cell trajectory. This will allow us to see their relative position and infer their differentiation state.
# Plot the combined trajectory, colored by original dataset (normal vs. Sézary)
plot_cells(cds_combined, color_cells_by = "dataset",
label_groups_by_cluster = FALSE,
label_leaves = FALSE,
label_branch_points = FALSE,
graph_label_size = 4) +
ggtitle("Sézary Cells Projected onto Normal CD4+ T-cell Trajectory")
# Plot the combined trajectory, colored by differentiation stage (for normal cells) and Sézary for malignant
plot_cells(cds_combined, color_cells_by = "differentiation_stage_ordered",
label_groups_by_cluster = TRUE,
label_leaves = FALSE,
label_branch_points = FALSE,
graph_label_size = 4) +
ggtitle("Differentiation Stages and Sézary Cell Projection")
# Plot pseudotime for all cells
plot_cells(cds_combined, color_cells_by = "pseudotime",
label_groups_by_cluster = FALSE,
label_leaves = FALSE,
label_branch_points = FALSE,
graph_label_size = 4) +
ggtitle("Pseudotime of Combined Normal and Sézary Cells")
# Analyze pseudotime distribution for Sézary cells
sezary_pseudotime_df <- colData(cds_combined) %>%
as.data.frame() %>%
filter(dataset == "Sézary") %>%
select(pseudotime, orig.ident)
ggplot(sezary_pseudotime_df, aes(x = pseudotime, fill = orig.ident)) +
geom_density(alpha = 0.6) +
labs(title = "Pseudotime Distribution of Sézary Cell Lines",
x = "Pseudotime", y = "Density") +
theme_minimal()
# Analyze gene expression of key markers in Sézary cells along the normal pseudotime
# Define key Sézary markers (e.g., CD7, CD26, TOX, DPEP1) and T-cell differentiation markers
sezary_markers <- c("CD7", "CD26", "DPEP1", "TOX", "GATA3", "CCR7", "SELL", "CD45RA", "CD45RO")
# Filter for genes present in the combined dataset
sezary_markers_present <- sezary_markers[sezary_markers %in% fData(cds_combined)$gene_short_name]
if(length(sezary_markers_present) > 0) {
plot_genes_in_pseudotime(cds_combined[fData(cds_combined)$gene_short_name %in% sezary_markers_present,],
color_cells_by = "dataset") +
ggtitle("Key Marker Expression along Normal Trajectory (with Sézary Cells)")
}
# Further analysis: Identify which normal cell types Sézary cells are closest to
# This can be done by looking at the density of Sézary cells in different regions of the UMAP
# or by calculating distances to normal cell type centroids.
# Example: Calculate mean pseudotime for each Sézary cell line
mean_pseudotime_by_sezary_line <- sezary_pseudotime_df %>%
group_by(orig.ident) %>%
summarise(mean_pseudotime = mean(pseudotime, na.rm = TRUE)) %>%
arrange(mean_pseudotime)
print("Mean pseudotime for each Sézary cell line:")
print(mean_pseudotime_by_sezary_line)
# Save the combined CDS object for future reference
saveRDS(cds_combined, "path/to/save/combined_normal_sezary_cds_object.rds")
print("Sézary cell projection onto normal trajectory complete.")
This comprehensive RMarkdown script provides a robust framework for understanding the differentiation landscape of normal CD4+ T cells and, crucially, for mapping Sézary syndrome malignant cells onto this normal trajectory. By following these steps, you can gain critical insights into the potential cell of origin for Sézary syndrome and the degree to which malignant cells deviate from normal differentiation pathways.
This script provides a powerful foundation for your investigation into the cell of origin of Sézary syndrome. Remember to adapt the paths, thresholds, and specific cell type mappings to your unique dataset and biological questions.