This R Markdown script provides the complete, executable code and
interpretation guidance to perform the single-cell RNA sequencing
(scRNA-seq) analysis on All_samples_Merged Seurat object.
The goal is to achieve a robust T cell annotation and answer specific
questions regarding Sezary cell phenotype, cell line heterogeneity, and
the drivers of expression clustering.
Prerequisite: Ensure the
All_samples_Merged Seurat object is loaded into your R
environment before running the chunks below.
We will use the Seurat package for core scRNA-seq
analysis, dplyr for data manipulation, and
ggplot2 for visualization.
# Load necessary libraries
library(Seurat)
library(dplyr)
library(ggplot2)
library(patchwork)
library(Matrix)
# IMPORTANT: Replace this with the actual path and command to load your object
All_samples_Merged <- readRDS("../../../PHD_3rd_YEAR_Analysis/0-Seurat_RDS_OBJECT_FINAL/All_samples_Merged_with_Renamed_Clusters_final-26-10-2025.rds")
# Placeholder:
if (!exists("All_samples_Merged")) {
stop("The 'All_samples_Merged' Seurat object is not loaded. Please load your data.")
}
# Check the object structure
print(All_samples_Merged)
An object of class Seurat
62900 features across 49305 samples within 6 assays
Active assay: RNA (36601 features, 0 variable features)
2 layers present: data, counts
5 other assays present: ADT, prediction.score.celltype.l1, prediction.score.celltype.l2, prediction.score.celltype.l3, SCT
5 dimensional reductions calculated: integrated_dr, ref.umap, pca, umap, harmony
The first critical step is to isolate the T cell population and re-run the clustering workflow to maximize resolution within this specific lineage.
We subset based on the expression of pan-T cell markers (CD3D, CD3E, CD3G).
# Identify cells expressing at least one of the pan-T cell markers
T_cells <- subset(All_samples_Merged, subset = CD3D > 0 | CD3E > 0 | CD3G > 0)
# Check the number of T cells retained
cat("Number of T cells retained:", ncol(T_cells), "\n")
Number of T cells retained: 48386
# Visualize the new T cell clusters
DimPlot(T_cells, reduction = "umap", label = F)
# Visualize the new T cell clusters
FeaturePlot(T_cells, features = c("CD4", "CD8A", "FOXP3", "KIR3DL2"), reduction = "umap")
Interpretation Guidance (Phase 1): *
UMAP: Visually inspect the clusters. Are they
well-separated? * Feature Plots: Use known markers
(e.g., CD4, CD8A, FOXP3, CCR7,
GZMB) to assign initial identities (CD4+, CD8+, Treg, Naive,
Effector). * Marker Identification: Use
FindAllMarkers(T_cells, only.pos = TRUE) to get a list of
genes for each cluster and finalize the annotation. You must add
a new column to the metadata, e.g., T_cells$cell_subtype,
with the final, robust annotations (e.g., “CD4_Naive”, “CD8_Exhausted”,
“Treg”, “Sezary_Malignant”).
We will check if the identified Sezary cell cluster(s) (likely CD4+ and expressing markers like KIR3DL2 or having a unique malignant signature) exhibit a T regulatory cell (Treg) phenotype.
A module score is a robust way to quantify the expression of a gene set.
# Define the core Treg gene signature
Treg_genes <- c("FOXP3", "IL2RA", "CTLA4", "TIGIT")
# Calculate the module score
T_cells <- AddModuleScore(
object = T_cells,
features = list(Treg_genes),
name = "Treg_Score"
)
# Rename the score column for simplicity
T_cells$Treg_Score <- T_cells$Treg_Score1
We compare the Treg score across all clusters, paying close attention to the Sezary cluster(s) and the bona fide Treg cluster(s).
# Visualize the distribution of the Treg score per cluster
VlnPlot(T_cells, features = "Treg_Score", group.by = "seurat_clusters", pt.size = 0) +
geom_boxplot(width = 0.1, outlier.shape = NA) +
labs(title = "Treg Signature Score Distribution by Cluster")
# Statistical comparison (e.g., comparing Sezary cluster to canonical Treg cluster)
# *NOTE: Replace "Sezary Cluster ID" and "Treg Cluster ID" with your actual cluster numbers/names*
wilcox.test(T_cells$Treg_Score[T_cells$seurat_clusters == "2"],
T_cells$Treg_Score[T_cells$seurat_clusters == "6"])
Wilcoxon rank sum test with continuity correction
data: T_cells$Treg_Score[T_cells$seurat_clusters == "2"] and T_cells$Treg_Score[T_cells$seurat_clusters == "6"]
W = 9672619, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
# Statistical comparison (e.g., comparing Sezary cluster to canonical Treg cluster)
# *NOTE: Replace "Sezary Cluster ID" and "Treg Cluster ID" with your actual cluster numbers/names*
wilcox.test(T_cells$Treg_Score[T_cells$seurat_clusters == "2"],
T_cells$Treg_Score[T_cells$seurat_clusters == "8"])
Wilcoxon rank sum test with continuity correction
data: T_cells$Treg_Score[T_cells$seurat_clusters == "2"] and T_cells$Treg_Score[T_cells$seurat_clusters == "8"]
W = 4951569, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
# Statistical comparison (e.g., comparing Sezary cluster to canonical Treg cluster)
# *NOTE: Replace "Sezary Cluster ID" and "Treg Cluster ID" with your actual cluster numbers/names*
wilcox.test(T_cells$Treg_Score[T_cells$seurat_clusters == "6"],
T_cells$Treg_Score[T_cells$seurat_clusters == "8"])
Wilcoxon rank sum test with continuity correction
data: T_cells$Treg_Score[T_cells$seurat_clusters == "6"] and T_cells$Treg_Score[T_cells$seurat_clusters == "8"]
W = 2515212, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
# Statistical comparison (e.g., comparing Sezary cluster to canonical Treg cluster)
# *NOTE: Replace "Sezary Cluster ID" and "Treg Cluster ID" with your actual cluster numbers/names*
wilcox.test(T_cells$Treg_Score[T_cells$seurat_clusters == "5"],
T_cells$Treg_Score[T_cells$seurat_clusters == "1"])
Wilcoxon rank sum test with continuity correction
data: T_cells$Treg_Score[T_cells$seurat_clusters == "5"] and T_cells$Treg_Score[T_cells$seurat_clusters == "1"]
W = 14625089, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
Interpretation Guidance (Phase 2): * VlnPlot: If the Sezary cluster shows a high median Treg score, comparable to or higher than the canonical Treg cluster, it suggests a Treg-like immunosuppressive phenotype. * Statistical Test: A significant p-value (e.g., p < 0.05) from the Wilcoxon test confirms a statistically different (or similar, depending on the means) expression of the Treg signature between the two populations.
This answers the question: “Could you show the distribution for each cell lines in T cell state?”
We calculate the proportion of each T cell cluster within each cell
line (patient origin). Ensure your cell line/patient origin
information is in a metadata column named
cell_line.
# Calculate the raw counts of cells per cluster per cell line
count_table <- table(T_cells$seurat_clusters, T_cells$cell_line)
# Calculate the proportion (margin = 2 means proportions within each column, i.e., within each cell line)
prop_table <- prop.table(count_table, margin = 2)
# Convert to data frame for plotting
prop_df <- as.data.frame(prop_table) %>%
rename(Cluster = Var1, CellLine = Var2, Proportion = Freq)
# Display the first few rows of the proportion table
head(prop_df)
# Stacked bar plot visualization
ggplot(prop_df, aes(x = CellLine, y = Proportion, fill = Cluster)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "T Cell Cluster Composition by Cell Line (Patient Origin)",
y = "Proportion of Total T Cells per Line") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Interpretation Guidance (Phase 3): * Stacked Bar Plot: Look for dominant clusters in specific cell lines. For example, if Cell Line A is 80% “Sezary_Malignant” but Cell Line B is only 20%, it indicates a significant difference in the purity or in vitro differentiation state of the cell lines. This heterogeneity will be the basis for the next clustering step.
This addresses: “Could you make an unsupervised clustering of our cell lines based upon expression?” We treat each cell line as a single sample and cluster them based on their overall gene expression profile.
# Calculate the average expression for all genes across all T cells, grouped by 'cell_line'
# We use the 'data' slot (normalized expression)
pseudobulk <- AverageExpression(T_cells,
group.by = "cell_line",
assays = "RNA",
slot = "data")
# Extract the expression matrix and transpose it (Cell Lines as rows, Genes as columns)
pb_matrix <- as.matrix(pseudobulk$RNA)
pb_matrix_t <- t(pb_matrix)
# Perform PCA on the pseudobulk matrix
# scale. = TRUE standardizes the gene expression across cell lines
pb_pca <- prcomp(pb_matrix_t, scale. = TRUE)
Error in prcomp.default(pb_matrix_t, scale. = TRUE) :
cannot rescale a constant/zero column to unit variance
Interpretation Guidance (Phase 4): * PCA
Plot: Cell lines that group closely together share similar
overall gene expression profiles. * Dendrogram: The
branches show the similarity. Cell lines joining early are highly
similar. Cutting the tree (e.g., k=3) defines the
Cell Line Clusters (e.g., “Cluster A”, “Cluster B”,
“Cluster C”).
This addresses: “Could you test metadata that could explain the clustering, instead of only patient origin? Like all the parameters that you have in metadata and also composition in T cell immunophenotype…”
We need a table where each row is a cell line, and columns contain: 1. The PC scores (quantitative representation of the clustering from Phase 4). 2. The proportion of each T cell subtype (immunophenotype composition from Phase 3). 3. Any other relevant metadata (e.g., age, sex, treatment status, batch).
# 1. Get PC scores (quantitative clustering)
cell_line_pcs <- data.frame(pb_pca$x) %>%
rownames_to_column("cell_line") %>%
select(cell_line, PC1, PC2, PC3) # Use the top PCs that explain most variance
# 2. Get T cell immunophenotype composition (proportions from Phase 3)
# *NOTE: This assumes you have finalized the 'cell_subtype' column in Phase 1*
cell_subtype_counts <- T_cells@meta.data %>%
group_by(cell_line, cell_subtype) %>%
summarise(Count = n(), .groups = 'drop')
cell_subtype_proportions <- cell_subtype_counts %>%
group_by(cell_line) %>%
mutate(Total = sum(Count),
Proportion = Count / Total) %>%
select(cell_line, cell_subtype, Proportion) %>%
# Widen the table so each cell subtype proportion is a column
tidyr::pivot_wider(names_from = cell_subtype, values_from = Proportion, values_fill = 0) %>%
rename_with(~paste0("Prop_", .), -cell_line) # Prefix column names
# 3. Get other metadata (e.g., Age, Sex)
# *NOTE: Replace "Age" and "Sex" with actual column names from your All_samples_Merged metadata*
other_metadata <- T_cells@meta.data %>%
select(cell_line, Age, Sex, Batch) %>%
distinct() # Ensure only one row per cell line
# 4. Merge all data into the final model table
model_data <- cell_line_pcs %>%
left_join(cell_subtype_proportions, by = "cell_line") %>%
left_join(other_metadata, by = "cell_line")
print(head(model_data))
We use a linear model to test if the metadata variables (e.g.,
Prop_Treg, Age) significantly predict the cell
line’s expression profile, represented by its PC scores.
# Test PC1 (the primary driver of the cell line clustering) against all predictors
# We test the T cell composition (e.g., Prop_Treg) and other metadata (e.g., Age)
# *NOTE: Adjust the formula to include all relevant metadata columns*
lm_model_pc1 <- lm(PC1 ~ Prop_Treg + Prop_Sezary_Malignant + Age + Sex + Batch, data = model_data)
# View the results
summary(lm_model_pc1)
# You can repeat this for PC2, PC3, etc.
# lm_model_pc2 <- lm(PC2 ~ Prop_Treg + Prop_Sezary_Malignant + Age + Sex + Batch, data = model_data)
# summary(lm_model_pc2)
Interpretation Guidance (Phase 6): *
summary(lm_model_pc1): Focus on the
p-value (Pr(>|t|)) for each predictor variable
(e.g., Prop_Treg). * Significant Predictor (p <
0.05): If a variable like Prop_Treg has a low
p-value, it means the proportion of Tregs in a cell line is a
significant predictor of where that cell line falls in
the expression clustering space (PC1). This suggests that the biological
difference captured by PC1 (the clustering) is explained
by the underlying T cell immunophenotype composition,
independent of the simple “patient origin” label. * Coefficient
(Estimate): The sign of the coefficient indicates the
direction. A positive coefficient for Prop_Treg means cell
lines with a higher proportion of Tregs tend to have a higher PC1
score.
This script provides a comprehensive framework to move from robust annotation to deep biological interpretation. By following these steps, you will be able to not only describe the heterogeneity of your cell lines but also identify the underlying cellular and clinical factors that drive their global expression differences. ```