Introduction

This R Markdown script provides the complete, executable code and interpretation guidance to perform the single-cell RNA sequencing (scRNA-seq) analysis on All_samples_Merged Seurat object. The goal is to achieve a robust T cell annotation and answer specific questions regarding Sezary cell phenotype, cell line heterogeneity, and the drivers of expression clustering.

Prerequisite: Ensure the All_samples_Merged Seurat object is loaded into your R environment before running the chunks below.

Setup and Data Loading

We will use the Seurat package for core scRNA-seq analysis, dplyr for data manipulation, and ggplot2 for visualization.

# Load necessary libraries
library(Seurat)
library(dplyr)
library(ggplot2)
library(patchwork)
library(Matrix)

# IMPORTANT: Replace this with the actual path and command to load your object
All_samples_Merged <- readRDS("../../../PHD_3rd_YEAR_Analysis/0-Seurat_RDS_OBJECT_FINAL/All_samples_Merged_with_Renamed_Clusters_final-26-10-2025.rds")
# Placeholder:
if (!exists("All_samples_Merged")) {
  stop("The 'All_samples_Merged' Seurat object is not loaded. Please load your data.")
}

# Check the object structure
print(All_samples_Merged)
An object of class Seurat 
62900 features across 49305 samples within 6 assays 
Active assay: RNA (36601 features, 0 variable features)
 2 layers present: data, counts
 5 other assays present: ADT, prediction.score.celltype.l1, prediction.score.celltype.l2, prediction.score.celltype.l3, SCT
 5 dimensional reductions calculated: integrated_dr, ref.umap, pca, umap, harmony

Phase 1: Robust T Cell Annotation and Subsetting

The first critical step is to isolate the T cell population and re-run the clustering workflow to maximize resolution within this specific lineage.

1 T Cell Subsetting

We subset based on the expression of pan-T cell markers (CD3D, CD3E, CD3G).

# Identify cells expressing at least one of the pan-T cell markers
T_cells <- subset(All_samples_Merged, subset = CD3D > 0 | CD3E > 0 | CD3G > 0)

# Check the number of T cells retained
cat("Number of T cells retained:", ncol(T_cells), "\n")
Number of T cells retained: 48386 
# Visualize the new T cell clusters
DimPlot(T_cells, reduction = "umap", label = F) 

# Visualize the new T cell clusters

FeaturePlot(T_cells, features = c("CD4", "CD8A", "FOXP3", "KIR3DL2"), reduction = "umap")

Interpretation Guidance (Phase 1): * UMAP: Visually inspect the clusters. Are they well-separated? * Feature Plots: Use known markers (e.g., CD4, CD8A, FOXP3, CCR7, GZMB) to assign initial identities (CD4+, CD8+, Treg, Naive, Effector). * Marker Identification: Use FindAllMarkers(T_cells, only.pos = TRUE) to get a list of genes for each cluster and finalize the annotation. You must add a new column to the metadata, e.g., T_cells$cell_subtype, with the final, robust annotations (e.g., “CD4_Naive”, “CD8_Exhausted”, “Treg”, “Sezary_Malignant”).

Phase 2: Sezary Cell and Treg-like Phenotype Check

We will check if the identified Sezary cell cluster(s) (likely CD4+ and expressing markers like KIR3DL2 or having a unique malignant signature) exhibit a T regulatory cell (Treg) phenotype.

2.1 Calculate Treg Module Score

A module score is a robust way to quantify the expression of a gene set.

# Define the core Treg gene signature
Treg_genes <- c("FOXP3", "IL2RA", "CTLA4", "TIGIT")

# Calculate the module score
T_cells <- AddModuleScore(
  object = T_cells,
  features = list(Treg_genes),
  name = "Treg_Score"
)

# Rename the score column for simplicity
T_cells$Treg_Score <- T_cells$Treg_Score1

2.2 Compare Treg Score Distribution

We compare the Treg score across all clusters, paying close attention to the Sezary cluster(s) and the bona fide Treg cluster(s).

# Visualize the distribution of the Treg score per cluster
VlnPlot(T_cells, features = "Treg_Score", group.by = "seurat_clusters", pt.size = 0) +
  geom_boxplot(width = 0.1, outlier.shape = NA) +
  labs(title = "Treg Signature Score Distribution by Cluster")


# Statistical comparison (e.g., comparing Sezary cluster to canonical Treg cluster)
# *NOTE: Replace "Sezary Cluster ID" and "Treg Cluster ID" with your actual cluster numbers/names*
wilcox.test(T_cells$Treg_Score[T_cells$seurat_clusters == "2"],
             T_cells$Treg_Score[T_cells$seurat_clusters == "6"])

    Wilcoxon rank sum test with continuity correction

data:  T_cells$Treg_Score[T_cells$seurat_clusters == "2"] and T_cells$Treg_Score[T_cells$seurat_clusters == "6"]
W = 9672619, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

# Statistical comparison (e.g., comparing Sezary cluster to canonical Treg cluster)
# *NOTE: Replace "Sezary Cluster ID" and "Treg Cluster ID" with your actual cluster numbers/names*
wilcox.test(T_cells$Treg_Score[T_cells$seurat_clusters == "2"],
             T_cells$Treg_Score[T_cells$seurat_clusters == "8"])

    Wilcoxon rank sum test with continuity correction

data:  T_cells$Treg_Score[T_cells$seurat_clusters == "2"] and T_cells$Treg_Score[T_cells$seurat_clusters == "8"]
W = 4951569, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
# Statistical comparison (e.g., comparing Sezary cluster to canonical Treg cluster)
# *NOTE: Replace "Sezary Cluster ID" and "Treg Cluster ID" with your actual cluster numbers/names*
wilcox.test(T_cells$Treg_Score[T_cells$seurat_clusters == "6"],
             T_cells$Treg_Score[T_cells$seurat_clusters == "8"])

    Wilcoxon rank sum test with continuity correction

data:  T_cells$Treg_Score[T_cells$seurat_clusters == "6"] and T_cells$Treg_Score[T_cells$seurat_clusters == "8"]
W = 2515212, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
# Statistical comparison (e.g., comparing Sezary cluster to canonical Treg cluster)
# *NOTE: Replace "Sezary Cluster ID" and "Treg Cluster ID" with your actual cluster numbers/names*
wilcox.test(T_cells$Treg_Score[T_cells$seurat_clusters == "5"],
             T_cells$Treg_Score[T_cells$seurat_clusters == "1"])

    Wilcoxon rank sum test with continuity correction

data:  T_cells$Treg_Score[T_cells$seurat_clusters == "5"] and T_cells$Treg_Score[T_cells$seurat_clusters == "1"]
W = 14625089, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

Interpretation Guidance (Phase 2): * VlnPlot: If the Sezary cluster shows a high median Treg score, comparable to or higher than the canonical Treg cluster, it suggests a Treg-like immunosuppressive phenotype. * Statistical Test: A significant p-value (e.g., p < 0.05) from the Wilcoxon test confirms a statistically different (or similar, depending on the means) expression of the Treg signature between the two populations.

Phase 3: Distribution of Cell Lines in T Cell States

This answers the question: “Could you show the distribution for each cell lines in T cell state?”

3.1 Calculate Proportions

We calculate the proportion of each T cell cluster within each cell line (patient origin). Ensure your cell line/patient origin information is in a metadata column named cell_line.

# Calculate the raw counts of cells per cluster per cell line
count_table <- table(T_cells$seurat_clusters, T_cells$cell_line)

# Calculate the proportion (margin = 2 means proportions within each column, i.e., within each cell line)
prop_table <- prop.table(count_table, margin = 2)

# Convert to data frame for plotting
prop_df <- as.data.frame(prop_table) %>%
  rename(Cluster = Var1, CellLine = Var2, Proportion = Freq)

# Display the first few rows of the proportion table
head(prop_df)

3.2 Visualize Distribution

# Stacked bar plot visualization
ggplot(prop_df, aes(x = CellLine, y = Proportion, fill = Cluster)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "T Cell Cluster Composition by Cell Line (Patient Origin)",
       y = "Proportion of Total T Cells per Line") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation Guidance (Phase 3): * Stacked Bar Plot: Look for dominant clusters in specific cell lines. For example, if Cell Line A is 80% “Sezary_Malignant” but Cell Line B is only 20%, it indicates a significant difference in the purity or in vitro differentiation state of the cell lines. This heterogeneity will be the basis for the next clustering step.

Phase 4: Unsupervised Clustering of Cell Lines (Pseudobulk)

This addresses: “Could you make an unsupervised clustering of our cell lines based upon expression?” We treat each cell line as a single sample and cluster them based on their overall gene expression profile.

4.1 Aggregate Expression (Pseudobulk)

# Calculate the average expression for all genes across all T cells, grouped by 'cell_line'
# We use the 'data' slot (normalized expression)
pseudobulk <- AverageExpression(T_cells,
                                group.by = "cell_line",
                                assays = "RNA",
                                slot = "data")

# Extract the expression matrix and transpose it (Cell Lines as rows, Genes as columns)
pb_matrix <- as.matrix(pseudobulk$RNA)
pb_matrix_t <- t(pb_matrix)

4.2 Dimensionality Reduction and Hierarchical Clustering

# Perform PCA on the pseudobulk matrix
# scale. = TRUE standardizes the gene expression across cell lines
pb_pca <- prcomp(pb_matrix_t, scale. = TRUE)
Error in prcomp.default(pb_matrix_t, scale. = TRUE) : 
  cannot rescale a constant/zero column to unit variance

Interpretation Guidance (Phase 4): * PCA Plot: Cell lines that group closely together share similar overall gene expression profiles. * Dendrogram: The branches show the similarity. Cell lines joining early are highly similar. Cutting the tree (e.g., k=3) defines the Cell Line Clusters (e.g., “Cluster A”, “Cluster B”, “Cluster C”).

Phase 5: Defining Cluster-Specific Genes and Shared Clusters

This addresses: “define clusters that we observe, a list of genes that we have in each clusters? Maybe we have clusters that are shared with cell lines derived from other patients?”

5.1 Marker Genes for T Cell Subtypes (Review)

You should have already done this in Phase 1, but here is the code to re-run and extract the top markers for the single-cell clusters.

# Find markers for all single-cell clusters
T_cells.markers <- FindAllMarkers(T_cells,
                                  only.pos = TRUE,
                                  min.pct = 0.25,
                                  logfc.threshold = 0.25)

# Get the top 10 markers per cluster
top10 <- T_cells.markers %>%
  group_by(cluster) %>%
  top_n(n = 10, wt = avg_log2FC)

# Print the top markers for review
print(top10)
# Find markers for all single-cell clusters
T_cells.markers <- FindAllMarkers(T_cells,
                                  group.by = "cell_line",
                                  only.pos = TRUE,
                                  min.pct = 0.25,
                                  logfc.threshold = 0.25)

# Get the top 10 markers per cluster
top10_cl <- T_cells.markers %>%
  group_by(cluster) %>%
  top_n(n = 10, wt = avg_log2FC)

# Print the top markers for review
print(top10_cl)

5.2 Marker Genes for Cell Line Clusters (The New Insight)

We now want to find the genes that define the Cell Line Clusters identified in Phase 4.

# Now, find markers between the Cell Line Clusters
line_cluster_markers <- FindAllMarkers(T_cells,
                                       group.by = "line_cluster",
                                       only.pos = TRUE,
                                       min.pct = 0.25,
                                       logfc.threshold = 0.25)
Error in FindAllMarkers(T_cells, group.by = "line_cluster", only.pos = TRUE,  : 
  'line_cluster' not found in object metadata

Interpretation Guidance (Phase 5): * Gene List: The top10_line_markers list is the answer to the “list of genes that we have in each clusters” question, but for the cell line clusters. These genes represent the core biological differences between the groups of cell lines. * Shared Clusters: To check if these clusters are “shared with cell lines derived from other patients,” you need to manually search the top marker genes (e.g., S100A4, GZMB, FOXP3) in public databases (e.g., GEO, literature) for known T cell states or malignant signatures. If the marker list for “Line Cluster A” matches a known signature (e.g., “Interferon Response Signature”), then the cluster is not unique to your patient cohort.

Phase 6: Testing Metadata as Explanatory Variables for Clustering

This addresses: “Could you test metadata that could explain the clustering, instead of only patient origin? Like all the parameters that you have in metadata and also composition in T cell immunophenotype…”

6.1 Prepare Metadata for Modeling

We need a table where each row is a cell line, and columns contain: 1. The PC scores (quantitative representation of the clustering from Phase 4). 2. The proportion of each T cell subtype (immunophenotype composition from Phase 3). 3. Any other relevant metadata (e.g., age, sex, treatment status, batch).

# 1. Get PC scores (quantitative clustering)
cell_line_pcs <- data.frame(pb_pca$x) %>%
  rownames_to_column("cell_line") %>%
  select(cell_line, PC1, PC2, PC3) # Use the top PCs that explain most variance

# 2. Get T cell immunophenotype composition (proportions from Phase 3)
# *NOTE: This assumes you have finalized the 'cell_subtype' column in Phase 1*
cell_subtype_counts <- T_cells@meta.data %>%
  group_by(cell_line, cell_subtype) %>%
  summarise(Count = n(), .groups = 'drop')

cell_subtype_proportions <- cell_subtype_counts %>%
  group_by(cell_line) %>%
  mutate(Total = sum(Count),
         Proportion = Count / Total) %>%
  select(cell_line, cell_subtype, Proportion) %>%
  # Widen the table so each cell subtype proportion is a column
  tidyr::pivot_wider(names_from = cell_subtype, values_from = Proportion, values_fill = 0) %>%
  rename_with(~paste0("Prop_", .), -cell_line) # Prefix column names

# 3. Get other metadata (e.g., Age, Sex)
# *NOTE: Replace "Age" and "Sex" with actual column names from your All_samples_Merged metadata*
other_metadata <- T_cells@meta.data %>%
  select(cell_line, Age, Sex, Batch) %>%
  distinct() # Ensure only one row per cell line

# 4. Merge all data into the final model table
model_data <- cell_line_pcs %>%
  left_join(cell_subtype_proportions, by = "cell_line") %>%
  left_join(other_metadata, by = "cell_line")

print(head(model_data))

6.2 Statistical Modeling

We use a linear model to test if the metadata variables (e.g., Prop_Treg, Age) significantly predict the cell line’s expression profile, represented by its PC scores.

# Test PC1 (the primary driver of the cell line clustering) against all predictors
# We test the T cell composition (e.g., Prop_Treg) and other metadata (e.g., Age)
# *NOTE: Adjust the formula to include all relevant metadata columns*
lm_model_pc1 <- lm(PC1 ~ Prop_Treg + Prop_Sezary_Malignant + Age + Sex + Batch, data = model_data)

# View the results
summary(lm_model_pc1)

# You can repeat this for PC2, PC3, etc.
# lm_model_pc2 <- lm(PC2 ~ Prop_Treg + Prop_Sezary_Malignant + Age + Sex + Batch, data = model_data)
# summary(lm_model_pc2)

Interpretation Guidance (Phase 6): * summary(lm_model_pc1): Focus on the p-value (Pr(>|t|)) for each predictor variable (e.g., Prop_Treg). * Significant Predictor (p < 0.05): If a variable like Prop_Treg has a low p-value, it means the proportion of Tregs in a cell line is a significant predictor of where that cell line falls in the expression clustering space (PC1). This suggests that the biological difference captured by PC1 (the clustering) is explained by the underlying T cell immunophenotype composition, independent of the simple “patient origin” label. * Coefficient (Estimate): The sign of the coefficient indicates the direction. A positive coefficient for Prop_Treg means cell lines with a higher proportion of Tregs tend to have a higher PC1 score.

Conclusion

This script provides a comprehensive framework to move from robust annotation to deep biological interpretation. By following these steps, you will be able to not only describe the heterogeneity of your cell lines but also identify the underlying cellular and clinical factors that drive their global expression differences. ```

---
title: "Single-Cell RNA-seq Analysis of T Cell Populations in Sezary Syndrome"
author: Nasir Mahmood Abbasi
date: "`r Sys.Date()`"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    toc_collapsed: yes
  word_document:
    toc: yes
  html_document:
    toc: yes
    df_print: paged
  pdf_document:
    toc: yes
---

# Introduction

This R Markdown script provides the complete, executable code and interpretation guidance to perform the single-cell RNA sequencing (scRNA-seq) analysis on `All_samples_Merged` Seurat object. The goal is to achieve a robust T cell annotation and answer specific questions regarding Sezary cell phenotype, cell line heterogeneity, and the drivers of expression clustering.

**Prerequisite:** Ensure the `All_samples_Merged` Seurat object is loaded into your R environment before running the chunks below.

## Setup and Data Loading

We will use the `Seurat` package for core scRNA-seq analysis, `dplyr` for data manipulation, and `ggplot2` for visualization.

```{r}
# Load necessary libraries
library(Seurat)
library(dplyr)
library(ggplot2)
library(patchwork)
library(Matrix)

# IMPORTANT: Replace this with the actual path and command to load your object
All_samples_Merged <- readRDS("../../../PHD_3rd_YEAR_Analysis/0-Seurat_RDS_OBJECT_FINAL/All_samples_Merged_with_Renamed_Clusters_final-26-10-2025.rds")
# Placeholder:
if (!exists("All_samples_Merged")) {
  stop("The 'All_samples_Merged' Seurat object is not loaded. Please load your data.")
}

# Check the object structure
print(All_samples_Merged)
```

# Phase 1: Robust T Cell Annotation and Subsetting

The first critical step is to isolate the T cell population and re-run the clustering workflow to maximize resolution within this specific lineage.

## 1 T Cell Subsetting

We subset based on the expression of pan-T cell markers (*CD3D*, *CD3E*, *CD3G*).

```{r}
# Identify cells expressing at least one of the pan-T cell markers
T_cells <- subset(All_samples_Merged, subset = CD3D > 0 | CD3E > 0 | CD3G > 0)

# Check the number of T cells retained
cat("Number of T cells retained:", ncol(T_cells), "\n")
```


```{r , fig.height=4, fig.width=10}
# Visualize the new T cell clusters
DimPlot(T_cells, reduction = "umap", label = F) 
```



```{r , fig.height=6, fig.width=8}
# Visualize the new T cell clusters

FeaturePlot(T_cells, features = c("CD4", "CD8A", "FOXP3", "KIR3DL2"), reduction = "umap")
```



**Interpretation Guidance (Phase 1):**
*   **UMAP:** Visually inspect the clusters. Are they well-separated?
*   **Feature Plots:** Use known markers (e.g., *CD4*, *CD8A*, *FOXP3*, *CCR7*, *GZMB*) to assign initial identities (CD4+, CD8+, Treg, Naive, Effector).
*   **Marker Identification:** Use `FindAllMarkers(T_cells, only.pos = TRUE)` to get a list of genes for each cluster and finalize the annotation. **You must add a new column to the metadata, e.g., `T_cells$cell_subtype`, with the final, robust annotations (e.g., "CD4_Naive", "CD8_Exhausted", "Treg", "Sezary_Malignant").**

# Phase 2: Sezary Cell and Treg-like Phenotype Check

We will check if the identified Sezary cell cluster(s) (likely *CD4+* and expressing markers like *KIR3DL2* or having a unique malignant signature) exhibit a T regulatory cell (Treg) phenotype.

## 2.1 Calculate Treg Module Score

A module score is a robust way to quantify the expression of a gene set.

```{r}
# Define the core Treg gene signature
Treg_genes <- c("FOXP3", "IL2RA", "CTLA4", "TIGIT")

# Calculate the module score
T_cells <- AddModuleScore(
  object = T_cells,
  features = list(Treg_genes),
  name = "Treg_Score"
)

# Rename the score column for simplicity
T_cells$Treg_Score <- T_cells$Treg_Score1
```

## 2.2 Compare Treg Score Distribution

We compare the Treg score across all clusters, paying close attention to the Sezary cluster(s) and the bona fide Treg cluster(s).

```{r , fig.height=8, fig.width=12}
# Visualize the distribution of the Treg score per cluster
VlnPlot(T_cells, features = "Treg_Score", group.by = "seurat_clusters", pt.size = 0) +
  geom_boxplot(width = 0.1, outlier.shape = NA) +
  labs(title = "Treg Signature Score Distribution by Cluster")

```


```{r , fig.height=8, fig.width=12}

# Statistical comparison (e.g., comparing Sezary cluster to canonical Treg cluster)
# *NOTE: Replace "Sezary Cluster ID" and "Treg Cluster ID" with your actual cluster numbers/names*
wilcox.test(T_cells$Treg_Score[T_cells$seurat_clusters == "2"],
             T_cells$Treg_Score[T_cells$seurat_clusters == "6"])
```

```{r , fig.height=8, fig.width=12}

# Statistical comparison (e.g., comparing Sezary cluster to canonical Treg cluster)
# *NOTE: Replace "Sezary Cluster ID" and "Treg Cluster ID" with your actual cluster numbers/names*
wilcox.test(T_cells$Treg_Score[T_cells$seurat_clusters == "2"],
             T_cells$Treg_Score[T_cells$seurat_clusters == "8"])
```

```{r , fig.height=8, fig.width=12}
# Statistical comparison (e.g., comparing Sezary cluster to canonical Treg cluster)
# *NOTE: Replace "Sezary Cluster ID" and "Treg Cluster ID" with your actual cluster numbers/names*
wilcox.test(T_cells$Treg_Score[T_cells$seurat_clusters == "6"],
             T_cells$Treg_Score[T_cells$seurat_clusters == "8"])
```
```{r , fig.height=8, fig.width=12}
# Statistical comparison (e.g., comparing Sezary cluster to canonical Treg cluster)
# *NOTE: Replace "Sezary Cluster ID" and "Treg Cluster ID" with your actual cluster numbers/names*
wilcox.test(T_cells$Treg_Score[T_cells$seurat_clusters == "5"],
             T_cells$Treg_Score[T_cells$seurat_clusters == "1"])
```


**Interpretation Guidance (Phase 2):**
*   **VlnPlot:** If the Sezary cluster shows a high median Treg score, comparable to or higher than the canonical Treg cluster, it suggests a **Treg-like immunosuppressive phenotype**.
*   **Statistical Test:** A significant p-value (e.g., p < 0.05) from the Wilcoxon test confirms a statistically different (or similar, depending on the means) expression of the Treg signature between the two populations.

# Phase 3: Distribution of Cell Lines in T Cell States

This answers the question: "Could you show the distribution for each cell lines in T cell state?"

## 3.1 Calculate Proportions

We calculate the proportion of each T cell cluster within each cell line (patient origin). **Ensure your cell line/patient origin information is in a metadata column named `cell_line`.**

```{r , fig.height=8, fig.width=12}
# Calculate the raw counts of cells per cluster per cell line
count_table <- table(T_cells$seurat_clusters, T_cells$cell_line)

# Calculate the proportion (margin = 2 means proportions within each column, i.e., within each cell line)
prop_table <- prop.table(count_table, margin = 2)

# Convert to data frame for plotting
prop_df <- as.data.frame(prop_table) %>%
  rename(Cluster = Var1, CellLine = Var2, Proportion = Freq)

# Display the first few rows of the proportion table
head(prop_df)
```

## 3.2 Visualize Distribution

```{r , fig.height=8, fig.width=12}
# Stacked bar plot visualization
ggplot(prop_df, aes(x = CellLine, y = Proportion, fill = Cluster)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "T Cell Cluster Composition by Cell Line (Patient Origin)",
       y = "Proportion of Total T Cells per Line") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

**Interpretation Guidance (Phase 3):**
*   **Stacked Bar Plot:** Look for **dominant clusters** in specific cell lines. For example, if Cell Line A is 80% "Sezary_Malignant" but Cell Line B is only 20%, it indicates a significant difference in the purity or *in vitro* differentiation state of the cell lines. This heterogeneity will be the basis for the next clustering step.

# Phase 4: Unsupervised Clustering of Cell Lines (Pseudobulk)

This addresses: "Could you make an unsupervised clustering of our cell lines based upon expression?" We treat each cell line as a single sample and cluster them based on their overall gene expression profile.

## 4.1 Aggregate Expression (Pseudobulk)

```{r}
# Calculate the average expression for all genes across all T cells, grouped by 'cell_line'
# We use the 'data' slot (normalized expression)
pseudobulk <- AverageExpression(T_cells,
                                group.by = "cell_line",
                                assays = "RNA",
                                slot = "data")

# Extract the expression matrix and transpose it (Cell Lines as rows, Genes as columns)
pb_matrix <- as.matrix(pseudobulk$RNA)
pb_matrix_t <- t(pb_matrix)
```

## 4.2 Dimensionality Reduction and Hierarchical Clustering

```{r , fig.height=8, fig.width=12}
# Perform PCA on the pseudobulk matrix
# scale. = TRUE standardizes the gene expression across cell lines
pb_pca <- prcomp(pb_matrix_t, scale. = TRUE)

# Visualize the cell lines in the PCA space
plot(pb_pca$x[, 1], pb_pca$x[, 2],
     main = "PCA of Cell Lines (PC1 vs PC2)",
     xlab = paste0("PC1 (", round(summary(pb_pca)$importance[2, 1] * 100, 2), "%)"),
     ylab = paste0("PC2 (", round(summary(pb_pca)$importance[2, 2] * 100, 2), "%)"),
     pch = 19,
     col = factor(rownames(pb_pca$x)))
text(pb_pca$x[, 1], pb_pca$x[, 2], labels = rownames(pb_pca$x), pos = 3)

# Hierarchical Clustering based on the first 10 PCs
hclust_res <- hclust(dist(pb_pca$x[, 1:10]), method = "ward.D2")

# Visualize the dendrogram
plot(hclust_res, main = "Hierarchical Clustering of Cell Lines", xlab = "Cell Line")

# Cut the tree to define 'Cell Line Clusters' (e.g., 3 clusters)
cell_line_clusters <- cutree(hclust_res, k = 3)
cell_line_clusters_df <- data.frame(cell_line = names(cell_line_clusters),
                                    line_cluster = factor(cell_line_clusters))
```

**Interpretation Guidance (Phase 4):**
*   **PCA Plot:** Cell lines that group closely together share similar overall gene expression profiles.
*   **Dendrogram:** The branches show the similarity. Cell lines joining early are highly similar. Cutting the tree (e.g., `k=3`) defines the **Cell Line Clusters** (e.g., "Cluster A", "Cluster B", "Cluster C").

# Phase 5: Defining Cluster-Specific Genes and Shared Clusters

This addresses: "define clusters that we observe, a list of genes that we have in each clusters? Maybe we have clusters that are shared with cell lines derived from other patients?"

## 5.1 Marker Genes for T Cell Subtypes (Review)

You should have already done this in Phase 1, but here is the code to re-run and extract the top markers for the single-cell clusters.

```{r , fig.height=8, fig.width=12}
# Find markers for all single-cell clusters
T_cells.markers <- FindAllMarkers(T_cells,
                                  only.pos = TRUE,
                                  min.pct = 0.25,
                                  logfc.threshold = 0.25)

# Get the top 10 markers per cluster
top10 <- T_cells.markers %>%
  group_by(cluster) %>%
  top_n(n = 10, wt = avg_log2FC)

# Print the top markers for review
print(top10)
```
```{r , fig.height=8, fig.width=12}
# Find markers for all single-cell clusters
T_cells.markers <- FindAllMarkers(T_cells,
                                  group.by = "cell_line",
                                  only.pos = TRUE,
                                  min.pct = 0.25,
                                  logfc.threshold = 0.25)

# Get the top 10 markers per cluster
top10_cl <- T_cells.markers %>%
  group_by(cluster) %>%
  top_n(n = 10, wt = avg_log2FC)

# Print the top markers for review
print(top10_cl)
```
## 5.2 Marker Genes for Cell Line Clusters (The New Insight)

We now want to find the genes that define the **Cell Line Clusters** identified in Phase 4.

```{r , fig.height=8, fig.width=12}
# 1. Merge the new 'line_cluster' information into the pseudobulk PCA results
pb_pca_df <- data.frame(pb_pca$x[, 1:10]) %>%
  rownames_to_column("cell_line") %>%
  left_join(cell_line_clusters_df, by = "cell_line")

# 2. Find the marker genes that differentiate the Cell Line Clusters
# This requires a differential expression test on the pseudobulk data
# For simplicity, we will use the 'line_cluster' as a grouping variable on the original T_cells object
# to find genes highly expressed in cells belonging to lines in a specific 'line_cluster'.

# First, add the cell line cluster information to the single-cell object metadata
T_cells@meta.data <- T_cells@meta.data %>%
  left_join(cell_line_clusters_df, by = "cell_line") %>%
  # Handle potential NA values if 'cell_line' column was missing
  mutate(line_cluster = as.factor(ifelse(is.na(line_cluster), "Unknown", as.character(line_cluster))))

# Now, find markers between the Cell Line Clusters
line_cluster_markers <- FindAllMarkers(T_cells,
                                       group.by = "line_cluster",
                                       only.pos = TRUE,
                                       min.pct = 0.25,
                                       logfc.threshold = 0.25)

# Get the top 10 markers for each Cell Line Cluster
top10_line_markers <- line_cluster_markers %>%
  group_by(cluster) %>%
  top_n(n = 10, wt = avg_log2FC)

print(top10_line_markers)
```

**Interpretation Guidance (Phase 5):**
*   **Gene List:** The `top10_line_markers` list is the answer to the "list of genes that we have in each clusters" question, but for the *cell line* clusters. These genes represent the core biological differences between the groups of cell lines.
*   **Shared Clusters:** To check if these clusters are "shared with cell lines derived from other patients," you need to **manually search** the top marker genes (e.g., *S100A4*, *GZMB*, *FOXP3*) in public databases (e.g., GEO, literature) for known T cell states or malignant signatures. If the marker list for "Line Cluster A" matches a known signature (e.g., "Interferon Response Signature"), then the cluster is not unique to your patient cohort.

# Phase 6: Testing Metadata as Explanatory Variables for Clustering

This addresses: "Could you test metadata that could explain the clustering, instead of only patient origin? Like all the parameters that you have in metadata and also composition in T cell immunophenotype..."

## 6.1 Prepare Metadata for Modeling

We need a table where each row is a cell line, and columns contain:
1.  The PC scores (quantitative representation of the clustering from Phase 4).
2.  The proportion of each T cell subtype (immunophenotype composition from Phase 3).
3.  Any other relevant metadata (e.g., age, sex, treatment status, batch).

```{r , fig.height=8, fig.width=12}
# 1. Get PC scores (quantitative clustering)
cell_line_pcs <- data.frame(pb_pca$x) %>%
  rownames_to_column("cell_line") %>%
  select(cell_line, PC1, PC2, PC3) # Use the top PCs that explain most variance

# 2. Get T cell immunophenotype composition (proportions from Phase 3)
# *NOTE: This assumes you have finalized the 'cell_subtype' column in Phase 1*
cell_subtype_counts <- T_cells@meta.data %>%
  group_by(cell_line, cell_subtype) %>%
  summarise(Count = n(), .groups = 'drop')

cell_subtype_proportions <- cell_subtype_counts %>%
  group_by(cell_line) %>%
  mutate(Total = sum(Count),
         Proportion = Count / Total) %>%
  select(cell_line, cell_subtype, Proportion) %>%
  # Widen the table so each cell subtype proportion is a column
  tidyr::pivot_wider(names_from = cell_subtype, values_from = Proportion, values_fill = 0) %>%
  rename_with(~paste0("Prop_", .), -cell_line) # Prefix column names

# 3. Get other metadata (e.g., Age, Sex)
# *NOTE: Replace "Age" and "Sex" with actual column names from your All_samples_Merged metadata*
other_metadata <- T_cells@meta.data %>%
  select(cell_line, Age, Sex, Batch) %>%
  distinct() # Ensure only one row per cell line

# 4. Merge all data into the final model table
model_data <- cell_line_pcs %>%
  left_join(cell_subtype_proportions, by = "cell_line") %>%
  left_join(other_metadata, by = "cell_line")

print(head(model_data))
```

## 6.2 Statistical Modeling

We use a linear model to test if the metadata variables (e.g., `Prop_Treg`, `Age`) significantly predict the cell line's expression profile, represented by its PC scores.

```{r , fig.height=8, fig.width=12}
# Test PC1 (the primary driver of the cell line clustering) against all predictors
# We test the T cell composition (e.g., Prop_Treg) and other metadata (e.g., Age)
# *NOTE: Adjust the formula to include all relevant metadata columns*
lm_model_pc1 <- lm(PC1 ~ Prop_Treg + Prop_Sezary_Malignant + Age + Sex + Batch, data = model_data)

# View the results
summary(lm_model_pc1)

# You can repeat this for PC2, PC3, etc.
# lm_model_pc2 <- lm(PC2 ~ Prop_Treg + Prop_Sezary_Malignant + Age + Sex + Batch, data = model_data)
# summary(lm_model_pc2)
```

**Interpretation Guidance (Phase 6):**
*   **`summary(lm_model_pc1)`:** Focus on the **p-value** (Pr(>|t|)) for each predictor variable (e.g., `Prop_Treg`).
*   **Significant Predictor (p < 0.05):** If a variable like `Prop_Treg` has a low p-value, it means the proportion of Tregs in a cell line is a **significant predictor** of where that cell line falls in the expression clustering space (PC1). This suggests that the biological difference captured by PC1 (the clustering) is **explained by** the underlying T cell immunophenotype composition, independent of the simple "patient origin" label.
*   **Coefficient (Estimate):** The sign of the coefficient indicates the direction. A positive coefficient for `Prop_Treg` means cell lines with a higher proportion of Tregs tend to have a higher PC1 score.

# Conclusion

This script provides a comprehensive framework to move from robust annotation to deep biological interpretation. By following these steps, you will be able to not only describe the heterogeneity of your cell lines but also identify the underlying cellular and clinical factors that drive their global expression differences.
```
